*Today is Allan Birnbaum’s birthday. In honor of his birthday, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I had posted the volume before, but there are several articles that are very worth rereading. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up. (Even if you are, you may be unaware of some of these key papers.)*

**HAPPY BIRTHDAY ALLAN!**

*Synthese* Volume 36, No. 1 Sept 1977: *Foundations of Probability and Statistics*, Part I

**Editorial Introduction:**

This special issue of

Syntheseon the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors ofSynthesein October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.THE EDITORS

**Table of Contents**

]]>

- Editorial Introduction. (1977).
Synthese,36(1), 3-3.- Giere, R. (1977). Allan Birnbaum’s Conception of Statistical Evidence.
Synthese,36(1), 5-13.## SUFFICIENCY, CONDITIONALLY AND LIKELIHOOD In December of 1961 Birnbaum presented the paper ‘On the Foundations, of Statistical Inference’ (Birnbaum [19]) at a special discussion meeting of the American Statistical Association. Among the discussants was L. J. Savage who pronounced it “a landmark in statistics”. Explicitly denying any “intent to speak with exaggeration or rhetorically”, Savage described the occasion as “momentous in the history of statistics”. “It would be hard”, he said, “to point to even a handful of comparable events” (Birnbaum [19], pp. 307-8). The reasons for Savage’s enthusiasm are obvious. Birnbaum claimed to have shown that two principles widely held by non-Bayesian statisticians (sufficiency and conditionality) jointly imply an important consequence of Bayesian statistics (likelihood).”[1]

- Giere, R. (1977). Publications by Allan Birnbaum.
Synthese,36(1), 15-17.

- Birnbaum, A. (1977). The Neyman-Pearson Theory as Decision Theory, and as Inference Theory; With a Criticism of the Lindley-Savage Argument for Bayesian Theory.
Synthese,36(1), 19-49.## INTRODUCTION AND SUMMARY ….Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to ‘decisions’ in a concrete literal sense as in acceptance sampling; and evidential, applicable to ‘decisions’ such as ‘reject H in a research context, where the pattern and strength of statistical evidence concerning statistical hypotheses is of central interest. Typical standard practice is characterized as based on the confidence concept of statistical evidence, which is defined in terms of evidential interpretations of the ‘decisions’ of decision theory. These concepts are illustrated by simple formal examples with interpretations in genetic research, and are traced in the writings of Neyman, Pearson, and other writers. The Lindley-Savage argument for Bayesian theory is shown to have no direct cogency as a criticism of typical standard practice, since it is based on a behavioral, not an evidential, interpretation of decisions.

- Lindley, D. (1977). The Distinction between Inference and Decision.
Synthese,36(1), 51-58.

- Pratt, J. (1977). ‘Decisions’ as Statistical Evidence and Birnbaum’s ‘Confidence Concept’
Synthese,36(1), 59-69.

- Smith, C. (1977). The Analogy between Decision and Inference.
Synthese,36(1), 71-85.

- Kyburg, H. (1977). Decisions, Conclusions, and Utilities.
Synthese,36(1), 87-96.

- Neyman, J. (1977). Frequentist Probability and Frequentist Statistics.
Synthese,36(1), 97-131.

- Lecam, L. (1977).A Note on Metastatistics or ‘An Essay toward Stating a Problem in the Doctrine of Chances’.
Synthese,36(1), 133-160.

- Kiefer, J. (1977). The Foundations of Statistics Are There Any?
Synthese,36(1), 161-176.[1]By “likelihood” here, Giere means the (strong) Likelihood Principle (SLP). Dotted through the first 3 years of this blog are a number of (formal and informal) posts on his SLP result, and my argument as to why it is unsound. I wrote a paper on this that appeared in Statistical Science 2014. You can find it along with a number of comments and my rejoinder in this post: Statistical Science: The Likelihood Principle Issue is Out.The consequences of having found his proof unsound gives a new lease on life to statistical foundations, or so I argue in my rejoinder.

Ship StatInfasST will embark on a new journey from 21 May – 18 June, a graduate research seminar for the Philosophy, Logic & Scientific Method Department at the LSE, but given the pandemic has shut down cruise ships, it will remain at dock in the U.S. and use zoom. If you care to follow any of the 5 sessions, nearly all of the materials will be linked here collected from excerpts already on this blog. If you are interested in observing on zoom beginning 28 May, please follow the directions here. The 21 May session will be put on the seminar web page.

**General Schedule **PDF

**Topic: Current Controversies in Phil Stat **(LSE, Remote 10am-12 EST, 15:00 – 17:00 London time; Thursdays 21 May-18 June)

**Main Text SIST****:** *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* CUP, 2018):

**I. (May 21)** Introduction: Controversies in Phil Stat:

**SIST: Preface, Excursion 1 **Preface

Excursion 1 Tour II

Notes/Outline of Excursion 1

Postcard: Souvenir A

**II.** **(May 28)** N-P and Fisherian Tests, Severe Testing:

**SIST: Excursion 3 Tour I **(focus on pages up to p. 152)** **

** Recommended:** Excursion 2 Tour II pp. 92-100

* Optional:* I will (try to) answer questions on demarcation of science, induction, falsification, Popper from Excursion 2 Tour II

**Handout**: *Areas Under the Standard Normal Curve*

**III.** **(June 4)** Deeper Concepts: Confidence Intervals and Tests: Higgs’ Discovery:

**SIST: Excursion 3 Tour III **

* Optional:* I will answer questions on Excursion 3 Tour II: Howlers and Chestnuts of Tests

**IV.** **(June 11)** Rejection Fallacies: Do P-values exaggerate evidence?

Jeffreys-Lindley paradox or Bayes/Fisher disagreement:

**SIST: Excursion 4 Tour II **

*Recommended: *

**V. (June 18) The Statistics Wars and Their Casualties:**

**SIST: Excursion 4 Tour III**: pp. 267-286; **Farewell Keepsake: **pp. 436-444**-Amrhein, V., Greenland, S., & McShane, B., (2019).** Comment: Retire Statistical Significance, *Nature*, 567: 305-308. **-Ioannidis J. (2019).** “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019.4582 **-Ioannidis, J. (2019).** Correspondence: Retiring statistical significance would give bias a free pass. *Nature,* *567,* 461. https://doi.org/10.1038/d41586-019-00969-2 **-Mayo, DG. (2019)**, P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. doi: 10.1111/eci.13170

**-References: Captain’s Bibliography **

**DELAYED: JUNE 19-20 Workshop: The Statistics Wars and Their Casualties**

Here’s the final part of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). The full article, which is open access, is here. I will make some remarks in the comments.

**5. The error-statistical perspective and the nature of science**

As noted at the outset, the error-statistical perspective has made significant contributions to our philosophical understanding of the nature of science. These are achieved, in good part, by employing insights about the nature and place of statistical inference in experimental science. The achievements include deliberations on important philosophical topics, such as the demarcation of science from non-science, the underdetermination of theories by evidence, the nature of scientific progress, and the perplexities of inductive inference. In this article, I restrict my attention to two such topics: The process of falsification and the structure of modeling.

*5.1. Falsificationism*

The best known account of scientific method is the so-called hypothetico-deductive method. According to its most popular description, the scientist takes an existing hypothesis or theory and tests indirectly by deriving one or more observational predictions that are subjected to direct empirical test. Successful predictions are taken to provide inductive confirmation of the theory; failed predictions are said to provide disconfirming evidence for the theory. In psychology, NHST is often embedded within such a hypothetico-deductive structure and contributes to weak tests of theories.

Also well known is Karl Popper’s falsificationist construal of the hypothetico-deductive method, which is understood as a general strategy of conjecture and refutation. Although it has been roundly criticised by philosophers of science, it is frequently cited with approval by scientists, including psychologists, even though they do not, indeed could not, employ it in testing their theories. The major reason for this is that Popper does not provide them with sufficient methodological resources to do so.

One of the most important features of the error-statistical philosophy is its presentation of a falsificationist view of scientific inquiry, with error statistics serving an indispensable role in testing. From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. Making good on Popper’s lack of knowledge of statistics, Mayo shows how one can properly employ a range of, often familiar, error-statistical methods to implement her all-important severity requirement. Stated minimally, and informally, this requirement says, “A claim is severely tested to the extent that it has been subjected to and passes a test that probably would have found flaws, were they present.” (Mayo, 2018, p. xii) Further, in marked contrast with Popper, who deemed deductive inference to be the only legitimate form of inference, Mayo’s conception of falsification stresses the importance of inductive, or content-increasing, inference in science. We have here, then, a viable account of falsification, which goes well beyond Popper’s account with its lack of operational detail about how to construct strong tests. It is worth noting that the error-statistical stance offers a constructive interpretation of Fisher’s oft-cited remark that the null hypothesis is never proved, only possibly disproved.

*5.2. A hierarchy of models*

In the past, philosophers of science tended to characterize scientific inquiry by focusing on the general relationship between evidence and theory. Similarly, scientists, even today, commonly speak in general terms of the relationship between data and theory. However, due in good part to the labors of experimentally-oriented philosophers of science, we now know that this coarse-grained depiction is a poor portrayal of science. The error-statistical perspective is one such philosophy that offers a more fine-grained parsing of the scientific process.

Building on Patrick Suppes’ (1962) important insight that science employs a hierarchy of models that ranges from experimental experience to theory, Mayo’s (1996) error-statistical philosophy initially adopted a framework in which three different types of models are interconnected and serve to structure error-statistical inquiry: Primary models, experimental models, and data models. Primary models, which are at the top of the hierarchy, break down a research problem, or question, into a set of local hypotheses that can be investigated using reliable methods. Experimental models take the mid-positon on the hierarchy and structure the particular models at hand. They serve to link primary models to data models. And, data models, which are at the bottom of the hierarchy, generate and model raw data, put them in canonical form, and check whether the data satisfy the assumptions of the experimental models. It should be mentioned that the error-statistical approach has been extended to primary models and theories of a more global nature (Mayo and Spanos, 2010) and, now, also includes a consideration of experimental design and the analysis and generation of data (Mayo, 2018).

This hierarchy of models facilitates the achievement of a number of goals that are important to the error-statistician. These include piecemeal strong testing of local hypotheses rather than broad theories, and employing the model hierarchy as a structuring device to knowingly move back and forth between statistical and scientific hypotheses. The error-statistical perspective insists on maintaining a clear distinction between statistical and scientific hypotheses, pointing out that psychologists often mistakenly take tests of significance to have direct implications for substantive hypotheses and theories.

**6. The philosophy of statistics**

A heartening attitude that comes through in the error-statistical corpus is the firm belief that the philosophy of statistics is an important part of statistical thinking. This emphasis on the conceptual foundations of the subject contrasts markedly with much of statistical theory, and most of statistical practice. It is encouraging, therefore, that Mayo’s philosophical work has influenced a number of prominent statisticians, who have contributed to the foundations of their discipline. Gelman’s error-statistical philosophy canvassed earlier is a prominent case in point. Through both precept and practice, Mayo’s work makes clear that philosophy can have a direct impact on statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves their thinking and practice in useful ways. More particularly, statistical reformers recommend methods and strategies that have underlying philosophical commitments. It is important that they are identified, described, and evaluated.

The tools used by the philosopher of statistics in order to improve our understanding and use of statistical methods are considerable (Mayo, 2011). They include clarifying disputed concepts, evaluating arguments employed in statistical debates, including the core commitments of rival schools of thought, and probing the deep structure of statistical methods themselves. In doing this work, the philosopher of statistics, as philosopher, ascends to a meta-level to get purchase on their objects of study. This second-order inquiry is a proper part of scientific methodology.

It is important to appreciate that the error-statistical outlook is a scientific methodology in the proper sense of the term. Briefly stated, methodology is the interdisciplinary field that draws from disciplines that include statistics, philosophy of science, history of science, as well as indigenous contributions from the various substantive disciplines. As such, it is the key to a proper understanding of statistical and scientific methods. Mayo’s focus on the role of error statistics in science is deeply informed about the philosophy, history, and theory of statistics, as well as statistical practice. It is for this reason that the error-statistical perspective is strategically positioned to help the reader to go beyond the statistics wars.

**7. Conclusion**

The error-statistical outlook provides researchers, methodologists, and statisticians with a distinctive and illuminating perspective on statistical inference. Its Popper-inspired emphasis on strong tests is a welcome antidote to the widespread practice of weak statistical hypothesis testing that still pervades psychological research. More generally, the error-statistical standpoint affords psychologists an informative perspective on the nature of good statistical practice in science that will help them understand and transcend the statistics wars into which they have been drawn. Importantly, psychologists should know about the error-statistical perspective as a genuine alternative to the new statistics and Bayesian statistics. The new statisticians, Bayesians statisticians, and those with other preferences should address the challenges to their outlooks on statistics that the error-statistical viewpoint provides. Taking these challenges seriously would enrich psychology’s methodological landscape.

*****This article is based on an invited commentary on Deborah Mayo’s book, *Statistical inference as severe testing: How to get beyond the statistics wars* (Cambridge University Press, 2018), which appeared at https://statmodeling.stat.colombia.edu/2019/04/12 It is adapted with permission. I thank Mayo for helpful feedback on an earlier draft.

Refer to the paper for the references. I invite your comments and questions.

]]>

Here’s a picture of ripping open the first box of (rush) copies of *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, *and here’s a continuation of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). Haig contrasts error statistics, the “new statistics”, and Bayesian statistics from the perspective of the statistics wars in psychology. The full article, which is open access, is here. I will make several points in the comments.

**4. Bayesian statistics**

Despite its early presence, and prominence, in the history of statistics, the Bayesian outlook has taken an age to assert itself in psychology. However, a cadre of methodologists has recently advocated the use of Bayesian statistical methods as a superior alternative to the messy frequentist practice that dominates psychology’s research landscape (e.g., Dienes, 2011; Kruschke and Liddell, 2018; Wagenmakers, 2007). These Bayesians criticize NHST, often advocate the use of Bayes factors for hypothesis testing, and rehearse a number of other well-known Bayesian objections to frequentist statistical practice.

Of course, there are challenges for Bayesians from the error-statistical perspective, just as there are for the new statisticians. For example, the frequently made claim that *p* values exaggerate the evidence against the null hypothesis, but Bayes factors do not, is shown by Mayo not to be the case. She also makes the important point that Bayes factors, as they are currently used, do not have the ability to probe errors and, thus, violate the requirement for severe tests. Bayesians, therefore need to rethink whether Bayes factors can be deployed in some way to provide strong tests of hypotheses through error control. As with the new statisticians, Bayesians also need to reckon with the coherent hybrid NHST afforded by the error-statistical perspective, and argue against it, rather than the common inchoate hybrids, if they want to justify abandoning NHST. Finally, I note in passing that Bayesians should consider, among other challenges, Mayo’s critique of the controversial Likelihood Principle, a principle which ignores the post-data consideration of sampling plans.

*4.1. Contrasts between the Bayesian and error-statistical perspectives*

One of the major achievements of the philosophy of error-statistics is that it provides a comprehensive critical evaluation of the major variants of Bayesian statistical thinking, including the classical subjectivist, “default”, pragmatist, and eclectic options within the Bayesian corpus. Whether the adoption of Bayesian methods in psychology will overcome the disorders of current frequentist practice remains to be seen. What is clear from reading the error-statistical literature, however, is that the foundational options for Bayesians are numerous, convoluted, and potentially bewildering. It would be a worthwhile exercise to chart how these foundational options are distributed across the prominent Bayesian statisticians in psychology. For example, the increasing use of Bayes factors for hypothesis testing purposes is accompanied by disorderliness at the foundational level, just as it is in the Bayesian literature more generally. Alongside the fact that some Bayesians are sceptical of the worth of Bayes factors, we find disagreement about the comparative merits of the subjectivist and default Bayesianism outlooks on Bayes factors in psychology (Wagenmakers et al., 2018).

The philosophy of error-statistics contains many challenges for Bayesians to consider. Here, I want to draw attention to three basic features of Bayesian thinking, which are rejected by the error-statistical approach. First, the error-statistical approach rejects the Bayesian insistence on characterizing the evidential relation between hypothesis and evidence in a universal and logical manner in terms of Bayes’ theorem. Instead, it formulates the relation in terms of the substantive and specific nature of the hypothesis and the evidence with regards to their origin, modeling, and analysis. This is a consequence of a strong commitment to a piecemeal, contextual approach to testing, using the most appropriate frequentist methods available for the task at hand. This contextual attitude to testing is taken up in Section 5.2, where one finds a discussion of the role different models play in structuring and decomposing inquiry.

Second, the error-statistical philosophy also rejects the classical Bayesian commitment to the subjective nature of prior probabilities, which the agent is free to choose, in favour of the more objective process of establishing error probabilities understood in frequentist terms. It also finds unsatisfactory the turn to the more popular objective, or “default”, Bayesian option, in which the agent’s appropriate degrees of belief are constrained by relevant empirical evidence. The error-statistician rejects this default option because it fails in its attempts to unify Bayesian and frequentist ways of determining probabilities.

And, third, the error-statistical outlook employs probabilities to measure how effectively *methods* facilitate the detection of error, and how those methods enable us to choose between alternative hypotheses. By contrast, orthodox Bayesians use probabilities to measure *belief* in hypotheses or degrees of confirmation. As noted earlier, most Bayesians are not concerned with error probabilities at all. It is for this reason that error-statisticians will say about Bayesian methods that, without supplementation with error probabilities, they are not capable of providing stringent tests of hypotheses.

*4.2. The Bayesian remove from scientific practice*

Two additional features of the Bayesian focus on beliefs, which have been noted by philosophers of science and statistics, draw attention to their outlook on science. First, Kevin Kelly and Clark Glymour worry that “Bayesian methods assign numbers to answers instead of producing answers outright.” (2004, p. 112) Their concern is that the focus on the scientist’s beliefs “screens off” the scientist’s direct engagement with the empirical and theoretical activities that are involved in the phenomenology of science. Mayo agrees that we should focus on the scientific phenomena of interest, not the associated epiphenomena of degrees of belief. This preference stems directly from the error-statistician’s conviction that probabilities properly quantify the performance of methods, not the scientist’s degrees of belief.

Second, Henry Kyburg is puzzled by the Bayesian’s desire to “replace the fabric of science… with a vastly more complicated representation in which each statement of science is accompanied by its probability, for each of us.” (1992, p.149) Kyburg’s puzzlement prompts the question, ‘Why should we be interested in each other’s probabilities?’ This is a question raised by David Cox about prior probabilities, and noted by Mayo (2018).

This Bayesian remove from science contrasts with the willingness of the error-statistical perspective to engage more directly with science. Mayo is a philosopher of science as well as statistics, and has a keen eye for scientific practice. Given that contemporary philosophers of science tend to take scientific practice seriously, it comes as no surprise that she brings it to the fore when dealing with statistical concepts and issues. Indeed, her error-statistical philosophy should be seen as a significant contribution to the so-called *new experimentalism*, with its strong focus, not just on experimental practice in science, but also on the role of statistics in such practice. Her discussion of the place of frequentist statistics in the discovery of the Higgs boson in particle physics is an instructive case in point.

Taken together, these just-mentioned points of difference between the Bayesian and error-statistical philosophies constitute a major challenge to Bayesian thinking that methodologists, statisticians, and researchers in psychology need to confront.

*4.3. Bayesian statistics with error-statistical foundations*

One important modern variant of Bayesian thinking, which now receives attention within the error-statistical framework, is the f*alsificationist Bayesianism* of Andrew Gelman, which received its major formulation in Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. Gelman’s philosophy of Bayesian statistics is also significantly influenced by Popper’s view that scientific propositions are to be submitted to repeated criticism in the form of strong empirical tests. For Gelman, best Bayesian statistical practice involves formulating models using Bayesian statistical methods, and then checking them through hypothetico-deductive attempts to falsify and modify those models.

Both the error-statistical and neo-Popperian Bayesian philosophies of statistics extend and modify Popper’s conception of the hypotheticodeductive method, while at the same time offering alternatives to received views of statistical inference. The error-statistical philosophy injects into the hypothetico-deductive method an account of statistical induction that employs a panoply of frequentist statistical methods to detect and control for errors. For its part, Gelman’s Bayesian alternative involves formulating models using Bayesian statistical methods, and then checking them through attempts to falsify and modify those models. This clearly differs from the received philosophy of Bayesian statistical modeling, which is regarded as a formal inductive process.

From the wide-ranging error-statistical evaluation of the major varieties of Bayesian statistical thought on offer, Mayo concludes that Bayesian statistics needs new foundations: In short, those provided by her error-statistical perspective. Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to learn how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. However, Borsboom and Haig (2013) and Haig (2018) provide sympathetic critical evaluations of Gelman’s philosophy of statistics.

It is notable that in her treatment of Gelman’s philosophy, Mayo emphasizes that she is willing to allow a decoupling of statistical outlooks and their traditional philosophical foundations in favour of different foundations, which are judged more appropriate. It is an important achievement of Mayo’s work that she has been able to consider the current statistics wars without taking a particular side in the debates. She achieves this by examining methods, both Bayesian and frequentist, in terms of whether they violate her minimal severity requirement of “bad evidence, no test”.

I invite your comments and questions.

**This picture was taken by Diana Gillooly, Senior Editor for Mathematical Sciences, Cambridge University Press, at the book display for the Sept. 2018 meeting of the Royal Statistical Society in Cardiff. She also had the honor of doing the ripping. A blogpost on the session I was in is here.*

This is the title of Brian Haig’s recent paper in *Methods in Psychology *2 (Nov. 2020). Haig is a professor emeritus of psychology at the University of Canterbury. Here he provides both a thorough and insightful review of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018) as well as an excellent overview of the high points of today’s statistics wars and the replication crisis, especially from the perspective of psychology. I’ll excerpt from his article in a couple of posts. The full article, which is open access, is here.

** Abstract: **In this article, I critically evaluate two major contemporary proposals for reforming statistical thinking in psychology: The recommendation that psychology should employ the “new statistics” in its research practice, and the alternative proposal that it should embrace Bayesian statistics. I do this from the vantage point of the modern error-statistical perspective, which emphasizes the importance of the severe testing of knowledge claims. I also show how this error-statistical perspective improves our understanding of the nature of science by adopting a workable process of falsification and by structuring inquiry in terms of a hierarchy of models. Before concluding, I briefly discuss the importance of the philosophy of statistics for improving our understanding of statistical thinking.

* Keywords: *The error-statistical perspective, The new statistics, Bayesian statistics, Falsificationism, Hierarchy of models, Philosophy of statistics

**1. Introduction**

Psychology has been prominent among a number of disciplines that have proposed statistical reforms for improving our understanding and use of statistics in research. However, despite being at the forefront of these reforms, psychology has ignored the philosophy of statistics to its detriment. In this article, I consider, in a broad-brush way, two major proposals that feature prominently in psychology’s current methodological reform literature: The recommendation that psychology should employ the so-called “new statistics” in its research practice, and the alternative proposal that psychology should embrace Bayesian statistics. I evaluate each from the vantage point of the error-statistical philosophy, which, I believe, is the most coherent perspective on statistics available to us. Before concluding, I discuss two interesting features of the conception of science adopted by the error-statistical perspective, along with brief remarks about the value of the philosophy of statistics for deepening our understanding of statistics.

**2. The error-statistical perspective**

The error-statistical perspective employed in this article is that of Deborah Mayo, sometimes in collaboration with Aris Spanos (Mayo, 1996, 2018; Mayo & Spanos, 2010, 2011). This perspective is landmarked by two major works. The first is Mayo’s ground-breaking book, *Error and the growth of experimental **knowledge** *(1996), which presented the first extensive formulation of her error-statistical perspective on statistical inference. This philosophy provides a systematic understanding of experimental reasoning in science that uses frequentist statistics in order to manage error. Hence, its name. The novelty of the book lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with the nature of evidence and inference.

The second book is Mayo’s recently published *Statistical inference as severe testing* (2018). In contrast with the first book, this work focuses on problems arising from statistical practice, but endeavors to solve them by probing their foundations from the related vantage points of the philosophy of science and the philosophy of statistics. By dealing with the vexed problems of current statistical practice, this book is a valuable repository of ideas, insights, and solutions designed to help a broad readership deal with the current crisis in statistics. Because my focus is on statistical reforms in psychology, I draw mainly from the resources contained in the second book.

Fundamental disputes about the nature and foundations of statistical inference are long-standing and ongoing. Most prominent have been the numerous debates between, and within, frequentist and Bayesian camps. Cutting across these debates have been more recent attempts to unify and reconcile rival outlooks, which have complexified the statistical landscape. Today, these endeavors fuel the ongoing concern that psychology and many sciences have with replication failures, questionable research practices, and the strong demand for an improvement of research integrity. Mayo refers to debates about these concerns as the “statistics wars”. With the addition of *Statistical inference as severe testing* to the error-statistical corpus, it is fair to say that the error-statistical outlook now has the resources to enable statisticians and scientists to understand and advance beyond the bounds of these statistics wars.

The strengths of the error-statistical approach are considerable (Haig, 2017; Spanos, 2019a, 2019b), and I believe that they combine to give us the most coherent philosophy of statistics currently available. For the purpose of this article, it suffices to say that the error-statistical approach contains the methodological and conceptual resources that enable one to diagnose and overcome the common misunderstandings of widely used frequentist statistical methods such as tests of significance. It also provides a trenchant critique of Bayesian ways of thinking in statistics. I will draw from these two strands of the error-statistical perspective to inform my critical evaluation of the new statistics and the Bayesian alternative.

Because the error-statistical and Bayesian outlooks are so different, some might consider it unfair to use the former to critique the latter. My response to this worry is three-fold: First, perspective-taking is an unavoidable feature of the human condition; we cannot rise above our human conceptual frameworks and adopt a position from nowhere. Second, in thinking things through, we often find it useful to proceed by contrast, rather than direct analysis. Indeed, the error-statistical outlook on statistics was originally developed in part by using the Bayesian outlook as a foil. And third, strong debates between Bayesians and frequentists have a long history, and they have helped shape the character of these two alternative outlooks on statistics. By participating in these debates, the error-statistical perspective is itself unavoidably controversial.

**3. The new statistics**

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, which urges the abandonment of null hypothesis significance testing (NHST), and the adoption of effect sizes, confidence intervals, and meta-analysis as a replacement package, is one such reform movement (Calin-Jageman and Cumming, 2019; Cumming, 2012, 2014). It has been heavily promoted in psychological circles and touted as a much-needed successor to NHST, which is deemed to be broken-backed. *Psychological Science*, which is the flagship journal of the Association for Psychological Science, endorsed the use of the new statistics, wherever appropriate (Eich, 2014). In fact, the new statistics might be considered the Association’s current quasi-official position on statistical inference. Although the error-statistical outlook does not directly address the new statistics movement, its suggestions for overcoming the statistics wars contain insights about statistics that can be employed to mount a powerful challenge to the integrity of that movement.

*3.1. Null hypothesis significance testing*

The new statisticians contend that NHST has major flaws and recommend replacing it with their favored statistical methods. Prominent among the flaws are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. The claim that we should abandon NHST because it leads to dichotomous thinking is unconvincing because it is leveled at the misuse of a statistical test that arises from its mechanical application and a poor understanding of its foundations. By contrast, the error-statistical perspective advocates the flexible use of levels of significance tailored to the case at hand as well as reporting of exact* p* values – a position that Fisher himself came to hold.

Further, the error-statistical perspective makes clear that the common understanding of the amalgam that is NHST is not an amalgam of Fisher’s and Neyman and Pearson’s thinking on the matter, especially their mature thought. Further, the error-statistical outlook can accommodate both evidential and behavioural interpretations of NHST, respectively serving *probative* and *performance* goals, to use Mayo’s suggestive terms. The error-statistical perspective urges us to move beyond the claim that NHST is an inchoate hybrid. Based on a close reading of the historical record, Mayo argues that Fisher and Neyman and Pearson should be interpreted as compatibilists, and that focusing on the vitriolic exchanges between Fisher and Neyman prevents one from seeing how their views dovetail. Importantly, Mayo formulates the error-statistical perspective on NHST by assembling insights from these founding fathers, and additional sources, into a coherent hybrid. There is much to be said for replacing psychology’s fixation on the muddle that is NHST with the error-statistical perspective on significance testing.

Thus, the recommendation of the new statisticians to abandon NHST, understood as the inchoate hybrid commonly employed in psychology, commits the fallacy of the false dichotomy because there exist alternative defensible accounts of NHST (Haig, 2017). The error-statistical perspective is one such attractive alternative.

*3.2. Confidence intervals*

For the new statisticians, confidence intervals replace *p*-valued null hypothesis significance testing. Confidence intervals are said to be more informative, and more easily understood, than *p* values, as well as serving the important scientific goal of estimation, which is preferred to hypothesis testing. Both of these claims are open to challenge. Whether confidence intervals are more informative than statistical hypothesis tests in a way that matters will depend on the research goals being pursued. For example, *p* values might properly be used to get a useful initial gauge of whether an experimental effect occurs in a particular study, before one runs further studies and reports *p* values, supplementary confidence intervals, and effect sizes. The claim that confidence intervals are more easily understood than *p* values is surprising, and is not borne out by the empirical evidence (e.g., Hoekstra et al., 2014). I will speak to the claim about the greater importance of estimation in the next section.

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: Make a decision on whether a parameter estimate is either inside, or outside, its confidence interval.

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). The single interval estimate corresponding to this level provides the basis for the inference that is drawn about the parameter values, depending on whether they fall inside or outside the interval. A limitation of this way of thinking is that each of the values of a parameter in the interval are taken to have the same evidential, or probative, force – an unsatisfactory state of affairs that results from weak testing. For example, there is no way of answering the relevant questions, ‘Are the values in the middle of the interval closer to the true value?’, or ‘Are they more probable than others in the interval?’

The error-statistician, by contrast, draws inferences about each of the obtained values, according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Mayo (2018) captures the counterfactual logic of severity thinking involved with the following general example: “Were *μ* less than the 0.995 lower limit, then it is very probable (>0.995) that our procedure would yield a smaller sample mean than 0.6. This probability gives the severity.” (p. 195) Clearly, this is a more nuanced and informative assessment of parameter estimates than that offered by the standard view. Details on the error-statistical conception of confidence intervals can be found in Mayo (2018, pp. 189–201), as well as Mayo and Spanos (2011) and Spanos (2014, 2019a, b).

Methodologists and researchers in psychology are now taking confidence intervals seriously. However, in the interests of adopting a sound frequentist conception of such intervals, they would be well advised to replace the new statistics conception of them with their superior error-statistical understanding.

*3.3. Estimation and hypothesis tests*

The new statisticians claim, controversially, that parameter estimation, rather than statistical hypothesis testing, leads to better science – presumably in part because of the deleterious effects of NHST. However, a strong preference for estimation leads Cumming (2012) to aver that the typical questions addressed in science are what questions (e.g., “What is the age of the earth?”, “What is the most likely sea-level rise by 2012?”). I think that this is a restricted, rather “flattened”, view of science where, by implication, explanatory* why* questions and *how* questions (which often ask for information about causal mechanisms) are considered atypical.

Why and how questions are just as important for science as what questions. They are often the sort of questions that science seeks to answer when constructing and evaluating explanatory hypotheses and theories. Interestingly, and at variance with this view, Cumming (Fidler and Cumming, 2014) acknowledges that estimation can be usefully combined with hypothesis testing in science, and that estimation can play a valuable role in theory construction. This is as it should be because science frequently incorporates parameter estimates in precise predictions that are used to assess the hypotheses and theories from which they are derived.

Although it predominantly uses the language of testing, the error-statistical perspective maintains that statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science and, in fact, advocates piecemeal testing of local hypotheses nested within large-scale explanatory theories.

Despite the generally favorable reception of the new statistics in psychology, it has been subject to criticism by both frequentists (e.g., Sakaluk, 2016), and Bayesians (e.g., Kruschke and Liddell, 2018). However, these criticisms have not occasioned a public response from the principal advocates of the new statistics movement. The error-statistical outlook presents a golden opportunity for those who advocate, or endorse, the new statistics to defend their position in the face of challenging criticism. A sound justification for the promotion and adoption of new statistics practices in psychology requires as much.

To be continued…. Please share comments and questions.

Excerpts and Mementos from SIST on this blog are compiled here.

]]>**Stephen Senn**

Consultant Statistician

Edinburgh

The intellectual illness of clinical drug evaluation that I have discussed here can be cured, and it will be cured when we restore intellectual primacy to the questions we ask, not the methods by which we answer them. Lewis Sheiner^{1}

In their recent essay *Causal Evidence and Dispositions in Medicine and Public Health*^{2}*,* Elena Rocca and Rani Lill Anjum challenge, ‘the epistemic primacy of randomised controlled trials (RCTs) for establishing causality in medicine and public health’. That an otherwise stimulating essay by two philosophers, experts on causality, which makes many excellent points on the nature of evidence, repeats a common misunderstanding about randomised clinical trials, is grounds enough for me to address this topic again. Before, however, explaining why I disagree with Rocca and Anjum on RCTs, I want to make clear that I agree with much of what they say. I loathe these pyramids of evidence, beloved by some members of the evidence-based movement, which have RCTs at the apex or possibly occupying a second place just underneath meta-analyses of RCTs. In fact, although I am a great fan of RCTs and (usually) *of intention to treat* analysis, I am convinced that RCTs alone are not enough. My thinking on this was profoundly affected by Lewis Sheiner’s essay of nearly thirty years ago (from which the quote at the beginning of this blog is taken). Lewis was interested in many aspects of investigating the effects of drugs and would, I am sure, have approved of Rocca and Anjum’s insistence that there are many layers of understanding how and why things work, and that means of investigating them may have to range from basic laboratory experiments to patient narratives via RCTs. Rocca and Anjum’s essay provides a good discussion of the various ‘causal tasks’ that need to be addressed and backs this up with some excellent examples.

In discussing RCTs Rocca and Anjum write

‘…any difference in outcome between the test group and the control group should be caused by the tested interventions, since all other differences should be homogenously distributed between the two groups,’

and later,

‘The experimental design is intended to minimise complexity—for instance, through strict inclusion and exclusion criteria’.

However, it is not the case that randomisation will guarantee that any difference between the groups should be caused by the intervention. On the contrary, many things apart from the treatment will affect the observed difference. And it is not the case that the analysis of RCTs requires the minimisation of complexity. Randomisation and its associated analysis deals with complexity in the experimental material and although the treatment structure in RCTs is often simple this is not always so (I give an example below) and it was not so in the field (literally) of agriculture for which Fisher developed his theory of randomisation. This is what Fisher, himself had to say about complexity

No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally one question, at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.

^{3 }(p. 511)

This 1926 paper of Fisher’s is an important and early statement of his views on randomisation and was cited recently by Simon Raper in his article in *Significance*^{4}. Raper points out, that Fisher was abandoning as unworkable an earlier view of causality due to John Stuart Mill whereby controlling for everything imaginable was the way you made valid causal judgements. I consider Raper is right in thinking of Fisher’s approach as an alternative to Mill’s programme, rather than some realisation of it, so I disagree for example, with Mumford and Anjum in their book^{5 }when they state

‘Fisher’s idea is the basis of the randomized controlled trial (RCT), which builds on J.S. Mill’s earlier

method of difference’ (pp. 111-112).

I shall now explain exactly what it is that Fisher’s approach does with the help of an example.

Before going into the example, which is a complex design, it is necessary to clear up one further potential point of confusion in Rocca and Anjum’s essay. N-of-1 studies, are not alternatives to RCTs but a subset of them. RCTs include not just conventional parallel group trials but also cluster randomised trial and cross-over trials, including n-of-1 studies. The difference between these studies is at the level one randomises and this is reflected in my example, which has features of both a parallel group and a cross-over study. Thus, reading Rocca and Anjum’s paper, which I can recommend, will make more sense if by their use of *RCT* is understood ‘*randomised parallel group trials*’.

For the moment, all that it is necessary to know is that within the same design, I can compare the effect on forced expiratory volume in one second (FEV_{1}), measured 12 hours after treatment, of two bronchodilators in asthma, which here I shall just label ISF24 and MTA6, in two different ways. First, I can use 71 patients who were given MTA6 and ISF24 on different occasions. Here I can compare the two treatments patient by patient. These data have the structure of a within-patient study. Second, within the same study there were 37 further patients who were given MTA6 but not 1SF24 and 37 further patients who were given ISF24 but not MTA6. Here I can compare the two groups of patients with each other. These data have the structure of a between-patient or parallel group study.

I now proceed to analyse the data from the 71 pairs of values from the patients who were given both using a matched pairs t-test. This will be referred to as the *within-patient study*. Note that this is an analysis of 2×71=142 values in total. I then proceed to compare the 37 patients given MTA6 *only* to the 37 given ISF24 *only* using a two-sample t-test. I shall refer to this as the *between-patient study*. Note that this is an analysis of 37+37=74 values in total. Finally, I combine the two using a meta-analysis.

The results are presented in the figure below which gives the point estimates for the difference between the two treatments and the 95% confidence intervals for both analyses and for a meta-analysis of both, which is labelled ‘combined’. (The horizontal dashed line is the point estimate for a full analysis of all the data and is described in the appendix.) Note how much wider the confidence intervals are for the between-patient study than the within-patient study. This is because the within-patient study is much more precise.

Why is the within-patient study so much more precise? Part of the story is that it is based on more data, in fact nearly twice as many data: 142 rather than 74. However, this is only part of the story. The ratio of variances is more than 30 to 1 and not just approximately 2 to 1, as the number of data might suggest. The main reason is that the within-patient study has balanced for a huge number of factors and the between-patient study has not. Thus, differences in 20,000 plus genes and all life-history until the beginning of the trial are balanced in the within-patient study, since each patient is his or her own control. For the between-patient study none of this is balanced by design. In fact, there are two crucial points regarding balance.

1. Randomisation does not produce balance

2. This does not affect the validity of the analysis

Why do I claim this does not matter? Suppose we accept the within-patient estimate as being nearly perfect because it balances for those huge numbers of factors. It seems that we can then claim that the between-patient estimate did a pretty bad job. The point estimate is 0.2L more than that from the within-patient design, a non-negligible difference. However, this is to misunderstand what the between-patient analysis claims. Its ‘claim’ is not the point estimate; its claim is the distribution associated with it, of which the 95% confidence interval is a sort of minimalist conventional summary and of which the point estimate is only one point. As I have explained elsewhere, such claims of uncertainty are a central feature of statistics. Thus, the true claim made by the between-patient study is not misleading. It is vague and, indeed, when we come to combine the results, the meta-analysis will give 30 times the weight to the within-patient estimate as to the between-patient estimate simply because of the vagueness of the associated claim. This is why the result from the meta-analysis is so similar to that of the within-patient estimate. Furthermore, although this can never be guaranteed, since probabilities are involved, the 95% CI for the between-patient study includes the estimate given by the within-patient study. (Note, that in general, confidence intervals are not a statement about a value in a future study, but about the ‘true’ average value^{6} but here, the within-patient study being very precise, they can be taken to be similar.)

This works because what Fisher’s analysis does is use variation at an appropriate level to estimate variation in the treatment estimate. So, for the between-study it starts from the following observations

1) There are numerous factors apart from treatment that could affect the outcome in one arm of the between-patient study compared to the other.

2) However, it is the joint effect of these that matters.

3) This joint effect of such factors will also vary within each of the two treatment groups.

4) Provided I use a method of allocation that is random, there will be no tendency for this variation within the groups to be larger or smaller than that between the groups.

5) Under this condition I have a way of estimating how reliable the treatment estimate is.

Thus, his programme is not about eliminating all sources of variation. He knows that this is impossible and accepts that estimates will be imperfect. Instead, he answers the question: ‘given that estimates are (inevitably) less than perfect, can we estimate how reliable they are?’. The answer he provides is ‘yes’ if we randomise.

If we now turn to the within-patient estimate, the same argument is repeated but in a first step differences are calculated by patient. These differences do not reflect differences in genes etc. since each patient acts as his or her own control. (They could reflect a treatment-by-patient interaction but this is another story I choose not to go into here^{7,} ^{8}. See my blog on n-of-1 trials for a discussion.) The argument then uses the variance in the single group of differences to estimate how reliable their average will be.

Note that a different design requires a different analysis and in particular because the estimate of the variability of the estimate will be inappropriate even if the estimate is not affected. This is illustrated in Figure 2 which shows what happens if you analyse the paired data from the 71 patients as if they were two independent sets of 71 each. Although the point estimate is unchanged, the confidence interval is now much wider than it was before. The value of having the patients as their own control is lost. The downstream effect of this is that the meta-analysis now weights the two estimates inappropriately.

Note also, that it is not a feature of Fisher’s approach that claims made by larger or otherwise more precise trials are generally more reliable than smaller or otherwise less precise ones. The increase in precision is *consumed* by the calculation of the confidence interval^{9, 10}. More precise designs produce narrower intervals. Nothing is left to make the claim that is made more valid. It is simply more precise. The allowance for chance effects will be less, and appropriately so. Balance is a matter of precision not validity.

As I often put it, the shocking truth about RCTs is the opposite of what many believe. Far from requiring us to know that all possible causal factors affecting the outcome are balanced in order for the conventional analysis of RCTs to be valid, if we knew all such factors were balanced, the conventional analysis would be *invalid*. RCTs neither guarantee nor require balance. Imbalance is inevitable and Fisher’s analysis allows for this. The allowance that is made for imbalance is appropriate provided that we have randomised. Thus, randomisation is a device for enabling us to make precise estimates of an inevitable imprecision.

I thank George Davey Smith, Elena Rocca and Rani Lill Anjum for helpful comments on an earlier version.

- Sheiner LB. The intellectual health of clinical drug evaluation [see comments].
*Clin Pharmacol Ther*1991;**50**(1): 4-9. - Rocca E, Anjum RL. Causal Evidence and Dispositions in Medicine and Public Health.
*International Journal of Environmental Research and Public Health*2020;**17**. - Fisher RA. The arrangement of field experiments.
*Journal of the Ministry of Agriculture of Great Britain*1926;**33**: 503-13. - Raper S. Turning points: Fisher’s random idea.
*Significance*2019;**16**(1): 20-23. - Mumford S, Anjum RL.
*Causation: a very short introduction*: OUP Oxford, 2013. - Senn SJ. A comment on replication, p-values and evidence S.N.Goodman,
*Statistics in Medicine*1992;**11**: 875-879.*Statistics in Medicine*2002;**21**(16): 2437-44. - Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine 2016;
**35**(7): 966-77. - Araujo A, Julious S, Senn S. Understanding Variation in Sets of N-of-1 Trials.
*PloS one*2016;**11**(12): e0167167. - Senn SJ. Seven myths of randomisation in clinical trials.
*Statistics in Medicine*2013;**32**(9): 1439-50. - Cumberland WG, Royall RM. Does Simple Random Sampling Provide Adequate Balance.
*J R Stat Soc*Ser B-Methodol 1988;**50**(1): 118-24. - Senn SJ, Lillienthal J, Patalano F, et al. An incomplete blocks cross-over in asthma: a case study in collaboration. In: Vollmar J, Hothorn LA, eds.
*Cross-over Clinical Trials*. Stuttgart: Fischer, 1997: 3-26.

This was a so-called *balanced incomplete blocks design* necessitated because it was desired to study seven treatments (three doses of each of two formulations and a placebo)^{11 }but it was not considered practical to treat patients in more than five treatments. Thus, patients were allocated a different one of the seven treatments in each of the five periods. That is to say, each patient received a subset of five of the seven treatments. Twenty-one sequences of five treatments were used. Each sequence permits (5×4)/2 = 10 pairwise comparisons but there are (7×6)/2= 21 pairwise comparisons overall and the sequences were chosen in such a way that any given one of the 21 pairwise comparisons within a sequence would appear equally often over the design. Looking at the members of such a given chosen pair one would find that in five further sequences the first would appear and not the second and *vice versa*. This leaves one sequence out of the 21 in which neither treatment would appear. The sort of scheme involved is illustrated in Table 1 below.

The active treatments were MTA6, MTA12, MTA24, ISF6, ISF12, ISF24, where the number refers to a dose in μg and the letters to two different formulations (MTA and ISF) of a dry powder of formoterol delivered by inhaler. The seventh treatment was a placebo.

In fact, the plan was to recruit six times as many patients as there were sequences, randomising a given patient to a sequence in a way that would guarantee approximately equal numbers per sequence. This would have given 126 patients in total. In the end, this target was exceeded and 161 patients were randomised to one of the sequences.

Obviously, this is a rather complex design but I have used it because it enabled me to compare two treatments two different ways. First by using only the ten sequences in which they both appear. For this purpose, I could use each patient as his or her control. Second, by using the ten further sequences in which only one appears.

This thus permitted me to analyse data from the same trial using a *within-patient analysis* and a *between-patient analysis*. The analyses used above should not be taken too seriously. The analysis would not generally proceed and did not in fact proceed in this way. For example, I ignored the complication of period effects and ignored the fact that by including all the seven treatments in an analysis at once, I could recover more information. I simply chose two treatments to compare and ignored all other information in order to illustrate a point. The two treatments I compared, ‘ISF24’ and ‘MTA6’, were respectively, the highest (24μg) dose of the then (1997) existing standard dry powder formulation, ISF of the beta-agonist formoterol, and the lowest (6μg) of a newer formulation, MTA, it was hoped to introduce. The experiment is discussed in full in Senn, Lilienthal, Patalano and Till^{11}.

The *full model* analysis that I showed as a dotted line in Figure 1 & Figure 2 fitted Patient as a random effect and Treatment and Period as fixed factors with 7 and 5 levels respectively.

Ed. A link to a selection of Senn’s posts and papers is here. Please share comments and thoughts.

]]>As much as doctors and hospitals are raising alarms about a shortage of ventilators for Covid-19 patients, some doctors have begun to call for entirely reassessing the standard paradigm for their use–according to a cluster of articles to appear in the last week. “What’s driving this reassessment is a baffling observation about Covid-19: Many patients have blood oxygen levels so low they should be dead. But they’re not gasping for air, their hearts aren’t racing, and their brains show no signs of blinking off from lack of oxygen.”[1] Within that group of patients, some doctors wonder if the standard use of mechanical ventilators does more harm than good.[2] The issue is controversial; I’ll just report what I find in the articles over the past week. Please share ongoing updates in the comments.

**I. Gattinoni: “COVID-19 pneumonia: different respiratory treatment for different phenotypes?”**

Luciano Gattinoni, one of the world’s experts in mechanical ventilation, “says more than half the patients he and his colleagues have treated in Northern Italy have had this unusual symptom. They seem to be able to breathe just fine, but their oxygen is very low”. …

He says these patients with more normal-looking lungs, but low blood oxygen, may also be especially vulnerable to ventilator-associated lung injury, where pressure from the air that’s being forced into the lungs damages the thin air sacs that exchange oxygen with the blood. [3]

Gattinoni labels these patients (more normal-looking lungs, but low blood oxygen) as Type L, and urges they be treated differently than the type of acute respiratory [ARDS] patients seen prior to Covid-19. This second type he calls Type H. (His editorial is in [4]). I found a picture of Type L and Type H lungs at this link on p. 12.

Patients with respiratory failure who can still breathe OK, but have still have very low oxygen, may improve on oxygen alone, or on oxygen delivered through a lower pressure setting on a ventilator.[3]

Gattinoni thinks the trouble for these patients may not be swelling and stiffening of their lung tissue, which is what happens when an infection causes pneumonia. Instead, he thinks the problem may lie in the intricate web of blood vessels in the lungs.[a]

Gattinoni says putting a patient like this on a ventilator under too high a pressure may cause lung damage that ultimately looks like ARDS.[3]

In other words, the high pressure of the ventilator may turn a Type L patient into a more serious Type H patient. “If you start with the wrong protocol, at the end they become similar,” Gattinoni said.[2] Oy! He recommends the two types (which can be determined in a number of ways) be treated differently: Type L patients receive greater benefit from less invasive oxygen support, via breathing masks, such as those used for patients with sleep apnea, nasal cannulas, or via a non-invasive high flow device.

Gattinoni said one center in central Europe that had begun using different treatments for different types of COVID-19 patients had not seen any deaths among those patients in its intensive care unit. He said a nearby hospital that was treating all COVID-19 patients based on the same set of instructions had a 60% death rate in its ICU. [He did not give the names of the hospitals.]

“This is a kind of disease in which you don’t have to follow the protocol — you have to follow the physiology,” Gattinoni said. “Unfortunately, many, many doctors around the world cannot think outside the protocol.” [3]

**II. Kyle-Sidell: Covid vent protocols need a second look**

But there are some doctors who may want to think outside the protocol, yet face pressure against doing so–according to Cameron Kyle-Sidell, an emergency room and critical care doctor at Maimonides Medical Center in Brooklyn.

The article that captured my attention on April 6 was the surprising transcript of Kyle-Sidell being video interviewed by WebMD chief medical officer John Whyte [5]:

: You’ve been talking on social media; you say you’ve seen things that you’ve never seen before. What are some of those things that you’re seeing?Whyte

: When I initially started treating patients, I was under the impression, as most people were, that I was going to be treating acute respiratory distress syndrome (ARDS)… And as I start to treat these patients, I witnessed things that are just unusual. …In the past, we haven’t seen patients who are talking in full sentences and not complaining of overt shortness of breath, with saturations [blood oxygen levels] in the high 70s [normal is said to be between 95 and 100].[b].Kyle-SidellThis originally came to me when we had a patient who had hit what we call our trigger to put in a breathing tube, … Most of the time, when patients hit that level of hypoxia, they’re in distress and they can barely talk; they can’t say complete sentences. She could do all of those and she did not want a breathing tube. So she asked that we put it in at the last minute possible. It was this perplexing clinical condition: When was I supposed to put the breathing tube in?…

We ran into an impasse where I could not morally, in a patient-doctor relationship, continue the current protocols which, again, are the protocols of the top hospitals in the country. … So I had to step down from my position in the ICU, and now I’m back in the ER where we are setting up slightly different ventilation strategies. Fortunately, we’ve been boosted by recent work by Gattinoni.

: Do you feel that somewhere the world made a wrong turn in treating COVID-19?Whyte

I don’t know that they made a wrong turn. I mean, it came so fast. … It’s hard to switch tracks when the train is going a million miles an hour. …But I do think that it starts out with knowing, or at least accepting the idea, that this may be an entirely new disease. Because once you do that, then you can accept the idea that perhaps all the studies on ARDS in the 2000s and 2010s, which were large, randomized, well-performed, well-funded studies, perhaps none of those patients in those studies had COVID-19 or something resembling it. It allows you to move away from a paradigm in which this disease may fit and, unfortunately, walk somewhat into the unknown.Kyle-Sidell:…One of the reasons I speak up, and I hope people at the bedside speak up, is that I think there may be a disconnect between those who are seeing these patients directly, who are sensing that something is not quite right, and those brilliant people and researchers and administrators who are writing the protocols and working on finding answers. The first thing to do is see if we can admit that this is something new. I think it all starts from there.

Gattinoni’s paper and Kyle-Sidell’s on-line discussions are having an impact in the popular press. Yesterday, the *Telegraph* reported that “British and American intensive care doctors at the front line of the coronavirus crisis are starting to question the aggressive use of ventilators for the treatment of patients”.[6]

In many cases, they say the machines– which are highly invasive and require the patient to be rendered unconscious– are being used too early and may cause more harm than good. Instead they are finding that less invasive forms of oxygen treatment through face masks or nasal cannulas work better for patients, even those with very low blood oxygen readings….This is the sort of treatment Boris Johnson, the Prime Minister, is said to have received in an intensive care unit at St Thomas’ Hospital in London.

…Increasingly, doctors in the UK, America and Europe are using these less invasive measures and holding back on the use of mechanical ventilation for as long as possible…Invasive ventilation is never a good option for any patient if it can be avoided. It can result in muscle wastage around the lungs and makes secondary infections more likely. It also requires a cocktail of drugs which themselves can prove toxic and lead to organ failure.[6]

“Instead of asking how do we ration a scarce resource, we should be asking how do we best treat this disease?” says physician Muriel Gillick of Harvard Medical School.[1]

**III. Need Non-invasive Ventilation Risk Health Care Workers?**

Yet there’s an important reason the standard protocol is to bypass non-invasive ventilation in Covid-19 patients (in the U.S.), and I don’t know if Gattinoni or Kyle-Sidell address it: they are thought to pose risks for heath care providers, at least without adequate protective devices.[c]:

One problem, though, is that CPAP [continuous positive airway pressure] and other positive-pressure machines pose a risk to health care workers…The devices push aerosolized virus particles into the air, where anyone entering the patient’s room can inhale them [spillage]. The intubation required for mechanical ventilators can also aerosolize virus particles, but the machine is a contained system after that.[1]

“If we had unlimited supply of protective equipment and if we had a better understanding of what this virus actually does in terms of aerosolizing, and if we had more negative pressure rooms, then we would be able to use more” of the noninvasive breathing support devices, said [Lakshman] Swamy [an ICU physician and pulmonologist of Boston Medical Center].[1]

But surely it would be easier to procure adequate protective equipment than obtain more ventilators, especially if it’s a way to beat the grim statistics for a significant group of Covid-19 sufferers. Italy has special plastic helmets that cordon off the patient’s head from the shoulder up, redolent of Victorian diving helmets. A virus filter prevents the aerosolization risk that is behind the common protocol. The Italian helmet, however, hasn’t been approved by the FDA, and anyway, Italy has banned its export given its own COVID-19 crisis. Fortunately, at least one group in the U. S is building its own coronavirus helmets.

Please share your thoughts, updates, and errors.

**NOTES:**

[a] The following are quotes from reference [3]

Normally, when lungs become damaged, the vessels that carry blood through the lungs so it can be re-oxygenated constrict, or close down, so blood can be shunted away from the area that’s damaged to an area that’s still working properly. This protects the body from a drop in oxygen. Gattinoni thinks some COVID-19 patients can’t do this anymore. So blood is still flowing to damaged parts of the lungs. People still feel like they’re taking good breaths, but their blood oxygen is dropping all the same.[3]

One doctor treating COVID-19 patients in New York [Cameron Kyle-Sidell] says it was like altitude sickness. It was “as if tens of thousands of my fellow New Yorkers are stuck on a plane at 30,000 feet and the cabin pressure is slowly being let out. These patients are slowly being starved of oxygen”. [3]

Lung scans show the same “ground glass” appearance in both covid-19 and high altitude pulmonary edema (HAPE).

[b] An oximeter I recently bought, of not very good quality, has me at 97.

[c] Except perhaps when mechanical ventilators are in too short supply.( I am not up on the current regulations). Of course, another reason is the danger in delaying intubation that might be necessary.

**REFERENCES:**

[0] March 17, 2020 around 30 cases down.

[1] “With ventilators running out, doctors say the machines are overused for Covid-19” STATREPORTS, April 8, 2020

[2] “Is Protocol-Driven COVID-19 Ventilation Doing More Harm Than Good?” Medscape, April 6, 2020.

[3] “Doctors puzzle over covid-19 lung problems”, WebMD Health News, April 07, 2020

[4] Gattinoni’s editorial: “COVID-19 pneumonia: different respiratory treatment for different phenotypes?” L. Gattinoni et al., (2020)

[5] “Do COVID-19 Vent Protocols Need a Second Look?”, WebMD Interview, John Whyte, MD, MPH; Cameron Kyle-Sidell, MD, April 06, 2020

[6] “Intensive care doctors question ‘overly aggressive’ use of ventilators in coronavirus crisis”, Telegraph, April 9, 2020

]]>**Aris Spanos
**

Beyond the plenitude of misery and suffering that pandemics bring down on humanity, occasionally they contribute to the betterment of humankind by (inadvertently) boosting creative activity that leads to knowledge, and not just in epidemiology. A case in point is that of Isaac Newton and the pandemic of 1665-6.

Born in 1642 (on Christmas day – old Julian calendar) in the small village of Woolsthorpe Manor, southeast of Nottingham, England, Isaac Newton had a very difficult childhood. He lost his father, also named Isaac, a farmer, three months before he was born; his mother, Hannah, married again when he was 3 years old and moved away with her second husband to start a new family; he was brought up by his maternal grandmother until the age of 10, when his mother returned, after her second husband died, with three young kids in tow.

At age 12, Isaac was enrolled in the King’s School in Grantham [where Margaret Thatcher was born], 8 miles away from home, where he boarded at the home of the local pharmacist. During the first two years at King’s School, he was an average student, but after a skirmish with a schoolyard bully, he took his revenge by distinguishing himself, or so the story goes! After that episode, Isaac began to exhibit an exceptional aptitude for constructing mechanical contraptions, such as windmills, dials, water-clocks, and kites. His mother, however, had other ideas and took young Isaac out of school at age 16 to attend the farm she inherited from the second husband. Isaac was terrible at farming, and after a year the headmaster of King’s School, Mr. Stokes, lectured Hannah to allow a promising pupil to return to school, and took Isaac to board in his own home. It was clear to both that young Isaac was not cut out to herd sheep and shovel dung. After completing the coursework in Latin, Greek and some mathematics, Newton was accepted at Trinity College, University of Cambridge, in 1661, at an age close to 19, somewhat older than the other students due to his skirmish with farming. For the first three years, he did not pay tuition by having to work in the College’s kitchen, diner and housekeeping, but by 1664 he showed adequate promise to be awarded a scholarship guaranteeing him four more years to complete his MA degree. As an undergraduate Isaac spent most of his time in solitary intellectual pursuits, which, beyond the prescribed Aristotelian texts, included reading in diverse subjects in a conscious attempt to supplement his education with reading extra-curricular books that attracted his curiosity, in history, philosophy – Rene Descartes in particular – and astronomy, such as the works of Galileo and Thomas Street through whom he learned about Kepler’s work; many scholars attribute Newton’s passion for mathematics to Descartes’s *Geometry*. He completed his BA degree in 1665 without displaying any scholarly promise that he would become the most celebrated scientist of all time. That was to be changed by a pandemic!

The bubonic plague of 1665-6 ravaged London, killing more than 100,000 residents (25% of its population), and rapidly spread throughout the country. Like most universities, Cambridge closed its doors and the majority of its students return to their family residence in the countryside to isolate themselves and avoid the plague. Isaac, an undistinguished BA student from Cambridge University, returned to Woolsthorpe, where he began a most creative period of assimilating what he has learned during his studies and devoting ample time to reflect on subjects of great interest to him, including mathematics, philosophy, and physics, that he could not devote sufficient time to during his coursework at Cambridge. These two years of isolation turned out to be the most creative years of his life. Newton’s major contributions to science and mathematics, including his work in Optics, the laws of motion and universal gravitation, as well as the creation of infinitesimal calculus, can be traced back to these two years of incredible ingenuity and originality, and their importance for science can only be compared with Einstein’s 1904-1905 *Annus Mirabilis*.

Newton returned to Cambridge in the Autumn of 1667 with notebooks filled with ideas as well as solved and unsolved problems. Soon after, he was elected a Minor Fellow of Trinity College. Newton completed his MA in 1668 during which he began interacting with Isaac Barrow, the Lucasian Professor of Mathematics, an accomplished mathematician in his own right with important contributions in geometry and optics, whom he failed to impress as an undergraduate. He handed Barrow a set of notes on the generalized binomial theorem and various applications of his newly minted fluxions (modern differential calculus) developed during the two years in Woolsthorpe. After a short period of discoursing with Newton, Barrow realized the importance of his young student’s work. Soon after that Barrow retired from the Lucasian chair in 1669, recommending Newton, age 26, to succeed him. Newton’s ideas during the next 30 years as Lucasian Professor of Mathematics changed the way we understand the physical world we live in.

One wonders how the history of science would have unfolded if it were not for the bubonic plague of 1665-6 forcing Newton into two years of isolation to study, contemplate and create!

**Aris Spanos **(March 2020)

Ed (Mayo) Note: Aris shared with me the case of Newton working during the bubonic plague 2 weeks ago, hearing how unproductive I was. I asked him to write a blogpost on it, and I’m very grateful that he did!

]]>**My “April 1” posts for the past 8 years have been so close to the truth or possible truth that they weren’t always spotted as April Fool’s pranks, which is what made them genuine April Fool’s pranks. (After a few days I either labeled them as such, e.g., “check date!”, or revealed it in a comment). Given the level of current chaos and stress, I decided against putting up a planned post for today, so I’m just doing a memory lane of past posts. (You can tell from reading the comments which had most people fooled.)**

**4/1/12 Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 **

## This morning I received a paper I have been asked to review (anonymously as is typical). It is to head up a forthcoming issue of a new journal called

Philosophy of Statistics: Retraction Watch. This is the first I’ve heard of the journal, and I plan to recommend they publish the piece, conditional on revisions. I thought I would post the abstract here. It’s that interesting.

“Some Slightly More Realistic Self-Criticism in Recent Work in Philosophy of Statistics,”In this paper we delineate some serious blunders that we and others have made in published work on frequentist statistical methods. First, although we have claimed repeatedly that a core thesis of the frequentist testing approach is that a hypothesis may be rejected with increasing confidence as the power of the test increases, we now see that this is completely backwards, and we regret that we have never addressed, or even fully read, the corrections found in Deborah Mayo’s work since at least 1983, and likely even before that.Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 (2012), pp. 1-19.

You can read the rest here.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/1/13 Flawed Science and Stapel: Priming for a Backlash? **

Deiderik Stapel is back in the news, given the availability of the English translation of the Tilberg (Levelt and Noort Committees) Report as well as his book,

Ontsporing(Dutch for “Off the Rails”), where he tries to explain his fraud. An earlier post on him is here. While the disgraced social psychologist was shown to have fabricated the data for something like 50 papers, it seems that some people think he deserves a second chance. A childhood friend, Simon Kuper, in an article “The Sin of Bad Science,” describes a phone conversation with Stapel:…..

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/1/14 Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic **

I had heard of medical designs that employ individuals who supply Bayesian subjective priors that are deemed either “enthusiastic” or “skeptical” as regards the probable value of medical treatments.[i] …But I’d never heard of these Bayesian designs in relation to decisions about building security or renovations! Listen to this….

You may have heard that the Department of Homeland Security (DHS), whose 240,000 employees are scattered among 50 office locations around D.C.,has been planning to have headquarters built at an abandoned insane asylum St Elizabeths in DC [ii]. (Here’s a 2015 update.)

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/01/15** **Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)**

Given recent evidence of the irreproducibility of a surprising number of published scientific findings, the White House’s Office of Science and Technology Policy (OSTP) sought ideas for “leveraging its role as a significant funder of scientific research to most effectively address the problem”, and announced funding for projects to “reset the self-corrective process of scientific inquiry”. (first noted in this post.)

You can read the rest here.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/1/16 Er, about those “other statistical approaches”: Hold off until a balanced critique is in?**

I could have told them that the degree of accordance enabling the “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

**You can read the rest here.**

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/1/17 and 4/1/18 were slight updates of 4/1/16. **@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

**4/1/19** **there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)**

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:

**You can read the rest here.**

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

]]>

Q. Was it a mistake to quarantine the passengers aboard the Diamond Princess in Japan?

A.The original statement, which is not unreasonable, was that the best thing to do with these people was to keep them safely quarantined in an infection-control manner on the ship. As it turned out, that was very ineffective in preventing spread on the ship. So the quarantine process failed. I mean, I’d like to sugarcoat it and try to be diplomatic about it, but it failed. I mean, there were people getting infected on that ship. So something went awry in the process of quarantining on that ship. I don’t know what it was, but a lot of people got infected on that ship. (Dr. A Fauci, Feb 17, 2020)

This is part of an interview of Dr. Anthony Fauci, the coronavirus point person we’ve been seeing so much of lately. Fauci has been the director of the National Institute of Allergy and Infectious Diseases since all the way back to 1984! You might find his surprise surprising. Even before getting our recent cram course on coronavirus transmission, tales of cruises being hit with viral outbreaks are familiar enough. The horror stories from passengers on the floating petri dish were well known by this Feb 17 interview. Even if everything had gone as planned, the quarantine was really only for the (approximately 3700) passengers because the 1000 or so crew members still had to run the ship, as well as cook and deliver food to the passenger’s cabins. Moreover, the ventilation systems on cruise ships can’t filter out particles smaller than 5000 or 1000 nanometers.[1]

“If the coronavirus is about the same size as SARS [severe acute respiratory syndrome], which is 120 nanometers in diameter, then the air conditioning system would be carrying the virus to every cabin,” according to Purdue researcher, Qingyan Chen, who specializes in how air particles spread in different passenger crafts. (His estimate was correct: the coronavirus is 120 nanometers.) Halfway through the quarantine, after passenger complaints, they began circulating only fresh air–which would have been preferable from the start. By then, however, it is too late: the ventilation system is already likely filled with the virus, says Chen.[2] Arthur Caplan, the bioethicist who is famous for issuing rulings on such matters, declares that

“Boats are notorious places for being incubators for viruses. It’s only morally justified to keep people on the boat if there are no other options.”

Admittedly, it is hard to see an alternative option to accommodate so many passengers for a 2 week quarantine on land, and there was the possible danger of any infections spreading to the local population in Japan. So, by his assessment, it may be considered morally justified.

*The upshot*: As of 19 March 2020, at least 712 out of the 3,711 passengers and crew had tested positive for covid-19; 9 of those who were on board have died from the disease (all over the age of 70). As I was writing this, I noted a new CDC report on the Diamond Princess as well as other cruise ships; they state 9 deaths.[3] A table on the distribution of ages of passengers on the Diamond Princess is in Note [4].

*So how did the Diamond Princess cruise ship become a floating petri dish for the coronavirus from Feb 4-Feb 20?*

**The Quarantine**

It was their last night of a 2-week luxury cruise aboard the Diamond Princess in Japan (Feb 3) when the captain came on the intercom. He announced: a passenger on this ship who disembarked in Hong Kong 9 days ago (Jan 25) has tested positive for the coronovirus. (He was on board for 5 days.) Everyone will have to stay on board an extra day to be examined by the Japanese health authorities. A new slate of activities was arranged to occupy passengers during the day of health screening–later mostly dropped. But on the evening of February 3, things continued on the ship more or less as before the intercom message.

“The response aboard the Diamond Princess reflected concern, but not a major one. The buffets remained open as usual. Onboard celebrations, opera performances and goodbye parties continued”. (NYT, March 8)

The next day, as health officials went door to door to screen passengers, guests still circulated on board, lined up for buffets, and used communal spaces. But then, the following morning (Feb 5), as guests were heading to breakfast, the captain came over the intercom again. He announced that 10 people had tested positive for the coronavirus and would be taken off the ship. Everyone else would now have to be quarantined in their cabins for 14 days. The second day of the quarantine (Feb 6) it was announced that 20 people more had tested positive, then on day three, 41 more, then 64 more, and on and on. By the end of the quarantine on February 19 at least 621 on the ship had tested positive for the virus.

Adding to the stress, “we quickly learned that our tests were part of an initial batch of 273 samples and that the first 10 cases reported on day one were only from the first 31 samples that had been processed” from the passengers with highest risk. (U.S. passenger, Spencer Fehrenbacher, interviewed on the ship)

As the number of infected ballooned, passengers were not always informed right away; some took to counting ambulances lined up outside to find out how many new cases would be announced at some point. I wonder if the passengers were told that the very first person to test positive was a crew member responsible for preparing food. In fact, by February 9, around 20 of the crew members tested positive, *15 of which were workers preparing food*. Crew members lived in close quarters, shared rooms and continued to eat their meals together buffet-style. They had no choice but to keep running the ship as best as they could.

“Feverish passengers were left in their rooms for days without being tested for the virus. Health officials and even some medical professionals worked on board without full protective gear. [Several got infected.] Sick crew members slept in cabins with roommates who continued their duties across the ship, undercutting the quarantine”. (NYT Feb 22)

Passengers in cabins without windows (and later, others) were allowed to walk on deck, six feet apart, for a short time daily. Unfortunately, presumed infection-free “green zones” were not rigidly separated from potentially contaminated “red zones”, and people walked back and forth between them. Gay Courter, a writer from the U.S. who, as it happens, situated one of her murder mysteries on a cruise ship, told *Time* “It feels like I’m in a bad movie. I tell myself, ‘Wake up, wake up, this isn’t really happening.’” (Time, Feb 11). This is the same bad movie we are all in now, except our horror tale has gotten much worse than on Feb 10.

At some point, I think Feb 10, the ship became the largest concentration of Covid-19 cases outside China, which is why you’ll notice the Diamond Princess has own category in the data compiled by the World Health Organization (Worldometer).

In a Science Today article, a Japanese infectious disease specialist regretted the patchwork way in which passenger testing was done:

Japan has missed a chance to answer important epidemiological questions about the new virus and the illness it causes. For instance, a rigorous investigation that tested all passengers at the start of the quarantine and followed them through to the end could have provided information on when infections occurred and answered questions about transmission, the course of the illness, and the behavior of the virus.

(They were only able to test people in stages.) A similar paucity of testing in the U.S. robs us from crucial information for understanding and controlling the coronavirus. However, there is a fair amount being gleaned from the Diamond Princess, as you can see in the references below. (Please share additional references in the comments.) More is bound to follow.

**Estimates from the Diamond Princess**

“Data from the *Diamond Princess* cruise ship outbreak provides a unique snapshot of the true mortality and symptomatology of the disease, given that everyone on board was tested, regardless of symptoms”–or at least virtually all. [link] The estimates (from the Diamond Princess) I’ve seen are based on those from the London School of Hygiene and Tropical Medicine, in a paper still in preprint form,”Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship”.

Adjusting for delay from confirmation-to-death, we estimated case and infection fatality ratios (CFR, IFR) for COVID-19 on the Diamond Princess ship as 2.3% (0.75%-5.3%) [among symptomatic] and 1.2% (0.38-2.7%) [all cases]. Comparing deaths onboard with expected deaths based on naive CFR estimates using China data, we estimate IFR and CFR in China to be 0.5% (95% CI: 0.2-1.2%) and 1.1% (95% CI: 0.3-2.4%) respectively. (PDF)

(For definitions and computations, see the article.) These are lower than the numbers we are often hearing. They used their lower fatality estimates to adjust (down) the estimates from China data. The paper lists a number of caveats.[5] I hope readers will have a look at it (it’s just a few pages) and share their thoughts in the comments. (Their estimates are in sync with an article by Fauci et al., to come out this week in *NEJM*; but whatever the numbers turn out to be, we know our healthcare system, in many places, is being overloaded. [6])

Another study takes the daily reports of infections on the Diamond Princes to attempt to evaluate the impact of the quarantine, as imperfect as it was, in comparison to a counterfactual situation where nothing was done, including not removing infected people from the ship. They estimate nearly 80%, rather than 17% would have been infected. [link]

We found that the reproductive number [R

_{0}] of COVID-19 in the cruise ship situation of 3,700 persons confined to a limited space was around 4 times higher than in the epicenter in Wuhan, where was estimated to have a mean of 3.7.[7]The interventions that included the removal of all persons with confirmed COVID-19 disease combined with the quarantine of all passengers substantially reduced the anticipated number of new COVID-19 cases compared to a scenario without any interventions (17% attack rate with intervention versus 79% without intervention) … However, the main conclusion from our modelling is that evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew members from getting infected.” [link]

Only 76, rather than 621 would have been infected, they estimate. [8]

Conclusions: The cruise ship conditions clearly amplified an already highly transmissible disease. The public health measures prevented more than 2000 additional cases compared to no interventions. However, evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew from infection.

These studies and models are of interest, although I’m in no position to evaluate them. Please share your thoughts and information, and point out any errors you find. I will indicate updates in the title of this post.

**Optimism**

I leave off with the remark of one of the U.S. passengers interviewed while still on the Diamond Princess:

“Being knee deep in the middle of a crisis leaves a person with two options — optimism or pessimism. The former gives a person strength, and the latter gives rise to fear.” (link)

He, like the others who were evacuated, faced an additional 2 weeks of quarantine.[9] He has since returned home and remains infection free.

*****

[1] As a noteworthy aside, Fauci was able to assure the interviewer that the “danger of getting coronavirus now is just minusculely low” (in the U.S. on Feb. 17). What a difference 2 weeks can make.

[2] In a 2015 paper, Chen and colleagues found a cruise ship’s ventilation spread particles from cabin to cabin. They found that 1 infected person typically led to more than 40 cases a week later on a 2000 passenger cruise. By contrast, the coronavirus, with a reproductive rate of 2 cases per infected person, would only lead to 3 new cases during that time. Planes rely on high-strength air filters and are designed to circulate air within cabin sections.

[3] In a March 23 CDC report: Among 3,711 Diamond Princess passengers and crew, 712 (19.2%) had positive test results for SARS-CoV-2. Of these, 331 (46.5%) were asymptomatic at the time of testing. Among 381 symptomatic patients, 37 (9.7%) required intensive care, and nine (1.3%) died (*8*).

They found coronavirus in Diamond Princess cabins 17 days after passengers disembarked (prior to cleaning).

[4] A table from the Japanese National Institute of Infectious Diseases (NIID) (Source LINK):

[5]

“There were some limitations to our analysis. Cruise ship passengers may have a different health status to the general population of their home countries, due to health requirements to embark on a multi-week holiday, or differences related to socio-economic status or comorbidities. Deaths only occurred in individuals 70 years or older, so we were not able to generate age-specific cCFRs; the fatality risk may also be influenced by differences in healthcare between countries”.

[6] In a March 26 article by Fauci and others, Covid-19 — Navigating the Uncharted, we read:

“If one assumes that the number of asymptomatic or minimally symptomatic cases is several times as high as the number of reported cases, the case fatality rate may be considerably less than 1%.”

[7] R_{0} may be viewed as the expected number of cases generated directly by 1 case in a susceptible population.

[8] The number in the most recent report is 712, but that would be after the quarantine ended on Feb 19.

[9] I read today that one of the U. S. evacuated passengers just entered a clinical trial on remdesivir. This would be over a month since the end of the first quarantine.

———–

**REFERENCES:**

- Fauci interview: ‘Danger of getting coronavirus now is just minusculely low‘

- Giwa, A., LLB, MD, MBA, FACEP, FAAEM; Desai, A., MD; Duca, A., MD; Translation by: Sabrina Paula Rodera Zorita, MD (2020). “Novel 2019 Coronavirus SARS-CoV-2 (COVID-19): An Updated Overview for Emergency Clinicians – 03-23-20”
*EBMedicine.net*; Pub Med ID: 32207910; (LINK)

- Japanese National Institute of Infectious Diseases (NIID). “Field Briefing: Diamond Princess COVID-19 Cases, 20 Feb Update” (LINK)

- Russell, T., Hellewell, J.,Jarvis, C., van-Zandvoort, K.Abbott, S.,Ratnayake, R., Flasche, S., Eggo, R. & Kucharski, A. (2020). “Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship.”
*MedRXIV: The preprint server for the Health Sciences*. (March 9, 2020). (PDF)

- Zheng, L., Chen, Q., Xu, J., & Wu, F. (2016). Evaluation of intervention measures for respiratory disease transmission on cruise ships.
*Indoor and Built Environment*, 25(8), 1267–1278. (First Published online August 28, 2015 ). (PDF)

**Stephen Senn**

*Consultant Statistician*

*Edinburgh *

**Correcting errors about corrected estimates**

Randomised clinical trials are a powerful tool for investigating the effects of treatments. Given appropriate design, conduct and analysis they can deliver good estimates of effects. The key feature is concurrent control. Without concurrent control, randomisation is impossible. Randomisation is necessary, although not sufficient, for effective blinding. It also is an appropriate way to deal with unmeasured predictors, that is to say suspected but unobserved factors that might also affect outcome. It does this by ensuring that, in the absence of any treatment effect, the *expected* value of variation between and within groups is the same. Furthermore, probabilities regarding the relative variation can be delivered and this is what is necessary for valid inference.

There are two extreme positions regarding randomisation that are unreasonable. The first is that because randomisation only ensures performance in terms of probabilities, it is irrelevant to any actual study conducted. The second is that because randomisation delivers valid estimates on average, using observed covariate information, say in a so-called *analysis of covariance* (ANCOVA), is unnecessary and possibly even harmful. The first is easily answered, even if many find the answer difficult to understand: probabilities are what we have to use when we don’t have certainties. For further discussion of this see my previous blogs on this site Randomisation Ratios and Rationality and Indefinite Irrelevance and also a further blog Stop Obsessing about Balance. The second criticism is also, in principle, easy to answer: covariates provide the means of recognising that averages are not relevant to the case in hand (Senn, S. J., 2019). Nevertheless, many trialists stubbornly refuse to use covariate information. This is wrong and in this blog I shall explain why.

A variant of the refusal to use covariates occurs when the covariate is a baseline. The argument is then sometimes used that an obviously appropriate ‘response’ variable is the so-called *change-score* (or *gain- score*), that is to say, the difference between the variable of interest at outcome and its value at baseline. For example, in a trial of asthma, we might be interested in the variable forced expiratory volume in one second (FEV_{1}), a measure of lung function. We measure this at outcome and also at baseline and then use the difference (outcome – baseline) as the ‘response’ measure in subsequent analysis.

The argument then continues that since one has adjusted for the baseline, by subtracting it from the outcome, no further adjustment is necessary and furthermore, that analysis of these change-scores being simpler than ANCOVA, it is more robust and reliable.

I shall now explain why this is wrong.

The important point to grasp from the beginning is that what are affected by the treatment are the outcomes. The baselines are not affected by treatment and so they do not carry the causal message. They may be predictive of the outcome, and that being so they may usefully be incorporated in an estimate of the effect of treatment, but treatment only has the capacity to affect the outcomes.

I stress this because much unnecessary complication is introduced by regarding the effect that treatments have on change over time as being fundamental. Such effects on change are a *consequence* of the effect on outcome: outcome is primary, change is secondary.

It will be easier to discuss all this with the help of some symbols. Table 1 shows symbols that can be used for referring to *statistics* for a parallel group trial in asthma with two arms, *control* and *treatment. *To simplify things, I shall take the case where the number of patients in each arm is identical, although this is not necessary to anything important that follows.

A further table, Table 2 gives symbols for various *parameters*. Some simplifying assumptions will be made that various parameter values do not change from control group to treatment group. For the assumptions in lines 2 and 4, this must be true if randomisation is carried out, since we are talking about expectations over all randomisations and the parameters refer to quantities measured *before* treatment starts. For lines 3 and 5, the assumptions are true under the null hypothesis that there is no difference between treatment and control.

Since the treatments can only affect the outcomes, the logical place to start is at the end. Thus, to use a term much in vogue, our *estimand* (that which we wish to estimate) is δ = μ_{Yt}-μ_{Yc }, the difference in ‘true’ means at outcome. Note that we could define the estimand in terms of the double difference

that is to say the differences between groups of the differences from baseline, but this is pointless because in a randomised trial we have μ_{Xc}=μ_{Xt }= μ_{X} and so this reduces to what we had previously. In fact, if we think of the logic of the randomised clinical trial, by which the patients given control are there purely to estimate what would have happened to the patients given the treatment had they been given the control, this is quite unnecessary.

As a first stab at estimating the estimand, the simplest thing to use is the corresponding difference in statistics, that is to say

This estimator is unbiased for δ: on average it will be equal to the parameter it is supposed to estimate. Its variance, given our assumptions, will be equal to

However, it is not independent of the observed difference at baseline. In fact, given our simplification, it has a covariance with this difference of

where ρ is the correlation between baseline and outcome.

This dependence implies two things. First, it means that although is unbiased, it is not *conditionally* unbiased. Given an observed difference at baseline,

we can do better than just assuming that the difference we would see at outcome, in the absence of any treatment effect, would be zero. Zero is the value we would see over all randomisations but it is not the value we would see for all randomised clinical trials with the observed baseline. This can be easily illustrated using a simulation described below.

I simulated 1000 clinical trials of a bronchodilator in asthma using forced expiratory volume in one second (FEV_{1}) as an outcome with parameters set as in Table 3:

The seed is of no relevance to anybody except me but is included here should I ever need to check back in the future. The other parameters are supposed to be what might be plausible in a such a clinical trial. The values are drawn from a bivariate Normal assumed to be a suitable approximate theoretical model for a randomised clinical trial.

One thousand confidence intervals are displayed in Figure 1. They have been plotted against the baseline difference. Those that are plotted in black cover the ‘true’ value of 200 mL and those that are plotted in red do not. The point estimate is shown by a black diamond in the former case and by a red circle in the latter. There are 949, or 94.9% that cover the true value and 51 or 5.1% that do not. The differences from the theoretical expectations of 95% and 5% are immaterial.

However, since we have baseline information available, we can recognise something about the intervals: where the baseline difference is negative, they are more likely to underestimate the true value and where the difference is positive, they are likely to overestimate it. In fact, generally, the bigger the baseline difference, the worse the coverage. In the view of many statisticians, including me, this means that a confidence interval calculated in this way is satisfactory if the baselines have __not __been observed but not if they have.

Can we fix this using the change-score? The answer is no. Figure 2 shows the corresponding plot for the change-score. Now we have the reverse problem. Where the baseline difference is negative, the treatment effect is overestimated and where the difference is positive, underestimated. There are exceptions but a general pattern is visible: the estimates are negatively correlated with the baseline difference.

The reason that this happens is that the values are not perfectly predictive of the values at outcome. The way to deal with this is to calculate exactly *how* predictive the values are using the data *within* the treatment groups. Because I am in the privileged position of having set the values of the simulation, I know that the relevant *slope *parameter, β, is 0.6, considerably less than the implicit change-score value of 1. This is because the correlation, ρ was set to be 0.6 and the variances at baseline and outcome were set to be equal. However, I did not ‘cheat’ by using this knowledge to do the adjusted calculation, since in practice I would never know the true value. In fact, for each simulation, the value was estimated from the covariances and variances within the treatment groups.

Figure 3 gives a dot histogram for the estimates of β. The mean over the 1000 simulations was 0.601 and the lowest value of the 1000 was 0.279, with the maximum being 0.869. Different values will have been used for different simulated clinical trials to adjust the difference at outcome by the difference at baseline. The adjusted estimates and confidence limits are given in Figure 4 and now it can be seen that these are independent of the baseline differences, which cannot now be used to select which confidence intervals are less likely to include the true value.

The three estimators we have considered can be regarded as a special case of

If we set *b* = 0 we have the simple unadjusted estimate of Figure 1. If we set *b *= 1 we have the change-score estimate of Figure 2. Finally if we set *b = *, the within group regression coefficient of Y on X shown in Figure 3, we get the ANCOVA estimate of Figure 4.

Given our assumptions, not unreasonable given the design, we have that the regression of the mean difference at outcome on the mean difference at baseline should be the same as the within groups regression of the individual outcome values. Note that this assumption is not so reasonable for different designs, for example, cluster randomised designs, a point that has been misunderstood in some treatments of Lord’s paradox. See Rothamsted Statistics meets Lord’s Paradox for an explanation and also Red Herrings and the Art of Cause Fishing. Thus, we can now study exactly what is going on in terms of the adjusted within group differences *Y* – *bX*. If these adjusted differences are averaged for each of the two groups and we then form the difference of these averages, treatment minus control, we have the estimate .

Now, the covariance of (*Y* – *bX*) & *X* is σ_{XY} – bσ_{X}σ_{X }from which it follows immediately that this is 0 if and only if

Thus, the value of *b* is then the within-slope regression β, so this is nothing other than the ANCOVA solution. (In practice we have to estimate β, a point to which I shall return later.) Hence, the lack of dependence between estimate and baseline difference exhibited by Figure 4. Note also that dimensional analysis shows that the units of β are units of *Y* over units of *X*. In our particular example of using a baseline, these two units are the same and so cancel out but the argument is also valid for any predictor, including one whose units are quite different. Once the slope has been multiplied by an *X* it yields a prediction in units of *Y* which is, of course, exactly what is needed for ‘correcting’ an observed *Y*. This, in my opinion, is yet another argument against the change-score ‘solution’. It only works in a special case for which there is no need to abandon the solution that works generally.

In fact, the change-score works badly. To see this, consider a re-writing of the adjusted outcome as

On the RHS for the first term in square brackets we have the ANCOVA estimate. On the RHS for the second term in square brackets, we have the amount by which any general estimate differs from the ANCOVA estimate. If we calculate the variance of the RHS, we find it consists of three terms. The first of these is the variance of the ANCOVA estimator. The second is (β – *b*)^{2 }times . The third is 2(β – *b*) times the covariance of [*Y* – β*X*] and *X*. However, we have shown that this covariance is zero. Thus, the third term is zero. Furthermore, the second term, being a product of squared quantities is greater than zero unless β = *b* but when this is the case we have the ANCOVA solution. Thus, ANCOVA is the minimum variance solution. The change-score solution will have a higher variance.

Some summary statistics (over the 1000 simulated trials) are given for the three approaches used to analyse the results. There are no differences worth noticing between coverage. This is because each method of analysis is designed to provide correct coverage on average. The simple analysis of outcomes is positively and highly significantly correlated with the baseline difference. The change-score is highly significantly negatively correlated. The correlation is less in absolute terms because the correlation coefficient used for the simulation (0.6) is greater than 0.5. Had it been less than 0.5, the absolute correlation would have been greater for the change-score. Similarly, although the mean width of the change-score is less than that for the simple analysis, this does not have to be so and is a result of the correlation being greater than 0.5. The ANCOVA estimates are more precise than either and this has to be so in expectation barring a minor issue, discussed in the next section.

It is not quite the case that ANCOVA is the best one can do, since I have assumed in the algebraic development (although not in the simulation) that the slope parameter is known. In practice it has to be estimated and this leads to a small loss in precision compared to the (unrealistic) situation where the parameter is known. Basically, there are two losses: first the degrees of freedom for estimating the error variance are reduced by one and second there is, in practice, a small penalty for the loss of orthogonality. Further discussing of this is beyond the cope of this note but is covered in (Lesaffre, E. & Senn, S., 2003) and (Senn, S. J., 2011). In practice, the effect is small once one has even a modest number of patients.

For randomised clinical trials, there is no excuse for using a change-score approach, rather than analysis of covariance. To do so betrays not only a conceptual confusion about causality but is inefficient. Given the stakes, this is unacceptable (Senn, S. J., 2005).

Lesaffre, E., & Senn, S. (2003). A note on non-parametric ANCOVA for covariate adjustment in randomized clinical trials. *Statistics in Medicine, ***22**(23), 3583-3596.

Senn, S. J. (2005). An unreasonable prejudice against modelling? *Pharmaceutical statistics, ***4**, 87-89.

Senn, S. J. (2011). Modelling in drug development. In M. Christie, A. Cliffe, A. P. Dawid, & S. J. Senn (Eds.), *Simplicity Complexity and Modelling* (pp. 35-49). Chichester: Wiley.

Senn, S. J. (2019). The well-adjusted statistician. *Applied Clinical Trials*, 2.

.

I will run a graduate Research Seminar at the LSE on Thursdays from May 21-June 18:

(See my new blog for specifics (phil-stat-wars.com).

I am co-running a workshop

from 19-20 June, 2020 at LSE (Center for the Philosophy of Natural and Social Sciences CPNSS), with Roman Frigg. Participants include:

If you have a particular Phil Stat event you’d like me to advertise, please send it to me.

]]>*Notre Dame Philosophical Reviews* is a leading forum for publishing reviews of books in philosophy. The philosopher of statistics, Prasanta Bandyopadhyay, published a review of my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)(SIST) in this journal, and I very much appreciate his doing so. Here I excerpt from his review, and respond to a cluster of related criticisms in order to avoid some fundamental misunderstandings of my project. Here’s how he begins:

In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.

^{[1]}I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.

* * * * * * * * *

I will begin with three issues on which Mayo focuses:

- Conflict about the foundation of statistical inference: Probabilism or Long-run Performance?
- Crisis in science: Which method is adequately general/flexible to be applicable to most problems?
- Replication crisis: Is scientific research reproducible?
Mayo holds that these issues are connected. Failure to recognize that connection leads to problems in statistical inference.

Probabilism, as Mayo describes it, is about accepting reasoned belief when certainty is not available. Error-statistics is concerned with understanding and controlling the probability of errors. This is a long-run performance criterion. Mayo is concerned with “probativeness” for the analysis of “particular statistical inference” (p. 14). She draws her inspiration concerning probativeness from severe testing and calls those who follow this kind of philosophy the “

severe testers” (p. 9). This concept is the central idea of the book.…. What should be done, according to the severe tester, is to take refuge in a meta-standard and evaluate each theory from that meta-theoretical standpoint. Philosophy will provide that higher ground to evaluate two contending statistical theories. In contrast to the statistical foundations offered by both probabilism and long-run performance accounts, severe testers advocate probativism, which does not recommend any statement to be warranted unless a fair amount of investigation has been carried out to probe ways in which the statement could be wrong.

Severe testers think their method is adequately general to capture this intuitively appealing requirement on any plausible account of evidence. That is, if a test were not able to find flaws with H even if H were incorrect, then a mere agreement of H with data X

_{0}would provide poor evidence for H. This, according to the severe tester’s account, should be a minimal requirement on any account of evidence. This is how they address (ii).Next consider (iii). According to the severe tester’s diagnosis, the replication crisis arises when there is selective reporting: the statistics are cherry-picked for x, i.e., looked at for significance where it is absent, multiple testing, and the like. Severe testers think their account alone can handle the replication crisis satisfactorily. That leaves the burden on them to show that other accounts, such as probabilism and long-run performance, are incapable of handling the crisis, or are inadequate compared to the severe tester’s account. One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. However, they insist that this procedure won’t be able to show

whatresearchers have initiallydonewrong in producing the crisis in the first place. The nub of their criticism is that Bayesians don’t provide a convincing resolution of the replication crisis since they don’t explain where the researchers make their mistake.

I don’t think we can look to this procedure (“assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it”) to block problematic inferences. In some cases, your disbelief in H might be right on the money, but this is precisely what is *unknown* when undertaking research. An account must be able to directly register how biasing selection effects alter error probing capacities if it is to call out the resulting bad inferences–or so I argue. Data-dredged hypotheses are often very believable, that’s what makes them so seductive. Moreover, it’s crucial for an account to be able to say that H is plausible but terribly tested by this particular study or test. I don’t say that inquirers are always in the context of severe testing, by the way. We’re not always truly trying to find things out; often, we’re just trying to make our case. That said, I never claim the severe testing account is the only way to avoid irreplication in statistics, nor do I suggest that the problem of replication is the sole problem for an account of statistical inference. Explaining and avoiding irreplication is a *minimal* problem an account should be capable of solving. This relates to Bandyopadhyay’s central objection below.

In some places, he attributes to me a position that is nearly the opposite of what I argue. After explaining, I try to consider why he might be led to his topsy turvy allegation.

The problem with the long-run performance-based frequency approach, according to Mayo, is that it is easy to support a false hypothesis with these methods by selective reporting. The severe tester thinks both Fisher’s and Neyman and Pearson’s methods leave the door open for cherry-picking, significance seeking, and multiple-testing, thus generating the possibility of a replication crisis. Fisher’s and Neyman-Pearson’s methods make room for enabling the support of a preferred claim even though it is not warranted by evidence. This causes severe testers like Mayo to abandon the idea of adopting long-run performance as a sufficient condition for statistical inferences; it is merely a necessary condition for them.

No, it is the opposite. The error statistical assessments are highly valuable because they pick up on the effects of data dredging, multiple testing, optional stopping and a host of biasing selection effects. Biasing selection effects are blocked in error statistical accounts because they preclude control of error probabilities! It is precisely because they render the error probability assessments invalid that error statistical accounts are able to require–with justification– predesignation and preregistration. That is the key message of SIST from the very start.

- SIST, p. 20: A key point too rarely appreciated: Statistical facts about P -values themselves demonstrate how data finagling can yield spurious significance. This is true for all error probabilities. That’s what a self-correcting inference account should do. … Scouring different subgroups and otherwise “trying and trying again” are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding – and that remains so even if you ditch P-values and never compute them.

Consider the dramatic opposition between Savage, and Fisher and N-P regarding the Likelihood Principle and optional stopping:

- SIST, p. 46: The lesson about who is allowed to cheat depends on your statistical philosophy. Error statisticians require that the overall and not the “computed” significance level be reported. To them, cheating would be to report the significance level you got after trying and trying again in just
*the same way*as if the test had a fixed sample size.

Bandyopadhyay seems to think that if I have criticisms of the long-run performance (or behavioristic) construal of error probabilities, it must be because I claim it leads to replication failure. That’s the only way I can explain his criticism above.

He is startled that I’m rejecting the long-run performance view I previously held.

This leads me to discuss the severe tester’s rejection of both probabilism and frequency-based long-run performance, especially the latter. It is understandable why Mayo finds fault with probabilists, since they are no friends of Bayesians who take probability theory to be the

onlylogic of uncertainty. So, the position is consistent with the severe tester’s account proposed in Mayo’s last two influential books (1996 and 2010.) What is surprising is that her account rejects the long-run performance view and only takes the frequency-based probability as necessary for statistical inference.

But I’ve always rejected the long run performance or “behavioristic” construal of error statistical methods–when it comes to using them for scientific inference. I’ve always rejected the supposition that the justification and rationale for error statistical methods is their ability to control the probabilities of erroneous inferences in a long run series of applications. Others have rejected it as well, notably, Birnbaum, Cox, Giere. Their sense is that these tools are satisfying inferential goals but in a way that no one has been able to quite explain. What hasn’t been done, and what I only hinted at in earlier work, is to supply an alternative, inferential rationale for error statistics. The trick is to show when and why long run error control supplies a measure of a method’s *capability* to identify mistakes. This capability assessment, in turn, supplies a measure of how well or poorly tested claims are. So, the inferential assessment, post data, is in terms of how well or poorly tested claims are.

My earlier work, *Error and the Growth of Experimental Knowledge* (EGEK) was directed at the uses of statistics for solving philosophical problems of evidence and inference.[1] SIST, by contrast, is focussed almost entirely on the philosophical problems of statistical practice. Moreover, I stick my neck out, and try to tackle essentially all of the examples around which there have been philosophical controversy from the severe tester’s paradigm. While I freely admit this represents a gutsy, if not radical, gambit, I actually find it perplexing that it hasn’t been done before. It seems to me that we convert information about (long-run) performance into information about well-testedness in ordinary, day to day reasoning. Take the informal example early on in the book.

- SIST, p. 14: Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’ s office. …Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4– 5 pound gain. …But the fact that all of them have me at an over 4-pound gain, while none show any difference in the weights of EGEK, pretty well seals it. …No one would say: ‘I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.’ To justify my conclusion by long-run performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings:
*H*: I’ve gained weight…. This is the key – granted with a homely example – that can fill a very important gap in frequentist foundations: Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.

Let me now clarify the reason that satisfying a long-run performance requirement only necessary and not sufficient for severity. Long-run behavior could be satisfied while the error probabilities do not reflect well-testedness in the case at hand. Go to the howlers and chestnuts of Excursion 3 tour II:

- Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time? She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)
*Basis for the joke:*An N-P test bases error probabilities on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do.

In short: I’m taking the tools that are typically justified only because they control the probability of erroneous inferences in the long-run, and providing them with an inferential justification relevant for the case at hand. It’s only when long-run relative frequencies represent the method’s capability to discern mistaken interpretations of data that the performance and severe testing goals line up. Where the two sets of goals do not line up, severe testing takes precedence–at least when we’re trying to find things out. The book is an experiment in trying to do all of philosophy of statistics within the severe testing paradigm.

There’s more to reply to in his review, but I want to just focus on this clarification which should rectify his main criticism. For a discussion of the general points of severely testing theories, I direct the reader to extensive excerpts from SIST. His full review is here.

__________________________________

Bandyopadhyay attended my NEH Summer Seminar in 1999 on Inductive-Experimental Inference. I’m glad that he has pursued philosophy of statistics through the years. I do wish he had sent me his review earlier so that I could clarify the small set of confusions that led him to some unintended places. Nous might have given the author an opportunity to reply lest readers come away with a distorted view of the book. I will shortly be resuming a discussion of SIST on this blog, picking up with excursion 2.

Update March 4: Note that I wound up commenting further on the Review in the following comments:

[1] If you find an example that has been the subject of philosophical debate that is omitted from SIST, let me know. You will notice that all these examples are elementary, which is why I was able to cover them with minimal technical complexity. Some more exotic examples are in “chestnuts and howlers”.

]]>

This is a belated birthday post for R.A. Fisher (17 February, 1890-29 July, 1962)–it’s a guest post from earlier on this blog by Aris Spanos.

**Happy belated birthday to R.A. Fisher!**

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in *Metron*, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to *Biometrika*, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

You can read more in Spanos 2008 (below)

**References**

Bowley, A. L. (1902, 1920, 1926, 1937) *Elements of Statistics*, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) *The Life of a Scientist: R. A. Fisher*, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” *Messenger of Mathematics*, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” *Biometrika,* 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” *Metron* 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” *Philosophical Transactions of the Royal Society*, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of *c*^{2} from contingency tables, and the calculation of p, “*Journal of the Royal Statistical Society* 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,” *Journal of the Royal Statistical Society,* 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” *Journal of the Royal Statistical Society*, 87, 442-450.

Fisher, R. A. (1925) *Statistical Methods for Research Workers*, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” *Journal of the Royal Statistical Society* 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” *Annals of Eugenics*, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” *Biometrika*, 6, 1-25.

Hald, A. (1998) *A History of Mathematical Statistics from 1750 to 1930*, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” *Journal of the American Statistical Association*, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” *Journal of the Royal Statistical Society,* 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, *Statistical Science*, 7, 34-48.

RSS (Royal Statistical Society) (1934) *Annals of the Royal Statistical Society* 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” *Annals of Statistics*, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in *The New Palgrave Dictionary of Economics*, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) *The Methods of Statistics*, Williams & Norgate, London.

[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

This was first posted on 17, Feb. 2013 here.

**HAPPY BIRTHDAY R.A. FISHER!**

** I. Doubt is Their Product **is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle:

**II. Fixing Science.** So, one day in January, I was invited to speak in a panel “Falsifiability and the Irreproducibility Crisis” at a conference “Fixing Science: Practical Solutions for the Irreproducibility Crisis.” The inviter, whom I did not know, David Randall, explained that a speaker withdrew from the session because of some kind of controversy surrounding the conference, but did not give details. He pointed me to an op-ed in the *Wall Street Journal.* I had already heard about the conference months before (from Nathan Schachtman) and before checking out the op-ed, my first thought was: I wonder if the controversy has to do with the fact that a keynote speaker is Ron Wasserstein, ASA Executive Director, a leading advocate of retiring “statistical significance”, and barring P-value thresholds in interpreting data. Another speaker eschews all current statistical inference methods (e.g., P-values, confidence intervals) as just too uncertain (D. Trafimow). More specifically, I imagined it might have to do with the controversy over whether the March 2019 editorial in TAS (Wasserstein, Schirm, and Lazar 2019) was a continuation of the ASA 2016 Statement on P-values, and thus an official ASA policy document, or not. Karen Kafadar, recent President of the American Statistical Association (ASA), made it clear in December 2019 that it is not.[2] The “no significance/no thresholds” view is the position of the guest editors of the March 2019 issue. (See “P-Value Statements and Their Unintended(?) Consequences” and “Les stats, c’est moi“.) Kafadar created a new 2020 ASA Task Force on Statistical Significance and Replicability to:

prepare a thoughtful and concise piece …without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice”. (Kafadar 2019, p. 4)

Maybe those inviting me didn’t know I’m “anti” the Anti-Statistical Significance campaign (“On some self-defeating aspects of the 2019 recommendations“), that I agree with John Ioannidis (2019) that “retiring statistical significance would give bias a free pass“, and published an editorial “P-value Thresholds: Forfeit at Your Peril“. While I regard many of today’s statistical reforms as welcome (preregistration, testing for replication, transparency about data-dredging, P-hacking and multiple testing), I argue that those in Wasserstein et al., (2019) are “Doing more harm than good“. In “Don’t Say What You don’t Mean“, I express doubts that Wasserstein et al. (2019) could really mean to endorse certain statements in their editorial that are so extreme as to conflict with the ASA 2016 guide on P-values. To be clear, I reject oversimple dichotomies, and cookbook uses of tests, long lampooned, and have developed a reformulation of tests that avoids the fallacies of significance and non-significance.[1] It’s just that many of the criticisms are confused, and, consequently so are many reforms.

**III. Bad Statistics is Their Product.** It turns out that the brouhaha around the conference had nothing to do with all that. I thank Dorothy Bishop for pointing me to her blog which gives a much fuller background. Aside from the lack of women (I learned a new word–a manference), her real objection is on the order of “Bad Statistics is Their Product”: The groups sponsoring the *Fixing Science *conference, The National Association of Scholars and the Independent Institute, Bishop argues, are using the replication crisis to cast doubt on well-established risks, notably those of climate change. She refers to a book whose title echoes David Michael’s: *Merchants of Doubt* (2010) __(__by historians of science: Conway and Oreskes). Bishop writes:

Uncertainty about science that threatens big businesses has been promoted by think tanks … which receive substantial funding from those vested interests. The Fixing Science meeting has a clear overlap with those players. (Bishop)

The speakers on bad statistics, as she sees it, are “foils” for these interests, and thus “responsible scientists should avoid” the meeting.

*But what if things are the reverse?* What if “bad statistics is our product” leaders also have an agenda. By influencing groups who have a voice in evidence policy in government agencies, they might effectively discredit methods they don’t like, and advance those they like. Suppose you have strong arguments that the consequences of this will undermine important safeguards (despite the key players being convinced they’re promoting better science). Then you should speak, if you can, and not stay away. *You should try to fight fire with fire. *

**IV. So what Happened?** So I accepted the invitation and gave what struck me as a fairly radical title: “P-Value ‘Reforms’: Fixing Science or Threats to Replication and Falsification?” (The abstract and slides are below.) Bishop is right that evidence of bad science can be exploited to selectively weaken entire areas of science; but evidence of bad statistics can also be exploited to selectively weaken entire methods one doesn’t like, and successfully gain acceptance of alternative methods, without the hard work of showing those alternative methods do a better, or even a good, job at the task at hand. Of course both of these things might be happening simultaneously.

Do the conference organizers overlap with science policy as Bishop alleges? I’d never encountered either outfits before, but Bishop quotes from their annual report.

In April we published

The Irreproducibility Crisis, a report on the modern scientific crisis of reproducibility—the failure of a shocking amount of scientific research to discover true results because of slipshod use of statistics, groupthink, and flawed research techniques. We launched the report at the Rayburn House Office Building in Washington, DC; it was introduced by Representative Lamar Smith, the Chairman of the House Committee on Science, Space, and Technology.

So there is a mix with science policy makers in Washington, and their publication, *The Irreproducibility Crisis,* is clearly prepared to find its scapegoat in the bad statistics supposedly encouraged in statistical significance tests. To its credit, it discusses how data-dredging and multiple testing can make it easy to arrive at impressive-looking findings that are spurious, but nothing is said about ways to adjust or account for multiple testing and multiple modeling. (P-values *are* defined correctly, but their interpretation of confidence levels is incorrect.) Published before the Wasserstein et al. (2019) call to end P-value thresholds, which would require the FDA and other agencies to end what many consider vital safeguards of error control, it doesn’t go that far. *Not yet at least! *Trying to prevent that from happening is a key reason I decided to attend. (updated 2/16)

My first step was to send David Randall my book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)–which he actually read and wrote a report on–and I met up with him in NYC to talk. He seemed surprised to learn about the controversies over statistical foundations and the disagreement about reforms. So did I hold people’s feet to the fire at the conference (when it came to scapegoating statistical significance tests and banning P-value thresholds for error probability control?) I did! I continue to do so in communications with David Randall. (I’ll write more in the comments to this post, once our slides are up.)

As for climate change, I wound up entirely missing that part of the conference: Due to the grounding of all flights to and from CLT the day I was to travel, thanks to rain, hail and tornadoes, I could only fly the following day, so our sessions were swapped. I hear the presentations will be posted. Doubtless, some people will use bad statistics and the “replication crisis” to claim there’s reason to reject our best climate change models, without having adequate knowledge of the science. But the real and present danger today that I worry about is that they will use bad statistics to claim there’s reason to reject our best (error) statistical practices, without adequate knowledge of the statistics or the philosophical and statistical controversies behind the “reforms”.

Let me know what you think in the comments.

**V.** Here’s my abstract and slides

P-Value “Reforms”: Fixing Science or Threats to Replication and Falsification?

Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome, others are quite radical. The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. Paradoxically, some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. Some even preclude testing and falsifying claims altogether. Too often the statistics wars become proxy battles between competing tribal leaders, each keen to advance a method or philosophy, rather than improve scientific accountability.

[1]* Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)*, 2018; *SIST* excerpts; Mayo and Cox 2006; Mayo and Spanos 2006.

[2] All uses of ASA II^{(note)} on this blog must now be qualified to reflect this.

[3] You can find a lot on the conference and the groups involved on-line. The letter by Lenny Teytelman warning people off the conference is here. Nathan Schachtman has a post up today on his law blog here.

]]>