This is the title of Brian Haig’s recent paper in Methods in Psychology 2 (Nov. 2020). Haig is a professor emeritus of psychology at the University of Canterbury. Here he provides both a thorough and insightful review of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) as well as an excellent overview of the high points of today’s statistics wars and the replication crisis, especially from the perspective of psychology. I’ll excerpt from his article in a couple of posts. The full article, which is open access, is here.
Abstract: In this article, I critically evaluate two major contemporary proposals for reforming statistical thinking in psychology: The recommendation that psychology should employ the “new statistics” in its research practice, and the alternative proposal that it should embrace Bayesian statistics. I do this from the vantage point of the modern error-statistical perspective, which emphasizes the importance of the severe testing of knowledge claims. I also show how this error-statistical perspective improves our understanding of the nature of science by adopting a workable process of falsification and by structuring inquiry in terms of a hierarchy of models. Before concluding, I briefly discuss the importance of the philosophy of statistics for improving our understanding of statistical thinking.
Keywords: The error-statistical perspective, The new statistics, Bayesian statistics, Falsificationism, Hierarchy of models, Philosophy of statistics
Psychology has been prominent among a number of disciplines that have proposed statistical reforms for improving our understanding and use of statistics in research. However, despite being at the forefront of these reforms, psychology has ignored the philosophy of statistics to its detriment. In this article, I consider, in a broad-brush way, two major proposals that feature prominently in psychology’s current methodological reform literature: The recommendation that psychology should employ the so-called “new statistics” in its research practice, and the alternative proposal that psychology should embrace Bayesian statistics. I evaluate each from the vantage point of the error-statistical philosophy, which, I believe, is the most coherent perspective on statistics available to us. Before concluding, I discuss two interesting features of the conception of science adopted by the error-statistical perspective, along with brief remarks about the value of the philosophy of statistics for deepening our understanding of statistics.
2. The error-statistical perspective
The error-statistical perspective employed in this article is that of Deborah Mayo, sometimes in collaboration with Aris Spanos (Mayo, 1996, 2018; Mayo & Spanos, 2010, 2011). This perspective is landmarked by two major works. The first is Mayo’s ground-breaking book, Error and the growth of experimental knowledge (1996), which presented the first extensive formulation of her error-statistical perspective on statistical inference. This philosophy provides a systematic understanding of experimental reasoning in science that uses frequentist statistics in order to manage error. Hence, its name. The novelty of the book lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with the nature of evidence and inference.
The second book is Mayo’s recently published Statistical inference as severe testing (2018). In contrast with the first book, this work focuses on problems arising from statistical practice, but endeavors to solve them by probing their foundations from the related vantage points of the philosophy of science and the philosophy of statistics. By dealing with the vexed problems of current statistical practice, this book is a valuable repository of ideas, insights, and solutions designed to help a broad readership deal with the current crisis in statistics. Because my focus is on statistical reforms in psychology, I draw mainly from the resources contained in the second book.
Fundamental disputes about the nature and foundations of statistical inference are long-standing and ongoing. Most prominent have been the numerous debates between, and within, frequentist and Bayesian camps. Cutting across these debates have been more recent attempts to unify and reconcile rival outlooks, which have complexified the statistical landscape. Today, these endeavors fuel the ongoing concern that psychology and many sciences have with replication failures, questionable research practices, and the strong demand for an improvement of research integrity. Mayo refers to debates about these concerns as the “statistics wars”. With the addition of Statistical inference as severe testing to the error-statistical corpus, it is fair to say that the error-statistical outlook now has the resources to enable statisticians and scientists to understand and advance beyond the bounds of these statistics wars.
The strengths of the error-statistical approach are considerable (Haig, 2017; Spanos, 2019a, 2019b), and I believe that they combine to give us the most coherent philosophy of statistics currently available. For the purpose of this article, it suffices to say that the error-statistical approach contains the methodological and conceptual resources that enable one to diagnose and overcome the common misunderstandings of widely used frequentist statistical methods such as tests of significance. It also provides a trenchant critique of Bayesian ways of thinking in statistics. I will draw from these two strands of the error-statistical perspective to inform my critical evaluation of the new statistics and the Bayesian alternative.
Because the error-statistical and Bayesian outlooks are so different, some might consider it unfair to use the former to critique the latter. My response to this worry is three-fold: First, perspective-taking is an unavoidable feature of the human condition; we cannot rise above our human conceptual frameworks and adopt a position from nowhere. Second, in thinking things through, we often find it useful to proceed by contrast, rather than direct analysis. Indeed, the error-statistical outlook on statistics was originally developed in part by using the Bayesian outlook as a foil. And third, strong debates between Bayesians and frequentists have a long history, and they have helped shape the character of these two alternative outlooks on statistics. By participating in these debates, the error-statistical perspective is itself unavoidably controversial.
3. The new statistics
For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, which urges the abandonment of null hypothesis significance testing (NHST), and the adoption of effect sizes, confidence intervals, and meta-analysis as a replacement package, is one such reform movement (Calin-Jageman and Cumming, 2019; Cumming, 2012, 2014). It has been heavily promoted in psychological circles and touted as a much-needed successor to NHST, which is deemed to be broken-backed. Psychological Science, which is the flagship journal of the Association for Psychological Science, endorsed the use of the new statistics, wherever appropriate (Eich, 2014). In fact, the new statistics might be considered the Association’s current quasi-official position on statistical inference. Although the error-statistical outlook does not directly address the new statistics movement, its suggestions for overcoming the statistics wars contain insights about statistics that can be employed to mount a powerful challenge to the integrity of that movement.
3.1. Null hypothesis significance testing
The new statisticians contend that NHST has major flaws and recommend replacing it with their favored statistical methods. Prominent among the flaws are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. The claim that we should abandon NHST because it leads to dichotomous thinking is unconvincing because it is leveled at the misuse of a statistical test that arises from its mechanical application and a poor understanding of its foundations. By contrast, the error-statistical perspective advocates the flexible use of levels of significance tailored to the case at hand as well as reporting of exact p values – a position that Fisher himself came to hold.
Further, the error-statistical perspective makes clear that the common understanding of the amalgam that is NHST is not an amalgam of Fisher’s and Neyman and Pearson’s thinking on the matter, especially their mature thought. Further, the error-statistical outlook can accommodate both evidential and behavioural interpretations of NHST, respectively serving probative and performance goals, to use Mayo’s suggestive terms. The error-statistical perspective urges us to move beyond the claim that NHST is an inchoate hybrid. Based on a close reading of the historical record, Mayo argues that Fisher and Neyman and Pearson should be interpreted as compatibilists, and that focusing on the vitriolic exchanges between Fisher and Neyman prevents one from seeing how their views dovetail. Importantly, Mayo formulates the error-statistical perspective on NHST by assembling insights from these founding fathers, and additional sources, into a coherent hybrid. There is much to be said for replacing psychology’s fixation on the muddle that is NHST with the error-statistical perspective on significance testing.
Thus, the recommendation of the new statisticians to abandon NHST, understood as the inchoate hybrid commonly employed in psychology, commits the fallacy of the false dichotomy because there exist alternative defensible accounts of NHST (Haig, 2017). The error-statistical perspective is one such attractive alternative.
3.2. Confidence intervals
For the new statisticians, confidence intervals replace p-valued null hypothesis significance testing. Confidence intervals are said to be more informative, and more easily understood, than p values, as well as serving the important scientific goal of estimation, which is preferred to hypothesis testing. Both of these claims are open to challenge. Whether confidence intervals are more informative than statistical hypothesis tests in a way that matters will depend on the research goals being pursued. For example, p values might properly be used to get a useful initial gauge of whether an experimental effect occurs in a particular study, before one runs further studies and reports p values, supplementary confidence intervals, and effect sizes. The claim that confidence intervals are more easily understood than p values is surprising, and is not borne out by the empirical evidence (e.g., Hoekstra et al., 2014). I will speak to the claim about the greater importance of estimation in the next section.
There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: Make a decision on whether a parameter estimate is either inside, or outside, its confidence interval.
Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). The single interval estimate corresponding to this level provides the basis for the inference that is drawn about the parameter values, depending on whether they fall inside or outside the interval. A limitation of this way of thinking is that each of the values of a parameter in the interval are taken to have the same evidential, or probative, force – an unsatisfactory state of affairs that results from weak testing. For example, there is no way of answering the relevant questions, ‘Are the values in the middle of the interval closer to the true value?’, or ‘Are they more probable than others in the interval?’
The error-statistician, by contrast, draws inferences about each of the obtained values, according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Mayo (2018) captures the counterfactual logic of severity thinking involved with the following general example: “Were μ less than the 0.995 lower limit, then it is very probable (>0.995) that our procedure would yield a smaller sample mean than 0.6. This probability gives the severity.” (p. 195) Clearly, this is a more nuanced and informative assessment of parameter estimates than that offered by the standard view. Details on the error-statistical conception of confidence intervals can be found in Mayo (2018, pp. 189–201), as well as Mayo and Spanos (2011) and Spanos (2014, 2019a, b).
Methodologists and researchers in psychology are now taking confidence intervals seriously. However, in the interests of adopting a sound frequentist conception of such intervals, they would be well advised to replace the new statistics conception of them with their superior error-statistical understanding.
3.3. Estimation and hypothesis tests
The new statisticians claim, controversially, that parameter estimation, rather than statistical hypothesis testing, leads to better science – presumably in part because of the deleterious effects of NHST. However, a strong preference for estimation leads Cumming (2012) to aver that the typical questions addressed in science are what questions (e.g., “What is the age of the earth?”, “What is the most likely sea-level rise by 2012?”). I think that this is a restricted, rather “flattened”, view of science where, by implication, explanatory why questions and how questions (which often ask for information about causal mechanisms) are considered atypical.
Why and how questions are just as important for science as what questions. They are often the sort of questions that science seeks to answer when constructing and evaluating explanatory hypotheses and theories. Interestingly, and at variance with this view, Cumming (Fidler and Cumming, 2014) acknowledges that estimation can be usefully combined with hypothesis testing in science, and that estimation can play a valuable role in theory construction. This is as it should be because science frequently incorporates parameter estimates in precise predictions that are used to assess the hypotheses and theories from which they are derived.
Although it predominantly uses the language of testing, the error-statistical perspective maintains that statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science and, in fact, advocates piecemeal testing of local hypotheses nested within large-scale explanatory theories.
Despite the generally favorable reception of the new statistics in psychology, it has been subject to criticism by both frequentists (e.g., Sakaluk, 2016), and Bayesians (e.g., Kruschke and Liddell, 2018). However, these criticisms have not occasioned a public response from the principal advocates of the new statistics movement. The error-statistical outlook presents a golden opportunity for those who advocate, or endorse, the new statistics to defend their position in the face of challenging criticism. A sound justification for the promotion and adoption of new statistics practices in psychology requires as much.
To be continued…. Please share comments and questions.
Excerpts and Mementos from SIST on this blog are compiled here.