This is the title of Brian Haig’s recent paper in Methods in Psychology 2 (Nov. 2020). Haig is a professor emeritus of psychology at the University of Canterbury. Here he provides both a thorough and insightful review of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) as well as an excellent overview of the high points of today’s statistics wars and the replication crisis, especially from the perspective of psychology. I’ll excerpt from his article in a couple of posts. The full article, which is open access, is here.
Abstract: In this article, I critically evaluate two major contemporary proposals for reforming statistical thinking in psychology: The recommendation that psychology should employ the “new statistics” in its research practice, and the alternative proposal that it should embrace Bayesian statistics. I do this from the vantage point of the modern error-statistical perspective, which emphasizes the importance of the severe testing of knowledge claims. I also show how this error-statistical perspective improves our understanding of the nature of science by adopting a workable process of falsification and by structuring inquiry in terms of a hierarchy of models. Before concluding, I briefly discuss the importance of the philosophy of statistics for improving our understanding of statistical thinking.
Keywords: The error-statistical perspective, The new statistics, Bayesian statistics, Falsificationism, Hierarchy of models, Philosophy of statistics
1. Introduction
Psychology has been prominent among a number of disciplines that have proposed statistical reforms for improving our understanding and use of statistics in research. However, despite being at the forefront of these reforms, psychology has ignored the philosophy of statistics to its detriment. In this article, I consider, in a broad-brush way, two major proposals that feature prominently in psychology’s current methodological reform literature: The recommendation that psychology should employ the so-called “new statistics” in its research practice, and the alternative proposal that psychology should embrace Bayesian statistics. I evaluate each from the vantage point of the error-statistical philosophy, which, I believe, is the most coherent perspective on statistics available to us. Before concluding, I discuss two interesting features of the conception of science adopted by the error-statistical perspective, along with brief remarks about the value of the philosophy of statistics for deepening our understanding of statistics.
2. The error-statistical perspective
The error-statistical perspective employed in this article is that of Deborah Mayo, sometimes in collaboration with Aris Spanos (Mayo, 1996, 2018; Mayo & Spanos, 2010, 2011). This perspective is landmarked by two major works. The first is Mayo’s ground-breaking book, Error and the growth of experimental knowledge (1996), which presented the first extensive formulation of her error-statistical perspective on statistical inference. This philosophy provides a systematic understanding of experimental reasoning in science that uses frequentist statistics in order to manage error. Hence, its name. The novelty of the book lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with the nature of evidence and inference.
The second book is Mayo’s recently published Statistical inference as severe testing (2018). In contrast with the first book, this work focuses on problems arising from statistical practice, but endeavors to solve them by probing their foundations from the related vantage points of the philosophy of science and the philosophy of statistics. By dealing with the vexed problems of current statistical practice, this book is a valuable repository of ideas, insights, and solutions designed to help a broad readership deal with the current crisis in statistics. Because my focus is on statistical reforms in psychology, I draw mainly from the resources contained in the second book.
Fundamental disputes about the nature and foundations of statistical inference are long-standing and ongoing. Most prominent have been the numerous debates between, and within, frequentist and Bayesian camps. Cutting across these debates have been more recent attempts to unify and reconcile rival outlooks, which have complexified the statistical landscape. Today, these endeavors fuel the ongoing concern that psychology and many sciences have with replication failures, questionable research practices, and the strong demand for an improvement of research integrity. Mayo refers to debates about these concerns as the “statistics wars”. With the addition of Statistical inference as severe testing to the error-statistical corpus, it is fair to say that the error-statistical outlook now has the resources to enable statisticians and scientists to understand and advance beyond the bounds of these statistics wars.
The strengths of the error-statistical approach are considerable (Haig, 2017; Spanos, 2019a, 2019b), and I believe that they combine to give us the most coherent philosophy of statistics currently available. For the purpose of this article, it suffices to say that the error-statistical approach contains the methodological and conceptual resources that enable one to diagnose and overcome the common misunderstandings of widely used frequentist statistical methods such as tests of significance. It also provides a trenchant critique of Bayesian ways of thinking in statistics. I will draw from these two strands of the error-statistical perspective to inform my critical evaluation of the new statistics and the Bayesian alternative.
Because the error-statistical and Bayesian outlooks are so different, some might consider it unfair to use the former to critique the latter. My response to this worry is three-fold: First, perspective-taking is an unavoidable feature of the human condition; we cannot rise above our human conceptual frameworks and adopt a position from nowhere. Second, in thinking things through, we often find it useful to proceed by contrast, rather than direct analysis. Indeed, the error-statistical outlook on statistics was originally developed in part by using the Bayesian outlook as a foil. And third, strong debates between Bayesians and frequentists have a long history, and they have helped shape the character of these two alternative outlooks on statistics. By participating in these debates, the error-statistical perspective is itself unavoidably controversial.
3. The new statistics
For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, which urges the abandonment of null hypothesis significance testing (NHST), and the adoption of effect sizes, confidence intervals, and meta-analysis as a replacement package, is one such reform movement (Calin-Jageman and Cumming, 2019; Cumming, 2012, 2014). It has been heavily promoted in psychological circles and touted as a much-needed successor to NHST, which is deemed to be broken-backed. Psychological Science, which is the flagship journal of the Association for Psychological Science, endorsed the use of the new statistics, wherever appropriate (Eich, 2014). In fact, the new statistics might be considered the Association’s current quasi-official position on statistical inference. Although the error-statistical outlook does not directly address the new statistics movement, its suggestions for overcoming the statistics wars contain insights about statistics that can be employed to mount a powerful challenge to the integrity of that movement.
3.1. Null hypothesis significance testing
The new statisticians contend that NHST has major flaws and recommend replacing it with their favored statistical methods. Prominent among the flaws are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. The claim that we should abandon NHST because it leads to dichotomous thinking is unconvincing because it is leveled at the misuse of a statistical test that arises from its mechanical application and a poor understanding of its foundations. By contrast, the error-statistical perspective advocates the flexible use of levels of significance tailored to the case at hand as well as reporting of exact p values – a position that Fisher himself came to hold.
Further, the error-statistical perspective makes clear that the common understanding of the amalgam that is NHST is not an amalgam of Fisher’s and Neyman and Pearson’s thinking on the matter, especially their mature thought. Further, the error-statistical outlook can accommodate both evidential and behavioural interpretations of NHST, respectively serving probative and performance goals, to use Mayo’s suggestive terms. The error-statistical perspective urges us to move beyond the claim that NHST is an inchoate hybrid. Based on a close reading of the historical record, Mayo argues that Fisher and Neyman and Pearson should be interpreted as compatibilists, and that focusing on the vitriolic exchanges between Fisher and Neyman prevents one from seeing how their views dovetail. Importantly, Mayo formulates the error-statistical perspective on NHST by assembling insights from these founding fathers, and additional sources, into a coherent hybrid. There is much to be said for replacing psychology’s fixation on the muddle that is NHST with the error-statistical perspective on significance testing.
Thus, the recommendation of the new statisticians to abandon NHST, understood as the inchoate hybrid commonly employed in psychology, commits the fallacy of the false dichotomy because there exist alternative defensible accounts of NHST (Haig, 2017). The error-statistical perspective is one such attractive alternative.
3.2. Confidence intervals
For the new statisticians, confidence intervals replace p-valued null hypothesis significance testing. Confidence intervals are said to be more informative, and more easily understood, than p values, as well as serving the important scientific goal of estimation, which is preferred to hypothesis testing. Both of these claims are open to challenge. Whether confidence intervals are more informative than statistical hypothesis tests in a way that matters will depend on the research goals being pursued. For example, p values might properly be used to get a useful initial gauge of whether an experimental effect occurs in a particular study, before one runs further studies and reports p values, supplementary confidence intervals, and effect sizes. The claim that confidence intervals are more easily understood than p values is surprising, and is not borne out by the empirical evidence (e.g., Hoekstra et al., 2014). I will speak to the claim about the greater importance of estimation in the next section.
There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: Make a decision on whether a parameter estimate is either inside, or outside, its confidence interval.
Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). The single interval estimate corresponding to this level provides the basis for the inference that is drawn about the parameter values, depending on whether they fall inside or outside the interval. A limitation of this way of thinking is that each of the values of a parameter in the interval are taken to have the same evidential, or probative, force – an unsatisfactory state of affairs that results from weak testing. For example, there is no way of answering the relevant questions, ‘Are the values in the middle of the interval closer to the true value?’, or ‘Are they more probable than others in the interval?’
The error-statistician, by contrast, draws inferences about each of the obtained values, according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Mayo (2018) captures the counterfactual logic of severity thinking involved with the following general example: “Were μ less than the 0.995 lower limit, then it is very probable (>0.995) that our procedure would yield a smaller sample mean than 0.6. This probability gives the severity.” (p. 195) Clearly, this is a more nuanced and informative assessment of parameter estimates than that offered by the standard view. Details on the error-statistical conception of confidence intervals can be found in Mayo (2018, pp. 189–201), as well as Mayo and Spanos (2011) and Spanos (2014, 2019a, b).
Methodologists and researchers in psychology are now taking confidence intervals seriously. However, in the interests of adopting a sound frequentist conception of such intervals, they would be well advised to replace the new statistics conception of them with their superior error-statistical understanding.
3.3. Estimation and hypothesis tests
The new statisticians claim, controversially, that parameter estimation, rather than statistical hypothesis testing, leads to better science – presumably in part because of the deleterious effects of NHST. However, a strong preference for estimation leads Cumming (2012) to aver that the typical questions addressed in science are what questions (e.g., “What is the age of the earth?”, “What is the most likely sea-level rise by 2012?”). I think that this is a restricted, rather “flattened”, view of science where, by implication, explanatory why questions and how questions (which often ask for information about causal mechanisms) are considered atypical.
Why and how questions are just as important for science as what questions. They are often the sort of questions that science seeks to answer when constructing and evaluating explanatory hypotheses and theories. Interestingly, and at variance with this view, Cumming (Fidler and Cumming, 2014) acknowledges that estimation can be usefully combined with hypothesis testing in science, and that estimation can play a valuable role in theory construction. This is as it should be because science frequently incorporates parameter estimates in precise predictions that are used to assess the hypotheses and theories from which they are derived.
Although it predominantly uses the language of testing, the error-statistical perspective maintains that statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science and, in fact, advocates piecemeal testing of local hypotheses nested within large-scale explanatory theories.
Despite the generally favorable reception of the new statistics in psychology, it has been subject to criticism by both frequentists (e.g., Sakaluk, 2016), and Bayesians (e.g., Kruschke and Liddell, 2018). However, these criticisms have not occasioned a public response from the principal advocates of the new statistics movement. The error-statistical outlook presents a golden opportunity for those who advocate, or endorse, the new statistics to defend their position in the face of challenging criticism. A sound justification for the promotion and adoption of new statistics practices in psychology requires as much.
To be continued…. Please share comments and questions.
Excerpts and Mementos from SIST on this blog are compiled here.
I really like how Haig sets out the landscape as contrasting the “new statistics”, error statistics and Bayesian schools. While I don’t use the term “new statistics” I discuss the Cummings’ CI campaign quite a lot in the book, explaining how a severity assessment solves key problems with the way CIs are typically used. I never saw why it would be deemed “new”. Not just because CIs are older than significance tests, or that CIs were developed by Neyman as inversions of tests, but because the way Cumming uses and interprets CIs does not reflect the current, right-headed move to avoid “arbitrary” levels like .95. A more current use of CIs is found in Confidence Distributions (CDs) where several confidence levels are used. (It too has an old history, but is being developed now in new ways.) But even CDs, like tests, require an “inferential” construal, and providing one is another way that the severity construal of CIs provides a more contemporary formulation of CIs.
On pp. 244-5 of SIST), I remark:
“Cumming’ s interpretation of CIs and confidence levels points to their performance-oriented construal: “…We can say we’ re 95% confident our one-sided interval includes the true value . . . meaning that for 5% of replications the [lower limit] will exceed the true value” (Cumming 2012, p. 112). What does it mean to be 95% confident in the particular interval estimate for Cumming? “It means that the values in the interval are plausible as true values for μ , and that values outside the interval are relatively implausible – though not impossible” (ibid., p. 79). The performance properties of the method rub off in a plausibility assessment of some sort.
The test that’ s dual to the CI would “accept” those parameter values within the corresponding interval, and reject those outside, all at a single predesignated confidence level 1 − α . Our main objection to this is it gives the misleading idea that there’ s evidence for each value in the interval, whereas, in fact, the interval simply consists of values that aren’t rejectable, were one testing at the α level. Not being a rejectable value isn’t the same as having evidence for that value. Some values are close to being rejectable, and we should convey this. Standard CIs do not.” (SIST 244-5).
One of the great things about blogging several ideas in the book as I was developing them is that readers’ comments, at times, brought out the need for a qualification. For example, when I suggested in a post what Haig says, that the CI treats all values in the CI on par, one of my frequentist likelihoodist commentators (M. Lew) noted that the points are differentiated by their likelihood in Cumming’s “cat-eye” diagram. But the likelihoodist logic differs from that of tests. Here’s a screen shot from p. 246 of SIST:
I’m pleased to see Mayo’s correction of Brian’s incomplete treatment of Cummings’ interpretation of intervals. (Note the careful use of apostrophes in that sentence: my English teacher would be pleased!)
Readers who want the full details can find them in a recent blog post by Cummings https://thenewstatistics.com/itns/2020/03/18/the-shape-of-a-confidence-interval-cats-eye-or-plausibility-picture-and-what-about-cliff/ . He has published a cross-sectional study of students’ responses to the cat’s eye plots https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5829532/
As a likelihoodist, I am very keen on the cat’s eye plot and the related “plausibility plot” as they represent a very practical use-case for the generally neglected likelihood function.
Michael:
It’s Geoff Cumming, not Cummings.
The trouble is, as I note in the section from SIST posted above, in pointing to a likelihood appraisal, Cumming hasn’t given a confidence interval appraisal that distinguishes the warrant for points within the interval. An appraisal of comparative likelihoods of the points in the intervals is at odds with a CI appraisal, again, for the reasons I give. CI inferences, like error statistical tests, take the form of inequalities like μ > μ’.–not points. The way to distinguish the values in a CI, while keeping to (and not flouting) CI reasoning, is to see each value as corresponding to an inference of form μ > μ’ or μ < μ", one for each value of μ in the intervals. Each corresponds to an inference made with a different confidence (or severity) level. Rather than sticking just to confidence level .95, there are several inferences, each corresponding to a difference confidence level. A few key benchmarks would do to avoid the adherence to .95.
He, he, he. Sorry Geoff! I don’t know what went wrong in my head. Can I say brain-fart?
Michael: And wasn’t I generous in not teasing you by bringing your English teacher into it?
Deborah: Many thanks for your comments, which expand on my brief treatment of the limitations of the traditional performance-oriented outlook on confidence intervals adopted by Cumming. Reasonably enough, you raise the question, Why call confidence intervals, (and effect sizes, and meta-analysis) “the new statistics”? After all, the newest of them, (modern) meta-analysis, is already more than 40 years old, and well established
Cumming is well aware that these are not new methods. In the preface to his book (An Introduction to the New Statistics, 2012) he acknowledges this, but goes on to say, “adopting them as the main way to analyze data would, for many students and researchers, be new.” (p. ix) In a short time, his well-known article on the new statistics (Psychological Science, 2014, 25, 7-29), has garnered more than 2000 citations (Google Scholar, 01/05/20).
“The new statistics”, is an appealing label to use to promote one’s favored statistical methods, and Cumming has been rather successful in pressing for their adoption under this banner. He professes to be something of a Bayesian at heart (at least in his everyday thinking), and I believe that his decision to go with well-established frequentists methods in his reform efforts has undoubtedly contributed to their increased uptake in psychology. A recent paper by Giofrè et al. (PLOS ONE, April 17, 2017), found that for the period, 2013-2015, the use of confidence intervals in the journal, Psychological Science (in which the new statistics have been explicitly encouraged), increased from 28% in 2013 to 70% in 2017, although the use of NHST continued to be very high. By comparison, for the Journal of Experimental Psychology: General, the increase in the use of confidence intervals was small, while the use of NHST remained extremely high.
Brian:
You wrote:
“He professes to be something of a Bayesian at heart (at least in his everyday thinking), and I believe that his decision to go with well-established frequentists methods in his reform efforts has undoubtedly contributed to their increased uptake in psychology.”
Does “their” refer to frequentist or Bayesian methods? I’m guessing you mean the latter, but I can’t tell. The CI Crusaders (an apt term introduced by S. Hurlbert, successfully popularize the false dilemma that the choice is between an abusive, cookbook version of tests (often labelled NHST) and their 95% CI approach. [See the following from SIST 6.7 Farewell Keepsake.] That is behind the latest 2019 move to ban the word “significance” and bar the use of P-value thresholds. (Readers can find a lot of discussion of that on this blog: search Wasserstein). The truth is that the same person (Neyman) developed a non-abusive form of hypothesis tests as well as CIs, the latter being a dual of tests. Most situations involve testing if the effect is genuine, and then estimating it (via CIs). But CIs inherit problems of N-P (Neyman-Pearson) tests, as already mentioned.
In relation to your remark, it is telling that those who advance ‘CIs only’ while trouncing tests (rather than seeing them as best combined) will misinterpret confidence levels as degrees of probability attached to the particular interval estimate. So they may sound Bayesian, but such a construal could only hold in special cases of diffuse priors.
From 6.7 Farewell Keepsake (SIST)
7. Inference Should Report Effect Sizes. Pre-data error probabilities and P-values do not report effect sizes or discrepancies –their major weakness. We avoid this criticism by interpreting statistically significant results, or“reject H0,” in terms of indications of a discrepancy γ from H0. In test T+: [Normal testing: H0: μ ≤ μ0 vs. H1: μ > μ0], reject H0 licenses inferences of the form: μ > [μ0 + γ]; non-reject H0, to inferences of the form: μ ≤ [μ0 + γ]. A report of discrepancies poorly warranted is also given (Section 3.1). The severity assessment takes account of the particular outcome x0 (Souvenir W). In some cases, a qualitative assessment suffices, for instance, that there’s no real effect. The desire for an effect size interpretation is behind a family feud among frequentists, urging that tests be replaced by confidence intervals (CIs). In fact there’s a duality between CIs and tests: the parameter values within the (1 − α) CI are those that are not rejectable by the corresponding test at level α (Section 3.7). Severity seamlessly connects tests and CIs. A core idea is arguing from the capabilities of methods to what may be inferred, much as we argue from the capabilities of a key to open a door to the shape of the key’s teeth. In statistical contexts, a method’s capabilities are represented by its probabilities of avoiding erroneous interpretations of data (Section 2.7).
The “CIs only” battlers have encouraged the use of CIs as supplements to tests, which is good; but there have been casualties. They often promulgate the perception that the only alternative to standard CIs is the abusive NHST animal, with cookbook, binary thinking. The most vociferous among critics in group (1) may well be waging a proxy war for replacing tests with CIs.
Viewing statistical inference as severe testing leads to improvements that the CI advocate should welcome (Sections 3.7, 4.3, and 5.5): (a) instead of a fixed confidence level, usually 95%, several levels are needed, as with confidence distributions CDs. (b) We move away from the dichotomy of parameter values being inside or outside a CI estimate; points within a CI correspond to distinct claims, and get different severity assignments. (c) CIs receive an inferential rather than a performance “coverage probability” justification. (d) Fallacies and chestnuts of confidence intervals (vacuous intervals) are avoided.
Here’s all of the Farewell Keepsake: https://errorstatistics.files.wordpress.com/2020/05/6.7-farewell-keepsake.pdf
Deborah:
My intent was to say that the “increased uptake in psychology” referred to the frequentist methods that Cumming recommends, not Bayesian methods. I think that Cumming finds Bayesian estimation with credible intervals an appealing option. However, he doesn’t like Bayesian hypothesis testing for fear of the dichotomous thinking that he believes it engenders. My sense is that he currently favors frequentist methods, in part because they are more readily teachable.
Brian:
I really don’t think that stat practitioners in psych were
movedencouraged to use frequentist methods because of the CI-only reformers. But since you are in psych, you may know more. I thought they viewed the frequentist machinery in general as the way to place psych on firm or firmer scientific footing, & were/are loathe to the idea that psych inferences could only be justified by psychological assessments of degrees of beliefs or plausibility. (Aside: Cumming visited me here in Virginia once. He, me and Spanos were to be working on something relating to improving tests via a severity assessment, but then we never heard from him again. It’s too bad that none of the work we had done together is cited in his books/papers as the way to solve the problems of testing and estimation.8 years ago, on this blog, I had a post: Everything Tests Can Do, CIs do Better….
https://errorstatistics.com/2012/06/02/anything-tests-can-do-cis-do-better-cis-do-anything-better-than-tests/
It was part of a series of posts on tests and CIs.Cumming wrote some responses readers can find on the blog.
Deborah:
I didn’t say that the “CI-only reformers” were responsible for psych researchers moving to frequentist methods. I said that the increased uptake of the methods that Cumming promotes was (in part) due to his influence. As you know, NHST has been omnipresent in psychology for decades. It’s the recent increase in use of CIs that is apparent.
Brian: I’m not sure why replies are out of order here. Of course I meant only that I didn’t see they were encouraged to use frequentist methods by dint of the push to replace tests with CIs, not that they hadn’t already been using them. I crossed out “moved” in my comment, replacing it with “encouraged”. Unfortunately, because the CI reformers are driven to presenting wildly inaccurate and long-lampooned versions of statistical significance tests, their upshot is to discourage the use of error statistical methods. Most people can see the connection between tests and interval estimates. For example, the lower confidence bound corresponds to the parameter value that the observed estimate is greater than, at the corresponding P-value. Similarly, the upper confidence bound corresponds to the parameter value that the observed estimate is less than, at the corresponding P-value. All this is shown more carefully in SIST. The point is that if P-value reasoning is declared all wrong, then so is CI reasoning. I noticed that Cumming’s follow-up book includes statistical significance tests, but it’s worse to present them in a most ungenerous fashion, warning people to stay away, than to leave them out altogether.
I’m reminded of some parody posts I did on task forces on CIs and hypothesis testing in the social sciences, such as this one from a 2015 Sat Nite comedy: https://errorstatistics.com/2015/01/31/2015-saturday-night-brainstorming-and-task-forces-1st-draft/
Deborah:
Thanks for the clarification. I agree with you that the new-statisticians’ handling of significance testing/p values/NHST is unconvincing. There is a lot that could be said in this regard, but I’ll just say here that they do not seriously engage the best literature on the topic.