Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand
The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (The American Statistician, 2019, 73, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of The American Statistician devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II(note)) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II(note) acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II(note) by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II(note).
1. The proposal that we should stop using the expression “statistical significance” is given a weak justification
ASA II(note) proposes a superficial linguistic reform that is unlikely to overcome the widespread misconceptions and misuse of the concept of significance testing. A more reasonable, and common-sense, strategy would be to diagnose the reasons for the misconceptions and misuse and take ameliorative action through the provision of better statistics education, much as ASA I did with p values. Interestingly, ASA II(note) references Mayo’s recent book, Statistical Inference as Severe Testing (2018), when mentioning the “statistics wars”. However, it refrains from considering the fact that her error-statistical perspective provides an informed justification for continuing to use tests of significance, along with the expression, “statistically significant”. Further, ASA II(note) reports cases where some of the Special Issue authors thought that use of a p-value threshold might be acceptable. However, it makes no effort to consider how these cases might challenge their main recommendation.
2. The claimed benefits of abandoning talk of statistical significance are hopeful conjectures.
ASA II(note) makes a number of claims about the benefits that it thinks will follow from abandoning talk of statistical significance. It says,“researchers will see their results more easily replicated – and, even when not, they will better understand why”. “[We] will begin to see fewer false alarms [and] fewer overlooked discoveries …”. And, “As ‘statistical significance’ is used less, statistical thinking will be used more.” (p.1) I do not believe that any of these claims are likely to follow from retirement of the expression, “statistical significance”. Unfortunately, no justification is provided for the plausibility of any of the alleged benefits. To take two of these claims: First, removal of the common expression, “significance testing” will make little difference to the success rate of replications. It is well known that successful replications depend on a number of important factors, including research design, data quality, effect size, and study power, along with the multiple criteria often invoked in ascertaining replication success. Second, it is just implausible to suggest that refraining from talk about statistical significance will appreciably help overcome mechanical decision-making in statistical practice, and lead to a greater engagement with statistical thinking. Such an outcome will require, among other things, the implementation of science education reforms that centre on the conceptual foundations of statistical inference.
3. ASA II’s(note) main recommendation is not a majority view.
ASA II(note) bases its main recommendation to stop using the language of “statistical significance” in good part on its review of the articles in the Special Issue. However, an inspection of the Special Issue reveals that this recommendation is at variance with the views of many of the 40-odd articles it contains. Those articles range widely over topics covered, and attitudes to, the usefulness of tests of significance. By my reckoning, only two of the articles advocate banning talk of significance testing. To be fair, ASA II(note) acknowledges the diversity of views held about the nature of tests of significance. However, I think that this diversity should have prompted it to take proper account of the fact that its recommendation is only one of a number of alternative views about significance testing. At the very least, ASA II(note) should have tempered its strong recommendation not to speak of statistical significance any more.
4.The claim for continuity between ASA I and ASA II(note) is misleading. There is no evidence in ASA I (Wasserstein & Lazar, 2016) for the assertion made in ASA II(note) that the earlier document stopped just short of recommending that claims of “statistical significance” should be eliminated. In fact, ASA II(note) marks a clear departure from ASA I, which was essentially concerned with how to better understand and use p-values. There is nothing in the earlier document to suggest that abandoning talk of statistical significance might be the next appropriate step forward in the ASA’s efforts to guide our statistical thinking.
5. Nothing is said about scientific method, and little is said about science.
The announcement of the ASA’s 2017 Symposium on Statistical Inference stated that the Symposium would “focus on specific approaches for advancing scientific methods in the 21st century”. However, the Symposium, and the resulting Special Issue of The American Statistician, showed little interest in matters to do with scientific method. This is regrettable because the myriad insights about scientific inquiry contained in contemporary scientific methodology have the potential to greatly enrich statistical science. The post-p< 0.05 world depicted by ASA II(note) is not an informed scientific world. It is an important truism that statistical inference plays a major role in scientific reasoning. However, for this role to be properly conveyed, ASA II(note) would have to employ an informative conception of the nature of scientific inquiry.
6. Scientists who speak of statistical significance do embrace uncertainty. I think that it is uncharitable, indeed incorrect, of ASA II(note) to depict many researchers who use the language of significance testing as being engaged in a quest for certainty. John Dewey, Charles Peirce, and Karl Popper taught us quite some time ago that we are fallible, error-prone creatures, and that we must embrace uncertainty. Further, despite their limitations, our science education efforts frequently instruct learners to think of uncertainty as an appropriate epistemic attitude to hold in science. This fact, combined with the oft-made claim that statistics employs ideas about probability in order to quantify uncertainty, requires from ASA II(note) a factually-based justification for its claim that many scientists who employ tests of significance do so in a quest for certainty.
Under the heading, “Has the American Statistical Association Gone Post-Modern?”, the legal scholar, Nathan Schachtman, recently stated:
The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.(‘Schachtman’s Law’, March 24, 2019)
I suggest that, with appropriate adjustment, the same can fairly be said about researchers and statisticians, who might look to ASA II(note) as an informative guide to a better understanding of tests of significance, and the many misconceptions about them that need to be corrected.
Haig, B. D. (2019). Stats: Don’t retire significance testing. Nature, 569, 487.
Mayo, D. G. (2019). The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean (Some Recommendations)(ii),blog post on Error Statistics Philosophy Blog, June 17, 2019.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. New York, NY: Cambridge University Press.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129-133.
Wasserstein, R. L., Schirm. A. L., & Lazar, N. A. (2019). Editorial: Moving to a world beyond “p<0.05”. The American Statistician, 73, S1, 1-19.
I thank Brian for taking me up on my offer in my recent post:
“A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.”
Please send me your views for potential guest posting.
The question of just what ASA II is stating, however, remains a major source of unclarity. I assumed several of the statements were slips, written in the hasty exuberance of taking such a strong line. Now I am much less sure, since there has, to my knowledge, been no move to modify some of the statements, at least to cohere with ASA I, which ASA II now does not:
♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)
♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)
♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)
It is not just that these are at odds with the ASA principle 1 (as discussed in my post), they would also be at odds with other accounts to which the ASA appears to give its blessing. That is, outcomes attaining a a given small p-value threshold map 1-1 on outcomes leading to corresponding inferences based on CIs, Bayes Factors, posteriors, etc. (even if it’s only when we can take the assumptions as holding). You can’t say the very same requirement for an inference holds using one word and not another word.
If science should be humble and self-critical, as ASA II rightly suggests, then meta-science must be as well. The ASA should show itself as an exemplar of willingness to find flaws in its standpoint, especially when they appear to be at odds with best practices in controlled studies. In responding to Amhrein et al, (2019) Cook et al. (2019) write (in the journal, Clinical trials):
“The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.”
Click to access cook-there-is-still-a-place-for-significance-testing-in-clinical-trials-2019-4.pdf
Haig cites lawyer Schachtman’s concern, and it is one I entirely share; namely, that ASA II will be a ready resource to free researchers from culpability for spinning their interpretations of results that can readily be produced by (or in other cases, scarcely produced by) chance variability. http://schachtmanlaw.com/has-the-american-statistical-association-gone-post-modern/
People will rightly refuse to take part in clinical trials were it to become clear that new rules allow so much latitude for spinning unwelcome results as purportedly showing evidence of the presence, or of the absence, of a genuine effect of concern. Putting a high/low degree of belief on the claim will not somehow make it more fair-minded.
What is to happen to testing altogether if even minimal thresholds are abandoned?
In this connection, it is very important to have a clarification of whether Principle 4 from ASA I still holds in ASA II. This was the principle requiring the reporting of things like data-dependent endpoints, stopping rules and other biasing selection effects. As Cook et. Al., observe:
“By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.”
P-values are directly invalidated by these gambits, it’s not clear that alternative methods are.
The bottom line is that no new formal method can turn troubled scientific practices into sound science. What is required for scientific inquiry with integrity is no big secret. By giving so much weight to the position that the problems with today’s industrialized science are the fault of a very artificial version of error statistical tests, there is a real and present danger that we lose the ability to hold accountable the formal “experts” who have so much control over our lives.
At the risk of being repetitive, let me share some thoughts on the above comments of Haig and follow ups by Mayo. Two comments come up:
1. Inbreeding – the discussions regarding ASA I and ASA II does not seem to involve the “customers”. Specifically, I am thinking of people in industry and health care research and delivery. The gap between theory and practice seems to have been widening. Moreover, these discussions assume that these customers listen and wait attentively for the statistics community to pass on a verdict on how statistical analysis should be done to ensure reproducibility and repeatability etc…. The 2017 symposium Haig referred to had very few practitioners both as speakers and attendants. Most of the discussions in the sessions were inbred exchanges, as if we (statisticians) live in an isolated planet. This ignores the many professional discisplines now active in the data analytics playground.
2. On several occasions I mentioned the importance of generalisations of findings and findings’ representation. This seems to me a crucial aspect that is apparently not getting attention. A proposal for this is made in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070. Will be glad to engage in discussions on it.
The only reason that I joined 840 others in signing the letter which advocated dropping the term “statistically significant” as a synonym for p < 0.05 is that dichotomisation makes no sense. It is obviously silly to treat p = 0.049 as meaning something different from p = 0.051. The letter did not say that p values should be dropped. It didn't even touch much on the deficiencies of p values as evidence.
I don't see how anyone can defend treating p = 0.05 as a threshold. The fact that so may users do exactly that has hindered the progress of science.