Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II(note)), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire.
[I thought that with my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), that I had said pretty much all I cared to say on this topic (and by and large, this is true), but almost as soon as it appeared in print just around a year ago, things got very strange.]
But wait. Someone might object that I’m the one doing more harm than good by linking the ASA (The American Statistical Association) to Wasserstein’s campaign to get publishers, journalists, authors and the general public to buy into the recommendations of ASA II(note). “Shhhhh!” some counsel, “don’t give it more attention; we want people to look away”. Nothing to see? I don’t think so. I will discuss this point in this post in PART II, as soon as I sketch my list of reasons #2-5.
Before starting, let me remind readers that what I abbreviate as ASA II(note) only refers to those portions of the 2019 editorial by Wasserstein, Schirm, and Lazar that allude to their general recommendations, not their summaries of contributed papers in the issue of TAS.
2 Decriminalize theft to end robbery. The key arguments for impeaching and removing statistical significance levels and P-value thresholds commit fallacies of the “cut-off your nose to spite your face” variety. For example, we should ban P-value thresholds because they cause biased selection and data dredging. Discard P-value thresholds and P-hacking disappears! Or so it is argued. Even were this true, it would be like arguing we should decriminalize robbery since then the crime of robbery would disappear! (ends justify the means fallacy). But it is also not true (that biased reporting goes away if you have no thresholds.) Faced with unwelcome nonsignificant results, eager researchers are still led to massage, spin, and data dredge–only now it is much harder to directly hold them accountable. For the argument, see my (“P-value Thresholds: Forfeit at your Peril“, 2019).
3 Straw men and women fallacies. ASA I and II(note) do more harm than good by presenting oversimple caricatures of the tests. Even ASA I excludes a consideration of alternatives, error probabilities and power. At the same time, it will contrast these threadbare “nil null” hypothesis tests with confidence intervals (CIs)–never minding that the latter employs alternatives. No wonder CIs look better, but such a test is unfair. (Neyman developed confidence intervals as inversions of tests at the same time he was developing hypotheses tests with alternatives in 1930. Using only significance tests, you could recover the lower (and upper) 1-α CI bounds if you wanted, by asking for the hypotheses that the data are statistically significantly greater (smaller) than, at level α, using the usual 2-sided computation).
In ASA II(note), we learn that “no p-value can reveal the …presence…of an association or effect” (at odds with principle 1 of ASA I). That could be true only in the sense that no formal statistical quantity alone could reveal the presence of an association. But in a realistic setting, small p-values surely do reveal the presence of effects. Yes, there are assumptions, but significance tests are prime tools to probe them. We hear of “the seductive certainty falsely promised by statistical significance”, and are told that “a declaration of statistical significance is the antithesis of thoughtfulness”. (How an account that never issues an inference without an associated error probability can be promising certainty is unexplained. On the second allegation, ignoring how thresholds are rendered meaningful by choosing them to reflect background information and a host of theoretical and epistemic considerations, is all more straw.) The requirement in philosophy of a reasonably generous interpretation of what your criticizing isn’t a call for being kind or gentle, it’s that otherwise your criticism is guilty of straw men (and women) fallacies, and thus fails.
4 Alternatives to significance testing are given a pass.You will not find any appraisal of the alternative methods recommended to replace significance tests for their intended tasks. Although many of the “alternative measures of evidence” listed in ASA I and II(note): Likelihood ratios, Bayes factors (subjective, default, empirical), posterior predictive values (in diagnostic screening) have been critically evaluated by leading statisticians, no word of criticism is heard here. Here’s an exercise: run down the list of 6 “principles” of ASA I, applying them to any of the alternative measures of evidence on offer. Take, for example, Bayes factors. I claim that they do worse than do significance tests, even without modifications.
5 Assumes probabilism. Any fair (non question-begging) comparison of statistical methods should recognize different roles probability may play in inference. The role of probability in inference by way of statistical falsification is quite different from using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model–or comparative measures of these. I abbreviate the former as error statistical methods, the latter, as variants on probabilism. Use whatever terms you like. Statistical significance tests are part of a panoply of methods where probability arises to assess and control misleading interpretations of data.
Error probabilities quantify the capabilities of a method to detect the ways a claim (hypothesis, model or other) may be false, or specifiably flawed. The basic principle of testing is minimalist: there is evidence for a claim only to the extent it has been subjected to, and passes, a test that had at least a reasonable probability of having discerned how the claim may be false. (For a more detailed exposition, see Mayo 2018, or excerpts from SIST on this blog).
Reason #5, then, is that “measures of evidence” in both ASA I and II(note) beg this key question (about the role of probability in statistical inference) in favor of probabilisms–usually comparative as with Bayes factors. If the recommendation in ASA II(note) to remove statistical thresholds is taken seriously, there are no tests and no statistical falsification. Recall what Ioannidis said in objecting to “don’t say signiicance”, cited in my last post:
Potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)
“Self-ostracizing” is a great term. ASA should ostracize self-ostracizing. This takes me back to the question I promised to come back to: is it a mistake to see the ASA as entangled in the campaign to ban use of the “S-word”, and kill P-value thresholds?
Those who say it is a mistake, point to the fact that what I’m abbreviating as ASA II(note) did not result from the kind of process that led to ASA I, with extended meetings of statisticians followed by a Board vote. I don’t think that suffices. Granted, the “P-value Project” (as it is called at ASA) is only a small part of the ASA, led by Executive Director Wasserstein. Nevertheless, as indicated on the ASA website, “As executive director, Wasserstein also is an official ASA spokesperson.” In his active campaign to get journals, societies, practitioners and the general public to accept the recommendations in ASA II(note), he wears his executive director hat, does he not?
As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, but it will not suffice to note that it never came to a Board vote. The campaign to editors to revise their guidelines for authors, taking account of both ASA I and II(note), should also be addressed. Keeping things blurred gives plausible deniability, but at the cost of increasing confusion and an “anything goes” attitude.
ASA II(note) clearly presents itself as a continuation of ASA I (again, ASA II(note) refers just to the portion of the editorial encompassing the general recommendation: don’t say significance or significant, oust P-value thresholds). It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blog, the result is to give us principles that are in tension with the original 6.) Next, it goes on to say:
The ASA Statement on P-Values and Statistical Significance started moving us toward this world…. The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. … it is time to stop using the term “statistically significant” entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive…
Undoubtedly, there are signs in ASA I that they were on the verge of this step, notably, the last section: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. .. likelihood ratios or Bayes factors”. (p. 132).
A letter to the editor on ASA I was quite prescient. It was written by Ionides, Giessing, Ritov and Page (link):
Mixed with the sensible advice on how to use p-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid p-values. They don’t tell you what you want to know. …The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.
What do you think? Please share your comments on this blogpost.
 “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power (among other things)” (ASA I)
 The ASA 2016 Guide’s Six Principles
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
 I am grateful to Ron Wasserstein for inviting me to be a “philosophical observer” of this historical project (I attended just one day).
Blog posts on ASA II(note):
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II(note))(Guest Post)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
On ASA I:
- Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.
Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA 321:2067‐2068. (pdf)
(2017). Response to the ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 71:1, 88-89. (pdf)
Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).
Mayo, D. G. (2019), P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. (pdf) doi:10.1111/eci.13170
Wasserstein, R. & Lazar, N. (2016), The ASA’s Statement on p-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.
Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (ASA II(note))(pdf)
Pingback: On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii) | Error Statistics Philosophy
I understand the sentiment expressed by some people who have emailed me, that if people actually think the ASA supports the P-value campaign recommended by the Wasserstein et al. 2019 editorial, then they will give it stronger weight than it merits, and that’s a bad thing they feel. If they think it’s a bad thing, they should make their views known and not just hope it will go away. when in fact there’s active campaigning going on.
Your efforts to point out that ASA II is a step backward are much needed and I hope you will persevere. Eventually practices more in line with your thinking will be recognized as optimal, but it appears to be a slow process.
One major factor that appears to me to be slowing statistical progress is the failure to distinguish between exploratory and confirmatory research. For example, in ASA II Wasserstein et al. note toward the end of their discussion (pages 9-10) that confirmatory research and regulated medical research currently have important, established roles for p-value thresholds that would need to be replaced with a new paradigm that remains to be developed. Their paper would have been more balanced if they would have made this point conspicuously near the beginning rather than as an afterthought at the end. If their recommendations would have been limited to the exploratory stage of research, I would have considered them a useful step forward rather than a major step backward.
The paper by Tong ( https://doi.org/10.1080/00031305.2018.1518264 ) in the American Statistician issue with ASA II pointed out the fundamental role of distinguishing between exploratory and confirmatory research. Tong argued that the current methodological crisis will not be resolved until this distinction becomes fully integrated into statistical thinking. Formal hypothesis tests are appropriate for confirmatory research, whereas exploratory research should focus more on estimation. Having worked in academic research and in regulated medical research, I agree with these points.
Unfortunately, formal confirmatory research has been rare outside of regulated clinical trials. Thus, most academic researchers and academic statisticians do not have experience with this type of confirmatory research. Rather, academic researchers, particularly psychologists, have often treated a second exploratory study as a confirmatory study without recognizing the major methodological differences between the two stages of research.
Bem’s ESP studies are a good example. Based on the stage of research, sample sizes, and descriptions, Bem’s original studies would be classified as Phase II within the framework of regulated clinical trials. Phase II studies are exploratory studies used to develop information for designing confirmatory (Phase III) studies that provide strong evidence. However, within psychology at that time, Bem’s studies had good methodology suitable for publication in a high end journal and were not presented as exploratory. This situation forced psychologists to reevaluate their common methodological practices.
Reality sets in when more formal confirmatory studies are done. Schlitz, Delorme, and Bem conducted a large (512 subjects) multicenter confirmatory study that was preregistered ( http://www.koestler-parapsychology.psy.ed.ac.uk/Documents/KPU_Registry_1007.pdf ). They also conducted a second preregistered multicenter study with 640 subjects ( http://www.koestler-parapsychology.psy.ed.ac.uk/Documents/KPU_registry_1016.pdf ). According to the registration for a third study ( http://www.koestler-parapsychology.psy.ed.ac.uk/Documents/KPU_Registry_1051.pdf ), the planned analyses for both of the above studies did not support the ESP hypothesis.
To my knowledge the most carefully designed study in the history of parapsychology is the ongoing Transparent Psi Project ( https://osf.io/d7sva/ ). This is the only study I have seen in parapsychology (or psychology) that has the degree of planning that was common in my experience in regulated medical research, including thorough evaluation of the operating characteristics of the planned analysis, formal software validation, measures to prevent fraud, and a research audit. This may be a useful point of reference for planning confirmatory research for controversial or high stakes topics. The primary analysis is Bayesian, but the evaluation of the operating characteristics integrates Bayesian and error statistic methods. That type of integration is also expected for Bayesian analysis in regulated medical research in the U.S. ( http://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm071121.pdf ), and that is the most well developed discussion of Bayesian methods for confirmatory research that I have found.