**Philip B. Stark**

Professor

Department of Statistics

University of California, Berkeley

I enjoyed Prof. Mayo’s comment in *Conservation Biology* Mayo, 2021 very much, and agree enthusiastically with most of it. Here are my key takeaways and reflections.

Error probabilities (or error rates) are essential to consider. If you don’t give thought to what the data would be like if your theory is false, you are not doing science. Some applications really require a decision to be made. Does the drug go to market or not? Are the girders for the bridge strong enough, or not? Hence, banning “bright lines” is silly. Conversely, no threshold for significance, no matter how small, suffices to prove an empirical claim. In replication lies truth. Abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of evidence, publication decisions are even more subject to cronyism, “taste”, confirmation bias, etc. Throwing away P-values because many practitioners don’t know how to use them is perverse. It’s like banning scalpels because most people don’t know how to perform surgery. People who wish to perform surgery should be trained in the proper use of scalpels, and those who wish to use statistics should be trained in the proper use of P-values. Throwing out P-values is self-serving to statistical instruction, too: we’re making our lives easier by teaching *less* instead of teaching *better.*

In my opinion, the main problems with P-values are: faulty interpretation, even of genuine P-values; use of nominal P-values that are not genuine P-values; and perhaps most importantly, testing statistical hypotheses that have no connection to the scientific hypotheses.

A P-value is the observed value of any statistic whose probability distribution is dominated by the uniform distribution when the null is true. That is, a P-value is any measurable function *T* of the data that doesn’t depend on any unknown parameters and for which, if the null hypothesis is true, Pr {*T* ≤ *p*} ≤ *p*. Reported P-values often do not have that **defining** property. One reason is that calculating *T* may involve many steps, including data selection, model selection, test selection, and selective reporting, but practitioners ignore all but the final step in making the probability calculation. That is, in reality, *T* is generally the composition of many functions *T _{n}* ο

*T*

_{n}_{-1}ο · · · ο

*T*

_{2}ο

*T*

_{1 }(·), but often only the final step T

_{n}(·) is considered in calculating the nominal P-value. If what is done to the data involves selection, conditioning, cherry-picking, multiple testing, stopping when things look good, or similar things, they all need to be accounted for, or the result will not be a genuine P-value.

In my experience, perhaps the most pernicious error in the use of P-values in applications is a Type III error: answering the wrong question by testing a statistical null hypothesis that has nothing to do with the scientific hypothesis, aside from having some words in common. A statistical null hypothesis needs to capture the science, or testing it sheds no light on the matter. For example, consider a randomized controlled trial with a binary treatment and a binary outcome. The *scientific* null is that the remedy does not improve clinical outcomes (either subject by subject, or on average across the subjects in the trial). A typical *statistical* null is that the responses to treatment and placebo are all IID *N* (*µ*, *σ*^{2}). The scientific null does not involve independence, normal distributions, or equality of variances. A genuine P-value for the statistical null does not say much about the scientific null. Here is a more nuanced example: do academic audiences interrupt female speakers more often than they interrupt male speakers? A typical statistical hypothesis might involve positing a model for interruptions, say a zero-inflated negative binomial regression model with coefficients for gender, speaker’s years since PhD, and other covariates. The statistical hypothesis might be that the coefficient of gender in that model is zero. Even if one computes a genuine P-value for that statistical hypothesis, what does it have to do with the original scientific question?

I close with a comment regarding likelihood-based tests, which are mentioned in the commentary. There are indeed tests that depend only on likelihoods or likelihood ratios — and that allow optional stopping when “the data look favorable” — but that nonetheless rigorously control the probability of a Type I error. Wald’s sequential probability ratio test is the seminal example, but there are a host of other martingale-based methods that give the same protections.

**Earlier commentaries on Mayo 2021 Editorial**

Philip:

Thank you so much for your excellent commentary. I especially endorse your remarks:

“If you don’t give thought to what the data would be like if your theory is false, you are not doing science.” This is a great way of putting why we care about error probabilities; I need to remember it.

Also extremely important and well put:

“A P-value is the observed value of any statistic whose probability distribution is dominated by the uniform distribution when the null is true. That is, a P-value is any measurable function T of the data that doesn’t depend on any unknown parameters and for which, if the null hypothesis is true, Pr {T ≤ p} ≤ p. Reported P-values often do not have that defining property. One reason is that calculating T may involve many steps, including data selection, model selection, test selection, and selective reporting, but practitioners ignore all but the final step in making the probability calculation.”

Unfortunately, many who endorse Wasserstein et al (2019)’s editorial will say, “But we’re not banning p-values”, as if that absolves their conception of p-values from any criticism. Often, all that’s left of their p-value is a report of a nominal p-value with no error probability warrant. Some claim as well that p-values can’t indicate departure from a null because they depend on a “model” (which they go on to say is invariably false), but as you point out, they only depend on the null hypothesis–and that’s for purposes of drawing out implications for testing.

Of course there are bright lines. But that doesn’t mean they should be based on p-values, or on hypothesis tests in general. The p-value does not tell you whether the bridge is safe enough or not.

This is the crux of the fallacy: “We must make a decision, so we must use p-values.” On the contrary, the more consequential the decision–e.g. whether to lock down a city due to Covid concerns–the less justified one is in making a decision mechanically via some rule. This is esp. true in light of all the questionable assumptions underlying that p-value (which you note too). The use of likelihood is esp. problematic, since most if not all models will be esp. vulnerable in the tails. Not to mention the whole i.i.d. assumption etc.

Norm: No one advocates a mechanical rule for consequential decisions or even for interpreting data*. Not Fisher, not N-P. However, the very fact that you think some uses of tests would result in unwarranted claims (whether at the stage of inferring evidence for something oran action) shows the use of a threshold for a poorly warranted claim is needful. Moreover, tests of assumptions also require recognizing when the data indicate they are way off. The fact that you prefer a different quantitative threshold, be in a posterior probability or a confidence level, only affirms the fact that these issues are matters of disagreement and individuals should weigh the arguments. Restrictions on which quantities to employ, along the lines of “p-values, no” and “other measures, yes” (which is itself a kind of thresholding and arbitrary gatekeeping) by those in power is what we think should be avoided.

*They are sometimes used for screening in high energy particle physics, but that is different.