
.
Professor Yudi Pawitan
Department of Medical Epidemiology and Biostatistics
Karolinska Institutet, Stockholm, Sweden
[An earlier guest post on this topic by Y. Pawitan is Jan 10, 2022: Yudi Pawitan: Behavioral aspects in the statistical significance war-game]
Behavioral aspects in the statistical significance war-game
I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple since it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics.
Typically, a persistent controversy between otherwise sensible and knowledgeable people might indicate we are missing some common perspectives or the big picture. In complex issues, there can be genuinely distinct aspects about which different parties disagree and, at some point, agree to disagree. I am not sure we have reached that point yet, with each side still working to persuade the other side about the faults of their position. For now, I can still concur with Mayo (2021)’s appeal that at least the umpires – reviewers and journals editors – recognize (a) the issue at hand and (b) that genuine debates are still ongoing, so it is not yet time to take sides.
I have previously described my disagreement with the ideas of banning the P-value or just its threshold, and retiring statistical significance and categorical significance statements (Pawitan, 2020). To summarize briefly,
- there are many common data analysis tasks where we would sorely miss the P-value, such as: displaying and comparing Kaplan-Meier survival curves; Wilcoxon and non-parametric rank tests in general; trend tests across an ordinal axis; any test with more than one degree of freedom,
- The P-value is often the raw ingredient for further adjustment to account for multiplicity,
- In genome-wide association studies (GWASs), the primary output is the so-called Manhattan plot, which shows a collection of millions of P-values. Considering the crucial role of GWAS in the current quantitative genetics research, this application alone would perhaps justify the existence of P-value.
- Without formal significance tests, what would happen to practically useful concepts in the design of experiments, such as power and sample size considerations?
- Without any indication of statistical significance and categorical direction, statements can sound less communicative/digestible/informative/thoughtful. Compare these statements: “Snus use was associated with a statistically significant increased risk of pancreatic cancer (relative risk 2·0; 95% CI 1·2–3·3), but not with oral (0·8, 0·4–1·7) and lung cancer (0·8, 0·5–1·3),” vs “Among snus-users the incidence rates (per 100,000 person-years) of pancreatic, oral and lung cancers are 8.5, 6.4 and 2.6, while the rates among non-users are 3.9, 8.6 and 3.1.” (These are actual rates from Luo et al, Lancet 2007; 369: 2015–20.)
- Theoretically, while the P-value is not a probability of a null hypothesis, it is a valid measure of confidence in the hypothesis. It is exactly the same notion of ‘confidence’ as in the confidence interval, so it is a mainstream concept (Pawitan and Lee, 2024, Chapter 11 and 12). Moreover, P-value is in fact a form of posterior probability for a certain choice of prior distribution.
However, no matter how much we have talked, debated and argued, it seems the disagreement persists. So, instead of repeating or expanding the arguments, here I would like to discuss where genuine disagreements can occur and be accepted. In game-theoretic or behavior-economic analyses, it is accepted that rational-intelligent individuals can act differently, thus disagree, reflecting different personal preferences or utility functions. In this game-theoretic framework, the differing parties accept each other’s position, and they feel no need to persuade and change each other’s opinion.
So let’s start by assuming we are all rational-intelligent players: we fully understand the correct meaning and usage of the P-value in particular and statistical inference in general. Excluding deliberate frauds, most objections to the P-value or its threshold seem to refer to at least three concerns:
- the potential misunderstanding of non-expert practitioners, who then produce misleading statements;
- the potential misunderstanding of the public or consumer of the statistical results, leading to poor decisions or confusing public discourse or both;
- the potential for more false-positive or false-negative errors than implied by the reported rates.
Let’s start with the last concern first. Since the P-value threshold controls the false-positive rate under the null, we must suppose that the concern regarding false positives is either due to a belief that (a) the reported P-value does not represent the true level of uncertainty, or (b) that the standard threshold – such as 0.05 – is too large. The first issue will arise, for instance, when one reports winners (i.e., selective inference) in a multiple testing situation without properly accounting for the multiplicity. In principle, this problem can be cured by more rigorous inference procedures, for example, using the false discovery rate methods.
Regarding (b), reducing the threshold will increase the false-negative rate, so it’s not cost-free. It seems to me the attitude is: don’t bother trying to balance these two errors, let’s just drop the P-value or its threshold. This reflects a preference that is perhaps amenable to further theoretical analysis and discussion, for example in relation to replicability/validation, but in any case, it will not affect the other two concerns.
The first two concerns are different: They reflect some degree of distrust of non-experts and the gullible public. Although I share these concerns, there is a genuine difference in where I put them on my own utility scale relative to the advantages of having the P-value. Furthermore, in the game theory for a social setting, we talk about a personal preference and a social preference. On any single issue, these can be distinct or may also coincide. For instance, personally I would never consider abortion, but I will not impose my personal preference on other people, so in my social preference, abortion is acceptable. However, somebody else might not only reject abortion for herself, but also wants to live in a society that does not allow abortion, so would militate for its ban. Even in liberal countries, where individual preferences/liberty are supposed to be paramount, there are many issues where you might want to project your personal preference as the social preference: vaccination, addictive drugs from marijuana to cocaine, pornography, prostitution, open-carry firearms, gambling, death penalty, euthanasia, etc. These social issues are typically solved by democratic means, directly in referenda or indirectly by decisions of elected representatives. In either case, there is a mechanism – such as voting – and an authority that can impose the agreed decision as a social contract to the whole society.
What kind of social solution is suitable for something like the P-value war? It is indeed a challenging problem, since we have (i) no boundary that defines the legitimate stake-holders (academic statisticians? +applied statisticians? +chartered statisticians? +statistically literate scientists?+…?), (ii) no formal mechanism to express and combine preferences, and (iii) no real authority to impose any agreed decision. As in society in general, social norms that are not formally democratically controlled are dictated by culture. But how cultures evolve and which social rules get adopted are not predictable; in particular they may not be decided by the majority. They may well depend on a small number of influencers, in our case perhaps top-ranked-journal editors, or top-ranked statisticians or scientists. Nassim Taleb (2020) highlighted how social changes can be driven by a small intolerant/loud minority in the face of a tolerant/quiet majority. For instance, the few editors of the journal Basic and Applied Social Psychology banned the P-value and statistical inference, and the numerous authors must acquiesce regardless of their personal views.
I have never seen any rigorous opinion poll on the use of P-value. An informal poll (n=303) done during a public debate on the P-value at the National Institute of Statistical Sciences (Oct 2020) showed that a clear majority (55%) of the audience would use the P-value alone vs 28% both the P-value and Bayes Factor vs 8% the Bayes Factor alone vs 13% Neither (private communication with JL Rosenberger who did the poll; see the transcript in https://errorstatistics.com/2020/12/13/the-statistics-debate-niss-in-transcript-form-question-1/) Formal professional bodies such as the American Statistical Association (ASA) or the Royal Statistical Society (RSS) could perhaps run such a poll. They will of course still face the boundary problem I mention above, as their members do not represent all users of statistics, but it will be a start. A rigorous poll would be useful, so we can judge the extent of the division within our profession. One may argue strongly that science is not a democratic enterprise: 1000 dissenting but wrong votes cannot beat a single correct vote. But on an issue with no definite right-wrong answer, such as the use of the P-value, a large support for banning it or its threshold should encourage all of us to come to a workable consensus. But a small support – please do not ask for a threshold! – should give the intolerant/loud minority pause for thought.
References
Luo, J. et al. (2007) Oral use of Swedish moist snuff (snus) and risk for cancer of the mouth, lung, and pancreas in male construction workers: a retrospective cohort study. Lancet, 369: 2015–20.
Mayo, D. (2021) The statistics wars and intellectual conflicts of interest. Conservation Biology. https://conbio.onlinelibrary.wiley.com/doi/full/10.1111/cobi.13861
Pawitan, Y. (2020). Defending the P-value. https://arxiv.org/abs/2009.02099
Pawitan, Y. and Lee, Y. (2024). Philosophies, Puzzles and Paradoxes. Boca Raton: Chapman and Hall/CRC Press.
Taleb, N. N. (2020). Skin in the Game: Hidden Asymmetries in Daily Life. New York: Random House.



I thank Yudi Pawitan for his guest post which came in just in time to include in this (first) series. I oppose any “appeal to numbers” or popularity in determining scientific methodology, although getting “statistics” is always interesting. I agree that his invoking Taleb’s point as to how social changes can be driven by a small intolerant/loud minority “intolerant minority” is entirely fitting here. Of relevance to this discussion is a post on this blog discussing an editorial by Hardwicke and Ioannidis on Petitions in scientific argumentation: Dissecting the request to retire statistical significance in 2019. Editorials by me and by Gelman also appeared in the same October issue of European Journal of Clinical Investigation.
I am glad to see he has a new book with “philosophies” in the title. At some point, I will want to discuss what he says about a popular topic on this blog: the likelihood principle, and his suggestion that I turn it into a normative principle.
Yudi:
I really like your comment about how much more communicative the data report is when the information about statistical significance is given, as opposed to when it is not. It is rarely mentioned what a mass of data reports one would otherwise need to comb through, taking extra seconds to check the endpoints of any confidence interval.
Can you say more about construing the P-value as a valid measure of confidence akin to what one obtains from a confidence interval. Is it mostly noting the P-values match posteriors in special cases, and of course a whole theory of matching priors is out there. Or, are you pursuing the kind of view one finds among some who develop confidence distributions? With a severity assessment, one is not proporting to say how believable, well-supported, or confident one is with a claim that passes with severity. The severe tester uses the error probability to inform about the capability the test has for finding flaws in a claim. Failing to find those flaws, it is warranted to infer the claim–at least according to the severe testing principle. Perhaps you don’t see this as very different from confidence distributions. I could not pin down Cox entirely.
Deborah,
re P-value and confidence, yes, I am thinking along the same path as the confidence distribution. Theoretically, (i) we may view the confidence distribution as a collection of P-values across the space of all possible null hypotheses.
(ii) In regular continuous single-parameter problems, we may interpret the one-sided right-side P-value for H0: theta=theta0 as the ‘confidence for theta<theta0.’ The confidence here is the area under the confidence density for theta<theta0. So, in this case, it is both intuitively and theoretically correct (and satisfying) to say that low P-value = low confidence. Note also that, in situations where the (percentile) bootstrap is valid, the bootstrap distribution is a confidence distribution, so the confidence distribution concept has wide applicability. This is interesting to mention, because the bootstrap is commonly used to get confidence intervals; its connection to the confidence distribution immediately tells us that we can also use it to get P-values.
(iii) The prior associated with a confidence density is implied by the definition of the P-value, such that confidence = implied prior x likelihood. In many theoretical examples, the implied prior is equal to Jeffreys’s prior. So the difference with the objective Bayesians is where we start, which is the frequentist P-value.
Yudi
Yudi:
I’ve never heard it said that “The prior associated with a confidence density is implied by the definition of the P-value, such that confidence = implied prior x likelihood.” You mean that if we assume the P-value is like confidence (in the null?), then we are led to a prior whereby the p-value matches the posterior probability? Do you think this is how Jeffreys gets his prior? Is your construal an epistemic measure or more like Fraser’s calibration? I expected people to express surprise at your saying the P-value measures confidence (in the corresponding null?), in that it is opposed to many of the criticisms of P-values.
Hi Deborah,
it’s not that we “assume P-value is like confidence”: P-value is confidence (for a specific set of parameters assoc with the H0). The full confidence distribution is defined across all possible H0s. If we make confidence= prior x likelihood, then prior is determined by the confidence, in other words P-value, since likelihood is already well-defined. What do you mean by “epistemic measure”? I also use the term “epistemic”, but we may have interpreted it differently.
Jeffreys got his prior by imposing invariance (Sec 9.3 in my philosophy book), so a completely different route.
Yudi
Yudi – Tx for a post which clearly builds on practical experience. Taleb is a great reference as he is a prime example of someone able to expand the perspective on various problems with new and sometimes unorthodox views. I suggest that statisticians should adopt such a role, among others. Based on this premise, one needs to have a wide view perspective, beyond the “to p or not to p” discourse. These comments are aimed at statisticians. An effort worth thinking about is a concerted initiative to set up some foundations of applied statistics, as mentioned in my post.
Another comment is about the general ecosystem of users of statistics. Your nostalgic view on “the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist” is a bit nostalgic. Today’s practice of statistics is strongly affected by AI/ML/LLM, and the application of statistics needs to account for it. The “to p or not to p” statistic wars seem niche like. From my view, it is not so much reflected in the ecosystem of users of statistics. It is even less present in statistical or AI/ML software platforms.
So, bottom line, yes – change is happening. Some good discussions on this can be found, for example in https://hdsr.mitpress.mit.edu/pub/g9mau4m0/release/2
Again, I suggest more such discussions are needed. This was the main aim of my post… Your thoughts on it will be most welcome.
Ron: I never felt that the statistical significance wars were taken very seriously by practicing statisticians. Everywhere I went, the practitioners kind of shook there heads as if to say: “those poor social scientists, they are still ack to battling abuses of statistics from the time of Morrison and Henkel (1962) or before. I view it as largely a move to replace error statistical methods (or at least statistical significance tests) with Bayesian alternatives, or with confidence intervals CIs). The majority of participants, even at the 2016 P-value forum, were Bayesians The push to only use CIs leads to the kind of difficult-to-parse reports Pawitan mentions. Moreover, it is confusing because CIs are just inversions of statistical tests of hypotheses and were invented by the same fellow who developed Neyman-Pearson tests, namely Neyman. Neyman began as a Bayesian, but finding frequentist priors so rarely available, moved away from it (but liked empirical Bayes when it came out). However, the motivation for N-P tests was to come up with statistical inferences that hold regardless of priors. As for how the “no threshold” and “don’t say significant” movement relates to AI/ML, I just don’t know. I thought AI/ML was all about classifying data (in contrast to Wasserstein et al., 2019 “quit classifying”). Park’s post said they’re testing AI/ML methods using RCTs and presumably p-values. If true, it seems they come full circle. The same might happen with evaluating “explainable AI/ML” but I’m really in the dark here. I will speak at the Neyman seminar this fall and I hope to learn more.
Hi Ron,
thanks for your comments! I recall the polemical paper by Breiman (2001) on Two Cultures of statistics: algorithmic (in AI/ML) and classical. Breiman was clearly on the former side and critical of the latter. At least at the time, the classical culture was mainstream. Now, with much more advanced AI including the LLMs, it’s perhaps harder to say. In his commentary, David Cox wrote that Breiman’s portrait of the classical culture was “based in part on a caricature.” What is interesting about statistics is how affected one’s view about it by the data/applications one regularly see or work on. If you only see big-data ML applications, it’s easy to feel that all of statistics is just prediction problems, model selections, validations, etc., with little need for inference, especially small-data inference problems. I can only agree with Cox when he then continued: “One of our failings has, I believe, been, in a wish to stress generality, not to set out more clearly the distinctions between different kinds of application and the consequences for the strategy of statistical analysis.” (Actually, I think this comment also applies to those who want to ban the P-value, its threshold, etc.)
The lack of explicit classical statistical thinking/reporting in AI/ML/LLM is surely a weakness. If we consult any expert system we usually expect its answer comes with some measure of confidence. Current LLMs do not provide this. Considering their well-known propensity for hallucination, don’t you wish that they do?
Yudi
Yudi:
By your comment, you seem to share a kindred spirit with me. And I remember Box’s response to Breiman and his later remarks on Dig Data in talks. He knew better than to suppose that all of our understanding of the world was going to be taken over by prediction. How can we possibly develop better theories about the world without theorizing? I’m wondering if maybe we have enough good theories to keep the prediction game going for a while longer. I can’t imagine humans no longer being curious about mechanisms underlying phenomena, and how to interact with them. Newton was predicting fine until scientists searched very hard for serious breakdowns. They wanted to, and still want to, understand gravity.
Yudi – your comment on LLM raises a very important issue. What we are seeing now is a shift of effort from generating information to an evaluation of what we get. ChatGPT is particularly awkward as is comes up with pure nonsense which gets us to notice only blatant mistakes, not the more subtle ones. We need to be creative in asking questions. I refer to this in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4760077
Regarding seminal papers, I tend to refer to Tukey’s 1962 annals of mathematical statistics paper on The Future of Data Analysis. Tukey had a clairvoyant understanding of the “singularity” affecting statistics. The abstract of the paper starts with “For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.” Donoho celebrate the paper’s jubilee in https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734
My blog above is, in a sense, a legacy of Tukey’s vision. It is worth while revisiting Tukey’s paper from time to time. (I believe I heard Roger Peng say he does it yearly).
So, in few words, I think Breiman opened the door to computer scientist’s interest in data analysis and Tukey charted the way for an evolution of statistics. Both messages need to strongly impact the application and development of statistics.
This brings me again to the need of a foundation of applied statistics, as presented in my blog. A Neyman used to say (I can confirm this first hand), “life is complicated but not uninteresting”….
Hi Ron,
thanks for the link to Donoho’s paper. His idea of Greater Data Science looks great and concrete enough to form a basis for a graduate programme in data science that puts a proper respect to classical statistical thinking. GDS avoids the weaknesses in many AI/ML initiatives that willfully ignore the contributions of statistics. It would be a great way forward if some top universities try it out.
Yudi
Hi Yudi,
Good old days notwithstanding, your introduction to (the role of philosophy in) a search to reconcile conflict by broadening the scope of interest has uberty. It reminds of Polya’s Inventor’s Paradox (the more ambitious plan may have more chances of success). Considering not just that “megateams of seventy or more authors array themselves on either side” (Mayo 2019), one may ask what other cultures might we be seeking to reconcile (or how ambitious should we be)?
Rick