**Yudi Pawitan**

Professor

Department of Medical Epidemiology and Biostatistics

Karolinska Institutet, Stockholm

**Behavioral aspects in the statistical significance war-game**

I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler – except when the likelihood principle is thrown in, always guaranteed to confound the frequentist – and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple as it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics. Typically, a persistent controversy between otherwise *sensible and knowledgeable* people – thus excluding anti-vaxxers and conspiracy theorists – might indicate we are missing some common perspectives or perhaps the big picture. In complex issues there can be genuinely distinct aspects about which different players disagree and, at some point, agree to disagree. I am not sure we have reached that point yet, with each side still working to persuade the other side the faults of their position. For now, I can only concur with Mayo (2021)’s appeal that at least the umpires – journals editors – recognize (a) the issue at hand and (b) that genuine debates are still ongoing, so it is not yet time to take sides.

I have previously described my disagreement with the ideas of banning the P-value or just its threshold, or retiring statistical significance (Pawitan, 2020). Rather than repeating or expanding the arguments here, I want instead to discuss where genuine disagreements can occur and be accepted. In game-theoretic or behavior-economic analyses, it is accepted that rational-intelligent individuals can act differently, thus disagree, reflecting different personal preferences or utility functions. In this game framework, the differing parties accept each other’s position, and there is no need to persuade and change each other’s opinion.

So let’s start by assuming we are all rational-intelligent players: we fully understand the correct meaning and usage of the P-value in particular and statistical inference in general. Excluding deliberate frauds, most objections to the P-value or its threshold seem to refer to at least three concerns: (i) the potential misunderstanding of non-expert practitioners, who then produce misleading statements; (ii) the potential misunderstanding of the public or consumer of the statistical results, leading to poor decisions or confusing public discourse or both; (iii) the potential of more false-positive or false-negative errors. Since the P-value threshold controls the false-positive rate under the null, we must suppose that the concern regarding false positives is either due to a belief that the reported P-value does not represent the true level of uncertainty, or that the standard threshold – such as 0.05 – is too large. The former will occur when using biased data or analysis procedures, so in principle can be cured by better data or more rigorous procedures. But reducing the threshold to cure the latter will increase the false-negative rate; vice versa, increasing the threshold reduces false-negative but will increase false-positive rate. It seems to me the attitude is that, rather than try to balance these two errors, let’s just not use the P-value or its threshold. This reflects a preference that is perhaps amenable to further theoretical analysis and discussion, for example in relation to replicability/validation, but in any case it will not affect the other two concerns.

The first two concerns are different: they reflect some degree of distrust of non-experts and the gullible public. Although I share these concerns, there is a genuine difference in where I put them on my utility scale relative to the advantages of having the P-value. Furthermore, in the game theory for a social setting, we talk about a personal preference and a social preference; on a single issue these can be distinct or may also coincide. For instance, personally I would never consider abortion, but I will not impose my preference on other people, so in my social preference, abortion is acceptable. But somebody else, who would not only reject abortion for herself, but also wants to live a society that does not allow abortion, so would militate for its ban. Even in liberal countries, where individual preferences/liberty are supposed to be paramount, there are many such issues where you might want to project your personal preference as the social preference: vaccination, addictive drugs such as marijuana to cocaine, pornography, prostitution, open-carry firearms, gambling, death penalty, euthanasia, etc. These social issues are typically solved by democratic means, directly in referenda or indirectly by decisions of elected representatives. In either case, there is a mechanism – such as voting – and an authority that can impose the agreed decision as a social contract to the whole society.

What kind of social solution is suitable for something like the P-value war? It is indeed a challenging problem, since we have (i) no boundary that defines the legitimate stake-holders (academic statisticians? +applied statisticians? +chartered statisticians? +statistically literate scientists? +…?), (ii) no formal mechanism to express and combine preferences, and (iii) no real authority to impose any agreed decision. As in society in general, social norms that are not formally democratically controlled are dictated by culture. But how cultures evolve and which social rules get adopted are not predictable; in particular they may not be decided by the majority. They may well depend on a small number of influencers, in our case perhaps top-ranked-journal editors, or top-ranked statisticians or scientists. Nassim Taleb (2020) highlighted how social changes can be driven by a small intolerant/loud minority in the face of a tolerant/quiet majority. For instance, the few editors of the journal Basic and Applied Social Psychology banned the P-value and statistical inference, and the numerous authors must acquiesce regardless of their personal views.

I have never seen any rigorous opinion poll on the use of P-values. Formal professional bodies such as the ASA or the RSS could perhaps run such a poll. They will of course still face the boundary problem I mention above, as their members do not represent all users of statistics, but it will be a start. A rigorous poll would be useful, so we can judge the extent of the division within our profession. One may argue strongly that science is not a democratic enterprise: 1000 dissenting but wrong votes cannot beat a single correct vote. But on an issue with no definite right-wrong answer such as the use of the P-value, a large support for banning it or its threshold should encourage all of us to come to a workable consensus. But a small support – please do not ask for a threshold! – should give the intolerant/loud minority pause for thought.

**References**

Mayo, D (2021) The statistics wars and intellectual conflicts of interest. *Conservation Biology*.

Pawitan, Y (2020). Defending the P-value. https://arxiv.org/abs/2009.02099

Taleb, N N (2020). *Skin in the Game: Hidden Asymmetries in Daily Life.* New York: Random House.

Re

> Since the P-value threshold controls the false-positive rate under the null, we must suppose that the concern regarding false positives is either due to a belief that the reported P-value does not represent the true level of uncertainty, or that the standard threshold – such as 0.05 – is too large.

I think a missing concern here is choice of null. Eg accepting/rejecting a single point null is usually less informative to me than a list of all point nulls that would be accepted/rejected (ie confidence interval = a full family of tests). This gives some indication of real world relevance and robustness to ‘crud’ factors

In his most recent arXiv paper about defending P-values and fixed cutoffs, he writes about confidence distributions citing S&H, 2016 and Fraser 2011, who explores the topic of posterior distributions being approximations to CDs, so I would assume that Yudi may be of their support. But I could be wrong.

I asked in an email to him to clarify whether he would support their use along with likelihood functions in place of giving one or two sentences describing the effect estimate, the 95% CI and test statistic/corresponding P-value for the null. Looking forward to hearing his thoughts.

Omaclaren:

I agree that the point null is an artificial example, although for N-P testers, the same one-sided test results if it’s a point null with a one-sided alternative or a one-sided null (in the opposite direction of the alternative). The critics of statistical significance test have used the artificial point null as a stick with which to beat tests. For Bayesians, a point null has to be given a point or lump prior, but then the result is that p-values can be small while posteriors on the null high. But who said they should match? And who says we should be using the point null with a spike? Not to mention that Bayesian disagree with themselves on this (as Senn puts it), since the posterior can match the p-value if you get rid of the spike prior. The topic of “p-values overstate evidence” is all over this blog and in my Statistical Inference as Severe Testing: How to get beyond the statistics wars (2018, CUP).

Here’s a list of the 16 blurbs of the book:

Pingback: Yudi Pawitan: Behavioral aspects in the statistical significance war-game(Guest Post) – 3ºB EE AMÁLIA RIBEIRO GARCIA PATTO – FILOSOFIA

Yudi:

Thank you so much for your guest commentary on my editorial. I recommend readers also read your full paper. I like your analogy: the umpires shouldn’t take sides while the game is still being played. This is not only under debate, but–as you note–under a brand new kind of debate. At the literal p-value debate held by the NISS last year (me, J. Berger and Trafimow were debaters), during a break they took a poll and found a majority favoring use of p-values (I don’t recall how it was worded). Jeske announced something like, sorry Jim (Berger), they are still a majority.

The full transcript can be read and heard at this link. I haven’t checked on this portion.

During the 11th January forum, Y Benjamini gave convincing illustrations of the problem of selective inference or multiple testing, which biases down the P-values of the reported or highlighted findings. Much of the replicability-reproducibility issue can be traced to this practice. Adjustment to the P-value is commonly used to deal with this problem, but there are logical issues involved. For instance, the P-value is affected by the intention of the experimenter; this can feel disturbing if the intention is only a future plan, i.e. not yet realized. And, to what collection of tests should the adjustment be made? All tests in a single table? In a single paper? In a single experiment by a single research group? Or over a lifetime of experiments by the research group? If I may, I would like advertise my other ArXiv paper where I discuss and illustrate those issues: “Dealing with multiple testing: To adjust or not to adjust” https://arxiv.org/pdf/2010.02205.pdf

The problem is that adjusting for selection is always partial. This includes out-of-study bias such as publication bias. No way to adjust or it….