Department of Medical Epidemiology and Biostatistics
Karolinska Institutet, Stockholm
Behavioral aspects in the statistical significance war-game
I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler – except when the likelihood principle is thrown in, always guaranteed to confound the frequentist – and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple as it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics. Typically, a persistent controversy between otherwise sensible and knowledgeable people – thus excluding anti-vaxxers and conspiracy theorists – might indicate we are missing some common perspectives or perhaps the big picture. In complex issues there can be genuinely distinct aspects about which different players disagree and, at some point, agree to disagree. I am not sure we have reached that point yet, with each side still working to persuade the other side the faults of their position. For now, I can only concur with Mayo (2021)’s appeal that at least the umpires – journals editors – recognize (a) the issue at hand and (b) that genuine debates are still ongoing, so it is not yet time to take sides.
I have previously described my disagreement with the ideas of banning the P-value or just its threshold, or retiring statistical significance (Pawitan, 2020). Rather than repeating or expanding the arguments here, I want instead to discuss where genuine disagreements can occur and be accepted. In game-theoretic or behavior-economic analyses, it is accepted that rational-intelligent individuals can act differently, thus disagree, reflecting different personal preferences or utility functions. In this game framework, the differing parties accept each other’s position, and there is no need to persuade and change each other’s opinion.
So let’s start by assuming we are all rational-intelligent players: we fully understand the correct meaning and usage of the P-value in particular and statistical inference in general. Excluding deliberate frauds, most objections to the P-value or its threshold seem to refer to at least three concerns: (i) the potential misunderstanding of non-expert practitioners, who then produce misleading statements; (ii) the potential misunderstanding of the public or consumer of the statistical results, leading to poor decisions or confusing public discourse or both; (iii) the potential of more false-positive or false-negative errors. Since the P-value threshold controls the false-positive rate under the null, we must suppose that the concern regarding false positives is either due to a belief that the reported P-value does not represent the true level of uncertainty, or that the standard threshold – such as 0.05 – is too large. The former will occur when using biased data or analysis procedures, so in principle can be cured by better data or more rigorous procedures. But reducing the threshold to cure the latter will increase the false-negative rate; vice versa, increasing the threshold reduces false-negative but will increase false-positive rate. It seems to me the attitude is that, rather than try to balance these two errors, let’s just not use the P-value or its threshold. This reflects a preference that is perhaps amenable to further theoretical analysis and discussion, for example in relation to replicability/validation, but in any case it will not affect the other two concerns.
The first two concerns are different: they reflect some degree of distrust of non-experts and the gullible public. Although I share these concerns, there is a genuine difference in where I put them on my utility scale relative to the advantages of having the P-value. Furthermore, in the game theory for a social setting, we talk about a personal preference and a social preference; on a single issue these can be distinct or may also coincide. For instance, personally I would never consider abortion, but I will not impose my preference on other people, so in my social preference, abortion is acceptable. But somebody else, who would not only reject abortion for herself, but also wants to live a society that does not allow abortion, so would militate for its ban. Even in liberal countries, where individual preferences/liberty are supposed to be paramount, there are many such issues where you might want to project your personal preference as the social preference: vaccination, addictive drugs such as marijuana to cocaine, pornography, prostitution, open-carry firearms, gambling, death penalty, euthanasia, etc. These social issues are typically solved by democratic means, directly in referenda or indirectly by decisions of elected representatives. In either case, there is a mechanism – such as voting – and an authority that can impose the agreed decision as a social contract to the whole society.
What kind of social solution is suitable for something like the P-value war? It is indeed a challenging problem, since we have (i) no boundary that defines the legitimate stake-holders (academic statisticians? +applied statisticians? +chartered statisticians? +statistically literate scientists? +…?), (ii) no formal mechanism to express and combine preferences, and (iii) no real authority to impose any agreed decision. As in society in general, social norms that are not formally democratically controlled are dictated by culture. But how cultures evolve and which social rules get adopted are not predictable; in particular they may not be decided by the majority. They may well depend on a small number of influencers, in our case perhaps top-ranked-journal editors, or top-ranked statisticians or scientists. Nassim Taleb (2020) highlighted how social changes can be driven by a small intolerant/loud minority in the face of a tolerant/quiet majority. For instance, the few editors of the journal Basic and Applied Social Psychology banned the P-value and statistical inference, and the numerous authors must acquiesce regardless of their personal views.
I have never seen any rigorous opinion poll on the use of P-values. Formal professional bodies such as the ASA or the RSS could perhaps run such a poll. They will of course still face the boundary problem I mention above, as their members do not represent all users of statistics, but it will be a start. A rigorous poll would be useful, so we can judge the extent of the division within our profession. One may argue strongly that science is not a democratic enterprise: 1000 dissenting but wrong votes cannot beat a single correct vote. But on an issue with no definite right-wrong answer such as the use of the P-value, a large support for banning it or its threshold should encourage all of us to come to a workable consensus. But a small support – please do not ask for a threshold! – should give the intolerant/loud minority pause for thought.
Mayo, D (2021) The statistics wars and intellectual conflicts of interest. Conservation Biology.
Pawitan, Y (2020). Defending the P-value. https://arxiv.org/abs/2009.02099
Taleb, N N (2020). Skin in the Game: Hidden Asymmetries in Daily Life. New York: Random House.
All commentaries on Mayo (2021) editorial until Jan 31, 2022 (more to come*)
*Let me know if you wish to write one