Department of Statistical Sciences
University of Bologna
The ASA controversy on P-values as an illustration of the difficulty of statistics
“I work on Multidimensional Scaling for more than 40 years, and the longer I work on it, the more I realise how much of it I don’t understand. This presentation is about my current state of not understanding.” (John Gower, world leading expert on Multidimensional Scaling, on a conference in 2009)
“The lecturer contradicts herself.” (Student feedback to an ex-colleague for teaching methods and then teaching what problems they have)
1 Limits of understanding
Statistical tests and P-values are widely used and widely misused. In 2016, the ASA issued a statement on significance and P-values with the intention to curb misuse while acknowledging their proper definition and potential use. In my view the statement did a rather good job saying things that are worthwhile saying while trying to be acceptable to those who are generally critical on P-values as well as those who tend to defend their use. As was predictable, the statement did not settle the issue. A “2019 editorial” by some of the authors of the original statement (recommending “to abandon statistical significance”) and a 2021 ASA task force statement, much more positive on P-values, followed, showing the level of disagreement in the profession.
Statistics is hard. Well-trained, experienced and knowledgeable statisticians disagree about standard methods. Statistics is based on probability modelling, and probability modelling in data analysis is essentially about whether and how often things that did not happen could have happened, which can never be verified. The very meaning of probability, and by extension of every probability statement, is controversial.
The 2021 task force statement states: “Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.” I do not disagree with this. Probability models assign probabilities to sets, and considering the probability of a well chosen data-dependent set is a very elementary way to assess the compatibility of a model with the data. The likelihood is another way, not requiring the specification of a test statistic that defines a “direction” in which the model may be violated, instead relying somewhat more on the exact model specification. Still, considering the P-value as “among the best understood”, it is remarkable how much controversy, lack of understanding, and misunderstanding regarding them exist. Indeed there are issues with tests and P-values about which there is disagreement even among the most proficient experts, such as when and how exactly corrections for multiple testing should be used, or under what exact conditions a model can be taken as “valid”. Such decisions depend on the details of the individual situation, and there is no way around personal judgement.
I do not think that this is a specific defect of P-values and tests. The task of quantifying evidence and reasoning under uncertainty is so hard that problems of these or other kinds arise with all alternative approaches as well. The opening quote by John Gower is not on P-values, but it would be heart-warming to see top experts on statistical inference talking this way, too. It is also important to acknowledge that there is agreement when it comes to mathematics and basic interpretation (not rejecting the null hypothesis does not mean that it is true, and neither is the P-value a probability for it to be true), from which the general perception may be distracted when focusing too much on philosophical differences.
A much bigger problem is the tension between the difficulty of statistics and the demand for it to be simple and readily available. Data analysis is essential for science, industry, and society as a whole. Not all data analysis can be done by highly qualified statisticians, and society cannot wait with analysing data for statisticians to achieve perfect understanding and agreement. On top of this there are incentives for producing headline grabbing results, and society tends to attribute authority to those who convey certainty rather than to those who emphasise uncertainty. Statistics provides standard model based indications of uncertainty, but on top of that there is model uncertainty, uncertainty about the reliability of the data, and uncertainty about appropriate strategies of analysis and their implications. A statistician who emphasises all of these will often meet confusion and disregard.
Another important tension exists between the requirement for individual judgement and decision-making depending on the specifics of a situation, and the demand for automated mechanical procedures that can be easily taught, easily transferred from one situation to another, justified by appealing to simple general rules (even though their applicability to the specific situation of interest may be doubtful), and also investigated by statistical theory and systematic simulation.
P-values are so elementary and apparently simple a tool that they are particularly suitable for mechanical use and misuse. To have the data’s verdict about a scientific hypothesis summarised in a single number is a very tempting perspective, even more so if it comes without the requirement to specify a prior first, which puts many practitioners off a Bayesian approach. As a bonus, there are apparently well established cutoff values so that the number can even be reduced to a binary “accept or reject” statement. Of course all this belies the difficulty of statistics and a proper account of the specifics of the situation.
As said in the 2016 ASA Statement, the P-value is an expression of the compatibility of the data with the null model, in a certain respect that is formalised by the test statistic. As such, I have no issues with tests and P-values as long as they are not interpreted as something that they are not. The null model should not believed to be true (and neither should any other model). A P-value is surely informative; regarding given data, compatibility is the best that models can ever achieve, of course keeping in mind that many models can be compatible with the same data. The fact that P-values (and statistical reasoning in general) regard idealised models that are different from reality seems to be hard to stomach and easy to ignore; contrarily sometimes this is interpreted as testifying the uselessness of P-values (or frequentist statistical inference in general). It seems more difficult to acknowledge how models can help us to handle reality without being true, and how finding an incompatibility between data and model can be a starting point of an investigation how exactly reality is different and what that means. For this, a test gives a rough direction (such as “the mean looks too large”), which can be useful, but is certainly limited as information.
Alternative statistical approaches have their merits and pitfalls, too, always including the temptation to over-interpret their implications, often by taking the assumed model as a truth rather than a model (also a Bayesian model of belief should not just be believed). The pessimistic belief seems realistic that the general popularity and spread of any statistical approach will correspond to its capacity of being mechanically used, misused, and over-interpreted, making it easy for its opponents to criticise it.
As statisticians we face the dilemma that we want statistics to be popular, authoritative, and in widespread use, but we also want it to be applied carefully and correctly, avoiding oversimplification and misinterpretation. That these aims are in conflict is in my view a major reason for the trouble with P-values, and if P-values were to be replaced by other approaches, I am convinced that we would see very similar trouble with them, and to some extent we already do.
Ultimately I believe that as statisticians we should stand by the complexity and richness of our discipline, including the plurality of approaches. We should resist the temptation to give those who want a simple device to generate strong claims what they want, yet we also need to teach methods that can be widely applied, with a proper appreciation of pitfalls and limitations, because otherwise much data will be analysed with even less insight. Making reference to the second quote above, we exactly need to “contradict ourselves” in the sense of conveying what can be done, together with what the problems of any such approach are.
When it comes to a representative association such as ASA, I think that the approach taken in the initial statement followed this ideal and was as such valuable. I would have hoped that the assertions made could be accepted by a vast majority of statisticians despite much existing disagreement, maybe tolerating disagreement with certain details of the statement. The “2019 editorial” had a different spirit by recommending to “abandon” methodology that a substantial number of statisticians routinely use and defend. This was obviously not something that could hope for broad agreement, and I think it was quite damaging for the profession. If we see ourselves as flag bearers of the acknowledgement and communication of uncertainty (and I think we should define ourselves in this way), this task alone puts us in a difficult position with a public who expect certainty and quick results. Regarding methodological controversies within our profession, we should be pluralist and open for the arguments of each side, rather than trying to shut one side out.
Unfortunately, for the participants in such controversies it is tempting and easy to hold difficulties and issues against an approach that they do not favour, for promoting a particular alternative approach. But the latter may well be affected in one way or another by the same or strongly related issues, as the difficulties with formalising uncertainty run deeper.
What we should like to see is scientists (and other statistics users) who are aware of the many sources of uncertainty and misunderstanding, and interpret their results keeping this in mind. Most of them are not highly trained statisticians, so we cannot expect them to have deep statistical insight or to do very sophisticated things. In any case we should not give them the impression that whether they do things right or wrong is a matter of whether they follow one or the other statistical approach, as long as both find support within the statistics community. Instead it is a matter of awareness of the limitations of whatever they do.
See Ionides and Riccov commentary here. Prior to that are commentaries by Haig and by Lakens.
Please join us for our special remote Phil Stat Forum on Tuesday Jan 11, 10 AM EST: phil-stat-wars.com (“statistical significance test anxiety”)
All commentaries on Mayo (2021) editorial until Jan 31, 2022 (more to come*)
*Let me know if you wish to write one