Guest Post: Christian Hennig: “Statistical tests in five random research papers of 2024, and related thoughts on the ‘don’t say significant’ initiative”

.

Professor Christian Hennig
Department of Statistical Sciences “Paolo Fortunati”
University of Bologna

[An earlier post by C. Hennig on this topic:  Jan 9, 2022: The ASA controversy on P-values as an illustration of the difficulty of statistics]

Statistical tests in five random research papers of 2024, and related thoughts on the “don’t say significant” initiative

This text follows an invitation to write on “abandon statistical significance 5 years on”, so I decided to do a tiny bit of empirical research. I had a look at five new papers listed on May 17 on the “Research Articles” site of Scientific Reports. I chose the most recent five papers when I looked without being selective. As I “sampled” papers for a general impression, I don’t want this to be a criticism of particular papers or authors, however in the interest of transparency, the doi addresses of the papers are:

https://doi.org/10.1038/s41598-024-62172-2
https://doi.org/10.1038/s41598-024-61552-y
https://doi.org/10.1038/s41598-024-62074-3
https://doi.org/10.1038/s41598-024-59702-3
https://doi.org/10.1038/s41598-024-62350-2

Four of the papers contain statistical tests. None of the papers contains any of the methods that were proposed in the statistical literature as alternatives to tests and p-values such as s-values (Cole et al. 2020), second generation p-values (Blume et al. 2019), e-values (Grünwald et al. 2023), relevance (Stahel 2021), or any Bayesian analysis. Severity (Mayo 2018) does not feature either. There is no trace of any influence of the “abandon statistical significance” discussion. The papers have a substantial amount of material and key conclusions that don’t rely on statistical tests, on which I will not comment here.

Obviously this is not a representative sample of what goes on in science (for sure looking at only one journal provides a very narrow view); I was just curious to see an interdisciplinary mix of recent papers in a reasonably well reputed journal. Despite this, it suggests (together with some even less formal further looking around) that significance testing has for sure not been abandoned, but still happens all over the place.

How do I feel about this? As probably almost all statisticians do, I think that statistical tests are endemically misinterpreted and misused. Although I don’t agree with demands for tests to be abandoned (more on this later), I hope that the controversy about tests causes at least some researchers to question what they are doing, and to understand a bit better when to use them or not, and how to interpret the results. This may indeed happen in some quarters, but I suspect that it concerns a small minority.

Looking at the four papers containing tests, there are some issues. One of them only reports whether p<0.05 or not, but not the precise p-value. This bugs me; it is of great informative value whether p is close to 0.05 (not a very strong indication of anything) or rather 0.0001; binary thinking based on just distinguishing “significant” from “insignificant” was one of the major issues with tests highlighted by Wasserstein et al. (2019), and I agree that (reasonably) precise p-values should be shown.

Another paper states that the null hypothesis is true in case of non-rejection (“there is no difference”; “data are normal” – no they aren’t! -when tested and not rejected). I find this problematic, but given that the general tone of the paper conveys that results refer to the specific experiment and authors avoid overgeneralising claims and don’t seem to imply that their results are the final word on the matter, this seems rather harmless in the specific case.

The same paper wrongly replaces a paired t-test by an unpaired Wilcoxon because of “non-normality”, ignoring dependence between observations on the same individual observed in two conditions. A data plot indicates, however, that the highly significant test result would also have been significant with a correct paired test.

One paper has a rather nonsensical description of a post hoc power analysis clearly revealing a lack of insight. Another one seems to imply that running a Kolmogorov-Smirnov test amounts to just plotting two distribution functions together without computing a p-value. The one paper that doesn’t run a formal statistical test claims anyway that a certain result is “significant”. This could have been formally tested, though the test would have been non-standard and not so simple. A statistical plot illustrates the result. I believe that a formal test would have confirmed significance due to a large sample size, and it would have been informative to actually run the test.

All four papers with tests run several tests. Two of them don’t account for or even mention multiple testing. This is particularly problematic where it is only indicated whether “p<0.05”, because multiple testing corrections would require p-values much smaller than 0.05, although the “p<0.05 paper” actually uses a multiple testing correction in one place, if not in another, and not in a way that the reader could see how this plays out. I don’t think that running corrections for multiple testing is mandatory (this depends on what kind of conclusion is drawn), but if authors don’t demonstrate any awareness for the issue, this raises suspicion. Fortunately, almost all given p-values are either seriously small or comfortably bigger than 0.05, so that I hardly expect conclusions to be affected by multiple testing – unless there actually was more testing that was not reported.

On the positive side, authors were interested in effect sizes (which are not confused with small p-values) and (mostly) gave intervals quantifying uncertainty. Also, often they plot the data in a way that the reader can see how the distributions look like and how big differences from what is expected under the null hypotheses actually are. One paper shows plots diagnosing model assumptions stating “the residuals are distributed uniformly around the zero line, indicating the
adequacy of the model”, but the corresponding plot indicates clear heteroscedasticity violating a model assumption.

My overall impression is somewhat ambivalent. There are misunderstandings and misinterpretations galore, but I don’t have the impression that any of them leads to a grossly wrong or misleading assessment of the subject matter. Of course I can’t rule out selective reporting or even computation errors or fake data, but none of the papers seems to hint at such a thing.

I’m fine with running a test as a routine device to check whether what was observed is compatible with meaningless random variation, e.g., “The concentration of sAA were similar between pasture and paddock (46.25 +/- 19.44 and 47.52 +/- 13.11 U/L, respectively; p = 0.7742) (…) BChol was found to be higher during pasture stable and lower in a statistically significant way when horses moved to paddock (12.44 +/- 6.30 and 5.58 +/- 2.39 mU/mL, respectively; p = 0.0068)” (Bazzano et al., 2024) with accompanying boxplots. The p-values here convey a message that is very relevant to the aim of the paper. If it is not possible to tell apart a difference from a model for meaningless random variation, it can for sure not be used as indication of anything meaningful.

I don’t actually think that this message could have been conveyed any better with any of the above mentioned “alternatives to p-values”, at least given that information to assess effect sizes is presented as well. The word “significant” doesn’t do any harm here as far as I’m concerned.

Scepticism regarding the generalisability of such results is always justified anyway, and for sure a p-value in isolation (and be it with an interval of effect sizes) doesn’t make a convincing discovery. There are always various ways to raise doubts. Ultimately I think that in order to achieve the status of a properly reliable scientific discovery, a result needs to be confirmed from different angles and by different authors. A single study with limited scope is generally not enough. If this were generally accepted, quite a bit of the worst trouble with overinterpretation of statistical tests would be out of the window.

Opponents of statistical tests could argue that in these papers there is a problem with binary thinking, which is encouraged by the binary logic of Neyman-Pearson type tests. Furthermore the list of issues with testing alone in the small set of papers considered here is rather long indeed.

I respond that (a) the tests serve a purpose, and (b) I don’t see how these problems will be solved using any other statistical method in the place of the tests (or just leaving them out). Most of the problems (e.g., confusion about paired observations; not being able to spot a model assumption violation) are on a level that will create issues with any statistical approach. Binary thinking will be brought in whenever an author wants to say that an observed effect is either meaningful or not, regardless of what method is used (a posterior probability for example can be thresholded just as well). Binary thinking (in the sense that “data show that either null hypothesis or alternative is true”) is inappropriate, but there is sometimes a need for binary decisions (e.g., should we use a method based on a normality assumption?), and language is essentially discrete, so any interpretation of a numerical result in words will imply some thresholding, if not necessarily transparently.

Statistical tests are the formalisation of an elementary intuition, namely that the observation of an event that is very unlikely under a certain probability model (which may involve fixed parameter values) indicates evidence against that model. This is probably the closest thing to falsification that we can have for data that comes with random variation and models that allow for the possibility of any outcome. As such, the idea behind tests looks very simple. Admittedly, fleshing it out and getting it into use in science comes with difficulties and complexities. In particular, often the probability for any specific result is low; at least for continuous distributions the probability for any precise observation is 0. Obviously just on this basis it wouldn’t make sense to say that there is evidence against the model whatever we observe. Instead, “statistical falsification” will require the specification of a rejection region (or rather regions at various levels, implying the definition of the p-value as “borderline rejection level”) in advance. There is more than one possible way to do this. Neyman and Pearson proposed to choose a test so that the probability to reject the null hypothesis is maximised under a certain alternative hypothesis of interest. Not taking for granted that reality follows either the null or the alternative hypothesis, such a test can be interpreted saying that the test statistic and nominal alternative hypothesis suggest a certain direction of deviation from the null hypothesis in case of rejection, such as generating larger values on average. In some situations tests may be preferred that work well distinguishing bigger sets of null and alternative hypotheses against each other rather than being optimal for specific ones. There are further subtleties, for example the issue of multiple testing, i.e., the potentially large probability of finding a meaningless significance if enough tests are run.

Most proposed alternatives to statistical tests are however even more complex (this may be controversial, but I won’t elaborate on the complexities of any specific alternative here), and several of them require understanding tests first, which is already hard. The superficial simplicity of tests is a blessing as well as a curse, as people all over the place without much statistical insight feel encouraged (or even forced) to use them. I suspect that some users of tests don’t even care about understanding what they are doing; following the “ritual” and getting published is enough. Other users may feel that they understand the basic idea, but may not know about the subtleties.

I actually think that coming up with alternative ways to formalise the evidence in the data is laudable. Many of these approaches have advantages that are potentially useful in certain situations, and I don’t think that people who use them should be pushed back into using standard tests (in many cases doing both, running a test and on top of it giving an e-value, relevance value or similar, can be informative). The scope of a p-value is limited, and there is more information in the data regarding hypotheses of interest than can be captured in a single number. So complementing a p-value with other information is a good thing, highlighting certain issues of p-values is helpful as well, but I am pretty convinced that replacing p-values with alternative single numbers will not improve matters.

This seems to be an instance of “solutionism”; the hope that the problems with statistics in practice can be solved by new methodology without addressing the underlying lack of statistical competence. Statisticians tend to get more credit (and higher level publications) by inventing new methodology than by increasing the understanding of existing ones, and the potential to come up with something that will later be used by generations of researchers is a strong incentive (even though this hope will be disappointed more often than not).

The aim of any initiative to improve the use of statistics should be to improve understanding. Pointing out misunderstandings and misinterpretations is necessary and worthwhile (reviewers should have asked for precise p-values instead of just stating “p<0.05”, and of course they should not have let authors get away with using a two-sample Wilcoxon for paired observations or a wrong interpretation of model diagnostics). I doubt that understanding can be improved by pushing people to use more complex methodology with which little experience exists. Changing methods or philosophy will not solve what really is the problem here. Wasserstein et al. 2019 and the papers in the Special Issue introduced by that editorial have some good things to say on how to improve data analysis in science (some of these papers explicitly advocate retaining statistical tests). Unfortunately the authors of the editorial chose to emphasise the red herring of “abandoning significance” over the more helpful aspects of their initiative.

References

Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., & Dupont, W. D. (2019). An Introduction to Second-Generation p-Values. The American Statistician, 73(sup1), 157-167. https://doi.org/10.1080/00031305.2018.1537893

Cole S. R., Edwards, J. K. and Greenland, S. (2021) Surprise! American Journal of Epidemiology 190(2), 191-193 https://doi.org/10.1093/aje/kwaa136

Grünwald, P., de Heide, R., & Koolen, W. M. (2023). Safe testing. arxiv, https://doi.org/10.48550/arXiv.1906.07801. To appear as discussion paper in Journal of the Royal Statistical Society

Mayo, D. (2010) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge University Press.

Stahel, W. A. (2021) New relevance and significance measures to replace
p-values. PLoS ONE 16(6): e0252991.  https://doi.org/10.1371/journal.pone.0252991

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Moving to a World Beyond ‘p < 0.05’, The American Statistician 73(S1), 1-19: Editorial. https://doi.org/10.1080/00031305.2019.1583913

Categories: 5-year memory lane, abandon statistical significance, Christian Hennig | 7 Comments

Post navigation

7 thoughts on “Guest Post: Christian Hennig: “Statistical tests in five random research papers of 2024, and related thoughts on the ‘don’t say significant’ initiative”

  1. I am very grateful to Christian Hennig for this new, very thoughtful guest post reflecting on a few recent papers that use statistical testing. I nudged him to qualify his claim that “Statistical tests are the formalisation of an elementary intuition, namely that the observation of an event that is very unlikely under a certain probability model … indicates evidence against that model”, which he has. This reasoning, without very considerable qualification, leads to some of the most flagrant howlers of tests, so it is good that he explains this a bit, for N-P tests. But I claim the fallacy is avoided by Fisherian tests as well. Bayesian epistemologist, Michael Titelbaum (2022), introduces statistical significance testing by saying “consider the hypothesis that John plays soccer”. Since being a goalie is improbable among soccer player, Titelbaum maintains that “by the logic of significance testing” we may infer that John does not play soccer. (He gets the example from someone else, but that doesn’t excuse it.) Such an inference would be wrong with probability 1, since it is given that goalie implies soccer player (“goalies are rare soccer players”(p. 463). Titelbaum’s fallacy begins with supposing an event like “John plays soccer” is a statistical hypothesis. The second fallacy is supposing that a p-value is a conditional probability (of an event given a hypothesis). It is not. Nevertheless, I admit, it is rarely explained just how the statistical falsification actually works. Any data described in sufficiently detail is improbable. It is not the precise observational data that a statistical hypothesis predicts or explains. It is, rather, a statement about a (specially defined) test statistic d(x) (e.g., exceeding a specified value, or being in a specified rejection region). For example, Ho entails that with high probability d(x) will not exceed a critical value. So I’m glad Hennig qualified this, but I will write a post on the ”soccer/goalie” fallacy, because  some philosophy students are being introduced to statistical significance tests with this howler

    One other thing: For all the limitations of the p-value, at least it can be correctly defined very informally in a sentence, in a way that explains the reasoning (e.g., if Pr(d(X) > d(x); Ho) is very low, then d(x) indicates ~Ho, for an appropriate test statistic d(x)) Can Hennig do this for the e-value?

  2. Thanks to both of you for this post – I have a couple questions.

    First, for Christian, how important would say it is to include exact p-values in the abstract of a paper? For example, would you say it’s bad practice to write “we found significant effects X and Y” instead of “we found significant effects X (p = 0.02) and Y (p = 0.03)”? I’m asking because I know space is at a premium in this situation.

    Second, for both of you, I want to know what you think of this approach to describing testing. TLDR: Do you scientists would we willing to hear “statistical significance is the minimum, not the be-all end-all?”

    I would like to say something “statistical significance is a necessary but not sufficient condition for claiming a successful prediction”. In other words, and here I disagree with you Christian somewhat, it doesn’t really matter if p = 0.001 or p=0.049 – if you’re over the hurdle, you’re over. I only say “somewhat”, because p-values are correlated with proper measures of relative evidence (likelihood ratios) in a lot of situations. So I think it’s not really bad to use a p-value as evidence, just imprecise. And since it’s a minimum requirement, I don’t mind setting alpha higher than 0.05 before starting the experiment. I just want the scientist to honestly follow through on their plan. The real reason to focus on p-values IMO is to stop scientists from engaging in optional stopping or HARKing without penalty – I don’t want them to say such studies have the same rigor as if they had come from a fixed sample size.

    Henry

    • hwyneken:
      You contradict yourself by saying both (a) “The real reason to focus on p-values IMO is to stop scientists from engaging in optional stopping or HARKing without penalty – I don’t want them to say such studies have the same rigor as if they had come from a fixed sample size,” with which I heartily agree, and previously (b) “p-values are correlated with proper measures of relative evidence (likelihood ratios) in a lot of situations. So I think it’s not really bad to use a p-value as evidence, just imprecise.” Proper measures of evidence? Likelihood ratios do not pick up on optional stopping and other biasing selection effects: they follow the likelihood principle.
      Mayo

      • That’s a fair point – I think I meant “proper” as in a “mathematical comparison of the how well the alternative hypothesis fits the data vs how well the null hypothesis fits the data”. I know that’s how likelihoodists define it. I don’t mind calling that “evidence” with a small e. I fully agree that a likelihood ratio does not deserve to count as Evidence for something without context. I think we’re in agreement that the likelihood principle is mostly bad news. It goes against my notions of common sense and fairness.

    • @hwyneken: “how important would you say it is to include exact p-values in the abstract of a paper?” I don’t have a strong opinion of whether the precise p-value should be stated in the abstract given that it is elsewhere in the paper anyway. Before giving a general recommendation I’d want to see some empirical research on what readers take away from a paper depending on whether a precise p-value is in the abstract or not. It’ll probably also depend on disciplinary culture. Also note that significance testing requires a level, so one shouldn’t say “significant” in the abstract (or eslewhere) without specifying the level.

      Regarding your “significance as minimum requirement”, I am very sceptical. I think that requiring significance at 0.05-level for publication was a big incentive for “fishing for significance”, and I actually share the mistrust of Wasserstein et al. of binary thinking. Binary decisions are sometimes required, but normally when reporting research results we can be more nuanced. I do realise that saying that “significance is a minimum requirement” is more nuanced than saying “p<0.05 means discovery”, still in my view p=0.0002 indicates much stronger against the null hypothesis that p=0.049, and p=0.049 isn’t really much different from p=0.06 regarding the strength of evidence. Note also that if many tests are run in a single paper and only one or a few are significant, p=0.049 may easily happen without anything meaningful going on, whereas p=0.0002 (let’s say samllest out of 10 tests) should still tell us something.

      Of course if actual binary decisions are made like when to stop or what other things to look at depending on certain earlier tests, a cutoff value specified in advance is required.

  3. @Mayo: My explanation regarding the “falsification intuition” of tests was not meant to apply to N-P tests exclusively. In any case a test needs to be constructed in such a way that the rejection region indicates a deviation from the null hypothesis that is relevant regarding the aim of research. N-P tests are constructed as optimal against a specific alternative, which is a mathematically well defined way for achieving this. Alternatively to N-P optimality, we may be interested in distiguishing the null hypothesis from a bigger set of alternatives, where mathematical optimality may not be available. Fisherian tests allow for such an interpretation as well.

    Regarding e-values, I am the wrong person to ask, as I wrote a rather critical comment on Grünwald et al.’s paper (to be published together with their paper in JRSS) that says among other things that I doubt that these are as easy to explain as p-values.

    • Christian:

      I know you didn’t intend your qualification to be limited to N-P theory, but it was still not quite clear to me how it works for Fisher (and Fisher’s disjunctive argument doesn’t help). Looking at optimality properties one might see the logic as behavioristic: reject H based on a rule with good performance properties–or the like. But you proposed a logic based loosely on taking x as evidence against a model or hypothesis that renders x improbable, and I often see misunderstandings of that logic. Fisher doesn’t make it clear–so we need to identify the reasoning..
      Mayo

Leave a reply to Christian Hennig Cancel reply

Blog at WordPress.com.