5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

.

On June 1, 2019, I posted portions of an article [i],“There is Still a Place for Significance Testing in Clinical Trials,” in Clinical Trials responding to the 2019 call to abandon significance. I reblog it here. While very short, it effectively responds to the 2019 movement (by some) to abandon the concept of statistical significance [ii]. I have recently been involved in researching drug trials for a condition of a family member, and I can say that I’m extremely grateful that they are still reporting error statistical assessments of new treatments, and using carefully designed statistical significance tests with thresholds. Without them, I think we’d be lost in a sea of potential treatments and clinical trials. Please share any of your own experiences in the comments. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….

However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….

Furthermore, the research question addressed by clinical trials (comparing alternative strategies) fits well with such an approach and the corresponding decision-making settings (e.g. regulatory agencies, data and safety monitoring committees and clinical guideline bodies) are often ones within which statistical experts are available to guide interpretation. The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.

The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii]  More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.

These considerations notwithstanding, the field of clinical trials is in rapid evolution and it is entirely possible and appropriate that the statistical framework used for their evaluation must also change. However, such evolution should emerge from careful methodological research and open-minded, self-critical enquiry. We earnestly hope that Clinical Trials will continue to be seen as a natural academic home for exploration and debate about alternative statistical frameworks for making inferences from clinical trials. The Editors welcome articles that evaluate or debate the merits of such alternative paradigms along with the conventional one within the context of clinical trials. Especially welcome are exemplar trial articles and those which are illustrated using practical examples from clinical trials that permit a realistic evaluation of the strengths and weaknesses of the approach.

You can read the full article here.

Please share your comments.

*****************************************************************

[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.

[ii] PBack in 2019,  was trying to find an apt acronym. I played with the idea of calling those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.

[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). 

[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously. 

^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.

 

 

 

Categories: 5-year memory lane, abandon statistical significance, statistical tests | 9 Comments

Post navigation

9 thoughts on “5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

  1. Mircea Zloteanu

    My concern with promoting (or at least not seriously flagging issues with) p-values and NHST is that they are only valid in very specific cases (i.e. experiments). Clinical RCTs where you can make assumptions about the distribution of p-values under the null (i.e. uniform) is exactly (only?) where they do work.

    However, p-values also appear in quasi-experimental and observational studies, where they make little sense for long term error control (although some attempts at empirical calibration exist). What should those researchers do, and what claims can they make? I rarely see guidance for them.

    Also, I find it difficult to consider how frequentist sequential testing is more optimal (read: less harmful to potential patients) than a bayesian approach where data collection (subjecting patients to a procedure) can be stopped after checking each new data point until an acceptable threshold has been reached.

    • Mircea:
      Thanks for your comment.
      I disagree. Statistical significance tests can be used to test assumptions, and in fact that is one of their most important functions. How do I test your degrees of belief? As for “non-subjective Bayes” there is no agreement among several rival systems. Granted there are, in some cases, frequentist matching priors, but the meaning is quite different. Empirical Bayes can be used in some cases. I concur with the authors: they need to demonstrate their abilities.
      The bottom line is that all statistical methods use model and other assumptions. Statistical significance tests have the least assumptions, can test their own assumptions, and error statistical methods in general only require the error probabilities hold approximately.

      Adaptive trials are the regularly conducted in frequentist trials, but they need to take account of interim monitoring to ensure error control. See this post.

      Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i)


      The authors, who are Bayesians, show
      that the type I error was inflated in the Bayesian adaptive designs through incorporation of interim analyses that allowed early stopping for efficacy and without adjustments to account for multiplicity.

      The Type I error probability can actually exceed 0.5! And this is radiation oncology. There’s a reason they’re only allowed in exploratory trials (so-far).

      These Bayesian authors admit that “Bayesian interim monitoring may violate the weak repeated sampling principle [Cox and Hinkley 1974] which states that, ‘We should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time’.
      That would seem to be strong evidence indeed for either avoiding or adjusting for multiplicities. Thus, in regulatory practice, a Bayesian might not wish to wear a strict Bayesian hat:

  2. From a patient perspective, I wonder whether debates about significance testing distract from the question of how to draw inferences from aggregate data that apply to the individual. If an experimental group shows better outcomes on average than a control group, I don’t care as much about whether the difference is “significant” as I do about how individual outcomes contributed to the means. Did a few people in the experimental group improve a lot, or did a lot of them improve a little? If the mean difference is small, I can imagine not caring whether or it’s “significant” in any sense of the term.

    • Ken:

      Thanks for your comment. I think the debates about statistical significance testing, especially those growing out of the 2019 Wasserstein move to “abandon significance”, have distracted from just about all interesting questions of statistical inference—questions that had been the focus before their negative campaign. Sure, it’s useful to review the age-old fallacies (p-values aren’t posteriors, aren’t measures of effect size, require knowing how many tests have been done (and other multiplicities); no evidence against isn’t evidence for; statistical significance is distinct from substantive scientific inference). But the Wasserstein, Schirm, Lazar WSL (2019) editorial doesn’t focus on these fixable problems, but is keen to ridicule significance, calling it “meaningless”, “thoughtless”, and denying small p-values can ever supply evidence of the presence of a genuine effect! (See my last post for details.) Criticisms of other methods, especially anything Bayesian, are largely off the table! It’s mostly a matter of (statistical) tribal warfare, but only certain tribes are allowed weapons.

      So I agree. I think you may be referring to something called “fragility indexes”? It’s interesting that I don’t recall that coming up in any of the proposals to fix statistical significance. I see that they are also open to some controversy.

      https://www.pnas.org/doi/full/10.1073/pnas.2105254118#:~:text=The%20fragility%20index%20is%20defined,should%20not%20be%20firmly%20trusted.

      Are these widely used and reported? I’d love for someone to do a guest post on them.

      I’m not sure if those indices look at the effect size contributed by each patient. What do researchers do when they find the results have high fragility (which I guess means a low number of patients could switch the results from positive to negative, or the other way around). I’m not familiar with this and would like to learn more.

      • kspringed53fa8cc8c

        Hi Deborah,

        The three or four times I’ve seen fragility indices in studies, they didn’t seem useful, because the authors just made generic inferences that could be derived from the p values themselves. For instance, when the fragility index for a significant effect is low, the authors may add a little note calling for caution, because, hypothetically, the effect could’ve readily been non-significant, or the effect might not turn up significant in a replication. Since FIs and p values are highly correlated, there’s nothing very informative here – no attention to the difference between what a p value tells us vs. what a fragility index tells us.

        Supposedly fragility indices are useful when they alert you that the number of patients lost to follow-up in a particular study exceeds the number of patients whose outcomes, if different, would’ve made significant effects non-significant or vice versa. I guess this is useful, but there are many more patients who weren’t included in the study in the first place, so you might still question the specific usefulness.

        You mentioned not seeing fragility indices as coming up in proposals to fix statistical significance. I don’t think FIs could be a fix, because they build on whatever approach to determining significance the researcher uses. If it’s an inferential test and alpha is .05, that remains the standard for significance against which the FI is interpreted.

        • Ken:
          I don’t see how you could derive a fragility index from the p-value alone. Anyway, I mentioned it only because your earlier comment called it to mind. Perhaps you were thinking more along the lines of personalized medicine. You can find some posts by Stephen Senn on this blog on the topic. He’s skeptical.

          • kspringed53fa8cc8c

            Deborah,

            My bad for using the term “derive.” I didn’t mean anyone derives a fragility index from a p-value. I meant that those authors drew implications from their fragility indices that could’ve been drawn from their reported p values. They frame the implications in a very generic way (e.g., “Type I error is a concern”), so I didn’t see why they bothered calculating the FI.

            My earlier comment was just a patient’s perspective on abandoning significance testing. Assuming a large-sample RCT, I think patients would find it much more useful to know the proportions of individuals in each group who improved to some clinically meaningful extent, as opposed to knowing whether group means differed significantly, according to any definition of significance. I mean, if they had to choose, they wouldn’t choose significance testing.

            Thanks for guiding me to Senn’s posts.

            • Ken:
              “I mean, if they had to choose, they wouldn’t choose significance testing.
              That’s too bad, because it is statistical significance testing that drives randomized controlled trials. That’s what gives a difference of proportions its relevance. It would be a completely meaningless report without knowing the expected variability of the effect. There are methods aside from RCTs, but they all bend over backwards to try to achieve what RCTs do.
              And while you’re looking at Senn on this blog, check out his brilliant insights on RCTs (e.g., against some critics from philosophy of science)–he’s a world expert.

              • kspringed53fa8cc8c

                I’ve seen too many RCTs claiming to show that some treatment works, but the mean difference between groups on the DV is small enough to be consistent with the possibility that the majority of participants did not benefit from treatment. I’m not impugning significance testing but merely suggesting that in cases like that, patients wouldn’t find the “expected variability of the effect” to be meaningful without also knowing something about outcome frequencies.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.