Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences


An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….

However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….

The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.

The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii]  More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.

You can read the full article here.

We may reject, with reasonable severity, the supposition that promoting correctly interpreted statistical tests is the real goal of the most powerful leaders of the movement to Stop Statistical Tests. The goal is just to stop (error) statistical tests altogether.[iv] That today’s CI leaders advance this goal is especially unwarranted and self-defeating, in that confidence intervals are just inversions of N-P tests, and were developed at the same time by the same man (Neyman) who developed (with E. Pearson) the theory of error statistical tests. See this recent post.

Reader: I’ve placed on draft a number of posts while traveling in England over the past week, but haven’t had the chance to read them over, or find pictures for them. This will soon change, so stay tuned!


[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.

[ii] Perhaps we should call those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.

[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). 

[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously. 

^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.




Categories: statistical tests

Post navigation

Comments are closed.

Blog at