Human Technology Interaction
Eindhoven University of Technology
Averting journal editors from making fools of themselves
In a recent editorial, Mayo (2021) warns journal editors to avoid calls for authors guidelines to reflect a particular statistical philosophy, and not to go beyond merely enforcing the proper use of significance tests. That such a warning is needed at all should embarrass anyone working in statistics. And yet, a mere three weeks after Mayo’s editorial was published, the need for such warnings was reinforced when a co-editorial by journal editors from the International Society of Physiotherapy (Elkins et al., 2021) titled “Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors” stated: “[This editorial] also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.”
This co-editorial by journal editors in the field of physiotherapy shows the incompetence that typically underlies bans of p-values – because let’s be honest, it is always the p-value and associated significance tests that are banned, even when empirical research has shown confidence intervals or Bayes factors are misused and misinterpreted as much, or more (Fricker et al., 2019; Hoekstra et al., 2014; Wong et al., 2021). In the co-editorial, the no-doubt well-intentioned physiotherapy journal editors recommend “Estimation as an alternative approach for statistical inference”. At first glance, one might think this means the editors are recommending estimation as an alternative approach to statistical tests. In other words, we would expect to see questions that are answered by effect size estimates, and not by dichotomous claims about the presence or absence of effects. But then the editors write the following (page 3):
“The estimate and its confidence interval should be compared against the ‘smallest worthwhile effect’ of the intervention on that outcome in that population. The smallest worthwhile effect is the smallest benefit from an intervention that patients feel outweighs its costs, risk and other inconveniences. If the estimate and the ends of its confidence interval are all more favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile by patients in that clinical population.”
This is confused advice, at best. The description of the statistical inference the editors want researchers to make is a dichotomous claim. It is made based on whether a confidence interval excludes the smallest effect size of interest. This procedure is mathematically identical to using p < alpha. The question whether a treatment effect is worthwhile or not is logically answered by a dichotomous ‘yes’ or ‘no’. An estimate of the effect size does not tell one whether the effect should be regarded as random noise around a true effect size of zero, or a non-zero effect.
The editors should clearly have followed Mayo’s (2021) advice to not go beyond enforcing proper use of significance tests. Estimation and significance testing answer two different questions. Estimation can’t, as the physiotherapists hope, replace significance tests. The conflict between the two approaches becomes apparent when we asks ourselves how researchers who want to publish in these physiotherapy journals should deal with situations where they would lower the alpha level to correct for multiple comparisons or sequential analyses. Are authors required to report a 99% confidence interval in cases where they would have used a Bonferroni correction when examining 5 independent test results, because they would otherwise have divided the 5% alpha by five? Or should they ignore error rates, and make claims based on a 95% confidence interval, even when this would lead to many more articles claiming treatments are beneficial than we currently find acceptable? Related applied questions that researchers who want to publish in physiotherapy journals face are which confidence interval they should report to begin with (as a 95% confidence interval is based on the idea that a maximum of a 5% error rate is deemed acceptable when making dichotomous claims, but a desired accuracy requires a different justification), as well as questions about sample size justifications (will editors accept papers with any sample size, or do they still expect an a-priori power analysis based on low Type 1 and Type 2 error rates when making claims about effect sizes?).
As Mayo (2021) writes, “The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in.” Fricker and colleagues (2019) show how removing the p-values and significance testing in the journal of Behavioral and Applied Social Psychology have led to the publication of articles in which claims are made that have a much higher probability of being wrong than was the case before p-values were banned, but without transparently communicating this high error rate. Anyone who reads physiotherapy journals that follow the guidelines of journal editors to use ‘estimation’ need to be prepared for the same development in their journals. As Mayo (2021) notes in her editorial, banning proper uses of thresholds in significance tests makes it “harder to hold data dredgers culpable for reporting a nominally small p value obtained through data dredging”.
The statistical philosophy of estimation is not designed to answer questions about the presence or absence of a beneficial effect. That a large group of journal editors thinks it can shows how rational thought often takes a backseat when journal editors start to make recommendations about how to improve statistical inferences.
What can journal editors require to avert incoherent recommendations that force researchers to use approaches that do not answer the questions they are asking? The answer is simple: They should require a coherent approach to statistical inferences, anchored in an epistemology, that answers the question a researcher is interested in. The task of journals is to evaluate the quality of the work that is submitted, not to dictate the questions researchers ask. Of course, a journal can state that they believe that only work in which no scientific claims are made, or where claims are made without any control on the rate at which these claims are wrong, is the definition of ‘high quality’ – I would look forward to the arguments for such a viewpoint, and doubt they would be convincing. Let’s hope Mayo’s (2021) editorial prevents similar groups of journal editors from making fools of themselves in the future.
See Brian Haig’s commentary next.
- Elkins, M. R., Pinto, R. Z., Verhagen, A., Grygorowicz, M., Söderlund, A., Guemann, M., Gómez-Conesa, A., Blanton, S., Brismée, J.-M., Ardern, C., Agarwal, S., Jette, A., Karstens, S., Harms, M., Verheyden, G., & Sheikh, U. (2021). Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. Journal of Physiotherapy. https://doi.org/10.1016/j.jphys.2021.12.001
- Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban. The American Statistician, 73(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892
- Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–1164. https://doi.org/10.3758/s13423-013-0572-3
- Mayo, D. (2021). The Statistics Wars and Intellectual Conflicts of Interest.
- Wong, T. K., Kiers, H., & Tendeiro, J. (2021). On the Potential Mismatch between the Function of the Bayes Factor and Researchers’ Expectations.