Monthly Archives: July 2019

Summer Seminar in PhilStat Participants and Special Invited Speakers


Participants in the 2019 Summer Seminar in Philosophy of Statistics


Renée Bolinger, Asst. Professor
Dept of Politics and the Center for Human Values, Princeton University


Lok Chan, Post Doc
Social Science Research Institute, Duke University


Marcello Di Bello
, Asst. Professor
Dept of Philosophy, Lehman College CUNY




John Douard, Appellate Staff Attorney
N.J. Office of the Public Defender




Georgi Gardiner, Junior Research Fellow,
St. John’s College, Oxford University
Asst. Professor, Dept. of Philosophy, University of Tennessee


Ruobin (Robin) Gong, Asst. Professor
Department of Statistics and Biostatistics, Rutgers University


Jennifer Juhn, Asst. Professor
Dept of Philosophy, Duke University


Molly Kao, Asst. Professor
Dept. of Philosophy, Université de Montréal



Jesse Krijthe, Post Doc, Data Science
Institute for Computing and Information Sciences, Radboud University



Jonathan Livengood, Assoc. Professor
Dept. of Philosophy, University of Illinois at Urbana-Champaign


Jolynn Pek, Asst. Professor
Dept. of Psychology, Ohio State University




Jonah Schupbach, Assoc. Professor
Dept. of Philosophy, University of Utah




Elay Shech, Asst. Professor
Department of Philosophy, Auburn University



Riet van Bork, Asst. Professor
Department of Psychological Methods, University of Amsterdam


Brian Zaharatos, Director,
Professional MS in Applied Mathematics and Instructor
Dept. of Applied Mathematics, University of Colorado, Boulder



Special Invited Speakers


Professor of Statistics, Political Science, Columbia University (mini bio)




Consultant Statistician, Edinburgh (mini bio) (Presentation abstract)


Lawyer, emphasis on the scientific and medico-legal issues (mini bio) (Presentation abstract & discussion)



Reader, School of Psychology, Cardiff University (mini bio) (Presentation abstract)

Categories: Summer Seminar in PhilStat | Leave a comment

The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting  yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II, in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II guidelines were heading.

The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments).

They do not go along with the ASA II call for ousting thresholds, ending the use of the words “significance/significant”, or banning “p ≤ 0.05”.  In the associated article, we read:

“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions”.

When it comes to confidence intervals, the recommendations of ASA II, to the extent they were influential on the NEJM, seem to have had the opposite effect to what was intended–or is this really what they wanted?

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

Significance levels and P-values, in other words, are terms to be reserved for contexts in which their error statistical meaning is legitimate. This is a key strong point of the NEJM guidelines. Confidence levels, for the NEJM, lose their error statistical or “coverage probability” meaning, unless they follow the adjustments that legitimate P-values call for. But they must be accompanied by a sign that warns the reader the intervals were not adjusted for multiple testing and thus “the inferences drawn may not be reproducible.” The P-value, but not the confidence interval, remains an inferential tool with control of error probabilities. Now CIs are inversions of tests, and strictly speaking should also have error control. Authors may be allowed to forfeit this, but then CIs can’t replace significance tests and their use may even (inadvertently, perhaps) signal lack of error control. (In my view, that is not a good thing.) Here are some excerpts:

For all studies:

  • Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to match any adjustment made to significance levels in the corresponding test.

For clinical trials:

  • Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.

  • The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. …

  • When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value was statistically significant. P values for the first nonsignificant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

As noted earlier, since P-values would be invalidated in such cases, it’s entirely right not to give them. CIs are permitted, yes, but are required to sport an alert warning that, even though multiple testing was done, the intervals were not adjusted for this and therefore “the inferences drawn may not be reproducible.” In short their coverage probability justification goes by the board.

I wonder if practitioners can opt out of this weakening of CIs, and declare in advance that they are members of a subset of CI users who will only report confidence levels with a valid error statistical meaning, dual to statistical hypothesis tests.

The NEJM guidelines continue:

  • …When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. If the study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.

  • Forest plots are often used to present results from an analysis of the consistency of a treatment effect across subgroups of factors of interest. …A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.

  • If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.

  • When possible, the editors prefer that absolute event counts or rates be reported before relative risks or hazard ratios. The goal is to provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.

  • Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant information included in the CONSORT checklist. …The CONSORT statement, checklist, and flow diagram are available on the CONSORT

Detailed instructions to ensure that observational studies retain control of error rates are given.

In the associated article:

P values indicate how incompatible the observed data may be with a null hypothesis; “P<0.05” implies that a treatment effect or exposure association larger than that observed would occur less than 5% of the time under a null hypothesis of no effect or association and assuming no confounding. Concluding that the null hypothesis is false when in fact it is true (a type I error in statistical terms) has a likelihood of less than 5%. [i]…

The use of P values to summarize evidence in a study requires, on the one hand, thresholds that have a strong theoretical and empirical justification and, on the other hand, proper attention to the error that can result from uncritical interpretation of multiple inferences.5 This inflation due to multiple comparisons can also occur when comparisons have been conducted by investigators but are not reported in a manuscript. A large array of methods to adjust for multiple comparisons is available and can be used to control the type I error probability in an analysis when specified in the design of a study.6,7 Finally, the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality. [ii]

A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions.

Finally, the current guidelines are limited to studies with a traditional frequentist design and analysis, since that matches the large majority of manuscripts submitted to the Journal. We do not mean to imply that these are the only acceptable designs and analyses. The Journal has published many studies with Bayesian designs and analyses8-10 and expects to see more such trials in the future. When appropriate, our guidelines will be expanded to include best practices for reporting trials with Bayesian and other designs.

What do you think?

I will update this with corrections and thoughts using (i), (ii), etc.

The author guidelines:

The associated article:

*I meant to thank Nathan Schachtman for notifying me and sending links; also Stuart Hurlbert.

[i] It would be better, it seems to me, if the term “likelihood” was used only for its technical meaning in a document like this.

[ii] I don’t see it as a matter of “reductionism” but simply a matter of the properties of the test and the discrepancies of interest in the context at hand.




Categories: ASA Guide to P-values | 15 Comments

B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (The American Statistician, 2019, 73, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of The American Statistician devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II.

1. The proposal that we should stop using the expression “statistical significance” is given a weak justification

ASA II proposes a superficial linguistic reform that is unlikely to overcome the widespread misconceptions and misuse of the concept of significance testing. A more reasonable, and common-sense, strategy would be to diagnose the reasons for the misconceptions and misuse and take ameliorative action through the provision of better statistics education, much as ASA I did with p values. Interestingly, ASA II references Mayo’s recent book, Statistical Inference as Severe Testing (2018), when mentioning the “statistics wars”. However, it refrains from considering the fact that her error-statistical perspective provides an informed justification for continuing to use tests of significance, along with the expression, “statistically significant”. Further, ASA II reports cases where some of the Special Issue authors thought that use of a p-value threshold might be acceptable. However, it makes no effort to consider how these cases might challenge their main recommendation.

2. The claimed benefits of abandoning talk of statistical significance are hopeful conjectures.

ASA II makes a number of claims about the benefits that it thinks will follow from abandoning talk of statistical significance. It says,“researchers will see their results more easily replicated – and, even when not, they will better understand why”. “[We] will begin to see fewer false alarms [and] fewer overlooked discoveries …”. And, “As ‘statistical significance’ is used less, statistical thinking will be used more.” (p.1) I do not believe that any of these claims are likely to follow from retirement of the expression, “statistical significance”. Unfortunately, no justification is provided for the plausibility of any of the alleged benefits.  To take two of these claims: First, removal of the common expression, “significance testing” will make little difference to the success rate of replications. It is well known that successful replications depend on a number of important factors, including research design, data quality, effect size, and study power, along with the multiple criteria often invoked in ascertaining replication success. Second, it is just implausible to suggest that refraining from talk about statistical significance will appreciably help overcome mechanical decision-making in statistical practice, and lead to a greater engagement with statistical thinking. Such an outcome will require, among other things, the implementation of science education reforms that centre on the conceptual foundations of statistical inference.

3. ASA II’s main recommendation is not a majority view.

ASA II bases its main recommendation to stop using the language of “statistical significance” in good part on its review of the articles in the Special Issue. However, an inspection of the Special Issue reveals that this recommendation is at variance with the views of many of the 40-odd articles it contains. Those articles range widely over topics covered, and attitudes to, the usefulness of tests of significance. By my reckoning, only two of the articles advocate banning talk of significance testing. To be fair, ASA II acknowledges the diversity of views held about the nature of tests of significance. However, I think that this diversity should have prompted it to take proper account of the fact that its recommendation is only one of a number of alternative views about significance testing. At the very least, ASA II should have tempered its strong recommendation not to speak of statistical significance any more.

4.The claim for continuity between ASA I and ASA II is misleading.  There is no evidence in ASA I (Wasserstein & Lazar, 2016) for the assertion made in ASA II that the earlier document stopped just short of recommending that claims of “statistical significance” should be eliminated. In fact, ASA II marks a clear departure from ASA I, which was essentially concerned with how to better understand and use p-values. There is nothing in the earlier document to suggest that abandoning talk of statistical significance might be the next appropriate step forward in the ASA’s efforts to guide our statistical thinking.

5. Nothing is said about scientific method, and little is said about science.

The announcement of the ASA’s 2017 Symposium on Statistical Inference stated that the Symposium would “focus on specific approaches for advancing scientific methods in the 21stcentury”. However, the Symposium, and the resulting Special Issue of The American Statistician, showed little interest in matters to do with scientific method. This is regrettable because the myriad insights about scientific inquiry contained in contemporary scientific methodology have the potential to greatly enrich statistical science. The post-p< 0.05 world depicted by ASA II is not an informed scientific world. It is an important truism that statistical inference plays a major role in scientific reasoning. However, for this role to be properly conveyed, ASA II would have to employ an informative conception of the nature of scientific inquiry.

6. Scientists who speak of statistical significance do embrace uncertainty. I think that it is uncharitable, indeed incorrect, of ASA II to depict many researchers who use the language of significance testing as being engaged in a quest for certainty. John Dewey, Charles Peirce, and Karl Popper taught us quite some time ago that we are fallible, error-prone creatures, and that we must embrace uncertainty. Further, despite their limitations, our science education efforts frequently instruct learners to think of uncertainty as an appropriate epistemic attitude to hold in science. This fact, combined with the oft-made claim that statistics employs ideas about probability in order to quantify uncertainty, requires from ASA II a factually-based justification for its claim that many scientists who employ tests of significance do so in a quest for certainty.

Under the heading, “Has the American Statistical Association Gone Post-Modern?”, the legal scholar, Nathan Schachtman, recently stated:

The ASA may claim to be agnostic in the face of the contradictory recommendations, but there is one thing we know for sure: over-reaching litigants and their expert witnesses will exploit the real or apparent chaos in the ASA’s approach. The lack of coherent, consistent guidance will launch a thousand litigation ships, with no epistemic compass.(‘Schachtman’s Law’, March 24, 2019)

I suggest that, with appropriate adjustment, the same can fairly be said about researchers and statisticians, who might look to ASA II as an informative guide to a better understanding of tests of significance, and the many misconceptions about them that need to be corrected.


Haig, B. D. (2019). Stats: Don’t retire significance testing. Nature, 569, 487.

Mayo, D. G. (2019). The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean (Some Recommendations)(ii),blog post on Error Statistics Philosophy Blog, June 17, 2019.

Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. New York, NY: Cambridge University Press.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129-133.

Wasserstein, R. L., Schirm. A. L., & Lazar, N. A. (2019). Editorial: Moving to a world beyond “p<0.05”. The American Statistician, 73, S1, 1-19.

Categories: ASA Guide to P-values, Brian Haig | Tags: | 31 Comments

The Statistics Wars: Errors and Casualties


Had I been scheduled to speak later at the 12th MuST Conference & 3rd Workshop “Perspectives on Scientific Error” in Munich, rather than on day 1, I could have (constructively) illustrated some of the errors and casualties by reference to a few of the conference papers that discussed significance tests. (Most gave illuminating discussions of such topics as replication research, the biases that discredit meta-analysis, statistics in the law, formal epistemology [i]). My slides follow my abstract. Continue reading

Categories: slides, stat wars and their casualties | Tags: | Leave a comment

Blog at