5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

In a July 19, 2019 post I discussed The New England Journal of Medicine’s response to Wasserstein’s (2019) call for journals to change their guidelines in reaction to the “abandon significance” drive. The NEJM said “no thanks” [A]. However confidence intervals CIs got hurt in the mix. In this reblog, I kept the reference to “ASA II” with a note, because that best conveys the context of the discussion at the time. Switching it to WSL (2019) just didn’t read right. I invite your comments.
The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting  yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II,(note) in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II(note) guidelines were heading.

The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments). They do not go along with the ASA II(note) call for ousting thresholds, ending the use of the words “significance/significant”, or banning “p ≤ 0.05”.  In the associated article, we read:

“Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions”.

When it comes to confidence intervals, the recommendations of ASA II(note), to the extent they were influential on the NEJM, seem to have had the opposite effect to what was intended–or is this really what they wanted?

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

Significance levels and P-values, in other words, are terms to be reserved for contexts in which their error statistical meaning is legitimate. This is a key strong point of the NEJM guidelines. Confidence levels, for the NEJM, lose their error statistical or “coverage probability” meaning, unless they follow the adjustments that legitimate P-values call for. But they must be accompanied by a sign that warns the reader the intervals were not adjusted for multiple testing and thus “the inferences drawn may not be reproducible.” The P-value, but not the confidence interval, remains an inferential tool with control of error probabilities. Now CIs are inversions of tests, and strictly speaking should also have error control. Authors may be allowed to forfeit this, but then CIs can’t replace significance tests and their use may even (inadvertently, perhaps) signal lack of error control. (In my view, that is not a good thing.) Here are some excerpts:

For all studies:

  • Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to match any adjustment made to significance levels in the corresponding test.

For clinical trials:

  • Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.

  • The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. …

  • When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value was statistically significant. P values for the first nonsignificant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.

  • When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.

As noted earlier, since P-values would be invalidated in such cases, it’s entirely right not to give them. CIs are permitted, yes, but are required to sport an alert warning that, even though multiple testing was done, the intervals were not adjusted for this and therefore “the inferences drawn may not be reproducible.” In short their coverage probability justification goes by the board.

I wonder if practitioners can opt out of this weakening of CIs, and declare in advance that they are members of a subset of CI users who will only report confidence levels with a valid error statistical meaning, dual to statistical hypothesis tests.

The NEJM guidelines continue:

  • …When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. If the study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.

  • Forest plots are often used to present results from an analysis of the consistency of a treatment effect across subgroups of factors of interest. …A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.

  • If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.

  • When possible, the editors prefer that absolute event counts or rates be reported before relative risks or hazard ratios. The goal is to provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.

  • Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant information included in the CONSORT checklist. …The CONSORT statement, checklist, and flow diagram are available on the CONSORT

Detailed instructions to ensure that observational studies retain control of error rates are given.

In the associated article:

P values indicate how incompatible the observed data may be with a null hypothesis; “P<0.05” implies that a treatment effect or exposure association larger than that observed would occur less than 5% of the time under a null hypothesis of no effect or association and assuming no confounding. Concluding that the null hypothesis is false when in fact it is true (a type I error in statistical terms) has a likelihood of less than 5%. [i]…

The use of P values to summarize evidence in a study requires, on the one hand, thresholds that have a strong theoretical and empirical justification and, on the other hand, proper attention to the error that can result from uncritical interpretation of multiple inferences.5 This inflation due to multiple comparisons can also occur when comparisons have been conducted by investigators but are not reported in a manuscript. A large array of methods to adjust for multiple comparisons is available and can be used to control the type I error probability in an analysis when specified in the design of a study.6,7 Finally, the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality. [ii]

A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions.

Finally, the current guidelines are limited to studies with a traditional frequentist design and analysis, since that matches the large majority of manuscripts submitted to the Journal. We do not mean to imply that these are the only acceptable designs and analyses. The Journal has published many studies with Bayesian designs and analyses8-10 and expects to see more such trials in the future. When appropriate, our guidelines will be expanded to include best practices for reporting trials with Bayesian and other designs.

What do you think?

The author guidelines:

https://www.nejm.org/author-center/new-manuscripts

The associated article:

https://www.nejm.org/doi/full/10.1056/NEJMe1906559

*I meant to thank Nathan Schachtman for notifying me and sending links; also Stuart Hurlbert.

[i] It would be better, it seems to me, if the term “likelihood” was used only for its technical meaning in a document like this.

[ii] I don’t see it as a matter of “reductionism” but simply a matter of the properties of the test and the discrepancies of interest in the context at hand.

[A] A self-published book on this episode, by Donald Macnaughton, came out in 2021: The War on Statistical Significance: The American Statistician vs. the New England Journal of Medicine.

 

 

 

Categories: 5-year memory lane, abandon statistical significance, ASA Guide to P-values | 6 Comments

Post navigation

6 thoughts on “5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

  1. Mayo,

    I have not done a careful review of the NEJM since its new statistical guidelines were issued, but I have occasion to read its articles. I’ve not seen any instance in which authors qualified the meaning of their confidence intervals, or adjusted their calculations of standard error to reflect multiple comparisons. On the other hand, I’ve not seen any instance of arguably inappropriate declarations of statistical significance for a non-prespecified primary outcome. In the recent clinical trial (TRAVERSE) of testosterone therapy in hypogonadal men with cardiovascular risk factors, the authors of article, from June 2023, reported hazard ratios with 95% CIs for primary, secondary, and tertiary end points. Lincoff, Cardiovascular Safety of Testosterone-Replacement Therapy NEJM (2023). Safety end points were presented as % in each arm, with a p-value. Even though an RCT such as this one will have hundreds of unadjudicated adverse events reported in both arms, the p-values are not adjusted for such end points. The article does not contain the words “statistically significant,” but others will, and have, used the phrase to describe the non-pre-specified safety endpoint findings.

    The Journal of the American Medical Association (JAMA) and its many subjournals were also targeted by Dr Wasserstein’s email campaign, and they continue to use the phrase “statistically significant,” probably in a less disciplined way than the NEJM.

    Nathan Schachtman

    • Nathan:

      Thank you for your comment. It is cleverly put in lawyerly terms. You say you’ve not seen any declarations of statistical significance on non-prespecified outcomes, but also observe that p-values are reported without adjustment, in cases where multiplicity would warrant adjustment. Is that right? (I recall NEJM calling for a warning in such cases.) Moreover, you say “others will and have” used the banned terms in relation to the results of the same article? Are these “others” in published writings, court cases, or?

      Your findings are very interesting and I hope you will write a guest post for this blog for my “5-year report” or whatever we might call it.

  2. Haha; your question is a pretty good cross-examination for a non-lawyer! Yes; the p-values for safety outcomes (adverse events) were not adjusted. It was also obvious to readers that they were “nominal” levels of significance probabilities. The safety events are categorized by MedDRA codes, for which there are, I believe, more entries than ICD-10 codes. So well over a thousand. The events are reported by the blinded clinical trial centers without respect to the arm the patient is in, and some patients may give rise to more than one reported event. I believe the TRAVERSE trial involved close to 5,000 patients randomized. (I am going on memory here.) All I can say is that safety events have always been reported this way; many clinical readers distinguish between efficacy and safety outcomes, and indulge a certain amount of precautionary thinking on the latter; no one was fooled; and the authors themselves did not refer to the events reported as “statistically significant disparities.” Other commentators may have called the safety outcomes statistically significant, and certainly in legal settings, I would anticipate hearing the results described as such, because p < 5%.

    In my view, the bigger point is that the NEJM and the JAMA resisted the lobbying of the ASA. The NEJM did refine its guidance for authors, but my sense is that not much changed at JAMA or its family of journals. Of the other three major clinical journals (BMJ, Annals of Internal Medicine, and Lancet), the Annals strikes as the best for its statistical editing, but I would be interested to hear others’ views.

    Nathan

    • Nathan:

      I hadn’t heard of MedDRA codes, so I looked it up. On a quick glance, it looks like a great big list of standardized terms to use, especially in describing patient adverse events and reactions. I read your last comment too quickly and I thought you were driving at some kind of misleading statistical report or use of p-values. Yes, obviously efficiency is very different from safety, and it’s appropriate to report all standard observed adverse events. There are a variety of known effects, and they are recorded, perhaps for further explanation. There’s no multiplicity and selection of the sort requiring adjustment that I can see, as there would be if, for example, only those in support of a given theory were reported.  This sounds more like Fisher’s recommendation to ask many questions of a set of data. David Cox and I gave a very abbreviated list (of when adjustment mattered) in our 2006 paper—I , of course, was following his lead, as the expert. However, I’m just guessing here, as I shouldn’t. I haven’t seen the paper and don’t even know what the trials were all about Can you link it?

      Yes, it’s good that respectable journals, to my knowledge, resisted the lobbying of certain members of the ASA. The funny thing is, I get along with Wasserstein, whenever we interact, and he’s never really intimidated me on this dispute of ours. I was, and to some extent, still am, convinced the wordsmithers he was relying on went too far. That’s why I wrote the “don’t say what you don’t mean” post on June 19, 2019, reblogged here:

      https://errorstatistics.com/2024/04/05/my-2019-friendly-amendments-to-that-abandon-significance-editorial/

      Not that the proposed revisions were made. He wanted to make a splash, and he did.

      • The randomized controlled trial to which I referred can be found here:

        https://www.nejm.org/doi/full/10.1056/NEJMoa2215025

        It is one that can be freely downloaded from the NEJM. It is, in a way, yet another example of the large RCT putting to rest prior observational studies (often poorly done), and meta-analyses of small RCTs. In medicine, this has happened before. The debacle over Avandia (rosiglitazone) is another example, but then it was Dr Nissen who published the meta-analysis, later undone by the “mega-trial.” Here Nissen was the P.I. of the TRAVERSE trial.

        In addition to what might be a linguistic or word-smithing issue, there was the email campaign.

        I roughly recall that I found an exemplar of the email that was sent out to journals as part of what seemed to be a campaign to end “statistical significance” testing. (I believe I shared the email from the website with you and others at the time. I am in Canada now and I cannot check my “archives.”) As I recall, the email had the logo of the ASA on it, and I took that as further support that the 2019 editorial was at best coy in not having had a disclaimer. Perhaps it is the lawyer in me. Although I don’t do it in informal communications, when I publish articles, I usually note that my views are not necessarily shared by my firm or my clients. Wasserstein put his official position in the 2019 opinion piece; to me, that gave rise to a need for a disclaimer.

        I have no corroboration that the NEJM’s change in statistical guidelines, and its published explanation, came about because its editors received an email from the ASA written by Wasserstein. Or that the journal Clinical Trials published its piece around the same time because of the email campaign. Still, I think the cookie crumbs point in that direction.

        I am not offended by the email campaign or by the editorial other than its implied misrepresentation of “provenance.” It seem clear and entirely proper that Wasserstein, as Wasserstein, has a view of statistical inference and practice that he would like to see followed in the scientific world. If the NEJM moved as a result of Wasserstein’s email or his editorial, the move was generally in a good direction, from where I am observing. He may be disappointed or disapproving, but I think the NEJM guidelines are an improvement over past practice. Perhaps he can take credit for that improvement.

        • Nathan:

          Thank you for the link. I will read it. Of course i recall of the details you mention. This blog has a good repository of all of it, including the letterhead, the President’s Task Force, and, ultimately, the disclaimer. I am grateful to you for some of the highlights. I am reblogging some items over the next month, and hope to get a blogpost from you at some point, when you can.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.