Comments for Error Statistics Philosophy

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by kspringed53fa8cc8c

kspringed53fa8cc8c — Mon, 22 Apr 2024 22:07:25 +0000

In reply to Mayo.

I've seen too many RCTs claiming to show that some treatment works, but the mean difference between groups on the DV is small enough to be consistent with the possibility that the majority of participants did not benefit from treatment. I'm not impugning significance testing but merely suggesting that in cases like that, patients wouldn't find the "expected variability of the effect" to be meaningful without also knowing something about outcome frequencies.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Mayo

Mayo — Mon, 22 Apr 2024 16:28:54 +0000

In reply to Nathan A Schachtman.

Nathan:

Thank you for the link. I will read it. Of course i recall of the details you mention. This blog has a good repository of all of it, including the letterhead, the President's Task Force, and, ultimately, the disclaimer. I am grateful to you for some of the highlights. I am reblogging some items over the next month, and hope to get a blogpost from you at some point, when you can.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Nathan A Schachtman

Nathan A Schachtman — Mon, 22 Apr 2024 12:29:41 +0000

In reply to Mayo.

The randomized controlled trial to which I referred can be found here:

https://www.nejm.org/doi/full/10.1056/NEJMoa2215025

It is one that can be freely downloaded from the NEJM. It is, in a way, yet another example of the large RCT putting to rest prior observational studies (often poorly done), and meta-analyses of small RCTs. In medicine, this has happened before. The debacle over Avandia (rosiglitazone) is another example, but then it was Dr Nissen who published the meta-analysis, later undone by the "mega-trial." Here Nissen was the P.I. of the TRAVERSE trial.

In addition to what might be a linguistic or word-smithing issue, there was the email campaign.

I roughly recall that I found an exemplar of the email that was sent out to journals as part of what seemed to be a campaign to end "statistical significance" testing. (I believe I shared the email from the website with you and others at the time. I am in Canada now and I cannot check my "archives.") As I recall, the email had the logo of the ASA on it, and I took that as further support that the 2019 editorial was at best coy in not having had a disclaimer. Perhaps it is the lawyer in me. Although I don't do it in informal communications, when I publish articles, I usually note that my views are not necessarily shared by my firm or my clients. Wasserstein put his official position in the 2019 opinion piece; to me, that gave rise to a need for a disclaimer.

I have no corroboration that the NEJM's change in statistical guidelines, and its published explanation, came about because its editors received an email from the ASA written by Wasserstein. Or that the journal Clinical Trials published its piece around the same time because of the email campaign. Still, I think the cookie crumbs point in that direction.

I am not offended by the email campaign or by the editorial other than its implied misrepresentation of "provenance." It seem clear and entirely proper that Wasserstein, as Wasserstein, has a view of statistical inference and practice that he would like to see followed in the scientific world. If the NEJM moved as a result of Wasserstein's email or his editorial, the move was generally in a good direction, from where I am observing. He may be disappointed or disapproving, but I think the NEJM guidelines are an improvement over past practice. Perhaps he can take credit for that improvement.

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Mayo

Mayo — Mon, 22 Apr 2024 03:50:19 +0000

In reply to kspringed53fa8cc8c.

Ken:
“I mean, if they had to choose, they wouldn’t choose significance testing.
That’s too bad, because it is statistical significance testing that drives randomized controlled trials. That’s what gives a difference of proportions its relevance. It would be a completely meaningless report without knowing the expected variability of the effect. There are methods aside from RCTs, but they all bend over backwards to try to achieve what RCTs do.
And while you’re looking at Senn on this blog, check out his brilliant insights on RCTs (e.g., against some critics from philosophy of science)–he’s a world expert.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Mayo

Mayo — Mon, 22 Apr 2024 03:37:06 +0000

In reply to Nathan A Schachtman.

Nathan:

I hadn’t heard of MedDRA codes, so I looked it up. On a quick glance, it looks like a great big list of standardized terms to use, especially in describing patient adverse events and reactions. I read your last comment too quickly and I thought you were driving at some kind of misleading statistical report or use of p-values. Yes, obviously efficiency is very different from safety, and it’s appropriate to report all standard observed adverse events. There are a variety of known effects, and they are recorded, perhaps for further explanation. There’s no multiplicity and selection of the sort requiring adjustment that I can see, as there would be if, for example, only those in support of a given theory were reported. This sounds more like Fisher’s recommendation to ask many questions of a set of data. David Cox and I gave a very abbreviated list (of when adjustment mattered) in our 2006 paper—I , of course, was following his lead, as the expert. However, I’m just guessing here, as I shouldn't. I haven’t seen the paper and don't even know what the trials were all about Can you link it?

Yes, it’s good that respectable journals, to my knowledge, resisted the lobbying of certain members of the ASA. The funny thing is, I get along with Wasserstein, whenever we interact, and he’s never really intimidated me on this dispute of ours. I was, and to some extent, still am, convinced the wordsmithers he was relying on went too far. That’s why I wrote the “don’t say what you don’t mean” post on June 19, 2019, reblogged here:

https://errorstatistics.com/2024/04/05/my-2019-friendly-amendments-to-that-abandon-significance-editorial/

Not that the proposed revisions were made. He wanted to make a splash, and he did.

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by kspringed53fa8cc8c

kspringed53fa8cc8c — Mon, 22 Apr 2024 02:27:45 +0000

In reply to Mayo.

Deborah,

My bad for using the term "derive." I didn't mean anyone derives a fragility index from a p-value. I meant that those authors drew implications from their fragility indices that could've been drawn from their reported p values. They frame the implications in a very generic way (e.g., "Type I error is a concern"), so I didn't see why they bothered calculating the FI.

My earlier comment was just a patient's perspective on abandoning significance testing. Assuming a large-sample RCT, I think patients would find it much more useful to know the proportions of individuals in each group who improved to some clinically meaningful extent, as opposed to knowing whether group means differed significantly, according to any definition of significance. I mean, if they had to choose, they wouldn't choose significance testing.

Thanks for guiding me to Senn's posts.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Nathan A Schachtman

Nathan A Schachtman — Mon, 22 Apr 2024 01:26:41 +0000

Haha; your question is a pretty good cross-examination for a non-lawyer! Yes; the p-values for safety outcomes (adverse events) were not adjusted. It was also obvious to readers that they were "nominal" levels of significance probabilities. The safety events are categorized by MedDRA codes, for which there are, I believe, more entries than ICD-10 codes. So well over a thousand. The events are reported by the blinded clinical trial centers without respect to the arm the patient is in, and some patients may give rise to more than one reported event. I believe the TRAVERSE trial involved close to 5,000 patients randomized. (I am going on memory here.) All I can say is that safety events have always been reported this way; many clinical readers distinguish between efficacy and safety outcomes, and indulge a certain amount of precautionary thinking on the latter; no one was fooled; and the authors themselves did not refer to the events reported as "statistically significant disparities." Other commentators may have called the safety outcomes statistically significant, and certainly in legal settings, I would anticipate hearing the results described as such, because p < 5%.

In my view, the bigger point is that the NEJM and the JAMA resisted the lobbying of the ASA. The NEJM did refine its guidance for authors, but my sense is that not much changed at JAMA or its family of journals. Of the other three major clinical journals (BMJ, Annals of Internal Medicine, and Lancet), the Annals strikes as the best for its statistical editing, but I would be interested to hear others' views.

Nathan

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Mayo

Mayo — Sun, 21 Apr 2024 23:38:20 +0000

In reply to kspringed53fa8cc8c.

Ken:
I don’t see how you could derive a fragility index from the p-value alone. Anyway, I mentioned it only because your earlier comment called it to mind. Perhaps you were thinking more along the lines of personalized medicine. You can find some posts by Stephen Senn on this blog on the topic. He’s skeptical.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Mayo

Mayo — Sun, 21 Apr 2024 22:19:54 +0000

In reply to Nathan A Schachtman.

Nathan:

Thank you for your comment. It is cleverly put in lawyerly terms. You say you've not seen any declarations of statistical significance on non-prespecified outcomes, but also observe that p-values are reported without adjustment, in cases where multiplicity would warrant adjustment. Is that right? (I recall NEJM calling for a warning in such cases.) Moreover, you say "others will and have" used the banned terms in relation to the results of the same article? Are these "others" in published writings, court cases, or?

Your findings are very interesting and I hope you will write a guest post for this blog for my "5-year report" or whatever we might call it.

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by kspringed53fa8cc8c

kspringed53fa8cc8c — Sun, 21 Apr 2024 19:34:41 +0000

In reply to Mayo.

Hi Deborah,

The three or four times I've seen fragility indices in studies, they didn't seem useful, because the authors just made generic inferences that could be derived from the p values themselves. For instance, when the fragility index for a significant effect is low, the authors may add a little note calling for caution, because, hypothetically, the effect could've readily been non-significant, or the effect might not turn up significant in a replication. Since FIs and p values are highly correlated, there's nothing very informative here – no attention to the difference between what a p value tells us vs. what a fragility index tells us.

Supposedly fragility indices are useful when they alert you that the number of patients lost to follow-up in a particular study exceeds the number of patients whose outcomes, if different, would've made significant effects non-significant or vice versa. I guess this is useful, but there are many more patients who weren't included in the study in the first place, so you might still question the specific usefulness.

You mentioned not seeing fragility indices as coming up in proposals to fix statistical significance. I don't think FIs could be a fix, because they build on whatever approach to determining significance the researcher uses. If it's an inferential test and alpha is .05, that remains the standard for significance against which the FI is interpreted.

Comment on 5-year review: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) by Nathan A Schachtman

Nathan A Schachtman — Sun, 21 Apr 2024 17:00:49 +0000

Mayo,

I have not done a careful review of the NEJM since its new statistical guidelines were issued, but I have occasion to read its articles. I've not seen any instance in which authors qualified the meaning of their confidence intervals, or adjusted their calculations of standard error to reflect multiple comparisons. On the other hand, I've not seen any instance of arguably inappropriate declarations of statistical significance for a non-prespecified primary outcome. In the recent clinical trial (TRAVERSE) of testosterone therapy in hypogonadal men with cardiovascular risk factors, the authors of article, from June 2023, reported hazard ratios with 95% CIs for primary, secondary, and tertiary end points. Lincoff, Cardiovascular Safety of Testosterone-Replacement Therapy NEJM (2023). Safety end points were presented as % in each arm, with a p-value. Even though an RCT such as this one will have hundreds of unadjudicated adverse events reported in both arms, the p-values are not adjusted for such end points. The article does not contain the words "statistically significant," but others will, and have, used the phrase to describe the non-pre-specified safety endpoint findings.

The Journal of the American Medical Association (JAMA) and its many subjournals were also targeted by Dr Wasserstein's email campaign, and they continue to use the phrase "statistically significant," probably in a less disciplined way than the NEJM.

Nathan Schachtman

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Mayo

Mayo — Fri, 19 Apr 2024 16:10:28 +0000

In reply to Ken Springer.

Ken:

Thanks for your comment. I think the debates about statistical significance testing, especially those growing out of the 2019 Wasserstein move to “abandon significance”, have distracted from just about all interesting questions of statistical inference—questions that had been the focus before their negative campaign. Sure, it’s useful to review the age-old fallacies (p-values aren’t posteriors, aren’t measures of effect size, require knowing how many tests have been done (and other multiplicities); no evidence against isn’t evidence for; statistical significance is distinct from substantive scientific inference). But the Wasserstein, Schirm, Lazar WSL (2019) editorial doesn’t focus on these fixable problems, but is keen to ridicule significance, calling it “meaningless”, “thoughtless”, and denying small p-values can ever supply evidence of the presence of a genuine effect! (See my last post for details.) Criticisms of other methods, especially anything Bayesian, are largely off the table! It's mostly a matter of (statistical) tribal warfare, but only certain tribes are allowed weapons.

So I agree. I think you may be referring to something called “fragility indexes”? It’s interesting that I don’t recall that coming up in any of the proposals to fix statistical significance. I see that they are also open to some controversy.

https://www.pnas.org/doi/full/10.1073/pnas.2105254118#:~:text=The%20fragility%20index%20is%20defined,should%20not%20be%20firmly%20trusted.

Are these widely used and reported? I’d love for someone to do a guest post on them.

I’m not sure if those indices look at the effect size contributed by each patient. What do researchers do when they find the results have high fragility (which I guess means a low number of patients could switch the results from positive to negative, or the other way around). I’m not familiar with this and would like to learn more.

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Ken Springer

Ken Springer — Fri, 19 Apr 2024 13:56:00 +0000

From a patient perspective, I wonder whether debates about significance testing distract from the question of how to draw inferences from aggregate data that apply to the individual. If an experimental group shows better outcomes on average than a control group, I don't care as much about whether the difference is "significant" as I do about how individual outcomes contributed to the means. Did a few people in the experimental group improve a lot, or did a lot of them improve a little? If the mean difference is small, I can imagine not caring whether or it's "significant" in any sense of the term.

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Mayo

Mayo — Thu, 18 Apr 2024 14:49:24 +0000

In reply to Mircea Zloteanu.

Mircea:
Thanks for your comment.
I disagree. Statistical significance tests can be used to test assumptions, and in fact that is one of their most important functions. How do I test your degrees of belief? As for “non-subjective Bayes” there is no agreement among several rival systems. Granted there are, in some cases, frequentist matching priors, but the meaning is quite different. Empirical Bayes can be used in some cases. I concur with the authors: they need to demonstrate their abilities.
The bottom line is that all statistical methods use model and other assumptions. Statistical significance tests have the least assumptions, can test their own assumptions, and error statistical methods in general only require the error probabilities hold approximately.

Adaptive trials are the regularly conducted in frequentist trials, but they need to take account of interim monitoring to ensure error control. See this post.

Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i)

The authors, who are Bayesians, show
that the type I error was inflated in the Bayesian adaptive designs through incorporation of interim analyses that allowed early stopping for efficacy and without adjustments to account for multiplicity.

The Type I error probability can actually exceed 0.5! And this is radiation oncology. There’s a reason they’re only allowed in exploratory trials (so-far).

These Bayesian authors admit that “Bayesian interim monitoring may violate the weak repeated sampling principle [Cox and Hinkley 1974] which states that, ‘We should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time’.
That would seem to be strong evidence indeed for either avoiding or adjusting for multiplicities. Thus, in regulatory practice, a Bayesian might not wish to wear a strict Bayesian hat:

Comment on 5-year review: Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences by Mircea Zloteanu

Mircea Zloteanu — Thu, 18 Apr 2024 08:42:20 +0000

My concern with promoting (or at least not seriously flagging issues with) p-values and NHST is that they are only valid in very specific cases (i.e. experiments). Clinical RCTs where you can make assumptions about the distribution of p-values under the null (i.e. uniform) is exactly (only?) where they do work.

However, p-values also appear in quasi-experimental and observational studies, where they make little sense for long term error control (although some attempts at empirical calibration exist). What should those researchers do, and what claims can they make? I rarely see guidance for them.

Also, I find it difficult to consider how frequentist sequential testing is more optimal (read: less harmful to potential patients) than a bayesian approach where data collection (subjecting patients to a procedure) can be stopped after checking each new data point until an acceptable threshold has been reached.