My 2019 friendly amendments to that “abandon significance” editorial

.

It was 3 months before I decided to write a blogpost in response to Wasserstein, Schirm and Lazar (2019)’s editorial in The American Statistician in which they recommend that the concept of “statistical significance” be abandoned, hereafter, WSL 2019. (I titled it “Don’t Say What You don’t Mean”.) In that June 17, 2019 blogpost, pasted below, I proposed 3 “friendly amendments” to the language of that document. (There are 97 comments on that post!) The problem is that WSL 2019 presents several of the 6 principles from ASA I (the 2016 ASA statement on Statistical Significance) in a far stronger fashion so as to be inconsistent or at least in tension with some of them. I didn’t think they really meant what they said. I discussed these amendments with Ron Wasserstein, Executive Director of the ASA at the time. Had these friendly amendments been carried out, the document would not have caused as much of a problem, and people might focus more on the positive recommendations it includes about scientific integrity. The proposed ban on a key concept of statistics would still be problematic, resulting in the 2019 ASA President’s Task Force, but it would have helped the document.  At the time, it was still not known whether WSL 2019 was intended as a continuation of the 2016 ASA policy document [ASA I]. That explains why I first referred to WSL 2019 in this blogpost as ASA II. Once it was revealed that it was not official policy at all (many months later), but only the recommendations of the 3 authors, I placed a “note” after each mention of ASA II. But given it caused sufficient confusion as to result in the then ASA president (Karen Kafadar) appointing an ASA Task Force on Statistical Significance and Replicability in 2019 (see here and here), and later, a disclaimer by the authors, in this reblog I refer to it as WSL 2019. You can search this blog for other posts on the 2019 Task Force: their report is here, and the disclaimer here.

***

“The 2019 Guide to P-values and Statistical significance: Don’t Say What You don’t Mean” (June 17, 2019)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, [WSL 2019]–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the [WSL 2019]. authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.)

So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions.

The Invitation to Broader Consideration and Debate

The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. ([WSL 2019] p. 1)

The questions around reform need consideration and debate. (p. 9)

Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on [WSL 2019]. (~1000 words) if you send them to me.

My focus here is just on the intended positions of the ASA [or WSL 2019], not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

♦ It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)

♦ “Statistically significant”– don’t say it and don’t use it. (p. 2)

(Wow!)

I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:

In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.

Since [WSL (2019)] will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals (CIs) are already routinely given alongside P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.

What’s the Relationship Between ASA I and [WSL 2019]?

I assume, for this post, that [WSL 2019] is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us try to take them at their word.

But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given. (See Note 5 for the six Principles of ASA I.)

Principle 1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)

However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.

So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from [WSL 2019]:

(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” [WSL 2019].

Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.

My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.

So, my first recommendation is:

Replace (1) with:

“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”

Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[4]

My second friendly amendment concerns the second bulleted item:

(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)

Focus just on “presence”. From this assertion it would seem to follow that no P-values[5], however small, even from well-controlled trials, can reveal the presence of an association or effect–and that is too strong. Again, we get a conflict with Principle 1 from ASA I. But I’m guessing, for now, the authors do not intend to say this. If you don’t mean it, don’t say it.

So, my second recommendation is to replace (2) with:

 “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.

Without this friendly amendment, [WSL 2019] is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.

This leads to my third bulleted item from [WSL 2019]:

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems. Here’s what I say in SIST:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,”  but rather, “ how well-probed”  claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the [WSL 2019] guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is presupposed by the very important Principle 4 of ASA I (see note (3).

As lawyer Nathan Schachtman observes, in a recent conversation on [WSL (2019)].

By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . .  No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)

So, my third recommendation is to replace (3) with (something like):

failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

There’s much else that bears critical analysis and debate in [WSL (2019)]; I’ll come back to it. I hope to hear from the authors of [WSL (2019)] about my very slight, constructive amendments (to avoid a conflict with Principle 1).

Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breached should not be considered at all. (This was already happening based on ASA I.)[6]

Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this. Version (ii) of this post begins a list:

Nathan Schachtman (2019): Has the ASA Gone Post-Modern?

Cook et al.,(2019) There is Still  Place for Significance Testing in Clinical Trials

NEJM Manuscript &amp; Statistical Guidelines 2019Harrington, New Guidelines for Statistical Reporting in the Journal NEJM 2019

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

References:

Gelman, A. (2012) “Ethics and the Statistical Use of Prior Information”. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics5.pdf

Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.

Schachtman, N.  (2019).  (private communication)

Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19.

NOTES

[1]  I gave an invited paper at the conference (“A world Beyond…”) out of which the idea for this volume grew. I was in a session with a few other exiles to describe the contexts where statistical significance tests are of value. I was too much involved in completing my book to write up my paper for this volume, nor did others in our small group. Links are here to: my slides and Yoav Benjamini’s slides. I did post notes to journalists on the Amrhein article here.

[2] Excerpts and mementos from SIST are here.

 

Categories: 2016 ASA Statement on P-values, ASA Guide to P-values, ASA Task Force on Significance and Replicability | Leave a comment

Post navigation

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.