I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:
- “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”
Recently, at chapter meetings, conferences, and other events, I’ve had the good fortune to meet many of our members, many of whom feel queasy about the effects of differing views on p-values expressed in the March 2019 supplement of The American Statistician (TAS). The guest editors— Ronald Wasserstein, Allen Schirm, and Nicole Lazar—introduced the ASA Statement on P-Values (2016) by stating the obvious: “Let us be clear. Nothing in the ASA statement is new.” Indeed, the six principles are well-known to statisticians.The guest editors continued, “We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.”…
Wait a minute. I’m confused about who is speaking. The statements “Let us be clear…” and “We hoped that a statement from the world’s largest professional association…” come from the 2016 ASA Statement on P-values. I abbreviate this as ASA I (Wasserstein and Lazar 2016). The March 2019 editorial that Kafadar says is making many members “feel queasy,” is the update (Wasserstein, Schirm, and Lazar 2019). I abbreviate it as ASA II [i].
A healthy debate about statistical approaches can lead to better methods. But, just as Wilks and his colleagues discovered, unintended consequences may have arisen: Nonstatisticians (the target of the issue) may be confused about what to do. Worse, “by breaking free from the bonds of statistical significance” as the editors suggest and several authors urge, researchers may read the call to “abandon statistical significance” as “abandon statistical methods altogether.” …
But we may need more. How exactly are researchers supposed to implement this “new concept” of statistical thinking? Without specifics, questions such as “Why is getting rid of p-values so hard?” may lead some of our scientific colleagues to hear the message as, “Abandon p-values”—despite the guest editors’ statement: “We are not recommending that the calculation and use of continuous p-values be discontinued.”
Brad Efron once said, “Those who ignore statistics are condemned to re-invent it.” In his commentary (“It’s not the p-value’s fault”) following the 2016 ASA Statement on P-Values, Yoav Benjamini wrote, “The ASA Board statement about the p-values may be read as discouraging the use of p-values because they can be misused, while the other approaches offered there might be misused in much the same way.” Indeed, p-values (and all statistical methods in general) can be misused. (So may cars and computers and cell phones and alcohol. Even words in the English language get misused!) But banishing them will not prevent misuse; analysts will simply find other ways to document a point—perhaps better ways, but perhaps less reliable ones. And, as Benjamini further writes, p-values have stood the test of time in part because they offer “a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs”—especially now that Efron’s bootstrap has become a familiar tool in all branches of science for characterizing uncertainty in statistical estimates.[Benjamini is commenting on ASA I.]
… It is reassuring that “Nature is not seeking to change how it considers statistical evaluation of papers at this time,” but this line is buried in its March 20 editorial, titled “It’s Time to Talk About Ditching Statistical Significance.” Which sentence do you think will be more memorable? We can wait to see if other journals follow BASP’s lead and then respond. But then we’re back to “reactive” versus “proactive” mode (see February’s column), which is how we got here in the first place.
… Indeed, the ASA has a professional responsibility to ensure good science is conducted—and statistical inference is an essential part of good science. Given the confusion in the scientific community (to which the ASA’s peer-reviewed 2019 TAS supplement may have unintentionally contributed), we cannot afford to sit back. After all, that’s what started us down the “abuse of p-values” path.
Is it unintentional? [ii]
…Tukey wrote years ago about Bayesian methods: “It is relatively clear that discarding Bayesian techniques would be a real mistake; trying to use them everywhere, however, would in my judgment, be a considerably greater mistake.” In the present context, perhaps he might have said: “It is relatively clear that trusting or dismissing results based on a single p-value would be a real mistake; discarding p-values entirely, however, would in my judgment, be a considerably greater mistake.” We should take responsibility for the situation in which we find ourselves today (and during the past decades) to ensure that our well-researched and theoretically sound statistical methodology is neither abused nor dismissed categorically. I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether. Please send me your ideas!
On Fri, Nov 8, 2019 at 2:09 PM Deborah Mayo <email@example.com> wrote:
Dear Professor Kafadar;
Your article in the President’s Corner of the ASA for June 2019 was sent to me by someone who had read my “P-value Thresholds: Forfeit at your Peril” editorial, invited by John Ioannidis. I find your sentiment welcome and I’m responding to your call for suggestions.
For starters, when representatives of the ASA issue articles criticizing P-values and significance tests, recommending their supplementation or replacement by others, three very simple principles should be followed:
- The elements of tests should be presented in an accurate, fair and at least reasonably generous manner, rather than presenting mainly abuses of the methods;
- The latest accepted methods should be included, not just crude nil null hypothesis tests. How these newer methods get around the often-repeated problems should be mentioned.
- Problems facing the better-known alternatives, recommended as replacements or supplements to significance tests, should be discussed. Such an evaluation should recognize the role of statistical falsification is distinct from (while complementary to) using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model.
Here’s what I recommend ASA do now in order to correct the distorted picture that is now widespread and growing: Run a conference akin to the one Wasserstein ran on “A World Beyond ‘P < 0.05′” except that it would be on evaluating some competing methods for statistical inference: Comparative Methods of Statistical Inference: Problems and Prospects.
The workshop would consist of serious critical discussions on Bayes Factors, confidence intervals[iii], Likelihoodist methods, other Bayesian approaches (subjective, default non-subjective, empirical), particularly in relation to today’s replication crisis. …
Growth of the use of these alternative methods have been sufficiently widespread to have garnered discussions on well-known problems….The conference I’m describing will easily attract the leading statisticians in the world. …
Please share your comments on this blogpost.
[i] My reference to ASA II refers just to the portion of the editorial encompassing their general recommendations: don’t say significance or significant, oust P-value thresholds. (It mostly encompasses the first 10 pages.) It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blogpost, the result is to give us principles that are in tension with the original 6.) Note my new qualification in [ii]*
[ii]*As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, with some kind of disclaimer. That will help, but merely noting that it never came to a Board vote will not quell the confusion now rattling some ASA members. The ASA’s P-value campaign to editors to revise their author guidelines asks them to take account of both ASA I and II. In carrying out the P-value campaign, at which he is highly effective, Ron Wasserstein obviously* wears his Executive Director’s hat. See The ASA’s P-value Project: Why It’s Doing More Harm than Good. So, until some kind of clarification is issued by the ASA, I’ve hit upon this solution.
The ASA P-value Project existed before the 2016 ASA I. The only difference in today’s P-value Project–since the March 20, 2019 editorial by Wasserstein et al– is that the ASA Executive Director (in talks, presentations, correspondence) recommends ASA I and the general stipulations of ASA II–even though that’s not a policy document. I will now call it the 2019 ASA P-value Project II. It also includes the rather stronger principles in ASA II. Even many who entirely agree with the “don’t say significance” and “don’t use P-value thresholds” recommendations have concurred with my “friendly amendments” to ASA II (including, for example, Greenland, Hurlbert, and others). See my post from June 17, 2019.
You merely have to look at the comments to that blog. If Wasserstein would make those slight revisions, the 2019 P-value Project II wouldn’t contain the inconsistencies, or at least “tensions” that it now does, assuming that it retains ASA I. The 2019 ASA P-value Project II sanctions making the recommendations in ASA II, even though ASA II is not an ASA policy statement.
However, I don’t see that those made queasy by ASAII would be any less upset with the reality of the ASA P-value Project II.
[iii]Confidence intervals (CIs) clearly aren’t “alternative measures of evidence” in relation to statistical significance tests. The same man, Neyman, developed tests (with Pearson) and CIs, even earlier ~1930. They were developed as duals, or inversions, of tests. Yet the advocates of CIs–the CI Crusaders, S. Hurlbert calls them–are some of today’s harshest and most ungenerous critics of tests. For these crusaders, it has to be “CIs only”. Supplementing p-values with CIs isn’t good enough. Now look what’s happened to CIS in the latest guidelines of the NEJM. You can readily find them searching NEJM on this blog. (My own favored measure, severity, improves on CIs, moves away from the fixed confidence level, and provides a different assessment corresponding to each point in the CI.
*Or is it not obvious? I think it is, because he is invited and speaks, writes, and corresponds in that capacity.
Related posts on ASA II:
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)
- July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring?
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- Nov 22. The ASA’s P-value Project: Why It’s Doing More Harm than Good.
Related book (excerpts from posts on this blog are collected here)
Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).