Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:
Dear Dr. Mayo,
I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.
Two footnotes, on pages
31 35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOC
Thank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.
With kind regards and wishes of a happy 2020,
Jennifer Heimberg, Ph.D.
Senior Program Officer
The National Academies of Sciences, Engineering, and Medicine
Categories: NAS, P-values
Mayo writing to Kafadar
I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:
- “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”
I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading
“Before we stood on the edge of the precipice, now we have taken a great step forward”
What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.
Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading
Mayo banging head
If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading
The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:
Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis
When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman
P-value thresholds: Forfeit at your peril, by Deborah Mayo
I blogged excerpts from my preprint, and some related posts, here.
All agree to the disagreement on the statistical and metastatistical issues: Continue reading
A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading
Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:
The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019] Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.
Admittedly, their recent statement, which I refer to as ASA II, has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post: Continue reading
When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks. Continue reading
I came across an interesting letter in response to the ASA’s Statement on p-values that I hadn’t seen before. It’s by Ionides, Giessing, Ritov and Page, and it’s very much worth reading. I make some comments below. Continue reading
I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes: Continue reading
I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).
Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out) Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. Continue reading
An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.
I. Redefine Power?
Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading
Erich Lehmann 20 November 1917 – 12 September 2009
Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*
I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.
He said ” I wonder if it might be of interest to me!” So he walked up to it…. It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after. (What are the chances?) Some related posts on Lehmann’s letter are here and here.
However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism). Continue reading
.A world beyond p-values?
I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:
The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about.  Continue reading
Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)
- Why use a tool that infers from a single (arbitrary) P-value that pertains to a statistical hypothesis H0 to a research claim H*?
- Why use an incompatible hybrid (of Fisher and N-P)?
- Why apply a method that uses error probabilities, the sampling distribution, researcher “intentions” and violates the likelihood principle (LP)? You should condition on the data.
- Why use methods that overstate evidence against a null hypothesis?
- Why do you use a method that presupposes the underlying statistical model?
- Why use a measure that doesn’t report effect sizes?
- Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?
I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.
Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading
Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading