Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.
Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.
By Nick Thieme (for Slate Magazine)
Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.
Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005.
P-values are tricky business, but here’s the basics on how they work: Let’s say I’m conducting a drug trial, and I want to know if people who take drug A are more likely to go deaf than if they take drug B. I’ll state that my hypothesis is “drugs A and B are equally likely to make someone go deaf,” administer the drugs, and collect the data. The data will show me the number of people who went deaf on drugs A and B, and the p-value will give me an indication of how likely it is that the difference in deafness was due to random chance rather than the drugs. If the p-value is lower than 0.05, it means that the chance this happened randomly is very small—it’s a 5 percent chance of happening, meaning it would only occur 1 out of 20 times if there wasn’t a difference between the drugs. If the threshold is lowered to 0.005 for something to be considered significant, it would mean that the chances of it happening without a meaningful difference between the treatments would be just 1 in 200.
On its face, this doesn’t seem like a bad idea. If this change requires scientists to have more robust evidence before they can come to conclusions, it’s easy to think it’s a step in the right direction. But one of the issues at the heart of making this change is that it seems to assume there’s currently a consensus around how p-value ought to be used and this consensus could just be tweaked to be stronger.
P-value use already varies by scientific field and by journal policies within those fields. Several journals in epidemiology, where the stakes of bad science are perhaps higher than in, say, psychology (if they mess up, people die), have discouraged the use of p-values for years. And even psychology journals are following suit: In 2015, Basic and Applied Social Psychology, a journal that has been accused of bad statistical (and experimental) practice, banned the use of p-values. Many other journals, including PLOS Medicine and Journal of Allergy and Clinical Immunology, actively discourage the use of p-values and significance testing already.
On the other hand, the New England Journal of Medicine, one of the most respected journals in that field, codes the 0.05 threshold for significance into its author guidelines, saying “significant differences between or among groups (i.e P<.05) should be identified in a table.” That may not be an explicit instruction to treat p-values less than 0.05 as significant, but an author could be forgiven for reading it that way. Other journals, like the Journal of Neuroscience and the Journal of Urology, do the same.
Another group of journals—including Science, Nature, and Cell—avoid giving specific advice on exactly how to use p-values; rather, they caution against common mistakes and emphasize the importance of scientific assumptions, trusting the authors to respect the nuance of any statistics tools. Deborah Mayo, award-wining philosopher of statistics and professor at Virginia Tech, thinks this approach to statistical significance, where various fields have different standards, is the most appropriate. Strict cutoffs, regardless of where they fall, are generally bad science.
Mayo was skeptical that it would have the kind of widespread benefit the authors assumed. Their assessment suggested tightening the threshold would reduce the rate of false positives—results that look true but aren’t—by a factor of two. But she questioned the assumption they had used to assess the reduction of false positives—that only 1 in 10 hypotheses a scientist tests is true. (Mayo said that if that were true, perhaps researchers should spend more time on their hypotheses.)
But more broadly, she was skeptical of the idea that lowering the informal p-value threshold will help fix the problem, because she’s doubtful such a move will address “what almost everyone knows is the real cause of nonreproducibility”: the cherry-picking of subjects, testing hypothesis after hypothesis until one of them is proven correct, and selective reporting of results and methodology.
There are plenty of other ways that scientists are testing to help address the replication crisis. There’s the move toward pre-registration of studies before analyzing data, in order to avoid fishing for significance. Researchers are also now encouraged to make data and code public so a third party can rerun analyses efficiently and check for discrepancies. More negative results are being published. And, perhaps most importantly, researchers are actually conducting studies to replicate research that has already been published. Tightening standards around p-values might help, but the debate about reproducibility is more than just a referendum on the p-value. The solution will need to be more than that as well.
[i] We did not discuss that recent test ban(“Don’t ask don’t tell”). If we had, I might have pointed him to my post on “P-value madness”.
Link to Nick Thieme’s Slate article: “Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.”
Think of three ways of evaluating evidence:
no p-value, p-value, multiplicity and modeling adjusted p-value.
Epi now uses unadjusted p-values _knowing full well they ask lots of questions_. They could move to stronger p-values, say 0.005, and have them adjusted for multiple testing. But that would destroy their very successful “science” business model.
At some point it would be useful to bring Boos and Stefanski 2011 The American Statistician thinking into the discussion. They suggested 0.05 => 0.001. Note the FDA rule of 2 0.05 studies (2 of k actually) is in the direction of a smaller p-value.
I can’t let this comment go without reply. There is no FDA rule of 2 0.05 studies out of k, Despite the recent publication by Ravenzwaaij and Ioannidis in which the idea of this requirement is suggested, it is not the case that is a regulatory rule. I remember in the late 90’s speaking to the marketing organisation of a large pharma company in New York during which this idea was proposed – do as many studies until you 2 +ve studies and marketing authorization would follow and telling them at the time that they would not be successful following such a strategy. Ultimately it is the weight of evidence that matters and if there are k-2 negative studies sponsors better have better argument than “look we have 2 +ves and that is all your rule requires”. They obviously hadn’t read Fisher:
“In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.“ The Design of Experiments, 1947.
If k is large 2 out of k is not “rarely fail”
van Ravenzwaaij, D. and J.P. Ioannidis, A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results. PLoS One, 2017. 12(3): p. e0173184.in which th