I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.
Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”.
Our session, “What are the best uses for P-values?“, I take it, will discuss the value of frequentist (error statistical) testing more generally, as is appropriate. (Aside from me, it includes Yoav Benjamini and David Robinson.)
One of the more baffling things about today’s statistics wars is their tendency to readily undermine themselves. One example is how they lead to relinquishing the strongest criticisms against findings that have been based on fishing expeditions. In the interest of promoting a statistical account that downplays error probabilities (Bayes Factors), it becomes difficult to lambast a researcher for engaging in practices on grounds that they violate error probabilities (cherry-picking, multiple testing, trying and trying again, post-data selection effects). You can’t condemn a researcher for engaging in practices on grounds that they violate error probabilities, if you reject error probabilities. The direct criticism has been lost. The criticism is redirected to finding the cherry picked hypothesis improbable. But now the cherry picker is free to discount the criticism as something a Bayesian can always do to counter a statistically significant finding–and they do this very effectively! Moreover, cherry picked hypotheses are often believable, that’s what makes things like post-data subgroups so seductive. Finally, in an adequate account, the improbability of a claim must be distinguished from its having been poorly tested. (You need to be able to say things like, “it’s plausible, but that’s a lousy test of it.”)
Now I realize error probabilities are often criticized as relevant only for long-run error control, but that’s a mistake. Ask yourself: what bothers you when cherry pickers selectively report favorable findings, and then claim to have good evidence of an effect? You’re not concerned that making a habit out of this would yield poor long-run performance–even though it would. What bothers you, and rightly so, is they haven’t done a good job in ruling out spurious findings in the case at hand. You need a principle to explain this epistemological standpoint–something frequentists have only hinted at. To state it informally:
I haven’t been given evidence for a claim C by dint of a method that had little if any capability to reveal specific flaws in C, even if they are present.
It’s a minimal requirement for evidence; I call it the severity requirement, but it doesn’t matter what its called. The stronger form says:
Data provide evidence for C only to the extent C has been subjected to and passes a reasonably severe test–one that probably would have found flaws in C, if present.
To many, “a world beyond p <.05” suggests a world without error statistical testing. So if you’re to use such a test in the best way, you’ll need to say why. Why use a statistical test of significance for your context and problem rather than some other tool? What type of context would ever make statistical (significance) testing just the ticket? Since the right use of these methods cannot be divorced from responding to expected critical challenges, I’ve decided to focus my remarks on giving those responses. However, I’m still working on it. Feel free to share thoughts.
Anyone who follows this blog knows I’ve been speaking about these things for donkey’s years–search terms of interest on this blog. I’ve also just completed a book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018).
 The inference to denying the weight of the finding lacks severity; it can readily be launched by simply giving high enough prior weight to the null hypothesis.
Mayo, D. 2018, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Cambridge, 2018)