There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence! (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).
If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.
- Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of “Redefine Statistical Significance”
- Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and a primary co-author of a response to ‘Redefine statistical significance’ (under review).
- Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
- Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
- E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper “Redefine Statistical Significance”
The paradox for those wishing to abandon significance tests on grounds that there’s “a replication crisis”–and I’m not alleging everyone under the “lower your p-value” umbrella are advancing this–is that lack of replication is effectively uncovered thanks to statistical significance tests. They are also the basis for fraud-busting, and adjustments for multiple testing and selection effects. Unlike Bayes Factors, they:
- are directly affected by cherry-picking, data dredging and other biasing selection effects
- are able to test statistical model assumptions, and may have their own assumptions vouchsafed by appropriate experimental design
- block inferring a genuine effect when a method has low capability of having found it spurious.
In my view, the result of a significance test should be interpreted in terms of the discrepancies that are well or poorly indicated by the result. So we’d avoid the concern that leads some to recommend a .005 cut-off to begin with. But if this does become the standard for testing the existence of risks, I’d make “there’s an increased risk of at least r” the test hypothesis in a one-sided test, as Neyman recommends. Don’t give a gift to the risk producers. In the most problematic areas of social science, the real problems are (a) the questionable relevance of the “treatment” and “outcome” to what is purported to be measured, (b) cherry-picking, data-dependent endpoints, and a host of biasing selection effects, and (c) violated model assumptions. Lowering a p-value will do nothing to help with these problems; forgoing statistical tests of significance will do a lot to make them worse.
[i] Both “frequentist” and “sampling theory” are unhelpful names. Since the key feature is basing inference on error probabilities of methods, I abbreviate by error statistics. The error probabilities are based on the sampling distribution of the appropriate test statistic. A proper subset of error statistical contexts are those that utilize error probabilities to assess and control the severity by which a particular claim is tested.
[iii] Two related posts: p-values overstate the evidence against the null fallacy
How likelihoodists exaggerate evidence from statistical tests (search the blog for others)