The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.
Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek et.al. 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]).
“By reviewing the hypotheses and analysis plans in advance, RRs should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case” (Nosek et. al)
A novel idea is that papers are to be provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. How will she use them to learn something (improve or falsify claims or methods)? I see nothing in preregistration, in and of itself, so far, to promote that. Existing replication research doesn’t go there. It would be wrong-headed to condemn CARKing, by the way. Post-data criticism of inquiries must be post-data. How else can you check if assumptions were met by the data in hand? [Note 7/12: Of course, what they must not be are ad hoc saves of the original finding, else they are unwarranted–minimal severity.] It would be interesting to see inquiries into potential hidden biases not often discussed. For example, what did the students (experimental subjects) know and when did they know it (the older the effect the more likely they know it)? What’s the attitude toward the finding conveyed (to experimental subjects) by the person running the study? I’ve little reason to point any fingers, it’s just part of the severe tester’s inclination toward cynicism and error probing. (See my “rewards and flexibility hypothesis” in my earlier discussion.)
It’s too soon to see how RR’s will fare, but plenty of credit is due to those sticking their necks out to upend the status quo. Research into changing incentives is a field in its own right. The severe tester may, again, appear awfully jaundiced to raise any qualms, but we shouldn’t automatically assume that research into incentivizing researchers to behave in a fashion correlated with good science –data sharing, preregistration–is itself likely to improve the original field. Not without thinking through what would be needed to link statistics up with the substantive hypotheses or problem of interest. (Let me be clear, I love the idea of badges and other carrots;it’s just that the real scientific problems shouldn’t be lost sight of.) We might be incentivizing researchers to study how to incentivize researchers to behave in a fashion correlated with good science.
Surely there are areas where the effects or measurement instruments (or both) genuinely aren’t genuine. Isn’t it better to falsify them than to keep finding ad hoc ways to save them? Is jumping on the meta-research bandwagon[iii] just another way to succeed in a field that was questionable? Heresies, I know.
To get the severe tester into further hot water, I’ll share with you her view that, in some fields, if they completely ignored statistics and wrote about plausible conjectures about human motivations, prejudices, attitudes etc. they would have been better off. There’s a place for human interest conjectures, backed by interesting field studies rather than experiments on psych students. It’s when researchers try to “test” them using sciency methods that the whole thing becomes pseudosciency.
Please share your thoughts. (I may add to this, calling it (2).)
[i] This article mentions a failed replication discussed on Gelman’s blog on July 8, on which I left some comments.
[ii] New readers, please search likelihood principle on this blog
[iii] This must be distinguished from the use of “meta” in describing a philosophical scrutiny of methods (meta-methodology). Statistical meta-researchers do not purport to be doing philosophy of science.