On the current state of play in the crisis of replication in psychology: some heresies

.

The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek et.al. 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]).

“By reviewing the hypotheses and analysis plans in advance, RRs should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case” (Nosek et. al)

A novel idea is that papers are to be provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. How will she use them to learn something (improve or falsify claims or methods)? I see nothing in preregistration, in and of itself, so far, to promote that. Existing replication research doesn’t go there. It would be wrong-headed to condemn CARKing, by the way. Post-data criticism of inquiries must be post-data. How else can you check if assumptions were met by the data in hand? [Note 7/12: Of course, what they must not be are ad hoc saves of the original finding, else they are unwarranted–minimal severity.] It would be interesting to see inquiries into potential hidden biases not often discussed. For example, what did the students (experimental subjects) know and when did they know it (the older the effect the more likely they know it)? What’s the attitude toward the finding conveyed (to experimental subjects) by the person running the study? I’ve little reason to point any fingers, it’s just part of the severe tester’s inclination toward cynicism and error probing. (See my “rewards and flexibility hypothesis” in my earlier discussion.)

It’s too soon to see how RR’s will fare, but plenty of credit is due to those sticking their necks out to upend the status quo. Research into changing incentives is a field in its own right. The severe tester may, again, appear awfully jaundiced to raise any qualms, but we shouldn’t automatically assume that research into incentivizing researchers to behave in a fashion correlated with good science –data sharing, preregistration–is itself likely to improve the original field. Not without thinking through what would be needed to link statistics up with the substantive hypotheses or problem of interest. (Let me be clear, I love the idea of badges and other carrots;it’s just that the real scientific problems shouldn’t be lost sight of.) We might be incentivizing researchers to study how to incentivize researchers to behave in a fashion correlated with good science.

Surely there are areas where the effects or measurement instruments (or both) genuinely aren’t genuine. Isn’t it better to falsify them than to keep finding ad hoc ways to save them? Is jumping on the meta-research bandwagon[iii] just another way to succeed in a field that was questionable? Heresies, I know.

To get the severe tester into further hot water, I’ll share with you her view that, in some fields, if they completely ignored statistics and wrote about plausible conjectures about human motivations, prejudices, attitudes etc. they would have been better off. There’s a place for human interest conjectures, backed by interesting field studies rather than experiments on psych students. It’s when researchers try to “test” them using sciency methods that the whole thing becomes pseudosciency.

Please share your thoughts. (I may add to this, calling it (2).)

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2017, July 8). The Preregistration Revolution (PDF). Open Science Framework. Retrieved from osf.io/2dxu5

[i] This article mentions a failed replication discussed on Gelman’s blog on July 8, on which I left some comments.

[ii] New readers, please search likelihood principle on this blog

[iii] This must be distinguished from the use of “meta” in describing a philosophical scrutiny of methods (meta-methodology). Statistical meta-researchers do not purport to be doing philosophy of science.

Categories: Error Statistics, preregistration, reforming the reformers, replication research | 9 Comments

Post navigation

9 thoughts on “On the current state of play in the crisis of replication in psychology: some heresies

  1. Great post. We need more of these heresies!

  2. Anonymous

    “A novel idea is that papers are to be provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. How will she use them to learn something (improve or falsify claims or methods)? I see nothing in preregistration, in and of itself, so far, to promote that. Existing replication research doesn’t go there.”

    I am a junior researcher and a little confused after reading this post. I have 2 questions which may help me trying to understand the point(s) of this post:

    1) Doesn’t a “failed” replication contribute to potentially falsifying claims and/or methods?

    2) Doesn’t accepting papers before the results are known (RR’s) in in some way encourages potentially falsifying/improving claims and/or methods because “negative” results will enter the literature and have to be somehow dealt with?

    • Anonymous:
      You ask:
      1) Doesn’t a “failed” replication contribute to potentially falsifying claims and/or methods?

      They could, but existing failed replications haven’t gone that far. They’ve generally been reported as indicating the effect might be smaller than the original study, or the failed replication due to “lack of fidelity”.I haven’t seen discussions questioning the presupposition that the artificial experiment is probing the substantive phenomenon of interest. More than that, they should set out tests to possibly falsify some of the measurement procedures. Possibly the psych literature will now declare certain effects “overturned”, and that’s something, you’re right, but it’s not clear how that contributes to understanding the original phenomenon.

      The use of statistical tests to move from statistical significance, and successful replication of the statistical effect, H, to substantive research claim H* is still fallacious. It is not permitted by the valid use of statistical tests. I don’t see that mentioned, at least in the context of replication research.

      It can also happen that the general substantive hypothesis H* is correct, but the claim studied statistically, H, doesn’t follow. So if the statistical effect H is overturned by inability to replicate it, it’s incorrect to deny H*.

      Attention to the avoidance of selection effects, cherry-picking, multiple-testing, and the like is all to the good, and very welcome, but there’s a presumption that if those are avoided, the research inferences will be on firm scientific footing.

  3. Fritz Strack

    Dear Dr. Mayo,

    as a “big shot”, I guess, I don’t have to introduce myself 😉
    I’ve just read your recent blog with great interest .

    I think it is time the epistemological dimension enters a purely statistical debate.
    It is really funny, during the first years of my studies in Mannheim, I have been intensely studying epistemology and logic with Hans Albert. He was well connected and I had the chance to meet Popper (twice), Feyerabend, Lakatos….etc.

    During my career as a social psychologist, I have been predominantly focussing on judgments and social cognition. Together with a colleague, I have developed a dual-systems models that has been frequently cited, much more often than the “classic” pen study, that is certainly not my main
    identification.

    However, I am willing to use the failed “replication” as a starting point attempting to eventually bring back some important epistemological insights that seem to be completely forgotten or neglected. Right now, I am planning a paper on external validity where I will be arguing that the prevalence of null-hypothesis significance testing has falsely spread the importance of ideas like “representativeness” and “generalization”, instead of hypothesis testing. Also, from a hypothesis-testing perspective, the idea of “effect size” is alien to basic science.

    Anyway, I am looking forward to reading about your thoughts on these issues.

    Best regards,

    Fritz Strack

    • Dear Dr. Strack:
      I’m grateful for your comment. Some first thoughts:

      “have been intensely studying epistemology and logic with Hans Albert.”

      Some of the proceedings of a conference I co-organized (with A. Spanos): “Statistical Science and Philosophy of Science Where should they meet?” and related discussions soon afterwards, may be found in an on-line journal co-edited by Max Albert, his son (if I’m not mistaken).
      http://www.rmm-journal.de/htdocs/st01.html
      Max was also at the conference, and others I’ve attended. We’re grouped together as the non-Bayesian philosophers.

      “I am planning a paper on external validity where I will be arguing that the prevalence of null-hypothesis significance testing has falsely spread the importance of ideas like “representativeness” and “generalization”, instead of hypothesis testing. Also, from a hypothesis-testing perspective, the idea of “effect size” is alien to basic science.”

      I don’t know what you mean by saying statistical testing has “falsely spread the importance of ideas like representativeness and generalization.” Are these dangerous ideas for science? Nor why you say “instead of hypothesis testing”. Statistical hypothesis testing is a species of hypotheses testing. It’s true that in some fields such as particle physics and experimental relativity, incredibly small effects are of great interest, so long as they are systematic. A speculative guess, based also on your response to the non-replication, is that you might be saying something like this: what we want is to discover genuine explanations and understanding of mechanisms, and these can be served even by bringing about small effects in highly artificial experimental set-ups that don’t obviously generalize to actual populations, and that can’t be expected to be replicated on the fly by self-selected groups of replicationists. I don’t see how statistical tests of significance are to be blamed, although, as I say in my post, I admit there are areas that might have been better off not looking to artificial experiments.

      Think of economics. It’s only relatively recently looked to experimental probes; both modeling and experimental economics live. Of course they have an advantage over psychology in being able to create fairly realistic scenarios (or so they argue convincingly). By the way, they refuse to misrepresent the experiment to the experimental subjects, unlike psychology. It seems to me that writing with the same pen being held in one’s teeth so as to force a smile (or in one case, with the hand you don’t ordinarily use) was only to create a ruse. That is, they could have rated the jokes writing normally.

  4. Brian Nosek linked to this post on twitter: https://twitter.com/BrianNosek/status/885137670225231876

  5. I think there are definite Advantages to the rest of psychology, like social psychology, thinking of experiments as being like clinical trials. The clinical trial literature has for decades been developing rules for clarity, experimental control and transparency.

    I studied some flexible alternatives including stopping rules.

    Click to access 0912f508125e9500ef000000.pdf

    I think that we have be very careful about stopping rules. They don’t have a good history in clinical trials. Following my own advice, I produced a disaster. We were doing a trial in Germany where we were comparing psychotherapy to a psychological control group and at the same time antidepressants to placebo. The a prori comparisons were between psychotherapy and its control group and then between antidepressants and a pill placebo. Of course, there was a secondary interest in comparing the psychotherapy to the antidepressant. However, the stopping rules dictated that we and the psychotherapy versus control group accrual early because of negative effects occurring in the psychological control group. But when that was done, the psychotherapy group was left underpowered for any comparisons with the antidepressants. There are lots of other examples of stopping rules being applied and then finding that the smaller trials that resulted couldn’t be replicated for whatever reason.

    • James: Thanks for your comment. The point–mine anyway– is not to solve the sticky problems of how best to take into account complex stopping plans, but to acknowledge that optional stopping (“trying and trying again”) should not be considered a mere report of someone’s intentions, locked in their head, of no relevance to assessing the warrant of an inference. In some classic cases, it can enable finding significance (or omitting the null value from a confidence interval) with high or maximal probabilities, even if the null hypothesis is true. It violates what Cox and Hinkley call the “weak repeated sampling” requirement. A very informal discussion is in this post: https://errorstatistics.com/2014/04/05/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour-2/

      Also this paper:

      Click to access Mayo%20&%20Kruse%20Principles%20of%20inference%20and%20their%20consequences%20B.pdf

      I meant to note that the reports coming out of clinical trials, despite having all this written in stone, is that a large percentage ignore or violate the predesignated specifications. Goldacre says that they insist their expertise allows them to do so. (Sometimes it might.)

  6. Strack’s response to the replication study of his work discussed in his comment:

    Click to access strack-2016-smiling-registered-replication-report.pdf

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.