Remember “Repligate”? [“Some Ironies in the Replication Crisis in Social Psychology“] and, more recently, the much publicized attempt to replicate 100 published psychology articles by the Open Science Collaboration (OSC) [“The Paradox of Replication“]? Well, some of the critics involved in Repligate have just come out with a criticism of the OSC results, claiming they’re way, way off in their low estimate of replications in psychology [1]. (The original OSC report is here.) I’ve only scanned the critical article quickly, but some bizarre statistical claims leap out at once. (Where do they get this notion about confidence intervals?) It’s published in Science! There’s also a response from the OSC researchers. Neither group adequately scrutinizes the validity of many of the artificial experiments and proxy variables–an issue I’ve been on about for a while. Without firming up the statistics-research link, no statistical fixes can help. I’m linking to the articles here for your weekend reading. I invite your comments! For some reason a whole bunch of items of interest, under the banner of “statistics and the replication crisis,” are all coming out at around the same time, and who can keep up? March 7 brings yet more! (Stay tuned).
My subtitle refers to my post alleging that non-replication articles are becoming so hot that non-significant results are the new significant results. Now we have another meta-level. So long as everyone’s getting published, who’s to complain, right? [2]
I’ll likely return to this once I’ve studied the articles–they’re quite short. Or, maybe readers can just share what they’ve found.
_____
[1] Recall mention of one of the authors in the article cited in my earlier post on repligate:
Mr. Gilbert, a professor of psychology at Harvard University, … wrote that certain so-called replicators are “shameless little bullies” and “second stringers” who engage in tactics “out of Senator Joe McCarthy’s playbook” (he later took back the word “little,” writing that he didn’t know the size of the researchers involved).
What got Mr. Gilbert so incensed was the treatment of Simone Schnall, a senior lecturer at the University of Cambridge, whose 2008 paper on cleanliness and morality was selected for replication in a special issue of the journal Social Psychology.
Wilson was also mentioned.
[2] Never mind if there’s little if any progress in understanding the statistics or the phenomenon.
REFERENCES
Gilbert, King, Pettigrew, Wilson (2016), “Comment on ‘Estimating the Reproducibility of psychological science'”and “Response”
OSC report: Estimating the Reproducibility of Psychological Science.
Other blog discussions on this (please add any you find in the comments).
- Uri Simonsohn on Data Colada Evaluating Replications: 40% Full ≠ 60% Empty, 3/3/16 post.
- Gelman’s blog: More on replication. 3/3/16 post
- Gelman’s blog: Replication crisis crisis 3/5/16
- Simine Vazire: On Sometimes I’m Wrong blog: is this what it sounds like when the doves cry?http://sometimesimwrong.typepad.com/wrong/2016/03/doves-cry.html …
- Sanjay Srivastava, on The Hardest Science blog: Evaluating a new critique of the Reproducibility Project
https://hardsci.wordpress.com/2016/03/03/evaluating-a-new-critique-of-the-reproducibility-project/ … - Bishop blog:There is a reproducibility crisis in psychology and we need to act on it http://deevybee.blogspot.co.uk/2016/03/there-is-reproducibility-crisis-in.html
- Daniel Lakens http://daniellakens.blogspot.com/2016/03/the-statistical-conclusions-in-gilbert.html?spref=tw
The 20% statistician - Nosek: Let’s not mischaracterize replication studies: authorsRetraction watch
The following references to the discussions of the OSC criticism are from Retraction Watch
- Monya Baker, at Nature, takes a look at the analysis: “Psychology’s reproducibility problem is exaggerated – say psychologists.”
- Benedict Carey does the same, at The New York Times.
- Slate’s Rachel Gross has detailed comments from Brian Nosek, who led the original replication effort.
- “Psychology Is in Crisis Over Whether It’s in Crisis,” Katie Palmer at WIRED writes. Palmer notes that Harvard’s Dan Gilbert, one of the authors of the Science article, who in the past has called replicators “shameless bullies,” hung up on her when she asked “if he thought his defensiveness might have colored his interpretation of this data.”
- The reason why many of the studies involved in the Reproducibility Project didn’t replicate? “Overestimation of effect sizes…due to small sample sizes and publication bias in the psychological literature,” says a new paper in PLOS ONE.
- Ed Yong weighs in at The Atlantic with “Psychology’s replication crisis can’t be wished away.”
I seemed to recall the OSC claiming that all the authors of the original study worked with them to come to an agreement as to how the replication would be conducted. If any reader is so inclined to check their report for this, please let me know. I do recall their being an issue regarding the OSC choosing to replicate only the last of several studies from the original report.
There’s something else that concerned me: in the one study I read, the students were told at the end that this was a replication attempt and that they shouldn’t tell others the purpose of the study, if they thought there was any chance they’d sign up. Now these psych students are required to sign up for studies, so I couldn’t understand why they were told this. On the other hand, I really don’t know how these psych experiments are normally run: do they tell the students or other subjects at the end of the study what they were really testing for? That’s relevant because it’s a big part of psych studies to try and hide the purpose of the study. I found myself sympathetic with Bressan’s criticisms.
It is true that the original authors were part of the process in most cases. See “Replication protocol” here in the supplemental materials (supplement). Not all of the original authors responded to requests to be involved, however.
Thanks, it was unclear; unclear as well why they went ahead with protocols when criticisms were raised by original authors. But frankly, looking at the original studies brought up in the critique, they aren’t much better.Do they always tell the subjects the purpose of the study once it’s over, even during the replication project? (as in the one I read)
Yes, telling participants the purpose of an experiment is required by ethical review boards; it’s called a debriefing, and is done to make sure that the participants in the studies 1) benefit from the study (educationally), and 2) are well-informed about what they just experienced.
Richard: I didn’t know this, but then again it’s rather unusual to have deception in experiments. Econ won’t do it, to my knowledge.
But if subjects are coming on board over a period of several months, it’s not the slightest bit plausible to suppose that no one mentioned the rationale to future students in the replication project. Is this source of bias discussed anywhere?
In the line of research I’m trained in (memory and perception) it is considered irrelevant because it is highly unlikely that it will change any results (you can either see/remember something or not; knowing the purpose of the experiment should not affect anything). I don’t know if this is discussed in other areas in psychology, but I *do* know that undergrads talk about research after they’ve done it; at the very least, they discuss with one another what experiments are interesting or boring, or “easy” credits (they’re often required to do so many in a semester, so they tell their friends about which ones they should do/not do).
Whether anyone finds this problematic for their research, I do not know. I will ask people to post their thoughts here.
Thanks. I’d find it troubling. If it didn’t matter, they’d tell them the purpose right off. e.g., we’re going to have you read a passage on free will and then see if you’re inclined not to press a button when the computer gives you the answer to a math problem you were to have performed yourself (meaning you’ve cheated). This is a terrible protocol for showing what’s claimed to begin with, but would be really silly if told. In just about all the psych experiments I’ve seen, you could probably manufacture any result you wanted by informing subjects. A good study would be to prove that–it would be quite easy.
I’ve been suggesting for years that an interesting research project for psych would be to test the assumed forms of measurement. Now here’s another: scrutinize those studies on students.
Does prior knowledge of experiments affect performance? This is an empirical question that a group of colleagues and I are investigating in a series of 9 preregistered experiments (3 perception/action, 3 memory, and 3 language comprehension). Half the subjects participate twice. We hypothesize that prior participation will not (much) affect performance in the kinds of experiments in our set (which is probably comparable to the kinds of experiments Richard is referring to). More specifically, we don’t expect a decrease in the effect size from the first to the second participation. The reason for this is that subjects typically are unaware of the manipulation in the experiments in our set and that responses are largely automatic. Data collection is complete, so we should have results soon. A study by Gabriele Paolacci (one of my collaborators on this project) and his colleagues found a decrease in effect size for studies in behavioral economics in which the manipulation is much more obvious than in our cognitive experiments. The Paolacci paper can be found here: http://pss.sagepub.com/content/early/2015/06/10/0956797615585115.abstract.
Thanks. Of course what matters is not where it makes no difference but where it does.But the replication project also had some possibly unique issues in that the same study was ongoing for many months. Or is that always true? Further, in revealing that it was a replication attempt (and it isn’t obvious to me why that would be revealed), a distinct motive may enter (e.g., show them wrong or right).
Debriefing is essential. And if properly handled, creates little threat to validity.
Studies that seem very juicy, theoretically, can be explained in ways that, although correct and informative, do not play up the excitement of the study.
If the participants are treated with respect and are told about the importance of keeping the study participants ignorant, they usually go along with the request for not discussing it with other students.
There has been research on this–and such requests are surprisingly successful. This doesn’t mean that it always works this way, but that it’s possible.
In addition, one shouldn’t overestimate contact among many undergraduates in their first semester of college at a large state university.
Thanks so much for the info. Are you saying then, that research usually is conducted over a semester? Curious to read about how they test the success of secrecy requests. My experience w/ students in a large state university might be different. The replication project has a somewhat unique feature: telling students it’s an attempt to see if such and such study holds up.
There’s an excellent discussion with charts of the two approaches to appealing to confidence intervals to determine replication success. Both are incorrect and problematic, as I think Simonsohn is aware*. This is the kind of inadvertent but faulty use of CIs encouraged by the language used by CI advocates, e.g., Cummings. Check Simonsohn’s pics.
However, one thing he writes is curious:
“For a replication to fail, the data must support the null. They must affirm the non-existence of a detectable effect.”
*Surely the replicators didn’t consider a statistically significant result in a replication as a failure because the initial estimate is not within the replication CI. I assume he means that once finding a non-significant result in a replication, they check to see if the original estimate is within the replication CI. Else it’s crazy.
You say “Neither group adequately scrutinizes the validity of many of the artificial experiments and proxy variables”
Can you expand on what you mean here and/or provide links to your previous discussion of this please?
Tom: I mean they’re looking just at the statistics, assuming that attaining a small p-value is indicative of the research hypothesis they have in mind. They are reaching causal conclusions on the basis of highly artificial studies. Look at my 2nd installment:
https://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/
The replicationist’s question of methodology, as I understand it, is alleged to be what we might call “purely statistical”. It is not: would the initial positive results warrant the psychological hypothesis, were the statistics unproblematic? The presumption from the start was that the answer to this question is yes. In the case of the controversial Schnall study, the question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s runover? At least not directly. In other words, the statistical-substantive link was not at issue. The question is limited to: do we get the statistically significant effect in a replication of the initial study, presumably one with high power to detect the effects at issue.
Thanks, that helps a lot (and, FWIW, I agree completely that many of these results are somewhere between dubiously meaningful and nonsensical even if they do replicate)
Totally agree with you on this. The “crisis,” very long in the making, developed not only due to the flakiness of the methods used to test hypotheses, but also and more importantly due to the triviality and absence of conceptual foundations of the hypotheses to begin with. Anything and everything could be thrown into the experimental blender, and the splatter of significant results taken as truth. The most obvious example of the problem with this environment is probably Bem’s notorious “proof” of ESP; a result proving something that violates the laws of physics is published because the result is “statistically significant.”
Will philosophers be able to prevent bad psych studies from encroaching upon philosophy?
My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is shouting disconfirmation, if not falsification.
It’s particularly concerning to me in philosophy because these types of experiments are becoming increasingly fashionable in “experimental philosophy,” especially ethics. Ethicists rarely are well-versed in statistics, but they’re getting so enamored of introducing an “empirical” component into their work that they rely on just the kinds of psych studies open to the most problems. Worse, they seem to think they are free from providing arguments for a position, if they can point to a psych study, and don’t realize how easy it is to read your favorite position into the data. This trend, should it grow, may weaken the philosopher’s sharpest set of tools: argumentation and critical scrutiny. Worse still, they act like they’re in a position to adjudicate long-standing philosophical disagreements by pointing to a toy psych study. One of the latest philosophical “facts” we’re hearing now is that political conservatives have greater “disgust sensitivity” than liberals. The studies are a complete mess, but I never hear any of the speakers who drop this “fact” express any skepticism. (Not to mention that it’s known the majority of social scientists are non-conservatives–by their definition.)
One of the psych replication studies considered the hypothesis: Believing determinism (vs free-will) makes you more likely to cheat. The “treatment” is reading a single passage on determinism. How do they measure cheating? You’re supposed to answer a math problem, but are told the computer accidentally spews out the correct answer, so you should press a key to get it to disappear, and work out the problem yourself. The cheating effect is measured by seeing how often you press the button. But the cheater could very well copy down the right answer given by the computer and be sure to press the button often so as to be scored as not cheating. Then there’s the Macbeth effect tested by unscrambling soap words and getting you to rate how awful it is to eat your just run-over dog. See this post: https://errorstatistics.com/2014/04/08/out-damned-pseudoscience-non-significant-results-are-the-new-significant-results/I could go on and on.
Maybe this new fad is the result of the death of logical positivism and the Quinean push to “naturalize” philosophy; or maybe it’s simply that ethics has run out of steam. Fortunately, I’m not in ethics, but it’s encroaching upon philosophical discussions and courses. It offends me greatly to see hard-nosed philosophers uncritically buying into these results. In fact, I find it triggers my sensitivity to disgust, even though I score high on their 6-point “liberal” scale.
Does knowledge of the study influence the results? It seems incredibly common sense that it does in a lot of social psychology studies (especially when deception is used). If people know the purpose of the study, some will just go along with it and say what they think you want in their responses. But I’d guess a higher percentage will be motivated to not be influenced by the researchers. Lots of social psych research finds that people tend to think they are less susceptible to social influences (like adverts, peer pressure – and I presume, social psychology studies) than they think other people are. And we all know that people are motivated to maintain (or at least believe they have) free will and autonomy. From my view, it seems baffling that this is even in question in the context of social psychology studies with experimental manipulations. In the same vein, what students are taught in courses about the topic they are researching also influences the results.
Someone above also mentioned that at large state universities in the U.S. students might not talk much to each other. That may or may not be true; lots of students form Facebook groups for their courses where 1 person mentioning this could be seen by lots of students. And in the UK all students in psychology take all required courses together (as in it is only offered once a year and you have to take it the year it is offered; 1st years only take 1st year classes , 2nd years only take 2nd, etc). And most courses are required. So you have 150-250 students in each year taking exactly the same courses (so yes they do talk to each other as they are all in every single class together) being eligible for often a small subset of studies. I’d say it is a safe bet they talk to each other.And when it is a small sample study – like the original studies being replicated – even a few participants can make a major difference. (just fiddle with people to take out and this becomes obvious. 1 or 2 can cause quite a swing; that’s why p hacking is a thing).
nheflick: I had missed your interesting comment. All that you say is plausible to me, and would make me question these replication and rereplication efforts. Now they will try to see if ones that didn’t replicate but had non-endorsed protocols will replicate if original authors endorse the protocol. Aside from the fact that the 10 studies are advertised, there’s an incentive to uphold the failed replication. Still no critique of the substantive causal claim or the presumption that the experiment is measuring the intended effect.
Pingback: Evaluating a new critique of the Reproducibility Project | The Hardest Science
A comment on the ASA site entered by Stuart Hurlbert:
(I’ve known him as a radical Fisherian)
It is an “interesting” comment on the discipline and practice of statistics that it takes a special commission to restate and reaffirm six principles the validity of which has been understood for more than half a century and which students should understand after any good 1-semester introductory statistics course. A strong testament to the rarity of good courses – and how little most statisticians know of the historical literature!
There are many sources of the massive disarray in statistical understanding and practice, and I and a few colleagues have been writing about these for decades, as have others. For one, most statistics texts whether written by statisticians (Hurlbert 2013a) or biologists (Hurlbert 2013b) or other scientists contain fair amounts of bad advice and error. For another, editors and reviewers of journals, including statistical journals, often fail to detect even gross errors in mss and offer bad advice or instructions on statistical matters. The only time I submitted a ms to The American Statistician I ran into an editor and a reviewer who thought I misunderstood the classic definitions of ‘experimental unit’ and ‘blocking.’ So I found a journal more like the ones Fisher used to publish in (details in supplemental materials for Hurlbert 2013a)! Lazy scholarship on the part of authors criticizing statistical practice is a big problem. Much of the time it is evident that they have read very little of the historical literature on the point (or error) they are making, which in the historical literature has been made and corrected over and over again. Especially the last few years, off-the-cuff, ‘drama queen’ authors having been getting a free pass. This is particularly true of the literature critical of P values and null hypothesis testing. But I and my colleague, Celia Lombardi, noticed this in the literature for EVERY topic we’ve done review articles on (e.g. Hurlbert & Lombardi 2009a, b, 2012, 2016).
But back to the ASA statement. This a pretty good statement considering the cats that had to be herded. No actual errors that I can detect, but perhaps a few weaknesses:
1. Saying “a p-value near 0.05 taken by itself offer only weak evidence against the null hypothesis” is meaningless because “weak” can only be defined by comparison with something else. It is not “weak” evidence relative to a p-value of 0.30, for sure! Presumably this was a sop to Bayesians who would, if candid, say it should be compared to an ‘objective’ Bayesian posterior of 0.05. Two of our papers (Hurlbert & Lombardi 2009a,b) exposed many cases where Bayesians have used fallacious logic and word games to discredit p-values. But now ASA has officially declared p = 0.05 to be “weak evidence.” What a diabolical tool to put into the hands of rigid, curmudgeonly editors!! Load those Bayesians into the tumbril!
2. Principle 3 almost gets to the neoFisherian position (Hurlbert & Lombardi 2009) that alpha’s should not be specified and the term “statistically significant” never used, a position advocated for decades by many top statisticians and other scientists. In the contexts of basic and applied research “binary decisions” are never needed. Quality control contexts, fine. Providing cover to FDA bureaucrats, fine. But in the conduct and presentation of research, never. This principle needs to be clarified by removal the vague waffling and going whole hog neoFisherian.
3. After all the trouble to be correct and clear in the official statement of principles, the commission perhaps erred in putting ASA’s imprimatur on a rather eclectic list of references. A disclaimer should be added: “Several of these works are considered controversial and some scientists claim they collectively contain many misstatements of fact and illogical arguments. Caveat emptor. They will provide, however, a good entrée to the literature.” No need to get personal!
I make this suggestion having read at least two-thirds of the works cited – and having pointed out many of the specific problems in them in our 2009 papers in particular.
Now for a commission on “multiplicity paranoia”!
*******************
Hurlbert,S.H. and C.M. Lombardi. 2009a. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici 46:311-349. PDF
Lombardi, C.M. and S.H. Hurlbert, 2009b. Misprescription and misuse of one-tailed tests. Austral Ecology 34:447-468. PDF, Appendix PDF
Hurlbert, S.H. and C.M. Lombardi. 2012. Lopsided reasoning on lopsided tests and multiple comparisons. Australian and New Zealand Journal of Statistics 54:23-42. PDF
Hurlbert, S.H. 2013a. Affirmation of the classical terminology for experimental design via a critique of Casella’s Statistical Design. Agronomy Journal 105: 412-418 + suppl. inform. PDF
Hurlbert, S.H. 2013b. [Review of Biometry, 4th edn, by R.R. Sokal & F.J. Rohlf]. Limnology
and Oceanography Bulletin 22(2): 62-65. PDF
Hurlbert,S.H. and C.M. Lombardi. 2016. Pseudoreplication, one-tailed tests, neoFisherianism, multiple comparisons, and pseudofactorialism. Integrated Environmental Assessment and Management 12:195-197, 2016. PDF