- The Paradox of Replication
Critic 1: It’s much too easy to get small P-values.
Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).
Is it easy or is it hard?
You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.
The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:
The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …
The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)
The Replication Report itself, published in Science, gives more details:
Considering significance testing, reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25%) of social psychology effects replicated by the P < 0.05 criterion, whereas 21 of 42 (50%) of cognitive psychology effects did so. …The difference in significance testing results between fields appears to be partly a function of weaker original effects in social psychology studies, particularly in JPSP, and perhaps of the greater frequency of high-powered within-subjects manipulations and repeated measurement designs in cognitive psychology as suggested by high power despite relatively small participant samples. …
A negative correlation of replication success with the original study P value indicates that the initial strength of evidence is predictive of reproducibility. For example, 26 of 63 (41%) original studies with P < 0.02 achieved P < 0.05 in the replication, whereas 6 of 23 (26%) that had a P value between 0.02 < P < 0.04 and 2 of 11 (18%) that had a P value > 0.04 did so (Fig. 2). Almost two thirds (20 of 32, 63%) of original studies with P < 0.001 had a significant P value in the replication. [i]
Since it’s expected to have only around 50% replications as strong as the original, the cases of initial significance level < .02 don’t do too badly, judging just on numbers. But I disagree with those who say that all that’s needed is to lower the required P-value, because it ignores the real monster: biasing selection effects.
2. Is there evidence that differences (between initial studies vs replications) are due to A, B, C…or not? Moreover, simple significance tests and cognate methods were the tools of choice in exploring possible explanations for the disagreeing results.
Last, there was little evidence that perceived importance of the effect, expertise of the original or replication teams, or self-assessed quality of the replication accounted for meaningful variation in reproducibility across indicators. Replication success was more consistently related to the original strength of evidence (such as original P value, effect size, and effect tested) than to characteristics of the teams and implementation of the replication (such as expertise, quality, or challenge of conducting study) (tables S3 and S4).
They look to a battery of simple significance tests for answers, if only indications. It is apt that they report these explanations as the result of “exploratory” analysis; they weren’t generalizing, but scrutinizing if various factors could readily account for the results.
What evidence is there that the replication studies are not themselves due to bias? According to the Report:
There is no publication bias in the replication studies because all results are reported. Also, there are no selection or reporting biases because all were confirmatory tests based on pre-analysis plans. This maximizes the interpretability of the replication P values and effect estimates.
One needn’t rule out bias altogether to agree with the Report that the replication research controlled the most common biases and flexibilities to which initial experiments were open. If your P-value emerged from torture and abuse, it can’t be hidden from a replication that ties your hands. If you don’t cherry-pick, try and try again, barn hunt, capitalize on flexible theory, and so on, it’s hard to satisfy R.A. Fisher’s requirement of rarely failing to bring about statistically significant results–unless you’ve found a genuine effect. Admittedly a small part of finding things out, the same methods can be used to go deeper in discovering and probing alternative explanations of an effect.
3. Observed differences cannot be taken as caused by the “treatment”: My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is indicating the absence of a genuine effect, much less the particular causal thesis. The statistical test is shouting disconfirmation, if not falsification, of unwarranted hypotheses, but no such interpretation is heard.
It would be interesting to see a list of the failed replications. (I’ll try to dig them out at some point.) The New York Times gives three, but even they are regarded as “simply weaker”.
The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.
This is akin to the habit some researchers have of describing non-significant results as sort of “trending” significant––when the P-value is telling them it’s not significant, and I don’t mean falling short of a “bright line” at .05, but levels like .2, .3, and .4. These differences are easy to bring about by chance variability alone. Psychologists also blur the observed difference (in statistics) with the inferred discrepancy (in parameter values). This inflates the inference. I don’t know the specific P-values for the following three:
More than 60 of the studies did not hold up. Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.
Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.
A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.
What are the grounds for saying they’re merely weaker? The author of the mate preference study protests even this mild criticism, claiming that a “theory required adjustment” shows her findings to have been replicated after all.
In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences [between her study and the replication] — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.
Wait a minute. This was to be a general evolutionary theory, yes? According to the abstract:
Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men. (A link to the abstract and paper is here.)
Is the author saying that Italian women obey a distinct evolutionary process? I take it one could argue that evolutionary forces manifest themselves in different ways in distinct cultures. Doubtless, ratings of attractiveness by U.S. psychology students can’t be assumed to reflect assessments about availability for impromptu sex. But can they even among Italian women? This is just one particular story through which the data are being viewed. [9/2/15 Update on the mate preference and ovulation study is in Section 4.]
I can understand that the authors of the replication Report wanted to tread carefully to avoid the kind of pushback that erupted when a hypothesis about cleanliness and morality failed to be replicated. (“Repligate” some called it.) My current concern echoes the one I raised about that case (in an earlier post):
“the [replicationist] question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s run over? At least not directly. In other words, the statistical-substantive link was not at issue.”
Just because subjects (generally psychology students) select a number on a questionnaire, or can be scored on an official test of attitude, feelings, self-esteem, etc., doesn’t mean it’s actually been measured, and you can proceed to apply statistics. You may adopt a method that allows you to go from statistical significance to causal claims—the unwarranted NHST animal that Fisher opposed—but the question does not disappear [ii]. Reading a passage against “free will” makes me more likely to cheat on a test? (There’s scarce evidence that reading a passage influenced the subject’s view on the deep issue of free will, nor even that the subject (chose to*) “cheat”, much less that the former is responsible for the latter.) When I plot two faraway points on a graph I’m more likely to feel more “distant” from my family than if I plot two close together points? The effect is weaker but still real? There are oceans of studies like these (especially in social psychology & priming research). Some are even taken to inform philosophical theories of mind or ethics when, in my opinion, philosophers should be providing a rigorous methodological critique of these studies [iii]. We need to go deeper; in many cases, no statistical analysis would even be required. The vast literatures on the assumed effect live lives of their own; to test their fundamental presuppositions could bring them all crashing down [iv]. Are they to remain out of bounds of critical scrutiny? What do you think?
4. Update on the Italian Mate Selection Replication
Here’s the situation as I understand it, having read both the replication and the response by Bressan. The women in the study had to be single, not pregnant, not on the pill, heterosexual. Among the single women,some are in relationships, they are “partnered”. The thesis is this: if a partnered woman is not ovulating, she’s more attracted to the “attached” guy, because he is deemed capable of a long-term commitment, as evidenced by his being in a relationship. So she might leave her current guy for him (at least if he’s handsome in a masculine sort of way). On the other hand, if she’s ovulating, she’d be more attracted to a single (not attached) man than an attached man. “In this way she could get pregnant and carry the high-genetic-fitness man’s offspring without having to leave her current, stable relationship” (Frazier and Hasselman Bressan_online_in lab (1).2)
So the deal is this: if she’s ovulating, she’s got to do something fast: Have a baby with the single (non-attached) guy whose not very good at commitments (but shows high testosterone, and thus high immunities, according to the authors), and then race back to have the baby in her current stable relationship. As Bressan puts it in her response to the replication:“This effect was interpreted on the basis of the hypothesis that, during ovulation, partnered women would be “shopping for good genes” because they “already have a potentially investing ‘father’ on their side.” But would he be an invested father if it was another man’s baby? I mean, does this even make sense on crude evolutionary terms? [I don’t claim to know. I thought male lions are prone to stomp on babies fathered by other males. Even with humans, I doubt that even the “feminine” male Pleistocene partner would remain fully invested.]
Nevertheless, when you see the whole picture, Bressan does raise some valid questions of the replication attempt BRESSAN COMMENTARY. I may come back to this later. You can find all the reports, responses by authors, and other related materials here.
[i] Here’s a useful overview from the Report in Science:
Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low. I think the entire issue of importance goes beyond rates, and that focusing on rates of replication actually distracts from what’s involved in appraising a given study or theory.
[ii] Statistical methods are relevant to answering this question and even falsifying conjectured causal claims. My point is that it demands more than checking the purely statistical question in these “direct” replications, and more than P-values. Oddly, since these studies appeal to power, they ought to be in Neyman-Pearson hypotheses testing (ideally without the behavioristic rationale). This would immediately scotch an illicit slide from statistical to substantive inference.
[iii] Yes, this is one of the sources of my disappointment: philosophers of science should be critically assessing this so-called “naturalized” philosophy. It all goes back to Quine, but never mind.
[iv] It would not be difficult to test whether these measures are valid. The following is about the strongest, hedged, claim (from the Report) that the replication result is sounder than the original:
If publication, selection, and reporting biases completely explain the effect differences, then the replication estimates would be a better estimate of the effect size than would the meta-analytic and original results. However, to the extent that there are other influences, such as moderation by sample, setting, or quality of replication, the relative bias influencing original and replication effect size estimation is unknown.