The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)


The unpopular P-value is invited to dance.

  1. The Paradox of Replication

Critic 1: It’s much too easy to get small P-values.

Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).

Is it easy or is it hard?

You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.

The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:

The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)

The Replication Report itself, published in Science, gives more details:

Considering significance testing, reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25%) of social psychology effects replicated by the P < 0.05 criterion, whereas 21 of 42 (50%) of cognitive psychology effects did so. …The difference in significance testing results between fields appears to be partly a function of weaker original effects in social psychology studies, particularly in JPSP, and perhaps of the greater frequency of high-powered within-subjects manipulations and repeated measurement designs in cognitive psychology as suggested by high power despite relatively small participant samples. …

A negative correlation of replication success with the original study P value indicates that the initial strength of evidence is predictive of reproducibility. For example, 26 of 63 (41%) original studies with P < 0.02 achieved P < 0.05 in the replication, whereas 6 of 23 (26%) that had a P value between 0.02 < P < 0.04 and 2 of 11 (18%) that had a P value > 0.04 did so (Fig. 2). Almost two thirds (20 of 32, 63%) of original studies with P < 0.001 had a significant P value in the replication. [i]

Since it’s expected to have only around 50% replications as strong as the original, the cases of initial significance level < .02 don’t do too badly, judging just on numbers. But I disagree with those who say that all that’s needed is to lower the required P-value, because it ignores the real monster: biasing selection effects.

 2. Is there evidence that differences (between initial studies vs replications) are due to A, B, C…or not?  Moreover, simple significance tests and cognate methods were the tools of choice in exploring possible explanations for the disagreeing results.

Last, there was little evidence that perceived importance of the effect, expertise of the original or replication teams, or self-assessed quality of the replication accounted for meaningful variation in reproducibility across indicators. Replication success was more consistently related to the original strength of evidence (such as original P value, effect size, and effect tested) than to characteristics of the teams and implementation of the replication (such as expertise, quality, or challenge of conducting study) (tables S3 and S4).

They look to a battery of simple significance tests for answers, if only indications. It is apt that they report these explanations as the result of “exploratory” analysis; they weren’t generalizing, but scrutinizing if various factors could readily account for the results.

What evidence is there that the replication studies are not themselves due to bias? According to the Report:

There is no publication bias in the replication studies because all results are reported. Also, there are no selection or reporting biases because all were confirmatory tests based on pre-analysis plans. This maximizes the interpretability of the replication P values and effect estimates.

One needn’t rule out bias altogether to agree with the Report that the replication research controlled the most common biases and flexibilities to which initial experiments were open. If your P-value emerged from torture and abuse, it can’t be hidden from a replication that ties your hands. If you don’t cherry-pick, try and try again, barn hunt, capitalize on flexible theory, and so on, it’s hard to satisfy R.A. Fisher’s requirement of rarely failing to bring about statistically significant results–unless you’ve found a genuine effect. Admittedly a small part of finding things out, the same methods can be used to go deeper in discovering and probing alternative explanations of an effect.

3. Observed differences cannot be taken as caused by the “treatment”: My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is indicating the absence of a genuine effect, much less the particular causal thesis. The statistical test is shouting disconfirmation, if not falsification, of unwarranted hypotheses, but no such interpretation is heard.

It would be interesting to see a list of the failed replications. (I’ll try to dig them out at some point.) The New York Times gives three, but even they are regarded as “simply weaker”.

The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

This is akin to the habit some researchers have of describing non-significant results as sort of “trending” significant––when the P-value is telling them it’s not significant, and I don’t mean falling short of a “bright line” at .05, but levels like .2, .3, and .4.  These differences are easy to bring about by chance variability alone. Psychologists also blur the observed difference (in statistics) with the inferred discrepancy (in parameter values). This inflates the inference. I don’t know the specific P-values for the following three:

More than 60 of the studies did not hold up. Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.

Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.

A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

What are the grounds for saying they’re merely weaker? The author of the mate preference study protests even this mild criticism, claiming that a “theory required adjustment” shows her findings to have been replicated after all.

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences [between her study and the replication] — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

Wait a minute. This was to be a general evolutionary theory, yes? According to the abstract:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men. (A link to the abstract and paper is here.)

Is the author saying that Italian women obey a distinct evolutionary process? I take it one could argue that evolutionary forces manifest themselves in different ways in distinct cultures. Doubtless, ratings of attractiveness by U.S. psychology students can’t be assumed to reflect assessments about availability for impromptu sex. But can they even among Italian women? This is just one particular story through which the data are being viewed. [9/2/15 Update on the mate preference and ovulation study is in Section 4.]

I can understand that the authors of the replication Report wanted to tread carefully to avoid the kind of pushback that erupted when a hypothesis about cleanliness and morality failed to be replicated. (“Repligate” some called it.) My current concern echoes the one I raised about that case (in an earlier post):

“the [replicationist] question wasn’t: can the hypotheses about cleanliness and morality be well-tested or well probed by finding statistical associations between unscrambling cleanliness words and “being less judgmental” about things like eating your dog if he’s run over? At least not directly. In other words, the statistical-substantive link was not at issue.”

Just because subjects (generally psychology students) select a number on a questionnaire, or can be scored on an official test of attitude, feelings, self-esteem, etc., doesn’t mean it’s actually been measured, and you can proceed to apply statistics. You may adopt a method that allows you to go from statistical significance to causal claims—the unwarranted NHST animal that Fisher opposed—but the question does not disappear [ii]. Reading a passage against “free will” makes me more likely to cheat on a test? (There’s scarce evidence that reading a passage influenced the subject’s view on the deep issue of free will, nor even that the subject (chose to*) “cheat”, much less that the former is responsible for the latter.) When I plot two faraway points on a graph I’m more likely to feel more “distant” from my family than if I plot two close together points? The effect is weaker but still real? There are oceans of studies like these (especially in social psychology & priming research). Some are even taken to inform philosophical theories of mind or ethics when, in my opinion, philosophers should be providing a rigorous methodological critique of these studies [iii].  We need to go deeper; in many cases, no statistical analysis would even be required. The vast literatures on the assumed effect live lives of their own; to test their fundamental presuppositions could bring them all crashing down [iv]. Are they to remain out of bounds of critical scrutiny? What do you think?

I may come back to this post in later installments.

*Irony intended.

4. Update on the Italian Mate Selection Replication

Here’s the situation as I understand it, having read both the replication and the response by Bressan. The women in the study had to be single, not pregnant, not on the pill, heterosexual. Among the single women,some are in relationships, they are “partnered”. The thesis is this: if a partnered woman is not ovulating, she’s more attracted to the “attached” guy, because he is deemed capable of a long-term commitment, as evidenced by his being in a relationship. So she might leave her current guy for him (at least if he’s handsome in a masculine sort of way). On the other hand, if she’s ovulating, she’d be more attracted to a single (not attached) man than an attached man. “In this way she could get pregnant and carry the high-genetic-fitness man’s offspring without having to leave her current, stable relationship” (Frazier and Hasselman Bressan_online_in lab (1).2)

So the deal is this: if she’s ovulating, she’s got to do something fast: Have a baby with the single (non-attached) guy whose not very good at commitments (but shows high testosterone, and thus high immunities, according to the authors), and then race back to have the baby in her current stable relationship. As Bressan puts it in her response to the replication:“This effect was interpreted on the basis of the hypothesis that, during ovulation, partnered women would be “shopping for good genes” because they “already have a potentially investing ‘father’ on their side.” But would he be an invested father if it was another man’s baby? I mean, does this even make sense on crude evolutionary terms? [I don’t claim to know. I thought male lions are prone to stomp on babies fathered by other males. Even with humans, I doubt that even the “feminine” male Pleistocene partner would remain fully invested.]

Nevertheless, when you see the whole picture, Bressan does raise some valid questions of the replication attempt BRESSAN COMMENTARY. I may come back to this later. You can find all the reports, responses by authors, and other related materials here.

[i] Here’s a useful overview from the Report in Science:

Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low. I think the entire issue of importance goes beyond rates, and that focusing on rates of replication actually distracts from what’s involved in appraising a given study or theory.

[ii] Statistical methods are relevant to answering this question and even falsifying conjectured causal claims. My point is that it demands more than checking the purely statistical question in these “direct” replications, and more than P-values. Oddly, since these studies appeal to power, they ought to be in Neyman-Pearson hypotheses testing (ideally without the behavioristic rationale). This would immediately scotch an illicit slide from statistical to substantive inference.

[iii] Yes, this is one of the sources of my disappointment: philosophers of science should be critically assessing this so-called “naturalized” philosophy. It all goes back to Quine, but never mind.

[iv] It would not be difficult to test whether these measures are valid. The following is about the strongest, hedged, claim (from the Report) that the replication result is sounder than the original:

If publication, selection, and reporting biases completely explain the effect differences, then the replication estimates would be a better estimate of the effect size than would the meta-analytic and original results. However, to the extent that there are other influences, such as moderation by sample, setting, or quality of replication, the relative bias influencing original and replication effect size estimation is unknown.

Categories: replication research, reproducibility, spurious p values, Statistics

Post navigation

23 thoughts on “The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

  1. Anonymous

    The question of replication of psychological experiments is a fascinating one. As a psychologist I thoroughly agree that it points to issues far beyond the statistical, or even experimental. Stated briefly, the non-replicability of such experiments appears less to be an issue of error or bias, but rather one of the nature of psychological knowledge and subjects. The call for strict replication presupposes that psychological finding or knowledge is based on universals – whether entities or laws/theories, usually conceived of as biological, which would be found by replication with different subjects in different times and places. The Italian researchers comments on attractiveness are potentially correct, I believe, but it is noteworthy that in the original article as quoted here (and presumably its replication) the experiments results were abstracted to be supporting/refuting biologically-based evolutionary theory – rather than the behaviour of Italian women in a particular time and place. These issues have long been debated in critical corners of psychology, and is best bought into the methodological (and political) light when experimental subjects a more Other, for instance from the global South or just non-middle class, white subjects. In these cases difference can be marked and the notion that psychological subjects are universal, biological subjects, starts to breakdown. It is quite possible that culture, place and historical time have a greater impact on social as opposed to cognitive experiments, but many socially-related skills are often incorrectly constructed as “cognitive” too (i.e. part of the functioning of a universal, biological brain). Of course one can question what type of knowledge-base psychology is building with research dominated by experiments with Euro-American university students as subjects, and believe me, there are many within the profession asking just the same thing. Statistics is often being misused here when critical thinking and cultural sensitive theory is required.

    • Anonymous
      There’s an article today in the NYT that also brings up the point that failure to generate a result purporting to be real may just reflect a context-dependency of the effect.* Sure, but at some point there needs to be a determination that the contextual saves have rendered the hypotheses unfalsifiable. Lakatos had come to conclude (after Kuhn) that, strictly speaking, at no point can it be said that saving a hypothesis from falsification (by a context-dependent add-on) is unscientific! (Further, he held that if there comes a point that scientists “decide” to consider a hypothesis degenerating to the degree that it is replaced, it will only have been “falsified” by post hoc fiat). No wonder Popper refused to speak with Lakatos after such a break with Popperian requirements. Feyerabend was right to call Lakatos “a fellow anarchist”. Anyway, I’m with Popper. I completely agree with the importance of discovering general conditions wherein an effect is and is not generated (as in the NYT article about the mice, or were they rats?). But the amended hypothesis must itself make a testable prediction. Only when that prediction passes subsequent stringent tests can the adjusted hypothesis be warranted, and this would not be the case if the modified hypothesis is tantamount to “my hypothesis holds just for the group where it holds”. Clearly, that would be ad hoc and unfalsifiable.


      • Anonymous

        I agree that the ad hoc (or would it be post-hoc) qualification here is just not good science (in anyone’s book).

  2. Carlos Ungil

    > “Thirty-six percent of replications had statistically significant results; …”
    > Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low.

    The 50% expectation of replications as strong as the original is unconditional. The analysis here is conditional on experiments having significant p-values. The expected replication rate will be strictly lower than 50% (depends on the p-value and the distribution of the real effect size).

    On the other hand, the 36% percent quoted does not refer to results as strong as the original (for example, it would include a study with p-value<0.01 replicated with p-value 0.04). The actual number seems to be around half of that: only 18% of the replications resulted in an effect size larger than in the original study.

    • Carlos Ungil

      I said “The expected replication rate will be strictly lower than 50% (depends on the p-value and the distribution of the real effect size).” On further thought, it will be lower than 50% if we condition only on the p-value being significant (with a value which depends on the underlying distribution of the actual effect size). By conditioning on the observed p-value the probability of getting a lower p-value on the replication can get arbitrarily close to 100% if the effect is large enough.

  3. Wonderful cartoon (even if you only found the pic and wrote the caption)! Unfortunately, that “girl” is simply not my type, for reasons I’ve given before (p-values are answering the wrong question, etc., including the assumption that there should be a “bright line”).

    I was disappointed to see that a paper in such a respected journal as Science wrote such things, though I give Editor Marcia McNutt lots of credit for getting the journal to address this vital issue.

    • Norm: As you know, I always report the discrepancies that are indicated and which not. Since you favor confidence intervals, that’s much the same except that I don’t choose a “bright line” confidence level, as if everything is either in or out. It’s amazing to me that some people are squeamish about choosing an alpha, but happy to choose (1- alpha). Both are too rigid, yet I think there’s a place for an indication of a real effect or lack of one. Research requires many tools; one isn’t always asking the same question. (David Cox has a good taxonomy for when a simple significance test is relevant.)
      The severity assessment is akin to forming several confidence intervals at different levels. But of course, here I was just discussing the replication report, not my statistical philosophy.
      As for cartoons, as it happens, I’ve been drawing several of my own (not this one), but will start putting some up soon.

  4. Pingback: Distilled News | Data Analytics & R

  5. john byrd

    So, there has been some discussion of whether the psych studies reveal information about all humans or more narrow groups of humans. In these studies, do they clearly identify the reference class that the statistical analysis might relate to?

    • John: The psych studies aren’t random samples from humans by any means; they at most apply to the groups studied (which can well be the case for RCTs as well). But I’m already skeptical at the level of having shown the effect in the studied population (in the case of the three discussed in the NYT; I haven’t researched the others.) Some call it internal validity.

      • john byrd

        Is that a large part of the problem, as far as over-interpretation of the results following a significance test? A significance testing strategy should be able to point to a reference class, right? And, the interpretation should be tied to the reference class, at least conceptually.

  6. blog-o-logue: Gelman cites this post on his blog today:
    He agrees with my take on the Italian evolutionary theory example.

  7. Z

    can you expand a little bit on ‘it all goes back to quine’?

    • Z: Sure. When Quine showed (or claimed to show) that there was no analytic-synthetic distinction (“2 Dogmas of empiricism”), it followed that the realm typically thought to be the domain of philosophy–the analytic realm–actually wasn’t distinct from the synthetic realm––the empirical realm. Analytic truths are true by definition or by math/logic and are not contingent empirical claims about the world. If philosophy is to be empirical, or “naturalistic”, he thought, it would look to psychology. This is a huge topic, I’ll see if there are other philosophers who want to jump in. I disagree with Quine in all kinds of ways (you might check my “Error and the Growth of Experimental Knowledge” for the conception of naturalized philosophy of science I favor.) I agree with Popper who said (in response to Quine) that psychology would be the wrong place to look for an empirically informed philosophy of science/knowledge. At least one of the right places to look, in my judgment, is statistical science, generally construed.

  8. Mayo wrote: “But would he be an invested father if it was another man’s baby? I mean, does this even make sense on crude evolutionary terms?”

    In this scenario the woman is cheating on her long-term partner. How did you miss this?

    • Corey: Right, she’s cheating, so what did I miss?

      • Presumably the woman conceals her cheating the the long-term partner thinks the child is his progeny. Isn’t that usually how these things go?

        • Corey: I don’t think so. Jane is not married to trustworthy Tim, and gets pregnant by Tarzan, who is single. Tim is unlikely to tie the knot with Jane. Nor can it be supposed Elvis (Tarzan?) has no interest in his child.

          • Principle of charity suggests that the author of the line you question intended something like my interpretation (regardless of what would actually happen).

            • Corey: Or, she was trying to account for the results which might have been different. I really don’t know.

  9. Russ Wolfinger

    Great post Mayo. I presume now Italian women would advise P-value-ana to really size up her suitor before agreeing to Rumba or Waltz.

    Regarding replicability/reproducibility, a sizable group of scientific colleagues and I have been investigating related issues over the past 15 years in the field of genomics. An original motivating problem is the following: A lab conducts a basic gene expression study comparing treatments A versus B over thousands of genes. They use a t-test on each gene and rank order them by p-value to determine the top 100 most statistically significant. (Whoops very sorry, I just used the forbidden phrase.) A second lab attempts to reproduce the experiment with the same protocols and genetically similar samples and does the exact same statistical analysis. The overlap of the two gene lists ends up being near zero. Why?

    After a lot of friendly debate amongst experts across several disciplines including biochemistry, toxicology, and statistics, along with some simulation studies, our conclusion is that the p-values are performing exactly as advertised. Specifically, they are not designed to reproduce results in this fashion, but rather to control error rates. A second conclusion is that if we rank order the genes by raw effect size (the numerator of the t-statistic), then the degree of overlap in the two lists is much higher. Log fold change is much more reproducible than it’s corresponding signal-to-noise ratio. This makes sense given the sampling variability of the two statistics. For details refer to and other publications from the MAQC Society.

    My current view is that reproducibility is a third dimension beyond specificity and sensitivity. Your error statistics philosophy sheds a lot of light on this and we appear to be in need of a better understanding of it along with a much richer set of applied methods to successfully navigate all three dimensions.

    • Russ: I just noticed this. Statistical significance is certainly not a prohibited phrase here, nor at the ASA, we now know. I’d be interested to understand hoe the “error statistics philosophy sheds a lot of light on this” issue. I will look at your paper.

Blog at