Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end .
[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).
An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a multiverse analysis, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies.
Steegen et.al.,consider the rather awful example from 2012 purporting to show that single (vs non-single) women prefer Obama to Romney when they are highly fertile; the reverse when they’re at low fertility. (I’m guessing there’s a hold on these ovulation studies during the current election season–maybe that’s one good thing in this election cycle. But let me know if you hear of any.)
Two studies with relatively large and diverse samples of women found that ovulation had different effects on religious and political orientation depending on whether women were single or in committed relationships. Ovulation led single women to become more socially liberal, less religious, and more likely to vote for Barack Obama (Durante et al., p. 1013).
What irks me to no end is the assumption they’re finding effects of ovulation when all they’ve got are a bunch of correlations with lots of flexibility in analysis. (It was discussed in brief on this blogpost.) Unlike the study claiming to show males are more likely to suffer a drop in self-esteem when their partner surpasses them in something (as opposed to when they surpass their partner), this one’s not even intuitively plausible (For the former case of “Macho Men” see slides starting from #48 of this post.) The ovulation study was considered so bad that people complained to the network and it had to be pulled. Nevertheless, both studies are open to an analogous critique.
One of the choice points is where to draw the line at “highly fertile” based on days in a woman’s cycle. It wasn’t based on any hormone check, but an on-line questionnaire asking subjects when they’d had their last period. There’s latitude in using such information (even assuming it to be accurate) to decide whether to place someone in a low or high fertility group (Steegen et al., find 5 sets of days that could have been used). It turns out that under the other choice points, many of the results were insignificant. Had the evidence been “constructed”along these alternative lines, a negative result would often have ensued. Intuitively, considering what could have happened but didn’t, is quite relevant for interpreting the significant result they published. But how?
1. A severity scrutiny
Suppose the study is taken as evidence for
H1: ovulation makes single women more likely to vote for Obama than Romney.
The data they selected for analysis accords with H1, where highly fertile is defined in their chosen manner, leading to significance. The multiverse arrays how many other choice combinations lead to different p-values. We want to determine how good a job has been done in ruling out flaws in the study purporting to have evidence for H1.To determine how severely H1 had passed we’d ask:
What’s the probability they would not have found some path or other to yield statistical significance, even if in fact H1 is false and there’s no genuine effect?
We want this probability to be high, in order to argue the significant result indicates a genuine effect. That is, we’d like some assurance that the procedure would have alerted us were H1unwarranted. I’m not sure how to compute this using the multiverse, but it’s clear there’s more leeway than if one definition for fertility had been pinned down in advance. Perhaps each of the k different consistent combinations can count as a distinct hypothesis, and then one tries to consider the probability of getting r out of k hypotheses statistically significant, even if H1 is false, taking account of dependencies. Maybe Stan Young’s “resampling-based multiple modeling” techniques could be employed (Westfall & Young, 1993). In any event, the spirit of the multiverse is, or appears to be, a quintessentially error statistical gambit. In appraising the well-testedness of a claim, anything that alters the probative capacity to discern flaws is relevant; anything that increases the flabbiness in uncovering flaws (in what is to be inferred) lowers the severity of the test that H1 is false passed. Clearly, taking a walk on a data construction highway does this–the very reason for the common call for preregistration.
If one hadn’t preregistered, and all the other plausible combinations of choices yield non-significance, there’s a strong inkling that researchers selectively arrived at their result. If one had preregistered, finding that other paths yield non-significance is still informative about the fragility of the result. On the other hand, suppose one had preregistered and obtained a negative result. In the interest of reporting the multiverse, positive results may be disinterred, possibly offsetting the initial negative result.
2. It is to be Applicable to Bayesian and Frequentist Approaches
I find it interesting that the authors say that “a multiverse analysis is valuable, regardless of the inferential framework (frequentist or Bayesian)” and regardless of whether the inference is in the form of p-values, CIs, Bayes Factors or posteriors (p.709). Do the Bayesian tests (posterior or Bayes Factors) find evidence against H1 just when the configuration yields an insignificant result? We’re not told. No, I don’t see why they would. It would depend, of course, on the choice of alternatives and priors. Given how strongly authors Durante et al. believe H1, it wouldn’t be surprising if the multiverse continues to find evidence for it (with a high posterior or high Bayes Factor in favor of H1). Presumably the flexibility in discretionary choices is to show up in diminished Bayesian evidence for H1 but it’s not clear to me how. Nevertheless, even if the approach doesn’t itself consider error probabilities of methods, we can set out to appraise severity on the meta-level. We may argue that there’s a high probability of finding evidence in favor of some alternative H1 or other (varying over definitions of high fertility, say), even if its false. Yet I don’t think that’s what Steegen et al., have in mind. I welcome a clarification.
3. Auditing: Just Falsify the Test, If You Can
I find a lot to like in the multiverse scrutiny with its recognition of how different choice points in modeling and collecting data introduce the same kind of flexibility as explicit data-dependent searches. There are some noteworthy differences between it and the kind of critique I’ve proposed.
If no strong arguments can be made for certain choices, we are left with many branches of the multiverse that have large p-values. In these cases, the only reasonable conclusion on the effect of fertility is that there is considerable scientific uncertainty. One should reserve judgment…researchers interested in studying the effects of fertility should work hard to deflate the multiverse (Steegen et al., p. 708).
Reserve judgment? Here’s another reasonable conclusion: The core presumptions are falsified (or would be with little effort). What is overlooked in all of these fascinating multiverses is whether the entire inquiry makes any sense. One should expose or try to expose the unwarranted presuppositions. This is part of what I call auditing. The error statistical account always includes the hypothesis: the test was poorly run, they’re not measuring what they purport to be, or the assumptions are violated. Say each person with high fertility in the first study is tested for candidate preference at a time next month where they are now in the low fertility stage. If they have the same voting preferences, the test is falsified.
The onus is on the researchers to belie the hypothesis that the test was poorly run; but if they don’t, then we must.
Please share your comments, suggestions, and any links to approaches related to the multiverse analysis.
Adapted from Mayo, Statistical Inference as Severe Testing (forthcoming)
 I’m reminded of Stapel’s “fix” for science: admit the story you want to tell and how you fixed the statistics to tell it. See this post.
 “Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as ‘silly,’ “stupid,’ ‘sexist,’ and ‘offensive.’ Others were less nice.” (Citation may be found here.)
 I have found nearly all experimental studies in the social sciences to be open to a falsification probe, and many are readily falsifiable. The fact that some have built-in ways to try and block falsification brings them closer to falling over the edge into questionable science. This is so, even in cases where their hypotheses are plausible. This is a far faster route to criticism than non-replication and all the rest.
Durante, K.M., Rae, A. & Griskevicius, V. 2013, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” Psychological Science, 24(6): 1007-1016.
Gelman, A. and Loken, E. 2014. “The statistical crisis in science,” American Scientist 2: 460-65.
Mayo, D. Statistical Inference as Severe Testing. CUP (forthcoming).
Steegen, Tuerlinckx, Gelman and Vanpaemel (2016) “Increasing Transparency Through a Multiverse Analysis.” Perspectives on Psychological Science, 11: 702-712.
Westfall, P. H. and S.S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. A Wiley-Interscience Publication. Wiley.