For Statistical Transparency: Reveal Multiplicity and/or Just Falsify the Test (Remark on Gelman and Colleagues)



Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end [1].

[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).

An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a multiverse analysis, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies.

Steegen,consider the rather awful example from 2012 purporting to show that single (vs non-single) women prefer Obama to Romney when they are highly fertile; the reverse when they’re at low fertility. (I’m guessing there’s a hold on these ovulation studies during the current election season–maybe that’s one good thing in this election cycle. But let me know if you hear of any.)

Two studies with relatively large and diverse samples of women found that ovulation had different effects on religious and political orientation depending on whether women were single or in committed relationships. Ovulation led single women to become more socially liberal, less religious, and more likely to vote for Barack Obama (Durante et al., p. 1013).

What irks me to no end is the assumption they’re finding effects of ovulation when all they’ve got are a bunch of correlations with lots of flexibility in analysis. (It was discussed in brief on this blogpost.) Unlike the study claiming to show males are more likely to suffer a drop in self-esteem when their partner surpasses them in something (as opposed to when they surpass their partner), this one’s not even intuitively plausible (For the former case of “Macho Men” see slides starting from #48 of this post.) The ovulation study was considered so bad that people complained to the network and it had to be pulled.[2] Nevertheless, both studies are open to an analogous critique.

One of the choice points is where to draw the line at “highly fertile” based on days in a woman’s cycle. It wasn’t based on any hormone check, but an on-line questionnaire asking subjects when they’d had their last period. There’s latitude in using such information (even assuming it to be accurate) to decide whether to place someone in a low or high fertility group (Steegen et al., find 5 sets of days that could have been used). It turns out that under the other choice points, many of the results were insignificant. Had the evidence been “constructed”along these alternative lines, a negative result would often have ensued. Intuitively, considering what could have happened but didn’t, is quite relevant for interpreting the significant result they published. But how?

1. A severity scrutiny

Suppose the study is taken as evidence for

H1: ovulation makes single women more likely to vote for Obama than Romney.

The data they selected for analysis accords with H1, where highly fertile is defined in their chosen manner, leading to significance. The multiverse arrays how many other choice combinations lead to different p-values. We want to determine how good a job has been done in ruling out flaws in the study purporting to have evidence for H1.To determine how severely H1 had passed we’d ask:

What’s the probability they would not have found some path or other to yield statistical significance, even if in fact His false and there’s no genuine effect?

We want this probability to be high, in order to argue the significant result indicates a genuine effect. That is, we’d like some assurance that the procedure would have alerted us were H1unwarranted. I’m not sure how to compute this using the multiverse, but it’s clear there’s more leeway than if one definition for fertility had been pinned down in advance. Perhaps each of the k different consistent combinations can count as a distinct hypothesis, and then one tries to consider the probability of getting r out of k hypotheses statistically significant, even if H1 is false, taking account of dependencies. Maybe Stan Young’s “resampling-based multiple modeling” techniques could be employed (Westfall & Young, 1993). In any event, the spirit of the multiverse is, or appears to be, a quintessentially error statistical gambit. In appraising the well-testedness of a claim, anything that alters the probative capacity to discern flaws is relevant; anything that increases the flabbiness in uncovering flaws (in what is to be inferred) lowers the severity of the test that H1 is false passed. Clearly, taking a walk on a data construction highway does this–the very reason for the common call for preregistration.

If one hadn’t preregistered, and all the other plausible combinations of choices yield non-significance, there’s a strong inkling that researchers selectively arrived at their result. If one had preregistered, finding that other paths yield non-significance is still informative about the fragility of the result. On the other hand, suppose one had preregistered and obtained a negative result. In the interest of reporting the multiverse, positive results may be disinterred, possibly offsetting the initial negative result.

2. It is to be Applicable to Bayesian and Frequentist Approaches

I find it interesting that the authors say that “a multiverse analysis is valuable, regardless of the inferential framework (frequentist or Bayesian)” and regardless of whether the inference is in the form of p-values, CIs, Bayes Factors or posteriors (p.709). Do the Bayesian tests (posterior or Bayes Factors) find evidence against H1 just when the configuration yields an insignificant result? We’re not told. No, I don’t see why they would. It would depend, of course, on the choice of alternatives and priors. Given how strongly authors Durante et al. believe H1, it wouldn’t be surprising if the multiverse continues to find evidence for it (with a high posterior or high Bayes Factor in favor of H1). Presumably the flexibility in discretionary choices is to show up in diminished Bayesian evidence for H1 but it’s not clear to me how. Nevertheless, even if the approach doesn’t itself consider error probabilities of methods, we can set out to appraise severity on the meta-level. We may argue that there’s a high probability of finding evidence in favor of some alternative H1 or other (varying over definitions of high fertility, say), even if its false. Yet I don’t think that’s what Steegen et al., have in mind. I welcome a clarification.

3. Auditing: Just Falsify the Test, If You Can

I find a lot to like in the multiverse scrutiny with its recognition of how different choice points in modeling and collecting data introduce the same kind of flexibility as explicit data-dependent searches. There are some noteworthy differences between it and the kind of critique I’ve proposed.

If no strong arguments can be made for certain choices, we are left with many branches of the multiverse that have large p-values. In these cases, the only reasonable conclusion on the effect of fertility is that there is considerable scientific uncertainty. One should reserve judgment…researchers interested in studying the effects of fertility should work hard to deflate the multiverse (Steegen et al., p. 708).

Reserve judgment? Here’s another reasonable conclusion: The core presumptions are falsified (or would be with little effort). What is overlooked in all of these fascinating multiverses is whether the entire inquiry makes any sense. One should expose or try to expose the unwarranted presuppositions. This is part of what I call auditing. The error statistical account always includes the hypothesis: the test was poorly run, they’re not measuring what they purport to be, or the assumptions are violated. Say each person with high fertility in the first study is tested for candidate preference at a time next month where they are now in the low fertility stage. If they have the same voting preferences, the test is falsified.

The onus is on the researchers to belie the hypothesis that the test was poorly run; but if they don’t, then we must.[3]

Please share your comments, suggestions, and any links to approaches related to the multiverse analysis. 

Adapted from Mayo, Statistical Inference as Severe Testing (forthcoming)



[1] I’m reminded of Stapel’s “fix” for science: admit the story you want to tell and how you fixed the statistics to tell it. See this post. 

[2] “Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as ‘silly,’ “stupid,’ ‘sexist,’ and ‘offensive.’ Others were less nice.” (Citation may be found here.)

[3] I have found nearly all experimental studies in the social sciences to be open to a falsification probe, and many are readily falsifiable. The fact that some have built-in ways to try and block falsification brings them closer to falling over the edge into questionable science. This is so, even in cases where their hypotheses are plausible. This is a far faster route to criticism than non-replication and all the rest.


images-32Durante, K.M., Rae, A. & Griskevicius, V. 2013, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” Psychological Science, 24(6): 1007-1016.


Gelman, A. and Loken, E. 2014. “The statistical crisis in science,” American Scientist 2: 460-65.

Mayo, D. Statistical Inference as Severe Testing. CUP (forthcoming).

Steegen, Tuerlinckx, Gelman and Vanpaemel (2016) “Increasing Transparency Through a Multiverse Analysis.” Perspectives on Psychological Science, 11: 702-712.

Westfall, P. H. and S.S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. A Wiley-Interscience Publication. Wiley.

Categories: Bayesian/frequentist, Error Statistics, Gelman, P-values, preregistration, reproducibility, Statistics

Post navigation

9 thoughts on “For Statistical Transparency: Reveal Multiplicity and/or Just Falsify the Test (Remark on Gelman and Colleagues)

  1. Michael Lew

    You hit the core issue of the current ‘reproducibility crisis’ with this introductory sentence: “In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about.” I would emphasise as well the triangulation by evidence from interrelated sets of experiments within a study as well as confirmatory partial replications.

    In my opinion the fix for that issue is for scientists to recognise the essential role of such triangulation in the inferential process. Currently it is underplayed where it is present, with the consequence that its absence can be ignored too easily. Research works that lack the corroboration of interrelated observations and theory should be considered to be preliminary and exploratory no matter how small the P-value, and no matter how severe the statistical test.

    It is my opinion (not yours Mayo, I know) that the dichotomisation of results inherent in the use of Neyman-Pearsonian hypothesis tests feeds into the naive, erroneous and disastrous assumption that unspupported studies of the type that “Gelman and Loken are on about” are worthy of attention. If, instead of simple declaration of ‘significant!’, authors can be encouraged to describe and quantify the evidence and place it into context by reasoned argument we will reduce both the number and impact of false positive papers.

    • Michael: Well I agree with everything except I have it on good evidence that neither Fisher, nor Neyman nor Pearson ever stopped emphasizing the difference between mere statistical effects and substantive theory, models and measurements. They each described, especially in applied work, the embarrassments that occur from running them together. But even if we imagined you were right that these founders (which would also have to include Cox, Lehmann, Barnard, Kempthorne and dozens of others) were guilty of feeding”into the naive, erroneous and disastrous assumption that unsupported studies of the type that ‘Gelman and Loken are on about’ are worthy of attention (however laughable), it would be irrelevant to the proper philosophy of science and statistics that we ought to hold today. (Have you ever read Neyman or Fisher or Cox on causal inference, experimental design, etc? )*
      That doesn’t mean that even physics has it easy in struggling for triangulation when it comes to scientific frontiers. Even high energy particle physics has to worry about data dependent choices for “cuts”–to count as an “event”. In this connection I include, in my new book, examples of how measurement is dealt with in well developed (and medium developed) sciences.
      *It’s worth noting that Neyman developed confidence intervals at the same time as tests, intending that demonstrated effects would be followed by estimation &/orpower analysis to determine if anything has been “confirmed”.

      • Michael Lew

        Mayo, as is often the case, you jumped to an exaggerated and largely unfounded interpretation of my words. I did not intend readers to assume that either Neyman or Pearson meant that scientists should confuse substantive significance with statistical significance. What they intended is not the issue. Instead, I was drawing attention to the fact that the received wisdom of dichotomised significant/not significant interpretation of results leads to the problems being discussed.

        Yes, as you full well know, I have read the works of Fisher, Neyman et al.*

        * Neyman’s confidence intervals can be used for estimation, but they are just as prone to dichotomous interpretation as are the results of the related hypothesis tests. Estimation should be based on evidential support rather than method-related error rates. Thus the good performance of Neyman’s confidence intervals is mostly a result of the fact that those intervals so often approximate intervals more directly based on the likelihood function.

        • Michael: i didn’t mean to jump, I got the feeling almost as if you thought the start of your comment showed too much agreement, so you went overboard by blaming N-P. In any event, the issue of dichotomy, which oddly enough is continued in today’s “new statistics” (Cummings use of CIs), is different from the key issue I’m talking about: poor tests, poorly run tests, and questionable measurements. This entered psychology by its very own, having to do with positivists like Stevens (I believe that’s the spelling) and others. They conveniently defined measurement so as NOT to require showing any real quantity was being measured, or rather, you could reify the thing to make it real and quantifiable. There’s a literature on this that I only know from Paul Meehl (Mitchell is a name). This is how scientism was introduced to psychology, and these issues have much more to do with its questionable credentials (I’m not lumping it all together) than the statistics. (Then it was an easy slide to statisticism.) They always blame the statistics for blocking them from doing real science, so they grab onto the next new method, but they almost never scrutinize the foundations of their entire field. If they did, dire consequences could follow. So that’s why most of today’s meta-methodology and”methodological activism” (as I’d heard it called) in social psych and related fields is barking up the wrong tree. I’m being too quick, but not by much.

          *As for your reading, I don’t know what you’ve read or haven’t read or how open you are to recognizing the evidential importance of error probabilities to probing errors in the case at hand (probativeness, not performance). Lacking that understanding is at least as big a block as the measurement problem in social science.

        • Here’s an abstract by Mitchell on measurement in psych:
          Measurement: a beginner’s guide.
          Michell J.
          This paper provides an introduction to measurement theory for psychometricians. The central concept in measurement theory is that of a continuous quantitative attribute and explaining what measurement is requires showing how this central concept leads on to those of ratio and real number and distinguishing measurements from measures. These distinctions made, the logic of quantification is described with particular emphasis upon the scientific task of quantification, as opposed to the instrumental task. The position presented is that measurement is the estimation of the magnitude of a quantitative attribute relative to a unit and that quantification is always contingent upon first attempting the scientific task of acquiring evidence that the relevant attribute is quantitative in structure. This position means that the definition of measurement usually given in psychology is incorrect and that psychologists’ claims about being able to already measure psychological attributes must be seriously questioned. Just how the scientific task of investigating whether psychological attributes are quantitative may be undertaken in psychology is then considered and the corollary that psychological attributes may not actually be quantitative is raised.

  2. A link on twitter to this post:

  3. Stan Young

    I like this post and discussion.

    Two examples.

    We came across a paper making the claim that women that ate cereal in and around the time of conception were more likely to have a boy baby. After some back and forth we obtained the data set. There were 131 foods in the food questionnaire and there were two time periods at issue. We computed 262 p-values and did a p-value plot. (In a p-value plot the p-values are ranked and plotted against the integers. If the p-values fall on a straight line the complete null is supported. The p-values did fall on a straight line so the claim was most likely a statistical false positive.

    Young SS, Bang H, Oktay K.(2008) Cereal-induced gender selection? Most likely a multiple testing false positive. Proc. Roy Soc B Published on line Jan 14.

    We looked at a very large data set, air quality (PM2.5 and ozone) and daily deaths for eight air basins, 13 years, for California. There were lots of ways to do an analysis. We ran a complex factorial design with over 78,000 different analyses. Plots showed results evenly distributed above and below the null. We concluded there was no effect. Two editors had very strong priors (our results were dismissed out of hand) so the results still await formal publication.

    Kenneth K. Lopiano, Richard L. Smith, S. Stanley Young (2015) Air quality and acute deaths in California, 2000-2012.

    • Stan: I’m still thinking that Gelman’s multiverse may differ importantly from your approach to multiplicity. Or not?

      • Stan Young

        There is an important paper by M. Clyde of Duke, 2000, in Environmetrics**. She pointed to the sea of models possible from large observational data sets. My reading is that she was rather completely ignored. Here we are 16 years later looking at the same ground. Or see my figure 3 in Young and Karr (2011). I don’t think we know how much things can be moved around by model selection. I don’t know of much research in area of model selection bias.

        Clean multiplicity comes from multiple outcomes and multiple predictors. More obscure model selection bias comes from a large number of covariates in or out of the model.

        Young SS, Karr A. (2011) Deming, data and observational studies: A process out of control and needing fixing. Significance, September, 122-126.

        “There are many aspects of model choice that are involved in health effect studies of particulate matter and other pollutants. Some of these choices concern which pollutants and confounding variables should be included in the model, what type of lag structure for the covariates should be used, which interactions need to be considered, and how to model nonlinear trends. Because of the large number of potential variables, model selection is often used to find a parsimonious model. Different model selection strategies may lead to very different models and conclusions for the same set of data. As variable selection may involve numerous test of hypotheses, the resulting significance levels may be called into question, and there is the concern that the positive associations are a result of multiple testing.”

Blog at