Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"


These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism).

Readers of this blog will recall one of the controversial cases of failed replication in psychology: Simone Schnall’s 2008 paper on cleanliness and morality. In an earlier post, I quote from the Chronicle of Higher Education:

…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.

My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case [1]. When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication.[2] Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):

For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.

“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends.[3] Here are the full slides for your Saturday night perusal.

[1] I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is easy rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.

[2] For some related posts, with links to other blogs, check out:

Some ironies in the replication crisis in social psychology.

Repligate returns.

A new front in the statistics wars: peaceful negotiation in the face of so-called methodological terrorism

For statistical transparency: reveal multiplicity and/or falsify the test: remark on Gelman and colleagues

[3] The problem underscores the need for a statistical account to be directly altered by biasing selection effects, as is the p-value.

Categories: Error Statistics, P-values, replication research, selection effects

Post navigation

22 thoughts on “Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"

  1. Stan Young

    In the world of (environmental) epidemiology, there are often many questions at issue. Usually, the reader has to figure out how many questions might be at issue. Essentially never will the researchers adjust for multiple testing or multiple modeling. Usually, the authors will not provide their data set or their analysis code. In such situations, everyone should assume that the paper does not exist. The editor should reject such papers or require the words, “Exploratory research” in the title of the paper.

    • Stat: Is that still true? I know that Greenland was recently defending epi practices because they can’t possibly run RCTs, but surely they could report multiplicity.

      • Thanatos Savehn

        The answer is, it depends. Sometimes you have a Robert Rinsky who objected to mindless data dredging thusly: “Subsequent to the release of our results, Bross obtained from us … tables showing numbers of observed and expected deaths by five-year latency and duration categories … in all, the tables contained over 4000 cells. Apparently in the absence of any a priori hypotheses, Bross & Driscoll then proceeded to examine these tables seeking possible associations between radiation exposure and mortality from any cause. From the countless thousands of permutations available to them, Bross & Driscoll perceived a positive relationship between radiation exposure and lung cancer in those workers who had accumulated more than 15 years’ latency… and who had cumulative radiation exposures of one rem or more … [I]t is inappropriate of Bross & Driscoll to have presented their observations as scientific conclusions or, worse, as received truth … Bross & Driscoll’s observations should, quite properly, be considered as suggestions for further research. Over-interpreted or taken out of context, however, their findings illustrate the pitfalls of evaluating complex data sets without benefit of an a priori hypothesis.”

        That was from a letter to the editor published 35 years ago: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1438946/?page=1

        I’ve always found this “explicate your model first” approach to be the demarcation between good expert witnesses and those of the hired gun variety.

        P.S. If you had to define your model first wouldn’t that cure the multiple comparisons problem? Assuming random-like variability shouldn’t it be harder to find an association between radiation exposure and lung cancer than between a sub set of cumulative radiation exposure and small cell lung cancer in people first exposed more than 15 years before diagnosis?

  2. Benjamini’s analysis debunks the study leading to a finding rather than just finding a failed replication. What surprises me is that this did not arise in the back and forth quarreling about the Schnall case (to my knowledge). It would shut it down totally, in my view. The fact that it wasn’t addressed and put to rest by her suffices for her claim to have been inseverely tested.

  3. In the new “21 Century Cures” bill we’re told: “Clinical testing of medical devices or drugs no longer requires the informed consent of the subjects if the testing poses no more than minimal risk and includes safeguards”.

  4. “Randomization and blinding, two further design elements, hold the potential to reduce bias and are experiencing resurgences in popularity; however, they are not yet routine in animal research.”
    Yet the new “21st century cures” bill is saying you no longer need clinical trials.
    Isn’t there some conflict between the insistence on ‘raising the bar’ to combat irreproducibility and the new bill that says you don’t need clinical trials? Perhaps the bill garnered support before some of the “omics” disasters, and now we may have to live with it.

    • Mark

      This bill is a very bad idea. The belief is now that anything can be explained by a model, regardless of how that big pile of data came to be. As I’ve said before, the Bayesians have won. Sure, not everybody is a Bayesian, but almost everybody seems to think probability=uncertainty, which is really just a formulation of subjective probability (coupled with logical positivism).

      Anybody who calls for a ban on p-values either doesn’t understand or doesn’t value randomization.

      Where have you gone, David Freedman?

      • Mark:
        Well it’s alarming. and at odds with the insistence on ‘raising the bar’ to combat irreproducibility.
        Here’s a line from “raising the bar”:

        “Randomization and blinding, two further design elements, hold the potential to reduce bias and are experiencing resurgences in popularity; however, they are not yet routine in animal research.”

        Experiencing resurgences in popularity! As if RCTs are so old. It took less than a decade of trying to downplay them to discover that we needed them after all. It’s all a matter of which will win out: profits or good science.

        • Mark

          Mark Weaver, yes. “Hold the potential to reduce bias.” NO, completely misses the point. Randomization, along with true allocation concealment, ELIMINATES the potential for selection bias. That is the true power of randomization, and it’s what allows us to use permutation tests, etc., to test a null that really could be true and really is worth testing (I’m sorry, but this nonsense that some are spreading that nobody really thinks the null could be true is just that, nonsense… Randomization is useful *because* the sharp null could be true). Finding evidence against this sharp null suggests that some people (who? cannot know) did better on Drug A, say, than they would have done on Drug B. this is an (uncertain) individual-level causal inference, it’s very powerful, and I personally think it’s the best we get out of an RCT. I’m so off of estimating “treatment effects,” I see that as just more nonsense (but don’t tell that to the folks drafting the new expansion to ICH E9 all about “estimands”).

          It’s a return to positivism.

          • Mark

            Along these same lines, I’m sure that you’ve noticed that most of the folks advocating for the lower alpha threshold (0.005) have a Bayesian bent. Also, here’s an approaching talk that I know you’d have an interest in: https://sph.unc.edu/files/2017/11/BIOS_Announcement_Frank_Harrell_11_16_2017.pdf

            • Mark: Yes, but I also notice they haven’t shown us we’d spot non-replications using their Bayesian approach. I wouldn’t be interested in Harrell’s talk at all, it’s just Bayesian cheerleading, even if he is an “expert”.

              • Mark

                Regarding “interest”, I was just referring to your funny tweet about the elevator…

                • Mark: yes, I meant it in earnest (that if we talked for a while, he’d agree with me), but I can see why the elevator exploded in laughter.

      • Mark: Yes, I miss Freedman, a man of integrity, though a curmudgeon. You are Mark___?

      • Mark: You say the Bayesians have won, but only a recent battle. In the larger war, people won’t go with methods that fail to control error. I don’t rule out that some Bayesian procedures can achieve this, in which case they become error statistical. It’s the supposition that we didn’t have to worry about selection effects, multiple testing, etc.–an attitude fostered by the Bayesian likelihood principle–that is largely responsible for irreplication. Having gobs of data, it turns out (unsurprisingly) is not enough.
        Today we basically have subjective Bayesianism and default “non-subjective” Bayesianism. Nearly everyone gives up on the former, but no one can decide between systems of default Bayesianism, nor tell us what the posteriors vouchsafe. We have empirical Bayesianism, which (following Lindley, Fraser and others) isn’t really Bayesian–just conditional probability with legitimate frequentist priors–rare for theories. There’s a somewhat related “diagnostic model” of inference where the priors are to be prevalences of true nulls in some population of research efforts–highly murky, and managing to confuse just about everyone regarding central notions like the probability of a type I error. Then there’s BFF–Bayesian, frequentist, fiducial fusions, which are still very much in development. They involve “confidence distributions” which are frequentist and are perhaps in sync with severe testing. There’s a Gelman-style falsificationist Bayesian, which I don’t claim to fully understand, but which he regards as error statistical.
        Anyway, the focus of many of these last groups is more on prediction than testing causal claims, but there are so many variants, I can’t say. I hope that my forthcoming book, “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” helps to illuminate things.

        • Mark

          I have been very much looking forward to your new book!

          • Mark: I’m impressed that you describe yourself as having a research interest in foundations of frequentist statistics.

            • Mark

              Ah, well, don’t be too impressed… It’s really only randomization that I’m interested in, if I’m being honest.

              You know, I have to tell you that I find it’s really pretty cool how many of my interests have been converging. I just noticed the other day that my first post to your blog was on 2/4/2012; as far as I recall, I just happened upon your website and was drawn in by the whole “frequentist in exile” thing, and of course I subsequently read your books. Prior to that, I’d been reading (and re-reading, and re-reading, etc.) Taleb since 2008; he initially turned me onto Popper, and I’ve since read most of Popper’s books, many multiple times. Frankly, reading Taleb (and Gary Taubes) has completely wrecked my life as a statistician…. And I’m completely grateful! But, then, way before reading Taleb, I was a huge fan of Stephen Senn. Although Senn would probably say I have a “randomization fetish” (I do, I freely admit it, whatever) I still find myself agreeing with about 95% of what he writes.

              So, anyway, thanks for doing what you do!

              • Mark: I’d never heard of a randomization fetish before–kinky. Frequentists in exile have other, related, error probability/power/stringency fetishes.

Blog at WordPress.com.