However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism).
Readers of this blog will recall one of the controversial cases of failed replication in psychology: Simone Schnall’s 2008 paper on cleanliness and morality. In an earlier post, I quote from the Chronicle of Higher Education:
…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.
My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case . When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication. Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):
For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.
“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends. Here are the full slides for your Saturday night perusal.
 I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is easy rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.
 For some related posts, with links to other blogs, check out:
Some ironies in the replication crisis in social psychology.
A new front in the statistics wars: peaceful negotiation in the face of so-called methodological terrorism
For statistical transparency: reveal multiplicity and/or falsify the test: remark on Gelman and colleagues
 The problem underscores the need for a statistical account to be directly altered by biasing selection effects, as is the p-value.
In the world of (environmental) epidemiology, there are often many questions at issue. Usually, the reader has to figure out how many questions might be at issue. Essentially never will the researchers adjust for multiple testing or multiple modeling. Usually, the authors will not provide their data set or their analysis code. In such situations, everyone should assume that the paper does not exist. The editor should reject such papers or require the words, “Exploratory research” in the title of the paper.