However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism).
Readers of this blog will recall one of the controversial cases of failed replication in psychology: Simone Schnall’s 2008 paper on cleanliness and morality. In an earlier post, I quote from the Chronicle of Higher Education:
…. Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. Ms. Schnall wanted to discover whether prompting—or priming, in psych parlance—people with the concept of cleanliness would make them less judgmental.
My gripe has long been that the focus of replication research is with the “pure” statistics– do we get a low p-value or not?–largely sidestepping the often tenuous statistical-substantive links and measurement questions, as in this case . When the study failed to replicate there was a lot of heated debate about the “fidelity” of the replication. Yoav Benjamini cuts to the chase by showing the devastating selection effects involved (on slide #18):
For the severe tester, the onus is on researchers to show explicitly that they could put to rest expected challenges about selection effects. Failure to do so suffices to render the finding poorly or weakly tested (i.e., it passes with low severity) overriding the initial claim to have a non-spurious effect.
“Every research proposal and paper should have a replicabiity-check component” Benjamini recommends. Here are the full slides for your Saturday night perusal.
 I’ve yet to hear anyone explain why unscrambling soap-related words should be a good proxy for “situated cognition” of cleanliness. But even without answering that thorny issue, identifying the biasing selection effects that were not taken account of vitiates the nominal low p-value. It is easy rather than difficult to find at least one such computed low p-value by selection alone. I advocate going further, where possible, and falsifying the claim that the statistical correlation is a good test of the hypothesis.
 For some related posts, with links to other blogs, check out:
 The problem underscores the need for a statistical account to be directly altered by biasing selection effects, as is the p-value.