
.
I had been invited to speak at a Royal Society meeting, held March 4 and 5, 2024, on “the promises and pitfalls of preregistration”—a topic in which I’m keenly interested. The meeting was organized by Dr Tom Hardwicke, Professor Marcus Munafò, Dr Sophia Crüwell, Professor Dorothy Bishop FRS FMedSci, and Professor Eric-Jan Wagenmakers. Unfortunately, I was unable to travel to London, so I had to decline attending a few months ago. But, I thought I might jot down some remarks here.
The flyer defines preregistration as “publicly declaring study plans before collecting or analyzing data”. I regard it as a welcome consequence of today’s statistical crisis of replication that some social sciences are taking a page from medical trials and calling for preregistration of sampling protocols and full reporting.[1] The major source of failed replication stems from the ease of obtaining impressive looking findings by data-dredging, multiple testing, outcome-switching, cherry-picking, optional stopping and a host of related “selection effects”. Such gambits may practically guarantee an impressive-looking effect, even if it is spurious. The inferred effect, H, agrees with the data but the test H has passed lacks stringency or severity. Such agreements, Popper (1983) might say, are “too cheap to be worth having”. Little if anything has been done to avoid the key flaw of concern: being fooled by chance. The key promise of preregistration of protocols (along with full reporting and replication checks) is to help block the biasing selection effects known to permit insevere tests.
Pejorative vs non-pejorative cases.
However, there are also cases of data dredging and multiple testing that satisfy severe testing requirements. Consider searching a full database for a DNA match with a criminal’s DNA, where we suppose the probabilities of false negatives and false positives are both extremely low. Since the probability is very high of a mismatch with person i, if i were not the criminal (and a nonmatch virtually excludes the person), a match with i warrants inferring that i is the criminal. Unlike examples of pejorative searching, where the concern is mistakenly inferring an effect is genuine, here there is a known effect or specific event, the criminal’s DNA, and stringent procedures are used to track down the source. Ruling out random chance is very different from explaining a known effect.
Nor need the data dredged hypotheses be prespecified to be tested with severity by data—even where those data were “used” to arrive at or select the hypothesis inferred. An example that is actually quite routine (although it sparked some controversy) is from the data analysis of the 1919 eclipse data. The same data were used to arrive at, as well as to test, the source of one set of blurred eclipse data during the tests of the Einstein deflection effect. The culprit–a distorted telescope mirror due to the sun’s heat–was not predesignated, but it was severely tested (i.e., it passed with severity).
This reminds me to make a remark on language: That H was severely tested means H passes a test with (high) severity. That is, it was subjected to, and passes, a test that it probably would have failed, just in case H is false.
It is the severity of a test, or lack of it, that distinguishes whether a data-dredged claim is warranted by data.
Had I been fortunate enough to fulfill the high honor of being invited to speak at this RS meeting, I think I would have started with these points. I would be keen to distinguish right off pejorative from non-pejorative data dredging. That is because one of the alleged “pitfalls” of preregistration is that it can discourage the great discoveries of science. Didn’t all these great scientists reach important discoveries by trenchantly exploring the data for patterns? Sure, but whether they stringently tested those patterns is a distinct question. (Note too that when data-driven discoveries are tested on brand new data, it is not considered data-dredging. The new data were not used in arriving at the hypothesis being scrutinized.) Moreover, frequentist error statisticians are sometimes wrongly criticized for requiring adjustments for selection in cases where no adjustment is needed or makes sense. The case of DNA matching is an example.[2] So I would want to clear the air right away that calls for preregistration to block data-dredging—where they matter—are calls for severe testing, notably where there’s an interest in avoiding being fooled by random chance.
The fact that there are ways to compensate for selection effects does not detract from preregistration’s legitimate rationale. Quite the opposite. It only underscores the importance of recognizing how selection effects can alter the reliability of inferences. Needing (or striving) to compensate, entails that they cannot just be ignored. With today’s use of big data, and AI and machine learning methods, “post data selection” is a research area of its own.[3]
Note that multiplicity and “trying and trying again” can refer to the hypothesis that is dredged to accord with given data, or it may refer to trying and trying again to find data that accord with a fixed hypothesis. (It can refer to other things besides.) There can be just as much latitude for bias in selecting what is reported as “the data” for testing a predesignated hypothesis as there is in selecting a hypothesis that agrees with data.
Biasing selection effects.
In my view, pejorative selection or biasing selection effects occur when data or hypotheses are selected, generated or interpreted in such a way as to result in failing the severity requirement.
Under failing the severity requirement I include, not only reporting as severely tested a claim that actually passes with very low severity, but the inability to assess how severely a claim has been tested, even approximately. For example, the FDA allows adjusting for selection only among prespecified hypotheses or endpoints. Open-ended data torturing precludes such adjustments.[4]
Prior to adoption of preregistered endpoints, the FDA reports, it was not uncommon for a clinical trialist, failing to find a statistically significant treatment benefit on prespecified endpoints, to ransack the unblinded data, cherry-pick a subgroup where treateds do better in some respect than controls, and report it as evidence of treatment benefit. Drawing a line around treateds who happen to show some beneficial effect is akin to the Texas Marksman circling a cluster of closely placed bullet holes, and regarding it as evidence of his marksmanship. The accordance between data and hypothesis is due to the biasing selection effects, not the truth of the hypotheses (about the drug’s benefit or his marksmanship).
Preregistration and error probabilities.
An adequate statistical account must be able to pick up on how biasing selection effects alter the capabilities of a method to assess and control erroneous interpretations of data. Moreover, these altered capabilities must show up in the evidential assessment—even if it is only to declare the inference unwarranted due to data torturing. They show up in a severity assessment of the inference, whether quantitative, semi-quantitative, or qualitative. In a formal statistical context, they may be provided by using the relevant sampling distribution.
Consider what the critical reader of a preregistered report might do, whether pre-data or post-data. She looks, in effect, at the number of chances the researchers give themselves to find apparent effects by chance alone. She asks, what’s the probability that one or another hypothesis, stopping point, choice of grouping variables, ways to interpret a measurement, and so on, could lead (or have led) to a false positive—even without a formal error probability computation? If it’s fairly high, she denies there is evidence for the effect. The onus is on those claiming to provide evidence to show they have worked to avoid known traps that blow up the probability of false positives and make it all too easy to mistake chance variability as real. Thus, the rationale for preregistration goes hand in hand with that of controlling error probabilities. There is a tension, therefore, between popular calls for preregistration and statistical accounts that downplay error probabilities. That’s what I would talk about next.
To be continued. Please share your thoughts in the comments. If I find any of the contributors’ talks are available, I will link to them.
I will update and make corrections, indicating the version.
***
[1] Anyone who has ever read this blog knows I’ve talked quite a lot about this topic over many years (e.g., Mayo 2018, CUP), and have exchanged ideas on the topic with many others. I will refrain from references here, but please search the blog. Links to most of my published papers are at https://errorstatistics.com/mayo-publications/.
[2] Dawid’s (2000) comment on Lindley is here.
[3] AI/ML prediction models might compensate for using the “same” data by cross validation and data splitting, at least with IID; but these fields also face reproducibility and replicability crises.
[4] Usually the primary endpoint must be found statistically significant before secondary endpoints are considered.



Deborah:
I spoke at that meeting! I also was not able to be there in person so I gave a remote talk, which as we all know is not as good as live. Also I didn’t have the opportunity to participate in discussion, so I can’t really say what the meeting was like. My general feelings about preregistration are similar to Jessica’s in <a href=”https://statmodeling.stat.columbia.edu/2023/12/04/modest-pre-registration/“>this post</a>. One of my problems with discussion of preregistration is that they can distract from discussion of design and measurement, a topic that I discuss <a href=”http://www.stat.columbia.edu/~gelman/research/published/jcp.pdf“>here</a>. And <a href=”https://statmodeling.stat.columbia.edu/2023/12/21/statistical-practice-as-scientific-exploration-my-talk-on-4-mar-2024-at-the-royal-society-conference-on-the-promises-and-pitfalls-of-preregistration/“>here’s the abstract</a> for my talk at that London conference. (My actual talk took a different direction.)
Andrew:
Thanks for your comment. I found it in spam, but wasn’t notified. I’d be interested to see the notes or slides you used for your talk.
I’m sorry that I didn’t have the chance to speak online. I did write to Hardwick, but maybe it was too late. I just assumed it had to be in-person and so when I found I couldn’t go to London, I presumed I was cut. I had another talk that I wound up giving online around the same time.
I’d like to know what you think of my continuation of this because it deals with a new issue.
Mayo
First, I doubt there is evidence, or that it is true, that “The main sources of failed replication are data-dredging, multiple testing, outcome-switching, cherry-picking, optional stopping and a host of related “biasing selection effects”.
Here is my alternative conjecture: the main sources of failed replication are 1. variations in the relevant sample sources; 2.making a conclusion after failing to search for or consider alternative explanations; (i.e., failure to data dredge); 3. fragility of effects; 4. failures of relevant balancing of study groups; 5. unrecognized variations in study design, e.g., implicitly conditioning or controlling variables in one study but not in another; 6. rearranging priors after examining the data then doing a Bayesian calculation.
Just imagine if Kepler or Cannizzaro or Darwin had to pre-register. And if they had–say Kepler had preregistered that he would fit and test circular orbits for the planets–what would he do with his data after that hypothesis failed?
I think the methodological wisdom (not the mathematics) from mainline staistics is often bogus.
Clark:
Sorry you had trouble posting.
I had you in mind when emphasizing the objections people raise to preregistration by reference to all these great scientists… It is also why I distinguish pejorative and non-pejorative. It is possible to dredge and severely test, but that’s not what happens in the majority of non-replications. The researchers either admit or leave a trail showing that they tried and tried again, so it’s unsurprising that when the protocol is tied down, the random feature goes another way. Searching for alternative explanations is not data dredging–it’s testing. Pejorative data dredging involves trying and trying again when the experiment doesn’t work as wished and selective reporting. Of course there are many weaknesses that enter whenever one is doing error-prone inference, but it’s the biasing selection effects that enable the researcher not to have exposed the other frailties you list. Exploiting a host of researcher flexibilities enabled them to arrive at an impressive-looking, but spurious effect, rather than exposing their foibles. I include #6 (data-dependent priors) as under pejorative selection effects.