
..
In my last post, I sketched some first remarks I would have made had I been able to travel to London to fulfill my invitation to speak at a Royal Society conference, March 4 and 5, 2024, on “the promises and pitfalls of preregistration.” This is a continuation. It’s a welcome consequence of today’s statistical crisis of replication that some social sciences are taking a page from medical trials and calling for preregistration of sampling protocols and full reporting. In 2018, Brian Nosek and others wrote of the “Preregistration Revolution”, as part of open science initiatives.
The main sources of failed replication are not mysterious: data-dredging, multiple testing, outcome-switching, cherry-picking, optional stopping and a host of related “biasing selection effects” can practically guarantee an impressive-looking effect, even if it is spurious. The inferred effect, H, agrees with the data but the test H has passed lacks stringency or severity. However, as I noted, I would be keen to distinguish right off pejorative from non-pejorative data dredging because one of the main arguments I hear against preregistration sets sail by presenting us with cases where trenchant searching is a mark of good science and the route to discovery. One need not always get new data to severely rule out errors of relevance (although what counts as “new” is often unclear). Think of trying and trying again to find a DNA match, the source of a faster than speed of light anomaly,[1] or the location of your keys (always the last place you look!). Calls for preregistration to block data-dredging—where they matter—are calls for severe testing, notably where the goal is avoiding being fooled by random chance. Since statistical significance tests are key tools for this end, preregistration (or, in some cases, compensation by error probability adjustments) is valued to avoid systematically misleading results (violating David Cox’s weak repeated sampling principle). In medicine, and other fields, perverse incentives to generate and present data so as to selectively advantage desirable inferences have led to elaborate protocols “on best practice in trial reporting, which are endorsed by 585 academic journals” (Goldacre et al., 2019, p. 2), as well as methods for p-value adjustments (Benjamini, 2020).
Now I enter the second portion of my thoughts on the topic.
However, there are rivals to error statistical methods that hold principles of evidence where evidence is not altered by altered error probabilities. On one such principle, all the evidence is via likelihood ratios (LR) of hypotheses:
Pr(x0;H1)/Pr(x0;H0)
where Pr(x0;H1) is the probability of x0 under hypothesis H1, and Pr(x0;H0) the probability of x0 under hypothesis H0. This is often called the (strong) likelihood principle.[2]
There is a lot of confusion about likelihoods. With likelihoods, the data x0 are fixed, the hypotheses vary. Often, “likelihood” is used interchangeably with “probability”, but this leads to trouble when we’re keen to talk about the formal concept of likelihood. A hypothesis that perfectly fits the data has a likelihood equal to 1; so, since many rival hypotheses can fit x0, it’s clear that likelihoods do not obey the probability calculus.[3]
Of pertinence to the issue of preregistration, if all the evidence is in the likelihood ratio, then error probabilities drop out once the data are in hand. As subjective Bayesian, Dennis Lindley observed long ago:
Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference—namely the sample space. (Lindley 1971, 436)
What he means is that once the data are observed, other outcomes that could have occurred are irrelevant. Yet error probabilities consider outcomes other than x0. The error statistician cannot assess the evidence in x0 without knowing how the method of sampling would have behaved under different outcomes (i.e., the sampling distribution). Ignoring this error probabilistic behavior, she knows she can be fooled. Finding the probability of x0 under hypothesis H0 is low, one can easily construct an alternative H1 that fits the data swimmingly in order to get a high likelihood ratio in favor of H1. As statistician George Barnard puts it, “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, p. 129). The probability of finding some better fitting alternative or other can be high, or guaranteed, even when H0 correctly describes the data generation. This probability enters in assessing how well an inferred claim has been severely probed.
Preregistration and controversies about error probabilities
A classic problematic example is when a researcher, failing to find a benefit in a (double-blind) randomized control trial on a medical treatment, searches the unblinded data until finding a subgroup where those treated do better than the controls in some way or other. They might search for patterns among patients with different characteristics (age, sex, employment, education, medical conditions diseases, etc.), and next try different proxy variables to use in measuring benefit (“outcome switching”). Let H1PD be the “post data” hypothesis arrived at from the subgroup search and outcome tinkering. The data x supports H1PD better than H0PD that there is 0 benefit. We know this because the subgroup has been deliberately selected such that those with the treatment do better than the untreated group by a chosen amount. The likelihood ratio or Bayes factor is
Pr(x|H1PD)/Pr(x|H0PD).
This way of proceeding has a high probability of issuing in a report of drug benefit H1 (in some subgroup or other), even if no benefit exists (i.e., even if the null or test hypothesis H0 is true). The researcher has drawn that line around the post-data subgroup just like the Texas marksman. It is even worse if the researcher reports this as a result of the original double-blind trial!
Nevertheless, if all of the evidence for a statistical inference is in the likelihood ratio, then alterations to error probabilities do not alter the import of the evidence. Thus, there is a tension between popular calls for preregistration—arguably, one of the most promising ways to boost replication—and accounts that downplay error probabilities: Bayes Factors, Bayesian posteriors, likelihood ratios.
Bayesian analysis does not base decisions on error control. Indeed, Bayesian analysis does not use sampling distributions. …As Bayesian analysis ignores counterfactual error rates, it cannot control them. (Kruschke and Liddell 2017, 13, 15)
So, the long-standing controversies about error probabilities, I would expect, would be central in a conference on preregistration in statistics. I am interested to learn if it was, and what was said.
The constructive upshot of the replication crisis–for most
(Frequentist) error statistical methods are often put on the defensive as to why control of error probabilities matters to the inference at hand. Error control, according to some statistical schools, is only of concern to ensure low error rates in some long run series of applications. Members of these schools say, we are happy to have methods with good long-run “operating properties” at the design stage, but once the data are in hand, error probabilities drop out. It should be clear from the replication crisis that what bothers us about pejorative data dredging is not about long-runs–even though they do damage the reliability of performance. It is that a poor job has been done in the case at hand in distinguishing genuine from spurious effects. Little has been done to mitigate and prevent known ways to blow up the probability of false positive results.
By and large, the replication crisis has had the constructive upshot of raising the consciousness of researchers. We have replication research and, as with this conference, a focus on preregistration and registered reports. Well-known statistical critics from psychology, Joseph Simmons, Leif Nelson, and Uri Simonsohn, place at the top of their list of requirements the need to block flexible or “optional” stopping: “Researchers often decide when to stop data collection on the basis of interim data analysis … many believe this practice exerts no more than a trivial influence on false-positive rates” (Simmons et al. 2011, p. 1361). “Contradicting this intuition” they show the probability of erroneous rejections balloons.
Consider the often discussed example of optional stopping in two-sided testing of a 0 versus a non-zero Normal mean (H0: μ = μ0 vs. H1: μ > μ1): (known σ) “[I]f an experimenter uses this procedure, then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true.” (Edwards, Lindman, and Savage 1963, 239) While both Bayesians and non-Bayesians employ such adaptive or sequential trials, the error statistician must take account of, and perhaps adjust for, the stopping rule. But Edwards, Lindman, and Savage aver “the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (ibid., pp. 238-9)
The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation. (Edwards, Lindman, and Savage 1963, p. 193)
[The same stopping procedure can be used to ensure that μ0 is always excluded from the corresponding confidence interval.] Authors of the Likelihood Principle, Jim Berger and Robert Wolpert, remark: “It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself” (Berger and Wolpert 1988, 78).
The question that arises is this: if a method’s error probabilities do not enter in appraising evidence, it is unclear how to use registered reports of what would alter error probabilities in scrutinizing an inference. (Recall my description in the previous post of what the critical reader of a preregistered report might consider.) How would a Bayesian, for example, assuming they accept the likelihood principle do so?[4] They can, of course, report that violations of preregistration are problematic for any reported p-values (or other error probability notions, type 1 and 2 errors, confidence levels). They might go on to show, or purport to show, that this would not be problematic for their preferred account. If so, does it follow we can skip the preregistration revolution and adopt their methods? I assume the preregistration conference took advantage of the opportunity to discuss this. (I will report back once I find out.)
Severity on the meta-level
Here things become especially tricky. If one purports to show that biasing selection effects make no difference to a given methodology, it will not do to employ an analysis that has no antenna for picking up on how error probabilities are altered by biasing selection effects. Right? It wouldn’t suffice to declare, for example, that wearing likelihood principle glasses, the likelihood ratio or Bayes factor is insensitive to biasing selection effects that alter error rates. After all, the severity principle also applies on the meta-level:
Severity requirement (minimal). If an inquiry had little or no capability of unearthing the falsity of a claim inferred, then the claim is unwarranted by the data from that inquiry.
As always, the severity assessment naturally takes into account the particular error of inference that is relevant in the context of the inference or claim. Since meta-statistics should be one-level removed, one needs to remove likelihood principle glasses and ask what the effects of biasing selection effects would be on an error statistical account of evidence. This is not always done, however, and some might assume that they need not worry about predesignation if they don’t compute p-values. But they do have to worry (at least in the contexts I am identifying as pejorative selection effects).
Now error statistical assessments of rivals to error statistical methods have often been done, with varying degrees of success. They are routinely carried out by regulatory agencies in examining Bayesian adaptive or sequential trials in exploratory inquiry[5]. Bayesian trialists Ryan et al. (2020) in radiation oncology give an interesting objection to such frequentist scrutiny. While they admit “the type I error was inflated in their Bayesian adaptive designs,
The requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle, and creates a design that is inherently frequentist. (Ryan et al., 2020, p. 7)
I think it is a hybrid Bayesian-frequentist assessment. Regulatory agencies perform a kind of hybrid Bayesian-frequentist computation of a type I error probability by “determining how frequently the Bayesian design incorrectly declares a treatment to be effective or superior when it is assumed that there is truly no difference” (ibid., p. 3). The idea is to simulate many thousands of trials and find the proportion that result in assigning a posterior probability of .9 (or other threshold) to H, a treatment is effective, assuming H is false. To correct for the raised frequency of reaching such thresholds in sequential trials, researchers are required to raise them.
To be clear, I am not saying error statistical methods could be replaced by such attempted calibrations, even if successful. They cannot. My point is only that an inquiry keen to generalize and to distinguish real from spurious should not feel free to ignore preregistration protocols in testing, even if they ditch p-values–at least not until they severely checked the consequences, wearing error statistical spectacles. This brings me to a closing remark about the links between the general debate about evidence in statistics and in philosophy.
A word about the philosophical advantages
The reason the likelihood principle is viewed as having philosophical imprimatur connects the controversy about preregistration to a controversy in philosophy between Popper’s account of falsification and confirmation theories. To logical empiricist philosophers, the holy grail was to find a logic of inductive inference that is purely formal in the same sense as deductive logic. All you would need to consider were the statements of evidence and hypotheses to arrive at inferences. Evidence “e” was the unproblematic, empirical, ground statement; confirmation logics defined various C-functions between e and h: C(h, e).[6] Popper challenged the idea that data were given, recognizing that we always observe through the lens of a perspective, an interest, a theory, a framework.[7] Nowadays, few hold to the idea that “data” are value-free, offering a relatively unproblematic basis for inference, but the ideal has not lost its attractiveness in many quarters.
I will undoubtedly return to make corrections to this continuation. If I do, I will indicate the version # in the title.
References can be found (with links) on the “Captain’s Biblio” here.
You can find all of the 16 “tours” in my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018) on this blogpost. (It includes the entire manuscript in “proof” form, which is quite readable.)
All topics are also discussed on this blog, with very useful comments from readers.
________
[1] The Opera experiment (link).
[2] If you are interested, you can search this blog for an enormous amount of material on the likelihood principle.
[3] The law of likelihood (Hacking 1965), which is weaker than the likelihood principle, says that x0 comparatively favors H1 over H0 if Pr(x0;H1) exceeds Pr(x0;H0). Hacking rejects it in 1980 along with the declaration that “there is no such thing as a logic of induction”.
[4] Objective or conventional Bayesians accept technical violations of the likelihood principle enabling them to achieve matching with frequentist error probabilities, but as I explain in SIST (2018), one feels their hearts aren’t in it. Empirical Bayesians like Efron reject the likelihood principle which is why Lindley says there’s “no one less Bayesian than an empirical Bayesian” (1969, p. 421). A ‘falsificationist Bayesian’, as Andrew Gelman calls himself, employs error statistics in model checking, at least he did a decade ago. Please see links in SIST (2018), and in searching this blog.
[5] I don’t think they are permitted in confirmatory trials.
[6] Of course, they were never able to find an adequate inductive logic. Carnap constructed a “continuum” of inductive logics. Harold Jeffreys developed an early “objective” Bayesian account in statistics, later built on my many others. Objective Bayesians haven’t settled on an adequate system. A few are: reference, frequentist matching, invariance.
[7] Nor can you just look at data x for Popper; you’d need to know if it resulted from a ‘sincere effort’ to falsify the claim inferred. He never fully fleshed out his philosophy of severe testing. He wrote to me once that he regretted not learning more about formal statistical testing.



Deborah:
You quote Berger and Wolpert as writing: “It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself.”
I have two things to say about this.
First, nobody is stopping Berger, Wolpert, or anybody else from summarizing the data without reference to any model. Feel free to compute or plot whatever data summaries you want, and the data have then spoken for themselves (conditional on the summaries you have chosen).
Second, I can’t speak to what “a frequentist” could or could not do, but I will say that a Bayesian analysis requires some model for the stopping rule. That is, if the stopping rule is “not given” in the Bayesian analysis, some assumptions must be made about it. We discuss this in chapter 8 of Bayesian Data Analysis, third edition (chapter 7 of the earlier editions). Indeed, my annoyance with statements such as that above quote is one reason I wrote BDA in the first place! It was frustrating to see all these Bayesians going around telling people what to do, giving advice that contradicted both mathematics and good practice.
Andrew:
Thank you so much for your comment. I’m not sure I understand your view of how Bayesians do or ought to take account of optional stopping, so I will have to study your chapter.
After reading your comment, I looked for a source online of some (subjective) Bayesians who have recently been claiming to show optional stopping is no skin off their nose, but with a bit of a twist to the traditional Bayesian position (e.g., Rouder). Using simulations, they demonstrate the Bayes factor (BF) still has the same meaning. I immediately came across a blogpost of yours with comments that covered many of the key points, so I will just cite it.
https://statmodeling.stat.columbia.edu/2018/01/03/stopping-rules-bayesian-analysis-2/
I agree with one of the commentators, Steven Martin. Another mentions Rouder.
Once the BF is reported, all of the import of the evidence would have been given, so the inferential assessment, post data, can be nothing more than the BF. Objective or conventional Bayesians generally admit, begrudgingly, that they allow “technical violations” of the Likelihood Principle LP (else they cannot match error statisticians). However, I assume you reject the LP, and wouldn’t start out assuming that priors were correct (in any sense).
I agree that you can
“Feel free to compute or plot whatever data summaries you want, and the data have then spoken for themselves (conditional on the summaries you have chosen)”.
But the Bayesians who claim optional stopping (and other error probabilities of methods) don’t matter post data generally aver that they are appealing to a core principle of rationality.
Mayo
Deborah:
The likelihood principle is fine for what it is, but, as we discuss in chapter 8 of BDA and as is also clear from common sense, the likelihood also includes the probability of which data are observed. You need to have some knowledge or make some assumptions about the stopping rule in order to have a likelihood in the first place!