7. Can/Should Bayesian and Error Statistical Philosophies Be Reconciled?
Stephen Senn makes a rather startling but doubtlessly true remark:
The late and great George Barnard, through his promotion of the likelihood principle, probably did as much as any statistician in the second half of the last century to undermine the foundations of the then dominant Neyman-Pearson framework and hence prepare the way for the complete acceptance of Bayesian ideas that has been predicted will be achieved by the De Finetti-Lindley limit of 2020. (Senn 2008, 459)
Many do view Barnard as having that effect, even though he himself rejected the likelihood principle (LP). One can only imagine Savage’s shock at hearing that contemporary Bayesians (save true subjectivists) are lukewarm about the LP! The 2020 prediction could come to pass, only to find Bayesians practicing in bad faith. Kadane, one of the last of the true Savage Bayesians, is left to wonder at what can only be seen as a Pyrrhic victory for Bayesians.
7.1 The (Strong) Likelihood Principle (LP)
Savage defines the LP as follows:
According to Bayes’s theorem, P(x│μ) [. . . ] constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x│μ) and P(y│μ) are proportional functions of μ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of μ. (Savage 1962, 17)
Berger and Wolpert, in their monograph The Likelihood Principle (1988), put their finger on the core issue:
The philosophical incompatibility of the LP and the frequentist viewpoint is clear, since the LP deals only with the observed x, while frequentist analyses involve averages over possible observations. [. . .] Enough direct conflicts have been [. . .] seen to justify viewing the LP as revolutionary from a frequentist perspective. (Berger and Wolpert 1988, 65–66)
The reason I argued in 1996 that “you cannot be a little bit Bayesian”, is that if one is Bayesian enough to accept the LP, one is Bayesian enough to renounce error probabilities.
7.2 Optional Stopping Effect
That error statistics violates the LP is often illustrated by means of the optional stopping effect. We can allude to our two-sided test from a Normal distribution with mean μ and standard deviation σ, i.e.,
Xi ~ N(μ,σ) and wish to test H0: μ=0, vs. H1: μ≠0.
Rather than fix the sample size ahead of time, the rule instructs us:
Keep sampling until H is rejected at the .05 level (i.e., keep sampling until |X| ≥ 1.96 σ/√n).
With n fixed the type 1 error probability is .05, but with this stopping rule the actual significance level differs from, and will be greater than, .05. In the Likelihood Principle, Berger and Wolpert claim that “the need here for involvement of the stopping rule clearly calls the basic frequentist premise into question” (74.2-75). But they are arguing from a statistical philosophy incompatible with the error-statistical philosophy which requires taking into account the relevant error probabilities.
Therefore, to ignore aspects of the data generation that alter error probabilities leads to erroneous assessments of the well testedness, or severity, of the inferences. Ignoring the stopping rule allows a high or maximal probability of error, thereby violating what Cox and Hinkley call “the weak repeated sampling rule.” As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error statistical standpoint, ignoring the stopping rule allows inferring that there is evidence for a null hypothesis even though it has passed with a low or even 0 severity.
7.3 The Optional Stopping Effect with (Two-sided) Confidence
The equivalent stopping rule can be framed in terms of the corresponding 95% confidence interval method:
Keep sampling until the 95% confidence interval excludes 0.
Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (80-81). This is a striking admission—especially as the Bayesian credibility interval assigns a probability of .95 to the truth of the interval estimate:
μ = x + 1.96(σ/√n)
Does this lead the authors to renounce the LP? It does not. At least not then. To do so would be to renounce Bayesian coherence. From the perspective of the Bayesian (or likelihoodist), to take the stopping rule into account is tantamount to considering the experimenter’s intentions (when to stop), which have no place in appraising data. This overlooks that fact that the error statistician has An entirely objective way to pick up on the stopping rule effect, or anything else that influences error probabilities—namely, in the error-statistical report. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.
One need not allude to optional stopping examples to see that error-statistical methods violate the LP. The analogous problem occurs if one has the null hypothesis and is allowed to search for maximally likely hypotheses (Mayo 1996, chap. 9; Mayo and Kruse; Cox and Hinkley).
7.4 Savage’s Sleight of Hand in Defense of the LP
While Savage touts the ‘simplicity and freedom’ enjoyed by the Bayesian, who may ignore the stopping rule, he clearly is bothered by the untoward implications of doing so. (Armitage notes that “thou shalt be misled” if one is unable to take account of the stopping rule.) In dismissing Armitage’s result (as no more possible than a perpetual motion machine), however, Savage switches to a very different case—one where the null and the alternative are both (point) hypotheses that have been fixed before the data, and where the test is restricted to these two preselected values. In this case, it is true, the high probability of error is averted, but it is irrelevant to the context in which the optional stopping problem appears—the two-sided test or corresponding confidence interval. Defenders of the LP often make the identical move to the point against point example (Royall 1997). Shouldn’t we trust our intuition in the simple case of point against point, some ask, where upholding the LP does not lead to problems (Berger and Wolpert, 83)? No. In fact, as Barnard (1962, 75) explained (to Savage’s surprise, at the ‘Savage Forum’), the fact that the alternative hypothesis need not be explicit is what led him to deny the LP in general.
7.5 The Counterrevolution?
But all this happened before the sands began to shift some ten years ago. Nowadays leading default Bayesians have conceded that desirable reference priors force them to consider the statistical model, “leading to violations of basic principles, such as the likelihood principle and the stopping rule principle” (Berger 2006, 394). But it is not enough to describe a certain decision context and loss function in which a Bayesian could take account of the stopping rule. Following our requirement for assessing statistical methods philosophically, we require a principled ground (see Mayo 2011). Similarly Bernardo (2005; 2010) leaves us with a concession (to renounce the LP) but without a philosophical foundation. By contrast, a justification that rests on having numbers agree (with those of the error statistician) lacks a philosophical core.
8. Concluding Remarks: Deep versus Shallow Statistical Philosophy
As I argued in part 1 (2011, this special topic of RMM), the Bayesians have ripped open their foundations for approaches that scarcely work from any standpoint. While many Bayesians regard the default Bayesian paradigm as more promising than any of its contenders, we cannot ignore its being at odds with two fundamental goals of the Bayesian philosophical standpoint: incorporating information via priors, and adhering to the likelihood principle. Berger (2003) rightly points out that arriving at subjective priors, especially in complex cases, also produces coherency violations. But the fact that human limitations may prevent attaining a formal ideal is importantly different from requiring its violation in order to obtain the recommended priors (Cox and Mayo 2010). In their attempt to secure default priors, and different schools have their very own favorites, it appears the default Bayesians have made a mess out of their philosophical foundations (Cox 2006; Kadane 2011). The priors they recommend are not even supposed to be interpreted as measuring beliefs, or even probabilities—they are often improper. Were default prior probabilities to represent background information, then, as subjective Bayesians rightly ask, why do they differ according to the experimental model? Default Bayesians do not agree with each other even with respect to standard methods.
For instance, Bernardo, but not Berger, rejects the spiked prior that leads to pronounced conflicts between frequentist p-values and posteriors. While this enables an agreement on numbers (with frequentists) there is no evidence that the result is either an objective or rational degree of belief (as he intends) or an objective assessment of well-testedness (as our error statistician achieves).
Embedding the analysis into a decision-theoretic context with certain recommended loss functions can hide all manner of sins, especially once one moves to cases with multiple parameters (where outputs depend on a choice of ordering of importance of nuisance parameters). The additional latitude for discretionary choices in decision-contexts tends to go against the purported goal of maximizing the contribution of the data in order to unearth ‘what can be said’ about phenomena under investigation. I invite leading reference Bayesians to step up to the plate and give voice to the philosophy behind the program into which they have led a generation of statisticians: it appears the emperor has no clothes.
While leading Bayesians embrace default Bayesianism, even they largely seem to do so in bad faith. Consider Jim Berger: See Berger deconstruction.
Too often I see people pretending to be subjectivists, and then using weakly informative priors that the objective Bayesian community knows are terrible and will give ridiculous answers; subjectivism is then being used as a shield to hide ignorance. In my own more provocative moments, I claim that the only true subjectivists are the objective Bayesians, because they refuse to use subjectivism as a shield against criticism of sloppy pseudo-Bayesian practice. (Berger 2006, 463)
This hardly seems a recommendation for either type of Bayesian, yet this is what the discussion of foundations tends to look like these days. Note too that the ability to use Bayesian methods to obtain ‘ridiculous answers’ is not taken as grounds to give up on all of it; whereas, the possibility of ridiculous uses of frequentist methods is invariably taken as a final refutation of the account—even though we are given no evidence that anyone actually commits them!
To echo Stephen Senn (2011, this special topic of RMM) perhaps the only thing these Bayesian disputants agree on, without question, is that frequentist error statistical methods are wrong, even as they continue to be used and developed in new arenas. The basis for this dismissal? If you do not already know you will have guessed: the handful of well-worn, and thoroughly refuted, howlers from 50 years ago, delineated in section 5.
Still, having found the Bayesian foundations in shambles, even having discarded the Bayesian’s favorite whipping boys, scarcely frees frequentist statisticians from getting beyond the classic caricatures of Fisherian and N-P methods. The truth is that even aside from the distortions due to personality frictions, these caricatures differ greatly from the ways these methods were actually used. Moreover, as stands to reason, the focus was nearly always on theoretical principle and application—not providing an overarching statistical philosophy. They simply did not have a clearly framed statistical philosophy. Indeed, one finds both Neyman and Pearson emphasizing repeatedly that these were tools that could be used in a variety of ways, and what really irked Neyman was the tendency toward a dogmatic adherence to a presumed a priori rationale standpoint. How at odds with the subjective Bayesians who tend to advance their account as the only rational way to proceed. Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur. In some corners of practice it appears that frequentist error statistical foundations are being discovered anew. Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are being disinterred. While some of the issues have trickled down to the philosophers, by and large we see ‘formal epistemology’ assuming the traditional justifications for probabilism that have long been questioned or thrown overboard by Bayesian statisticians. The obligation is theirs to either restore or give up on their model of ‘rationality’.
*To read previous sections, please see the RMM page, and scroll down to Mayo’s Sept 25 paper.
(All references can also be found in the link.)