A reader asks: “Can you tell me about disagreements on numbers between a severity assessment within error statistics, and a Bayesian assessment of posterior probabilities?” Sure.
There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)
1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over k or more factors? Or to distinguish optional stopping with sequential trials from fixed sample size experiments. Here’s a quote I came across just yesterday:
“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a fixed sample had been taken, one could guarantee arbitrarily strong frequentist ‘significance’ against H0.” (Berger and Wolpert, 1988, 77).
The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong) likelihood principle, and Birnbaum.
2. Highly probable vs. highly probed. SEV doesn’t obey the probability calculus: for any test T and outcome x, the severity for both H and ~H might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle.
3. I once posted a comment by J. A. Hartigan (1971), on David Bartholomew, that gives a 5 line argument that while Bayes and frequency intervals may sometimes agree with improper priors, they never exactly agree with proper priors (see below [i]). But improper priors are not considered to provide degrees of belief (not even being proper probabilities). This would seem to suggest that when they (frequentists and Bayesians) agree on numbers, the prior cannot be construed as a proper degree of belief assignment. So is there agreement? If priors are not probabilities, what then is the interpretation of a posterior? (There’s no suggestion it would be a measure of well-testedness, as is SEV).
4. Now it might instead be a “default”or “conventional” prior. Yet, producing identical numbers could only be taken as performing the tasks of an error statistical inference by reinterpreting them to mean confidence levels and significance levels, not posteriors. Then these could be connected to SEV measures.
For example, there’s agreement between “conventional” Bayesians and frequentists in the case of one-sided Normal (IID) testing (known σ), as noted in Ghosh, Delampady, and Samanta (2006, p. 35). If we wish to reject a null value when “the posterior odds against it are 19:1 or more, i.e., if posterior probability of H0 is < .05” then the rejection region matches that of the corresponding test of H0, (at the .05 level) if that were the null hypothesis. Even with this agreement on numbers, it seems to me this would badly distort the goal and rationale of testing inferences. When properly used, they typically serve as workaday tools for probing a specific question or problem (e.g., about discrepancies). You jump in, and you jump out. Even after ensuring a p-value is genuine (“isolated tests” are not enough, search blog for why) the tester is not about to use the information gleaned to assign either “posterior odds” or (actual or rational) degrees of belief to statistical hypotheses within a generally very idealized model. (At least that’s not the direct function of p-values or confidence levels).
5. Then there’s the familiar fact that there would be disagreement between the frequentist and Bayesian if one were testing the two sided Normal mean: H0: μ=μ0 vs. H1: μ≠μ0. In fact, the same outcome that would be regarded as evidence against the null in the one-sided test (for the default Bayesian and frequentist) can result in statistically significant results being Bayesianly construed as no evidence against the null or even evidence for it (due to a spiked prior).[ii]
See this post and the chart [iii] comparing p-values and posteriors below.
6. I’m reminded of Jim Berger’s valiant attempt to get Fisher, Jeffreys, and Neyman to all agree on testing (Berger 2003)? Taking the conflict between p-values and Bayesian posteriors in two-sided testing, he offers a revision of tests thought to do a better job from both Bayesian and frequentist perspectives, but neither side liked it very much. See an earlier post. (Also Mayo 2003, Mayo and Spanos 2011 below [iv])
The ‘spike and slab priors’ (a cute name I heard some statisticians use) also leads to a conflict with Bayesian ‘credibility interval’ reasoning, since 0 is outside the corresponding interval. By contrast, the severe testing inferences are harmonious whether one is doing testing or confidence intervals.
7. Recall in this connection Senn’s note (on this blog) on the allegation that p-values overstate the evidence.
8. Sometimes when people ask for differences they really mean to pose the challenge: “show me a case that I cannot reconstruct Bayesianly.” This leads to “rational reconstructions” or what I call “painting by numbers”.
The search for an agreement/disagreement on numbers across different statistical philosophies is an understandable pastime in foundations of statistics. Identifying matching or unified numbers, apart from what they might mean, might offer a glimpse as to shared (unconscious?) underlying goals. For example, if H passes severely, you are free to say you have warrant to believe H, and then assign it a high probability, but this is not Bayesian inference (and the converse doesn’t hold). On the face of it, any inference, whether to the adequacy of a model (for a given purpose) or to a posterior probability, can be said to be warranted just to the extent that the inference has withstood severe testing: one with a high capability of having found flaws were they present. But I think it’s a mistake to play a numbers game without the statistical philosophy to point toward interpretation as well as avenues for filling gaps in the approach.
[i] Here are the 5 lines:
We need P(θ < θα(x)| θ) = α all θ.
Assume θα(x) has positive density over line, all θ.
Then P(θ < θα(x)| θ, θα(x) >0) = α(θ) > α.
So P(θ < θα(x)| θα(x) > 0) > α averaging over θ.
So P(θ < θα(x)| x) = α is impossible.
(J.A. Hartigan, comment on D. J. Bartholomew (1971), “Comparison of Frequentist and Bayesian Approaches to Inference with Prior Knowledge”, in Godambe and Sprott (eds.), Foundations of Statistical Inference, p.432)
[ii] Not all default Bayesians endorse the spiked priors here, meaning there a lack of agreement on numbers even within the same philosophical school. See this post.
[iv] For a short and sweet, and not too bad, overview, with comments by Casella, see Mayo 2004 below.
Bartholomew, D. J., “A comparison of Frequentist and Bayesian Approaches to Inferences With Prior Knowledge,” in Godambe and Sprott, (1971), Foundations of Statistical Inference, 417-429.
Berger, J. (2003),“Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science 18, 1–12.
Berger, J. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82, 112–139.
Berger, J. and Wolpert, R. (1988). The Likelihood Principle. 2nd ed. Vol. 6. Lecture Notes-Monograph Series. Hayward, California: Institute of Mathematical Statistics.
Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82,106–111, 123–139.
Ghosh, Delampady, and Samanta (2006), An Introduction to Bayesian Analysis, Theory and Methods, Springer.
Hartigan, J. (1971) Comment on D. J. Bartholomew (1971), “Comparison of Frequentist and Bayesian Approaches to Inference with Prior Knowledge”, in Godambe and Sprott (eds.), Foundations of Statistical Inference.
Mayo, D. (2003), Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science 18, 19-24.
Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.
Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.
I neglected another point of difference which is very important:
9. for the error statistician, data generation and modeling is very much a part of the overall inquiry; we don’t seek an account that imagines data are thrown at you and you have to make an inference. Testing assumptions is the other big area of contrast with Bayesians.
Now the person who sent me the query was mentioning another blog that was seeking to find cases where SEV and posteriors differ, presumably to show the Bayesian posterior gets it right or whatever. But the person seemed to be overlooking the most central points of disagreement, and #9 is yet another.
“even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.”
… if and only if the prior is absolutely continuous and the true parameter value happens to be in the exact center of the optional stopping window.
Typically, if one suspects that this might *actually* be the case, one has prior information (usually of a theoretical nature) justifying putting a point mass on that specific value, and the quoted claim is no longer true.