The record number of hits on this blog goes to “When Bayesian Inference shatters,” where Houman Owhadi presents a “Plain Jane” explanation of results now published in “On the Brittleness of Bayesian Inference”. A follow-up was 1 year ago. Here’s how their paper begins:
Houman Owhadi
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA+
Clint Scovel
Senior Scientist,
Computing + Mathematical Sciences,
California Institute of Technology, USA
“On the Brittleness of Bayesian Inference”
ABSTRACT: With the advent of high-performance computing, Bayesian methods are becoming increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods can impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is a pressing question to which there currently exist positive and negative answers. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they could be generically brittle when applied to continuous systems (and their discretizations) with finite information on the data-generating distribution. If closeness is defined in terms of the total variation (TV) metric or the matching of a finite system of generalized moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusion. The mechanism causing brittleness/robustness suggests that learning and robustness are antagonistic requirements, which raises the possibility of a missing stability condition when using Bayesian inference in a continuous world under finite information.
© 2015, Society for Industrial and Applied Mathematics
Permalink: http://dx.doi.org/10.1137/130938633
_______________________________________________________________________________
The application of Bayes’ theorem in the form of Bayesian inference has fueled an ongoing debate with practical consequences in science, industry, medicine, and law [21]. One commonly-cited justification for the application of Bayesian reasoning is Cox’s theorem [15], which has been interpreted as stating that any “natural” extension of Aristotelian logic to uncertain contexts must be Bayesian [34]. It has now been shown that Cox’s theorem as originally formulated is incomplete [28] and there is some debate about the “naturality” of the additional assumptions required for its validity [1, 20, 29, 31], e.g., the assumption that knowledge can be always represented in the form of a σ-additive probability measure that assigns to each measurable event a single real-valued probability.
However—and this is the topic of this article—regardless of the internal logic, elegance, and appealing simplicity of Bayesian reasoning, a critical question is that of the robustness of its posterior conclusions with respect to perturbations of the underlying models and priors.
For example, a frequentist statistician might ask, if the data happen to be a sequence of i.i.d. draws from a fixed data-generating distribution μ†, whether or not the Bayesian posterior will asymptotically assign full mass to a parameter value that corresponds to μ†. When it holds, this property is known as frequentist consistency of the Bayes procedure, or the Bernstein–von Mises property.
Alternatively, without resorting to a frequentist data-generating distribution μ†, a Bayesian statistician who is also a numerical analyst might ask questions about stability and conditioning: does the posterior distribution (or the posterior value of a particular quantity of interest) change only slightly when elements of the problem setup (namely, the prior distribution, the likelihood model, and the observed data) are perturbed, e.g., as a result of observational error, numerical discretization, or algorithmic implementation? When it holds, this property is known as robustness of the Bayes procedure.
This paper summarizes recent results [46, 47] that give conditions under which Bayesian inference appears to be nonrobust in the most extreme fashion, in the sense that arbitrarily small changes of the prior and model class lead to arbitrarily large changes of the posterior value of a quantity of interest. We call this extreme nonrobustness “brittleness,” and it can be visualized as the smooth dependence of the value of the quantity of interest on the prior breaking into a fine patchwork, in which nearby priors are associated to diametrically opposed posterior values. Naturally, the notion of “nearby” plays an important role, and this point will be revisited later. Much as classical numerical analysis shows that there are “stable” and “unstable” ways to discretize a partial differential equation (PDE), these results and the wider literature of positive [8, 13, 19, 37, 38, 53, 56] and negative [3, 17, 23, 24, 35, 40] results on Bayesian inference contribute to an emerging understanding of “stable” and “unstable” ways to apply Bayes’ rule in practice.
The results reported in this article show that the process of Bayesian conditioning on data at finite enough resolution is unstable (or “sensitive” as defined in [54]) with respect to the underlying distributions (under the total variation (TV) and Prokhorov metrics) and is the source of negative results similar to those caused by tail properties in statistics [2, 18]. The mechanisms causing the stability/instability of posterior predictions suggest that learning and robustness are conflicting requirements and raise the possibility of a missing stability condition when using Bayesian inference for continuous systems with finite information (akin to the Courant–Friedrichs–Lewy (CFL) stability condition when using discrete schemes to approximate continuous PDEs). …
To keep reading the paper: http://epubs.siam.org/doi/10.1137/130938633
Owhadi has agreed to answer reader questions on this blog. You may want to check the discussion in the comments on the two earlier posts here and here.
H. Owhadi, C. Scovel & T. J. Sullivan. “On the Brittleness of Bayesian Inference” SIAM Review 57(4):566–582, 2015. doi:10.1137/130938633
H. Owhadi, C. Scovel & T. J. Sullivan. Brittleness of Bayesian Inference under Finite Information in a Continuous World. Electronic Journal of Statistics, vol 9, pp 1-79, 2015. arXiv:1304.6772
H. Owhadi & C. Scovel. Brittleness of Bayesian inference and new Selberg formulas. Communications in Mathematical Sciences, vol. 14, n. 1, pp. 83-145, 2016. arXiv:1304.7046
Previous posts:
September 14, 2013: When Bayesian Inference Shatters: Guest post
January 8, 2015: The Brittleness of Bayesian Inference (re-blog)
Far beyond my technical knowledge to comment but I do wonder if something like this would prevent the brittleness – https://www.researchgate.net/publication/278969335_Robust_Bayesian_inference_via_coarsening
I don’t think I would ever believe all of the data is as it is reported and to the accuracy claimed.
Keith O’Rourke
Phaneron0: This definitely sounds relevant. Houman has been really good in taking up & illuminating even partially baked questions, when a group of them form. So I hope a few more come in.
Keith: Check what Houman said to me on a somewhat similar point (I think) in 2013: https://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/#comment-14895
That link increases my expectation the answer will be yes…
A number of people have suggested replacing ‘exact’ conditioning by a ‘coarsened’ version.
Tarantola also suggested the same in his classic ‘Inverse Problems’ book. A key motivation was apparently the Borel-Kolmogorov conditioning paradox. Jaynes also proposed a resolution based on reconsidering the passage to the limit more carefully. Similarly, ABC might be considered a more ‘honest’ representation of conditioning (and uses a tolerance) despite the name ‘approximate bayesian computation’.
I mentioned on Gelman’s blog the other day that these might be thought of as types of hierarchical model (which has been pointed out in the lit, eg by Wilkinson ‘ABC gives exact results under the assumption of model error’).
This is all not too surprising (in retrospect of course!) from one sort of regularisation perspective – it is common in physical models to introduce an additional scale/parameter to remove singular limits. The price is needing to choose the coarseness parameter of course.
This is also a general perspective of many in the dynamical systems world (for example) – view singular (irregular) phenomenon as a projection or limit of some more general (eg relax some implicit or implicit constraint) but more regular model.
The tradeoff indeed seems to be between ‘precision’ in lower dimensions with the risk of singular behaviour and ‘vagueness’ in higher dimensions but with more regularity (at least in some senses).
‘The curse of instability’ (cf of dimensionality) might be of interest to some here – http://arxiv.org/abs/1505.04334
Deborah: Thanks for the posts.
Keith: Yes “coarsening” (i.e. using coarse models or limiting the resolution of the data in the conditioning process) is one way to avoid brittleness. Note that in doing so you are giving up some degree of convergence/accuracy. In that sense robustness and consistency/accuracy/convergence are conflicting requirements.
Houman
PS: The Q/A section of https://errorstatistics.com/2015/01/08/on-the-brittleness-of-bayesian-inference-an-update-owhadi-and-scovel-guest-post/ and the conclusion of the paper contain a short discussion of the coarsening strategy. Summarizing, the brittleness results appear to suggest that robust inference (in a continuous world under finite-information) should perhaps be done with reduced/coarse models rather than highly sophisticated/complex models (with a level of coarseness/reduction that would depend on the available information).
Olivier: Thanks for the pointers, we are on the same page. A few remarks:
– The Borel-Kolmogorov paradox can be shown to be generic, e.g. always present when the distribution of the data is not absolutely continuous with respect to one induced by the prior (e.g. Thm 4 of http://arxiv.org/abs/1508.02449).
– When trying to construct conditional expectations as disintegration or derivation limits the limit may not exist, may depend on the net or may be non-computable if the set of priors is not carefully constrained ([Tjur 1974, 1980], [Pfanzagl 1979], [Ackerman, C. E. Freer, and D. M. Roy. 2010], Remark 5 of http://arxiv.org/abs/1508.02449 for precise references).
– Taking the limit will also lead to brittleness (the threshold below which one gets brittleness is non-asymptotic) so, as you mention, one question is how to choose the level of coarseness.
Houman
Thanks for the reply and refs. I look forward to having a more careful read
This is probably much too late but for the record. Data are generated at the level of distribution functions, Bayesian statistics uses densities. The distribution function and the density are connected by the pathologically discontinuous (unbounded) linear differential functional. Given finite precision if the Kolmogorov distance d_{ko}(F,G) between the two distribution functions F and G is sufficiently small the data generated by F and G will be equal. In spite of this the densities f and g can be arbitrarily far apart in the total deviation metric and conclusions based on these densities can also be arbitrarily far apart. This cannot be rectified by making the total variation distance small: even if d_{tv}(F,G) is small the densities f and g can be unbounded in different regions and zero in different regions leading to very different conclusions if these are based on likelihood. The above is a way of thinking about statistics. The paper gives a precise mathematical expression to this. Ill-posed problems require regularization but it is not possible to regularize in a total variation neighbourhood as they are too small. Presumably such a regularization would look something like min J(g) subject to d_{tv}(f,g)< epsilon. This requires an initial choice of f and it is not clear how this can be made. In a previous post I mentioned the standard comb distribution C_o. The Kolmogorov distance to the standard normal distribution is d_{ko}(N(0,1),C_o)=0.02. The total variation distance is d_{tv}(N(0,1),C_o)=0.483. Thus in order to go from C_o to the N(0,1) by regularization would require regularization over a very large total variation neighbourhood. Thus any attempt to regularize at the level of densities will fail. Regularization over Kolmogorov neighbourhoods does make sense: min J(F) subject to d_{ko}(F,P_n)=O(1/sqrt(n)< epsilon where P_n is the empirical measure. Note that the Kolmogorov metric allows a sensible comparison with the empirical distribution. The total variation metric does not as in d_{tv}(F,P_n)=1. In general any metric based on a Vapnik-Cervonenkis (polynomial discrimination) family of sets will work. The pathologies of the article can be avoided by regularizing the model in this manner. Presumably something similar can be done with the prior but this is not my problem. The proof makes use of balls around points, B_{delta}(x). The set of all such balls is not a Vapnik-Cervonenkis class as they shatter all finite sets. Thee are good reasons for restricting probabilities P(B) to Vapnik-Cervonenkis classes and functionals which are continuous (platinum standard locally uniformly differentiable) with respect to metrics based on such classes. This gives stability of analysis and avoids the pathologies of the paper. Finally a word on the Prokhorov metric. The definition involves a number epsilon which in one and the same expression is a measure of measuring accuracy with units (inches) and a measureless quantity probability. I can think of no situation where this makes sense.
It’s not too late, but maybe you can say more about the upshot of your comment. Is it the discrete nature of the data that causes the problem they’re on about?
Mayo: I believe discrete data + continuous model context means that the prior doesn’t ‘wash out’ but remains relevant. The reasons why this is so can be traced to measure/functional analytic theorems like Laurie mentions. They probably also hold for discrete but ill-conditioned problems, eg arising from discretisations of continuous problems and hence having connections to stability conditions in numerical analysis problems as Owhadi et al mention.
As far as I understand the paper it exploits the pathological discontinuity of likelihood with respect to the total variation metric. The authors defined the total variation metric d_{tv}(F,G) but as the mathematical background of people reading this blog is heterogeneous it may be easier to use the equivalent L_1 metric d_1(f,g)=\int |f(x) -g(x)|dx. We have d_1(f,g)=2d_{tv}(F,G) where f and g are the densities of F and G. This is easier to interpret. Look at Figure 2 of the paper and in particular theta=0.4 for f^a and (b) f^b. You have to look carefully and in particular look at the scales but it seems to me that (b) corresponds to theta=0.4. Now f^a(x,0.4) and f^b(x,0.4) differ only on the very small interval evident in the figure (b). This means that d_1(f^a(\cdot,0.4)-f^b(\cdot,0.4)) is very small. Suppose you now observe x=0.5. Then f^a(x,0.4)=f^a(0.5,0.4) approx 0.75 (reading from the figure) and f^b(x,0.4)=f^b(0.5,0.4) \approx 10^-9. You can do it in the other direction. Replace the trough at x=0.5 for f^b by a very high but thin peak. Then you can make f^b(0.5,0.4) as large as you wish. More generally let l(f,x) be the likelihood based on the model f for the data x and suppose 0< l(f,x) < \infty. Then inf_{g: d_1(f,g)<epsilon}l(g,x)=0 and sup_{g: d_1(f,g)<epsilon}l(g,x)=\infty. Thus making models close in the sense of total variation places no restrictions on the likelihoods. Given this the conclusions of the paper are not surprising.
To me the f^b model seems more perverse than my comb model. Unless the statistician has good arguments for the f^b model then some other model must be chosen. This implies regularization on the one hand. On the other the chosen model must be consistent with the data. I see no way of doing this using total variation for the simple reason d_{tv}(F,F_n)=1 for any continuous F. Once again, the differential operator is pathologically discontinuous and choosing a density is an ill-posed problem. It requires regularization but the authors give no hint as to how they propose to regularize. The Phi of PROBLEM 2 is a functional defined for all probability measures on {\mathcal X) =[0,1] say. The assumption is that the data were generated as an i.i.d. sample based on some such probability measure mu^*. The goal is to give some estimate of Phi(mu^*). This sits uneasily the Bayesian paradigm as it is not possible to put a prior over all probability measures mu^*. Take the special case of the mean EXAMPLE 3, Phi(mu^*)=\int xdmu^*(x). This is how I would do it. As the observation lie between 0 and 1 we have a uniform (over the set of models) central limit theorem for the mean. Simply use the mean of the data and a standard confidence interval. This is not a regularization of the model but a regularization of the procedure. How does the Bayesian do it? In this case I suspect the Bayesian could simply use the Gaussian model with a prior on the mean and the variance, both say uniform on [0,1]. There would be no claim that the Gaussian model is in any sense adequate, the data could be binomial(1,0.5), and so it could not really be called Bayesian or am I missing something? Suppose now that Phi(mu^*)= number of local maxima of the density of mu^* where consideration is restricted to models with a continuous Lebesgue density. This time Phi is not uniformly differentiable but pathologically discontinuous and so must be regularized as in the minimum number of local extreme values consistent with the data. How does this fit into the framework? A regularization based on total variation will not work. See
https://projecteuclid.org/euclid.aos/1085408496
I don't think all this has anything to do with the precision of the data. The basic problem is yet again the pathological discontinuity of the differentiable operator which makes any density based analysis ill-posed and in requirement of regularization. Using likelihood makes things worse. The likelihood dogma requires calculating the probability of the data under the model. The authors do this using the balls B_{delta}(x_i). The dogma orders the models according to this probability, the higher the better. If the model has a continuous density then as delta tends to zero we can approximate this probability by the standard likelihood function as delta tends to zero. Calculating this probability means calculating the probability (X_1,..X_n) in (b_{delta}(x_1)xB_{delta}(x_2)x..xB_{delta}(x_n) under the model. Sets of this form shatter any finite set of points in R. My philosophy is to calculate probabilities only for simpler sets such as (-infty,x], more technically for V-C-classes of sets. This is much too technical for a blog but here is a simple example. You have 100 data points which are positive integers generated as binomial(5000,0.5). Suppose you model the data as N(mu,sigma^2) where mu and sigma are the mean and standard deviation of the data. If you now calculate P(a<X<=b) under N(2500,1250) the results will agree quite well with the empirical frequencies. The probability of the data under this model is zero. The second model is Poisson with parameter lambda=1. This is hopeless but under this model the data have a positive probability. The likelihood dogma say that this model is better. But if you calculate P(a<X<=b) under the Poisson model it will be nothing like the empirical frequencies.
Your comment stopped mid-way through.
Some quick comments. Esp. re: the density question and Laurie’s comments. I agree that traditional Bayes is based on densities and that, for the purposes of formulating a *general* framework, this is a Bad Thing.
There are (attempts at?) generalisations of the Bayesian approach which are not density based and which are applicable to the infinite-dimensional case. Some quick thoughts below – I’m not the best person to give a good presentation/summary of this but perhaps I will try to improve on it at some point.
I believe the general idea is to start from the Radon-Nikodym theorem. It is possible, for example, in the infinite-dimensional setting to have no (posterior, prior) densities π1 and π0 with respect to Lebesgue measure, but where a posterior measure µ1 can be constructed which has a Radon–Nikodym derivative with respect to a prior measure µ0. This forms a natural generalisation of Bayes’ theorem which reduces to the usual case in finite-dimensions/when densities exist etc etc. The Lebesgue decomposition thereom is also probably important for understanding some of the underdetermination issues Laurie raises and what one might think of as ‘singular degrees of freedom’ in the updating.
(Mayo – The philosophical upshot is probably something like – for a Bayesian the prior is not something to ‘shy away from’ while emphasising the likelihood. The basic ingredients of Bayes are a good prior measure and a good dynamical rule for updating it to a posterior measure on the basis on new information. A very ‘dynamical systems’ point of view. Both of the prior measure and the dynamical updating rule are components of ‘the model’, so both can/should be checked etc – ‘Falsificationist Bayes’ I suppose. The Lebesgue decomposition thereom probably emphasises the need to think about the geometry of the updating dynamics a bit more carefully.)
See A.M. Stuart (2010) ‘Inverse problems: A Bayesian perspective’ in Acta Numerica for more on this sort of perspective.
Click to access stuart15c.pdf