I’m extremely grateful to Drs. Owhadi and Scovel for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it. If readers want to ponder the paper awhile and send me comments for guest posts or “U-PHILS*” (by OCT 15), let me know. Feel free to comment as usual in the mean time.
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA
California Institute of Technology, USA
“When Bayesian Inference Shatters: A plain Jane explanation”
This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at http://arxiv.org/abs/1308.6306 with the following abstract:
“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”
Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data.
A good way to understand this is through a simple example: Assume that you want to estimate the mean Eμ†[X] of some random variable X with respect to some unknown distribution μ† on the interval [0,1] based on the observation of n i.i.d. samples, given to finite resolution δ, from the unknown distribution μ†. The Bayesian answer to that problem is to assume that μ† is the realization of some random measure distributed according to some prior π (i.e. μ ~ π) and then compute the posterior value of the mean by conditioning on the data. Now to specify the prior π you need to specify the distribution of all the moments of μ (i.e. the distribution of the infinite-dimensional vector (Eμ[X], Eμ[X2], Eμ[X3],…)). So a natural way to assess the sensitivity of the Bayesian answer with respect to the choice of prior is to specify the distribution ℚ of only a large, but finite, number of moments of μ (i.e. to specify the distribution of (Eμ[X], Eμ[X2],…, Eμ[Xk]), where k can be arbitrarily large). This defines a class of priors Π and our results show that no matter how large k is, no matter how large the number of samples n is, for any ℚ that has a positive density with respect to the uniform distribution on the first k moments, if you observe the data at a fine enough resolution, then the minimum and maximum of the posterior value of the mean over the class of priors Π are 0 and 1.
It is important to note that these brittleness theorems concern the whole of the posterior distribution and not just its expected value, since by simply raising the quantity of interest to an arbitrary power, we obtain brittleness with respect to all higher-order moments. Moreover, since the quantity of interest may be any (measurable) function of the data-generating distribution, in the example above, we would get the same brittleness results if instead of estimating the mean we estimate the median, some other quantile, or the probability of an event of interest.
Instead of moments or other finite-dimensional features, we can also consider perturbations of the model quantified by a metric on the space of probability measures. In this case, our results show that, for any perturbation level α > 0, no matter the size of the data, for any parametric Bayesian model, there is another Bayesian model that is at distance at most α from the first one in the Prokhorov or Total Variation (TV) metric leading to diametrically opposite conclusions in the sense described above.
Since, as noted by G. E. P. Box, for complex systems, all models are mis-specified, these brittleness results suggest that, in the absence of a rigorous accuracy/performance analysis, the guarantee on the accuracy of Bayesian Inference for complex systems is similar to what one gets by arbitrarily picking a point between the minimum and maximum value of the quantity of interest.
Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. Moreover, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model, which could be a very strong assumption if you are trying to certify the safety of a critical system. Indeed, when performing Bayesian analysis on function spaces e.g. for studying PDE solutions as is now increasingly popular, results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular, and hence at KL distance infinity from one another. Observe also that if the distance in TV between two Bayesian models is smaller than, say 10-9, then those models are nearly indistinguishable (you will not be able to test the difference between them without a ridiculously large amount of data). So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.
To summarize, our understanding of these results is as follows: the current robust Bayesian framework (sensitivity analysis posterior to the observation of the data) leads to Brittleness under finite-information or local mis-specification. This situation might be remedied by computing robustness estimates prior to the observation of the data with respect to the worst case scenario of what the data generating distribution could be. We are currently working on this while pursuing the goal of developing a mathematical framework that can reduce the task of developing optimal statistical estimators and models into an algorithm.
We do not think that this is the end of Bayesian Inference, but we hope that these results will help stimulate the discussion on the necessity to couple posterior estimates with rigorous performance/accuracy analysis. Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy. In other words, as it currently stands, although Bayesian Inference offers an easy and powerful “tool” to build/design statistical estimators, it does not say anything (by itself) about the statistical accuracy of these estimators if the model is not exact.
The main examples that come to mind are those where the accuracy of posteriors given by Bayesian Inference cannot be tested by repeating experiments, or through the availability of additional samples (climate modeling, catastrophic risk assessment, etc…) The main consequence is that for such systems risk may be severely (dangerously, ruinously) underestimated.
Houman and Clint
*”U-PHILS” = “U-Philosophize”. Info and exemplars at the link.
This is all a bit hard to follow for the lay person. But, I think I am seeing a point made which relates to the naivete of blind reliance on the strong likelihood principle in Bayesian inference. That is, the need for post-hoc evaluation of the posteriors seems to torpedo the notion that the data and model and the prior tell us what we need to know. Am I off base?
I hope this does shatter naïve Bayesianism. But maybe something can be reconstructed from the pieces. Contemporary Bayesianism often seems to claim a lot, and to be fragile. Maybe if it claimed less it could be robust.
I note some related work and ideas at my blog,
I will give it some more thought. Regards.
Houman, Clint and Tim: Thanks so much for sending this intriguing guest post, and so soon! I was about to post a new “Saturday night comedy at the Bayesian retreat” when this came through, so I went with this instead! Earlier comments (by Corey and Christian Hennig) when I was running this by last week may be found here:
I’m wondering how this compares to a non-Bayesian treatment; I’ve always suspected Bayesian success stories depended on non-Bayesian feedback loops. But is there actually a danger of observing the data at a fine enough resolution? Are you spared if you don’t? (Naive philosopher’s questions)
It’s nice to see you writing directly on this blog!
I wonder whether the following statement is a bit misleading: “It is important to note that these brittleness theorems concern the whole of the posterior distribution and not just its expected value, since by simply raising the quantity of interest to an arbitrary power, we obtain brittleness with respect to all higher-order moments.” It is true that the discontinuity of the expectation as functional of distributions extends to all higher order moments, but still I’d distinguish this from what I’d call “the whole of the posterior distribution”, because a number of things that people could do with the posterior don’t seem to be affected by this, such as the posterior median or credibility intervals as long as they are not centered about the expectation. I’d think that the fact that moments of a distribution don’t imply information about the probability of sets as long as the sets don’t depend on moments (or the probability is one) makes such a theorem possible in the first place. Or am I wrong?
“We do not think that this is the end of Bayesian Inference”
I expect it’ll suffer the same fate as that fragile flower mechanics did after it was similarly “shattered”: