I’m about to post an update of this, most viewed, blogpost, so I reblog it here as a refresher. If interested, you might check the original discussion.
I am grateful to Drs. Owhadi and Scovel for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it.
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA
California Institute of Technology, USA
“When Bayesian Inference Shatters: A plain Jane explanation”
This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at http://arxiv.org/abs/1308.6306 with the following abstract:
“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”
Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data.
A good way to understand this is through a simple example: Assume that you want to estimate the mean Eμ†[X] of some random variable X with respect to some unknown distribution μ† on the interval [0,1] based on the observation of n i.i.d. samples, given to finite resolution δ, from the unknown distribution μ†. The Bayesian answer to that problem is to assume that μ† is the realization of some random measure distributed according to some prior π (i.e. μ ~ π) and then compute the posterior value of the mean by conditioning on the data. Now to specify the prior π you need to specify the distribution of all the moments of μ (i.e. the distribution of the infinite-dimensional vector (Eμ[X], Eμ[X2], Eμ[X3],…)). So a natural way to assess the sensitivity of the Bayesian answer with respect to the choice of prior is to specify the distribution ℚ of only a large, but finite, number of moments of μ (i.e. to specify the distribution of (Eμ[X], Eμ[X2],…, Eμ[Xk]), where k can be arbitrarily large). This defines a class of priors Π and our results show that no matter how large k is, no matter how large the number of samples n is, for any ℚ that has a positive density with respect to the uniform distribution on the first k moments, if you observe the data at a fine enough resolution, then the minimum and maximum of the posterior value of the mean over the class of priors Π are 0 and 1.
It is important to note that these brittleness theorems concern the whole of the posterior distribution and not just its expected value, since by simply raising the quantity of interest to an arbitrary power, we obtain brittleness with respect to all higher-order moments. Moreover, since the quantity of interest may be any (measurable) function of the data-generating distribution, in the example above, we would get the same brittleness results if instead of estimating the mean we estimate the median, some other quantile, or the probability of an event of interest.
Instead of moments or other finite-dimensional features, we can also consider perturbations of the model quantified by a metric on the space of probability measures. In this case, our results show that, for any perturbation level α > 0, no matter the size of the data, for any parametric Bayesian model, there is another Bayesian model that is at distance at most α from the first one in the Prokhorov or Total Variation (TV) metric leading to diametrically opposite conclusions in the sense described above.
Since, as noted by G. E. P. Box, for complex systems, all models are mis-specified, these brittleness results suggest that, in the absence of a rigorous accuracy/performance analysis, the guarantee on the accuracy of Bayesian Inference for complex systems is similar to what one gets by arbitrarily picking a point between the minimum and maximum value of the quantity of interest.
Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. Moreover, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model, which could be a very strong assumption if you are trying to certify the safety of a critical system. Indeed, when performing Bayesian analysis on function spaces e.g. for studying PDE solutions as is now increasingly popular, results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular, and hence at KL distance infinity from one another. Observe also that if the distance in TV between two Bayesian models is smaller than, say 10-9, then those models are nearly indistinguishable (you will not be able to test the difference between them without a ridiculously large amount of data). So the brittleness theorems state that for any Bayesian model there is a second one, nearly indistinguishable from the first, achieving any desired posterior value within the deterministic range of the quantity of interest.
To summarize, our understanding of these results is as follows: the current robust Bayesian framework (sensitivity analysis posterior to the observation of the data) leads to Brittleness under finite-information or local mis-specification. This situation might be remedied by computing robustness estimates prior to the observation of the data with respect to the worst case scenario of what the data generating distribution could be. We are currently working on this while pursuing the goal of developing a mathematical framework that can reduce the task of developing optimal statistical estimators and models into an algorithm.
We do not think that this is the end of Bayesian Inference, but we hope that these results will help stimulate the discussion on the necessity to couple posterior estimates with rigorous performance/accuracy analysis. Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy. In other words, as it currently stands, although Bayesian Inference offers an easy and powerful “tool” to build/design statistical estimators, it does not say anything (by itself) about the statistical accuracy of these estimators if the model is not exact.
The main examples that come to mind are those where the accuracy of posteriors given by Bayesian Inference cannot be tested by repeating experiments, or through the availability of additional samples (climate modeling, catastrophic risk assessment, etc…) The main consequence is that for such systems risk may be severely (dangerously, ruinously) underestimated.
Houman and Clint
*”U-PHILS” = “U-Philosophize”. Info and exemplars at the link.
Any Jane I know who could understand this material would not be plain in any sense I understand. Many thanks to the authors for further description of their research findings, such explorations into model and algorithm performance characteristics are indeed important. I will update when I find a Jane that knows something about Prokhorov.
“We do not think that this is the end of Bayesian Inference, but we hope that these results will help stimulate the discussion on the necessity to couple posterior estimates with rigorous performance/accuracy analysis. Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy. In other words, as it currently stands, although Bayesian Inference offers an easy and powerful “tool” to build/design statistical estimators, it does not say anything (by itself) about the statistical accuracy of these estimators if the model is not exact.”
Indeed, since a Bayesian approach is rarely the only way to characterize a model of a system, Bayesian approaches can and should be compared to other approaches and reasonable error statistical measures of performance characteristics presented. Frequentist methods in which the ratio of model parameters to data does not tend to zero as amount of data expands to infinity show poor performance characteristics, and there’s no reason to sensibly believe that Bayesian methods with complex priors will somehow avoid this problem. These authors show mathematically situations in which Bayesian methods do exhibit this problem.
“The main examples that come to mind are those where the accuracy of posteriors given by Bayesian Inference cannot be tested by repeating experiments, or through the availability of additional samples (climate modeling, catastrophic risk assessment, etc…) The main consequence is that for such systems risk may be severely (dangerously, ruinously) underestimated.”
I find the claim that Bayesian inference methods can not be tested by repeating experiments to be disingenuous. (I’m not saying these authors make this claim, but I have seen this line of “reasoning” displayed elsewhere.) Any situation that will never repeat is a one-off event, that can only be anecdotally described. Since it only happened once, we don’t have to worry about it, because it will never happen again. No science can be done on a one-off event. What does it even mean to have a prior and a posterior for a single event that never repeats again?
Science is all about developing models that reasonably characterize the mechanisms underlying repeatable events. Wherever Bayesian methods are deployed to characterize repeating events, they certainly can have their performance characteristics examined. When early efforts are underway, and the amount of data available upon which to condition is “small” relative to the unknowns that characterize the data generation mechanism, then posteriors can only honestly be described as exploratory.
Climate modeling is only useful for weather patterns that repeat. It may occur that a system generating weather patterns is a chaotic one, which no statistical model can accurately predict far into the the future – but even in chaotic systems, statistics can be used to understand the window width within which any useful predictions can ensue for a chaotic model. Stormy weather patterns can only be characterized and predicted over a few hours, balmy mild weather patterns characterized and predicted over a few days, etc. CO2 patterns have risen and fallen multiple times in geological history, so we do have multiple events to assess model performance with respect to modern CO2 escalation.
Risk assessment for earthquakes, floods and other such major problems will continue because they do recur – so proposed Bayesian models can have their utility checked by seeing how well they perform for future such events.
My measure theory chops are below the paygrade to thoroughly vet the measure theoretic arguments presented in this paper, but it’s been on arXiv for over a year and I was not able to find any rebuttals describing any flaws in their mathematical reasoning – so this is an important and useful report on an issue that really isn’t that surprising and that we should all be on the lookout for.