# Wasserman on Wasserman: Update! December 28, 2013

Professor Larry Wasserman

I had invited Larry to give an update, and I’m delighted that he has! The discussion relates to the last post (by Spanos), which follows upon my deconstruction of Wasserman*. So, for your Saturday night reading pleasure, join me** in reviewing this and the past two blogs and the links within.

“Wasserman on Wasserman: Update! December 28, 2013”

My opinions have shifted a bit.

My reference to Franken’s joke suggested that the usual philosophical  debates about the foundations of statistics were un-important, much  like the debate about media bias. I was wrong on both counts.

First, I now think Franken was wrong. CNN and network news have a  strong liberal bias, especially on economic issues. FOX has an  obvious right wing, and anti-atheist bias. (At least FOX has some  libertarians on the payroll.) And this does matter. Because people  believe what they see on TV and what they read in the NY times. Paul  Krugman’s socialist bullshit parading as economics has brainwashed  millions of Americans. So media bias is much more than who makes  better hummus.

Similarly, the Bayes-Frequentist debate still matters. And people — including many statisticians — are still confused about the  distinction. I thought the basic Bayes-Frequentist debate was behind  us. A year and a half of blogging (as well as reading other blogs)  convinced me I was wrong here too. And this still does matter.

My emphasis on high-dimensional models is germane, however. In our  world of high-dimensional, complex models I can’t see how anyone can  interpret the output of a Bayesian analysis in any meaningful way.

I wish people were clearer about what Bayes is/is not and what  frequentist inference is/is not. Bayes is the analysis of subjective  beliefs but provides no frequency guarantees. Frequentist inference  is about making procedures that have frequency guarantees but makes no  pretense of representing anyone’s beliefs. In the high dimensional  world, you have to choose: objective frequency guarantees or  subjective beliefs. Choose whichever you prefer, but you can’t have  both. I don’t care which one people pick; I just wish they would be  clear about what they are giving up when they make their choice.

In your blog, Deborah, you mentioned these papers

http://arxiv.org/abs/1308.6306

http://arxiv.org/abs/1304.6772

by Houman Owhadi, Clint Scovel and Tim Sullivan [Ed: See this post.]  And then there is this paper

http://arxiv.org/abs/1306.4943

by Gordon Belot (“Failure of Calibration is Typical”).

These challenges to Bayesian inference remain unanswered in my  opinion. In fact, I think Freedman’s Theorem (1965, Annals, p 454)  still remains adequately unanswered.

Of course, one can embrace objective Bayesian inference. If this means “Bayesian procedures with good frequentist properties” then I  am all for it. But this is just frequentist inference in Bayesian  clothing. If it means something else, I’d like to know what.

*I had intended to post a Wasserman update, were I lucky enough to get one, after running the Gelman and Hennig deconstructions from last year, but since Wasserman has surprised me, I’m reversing the order. Two additional related posts are below. I invite the other discussants to share their current reflections and updates, whenever they wish. Normal Deviate is back!

Wasserman (initial) response to Mayo’s deconstruction: https://errorstatistics.com/2012/08/13/u-phil-concluding-the-deconstruction-wasserman-mayo/

**True, it’s all work here in exile, aside from an occasional visit to the Elbar Room!

*************************************************************************

Categories: Error Statistics, frequentist/Bayesian, Statistics, Wasserman | 56 Comments

### 56 thoughts on “Wasserman on Wasserman: Update! December 28, 2013”

1. Normal Deviate: I’m so glad to have your frank update. It’s great, and I agree with all you say, although I’d extend the point to any dimensional models. My problems with so-called objective Bayesian inference are (a) the ones Kass and Wasserman (1996)* bring out for all “default” or conventional Bayesian priors, (b) the fact that I find them schizophrenic in their claimed rationales (we want to put in background information but we want uninformative priors, at least until you come up with your subjective priors and once you do, we’ll put them in, and (c) the frequentist guarantees aren’t the relevant ones (or the most relevant ones) for making inferences about the particular hypothesis or claim. Even when there’s a match of numbers, I don’t see that they supply the interpretation that’s wanted for scientific inference (i.e., in terms of how well-tested the particular claim is, by the method giving rise to the data). That’s different from screening.

They also encourage another fashionable hybrid: the posterior probabilities of “false positives”, based on presumed relative frequencies of true nulls, coupled with recipe-like “up-down: significance testing.

I very much hope that the Normal Deviate will consider a monthly post on the “error statistics philosophy” blog!

*http://www.stat.cmu.edu/~kass/papers/rules.pdf

2. OK, I left in “bullshit” in the post rather than substitute with “B.S.”, so what’s going to happen to me? (I see “math babe” freely using 4 letter words on her blog.) Anyway, one does wonder if anyone takes Krugman seriously any more.

• Simon Wren-Lewis, Mark Thoma, Brad deLong, and Noah Smith are four econbloggers I read who take Krugman seriously. Larry Summers took him seriously enough to 2011 engage in a debate; and note that although Summers won the debate in 2011, he now argues for the position Krugman took in that debate.

I really don’t get the hate for Krugman. Is it a social affiliation thing? Is it his tone? The “fact” that he never admits he’s wrong (not an actual fact)?

• kenmccue

Paul  Krugman’s socialist bullshit parading as economics has brainwashed  millions of Americans.

I wonder if anyone would take an economist seriously if he/she started throwing around terms like UMVU and MLE that were totally discordant with the accepted definitions. Socialism is an economic theory that holds society is best served if the “means of production” were owned by the state, or people, or whatever. Socialism isn’t really practiced anywhere anymore, except maybe Cuba or North Korea. Social democracy or social welfare is practiced by every country in the G7 (or G20, for that matter). An expansion of the welfare state is really what Krugman is arguing for–this is not socialisim (except in the sense that Roosevelt, for example, was a socialist or communist (see Al Smith’s speech on one capitol, either Washington or Moscow, for an example of this)). In this sense government support for statistical research is also socialism. And maybe bullshit.

A distinction must also be made between Krugman’s economic models and Krugman’s musings as a pundit. The economic models, which rely on classic upward-downward response curves are relatively simple (see http://krugman.blogs.nytimes.com/2013/11/29/on-the-importance-of-little-arrows-wonkish) and can be understood by those with just a smattering of economic knowledge (a course or two in macroeconomics). These are often simple enough that they are presented in Krugman’s blog. In fact, while directed towards a large audience that Wasserman’s blog was, they served an analogous purpose–presenting ideas in the simplest form (consistent with the demands of accurately presenting the idea) to disseminate those ideas more widely. The proper response of those disagreeing with Krugman’s models is to critique the model, not label it with pejoratives (perhaps the Bayesian/frequentist dialectic has influenced the conceptualization of opposing ideas for statisticians, or perhaps it is simply a tendency of human nature to disparage). I personally think Krugman is more accurate than most economists, and not on any basis of any particular theory, but rather an input-output analysis (pioneered by Leontief) that was used extensively during the Cold War to analyze the Soviet economy. This was used since the internal processes were closed to Western analyists, and I would suggest that the internal processes of our society are often closed (or are the subject to much disagreement). What goes in and what goes out, is, however, much more quantifiable.

• Regardless of any claimed accuracy of his models, the criticism of his defensive moves, and especially his dismissals of other positions, would still stand. Of course, it has become so common now (in politics) to argue by begging the question and straw men, that I suspect many people don’t notice it. That’s the real danger: it becomes the accepted mode of argument.

3. You might add some free market economists to your
blog list Corey. Try Don Boudreaux (cafe hayek) or
John Cochrane or Walter Williams. or Greg Mankiw
or the Cato Institute or Reason Magazine …
There are plenty
of economists (some with Nobel prizes as well) who
think Krugman has long ago given up making intellectual arguments.
Krugman’s view is: if you disagree with him (Krugman)
then you are either cruel or stupid.
There is no room for intellectual disagreement.
Sorry Deborah: I should stick to statistics

• LW: I’ve read Cochrane and Mankiw, but I gave up on them — Cochrane seems to be entirely unaware of saltwater notions, and Mankiw seems given to writing his bottom line first and filling in the argument with whatever will stick. I’m Canadian, so I get to see various leftish policy ideas in practice; to me (and most non-Americans, I think) libertarian economic thought à la Cato, Reason, Hayek’s modern followers seems infected with inaccurate premises.

• I am Canadian too Corey.
Anyway, we’ll just have to agree to disagree.
(This is why we should never discuss politics on a
scientific blog).

• Fair enough.

• I think Krugman, like many economists, sets himself up so that there’s no chance of his being wrong; you can always interpret the situation so as to “confirm” his position. He’s just much more obnoxious and arrogant about it. But really, I rarely read his editorials beyond the caption any more. My guess is that these shifts are somehow related (or at least correlated) to the general shifts in politics/economics a decade or so after the Al Franken book in Wasserman’s post, and certainly post-crash.

• “I think Krugman, like many economists, sets himself up so that there’s no chance of his being wrong; you can always interpret the situation so as to “confirm” his position.”

If you were to read his columns, you would know this is false to fact.

• Corey: I’m afraid not, I do read enough of them to know…really I don’t care either. He’s an entertainer, entertainers have a right to their schticks.

• Krugman today: “But as a europessimist, I do have to admit that it’s now possible to see how this could work.”

It’s a grudging admission in the last paragraph with many caveats; nevertheless it’s an actual real reversal prompted by data.

• Corey: well, he said he was reading my blog and wanted to clean up his image just a bit.

• Larry: Do you know anything about current reactions to, status of, the results by Houman Owhadi, Clint Scovel and Tim Sullivan? Are they deemed relevant for practice? (I heard some people downplay the results as not of practical concern.)

• It isn’t as if statistical controversies are less heated than the political ones. I’ve noticed a website:
http://www.krugmaniswrong.com

• kenmccue

I just glanced at this website. The first entry is on Chinese Ghost Cities (cities and other public works built by the government and not well utilized) and a rant against central planning and government investment (the two are conflated). There is no analysis of when government spending is useful and when it is not (once again, government funding of statistical research is, by this definition, wasteful–I don’t see anyone propounding right-wing or libertarian views in this exchange arguing against that funding).

The next entry is unbelief that Krugman can argue for ignoring the debt (ICYMI: Krugman lost MSNBC) and in particular unfunded liabilities in entitlement spending. The arguments really do beg the question–since debt is bad, anyone who argues against making it the top priority is an “arrogant prick” (in the entry). The other side of this (which is not given) is that countries (Nazi Germany, England (Napoleonic Wars, WWI and II)) have run up huge debts and functioned well (too well in the case of Germany).

The next entry is on the new Star Trek Movie. I’m making an equivalence here between those who post on this site and the “L5 in 95” crowd (remember them?) I used to run into at Caltech.

The next entry is “Breaking News: Krugman is STILL wrong!” No facts here, but a quote is given: “Krugman is a real asshole, seriously, the guys a complete jerk. A total asshole.”

The last entry is anger about a column Krugman wrote on the use of 9/11 in American politics This entry is dated September 12, 2011. So we have 5 entries over 2 years, with few facts, many ad hominen attacks, and all arguments based on the premise that particular theories of government spending and debt are true. Why is this website being presented as anything other then an inchoate screed by an individual who is lacking in rudimentary intellectual skills?

I’ll make a more general comment. The posts relating to statistics on this blog, Wasserman’s blog, etc are well thought out and often useful summaries of current/interesting work in statistics (particularly Wasserman’s, which is why I’ll regret it disappearing). The comments on economics/social policy in this thread do not have these characteristics, yet they are apparently felt strongly. This is a microcosm of the problems in the making of governmental policy–people have deeply-held normative beliefs about what people should do or how they should behave. This is different than the sciences where people generally don’t have strong a priori beliefs about particularl genes, chemical reactions, tensile strengths of various materials, etc (they do tend to associate themselves with a theory and then defend it against new results, but this is not an a priori belief). Anyway, I would recommend adopting an empirical approach to economic analysis and governmental policy in general. Statisticians above all should see the wisdom in this.

• again I regret mentioning politics

let’s drop it and stick to statistics

• OK no more on Krugman, the Editors have been alerted.

I just felt bad for sending it off topic.

• ND: Well it does relate to the general issue of sound/fallacious arguments, especially in “evidence-based” policy.

4. I regret mentioning Krugman. My mistake.
Can we agree not to use Deborah’s blog
as a political debate blog?

I am not aware of significant rebuttals.

Larry

• kenmccue

Can we agree not to use Deborah’s blog as a political debate blog?

You’re missing the distinction I’m trying to make, which is with regards to methods of arguments rather than a political debate (obviously, you’re to the right and I’m to the left). Labeling a pundit’s arguments as socialist when it is not is, well, you provide the word. In your blog you were always judicious and even-handed about presenting statistical arguments, particularly the disputes between the Bayesian and frequentist camps. Similarly, there are disputes about economic policies and governmental responsibilities. Academics should (and this is normative) attempt to present both sides in a non-pejorative manner as possible, even when they are not experts in the area they are pontificating in. I understand the context of the Franken quote and how you wished to express the belief that it was incorrect about biases (I actually agree with you on that though I disagree on how the biases present). I just believe your characterization was incorrect and in a way that contributes to polemic discourse.

I do want to express my appreciation of your blog. Your method of presenting an idea in the simplest way possible (consistent with accurately representing the idea) I found to be extremely useful.

• Kenmccue: I want to note, in case it’s unclear, that Wasserman didn’t say any of the things you object to in his published paper, I’m not sure if you read it. The Franken thing encompasses exactly the little jokey introduction and no more.

• kenmccue

I’m aware Wassterman’s comments are not in a published paper. And I have nothing against humor in academic work–I’ve used it myself (see “The Statistical Foundations of the EI Method”, American Statistician Vol. 55, No. 2 (May, 2001), pp. 106-110–Peter Bickel found this paper amusing). My objection was to the mischaracterization of advocacy for the welfare state as socialism. I think doing this is a mistake, particularly when done by an individual prominent in a field where definitions matter., even if the mischarecterization is not in their academic specialty. Incidentally, the above mentioned paper is a correction of a mischaracterization of statistical theory, and I had input from David Freedman (who also worked on this problem–ecological inference). Politically I disagreed with pretty much everything he believed in (for that matter, I disagreed with his approach to ecological inference), but he was a good statistician. I think Wasserman is a good statistician also and I particularly regret the loss of his blog (your’s is more philosophical, which is fine, but is less useful for applied statisticians).

• LW: I have a vague idea that any rebuttal will basically assert that the distance these authors use is too non-discriminating in some sense, so Bayes fails to distinguish “nice” distributions from nearby (according to the distance) “nasty” ones. My intuition is that these results won’t hold for relative entropy, but I don’t have the knowledge and training to develop this idea — you’d need someone like John Baez for that.

• Baez is a physicist.
Why we would we want to know what he thinks

• John Baez is a mathematical physicist who has reinvented himself as a category theorist. I pointed to him because he has recently characterized relative entropy as.the unique (up to proportionality) convex linear lower-semicontinuous functor from a category called FinStat to [0, +inf]. FinStat is, roughly, the category of methods for updating probability distributions on finite sets on the basis of observed data. I expect someone like him would have the ability and insight to see in what way, if any, the distances used in Ohwadi and others’ work are not natural.

• That would be like me offering opinions on string theory

• Houman Owhadi (forwarded by Mayo)

[Houman Owhadi sent me this response (to Corey) for posting]:

Well, one should define what one means by “nice” and “nasty” (and preferably without invoking circular arguments).

Also, it would seem to me that the statement that TV and Prokhorov cannot be used (or are not relevant) in “classical” Bayes is a
powerful result in itself. Indeed TV has not only been used in many parts of statistics but it has also been called the testing metric by Le Cam for a good reason: i.e. (writing n the number of samples), Le Cam’s Lemma state that
1) For any n, if TV is close enough (as a function of n) all tests are bad.
2) Given any TV distance, with enough sample data there exists a good test.

Now concerning using closeness in Kullback–Leibler (KL) divergence rather than Prokhorov or TV, observe that (as noted in our original post) closeness in KL divergence is not something you can test with discrete data, but you can test closeness in TV or Prokhorov. In other words the statement “if the true distribution and my model are close in KL then classical Bayes behaves nicely” can be understood as “if I am given this infinite amount of information then my Bayesian estimation is good” which is precisely one issue/concern raised by our paper (brittleness under “finite” information).

Note also that, the assumption of closeness in KL divergence requires the non-singularity of the data generating distribution with respect to the Bayesian model (which could be a very strong assumption if you are trying to certify the safety of a critical system and results like the Feldman–Hajek Theorem tell us that “most” pairs of measures are mutually singular in the now popular context of stochastic PDEs).

• This procedure — taking the worst case from among the local TV or Prokhorov neighborhood of a model in the limit as the data discretization becomes infinitely fine — blocks all possibility of learning. There’s nothing special about Bayes here; it goes for error statistical learning too. I’ve written a blog post with the explanation.

• Corey: The special thing about Bayes, as it is oftentimes applied, is the complete absence of error analysis other than the one inferred from the (oftentimes arbitrary) prior itself. What those brittleness results show is that in the absence of such error analysis (which could be seen as a non-Bayesian feedback loop) you can get anything you want.

Note that you are already moving in that direction in the Gaussian example (since you are referring to the distribution that is generating the data and you are raising the issue that a max over priors “after” observing the data will lead to worst case scenarios depending on the data), you just need to push the argument a little bit further to observe that to avoid brittleness (in presence of finite information) you need to compute those robustness estimates before observing the data (and this is our conclusion http://arxiv.org/abs/1308.6306), the problem (for some people) is that to do so you have to exit the strict Bayesian framework (and the fact that the error analysis is, in general, much more difficult than computing a posterior value).

Alternatively if you think that the assumptions made in http://arxiv.org/abs/1304.6772 (closeness in TV, Prokhorov, specifying the distributions of a finite but possibly arbitrarily large number of moments) are not relevant to your practice, then you can still use the machinery (the proof of Theorem 5.12 of http://arxiv.org/abs/1304.6772, inequality 3.6 of http://arxiv.org/abs/1304.7046, etc…) to see what you would get under different assumptions/classes of priors.

For instance, if you ask yourself what happens if in addition to closeness in TV/Prokhorov (and/or specifying the distributions of a finite but possibly arbitrarily large number of moments) you also specify that, for all these priors, the probability of the data is within a finite multiplicative constant $\alpha$ of one obtained by a given measure of probability (e.g. an oracle gives you the true probability of the data within a multiplicative constant $\alpha$), then (using the quantitative inequality (3.6) of http://arxiv.org/abs/1304.7046) you still get brittleness as $\alpha$ deviates from 1.
So those worst/best case priors are not isolated nasty priors (solitary chicken little?), they are just directions of instability of the conditional expectation with respect to the particular choice of prior. What does it mean in practice? Well it means that (a) you can find a nearby prior (or more precisely a set of nearby priors) for which the probability of observing the data is $10^{-7}/3$ rather than $10^{-7}$ and the posterior value (probability of failure) is $0.8$ rather than $0.1$ (b) your estimates become more and more unstable as condition over more data or you increase the complexity of your system.
This is in some sense not surprising since in the Bayes rule you condition by an event corresponding to the probability of data (which is likely to be (1) very small if you are using a large number of data points (2) quite sensitive to your particular choice of prior for a complex system).

Just take the (simple) Gaussian example and add 10,000 more data points and assume that you measure them to finite precision (say $10^{-2}$) and assume that an oracle gives you, (a) the true probability of each data point with 1% error, or (b) the probability of the data as a whole with 30% error, then look at what happens. You will also see that (1) the instability arises not only in the limit as the data discretization becomes infinitely fine but also in the limit where the discretization is finite and you get more and more data (this is so because the measurement resolution needs only to be smaller than a finite threshold that gets larger as you condition over more and more data) (2) as you get more and more data, “small” fluctuations in the probability of the data lead to large fluctuations in your posterior conclusions. Note that you will reach similar conclusions if, instead of closeness in TV, your model captures the distribution of the first 100 moments (which would be more realistic for the Gaussian example since you are looking at a tail estimate).

Houman

• Professor Owhadi: When I say “it goes for error-statistical learning too” I mean this sort of thing.

Basically, Spanos (and Mayo too, I believe) define statistical adequacy of a model to mean that the available data “looks like” (graphically or by passing formal tests of auxiliary model assumptions) a typical realization of particular type of stochastic process. So suppose we have a model that passes such a test, if we consider that such a model might nevertheless be misspecificed and admit all models in a small TV or Prokhorov neighborhood of it, we’ll find that, e.g., the actual Type I error probabilities of tests of what Mayo calls “primary statistical inferences” can be anything in [0, 1] no matter what the nominal error rates are.

• Aris Spanos

Corey: thorough Mis-Specification (M-S) testing alludes to a battery of well-chosen tests (not one or two) aiming to test 4-5 different assumptions both individually and jointly with a view to evaluate their validity using a number of strategies to ensure that these tests self-correct; they correct each other. The only way I can answer questions of idle speculation about small TVs or Prokhorov neighborhoods pertaining to the potential effectiveness of such M-S testing strategies is to issue a challenge to the skeptic to generate data that involve such pathologies and send me the data. Remember that statistical adequacy does not guarantee “truth” (however understood). It guarantees the statistical reliability of inferences by ensuring that the actual error probabilities approximate closely the nominal ones.

• Spanos: Just to be clear, I’m not arguing against M-S testing here. I’m saying that from the perspective of M-S testing, the OSS argument against Bayes proves too much — the M-S testing approach is just as “brittle”. I’m arguing that one cannot consistently hold that the OSS argument discredits Bayes but does not discredit M-S testing.

(I might take you up on your challenge to the skeptic someday, but my idea for that is currently speculative and is tangential to this discussion in any event.)

• Corey:
In a moving vehicle so I will mainly just quote them. First, I don’t know where you get the claim that a model is deemed statistically adequate in error statistics simply because the data “look like” a possible realization of the model.
On the bigger point: I think you’re confusing the fact that the error probability assessment depends on an approximately statistically adequate model, and the radical lack of error statistical control in the situation discussed by Owhadi, Scovel, and Sullivan, as well as the solution they hint at (via “a non-Bayesian feedback loop”).

“these brittleness results suggest that, in the absence of a rigorous accuracy/performance analysis, the guarantee on the accuracy of Bayesian Inference for complex systems is similar to what one gets by arbitrarily picking a point between the minimum and maximum value of the quantity of interest.”

Mayo: The kind of rigorous accuracy/performance analysis I take it is of the error statistical variety.

“……we hope that these results will help stimulate the discussion on the necessity to couple posterior estimates with rigorous performance/accuracy analysis. Indeed, it is interesting to note that the situations where Bayesian Inference has been successful can be characterized by the presence of some kind of “non-Bayesian” feedback loop on its accuracy…[Bayesian inference] does not say anything (by itself) about the statistical accuracy of these estimators if the model is not exact.”

• Mayo: Any post-data model assessment will be brittle — that’s why Owhadi didn’t tell me “What you have written is ignorant and not even wrong,” but rather, “you are raising the issue that a max over priors [this usage of “priors” includes what we call “models” — ed.] “after” observing the data will lead to worst case scenarios depending on the data… to avoid brittleness (in presence of finite information) you need to compute those robustness estimates before observing the data.”

So no, I think it’s wrong to take “rigorous accuracy/performance analysis” to be of the Mayo-style post-data error statistical variety.

• Corey: I don’t think you can mean this. Obviously, even methods with pre-data error probabilities are applied with data “in hand”. But the error properties “came first” (as it were), unlike their data-dependent priors.
Else all data used in testing would be “old data” or in some way double-counted.

• Mayo: And you think that’s nuts, huh? Take it up with Owhadi — he’s the one who wrote it. *My* claim is that if you were to think through what their math is really saying, you too would notices that the OSS brittleness analysis proves way too much.

• Note: Corey has a more detailed argument on his blog which I just came across:
http://itschancy.wordpress.com/2014/01/04/when-does-bayesian-inference-shatter/

• Corey,

I am not familiar with the Statistical Misspecification Testing developed by Deborah and Aris but from reading “Methodology in Practice: Statistical Misspecification Testing” I don’t see how the brittleness results would apply. My understanding (if it is correct) is that M-S is based on a finite number of tests and error estimates are computed for each test.

The issue with Bayes is its expressiveness as you vary the model/prior (interestingly this is also its strength since allows you to build sophisticated estimators). More precisely if we were to use Bayes to construct hypothesis tests we would get a different test for each different prior. Some of these tests may perform extremely well and some poorly (given a worst case scenario with respect to what the data generating distribution could be). And without a statistical error analysis you will get brittleness (any answer you want) under slight variations of your prior (i.e. in Bayesian terms if you “do not” interpret the priors as data generating distributions then you get any answer you want under slight variations of your beliefs).

What I mean by computing error estimates before the observation of the data is as follows. Let $\pi$ be your prior (your belief about the system) and $\pi’$ the true prior (from which the data generating distribution is sampled from). If you know $\pi’$ then $\pi=\pi’$ will give you an optimal estimator and its (exact) statistical error is obtained by averaging the data with respect to $\pi$.
If you don’t know $\pi’$, then the statistical error of $\pi$ is unknown since it requires an averaging with respect to the distribution of the data (since it is a function of the unknown $\pi’$) but you can still bound the statistical error of $\pi$ by averaging the data with respect to the worst candidate for $\pi’$. Note that by doing so you are bounding the statistical error of your prior before the observation of the data. This bound will basically tell you how much you can trust the estimation obtained by computing the value of the $\pi$-estimator against the actual observed data.

Houman

• Houman: The kind of pre-data robustness analysis I think you’re talking about is alive and well in the mathematical statistics literature. Here’s a recent example: On posterior concentration in misspecified models. Connecting those sorts of results to applied Bayesian practice is indeed hard and not widespread…

• Corey: Yes, I am aware of asymptotic concentration results in KL, etc… (we cite these in Section 2 of the first paper).
The kind of error analysis that I have in mind is the non-asymptotic one that can be traced back to Wald’s statistical decision theory (more precisely a generalization of that to statistical errors, intervals of confidence, etc… but this is not relevant at this stage). I am not sure about this type of analysis being alive and well today in applications (B. Efron has published a recent paper about that in Science http://www.sciencemag.org/content/340/6137/1177).

Our program is basically a computational version of a variation/generalization of Wald’s statistical decision theory and we found the (post-data) brittleness results in the process of developing the required reduction calculus (the min and max have to be taken over measures over spaces of measures and calculus on a computer is necessarily discrete and finite).

Houman

• I think a post-data approach could work, just with a less crazy misspecification set. Hmm… for a real 1-dimensional RV, how about a set such that the relative distribution from the “true” distribution to the model is smooth with a parameter controlling the intensity of the allowed fluctuations? You can study the OUQ bounds as the fluctuation control parameter goes from no fluctuations (i.e., uniform relative distribution, no misspecification) to uncontrolled fluctuations (recovering the TV neighborhood result).

• Well, we already know (from classical robust Bayesian analysis) that Bayes is robust (post data) with respect to finite-dimensional classes of priors.
And we know from the brittleness papers that we get brittleness with finite co-dimensional classes of priors (with a sensitivity increasing not only with measurement resolution but also with the number of sample points).
What is the weakest kind of control (to be added to finite-co-dimensional) that would give us robustness? the machinery in the brittleness papers points towards the direction of controlling fluctuations in the probability of the data (i.e. choosing classes of priors $\Pi$ such that if $\pi$ and $\pi’$ are two priors in $\Pi$ then the probability of the data under $\pi$ divided by the probability of the data under $\pi’$ is between $1/\alpha$ and $\alpha$ where $\alpha$ is close to one). One may find this kind of control/assumption quite strong/restrictive but this is the kind of control that KL enables.

What about robustness in KL divergence (or by other means of control of fluctuations in densities)? There is a precise local sensitivity analysis (in the sense of the Frechet derivative) done in Larry’s paper (with P Gustafson, “Local Sensitivity Diagnostics for Bayesian Inference”) for the broader class of $\Phi$-divergences (that includes KL, Hellinger, etc…). Interestingly this analysis shows that, as the number of data points goes to infinity, the local sensitivity measure diverges to infinity with probability 1 (unless the direction of the perturbation has specific added restrictions).

Houman

• Dear Houman:
Let me say that I don’t expect you to try to explain a complex result in blog comments, but that I’m very grateful for the patient attempts, … I’d been on the road the past few days, and now that I’m in one place, I’ve reread the exchange of comments between you and Corey, and am not sure I’m getting the drift of the past few remarks, not to mention the symbols are this way and that way. So, while I absolutely don’t expect you to go back over things, if you do wish to piece together the comments with the intended symbols, I could post them separately.
Here are two remarks/queries just to demonstrate the severity of (felt) gaps in my understanding: (1) If brittleness means, as you often indicate, that “you get anything you want”, then being spared from brittleness suggests only avoidance of a type of extreme underdetermination—scarcely enough to recommend an account. (2) You speak of “your prior (your belief about the system)” and “ the true prior (from which the data generating distribution is sampled from)”. I don’t see how this notion of a “true prior” is really understandable (or relevant) in the context of your results.
So I’m afraid even a plainer, plain Jane might be needful, while I realize, likely infeasible in this context.

• Deborah,

In the brittleness papers we indeed do not need/use the notion of a “true prior” (since we use the framework of classical robust Bayesian Inference where robustness is estimated given the data) and we “do not” interpret those priors as data generating distributions (so they could be subjective beliefs about the system of interest).

I speak of “true prior” to elaborate on what we mean by avoiding brittleness by computing bounds on statistical robustness (pre data).
This basically concerns an element of our sequel work but I will describe what I mean in the context of Wald’s Statistical Decision Theory (established on Von Neumann’s game theory). Assume that A (the statistician) and B (the universe) play a game. In this game A tries to estimate some quantity of interest that is a function of some measure of probability $\mu$ chosen by $B$. A doesn’t know what $\mu$ is but he gets to observe n i.i.d. samples from $\mu$. So A’s estimation is a function $\theta$ of those n iid samples (let’s call that function the model/estimator). The statistical error of A’s estimation is a function of A’s choice (the model $\theta$) and B’s choice (the data generating distribution $\mu$).

Assume that on the board of this game, there is a bag of measures of probability on the space of data generating distributions (so the elements of that bag are measures of probability over a set of measures of probability). Assume that $A$ selects an element of that bag (his prior) to construct his model/estimator (using Bayes’ rule) and that $B$ also selects an element from that bag (the “true prior”, not necessarily the same as A’s prior) to sample (choose at random) the data generating distribution $\mu$.
Question: What is the statistical error of A’s model if B gets to see A’s model and tries to maximize A’s statistical error?
The answer to that question is a (pre data) robustness/sensitivity estimate (is the answer to the above question small if the bag is small in some sense or the number of observed data points is large?) but you can also see it as an error estimate: the answer is the worst statistical error of A’s model with respect to B’s choice. Note that (for the error estimate) the bag would have to reflect the (lack of) information about $\mu$ (so it could be quite large in practical applications) and it should also contain point masses on the space of $\mu$s (to reflect the possibility that B’s choice may be non-random).

Now assume that you use Bayesian Inference to estimate some quantity of interest about a complex system. If you do compute the quantity described above (i.e. find the numerical answer to the above question) then you have a defensible bound on the statistical error of your system (given what you are ready to assume about B’s choice).
What if you don’t compute this bound? Well, if the bag is finite co-dimensional (i.e. you only know finitely many things about reality/B’s choice) then (1) given the data, slight variations of A’s choice (the prior) will lead to any possible answer between the min and max of the quantity of interest (2) this instability increases not only with the complexity of the system but also with the number of data points (since each new data point adds a new direction of instability) .

Houman

• Houman: Thanks so much for this intriguing game-theory analogy.
“B gets to see A’s model and tries to maximize A’s statistical error”
And so B carries out this effort (to maximize A’s stat error) by dint of a deliberate choice of a member from the bag (of measures of probability on the space of data generating distributions)? Is that the idea?

• Deborah: Yes this is the idea. More precisely, the resulting maximum corresponds to the worst statistical error of A’s model given A’s lack of information about B’s choice (the actual statistical error may be smaller but there is no way of knowing given available information).

Can A avoid computing this bound by using a meta-prior on B’s choice? Well, by doing so he would only make his analysis more brittle (the meta-prior would be a measure on a space of measures on a space of measures and the analysis/inference would be even more brittle with respect to the choice of meta-prior). So A cannot avoid brittleness by staying within the strict framework of Bayesian Inference and what the computation of the bound described above allows is to exit that strict Bayesian framework and provide a (non-Bayesian) feedback on the accuracy of A’s choice.

Note also that the computation of the bound described above induces an order on the space of priors: it also allows A to compare priors and decide (in a transparent manner) if a prior is “good” or “bad” (an optimal prior would be one minimizing the above bound).
Can A “find/guess/eyeball” a “good prior/model” (or most uninformative prior) without going through an actual computation or without some kind of (possibly empirical) feedback? Well, in the context of this game, this question becomes: can A guess an element of the space of measures over measures (i.e. the space of priors) that is a good enough approximation of the optimal prior or whose least upper bound on the statistical error is sufficiently small for the application (without actually computing the least upper bound)?
Although this might be possible for simple (and non-critical) systems with one or two samples, it would seem to me that this task would be quite challenging for complex systems with a non-negligible number of samples. If there is one thing that we learned from optimization problems over measures (in http://arxiv.org/abs/1009.0679), it is that their solutions are oftentimes surprising/unexpected.

Houman

5. “Bayes is the analysis of subjective  beliefs but provides no frequency guarantees.”

LW: Bayes offers the guarantee that, given a loss function for a point estimator, minimizing posterior expected loss yields an admissible estimator. This is half of Wald’s complete class theorem; the other half is that under certain (mild?) assumptions, the class of minimum posterior loss estimators is “essentially complete” with respect to admissible risk functions. I once asked you what you thought of that theorem; you said you don’t think of it. I put it to you that your decision not to think on it has led you to make a demonstrably false claim.

And now I’ll play “let’s you and him fight”. Spanos (Mayo, too?) has basically rejected all risk-function-based justifications for frequentist procedures on the grounds that estimators are to be justified on properties known to hold under the “True State of Nature”. More generally, risk-function-based justifications aren’t targeted at error probabilities, and these are what provides grounds for inferential procedures. I wonder if Mayo and/or Spanos and/or you yourself would categorize your view of frequentist procedures as “behavioral” in the sense of Mayo and Spanos (2006).

6. Yes my view tends to be “behavioral” although
I have always hated that term.

7. “I have a hunch that today it is hazardous to be a frequentist. Deborah’s Frequentists in Exile blog is there for good reason!” Norm Matloff
http://normaldeviate.wordpress.com/2013/09/01/is-bayesian-inference-a-religion/#comment-9764

This site uses Akismet to reduce spam. Learn how your comment data is processed.