*Dear Reader*: Tonight marks the 2-year anniversary of this blog; so I’m reblogging my very first posts from 9/3/11 here and here (from the rickety old blog site)*. (One was the “about”.) The current blog was included once again in the top 50 statistics blogs. Amazingly, I have received e-mails from different parts of the world describing experimental recipes for the special concoction we exiles favor! (Mine is here.) If you can fly over to the Elbar Room, please join us: I’m treating everyone to doubles of Elbar Grease! Thanks for reading and contributing! *D. G. Mayo*

(*The old blogspot is a big mix; it was before Rejected blogs. Yes, I still use this old typewriter [ii])

**“Overheard at the Comedy Club at the Bayesian Retreat” 9/3/11 by D. Mayo**

**“Did you hear the one about the frequentist . . .**

- “who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

or

- “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of “straw-men” fallacies, they form the basis of why some reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the curious reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call “error statistics,” continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called *probabilism*. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define “controlling long-run error,” it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of “There’s No Theorem Like Bayes’s Theorem.”

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in many Bayesian textbooks and articles on philosophical foundations. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson “really thought”. Many others just find the “statistical wars” distasteful.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error- statistical philosophy.

But given this is a blog, I shall be direct and to the point: I hope to cultivate the interests of others who might want to promote intellectual honesty within a generally very lopsided philosophical debate. I will begin with the first entry to the comedy routine, as it is put forth by leading Bayesians……

___________________________________________

**“Frequentists in Exile” 9/3/11 by D. Mayo**

Confronted with the position that “arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006, 196), frequentists might have seen themselves in a kind of exile when it came to foundations, even those who had been active in the dialogues of an earlier period [i]. Sometime around the late 1990s there were signs that this was changing. Regardless of the explanation, the fact that it did occur and is occurring is of central importance to statistical philosophy.

Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur. In some corners of practice it appears that frequentist error statistical foundations are being discovered anew. Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are finally being disinterred. But let’s learn from some of the mistakes in the earlier attempts to understand it. With this goal I invite you to join me in some deep water drilling, here as I cast about on my Isle of Elba.

Cox, D. R. (2006), *Principles of Statistical Inference*, CUP.

________________________________________________

[i] Yes, that’s the Elba connection: Napolean’s exile (from which he returned to fight more battles).

[ii] I have discovered a very reliable antique typewriter shop in Oxford that was able to replace the two missing typewriter keys. So long as my “ribbons” and carbon sheets don’t run out, I’m set.

Congratulations on the two-year anniversary! I appreciate the research you perform, and hope to contribute more to this blog.

It has been truly educational to follow this blog for two years. I am beginning to think that the spirited defense of the strong likelihood principle is just not going to show up… After two years of carefully following the various threads, I have come to the opinion that without the SLP, Bayesian posterior probabilities have no real meaning (except perhaps where the posteriors are validated using error statistical approaches). And then there are the priors for which I would make a similar point. I am thinking of this because these topics arose early in the history of the blog. There have been interesting arguments made by outstanding thinkers from various perspectives, and plenty of opportunity to expose weaknesses in error statistical approaches. It appears that the error stat approach is quite solid in its underlying philosophy. I have yet to see a valid refutation of any of the core tenets. This is helpful.

Thanks John! I appreciate your support and interest. I’d still like to get a post from you some time on error statistics in forensics. Oh and 2016, I expect, will be ERROR16!

I add my congratulations, too!

Thanks Christian. I was going to list the people who were top contributors–you would obviously be in the short short list–but in the end decided it would take time I don’t have to get it right. In fact, I’m not sure about keeping up the blog, although it’s got an addicting quality to it, which isn’t the slightest bit justified by professional payoff. I should ask the inner-sanctum group for feedback/suggestions, but I’m not sure I’ll have the time to pursue them. There’s tons I could readily do–return to new U-Phils and maybe new articles in RMM–but I’m very behind in my work…..

Thanks from me as well!

You are right about the addictive nature. I quite often write responses to posts and then throw them away in part because the anticipation of what others will say in response can be distracting from what I am supposed to be doing…

David: do you have a blog? or do you mean on other peoples’ blogs, like this one?

If I comment on other people’s blogs, I nearly always avoid going back there to check. If I do it’s sometimes a shock, as on Gelman’s blog last week. (I refer to comments by others.)

Anyway, I figure people trained to keep up their tweeting are worse off, tweeting minutia that accomplishes nothing…

Well, keeping up the blog certainly would serve the aim of impressing me because I always thought that this kind of thing is for people who are really more productive than me…

Christian: Well it’s certainly best for people more productive than I, but I did it anyway (starting at the beginning of a 2-year leave), largely, at first, to discuss the papers growing out of our June 2010 conference. (By the way, didn’t you have some commentary presented at the conference that we might add to the materials?)

” (By the way, didn’t you have some commentary presented at the conference that we might add to the materials?)”

I’d probably have slides but I never wrote it down properly because I was somehow not in the loop when stuff from the conference was published.

Mayo: Congratulations and many thanks. It’s been both enjoyable and edifying to engage with the strongest defense of frequentism I’ve ever encountered. I especially appreciate your demonstration of how the severity principle scotches many common criticisms of frequentism — it’s saved me from propagating wrong claims.

Corey: And thank you for all of your insights. I know that you are an error statistician at heart. Of course, I don’t see myself as giving a defense of “frequents” as that has ever been formulated. I’m not sure why it hasn’t been formulated this way, when it seem so clear to me that (a) error probabilities lend themselves so well for evaluating and controlling error probing capacities of methods, and (b) inductive reasoning takes the form of inferring what has/has not been well probed, including negations of claims. A poorly tested claim H is not a little bit probable, the data have provided bad evidence, no test (BENT) at all. Scientists aren’t even interested in highly probable claims, in any sense that might be meant, because it directs one to stop with vague, low-content, close-to-observational claims. We want high content theories that suggest probative and informative tests. Moreover, anyone who thinks a general hypothesis or theory (statistical or otherwise) can’t be true, has no business saying there’s some probability it is true. What can it mean? The only probability you could assign such a hypothesis or model is 0. I don’t think anyone intends a probability of, say, .9 in a hypothesis to mean, say, 90% of it is true or shown true. (They really just have likelihoods.) But we can assess whether the hypothesis is adequate, or which of its claims have been well tested, and which not (e.g., we may infer something about the behavior of prion infection, but don’t have a full theory of prions). Local statistical claims are crucial for probing non-trivial models and theories in science. If some statistical aspect h has passed severely, then we can say, if we wanted to, that the probabilities of outcomes would occur about as often as h says they would (in given experiments)–and could even bet accordingly (if we chose to)–but this would not be to assign the probability to h itself. Well you’ve heard all this….and I’ve dashed it off way too fast, as I’m running off…

I’d be interested to see your laundry list of how the severity principle scotches, and saved you from propagating, common howlers!

Mayo: Okay, I’ll put it here.

Mayo: You’re close to correct — my thorax is that of an error statistician. It’s just that I think that the severity principle in informal application is a special case of Bayesian reasoning, and in formal application to statistical hypothesis, slides a bit too glibly from the sampling properties of procedures to the warrant for asserting statistical hypotheses in specific cases. I really am a Bayesian at heart.

Corey: I’m not sure what you mean about severity being a special case of Bayesian reasoning, except maybe if you’re painting by numbers.

As Larry Laudan, John Worrall, Clark Glymour, Henry Kyburg, Wesley Salmon, John Earman, me, and many, many others have said, the flexibility of an inference account to accord with any assessment whatsoever is actually a sign of that account’s lack of content. It’s like lack of falsifiability in a theory. For instance, regardless of how one solves the Duhemian problem of where to place blame for an anomaly (between theory and evidence), one can reconstruct it so as to get a Bayesian rationale. (e.g., one scientist blames the instrument, another blames the theory–both can be reconstructed Bayesianly). This is in EGEK (chapter 3) and elsewhere.

Even where Bayesians agree on numbers, the interpretation may be radically different. Now here are examples of what I think would normally be called disagreements, just off the top of my head. I never assign a probability to a statistical hypothesis, unless it has a frequentist prior, but even then rarely find it relevant to do so (unless it’s a screening example); I reject all of a very fuzzy mass of purported interpretations of subjective probability; reject default and reference Bayesian accounts; reject the strong likelihood principle, stopping rule principles, Dutch book arguments, ….you get the picture. Are you prepared to say that each of my positions here is consistent , and perfectly in keeping, with Bayesianism? If so, then I rest my case—the account is empty. Such an account–or rather your interpretation of the account– might be valuable for some purposes, but would be too promiscuous for purposes of giving any kind of guidance or insight in resolving/understanding debates about the statistical interpretation of data.

Mayo: I’ll explain my position more completely on my blog; I still need to sort out the order I’m going to go through my ideas and where my planned sequence of posts on the howlers fits.

Corey: Oh so you’re backing away from the posts on the howlers being the main focus? Or was that not the focus?

Mayo: I don’t see myself as “backing away” from anything I said I would do. As it says in my first post, I’m going to start by engaging with your Severity Principle, and that includes (but is not limited to) the calling out the howlers, as I *volunteered* to do during our email exchange.

You’ve constructed the “dare” framing out of your belief that I’m reluctant to do so. This is not the case — which one of us first brought up the idea? Failing to engage an opposing view’s strongest arguments is ignorance at best and laziness, and/or malice at worst, and it’s a situation in need of fixing. I really do want Bayesians to stop repeating faulty criticisms.

Corey has taken me up on a dare: for a Bayesian who recognized many of the anti-frequentist arguments to be howlers to shout it out. So he has started a blog*:

“Aboot this blog”

Better get that spell check in there…and “hairless ape #2”? who is #1?

Here’s what he wrote:

“I’m Corey Yanofsky, a biostatistician working in Montreal and Ottawa. I’m a Bayesian in theory, a statistical ecumenist in practice. This blog is where I’ll record my random ruminations and reveries. I’m initially planning on grappling with Deborah Mayo’s error statistical philosophy and the severity principle on which it is based.

I’m clearly too late to the statistics blogging game — all the obvious puns are taken.

Posted 2 hours ago by Hairless Ape #2,493,564,909”

*Starting them is the easy part…..

Spell check? I don’t know what you’re talking aboot.

Maybe someone wishes to ponder/comment on/explain the gist of this paper I saw mentioned on Christian Robert’s blog:

http://xianblog.wordpress.com/2013/09/11/bayesian-brittleness-again/

When Bayesian Inference Shatters

http://arxiv.org/abs/1308.6306

Houman Owhadi, Clint Scovel, Tim Sullivan

(Submitted on 28 Aug 2013)

With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.

Conclusion: Don’t define “closeness” using the TV metric or matching a finite number of moments. Use KL divergence instead.

As far as I understand from first reading, the two brittleness theorems do not concern the whole of the posterior distribution, but its expected value, and they basically state the consequences of the well known instability of the expected value as a functional on the space of distributions for Bayesian inference. In frequentist statistics, such observations gave rise to robustness concepts such as “qualitative robustness” in the 70s/80s but the Bayesians didn’t bother much.

The Bayesians could defend themselves by saying that they are not so much interested in the expectation of the posterior but rather in functionals such as the posterior mode or median, or in probabilities of certain sets of interest (credibility intervals), which are not affected by these theorems (though they may be non-robust/brittle in some sense, too).

The TV metric really isn’t the problem here, particularly as Theorem 1 holds for the Prokhorov metric, too (which for what the authors are interested in seems much more appropriate than the KL divergence, by the way).

Corey: As far as I see it, the only way to argue against the TV metric here is to make a case that what is close in terms of the TV metric is not “really” close in terms of your interpretation of the situation. And this case seems difficult to make. Can you?

(BTW, Davies doesn’t like the TV metric either, but for the opposite reason: What may be far away with respect to TV may be close when it comes to interpretation.)

Christian: R. T. Cox of Cox’s theorem was working on an algebra of inquiry to go along with his algebra of probable inference. The algebra of probable inference extends Boolean algebra to the reals, and probabilities pop out. The algebra of inquiry extends an algebra of questions to the reals, and information entropy pops out. (A KL-divergence-like quantity is the appropriate generalization of entropy in the continuous case.) My interpretation of the situation is that I care about knowing which questions have been answered and which are still open. By showing that TV-small perturbations can result in arbitrarily large differences in the resulting inference, Owhadi has shown that TV-closeness doesn’t reflect the kind of closeness I care about.

Corey: I’d be worried if you’d not care about Prokhorov-closeness, because the weak topology corresponding to the Prokhorov-distance basically implies that for any given amount of data what is close enough in terms of Prokhorov cannot be distinguished from each other. (See Davies’s work. More precisely, “can only be distinguished with vanishing probability”.)

So you’d basically say that you don’t care about the fact that two distributions between which no difference can be “seen” (in data) lead to wildly different results.

(I hope you see also my direct reply to Mayo above, which qualifies this a bit.)

Christian: Your “wildly different” results correspond to the difference between “really really small but non-zero” and “strictly zero”. I plead Cromwell’s rule.

Corey: It’s not “my” wildly different results but those of Owhadi, Scovel and Sullivan, and if I understand them correctly, differences are as wild as it gets: “the range of posterior predictions among all admissible priors is as wide

as the deterministic range of the quantity of interest.”

Christian: I’m looking at the actual expressions for the Lévy metric (i.e., 1-D Prokhorov) and KL divergence. With arbitrarily large amounts of data, we get to estimate densities and cdfs to arbitrary accuracy. Under these conditions, the only way two distributions can be indistinguishable in Lévy distance but wildly different according to KL divergence is if the two associated probability measures aren’t equivalent.

Corey: My “wildly different” was not meant regarding KL-divergence but regarding the posterior expectation (result in the paper).

The thing is, estimating cdfs to arbitrary accuracy does *not* guarantee to estimate an expectation to arbitrary accuracy. By the way (as you could learn from Davies) there is no way to estimate a density to arbitrary accuracy, because densities (and therefore also the KL-divergence) are pathologically discontinuous as functionals of distributions.

Mayo: Probably its best to let Owhadi explain first… I think I understand the paper (because from robustness theory I had some well funded expectations what the result would be and why it holds) but I really haven’t put enough effort in yet so that I could put my hand in the fire for what I make of it.

Christian: Yes, I took your “wildly different” to refer to my KL-closeness rather than Owhadi’s expected posterior inference thingy. My mistake. On density estimation, I think “no way” is a bit strong, eh? Of course, assumptions are needed to get arbitrarily accurate density estimates — Tokdar 2006 gives some in the context of strong consistency of Dirichlet process mixture model density estimation.

Corey: You will need smoothness assumptions. The point about these is that they cannot be checked by data, they are critical (i.e., they are not there for mathematical convenience but everything breaks down without them), and they concern the “joint” between modelling and reality – in reality we only have discrete data, so although the overall shape of a continuous density can “fit” the data, its smoothness definitely cannot. This means that the required assumptions are strictly violated by whatever is observable.

Actually, from time to time I think that a smooth version of a density can be defined for which this doesn’t hold (so that all probability measures in the neighborhood of measures with a nice smooth density actually have a nice smooth density, too), so that estimation of it would actually not be an ill-posed problem in Davies’s sense. I don’t think that such a thing is already in the literature but maybe. When I posted this on the Normal Deviate blog some time ago, Larry Wasserman agreed that this should work.

Christian: I find a certain Quixotic grandeur in Davies’s views on smoothness. I have mixed feelings on the subject — the doctrinaire Bayesian in me gives three cheers; the engineer in me scoffs.

It is always said (e.g., by J. Berger) that a prior is tantamount to assuming infinitely many things. So in a case like this, the data can never catch up—or something like that.

Corey (sorry we’ve apparently exhausted the comment hierarchy): I think that it has much to do with the fact that often in statistics we are interested in something slightly different from what it seems on the surface. Here, for example, people are really interested in some kind of smooth approximation of the distributional shape and not really in the true density. And because methods that are supposedly constructed for the latter task, for which they fail miserably, tend to work OK for the former task, an engineer may be happy with them.

Christian: How does this impact the result of the paper, if it does?

Christian wrote: [M]ethods that are supposedly constructed for [density estimation], for which they fail miserably, tend to work OK for [estimation of a smooth approximation], an engineer may be happy with them.

Wow, you just made my inner doctrinaire Bayesian and my inner engineer shake hands and agree to be friends.

Well if a blog’s good for anything, it’s precisely to float one-quarter or one-eigth backed reflections, and yours gave a lot of interesting perspective. That’s what will help to understand, i.e.,reflections from statisticians with expertise regarding related results. Did you send the paper to Davies by the way?

and what happens if nor priors are used, and everything else is the same I wonder…

In the mean time, I hope the policy decisions to which the authors refer are not relying on these methods…or maybe that’s why some of the current policies have been so off…(just kidding)

I really appreciate comments on this paper–were you already familiar with this work?–, and I have only scanned it, but I thought it pertained to misspecification, and also to the possibility of slightly modifying a prior to obtain desired results…? But maybe I’m missing..Owhadi has just agreed to write/exchange something on this blog next month*, so maybe (in advance) people want to write “U-Phils”, but either way I’d be extremely grateful for baby explanations….

*This was arbitrary, but just to give some time to think about it…

Mayo: This discussion has departed a bit from the paper. It came from the initial discussion about whether in the paper the right kind of neighborhood (metric) for distributions was used. Corey argued against it (see above), whereas I’m with Laurie Davies (and others such as Donoho and, as I took from some hints he gave, Larry Wasserman) in this respect: A good metric between distributions is one in which distributions are close to each other if they cannot be distinguished by (a limited but potentially large amount of) observed data. I was defending the authors for the use of the Prokhorov metric, which has this property; neither TV nor KL (Corey’s favourite choice) have.

One of the author’s results (if I could nominate one as the most important, I’d choose this one) says that if you replace your model by another one which is in an arbitrarily close neighborhood (according to the metric discussed above), the posterior expectation could be as far away as you want. Which, if you choose the right metric, means that you replace your sampling model by another one out of which typical samples *look the same*, and which therefore can be seen as as appropriate for the situation as the original one.

Note that the result is primarily about a change in the sampling model, not the prior, although it is a bit more complex than that because if you change the sampling model, you need to adapt the prior, too, which is appropriately taken into account by the authors as far as I can see.

(This is connected to my earlier comment that the result is a consequence of a robustness problem with expectations and sampling models that exists in frequentism, too, but there it has attracted much work since the 1960s/70s; it should be well known, and in the “robustness community” people know how to handle it.)