Return to the Comedy Hour: P-values vs posterior probabilities (1)

Comedy Hour

Comedy Hour

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB [Jim Berger]: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist Significance Tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke….

Of course it is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 [i] .  Somewhat more recent work generalizes the result, e.g., J. Berger and Sellke, 1987. Although from their Bayesian perspective, it appears that p-values come up short as measures of evidence, the significance testers balk at the fact that use of the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!  An interesting twist in recent work is to try to “reconcile” the p-value and the posterior e.g., Berger 2003[ii].

The conflict between p-values and Bayesian posteriors considers the two sided  test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!


Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many find the example compelling evidence that the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative”(?) Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others charge that the problem is not p-values but the high prior (Casella and R.Berger, 1987). Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Moreover, the “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false”. See Senn’s very interesting points on this same issue in his letter (to Goodman) here

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and frequentist error probabilities: it is imagined that we sample randomly from a population of hypotheses, some proportion of which are assumed to be true, 50% is a common number used. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known, or the like”).

Therefore P(H0 is true) = .5.

It isn’t that one cannot play a carnival game of reaching into an urn of nulls (and one can imagine lots of choices for what to put in the urn), and use a Bernouilli model for the chance of drawing a true hypothesis (assuming we could even tell), but this “generic hypothesis”  is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. (In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation.) [iii]  In any event .5 is not the frequentist probability that the selected null H0 is true. (Note the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

Yet J. Berger claims his applets are perfectly frequentist, and by adopting his recommended O-priors (now called conventional priors), we frequentists can become more frequentist (than using our flawed p-values)[iv]. We get what he calls conditional p-values (of a special sort). This is a reason for coining a different name, e.g.,  frequentist error statistician.

Upshot: Berger and Sellke tell us they will cure  the significance tester’s tendency to exaggerate the evidence against the null  (in two-sided testing) by using some variant on a spiked prior. But the result of their “cure” is that outcomes may too readily be taken as no evidence against, or even evidence for, the null hypothesis, even if it is false.  We actually don’t think we need a cure.  Faced with conflicts between error probabilities and Bayesian posterior probabilities, the error statistician may well conclude that the flaw lies with the latter measure. This is precisely what Fisher argued:

Discussing a test of the hypothesis that the stars are distributed at random, Fisher takes the low p-value (about 1 in 33,000) to “exclude at a high level of significance any theory involving a random distribution” (Fisher, 1956, page 42). Even if one were to imagine that H0 had an extremely high prior probability, Fisher continues—never minding “what such a statement of probability a priori could possibly mean”—the resulting high posteriori probability to H0, he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (44) . . . “is not capable of finding expression in any calculation of probability a posteriori” (43). Sampling theorists do not deny there is ever a legitimate frequentist prior probability distribution for a statistical hypothesis: one may consider hypotheses about such distributions and subject them to probative tests. Indeed, Fisher says,  if one were to consider the claim about the a priori probability to be itself a hypothesis, it would be rejected by the data!

UPDATE NOVEMBER 28, 2015: Now I realize that some recent arguments of this sort will bite the bullet and admit they’re assessing the prior probability of the particular hypothesis H* you just tested by considering the % of “true” nulls in an urn from which it is imagined that H* has been randomly selected.  They admit it’s an erroneous instantiation, but declare that they’re just assessing “science wise error rates” of some sort or other. Even bending over backwards to grant these rates, my question is this: Why would it be relevant to how good a job you did in testing H* that it came from an urn of nulls assumed to contain k% “true” nulls? (And think of how many ways you could delineate those urns of nulls, e.g., nulls tested by you, by females, by senior scientists, nulls in social psychology, etc. etc).

(0) If we’re ever going to make progress, or even attain a cumulative understanding, we really need to go back to at least one of the key, earlier criticisms and responses for each classic howler. (This is the first (1) in a “let PBP” series.) Please check comments from this post.

(1) Pratt, commenting on Berger and Sellke (1987), needled them on how he’d shown this long before. I will update this note with references when I return from travels.

[i] A result my late colleague I.J. wanted me to call the Jeffreys-Good-Lindley Paradox.

[ii]An applet is available at∼berger

[iii] Bayesian philosophers, e.g., Achinstein, allow this does not yield a frequentist prior, but he claims it yields an acceptable prior for the epistemic  probabilist (e.g., See Error and Inference 2010).

[iv]Does this remind you of how the Bayesian is said to become more subjective by using the Berger O-Bayesian prior? See Berger deconstruction.

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”Statistical Science18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.

Related posts:

Categories: Bayesian/frequentist, Comedy, PBP, significance tests, Statistics | 2 Comments

S. McKinney: On Efron’s “Frequentist Accuracy of Bayesian Estimates” (Guest Post)



Steven McKinney, Ph.D.
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre


On Bradley Efron’s: “Frequentist Accuracy of Bayesian Estimates”

Bradley Efron has produced another fine set of results, yielding a valuable estimate of variability for a Bayesian estimate derived from a Markov Chain Monte Carlo algorithm, in his latest paper “Frequentist accuracy of Bayesian estimates” (J. R. Statist. Soc. B (2015) 77, Part 3, pp. 617–646). I give a general overview of Efron’s brilliance via his Introduction discussion (his words “in double quotes”).

“1. Introduction

The past two decades have witnessed a greatly increased use of Bayesian techniques in statistical applications. Objective Bayes methods, based on neutral or uniformative priors of the type pioneered by Jeffreys, dominate these applications, carried forward on a wave of popularity for Markov chain Monte Carlo (MCMC) algorithms. Good references include Ghosh (2011), Berger (2006) and Kass and Wasserman (1996).”

A nice concise summary, one that should bring joy to anyone interested in Bayesian methods after all the Bayesian-bashing of the middle 20th century. Efron himself has crafted many beautiful results in the Empirical Bayes arena. He has reviewed important differences between Bayesian and frequentist outcomes that point to some as-yet unsettled issues in statistical theory and philosophy such as his scales of evidence work. Continue reading

Categories: Bayesian/frequentist, objective Bayesians, Statistics | 44 Comments

Statistical “reforms” without philosophy are blind (v update)



Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values? 

I. To get at philosophical underpinnings, the single most import question is this:

(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data? Continue reading

Categories: Bayesian/frequentist, Error Statistics, P-values, significance tests, Statistics, strong likelihood principle | 193 Comments

Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?


Faye Flam

ONE YEAR AGO, the NYT “Science Times” (9/29/14) published Fay Flam’s article, first blogged here.

Congratulations to Faye Flam for finally getting her article published at the Science Times at the New York Times, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful that one remark from me remained. Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory (and can point to a huge literature)….. Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:


silly pic that accompanied the NYT article

…….When people think of statistics, they may imagine lists of numbers — batting averages or life-insurance tables. But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced that they’d analyzed 53 cancer studies and found it could not replicate 47 of them.

Similar follow-up analyses have cast doubt on so many findings in fields such as neuroscience and social science that researchers talk about a “replication crisis”

Continue reading

Categories: Bayesian/frequentist, Statistics | Leave a comment

(Part 2) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce 9/10/1839 – 4/19/1914

C. S. Peirce
9/10/1839 – 4/19/1914

Continuation of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Part 1 is here.

There are two other points of confusion in critical discussions of the SCT, that we may note here:

I. The SCT and the Requirements of Randomization and Predesignation

The concern with “the trustworthiness of the proceeding” for Peirce like the concern with error probabilities (e.g., significance levels) for error statisticians generally, is directly tied to their view that inductive method should closely link inferences to the methods of data collection as well as to how the hypothesis came to be formulated or chosen for testing.

This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated (1.95) namely, that the sample be (approximately) random and that the property being tested not be determined by the particular sample x— i.e., predesignation.

The picture of Peircean induction that one finds in critics of the SCT disregards these crucial requirements for induction: Neither enumerative induction nor H-D testing, as ordinarily conceived, requires such rules. Statistical significance testing, however, clearly does. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Peircean Induction and the Error-Correcting Thesis (Part I)

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Yesterday was C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic. I only recently discovered a passage where Popper calls Peirce one of the greatest philosophical thinkers ever (I don’t have it handy). If Popper had taken a few more pages from Peirce, he would have seen how to solve many of the problems in his work on scientific inference, probability, and severe testing. I’ll blog the main sections of a (2005) paper of mine over the next few days. It’s written for a very general philosophical audience; the statistical parts are pretty informal. I first posted it in 2013Happy (slightly belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Can You change Your Bayesian prior? (ii)



This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.



S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

From our “Philosophy of Statistics” session: APS 2015 convention



“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:


D. Mayo: “Error Statistical Control: Forfeit at your Peril” 


S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”


A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)


For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC

Start Spreading the News…..



 The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,
2015 APS Annual Convention
Saturday, May 23  
2:00 PM- 3:50 PM in Wilder

(Marriott Marquis 1535 B’way)





Andrew Gelman

Professor of Statistics & Political Science
Columbia University



Stephen Senn

Head of Competence Center
for Methodology and Statistics (CCMS)

Luxembourg Institute of Health



D. Mayo headshot

D.G. Mayo, Philosopher



Richard Morey, Session Chair & Discussant

Senior Lecturer
School of Psychology
Cardiff University
Categories: Announcement, Bayesian/frequentist, Statistics | 8 Comments

Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”

I finally saw The Imitation Game about Alan Turing and code-breaking at Bletchley Park during WWII. This short clip of Joan Clarke, who was engaged to Turing, includes my late colleague I.J. Good at the end (he’s not second as the clip lists him). Good used to talk a great deal about Bletchley Park and his code-breaking feats while asleep there (see note[a]), but I never imagined Turing’s code-breaking machine (which, by the way, was called the Bombe and not Christopher as in the movie) was so clunky. The movie itself has two tiny scenes including Good. Below I reblog: “Who is Allowed to Cheat?”—one of the topics he and I debated over the years. Links to the full “Savage Forum” (1962) may be found at the end (creaky, but better than nothing.)

[a]”Some sensitive or important Enigma messages were enciphered twice, once in a special variation cipher and again in the normal cipher. …Good dreamed one night that the process had been reversed: normal cipher first, special cipher second. When he woke up he tried his theory on an unbroken message – and promptly broke it.” This, and further examples may be found in this obituary

[b] Pictures comparing the movie cast and the real people may be found here. Continue reading

Categories: Bayesian/frequentist, optional stopping, Statistics, strong likelihood principle | 6 Comments

What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?



Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)


  1. Simple rules for alternatives against which T+ has high power:
  • If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

Categories: Bayesian/frequentist, law of likelihood, Statistical power, statistical tests, Statistics, Stephen Senn | 87 Comments

On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)




Houman Owhadi

Professor of Applied and Computational Mathematics and Control and Dynamical Systems,
Computing + Mathematical Sciences
California Institute of Technology, USA




Clint Scovel
Senior Scientist,
Computing + Mathematical Sciences
California Institute of Technology, USA


 “On the Brittleness of Bayesian Inference: An Update”

Dear Readers,

This is an update on the results discussed in (“On the Brittleness of Bayesian Inference”) and a high level presentation of the more  recent paper “Qualitative Robustness in Bayesian Inference” available at

In we looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Now it is already well established that when the class of priors is finite-dimensional then one obtains robustness.  What we observe is that, under general conditions, when the class of priors is finite codimensional, then the optimal bounds on posterior are as large as possible, no matter the number of data points.

Our motivation for specifying a finite co-dimensional  class of priors is to look at what classical Bayesian sensitivity  analysis would conclude under finite  information and the best way to understand this notion of “brittleness under finite information”  is through the simple example already given in and recalled in Example 1. The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (see Example 2 for an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference [6]. Continue reading

Categories: Bayesian/frequentist, Statistics | 13 Comments

“When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)

images-9I’m about to post an update of this, most viewed, blogpost, so I reblog it here as a refresher. If interested, you might check the original discussion.


I am grateful to Drs. Owhadi, Scovel and Sullivan for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it. 


owhadiHouman Owhadi
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA
 Clint Scovel
ClintpicSenior Scientist,
Computing + Mathematical Sciences,
California Institute of Technology, USA
TimSullivanTim Sullivan
Warwick Zeeman Lecturer,
Assistant Professor,
Mathematics Institute,
University of Warwick, UK

“When Bayesian Inference Shatters: A plain Jane explanation”

This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at with the following abstract:

“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”

Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data. Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Statistics | 1 Comment

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)

(May 4) 7 Deborah Mayo  “Ontology & Methodology in Statistical Modeling”Below are the slides from my Rutgers seminar for the Department of Statistics and Biostatistics yesterday, since some people have been asking me for them. The abstract is here. I don’t know how explanatory a bare outline like this can be, but I’d be glad to try and answer questions[i]. I am impressed at how interested in foundational matters I found the statisticians (both faculty and students) to be. (There were even a few philosophers in attendance.) It was especially interesting to explore, prior to the seminar, possible connections between severity assessments and confidence distributions, where the latter are along the lines of Min-ge Xie (some recent papers of his may be found here.)

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance”

[i]They had requested a general overview of some issues in philosophical foundations of statistics. Much of this will be familiar to readers of this blog.



Categories: Bayesian/frequentist, Error Statistics, Statistics | 11 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2011. I mark in red 3 posts that seem most apt for general background on key issues in this blog.*

  • (11/1) RMM-4:“Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*” by Aris Spanos, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”)
  • (11/3) Who is Really Doing the Work?*
  • (11/5) Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest
  • (11/9) Neyman’s Nursery 2: Power and Severity [Continuation of Oct. 22 Post]
  • (11/12) Neyman’s Nursery (NN) 3: SHPOWER vs POWER
  • (11/15) Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower (“observed” power) vs Power
  • (11/18) Neyman’s Nursery (NN5): Final Post
  • (11/21) RMM-5: “Low Assumptions, High Dimensions” by Larry Wasserman, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”) See also my deconstruction of Larry Wasserman.
  • (11/23) Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat
  • (11/28) The UN Charter: double-counting and data snooping
  • (11/29) If you try sometime, you find you get what you need!

*I announced this new, once-a-month feature at the blog’s 3-year anniversary. I will repost and comment on one of the 3-year old posts from time to time. [I’ve yet to repost and comment on the one from Oct. 2011, but will shortly.] For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!


 Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))












Categories: 3-year memory lane, Bayesian/frequentist, Statistics | Leave a comment

Lucien Le Cam: “The Bayesians Hold the Magic”

lecamToday is the birthday of Lucien Le Cam (Nov. 18, 1924-April 25,2000): Please see my updated 2013 post on him.


Categories: Bayesian/frequentist, Statistics | Leave a comment

Oxford Gaol: Statistical Bogeymen

Memory Lane: 3 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)Unknown-2

Criticisms then follow readily: the form of one or both:

  • Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
  • Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
  • I have proposed an alternative philosophy that replaces these tenets with different ones:
  • the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
  • the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
  • Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Philosophy of Statistics, Statistics | Tags: , | 30 Comments

Gelman recognizes his error-statistical (Bayesian) foundations


From Gelman’s blog:

“In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons”

Posted by  on

Exhibit A: [2012] Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 189-211. (Andrew Gelman, Jennifer Hill, and Masanao Yajima)

Exhibit B: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, in press. (Andrew Gelman and Eric Loken) (Shortened version is here.)


The “forking paths” paper, in my reading,  basically argues that mere hypothetical possibilities about what you would or might have done had the data been different (in order to secure a desired interpretation) suffices to alter the characteristics of the analysis you actually did. That’s an error statistical argument–maybe even stronger than what some error statisticians would say. What’s really being condemned are overly flexible ways to move from statistical results to substantive claims. The p-values are illicit when taken to provide evidence for those claims because an actual p-value requires Prob(P < p;Ho) = p (and the actual p-value has become much greater by design). The criticism makes perfect sense if you’re scrutinizing inferences according to how well or severely tested they are. Actual error probabilities are accordingly altered or unable to be calculated. However, if one is going to scrutinize inferences according to severity then the same problematic flexibility would apply to Bayesian analyses, whether or not they have a way to pick up on it. (It’s problematic if they don’t.) I don’t see the magic by which a concern for multiple testing disappears in Bayesian analysis (e.g., in the first paper) except by assuming some prior takes care of it.

See my comment here.

Categories: Error Statistics, Gelman | 17 Comments

Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?


Faye Flam

Congratulations to Faye Flam for finally getting her article published at the Science Times at the New York Times, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful too, that one remark from me remained. Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory, so the posterior checks out.

The article says, Bayesian methods can “crosscheck work done with the more traditional or ‘classical’ approach.” Yes, but on traditional frequentist grounds. What many would like to know is how to cross check Bayesian methods—how do I test your beliefs? Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:

Statistics may not sound like the most heroic of pursuits. But if not for statisticians, a Long Island fisherman might have died in the Atlantic Ocean after falling off his boat early one morning last summer.

Continue reading

Categories: Bayesian/frequentist, Statistics | 47 Comments

Continued:”P-values overstate the evidence against the null”: legit or fallacious?



Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 39 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 1,125 other followers