Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand



Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this:

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). …

But this is also true for a test’s significance level α, so on these grounds α couldn’t be an “error rate” or error probability either. Yet Frost defines α to be a Type I error probability (“An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis“.) [1]

Let’s use the philosopher’s slightly obnoxious but highly clarifying move of subscripts. There is error probability1—the usual frequentist (sampling theory) notion—and error probability2—the posterior probability that the null hypothesis is true conditional on the data, as in Frost’s remark.  (It may also be stated as conditional on the p-value, or on rejecting the null.) Whether a p-value is predesignated or attained (observed), error probabilitity1 ≠ error probability2.[2] Frost, inadvertently I assume, uses the probability of a Type I error in these two incompatible ways in his posts on significance tests.[3]

Interestingly, the simulations to which Frost refers to “show that the actual probability that the null hypothesis is true [i.e., error probability2] tends to be greater than the p-value by a large margin” work with a fixed p-value, or α level, of say .05. So it’s not a matter of predesignated or attained p-values [4]. Their computations also employ predesignated probabilities of type II errors and corresponding power values. The null is rejected based on a single finding that attains .05 p-value. Moreover, the point null (of “no effect”) is give a spiked prior of .5. (The idea comes from a context of diagnostic testing; the prior is often based on an assumed “prevalence” of true nulls from which the current null is a member. Please see my previous post.)

Their simulations are the basis of criticisms of error probability1 because what really matters, or so these critics presuppose, is error probability2 .

Whether this assumption is correct, and whether these simulations are the slightest bit relevant to appraising the warrant for a given hypothesis, are completely distinct issues. I’m just saying that Frost’s own links mix these notions. If you approach statistical guidebooks with the magician’s suspicious eye, however, you can pull back the curtain on these sleights of hand.

Oh, and don’t lose your nerve just because the statistical guides themselves don’t see it or don’t relent. Send it on to me at


[0] They are the focus of a book I am completing: “Statistical Inference As Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2017)

[1]  I admit we need a more careful delineation of the meaning of ‘error probability’.  One doesn’t have an error probability without there being something that could be “in error”. That something is generally understood as an inference or an interpretation of data. A method of statistical inference moves from data to some inference about the source of the data as modeled; some may wish to see the inference as a kind of “act” (using Neyman’s language) or “decision to assert” but nothing turns on this.
Associated error probabilities refer to the probability a method outputs an erroneous interpretation of the data, where the particular error is pinned down. For example, it might be, the test infers μ > 0 when in fact the data have been generated by a process where μ = 0.  The test is defined in terms of a test statistic d(X), and  the error probabilitiesrefer to the probability distribution of d(X), the sampling distribution, under various assumptions about the data generating process. Error probabilities in tests, whether of the Fisherian or N-P varieties, refer to hypothetical relative frequencies of error in applying a method.

[2] Notice that error probability2 involves conditioning on the particular outcome. Say you have observed a 1.96 standard deviation difference, and that’s your fixed cut-off. There’s no consideration of the sampling distribution of d(X), if you’ve conditioned on the actual outcome. Yet probabilities of Type I and Type II errors, as well as p-values, are defined exclusively in terms of the sampling distribution of d(X), under a statistical hypothesis of interest. But all that’s error probability1.

[3] Doubtless, part of the problem is that testers fail to clarify when and why a small significance level (or p-value) provides a warrant for inferring a discrepancy from the null. Firstly, for a p-value to be actual (and not merely nominal):

Pr(P < pobs; H0) = pobs .

Cherry picking and significance seeking can yield a small nominal p-value, while the actual probability of attaining even smaller p-values under the null is high. So this identity fails. Second, A small p- value warrants inferring a discrepancy from the null because, and to the extent that, a larger p-value would very probably have occurred, were the null hypothesis correct. This links error probabilities of a method to an inference in the case at hand.

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, p. 66).

[4] The myth that significance levels lose their error probability status once the attained p-value is reported is just that, a myth. I’ve discussed it a lot elsewhere; but the the current point doesn’t turn on this. Still, it’s worth listening to statistician Stephen Senn (2002, p. 2438) on this point.

 I disagree with [Steve Goodman] on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement.  In my opinion, whatever philosophical differences there are between significance tests and hypothesis test, they have little to do with the use or otherwise of p-values. For example, Lehmann in Testing Statistical Hypotheses, regarded by many as the most perfect and complete expression of the Neyman–Pearson approach, says

‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … the significance probability or p-value, at which the hypothesis would be rejected for the given observation’. (Lehmann, Testing Statistical hypotheses (1994, p. 70, original italics). 

Note to subscribers: Please check back to find follow-ups and corrected versions of blogposts, indicated with (ii), (iii) etc.

Some Relevant Posts:

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 34 Comments

Gigerenzer at the PSA: “How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual”: Comments and Queries (ii)



Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).



  1. Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.

I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”? Continue reading

Categories: Fisher, frequentist/Bayesian, Gigerenzer, Gigerenzer, P-values, spurious p values, Statistics | 11 Comments

“So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference



I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black. Continue reading

Categories: frequentist/Bayesian, Honorary Mention, P-values, reforming the reformers, science communication, Statistics | 45 Comments

A. Spanos: Talking back to the critics using error statistics

spanos 2014


Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.


See also on this blog:

A. Spanos, “Recurring controversies about p-values and confidence intervals revisited

A. Spanos, “Lecture on frequentist hypothesis testing



Categories: Error Statistics, frequentist/Bayesian, reforming the reformers, statistical tests, Statistics | 72 Comments

In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel)


Scientist sees squirrel

Evolutionary ecologist, Stephen Heard (Scientist Sees Squirrel) linked to my blog yesterday. Heard’s post asks: “Why do we make statistics so hard for our students?” I recently blogged Barnard who declared “We need more complexity” in statistical education. I agree with both: after all, Barnard also called for stressing the overarching reasoning for given methods, and that’s in sync with Heard. Here are some excerpts from Heard’s (Oct 6, 2015) post. I follow with some remarks.

This bothers me, because we can’t do inference in science without statistics*. Why are students so unreceptive to something so important? In unguarded moments, I’ve blamed it on the students themselves for having decided, a priori and in a self-fulfilling prophecy, that statistics is math, and they can’t do math. I’ve blamed it on high-school math teachers for making math dull. I’ve blamed it on high-school guidance counselors for telling students that if they don’t like math, they should become biology majors. I’ve blamed it on parents for allowing their kids to dislike math. I’ve even blamed it on the boogie**. Continue reading

Categories: fallacy of rejection, frequentist/Bayesian, P-values, Statistics | 20 Comments

“Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)

Objectivity 1: Will the Real Junk Science Please Stand Up?Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)




Categories: evidence-based policy, frequentist/Bayesian, junk science, Rejected Posts | 2 Comments

“Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).

Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”!  “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.

Continue reading

Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 48 Comments

What really defies common sense (Msc kvetch on rejected posts)

imgres-2Msc Kvetch on my Rejected Posts blog.

Categories: frequentist/Bayesian, msc kvetch, rejected post | Leave a comment

Statistical Science: The Likelihood Principle issue is out…!

Stat SciAbbreviated Table of Contents:

Table of ContentsHere are some items for your Saturday-Sunday reading. 

Link to complete discussion: 

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.

Links to individual papers:

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227-239.

Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240-241.

Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242-246.

Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247-251.

Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252-253.

Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254-258.

Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259-260.

Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261-266.

Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x and y from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(xθ) = cf2(yθ) for all θ, outcomes x and ymay have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].

Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.

In the months since this paper has been accepted for publication, I’ve been asked, from time to time, to reflect informally on the overall journey: (1) Why was/is the Birnbaum argument so convincing for so long? (Are there points being overlooked, even now?) (2) What would Birnbaum have thought? (3) What is the likely upshot for the future of statistical foundations (if any)?

I’ll try to share some responses over the next week. (Naturally, additional questions are welcome.)

[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”

 UPhils and responses



Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 40 Comments

Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)



Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.

1. Some assertions from Fisher, N-P, and Bayesian camps

Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)

a) From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

b) From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null. Continue reading

Categories: frequentist/Bayesian, J. Berger, P-values, Statistics | 32 Comments

Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest posts)

Roger BergerRoger L. Berger

School Director & Professor
School of Mathematical & Natural Science
Arizona State University

Comment on S. Senn’s post: Blood Simple? The complicated and controversial world of bioequivalence”(*)

First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.

Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. Continue reading

Categories: bioequivalence, frequentist/Bayesian, PhilPharma, Statistics | Tags: , | 22 Comments

The Science Wars & the Statistics Wars: More from the Scientism workshop

images-11-1Here are the slides from my presentation (May 17) at the Scientism workshop in NYC. (They’re sketchy since we were trying for 25-30 minutes.) Below them are some mini notes on some of the talks.

Now for my informal notes. Here’s a link to the Speaker abstracts;the presentations may now be found at the conference site here. Comments, questions, and corrections are welcome. Continue reading

Categories: evidence-based policy, frequentist/Bayesian, Higgs, P-values, scientism, Statistics, StatSci meets PhilSci | 11 Comments

A. Spanos: Talking back to the critics using error statistics (Phil6334)

spanos 2014

Aris Spanos’ overview of error statistical responses to familiar criticisms of statistical tests. Related reading is Mayo and Spanos (2011)

Categories: Error Statistics, frequentist/Bayesian, Phil6334, reforming the reformers, statistical tests, Statistics | Leave a comment

Wasserman on Wasserman: Update! December 28, 2013

Professor Larry Wasserman

Professor Larry Wasserman

I had invited Larry to give an update, and I’m delighted that he has! The discussion relates to the last post (by Spanos), which follows upon my deconstruction of Wasserman*. So, for your Saturday night reading pleasure, join me** in reviewing this and the past two blogs and the links within.

“Wasserman on Wasserman: Update! December 28, 2013”

My opinions have shifted a bit.

My reference to Franken’s joke suggested that the usual philosophical 
debates about the foundations of statistics were un-important, much 
like the debate about media bias. I was wrong on both counts.

First, I now think Franken was wrong. CNN and network news have a 
strong liberal bias, especially on economic issues. FOX has an 
obvious right wing, and anti-atheist bias. (At least FOX has some 
libertarians on the payroll.) And this does matter. Because people
 believe what they see on TV and what they read in the NY times. Paul
 Krugman’s socialist bullshit parading as economics has brainwashed 
millions of Americans. So media bias is much more than who makes 
better hummus.

Similarly, the Bayes-Frequentist debate still matters. And people —
including many statisticians — are still confused about the 
distinction. I thought the basic Bayes-Frequentist debate was behind 
us. A year and a half of blogging (as well as reading other blogs) 
convinced me I was wrong here too. And this still does matter. Continue reading

Categories: Error Statistics, frequentist/Bayesian, Statistics, Wasserman | 56 Comments

Lucien Le Cam: “The Bayesians hold the Magic”

Nov.18, 1924 -April 25, 2000

Nov.18, 1924 -April 25, 2000

Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me.  A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a snippet from a passage from Le Cam (1977) (Here I have fleshed it out):

“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation. Whether this is true, false, unacceptable or inspiring, it tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices. It is only at the design stage that the statistician can help you.

Another claim is the very curious one that if one follows the neo-Bayesian theory strictly one would not randomize experiments….However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. …

In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.

There are many other curious statements concerning confidence intervals, levels of significance, power, and so forth. These statements are only confusing to an otherwise abused public”. (Le Cam 1977, 158)

Back to EGEK:

Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,

In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .

The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others.

Given the difference in aims, it is not surprising that information relevant to the Bayesian task is very different from that relevant to the task of the error statistician. In this section I want to sharpen and make more rigorous what I have already said about this distinction.

…. the secret to solving a number of problems about evidence, I hold, lies in utilizing—formally or informally—the error probabilities of the procedures generating the evidence. It was the appeal to severity (an error probability), for example, that allowed distinguishing among the well-testedness of hypotheses that fit the data equally well… .

A few pages later in a section titled “Bayesian Freedom, Bayesian Magic” (350-1):

 A big selling point for adopting the LP (strong likelihood principle), and with it the irrelevance of stopping rules, is that it frees us to do things that are sinful and forbidden to an error statistician.

“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson). . . . Many experimenters would like to feel free to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money or patience … Classi­cal statisticians … have frowned on [this]”. (Edwards, Lindman, and Savage 1963, 239)1

Breaking loose from the grip imposed by error probabilistic requirements returns to us an appealing freedom.

Le Cam, … hits the nail on the head:

“It is characteristic of [Bayesian approaches] [2] . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from what­ever data and whatever features one happens to notice”. (Le Cam 1977, 145)

In contrast, the error probability assurances go out the window if you are allowed to change the experiment as you go along. Repeated tests of significance (or sequential trials) are permitted, are even desirable for the error statistician; but a penalty must be paid for perseverance—for optional stopping. Before-trial planning stipulates how to select a small enough significance level to be on the lookout for at each trial so that the overall significance level is still low. …. Wearing our error probability glasses—glasses that compel us to see how certain procedures alter error probability characteristics of tests—we are forced to say, with Armitage, that “Thou shalt be misled if thou dost not know that” the data resulted from the try and try again stopping rule. To avoid having a high probability of following false leads, the error statistician must scrupulously follow a specified experimental plan. But that is because we hold that error probabilities of the procedure alter what the data are saying—whereas Bayesians do not. The Bayesian is permitted the luxury of optional stopping and has nothing to worry about. The Bayesians hold the magic.

Or is it voodoo statistics?

When I sent him a note, saying his work had inspired me, he modestly responded that he doubted he could have had all that much of an impact.

*I had forgotten that this Synthese (1977) volume on foundations of probability and statistics is the one dedicated to the memory of Allan Birnbaum after his suicide: “By publishing this special issue we wish to pay homage to professor Birnbaum’s penetrating and stimulating work on the foundations of statistics” (Editorial Introduction). In fact, I somehow had misremembered it as being in a Harper and Hooker volume from 1976. The Synthese volume contains papers by Giere, Birnbaum, Lindley, Pratt, Smith, Kyburg, Neyman, Le Cam, and Kiefer.


Armitage, P. (1961). Contribution to discussion in Consistency in statistical inference and decision, by C. A. B. Smith. Journal of the Royal Statistical Society (B) 23:1-37.

_______(1962). Contribution to discussion in The foundations of statistical inference, edited by L. Savage. London: Methuen.

_______(1975). Sequential Medical Trials. 2nd ed. New York: John Wiley & Sons.

Edwards, W., H. Lindman & L. Savage (1963) Bayesian statistical inference for psychological research. Psychological Review 70: 193-242.

Le Cam, L. (1974). J. Neyman: on the occasion of his 80th birthday. Annals of Statistics, Vol. 2, No. 3 , pp. vii-xiii, (with E.L. Lehmann).

Le Cam, L. (1977). A note on metastatistics or “An essay toward stating a problem in the doctrine of chances.”  Synthese 36: 133-60.

Le Cam, L. (1982). A remark on empirical measures in Festschrift in the honor of E. Lehmann. P. Bickel, K. Doksum & J. L. Hodges, Jr. eds., Wadsworth  pp. 305-327.

Le Cam, L. (1986). The central limit theorem around 1935. Statistical Science, Vol. 1, No. 1,  pp. 78-96.

Le Cam, L. (1988) Discussion of “The Likelihood Principle,” by J. O. Berger and R. L. Wolpert. IMS Lecture Notes Monogr. Ser. 6 182–185. IMS, Hayward, CA

Le Cam, L. (1996) Comparison of experiments: A short review. In Statistics, Probability and Game Theory. Papers in Honor of David Blackwell 127–138. IMS, Hayward, CA.

Le Cam, L.,  J. Neyman and E. L. Scott (Eds). (1973). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Vol. l: Theory of Statistics, Vol. 2: Probability Theory, Vol. 3: Probability Theory. Univ. of Calif. Press, Berkeley Los Angeles.

Mayo, D. (1996). [EGEK] Error Statistics and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. (Chapter 10; Chapter 3)

Neyman, J. and L. Le Cam (Eds). (1967).  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. I: Statistics, Vol. II: Probability Part I & Part II. Univ. of Calif. Press, Berkeley and Los Angeles.

[1] For some links on optional stopping on this blog: Highly probably vs highly probed: Bayesian/error statistical differences.Who is allowed to cheat? I.J. Good and that after dinner comedy hour….New SummaryMayo: (section 7) “StatSci and PhilSci: part 2″After dinner Bayesian comedy hour….; Search for more, if interested.

[2] Le Cam is alluding mostly to Savage, and (what he called) the “neo-Bayesian” accounts.

Categories: Error Statistics, frequentist/Bayesian, phil/history of stat, strong likelihood principle | 58 Comments

Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior

DSCF3726“Why presuming innocence has nothing to do with assigning low prior probabilities to the proposition that defendant didn’t commit the crime”

by Professor Larry Laudan
Philosopher of Science*

Several of the comments to the July 17 post about the presumption of innocence suppose that jurors are asked to believe, at the outset of a trial, that the defendant did not commit the crime and that they can legitimately convict him if and only if they are eventually persuaded that it is highly likely (pursuant to the prevailing standard of proof) that he did in fact commit it. Failing that, they must find him not guilty. Many contributors here are conjecturing how confident jurors should be at the outset about defendant’s material innocence.

That is a natural enough Bayesian way of formulating the issue but I think it drastically misstates what the presumption of innocence amounts to.  In my view, the presumption is not (or at least should not be)  an instruction about whether jurors believe defendant did or did not commit the crime.  It is, rather, an instruction about their probative attitudes.wavy capital

There are three reasons for thinking this:

a). asking a juror to begin a trial believing that defendant did not commit a crime requires a doxastic act that is probably outside the jurors’ control.  It would involve asking jurors  to strongly believe an empirical assertion for which they have no evidence whatsoever.  It is wholly unclear that any of us has the ability to talk ourselves into resolutely believing x if we have no empirical grounds for asserting x. By contrast, asking juries to believe that they have seen as yet no proof of defendant’s guilt is an easy belief to acquiesce in since it is obviously true. Continue reading

Categories: frequentist/Bayesian, PhilStatLaw, Statistics | 28 Comments

Mayo Commentary on Gelman & Robert

The following is my commentary on a paper by Gelman and Robertforthcoming (in early 2013) in the The American Statistician* (submitted October 3, 2012).


mayo 2010 conference IphoneDiscussion of Gelman and Robert, “Not only defended but also applied”: The perceived absurdity of Bayesian inference”
Deborah G. Mayo

1. Introduction

I am grateful for the chance to comment on the paper by Gelman and Robert. I welcome seeing statisticians raise philosophical issues about statistical methods, and I entirely agree that methods not only should be applicable but also capable of being defended at a foundational level. “It is doubtful that even the most rabid anti-Bayesian of 2010 would claim that Bayesian inference cannot apply” (Gelman and Robert 2012, p. 6). This is clearly correct; in fact, it is not far off the mark to say that the majority of statistical applications nowadays are placed under the Bayesian umbrella, even though the goals and interpretations found there are extremely varied. There are a plethora of international societies, journals, post-docs, and prizes with “Bayesian” in their name, and a wealth of impressive new Bayesian textbooks and software is available. Even before the latest technical advances and the rise of “objective” Bayesian methods, leading statisticians were calling for eclecticism (e.g., Cox 1978), and most will claim to use a smattering of Bayesian and non-Bayesian methods, as appropriate. George Casella (to whom their paper is dedicated) and Roger Berger in their superb textbook (2002) exemplify a balanced approach. Continue reading

Categories: frequentist/Bayesian, Statistics | 24 Comments

Blog at