Monthly Archives: February 2017

R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts in honor of his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. They are each very short.

17 February 1890 — 29 July 1962

“Statistical Methods and Scientific Induction”

by Sir Ronald Fisher (1955)


The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of  acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

  1. “Repeated sampling from the same population”,
  2. Errors of the “second kind”,
  3. “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

The most noteworthy feature is Fisher’s position on Fiducial inference, typically downplayed. I’m placing a summary and link to Neyman’s response below–it’s that interesting.

Note on an Article by Sir Ronald Fisher

by Jerzy Neyman (1956)




(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.  (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values.  The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight.  (4)  The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.


Categories: fiducial probability, Fisher, Neyman, phil/history of stat | 3 Comments

Guest Blog: ARIS SPANOS: The Enduring Legacy of R. A. Fisher

By Aris Spanos

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most re­markable, but least recognized, achievement was to initiate the recast­ing of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

Mθ(x)={f(x;θ); θ∈Θ}; x∈Rn ;Θ⊂Rm; m < n; (1)

where the distribution of the sample f(x;θ) ‘encapsulates’ the proba­bilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily confined to the description of the distributional features of the data in hand using the histogram and the first few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand x0:=(x1,x2,…,xn) As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pear­son’s approach, proceeding from data x0 in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespecified Mθ(x) (a ‘hypothetical infinite population’), and view x0 as a ‘typical’ realization thereof; see Spanos (1999).

In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material ex­periment into Mθ(x), and taming errors via probabilification, i.e. to define frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘effectiviness’ of the inference procedure: how often a certain method will give rise to correct in­ferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been refined and extended by Mayo’s error statistical approach (e.g., Mayo 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011).

Frequentist statistical inference was largely in place by the late 1930s. Fisher, almost single-handedly, created the current theory of ‘optimal’ point estimation and formalized significance testing based on the p-value reasoning. In the early 1930s Neyman and Pearson (N-P) proposed an ‘optimal’ theory for hypothesis testing, by modify­ing/extending Fisher’s significance testing. By the late 1930s Neyman proposed an ‘optimal’ theory for interval estimation analogous to N-P testing. Despite these developments in frequentist statstics, its philo­sophical foundations concerned with the proper form of the underlying inductive reasoning were in a confused state. Fisher was arguing for ‘inductive inference’, spearheaded by his significance testing in conjunc­tion with p-values and his fiducial probability for interval estimation. Neyman was arguing for ‘inductive behavior’ based on N-P testing and confidence interval estimation firmly grounded on pre-data error prob­abilities.

The last exchange between these pioneers took place in the mid 1950s (see [Fisher, 1955; Neyman, 1956; Pearson, 1955]) and left the philosophical foundations of the field in a state of confusion with many more questions than answers.

One of the key issues of disagreement was about the relevance of alternative hypotheses and the role of the pre-data error probabilities in frequentist testing, i.e. the irrelevance of Errors of the “second kind”, as Fisher (p. 69) framed the issue. My take on this issue is that Fisher did understand the importance of alternative hypotheses and the power of the test by talking about its ‘sensitivity’:

“By increasing the size of the experiment, we can render it more sensi­tive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or, in other words, of a quantitatively smaller departure from the null hypothesis.” (Fisher, 1935, p. 22)

If this is not the same as increasing the power of the test by increas­ing the sample size, I do not know what it is! What Fisher and many subsequent commentators did not appreciate enough was that Neyman and Pearson defined the relevant alternative hypotheses in a very spe­cific way: to be the complement to the null relative to the prespecified statistical model Mθ(x):

H0: µ∈Θ0 vs. H1: µ∈Θ1 (2)

where Θ0 and Θ1 constitute a partition of the parameter space Θ. That rendered the evaluation of power possible and Fisher’s comment about type II errors:

“Such errors are therefore incalculable both in frequency and in magni­tude merely from the specification of the null hypothesis.” simply misplaced.

Let me finish with a quotation from Fisher (1935) that I find very insightful and as relevant today as it was then:

“In the field of pure research no assessment of the cost of wrong con­clusions, or of delay in arriving at more correct conclusions can conceivably be more than a pretence, and in any case such an assessment would be inadmissible and irrelevant in judging the state of the scientific evidence.” (pp. 25-26)

This post was first blogged in 2012.


[1] Fisher, R. A. (1922), “On the mathematical foundations of theoret­ical statistics”, Philosophical Transactions of the Royal Society A,

222: 309-368.

[2] Fisher, R. A. (1935), The Design of Experiments, Oliver and Boyd, Edinburgh.

[3] Fisher, R. A. (1955), “Statistical methods and scientific induction,” Journal of the Royal Statistical Society, B, 17: 69-78.

[4] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151­196 in the Handbook of Philosophy of Science, vol. 7: Philosophy of Statistics, D. Gabbay, P. Thagard, and J. Woods (editors), Elsevier.

[5] Neyman, J. (1956), “Note on an Article by Sir Ronald Fisher,” Journal of the Royal Statistical Society, B, 18: 288-294.

[6] Pearson, E. S. (1955), “Statistical Concepts in the Relation to Real­ity,” Journal of the Royal Statistical Society, B, 17, 204-207.

[7] Spanos, A. (1999), Probability Theory and Statistical Inference: Econometric Modeling with Observational Data, Cambridge Uni­versity Press, Cambridge.

Categories: Fisher, Spanos, Statistics | Tags: , , , , , , | Leave a comment

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some different Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 2 Comments

Winner of the January 2017 Palindrome contest: Cristiano Sabiu

Winner of January 2017 Palindrome Contest: (a dozen book choices)



Cristiano Sabiu: Postdoctoral researcher in Cosmology and Astrophysics

Palindrome: El truth supremo nor tsar is able, Elba Sir Astronomer push turtle.

The requirement: A palindrome using “astronomy” or “(astronomer/astronomical” (and Elba, of course).cosmic-turtle-1

Book choiceError and the Growth of Experimental Knowledge (D. Mayo 1996, Chicago)

Bio: Cristiano Sabiu is a postdoctoral researcher in Cosmology and Astrophysics, working on Dark Energy and testing Einstein’s theory of General Relativity. He was born in Scotland with Italian roots and currently resides in Daejeon, South Korea.

Statement: This was my first palindrome! I was never very interested in writing when I was younger (I almost failed English at school!). However, as my years progress I feel that writing/poetry may be the easiest way for us non-artists to express that which cannot easily be captured by our theorems and logical frameworks. Constrained writing seems to open some of those internal mental doors, I think I am hooked now. Thanks for organising this!

Mayo Comment: Thanks for entering Cristiano, you just made the “time extension” for this month. That means we won’t have a second month of “astronomy” and the judges will have to come up with a new word. I’m glad you’re hooked. Good choice of book! I especially like the “truth supremo/push turtle” . I’m also very interested in experimental testing of GTR–we’ll have to communicate on this.

Mayo’s January attempts (selected):

  • Elba rap star comedy: Mr. Astronomy. Testset tests etymon or tsar, my democrats’ parable.
  • Parable for astronomy gym, on or tsar of Elba rap.
Categories: Palindrome | Leave a comment

Cox’s (1958) weighing machine example



A famous chestnut given by Cox (1958) recently came up in conversation. The example  “is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992, p. 582). When I describe it, you’ll find it hard to believe many regard it as causing an earthquake in statistical foundations, unless you’re already steeped in these matters. If half the time I reported my weight from a scale that’s always right, and half the time use a scale that gets it right with probability .5, would you say I’m right with probability ¾? Well, maybe. But suppose you knew that this measurement was made with the scale that’s right with probability .5? The overall error probability is scarcely relevant for giving the warrant of the particular measurement,knowing which scale was used.

The following is an excerpt from Cox and Mayo (2010,295-8):


It had long been thought that the (WCP) entails the (strong) Likelihood Principle (LP) which renders error probabilities irrelevant  to parametric inference once the data are known. I give a disproof in Mayo (2010), but later recognized the need for a deeper argument which I gave in Mayo (2014). .If you’re interested, the ink to Statistical Science includes comments  by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu. You can find quite a lot on the LP searching this blog; it was a main topic for the first few years of this blog.

Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: Statistical Science 29(2) pp. 227-239, 261-266.

Reid, N. (1992). Introduction to Fraser (1966) structural probability and a generalization. In Breakthroughs in Statistics (S. Kotz and N. L. Johnson, eds.) 579–586. Springer Series in Statistics. Springer, New York.

Categories: Error Statistics, Sir David Cox, Statistics, strong likelihood principle | 1 Comment

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand



Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this:

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). …

But this is also true for a test’s significance level α, so on these grounds α couldn’t be an “error rate” or error probability either. Yet Frost defines α to be a Type I error probability (“An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis“.) [1]

Let’s use the philosopher’s slightly obnoxious but highly clarifying move of subscripts. There is error probability1—the usual frequentist (sampling theory) notion—and error probability2—the posterior probability that the null hypothesis is true conditional on the data, as in Frost’s remark.  (It may also be stated as conditional on the p-value, or on rejecting the null.) Whether a p-value is predesignated or attained (observed), error probabilitity1 ≠ error probability2.[2] Frost, inadvertently I assume, uses the probability of a Type I error in these two incompatible ways in his posts on significance tests.[3]

Interestingly, the simulations to which Frost refers to “show that the actual probability that the null hypothesis is true [i.e., error probability2] tends to be greater than the p-value by a large margin” work with a fixed p-value, or α level, of say .05. So it’s not a matter of predesignated or attained p-values [4]. Their computations also employ predesignated probabilities of type II errors and corresponding power values. The null is rejected based on a single finding that attains .05 p-value. Moreover, the point null (of “no effect”) is give a spiked prior of .5. (The idea comes from a context of diagnostic testing; the prior is often based on an assumed “prevalence” of true nulls from which the current null is a member. Please see my previous post.)

Their simulations are the basis of criticisms of error probability1 because what really matters, or so these critics presuppose, is error probability2 .

Whether this assumption is correct, and whether these simulations are the slightest bit relevant to appraising the warrant for a given hypothesis, are completely distinct issues. I’m just saying that Frost’s own links mix these notions. If you approach statistical guidebooks with the magician’s suspicious eye, however, you can pull back the curtain on these sleights of hand.

Oh, and don’t lose your nerve just because the statistical guides themselves don’t see it or don’t relent. Send it on to me at


[0] They are the focus of a book I am completing: “Statistical Inference As Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2017)

[1]  I admit we need a more careful delineation of the meaning of ‘error probability’.  One doesn’t have an error probability without there being something that could be “in error”. That something is generally understood as an inference or an interpretation of data. A method of statistical inference moves from data to some inference about the source of the data as modeled; some may wish to see the inference as a kind of “act” (using Neyman’s language) or “decision to assert” but nothing turns on this.
Associated error probabilities refer to the probability a method outputs an erroneous interpretation of the data, where the particular error is pinned down. For example, it might be, the test infers μ > 0 when in fact the data have been generated by a process where μ = 0.  The test is defined in terms of a test statistic d(X), and  the error probabilitiesrefer to the probability distribution of d(X), the sampling distribution, under various assumptions about the data generating process. Error probabilities in tests, whether of the Fisherian or N-P varieties, refer to hypothetical relative frequencies of error in applying a method.

[2] Notice that error probability2 involves conditioning on the particular outcome. Say you have observed a 1.96 standard deviation difference, and that’s your fixed cut-off. There’s no consideration of the sampling distribution of d(X), if you’ve conditioned on the actual outcome. Yet probabilities of Type I and Type II errors, as well as p-values, are defined exclusively in terms of the sampling distribution of d(X), under a statistical hypothesis of interest. But all that’s error probability1.

[3] Doubtless, part of the problem is that testers fail to clarify when and why a small significance level (or p-value) provides a warrant for inferring a discrepancy from the null. Firstly, for a p-value to be actual (and not merely nominal):

Pr(P < pobs; H0) = pobs .

Cherry picking and significance seeking can yield a small nominal p-value, while the actual probability of attaining even smaller p-values under the null is high. So this identity fails. Second, A small p- value warrants inferring a discrepancy from the null because, and to the extent that, a larger p-value would very probably have occurred, were the null hypothesis correct. This links error probabilities of a method to an inference in the case at hand.

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, p. 66).

[4] The myth that significance levels lose their error probability status once the attained p-value is reported is just that, a myth. I’ve discussed it a lot elsewhere; but the the current point doesn’t turn on this. Still, it’s worth listening to statistician Stephen Senn (2002, p. 2438) on this point.

 I disagree with [Steve Goodman] on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement.  In my opinion, whatever philosophical differences there are between significance tests and hypothesis test, they have little to do with the use or otherwise of p-values. For example, Lehmann in Testing Statistical Hypotheses, regarded by many as the most perfect and complete expression of the Neyman–Pearson approach, says

‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … the significance probability or p-value, at which the hypothesis would be rejected for the given observation’. (Lehmann, Testing Statistical hypotheses (1994, p. 70, original italics). 

Note to subscribers: Please check back to find follow-ups and corrected versions of blogposts, indicated with (ii), (iii) etc.

Some Relevant Posts:

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 34 Comments

High error rates in discussions of error rates: no end in sight


waiting for the other shoe to drop…

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost):

(1) The Significance level is the Type I error probability (3/15)

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference….

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true. In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate! (My emphasis.)

Note: Frost is using the term “error rate” here, which is why I use it in my title. Error probability would be preferable.

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

(2) Definition Link: Now we go to the blog’s definition link for this “type of error”

No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of drawing an incorrect conclusion.

Type I error

When the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. (My emphasis)

  Null Hypothesis
Decision True False
Fail to reject Correct Decision (probability = 1 – α) Type II Error – fail to reject the null when it is false (probability = β)
Reject Type I Error – rejecting the null when it is true (probability = α) Correct Decision (probability = 1 – β)


He gives very useful graphs showing quite clearly that the probability of a Type I error comes from the sampling distribution of the statistic (in the illustrated case, it’s the distribution of sample means).

So it is odd that elsewhere Frost tells us that a significance level (attained or fixed) is not the probability of a Type I error. Note: the issue here isn’t whether the significance level is fixed or attained, the difference to which I’m calling your attention is between an ordinary frequentist error probability and a posterior probability in a null hypothesis, given it is rejected—based on a prior probability for the null vs a probability for a single alternative, which he writes as P(real). We may call it the false finding rate (FFR), and it arises in typical diagnostic screening contexts. I elsewhere take up the allegation, by some, that a significance level is an error probability but a P-value is not (Are P-values error probabilities?). A post is here. Also see note [a] below.]
Here are some examples from Frost’s posts on 4/14 & 5/14:

 (3)  In a different post Frost alleges: A significance level is not the Type I error probability

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).

Now “making a mistake” may be vague, (and his statement here is a bit wonky) but the parenthetical link makes it clear he intends the Type I error probability. Guess what? The link is to the exact same definition of Type I error as before: the ordinary error probability computed from the sampling distribution. Yet in the blogpost itself, the Type I error probability now refers to a posterior probability of the null, based on an assumed prior probability of .5!

If a P value is not the error rate, what the heck is the error rate?

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here), the table summarizes them for middle-of-the-road assumptions.

P value Probability of incorrectly rejecting a true null hypothesis
0.05 At least 23% (and typically close to 50%)
0.01 At least 7% (and typically close to 15%)

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

We’ve discussed how J. Berger and Sellke (1987) compute these posterior probabilities using spiked priors, generally representing undefined “reference” or conventional priors. (Please see my previous post.) J. Berger claims, at times, that these posterior probabilities (which he computes in lots of different ways) are the error probabilities, and Frost does too, at least in some posts. The allegation that therefore P-values exaggerate the evidence can’t be far behind–or so a reader of this blog surmises–and there it is, right below:

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. (Frost 5/14)

Admittedly, Frost is led to his equivocation by the sleights of hands of others, encouraged after around 2003 (in my experience).

(4) J. Berger’s Sleight of Hand: These sleights of hand are familiar to readers of this blog; I wouldn’t have expected them in a set of instructional blogposts about misinterpreting significance levels (at least without a great big warning). But Frost didn’t dream them up, he’s following a practice, traceable to J. Berger, of claiming that a posterior (usually based on conventional priors) gives the real error rate (or the conditional error rate). [The computations come from Edwards, Lindmann, and Savage 1963.) Whether it’s the ‘right’ default prior to use or not, my point is simply that the meaning is changed, and Frost ought to issue a trigger alert! Instead, he includes numerous links to related posts on significance tests, making it appear that blithely assigning a .5 spiked prior to the null is not only kosher, but is part of ordinary significance testing. It’s scarcely a justified move. As Casella and R.Berger (1987) show, this is a highly biased prior to use. See this post, and others by Stephen Senn (3/16/15, 5/9/15). Moreover, many people regard point nulls as always false.

In my comment on J. Berger (2003), I noted my surprise at his redefinition of error probability. (See pp. 19-24 in this paper). In response to me, Berger asks,

“Why should the frequentist school have exclusive right to the term ‘error probability’? It is not difficult to simply add the designation ‘frequentist’ (or Type I or Type II) or ‘Bayesian’ to the term to differentiate between the schools” (J. Berger 2003, p. 30).

That would work splendidly, I’m all in favor of differentiating between the schools. Note that he allows “Type I ” to go with the ordinary frequentist variant. If Berger had emphasized this distinction in his paper, Frost would have been warned of the slipperly slide he’s about to take a trip on. Instead, Berger has increasingly grown accustomed to claiming these are the real frequentist(!) error probabilities. (Others have followed.)

(5) Is Hypothesis testing like Diagnostic Screening? Now Frost (following David Colquhoun) appears to favor (not Berger’s conventionalist prior) a type of frequentist or “prevalence” prior:

It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.

Do we know these prevalences? What reference class should a given hypothesis be regarded as having been selected? There would be many different urns to which a particular hypothesis belongs. Such frequentist-Bayesian computations may be appropriate in contexts of high throughput screening, where a hypothesis is viewed as a generic, random selection from an urn of hypotheses. Here, a (behavioristic) concern to control the rates of following up false leads is primary. But that’s very different from evaluating how well tested or corroborated a particular H is. And why should fields with “high crud factors” (as Meehl called them) get the benefit? (having a low prior prevalence of “no effect”).

I’ve discussed all these points elsewhere, and they are beside my current complaint which is simply this: In some places, Frost construes the probability of a Type I error as an ordinary error probability based on the sampling distribution alone; and other places as a Bayesian posterior probability of a hypothesis, conditional on a set of data.

Frost goes further in the post to suggest that “hypotheses tests are journeys from the prior probability to posterior probability”.

Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. [Mayo: They do?] This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.


Initial Probability of
true null (1 – P(real))
P value obtained Final Minimum Probability
of true null
0.5 0.05 0.289
0.5 0.01 0.110
0.5 0.001 0.018
0.33 0.05 0.12
0.9 0.05 0.76

The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.

It is assumed that there is just a crude dichotomy: the null is true vs the effect is real (never mind magnitudes of discrepancy which I and others insist upon), and further, that you reject on the basis of a single, just statistically significant result. But these moves go against the healthy recommendations for good testing in the other posts on the Minitab blog. I recommend Frost go back and label the places where he has conflated the probability of a Type I error with a posterior probability based on a prior: use Berger’s suggestion of reserving “Type I error probability” for the ordinary frequentist error probability based on a sampling distribution alone, calling the posterior error probability Bayesian. Else contradictory claims will ensue…But I’m not holding my breath.

I may continue this in (ii)….


January 21, 2016 Update:

Jim Frost from Minitab responded, not to my post, but to a comment I made on his blog prior to writing the post. Since he hasn’t commented here, let me paste the relevant portion of his reply. I want to separate the issue of predesignating alpha and the observed P-value, because my point now is quite independent of it. Let’s even just talk of a fixed alpha or fixed P-value for rejecting the null. My point’s very simple: Frost sometimes considers the type I error probability to be alpha (in the 2015 posts) based solely on the sampling distribution which he ably depicts– whereas in other places he regards it as the posterior probability of the null hypothesis based on a prior probability (the 2014 posts). He does the same in his reply to me (which I can’t seem to link, but it’s in the discussion here):

From Frost’s reply to me:

Yes, if your study obtains a p-value of 0.03, you can say that 3% of all studies that obtain a p-value less than or equal to 0.03 will have a Type I error. That’s more of a N-P error rate interpretation (except that N-P focused on critical test values rather than p-values). …..

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:

Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). See a graphical representation of the math behind p-values and a post dedicated to how to correctly interpret p-values.

2) We also know this because there have been a number of simulation studies that look at the relationship between p-values and the probability that the null is true. These studies show that the actual probability that the null hypothesis is true tends to be greater than the p-value by a large margin.

3) Empirical studies that look at the replication of significant results also suggest that the actual probability that the null is true is greater than the p-value.

Frost’s points (1)-(3) above would also oust alpha as the type I error probability, for it’s also not designed to give a posterior. Never mind the question of the irrelevance or bias associated with the hefty spiked prior to the null involved in the simulations, all I’m saying is that Frost should make the distinction that even J. Berger agrees to, if he doesn’t want to confuse his readers.


[a] A distinct issue as to whether significance levels, but not P-values (the attained significance level) are error probabilities, is discussed here. Here are some of the assertions from Fisherian, Neyman-Pearsonian and Bayesian camps cited in that post. (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.

From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

Gibbons and Pratt:

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).



Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Edwards, A. ., Lindman, H. and Savage, L. 1963. ‘Bayesian Statistical Inference for Psychological Research’, Psychological Review 70(3): 193–242.

Sellke, T., Bayarri, M. J. and Berger, J. O. 2001. Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician, 55(1): 62-71.

  1. Frost blog posts:
  • 4/17/14: How to Correctly Interpret P Values
  • 5/1/14: Not All P Values are Created Equal
  • 3/19/15: Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Some Relevant Errorstatistics Posts:

  •  4/28/12: Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.
  • 9/29/13: Highly probable vs highly probed: Bayesian/ error statistical differences.
  • 7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
  • 3/5/15: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
  • 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
  • 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
Categories: highly probable vs highly probed, J. Berger, reforming the reformers, Statistics | 1 Comment

Blog at