Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities and error rates—indeed in the glossary for the site—while in others it is denied they provide error rates. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.
Begin with one of the kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics”:
(1) The Significance level is the Type I error probability (3/15)
The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference….
Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true. In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate! (My emphasis.)
This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.
(2) Definition Link: Now we go to the blog’s definition link for this “type of error”
No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of drawing an incorrect conclusion.
Type I error
When the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test. An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. (My emphasis)
Null Hypothesis Decision True False Fail to reject Correct Decision (probability = 1 – α) Type II Error – fail to reject the null when it is false (probability = β) Reject Type I Error – rejecting the null when it is true (probability = α) Correct Decision (probability = 1 – β)
He gives very useful graphs showing quite clearly that the probability of a Type I error comes from the sampling distribution of the statistic (in the illustrated case, it’s the distribution of sample means).
So it is odd that elsewhere Frost tells us that a significance level (attained or fixed) is not the probability of a Type I error. [Note: the issue here isn’t whether the significance level is fixed or attained, the difference is between an ordinary frequentist error probability and a posterior probability in a null hypothesis, given it is rejected—based on a prior probability for the null vs a probability for a single alternative, which he writes as P(real). I elsewhere take up the allegation, by some, that a significance level is an error probability but a P-value is not. The post is here. Also see note [a] below.]
Here are some examples from Frost’s posts on 4/14 & 5/14:
(3) A significance level is not the Type I error probability
Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).
Now “making a mistake” may be vague, but the parenthetical link makes it clear he intends the Type I error. Guess what? The link is to the exact same definition of Type I error as before: the ordinary error probability computed from the sampling distribution. Yet in the blogpost itself, the Type I error probability now refers to a posterior probability of the null, based on an assumed prior probability of .5!
If a P value is not the error rate, what the heck is the error rate?
Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here), the table summarizes them for middle-of-the-road assumptions.
P value Probability of incorrectly rejecting a true null hypothesis 0.05 At least 23% (and typically close to 50%) 0.01 At least 7% (and typically close to 15%)
*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1
We’ve discussed how J. Berger and Sellke (1987) compute these posterior probabilities using spiked priors, generally representing undefined “reference” or conventional priors. (Please see my previous post.) J. Berger claims, at times, that these posterior probabilities (which he computes in lots of different ways) are the error probabilities, and Frost does too, at least in some posts. The allegation that therefore P-values exaggerate the evidence can’t be far behind–or so a reader of this blog surmises–and there it is, right below:
Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. (Frost 5/14)
(4) J. Berger’s Sleight of Hand: These sleights of hand are familiar to readers of this blog; I wouldn’t have expected them in a set of instructional blogposts about misinterpreting significance levels (at least without a great big warning). But Frost didn’t dream them up, he’s following a practice, traceable to J. Berger, of claiming that a posterior (usually based on conventional priors) gives the real error rate (or the conditional error rate). Whether it’s the one to use or not, my point is simply that the meaning is changed, and Frost ought to issue a trigger alert. Instead, he includes numerous links to related posts on significance tests, making it appear that blithely assigning the .5 spiked prior to the null is not only kosher, but is part of ordinary significance testing. It’s scarcely a justified move. As Casella and Berger (1987) show, this is a highly biased prior to use. See this post, and others by Stephen Senn (3/16/15, 5/9/15). Moreover, many people regard point nulls as always false.
In my comment on J. Berger (2003), I noted my surprise at his redefinition of error probability. (See pp. 19-24 in this paper). In response to me, Berger asks,
“Why should the frequentist school have exclusive right to the term ‘error probability’? It is not difficult to simply add the designation ‘frequentist’ (or Type I or Type II) or ‘Bayesian’ to the term to differentiate between the schools” (J. Berger 2003, p. 30).
That would work splendidly, I’m all in favor of differentiating between the schools. Note that he allows “Type I ” to go with the ordinary frequentist variant. If Berger had emphasized this distinction in his paper, Frost would have been warned of the slipperly slide he’s about to take a trip on. Instead, Berger has increasingly grown accustomed to claiming these are the real frequentist(!) error probabilities.
(5) Is Hypothesis testing like Diagnostic Screening? Now Frost (following David Colquhoun) appears to favor (not Berger’s conventionalist prior) a type of frequentist or “prevalence” prior:
It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.
Do we know these prevalences? What reference class should a given hypothesis be regarded as having been selected? There would be many different urns to which a particular hypothesis belongs. Such frequentist-Bayesian computations may be appropriate in contexts of high throughput screening, where a hypothesis is viewed as a generic, random selection from an urn of hypotheses. Here, a (behavioristic) concern to control the rates of following up false leads is primary. But that’s very different from evaluating how well tested or corroborated a particular H is. And why should fields with “high crud factors” (as Meehl called them) get the benefit? (having a low prior prevalence of “no effect”).
I’ve discussed all these points elsewhere, and they are beside my current complaint which is simply this: Frost construes the probability of a Type I error in some places as an ordinary error probability based on the sampling distribution alone; and other places as a Bayesian posterior probability of a hypothesis, conditional on a set of data.
Frost goes further in the post to suggest that “hypotheses tests are journeys from the prior probability to posterior probability”.
Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. [Mayo: They do?] This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.
Initial Probability of
true null (1 – P(real))
P value obtained Final Minimum Probability
of true null
0.5 0.05 0.289 0.5 0.01 0.110 0.5 0.001 0.018 0.33 0.05 0.12 0.9 0.05 0.76
The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.
It is assumed that there is just a crude dichotomy: the null is true vs the effect is real (never mind magnitudes of discrepancy which I and others insist upon), and further, that you reject on the basis of a single, just statistically significant result. But these moves go against the healthy recommendations for good testing in the other posts on the Minitab blog. I recommend Frost go back and label the places where he has conflated the probability of a Type I error with a posterior probability based on a prior: use Berger’s suggestion of reserving “Type I error probability” for the ordinary frequentist error probability based on a sampling distribution alone, calling the posterior error probability Bayesian. Else contradictory claims will ensue…But I’m not holding my breath.
I may continue this in (ii)….
January 21, 2016 Update:
Jim Frost from Minitab responded, not to my post, but to a comment I made on his blog prior to writing the post. Since he hasn’t commented here, let me paste the relevant portion of his reply. I want to separate the issue of predesignating alpha and the observed P-value, because my point now is quite independent of it. Let’s even just talk of a fixed alpha or fixed P-value for rejecting the null. My point’s very simple: Frost sometimes considers the type I error probability to be alpha (in the 2015 posts) based solely on the sampling distribution which he ably depicts– whereas in other places he regards it as the posterior probability of the null hypothesis based on a prior probability (the 2014 posts). He does the same in his reply to me (which I can’t seem to link, but it’s in the discussion here):
From Frost’s reply to me:
Yes, if your study obtains a p-value of 0.03, you can say that 3% of all studies that obtain a p-value less than or equal to 0.03 will have a Type I error. That’s more of a N-P error rate interpretation (except that N-P focused on critical test values rather than p-values). …..
Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:
Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.
We know that the p-value is not the error rate because:
1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). See a graphical representation of the math behind p-values and a post dedicated to how to correctly interpret p-values.
2) We also know this because there have been a number of simulation studies that look at the relationship between p-values and the probability that the null is true. These studies show that the actual probability that the null hypothesis is true tends to be greater than the p-value by a large margin.
3) Empirical studies that look at the replication of significant results also suggest that the actual probability that the null is true is greater than the p-value.
Frost’s points (1)-(3) above would also oust alpha as the type I error probability, for it’s also not designed to give a posterior. Never mind the question of the irrelevance or bias associated with the hefty spiked prior to the null involved in the simulations, all I’m saying is that Frost should make the distinction that even J. Berger agrees to, if he doesn’t want to confuse his readers.
[a] A distinct issue as to whether significance levels, but not P-values (the attained significance level) are error probabilities, is discussed here. Here are some of the assertions from Fisherian, Neyman-Pearsonian and Bayesian camps cited in that post. (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.
From the Fisherian camp (Cox and Hinkley):
For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by
pobs = Pr(T > tobs; H0).
….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).
Thus pobs would be the Type I error probability associated with the test.
From the Neyman-Pearson N-P camp (Lehmann and Romano):
“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)
Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.
Gibbons and Pratt:
“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
Sellke, T., Bayarri, M. J. and Berger, J. O. 2001. Calibration of p Values for Testing Precise Null Hypotheses. The American Statistician, 55(1): 62-71.
- Frost blog posts:
- 4/17/14: How to Correctly Interpret P Values
- 5/1/14: Not All P Values are Created Equal
- 3/19/15: Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics
Some Relevant Errorstatistics Posts:
- 4/28/12: Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.
- 9/29/13: Highly probable vs highly probed: Bayesian/ error statistical differences.
- 7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
- 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
- 3/5/15: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
- 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
- 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
I wrote this in airports and on a trans Atlantic flight*, so please alert me to errors, several of which I’ve already found (and fixed). I don’t have an e-mail address for Jim Frost, though I sent a link on a comment on his blog. It’s interesting that the more recent posts do not contain these problems, although they are within a year. Maybe more than one person writes his posts. Perhaps there are some warnings of the shift in meanings that I missed; if you spot some let me know.
*Of course I wouldn’t dream of wearing heels like these while on a plane!
I found this post (and am finding Error and the Growth of Experimental Knowledge) highly instructive in pointing out the sleights of hand and misunderstandings that go on in the current debates in statistics. As a graduate student dissatisfied with the misuse of frequentest statistics in my field and swept up in the Bayesian storm your work is highly instructive and challenging, thank you.