Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities and error rates—indeed in the glossary for the site—while in others it is denied they provide error rates. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of the kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics”:

**(1)** The Significance level is the Type I error probability (3/15)

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference….

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when

thenull hypothesis istrue. In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region.That’s why the significance level is also referred to as an(My emphasis.)errorrate!This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

**(2) Definition Link: **Now we go to the blog’s definition link for this “type of error”

No hypothesis test is 100% certain. Because the test is based on probabilities, there is always a chance of drawing an incorrect conclusion.

Type I errorWhen the null hypothesis is true and you reject it, you make a type I error. The probability of making a type I error is α, which is the level of significance you set for your hypothesis test.

Anαof 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.(My emphasis)

Null HypothesisDecisionTrue False Fail to reject Correct Decision (probability = 1 – α) Type II Error– fail to reject the null when it is false (probability = β)Reject Type I Error– rejecting the null when it is true (probability = α)Correct Decision (probability = 1 – β)

He gives very useful graphs showing quite clearly that the probability of a Type I error comes from the *sampling distribution* of the statistic (in the illustrated case, it’s the distribution of sample means).

So it is odd that elsewhere Frost tells us that a significance level (attained or fixed) is not the probability of a Type I error. [Note: the issue here isn’t whether the significance level is fixed or attained, the difference is between an ordinary frequentist error probability and a posterior probability in a null hypothesis, given it is rejected—based on a prior probability for the null vs a probability for a single alternative, which he writes as P(real). I elsewhere take up the allegation, by some, that a significance level is an error probability but a P-value is not. The post is here. Also see note [a] below.]

Here are some examples from Frost’s posts on 4/14 & 5/14:

** (3)** A significance level is not the Type I error probability

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a

Type I error).

Now “making a mistake” may be vague, but the parenthetical link makes it clear he intends the Type I error. Guess what? The link is to the exact same definition of Type I error as before: the ordinary error probability computed from the sampling distribution. Yet in the blogpost itself, the Type I error probability now refers to a posterior probability of the null, based on an assumed prior probability of .5!

If a P value is not the error rate, what the heck is the error rate?

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss

here), the table summarizes them for middle-of-the-road assumptions.

P valueProbability of incorrectly rejecting a true null hypothesis0.05 At least 23% (and typically close to 50%) 0.01 At least 7% (and typically close to 15%) *Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

We’ve discussed how J. Berger and Sellke (1987) compute these posterior probabilities using spiked priors, generally representing undefined “reference” or conventional priors. (Please see my previous post.) J. Berger claims, at times, that these posterior probabilities (which he computes in lots of different ways) are the error probabilities, and Frost does too, at least in some posts. The allegation that therefore P-values exaggerate the evidence can’t be far behind–or so a reader of this blog surmises–and there it is, right below:

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. (Frost 5/14)

**(4) J. Berger’s Sleight of Hand:** These sleights of hand are familiar to readers of this blog; I wouldn’t have expected them in a set of instructional blogposts about misinterpreting significance levels (at least without a great big warning). But Frost didn’t dream them up, he’s following a practice, traceable to J. Berger, of claiming that a posterior (usually based on conventional priors) gives the real error rate (or the conditional error rate). Whether it’s the one to use or not, my point is simply that the meaning is changed, and Frost ought to issue a trigger alert. Instead, he includes numerous links to related posts on significance tests, making it appear that blithely assigning the .5 spiked prior to the null is not only kosher, but is part of ordinary significance testing. It’s scarcely a justified move. As Casella and Berger (1987) show, this is a highly biased prior to use. See this post, and others by Stephen Senn (3/16/15, 5/9/15). Moreover, many people regard point nulls as always false.

In my comment on J. Berger (2003), I noted my surprise at his redefinition of error probability. (See pp. 19-24 in this paper). In response to me, Berger asks,

“Why should the frequentist school have exclusive right to the term ‘error probability’? It is not difficult to simply add the designation ‘frequentist’ (or Type I or Type II) or ‘Bayesian’ to the term to differentiate between the schools” (J. Berger 2003, p. 30).

That would work splendidly, I’m all in favor of differentiating between the schools. Note that he allows “Type I ” to go with the ordinary frequentist variant. If Berger had emphasized this distinction in his paper, Frost would have been warned of the slipperly slide he’s about to take a trip on. Instead, Berger has increasingly grown accustomed to claiming these are the real frequentist(!) error probabilities.

**(5) Is Hypothesis testing like Diagnostic Screening?** Now Frost (following David Colquhoun) appears to favor (not Berger’s conventionalist prior) a type of frequentist or “prevalence” prior:

It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.

Do we know these prevalences? What reference class should a given hypothesis be regarded as having been selected? There would be many different urns to which a particular hypothesis belongs. Such frequentist-Bayesian computations may be appropriate in contexts of high throughput screening, where a hypothesis is viewed as a generic, random selection from an urn of hypotheses. Here, a (behavioristic) concern to control the rates of following up false leads is primary. But that’s very different from evaluating how well tested or corroborated a particular *H* is. And why should fields with “high crud factors” (as Meehl called them) get the benefit? (having a low prior prevalence of “no effect”).

I’ve discussed all these points elsewhere, and they are beside my current complaint which is simply this: Frost construes the probability of a Type I error in some places as an ordinary error probability based on the sampling distribution alone; and other places as a Bayesian posterior probability of a hypothesis, conditional on a set of data.

Frost goes further in the post to suggest that “hypotheses tests are journeys from the prior probability to posterior probability”.

Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. [Mayo: They do?] This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.

Initial Probability of

true null (1 – P(real))P value obtained Final Minimum Probability

of true null0.5 0.05 0.289 0.5 0.01 0.110 0.5 0.001 0.018 0.33 0.05 0.12 0.9 0.05 0.76 The table is based on calculations by Colquhoun and Sellke

et al.It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.

It is assumed that there is just a crude dichotomy: the null is true vs the effect is real (never mind magnitudes of discrepancy which I and others insist upon), and further, that you reject on the basis of a single, just statistically significant result. But these moves go against the healthy recommendations for good testing in the other posts on the Minitab blog. I recommend Frost go back and label the places where he has conflated the probability of a Type I error with a posterior probability based on a prior: use Berger’s suggestion of reserving “Type I error probability” for the ordinary frequentist error probability based on a sampling distribution alone, calling the posterior error probability Bayesian. Else contradictory claims will ensue…But I’m not holding my breath.

I may continue this in (ii)….

***********************************************************************

January 21, 2016 Update:

Jim Frost from Minitab responded, not to my post, but to a comment I made on his blog prior to writing the post. Since he hasn’t commented here, let me paste the relevant portion of his reply. I want to separate the issue of predesignating alpha and the observed P-value, because my point now is quite independent of it. Let’s even just talk of a fixed alpha or fixed P-value for rejecting the null. My point’s very simple: Frost sometimes considers the type I error probability to be alpha (in the 2015 posts) based solely on the sampling distribution which he ably depicts– whereas in other places he regards it as the posterior probability of the null hypothesis based on a prior probability (the 2014 posts). He does the same in his reply to me (which I can’t seem to link, but it’s in the discussion here):

From Frost’s reply to me:

Yes, if your study obtains a p-value of 0.03, you can say that 3% of all studies that obtain a p-value

less than or equal to0.03 will have a Type I error. That’s more of a N-P error rate interpretation (except that N-P focused on critical test values rather than p-values). …..Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of

exactly0.03 is like this:Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). See a graphical representation of the math behind p-values and a post dedicated to how to correctly interpret p-values.

2) We also know this because there have been a number of simulation studies that look at the relationship between p-values and the probability that the null is true. These studies show that the actual probability that the null hypothesis is true tends to be greater than the p-value by a large margin.

3) Empirical studies that look at the replication of significant results also suggest that the actual probability that the null is true is greater than the p-value.

Frost’s points (1)-(3) above would also oust alpha as the type I error probability, for it’s also not designed to give a posterior. Never mind the question of the irrelevance or bias associated with the hefty spiked prior to the null involved in the simulations, all I’m saying is that Frost should make the distinction that even J. Berger agrees to, if he doesn’t want to confuse his readers.

—————————————————————————

[a] A distinct issue as to whether significance levels, but not P-values (the attained significance level) are error probabilities, is discussed here. Here are some of the assertions from Fisherian, Neyman-Pearsonian and Bayesian camps cited in that post. (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.

*From the Fisherian camp (Cox and Hinkley):*

For given observationsywe calculate t = t_{obs}= t(y), say, and the level of significance p_{obs}by

p_{obs}= Pr(T > t_{obs}; H_{0}).

….Hence p_{obs}is the probability that we would mistakenly declare there to be evidence against H_{0}, were we to regard the data under analysis as being just decisive against H_{0}.” (Cox and Hinkley 1974, 66).Thus p

_{obs}would be the Type I error probability associated with the test.

*From the Neyman-Pearson N-P camp (Lehmann and Romano):*

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

*Gibbons and Pratt:*

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i]but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

**REFERENCES:**

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

Sellke, T., Bayarri, M. J. and Berger, J. O. 2001. Calibration of p Values for Testing Precise Null Hypotheses. *The American Statistician*, 55(1): 62-71.

**Frost blog posts:**

- 4/17/14: How to Correctly Interpret P Values
- 5/1/14: Not All P Values are Created Equal
- 3/19/15: Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

**Some Relevant Errorstatistics Posts:**

- 9/29/13: Highly probable vs highly probed: Bayesian/ error statistical differences.
- 7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
- 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
- 3/5/15: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
- 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
- 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

I wrote this in airports and on a trans Atlantic flight*, so please alert me to errors, several of which I’ve already found (and fixed). I don’t have an e-mail address for Jim Frost, though I sent a link on a comment on his blog. It’s interesting that the more recent posts do not contain these problems, although they are within a year. Maybe more than one person writes his posts. Perhaps there are some warnings of the shift in meanings that I missed; if you spot some let me know.

*Of course I wouldn’t dream of wearing heels like these while on a plane!

I found this post (and am finding Error and the Growth of Experimental Knowledge) highly instructive in pointing out the sleights of hand and misunderstandings that go on in the current debates in statistics. As a graduate student dissatisfied with the misuse of frequentest statistics in my field and swept up in the Bayesian storm your work is highly instructive and challenging, thank you.

Thanks Ed. I hope you don’t get swept too far out to sea. Be skeptical of seemingly convincing arguments from ‘big names’ like ‘we get good error probabilities and posteriors at the same time”. That means you’ll have to have a degree of chuzpah!

In order to define “the probability that the null hypothesis is true”, one needs to specify what “true” means; literally or approximately? If the latter, what kind of approximation, and where is the cutoff between true and false? Over what population of null hypotheses?

Daniel Lakeland, on a comment on Gelman’s blog, recently (kind of) defined a “true null hypothesis” in the space of a pre-chosen family of distributions as the distribution out of that family that produces the best predictions (assuming this is measurable in principle, which it often is). This at least explains what is meant by truth (although it means that changing the model, i.e., the family of distributions, changes the truth). Lakeland as (I think) Jaynesian would then go on and define the probability for a model to be true relative to available information about it pre-data; but he wrote that the only important thing about the prior is that the true value is in the high density region of the prior, which allows a huge variety of admissible priors and will therefore not define a unique pre-data probability of a hypothesis to be true, let alone a frequency-based one over a well defined sample of hypotheses. So this approach is not suitable to explain what Frost and others mean.

Gelman, I think, has indeed frequencies in some population of hypotheses in mind when it come to this issue, but I haven’t seen him making this notion precise in the sense above. If one is to test a very large number of hypotheses this may not be so problematic, as you mention, but in many other instances it’s not clear, as you mention above.

De Finetti thought that probabilities should only be considered about observable events, and the truth of probability hypotheses is not one of these. (He uses them as mere technical devices to get at predictive probabilities for observables.) Obviously even using Lakeland’s approach to define a truth that is observable in principle, his probabilities would still be subjective.

I’d be curious about how Frost and Colquhoun would explain theirs.

For me, still, when doing frequentist analysis, there is no such thing as a probability for a hypothesis to be true, unless there is a specific process generating hypotheses that makes sense to be modelled by a frequentist model (which then needs to be specified and defended, of course).

I certainly wouldn’t attempt to define the probability that the null hypothesis is true. For me, wearing my experimental hat, I think that the point null is what I hope to reject. All I can hope to do is to reject that null, Once I have convinced myself that the observations are probably inconsistent with the null, I’d calculate the effect size and decide whether it was big enough to be important in practice.

The only problem that I can see in my argument is the definition of the prevalence of true effects. It is so something that we don’t know (Although I can imagine certain contexts in which it could, in principle, ne measured, I’m not aware that it’s ever been done).

The question that I was trying to answer was a restricted one. If we claim to have made a discovery when we observe P = 0.047 in a single unbiassed test, what’s the probability that I’ll be wrong. Although we don’t know the prior prevalence of true effects, it’s relevance to answering the question is undoubted. Just consider the case where we compare two identical pills (or, equivalently, a dummy pill with a homeopathic pill). In this case the null is exactly true, the prior prevalence is zero and the false positive rate is 100 percent.

Although we haven’t got a value for the prior prevalence of true effects, we can set limits to it. My argument depends on assuming that it is never legitimate to assume a prior that is bigger than 0.5. To assume anything larger would mean presenting a journal editor with evidence that you’d made a discovery and explaining that your statistical justification for that claim was based on the premise that you were more likely than not to be right before the experiment was done. I have never seen any apepr that tried to justify it’s conclusions in that way. I think that any such argument would get short shrift from reviewers, quite rightly.

Given the premise that the prior can’t be bigger than 0.5, the false positive rate for 0.5 (at least 26 percent for my question) is the minimum that can be expected. Any smaller prior would result in a higher false positive rate (eg for my question. the false positive rate ia a staggering 76 percent for a prior of 0.1).

In summary, I don’t claim to know the prior, and I don’t claim to be able to calculate the false positive rate. But I do claim that If you say you have made a discovery when you observe P = 0.047 in a single unbiassed test, then there is a chance of at least 26 percent that you’ll make a fool of yourself.

David: There’s so many confusions of probabilistic notions here, but it will have to wait til I’m on the plane to respond. Prevalence is what it is, and since you’re playing the prevalence game (not giving a prior plausibility to the hypothesis), you should not insist on biasing it. In many fields, point nulls are known to be false, rather than having the .5 spike you insist on. In truth, the supposed prevalences of true null hypothesis (from which your given H is purportedly randomly selected) are completely irrelevant to the assessment of the evidence for a hypothesis H in science, and irrelevant as well to the replicability of a given finding. (You can read the pages I gave you today, if you get a chance).

Sorry to miss your talk.

This is just a test; I tried to post something earlier but it just disappeared when I clicked “Post Comment”.

Christian: Look at how Coquhoun is muddying the waters further by claiming that if P-values differ from Bayesian posteriors then they are being “misinterpreted”. I’ll be leaving the day he’s giving this talk, so I’m counting on you Christian.

That’s odd. It seems like he’s clinging to the replication studies of reported p-values as if that’s evidence of how p-values themselves work, rather than representing a bias of publishing significant results or researcher degrees of freedom (that whole “garden of forking paths” Gelman always refers to).

He’s confusing the math with it’s application in publishing. We must be careful about that distinction when discussing the merits of using p-values, because, as we see here, misinformation can flow down stream.

Zakdavid: That’s a good way of putting it.

The replication studies appeared almost a year after I wrote the paper http://rsos.royalsocietypublishing.org/content/1/3/140216

The published replication studies have nothing to do with my argument, Doubtless there are many reasons for the low replicability found by Nosek et al, The underestimation false positive rate is, no doubt, one of them,