When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.
I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this:
Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of exactly 0.03 is like this:
Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.
We know that the p-value is not the error rate because:
1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). …
But this is also true for a test’s significance level α, so on these grounds α couldn’t be an “error rate” or error probability either. Yet Frost defines α to be a Type I error probability (“An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis“.) 
Let’s use the philosopher’s slightly obnoxious but highly clarifying move of subscripts. There is error probability1—the usual frequentist (sampling theory) notion—and error probability2—the posterior probability that the null hypothesis is true conditional on the data, as in Frost’s remark. (It may also be stated as conditional on the p-value, or on rejecting the null.) Whether a p-value is predesignated or attained (observed), error probabilitity1 ≠ error probability2. Frost, inadvertently I assume, uses the probability of a Type I error in these two incompatible ways in his posts on significance tests.
Interestingly, the simulations to which Frost refers to “show that the actual probability that the null hypothesis is true [i.e., error probability2] tends to be greater than the p-value by a large margin” work with a fixed p-value, or α level, of say .05. So it’s not a matter of predesignated or attained p-values . Their computations also employ predesignated probabilities of type II errors and corresponding power values. The null is rejected based on a single finding that attains .05 p-value. Moreover, the point null (of “no effect”) is give a spiked prior of .5. (The idea comes from a context of diagnostic testing; the prior is often based on an assumed “prevalence” of true nulls from which the current null is a member. Please see my previous post.)
Their simulations are the basis of criticisms of error probability1 because what really matters, or so these critics presuppose, is error probability2 .
Whether this assumption is correct, and whether these simulations are the slightest bit relevant to appraising the warrant for a given hypothesis, are completely distinct issues. I’m just saying that Frost’s own links mix these notions. If you approach statistical guidebooks with the magician’s suspicious eye, however, you can pull back the curtain on these sleights of hand.
Oh, and don’t lose your nerve just because the statistical guides themselves don’t see it or don’t relent. Send it on to me at email@example.com.
 They are the focus of a book I am completing: “How to Tell What’s True About Statistical Inference”.
 I admit we need a more careful delineation of the meaning of ‘error probability’. One doesn’t have an error probability without there being something that could be “in error”. That something is generally understood as an inference or an interpretation of data. A method of statistical inference moves from data to some inference about the source of the data as modeled; some may wish to see the inference as a kind of “act” (using Neyman’s language) or “decision to assert” but nothing turns on this.
Associated error probabilities refer to the probability a method outputs an erroneous interpretation of the data, where the particular error is pinned down. For example, it might be, the test infers μ > 0 when in fact the data have been generated by a process where μ = 0. The test is defined in terms of a test statistic d(X), and the error probabilities1 refer to the probability distribution of d(X), the sampling distribution, under various assumptions about the data generating process. Error probabilities in tests, whether of the Fisherian or N-P varieties, refer to hypothetical relative frequencies of error in applying a method.
 Notice that error probability2 involves conditioning on the particular outcome. Say you have observed a 1.96 standard deviation difference, and that’s your fixed cut-off. There’s no consideration of the sampling distribution of d(X), if you’ve conditioned on the actual outcome. Yet probabilities of Type I and Type II errors, as well as p-values, are defined exclusively in terms of the sampling distribution of d(X), under a statistical hypothesis of interest. But all that’s error probability1.
 Doubtless, part of the problem is that testers fail to clarify when and why a small significance level (or p-value) provides a warrant for inferring a discrepancy from the null. Firstly, for a p-value to be actual (and not merely nominal):
Pr(P < pobs; H0) = pobs .
Cherry picking and significance seeking can yield a small nominal p-value, while the actual probability of attaining even smaller p-values under the null is high. So this identity fails. Second, A small p- value warrants inferring a discrepancy from the null because, and to the extent that, a larger p-value would very probably have occurred, were the null hypothesis correct. This links error probabilities of a method to an inference in the case at hand.
….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, p. 66).
 The myth that significance levels lose their error probability status once the attained p-value is reported is just that, a myth. I’ve discussed it a lot elsewhere; but the the current point doesn’t turn on this. Still, it’s worth listening to statistician Stephen Senn (2002, p. 2438) on this point.
I disagree with [Steve Goodman] on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement. In my opinion, whatever philosophical differences there are between significance tests and hypothesis test, they have little to do with the use or otherwise of p-values. For example, Lehmann in Testing Statistical Hypotheses, regarded by many as the most perfect and complete expression of the Neyman–Pearson approach, says
‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … the significance probability or p-value, at which the hypothesis would be rejected for the given observation’. (Lehmann, Testing Statistical hypotheses (1994, p. 70, original italics).
Note to subscribers: Please check back to find follow-ups and corrected versions of blogposts, indicated with (ii), (iii) etc.
Some Relevant Posts:
- 5/10/12 Excerpts from Senn’s letter [to Goodman] on replication, p-values, and evidence.
- 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
- 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
- 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
- previous post: High error rates in discussions of error rates.
I’m suspicious of the whole “magicians had to be brought in” hype. In more recent years, Bem’s paper got a lot of hype but regular non-magicians such as Wagenmakers and myself were able to see through it. Of course magicians are going to want to sell their particular expertise, but I’m doubtful. The fact that some scientists were gullible should not be taken to imply that regular, non-magician scientists can’t work this stuff out. In understanding ESP experiments, I’d much rather have a psychologist in my corner than a magician.