Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this:

Based on Fisher’s measure of evidence approach, the correct way to interpret a p-value of

exactly0.03 is like this:Assuming the null hypothesis is true, you’d obtain the observed effect or more in 3% of studies due to random sampling error.

We know that the p-value is not the error rate because:

1) The math behind the p-values is not designed to calculate the probability that the null hypothesis is true (which is actually incalculable based solely on sample statistics). …

**But this is also true for a test’s significance level α, so on these grounds α couldn’t be an “error rate” or error probability either. **

*Y*et Frost defines α to be a Type I error probability (“An α of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis“.)

**[1]**

Let’s use the philosopher’s slightly obnoxious but highly clarifying move of subscripts. There is error probability_{1}—the usual frequentist (sampling theory) notion—and error probability_{2}—the posterior probability that the null hypothesis is true conditional on the data, as in Frost’s remark. (It may also be stated as conditional on the p-value, or on rejecting the null.) Whether a p-value is predesignated or attained (observed), error probabilitity_{1}_{ }≠ error probability_{2}.[2] Frost, inadvertently I assume, uses the probability of a Type I error in these two incompatible ways in his posts on significance tests.[3]

Interestingly, the simulations to which Frost refers to “show that the actual probability that the null hypothesis is true [i.e., error probability_{2}] tends to be greater than the p-value by a large margin” work with a *fixed* p-value, or α level, of say .05. So it’s not a matter of predesignated or attained p-values [4]. Their computations also employ predesignated probabilities of type II errors and corresponding power values. The null is rejected based on a single finding that attains .05 p-value. Moreover, the point null (of “no effect”) is give a spiked prior of .5. (The idea comes from a context of diagnostic testing; the prior is often based on an assumed “prevalence” of true nulls from which the current null is a member. Please see my previous post.)

Their simulations are the basis of criticisms of error probability_{1 }because what really matters, or so these critics presuppose, is error probability_{2 .}

Whether this assumption is correct, and whether these simulations are the slightest bit relevant to appraising the warrant for a given hypothesis, are completely distinct issues. I’m just saying that Frost’s own links mix these notions. If you approach statistical guidebooks with the magician’s suspicious eye, however, you can pull back the curtain on these sleights of hand.

Oh, and don’t lose your nerve just because the statistical guides themselves don’t see it or don’t relent. Send it on to me at error@vt.edu.

******************************************************************************

[0] They are the focus of a book I am completing: “Statistical Inference As Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2017)

[1] I admit we need a more careful delineation of the meaning of ‘error probability’. One doesn’t have an error probability without there being something that could be “in error”. That something is generally understood as an inference or an interpretation of data. A method of statistical inference moves from data to some inference about the source of the data as modeled; some may wish to see the inference as a kind of “act” (using Neyman’s language) or “decision to assert” but nothing turns on this.

Associated error probabilities refer to the probability a method outputs an erroneous interpretation of the data, where the particular error is pinned down. For example, it might be, the test infers μ > 0 when in fact the data have been generated by a process where μ = 0. The test is defined in terms of a test statistic *d*(** X**), and

*the error probabilities*refer to the probability distribution of

_{1 }*d(*the

**X**),*sampling distribution,*under various assumptions about the data generating process. Error probabilities in tests, whether of the Fisherian or N-P varieties, refer to hypothetical relative frequencies of error in applying a method.

[2] Notice that error probability_{2} involves conditioning on the particular outcome. Say you have observed a 1.96 standard deviation difference, and that’s your fixed cut-off. There’s no consideration of the sampling distribution of *d( X),* if you’ve conditioned on the actual outcome. Yet probabilities of Type I and Type II errors, as well as p-values, are defined exclusively in terms of the

*sampling distribution*

*of d(*under a statistical hypothesis of interest. But all that’s error probability

**X**),_{1}.

[3] Doubtless, part of the problem is that testers fail to clarify when and why a small significance level (or p-value) provides a warrant for inferring a discrepancy from the null. Firstly, for a p-value to be *actual* (and not merely *nominal*):

*Pr(P < p _{obs}; H_{0}) = p_{obs} . *

Cherry picking and significance seeking can yield a small *nominal* p-value, while the actual probability of attaining even smaller p-values under the null is high. So this identity fails. Second, A small p- value warrants inferring a discrepancy from the null because, and to the extent that, a larger p-value would very probably have occurred, were the null hypothesis correct. This links error probabilities of a method to an inference in the case at hand.

….Hence p_{obs}is the probability that we would mistakenly declare there to be evidence against H_{0}, were we to regard the data under analysis as being just decisive against H_{0}.” (Cox and Hinkley 1974, p. 66).

[4] The myth that significance levels lose their error probability status once the attained p-value is reported is just that, a myth. I’ve discussed it a lot elsewhere; but the the current point doesn’t turn on this. Still, it’s worth listening to statistician Stephen Senn (2002, p. 2438) on this point.

I disagree with [Steve Goodman] on two grounds here: (i) it is not necessary to separate p-values from their hypothesis test interpretation; (ii) the replication probability has no direct bearing on inferential meaning. Second he claims that, ‘the replication probability can be used as a frequentist counterpart of Bayesian and likelihood models to show that p-values overstate the evidence against the null-hypothesis’ (p. 875, my italics). I disagree that there is such an overstatement. In my opinion, whatever philosophical differences there are between significance tests and hypothesis test, they have little to do with the use or otherwise of p-values. For example, Lehmann in Testing Statistical Hypotheses, regarded by many as the most perfect and complete expression of the Neyman–Pearson approach, says

‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … the significance probability or p-value, at which the hypothesis would be rejected for the given observation’. (Lehmann, Testing Statistical hypotheses (1994, p. 70, original italics).

Note to subscribers: Please check back to find follow-ups and corrected versions of blogposts, indicated with (ii), (iii) etc.

**Some Relevant Posts:**

- 5/10/12 Excerpts from Senn’s letter [to Goodman] on replication, p-values, and evidence.
- 8/17/14: Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)
- 3/16/15: Stephen Senn: The pathetic P-value (Guest Post)
- 5/9/15: Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
- previous post: High error rates in discussions of error rates.

I think Frost’s statement about the significance level can be defended on the grounds that “you are willing to accept a 5% chance that you are wrong” is not the same as “there *is* a 5% chance that you are wrong”. So his statement is not actually identifying the significance level with the probability of a Type I error. (It’s just the value which that probability will have in the event that the null happens to be true)

If you study the posts, you’ll see what I mean. And I’m not even saying anything here about which measure is desirable. As pointed out in my last post, I thought everything was unusually correct in his treatment. I assumed he’d want to correct this or clarify, but he did not.

Could it be we’re asking too much of statistics (when it comes to science)?

Maybe. It shouldn’t be used as window-dressing. But I don’t think we’re asking too much of those who set out their shingle as “statistical guides”. We don’t ask enough.

When is your new book coming out ? This year ?

yes

Suppose I simulate a study (in which the null is true) 1 million times using an alpha equal to 0.05. The null will be rejected 50,000 times more or less, but I end up with 1 million values of p, 50,000 of them being less than 0.05. Looked at this way, alpha is a property of the test and an error rate. But p has a sampling distribution and so is not an error rate. I’m not even sure that we should call it a probability.

So, saying that p is not an error rate does not imply that alpha is not an error rate. Or am I hopelessly confused.

Peter, you might well be hopelessly confused, because the topic is hopelessly confusing as generally presented. If you keep the distinction between method long run performance and the individual outcomes the issue is less confusing.

The nub of the matter is that it is the statistical _procedure_ that has a long run error rate.That is alpha. A particular P-value is a singular index of evidence regarding the match of the data to the model and its null hypothesis. Being singular it cannot be a rate. The rate in question comes from the notional agglomeration of notional singular P-values (in the notional long run) and the application of dichotomisation rules to them in the notional condition of the null hypothesis being true.

Say we have a particular P-value of 0.027. If the procedure was to reject the null hypothesis every time the P-value was 0.027 or less then that P-value would be numerically equal to the alpha of the procedure. However, if you didn’t have that alpha set prior to seeing the data it is not the alpha that pertains to the actually used procedure.

P-values can be (should be) used as indices of evidence that are considered along with other available information prior to making a scientific statement. Any error in the scientific statement would have been the result of not only the P-value but also the considerations and the other information. Error rates are a convenient accounting mechanism for design of procedures and experiments, but they have almost no utility in a sensible scientific procedure once the data are in hand.

I disagree with someone. Please see this post: https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

Sorry, I have a new ‘puter and Chrome didn’t supply my details automagically. You disagree with me, Mayo, but that is not new information. My response was for the edification of Peter Chapman, not to re-open the endless and pointless discussions between you and me about the error rate nature of P-values.

(I agree with Oliver Maclaren on the point that you have co-opted Fisher’s name into an error account of P-values without clear evidence that he actually intended the meaning that you impute.)

Fisher’s the one who touted .05 as assuring the tester he’d be wrong 5% of the time. He started all that error rate business. N-P said balance 2 types of error. Strictly speaking, the fiducial argument is also in terms of the frequency of being wrong in an aggregate–along the lines of Fraser. Fisher agreed when Neyman first pointed this out. Only after N-P overshadowed him and the 1935 break did Fisher start backing off what he’d said for years. Only by the 1950s did Fisher get the idea (from Barnard) that N had converted “his” tests into acceptance rules. Of course, by then, Wald’s work really did move N-P-W into decision theory territory. E.Pearson wanted no part of it. But Fisher had faded by then.

Anonymous: My comment about being hopelessly confused was a little jest. You have simply repeated what I said. In my comments I stated that the p-value was not an error rate because we get a new value for every run in a long run – ie it is post data, whereas alpha is an error rate fixed at the outset – ie it is pre-data. It is because of this statement that I reserve the right to be “possibly wrong” because “I think” Deborah regards the p-value as an error rate.

I don’t think the topic is at all confusing.

To see why the p-value is an error probability see this post https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

The supposition that it’s not is based purely on a philosophical presumption about what was in Fisher’s mind, and a failure to see that alpha is the probability we’d erroneously reject, were we to reject when reaching the alpha cut-off. That’s why Lehmann could say, as he does, that reporting the p-value lets other apply their own alpha cut-off. check the definitions on the post above.

Peter

A p-value is a statistic, just as a t-value or an F-value or a Chi-square value. Generally under a null hypothesis the distribution of p-values has a uniform distribution over the interval 0 – 1. Alpha is the error rate we specify in setting up a test framework. If the null hypothesis is true, repeated evaluations of our test statistic will show proportion alpha of said test statistic values falling in the critical region specified by the pre-set alpha level. Their corresponding p-values will all be smaller than alpha. Understanding the distributional properties of p-value statistics allows several numerical and graphical methods for displaying and assessing collections of p-values to shed light on whether the null hypothesis fits the observed data better than various alternatives.

RE your point [1] and Fisher vs Neyman. Here is a quote from Rao, heavily influenced by Fisher:

“… it may be necessary to consider the problem of estimation from a wider point of view as extraction of information’ for [later] drawing inferences and for recording it…for possible future uses”

I hence still maintain that the NP view and Fisherian view are fundamentally distinct. Even if Neyman’s behavioural mode is ‘just a matter of speaking’ I think any translation remains distinct from the key elements of a Fisherian, ‘extraction of information’ approach to statistics.

It seems disingenuous to claim Fisher within ‘error statistics’ – especially since he denigrated type II errors and viewed power as a specialist application of likelihood.

Just as we distinguish various types of Bayesian despite occasional agreements, I think we should maintain a distinction between Fisher and Neyman, despite occasional agreements. They have fundamentally different conceptual and philosophical approaches.

Neymanian statistics doesn’t even require likelihood as a basic component (though it still happens to turn up as a derived quantity in particular cases) – on the other hand I think you’d be hard pressed to convince Fisher (and Cox) not to put likelihood front and centre, even if supplemented by eg large sample calibration etc.

Om: As you know, I’ve written quite a lot denying the most problematic claims about Fisher-NP statistics being an inconsistent hybrid, whereas I a urge, “it’s the methods stupid”–not what they said (or yelled) and not your favorite philosophy. They are best seen as parts of a conglomeration of tools that include many others. The overarching reasoning is error statistical: probability arises to characterize the probabilities that method control erroneous interpretations of data. N-P were doing their darndest to give a theory for generating and justifying Fisher’s tests–tests he just honed in on in a helter skelter fashion. They wanted to save him from the unwarranted tests that the single null could permit, and the use of “fiducial” probability would allow. Fisher didn’t want to be saved. So they broke up and went their own ways in 1935 just a few years after N-P tests were developed, and that’s really the end of it.

The danger, almost single handedly foisted upon us by that clever man, Jim Berger, is the following move: N-P methods are irrelevant for inference, a P value is intended to be relevant for inference, but any quantity relevant for inference must fall under the banner of “probabilism”–never mind that probabilism is practiced in inconsistent ways. So, Berger continues, we must judge P-values as if they were to be either posterior probabilities in hyps or Bayes factors or the like. Gee they don’t match these measures–at least if you conveniently use the priors Berger recommends for this job. Therefore, we must reject both N-P and Fisher and instead be default Bayesians.

I don’t think Fisher would agree ‘The overarching reasoning is error statistical’ since, for example he had no use for type II errors.

It’s of course convenient to claim that Fisher didn’t know what he was talking about and was really an error statistician, at least for the error statistician, but I don’t think it is accurate at all.

Why not just – there are some points of agreement but some major points of disagreement.

Also, by saying ‘it’s the methods’ you can’t then neglect Fisher’s substantial likelihood-based work, the subsequent work on conditional inference and higher order asymptotics, and just focus on simple tests of hypotheses.

Mayo, your account of Berger’s influence and thought patterns might well be correct, but you probably should ask him. I met him last year and it is my opinion that he would respond to a direct question directly.

You cannot ask Fisher, of course, and so I suppose you can feel free to do as I do and pretend to know his motivations and thoughts. However, note that your version differs markedly from mine. The most important part of the disagreement between Fisher and Neyman revolved around whether or not P-values could stand as evidence about the state of the world pertaining to the collection of the particular dataset. Fisher’s disjunction implies that they do, but Neyman’s long run error rates require that they do not. That difference is expressed very clearly in the argument about how to interpret identical intervals denoted as ‘confidence’ and ‘fiducial’.

It’s a long story.What’s not in my blog over 5 yrs is in my new book. But that’s the least important part. Remember “it’s the methods stupid”. I have spoken to Berger many times. He has a particular, Berger interpretation of Neyman that’s Bayesian.

Your “it’s the methods stupid” seems to be a bizarre reply for a philosopher to give.

Fisherian methods are different from Neymanian methods – even Cox says so in his principles book. Efron distinguishes ‘Fisherian’ statistics. Even Spanos does in his econometrics book. The examples abound.

Yes there are some points of agreement, no there is not universal agreement on methods between Fisher and Neyman.

There are many flavours of Bayesian but there are also many flavours of confidence inference and of likelihood-based inference. For example some of those who consider themselves ‘Fisherian’ and favour a form of ‘direct likelihood’ inference (eg Sprott, Lindsey) also reject the strong likelihood principle and embrace some ‘confidence’ esq reasoning in addition.

You have one interpretation and set of methods. To me your approach is simply not Fisherian and that is fine. But it is not Fisherian and does not in fact even cover the full extent of ‘confidence’ inference.

Om: No Cox says they’re mathematically the same and further that Fisher was more behavioral in practice than was Neyman. I’m not saying they aren’t different tools, but it’s absurd to imagine just one tool for all science. N-P-F all used and invented several tools.

On the philosophy, chapter 11 of EGEK is “Why Pearson rejected the N-P philosophy of statistics”–so I too bought into it to a degree. I later looked more carefully, thanks to Spanos. I’m saying the alleged philosophical difference barely exists and what people have made of the slight difference is a disaster.

But I repeat: it’s the methods stupid, and we should be able to figure out anew how to use them and not keep saying, “but Fisher said” when he was yelling at Neyman. That’s so personality oriented and is silly.

Cox in principles book, comparing to the ‘Fisherian’ approach and use of sufficiency:

“An alternative avenue, following the important and influential work of J. Neyman and E. S. Pearson, in a sense proceeds in the reverse direction. Optimality criteria are set up for a formulation of a specific issue, expressing some notion of achieving the most sensitive analysis possible with the available data. The optimality problem is then solved (typically by appeal to a reservoir of relevant theorems) and a procedure produced which, in the contexts under discussion, is then found to involve the data only via s. The two avenues to s nearly, but not quite always, lead to the same destination.

This second route will not be followed in the present book.

In the Neyman–Pearson theory of tests, the sensitivity of a test is assessed by the notion of power, defined as the probability of reaching a preset level of significance considered for various alternative hypotheses. In the approach adopted here the assessment is via the distribution of the random variable P, again considered for various alternatives.

The main applications of power are in the comparison of alternative procedures and in preliminary calculation of appropriate sample size in the design of studies. The latter application, which is inevitably very approximate, is almost always better done by considering the width of confidence intervals.”

More broadly, anyone who has looked at Fisher’s methods properly can see they are different. Nothing to do with personality as such – simply different.

Om: thanks for the quote:

“In the Neyman–Pearson theory of tests, the sensitivity of a test is assessed by the notion of power, defined as the probability of reaching a preset level of significance considered for various alternative hypotheses. In the approach adopted here the assessment is via the distribution of the random variable P, again considered for various alternatives”. Cox 2006

Yes there’s a difference between the crude behavioristic approach and Fisher, but Cox and I and Fisher (and Barnard and Birnbaum and Spanos and doubtless others) agree that that’s not the only way to use the measures., and further, that in practice, Cox came to see, Neyman did not use them behavioristically. That’s why Cox said in practice Fisher is more behavioristic, Neyman less….Egon said every statistician has his alpha and beta side (theory vs practice).

You know it’s a little awkward because Cox and I started working together in 2003, and power occupied a purely pre-data planning notion for him. He hadn’t known, nor had I until around 1999, that N-P give 3 roles to power–one is post data (to determine if there are grounds to “confirm” the adequacy of the null in the case of non-rejection).This is no different from the logic of significance testing, so once we had principle FEV, we could use SEV. This quote is the way Cox and I were able to agree on severity using what could be described as the distribution of the P-value over alternatives or “attained power”(but we didn’t call it that) in our 2006 paper for the Lehmann conference. This influenced his book, however slightly. Greenland wrote to Cox, a bit alarmed at Mayo and Cox 2006/2010 because of this. Cox said, no what we’re doing is something quite different.

So you see, these things aren’t so hard and fast, and real people (not historical caricatures), really rethinking the logic, and looking at the founder’s practice (free from polemics) can move away from rigid views.

I would love people to move away from rigid views and embrace the full diversity of good methods, including Bayesian, Likelihood, Neymanian, Fisherian, Tukian etc etc.

Om: But I’m not sure you got the point, and since this comes up a lot, it would be good to pin it down. It’s not merely a matter of using a grab bag of tools, it’s the fact that frequentist error probability tools–the one’s I’m talking about–can be reinterpreted in this way. Thus to keep saying but Fisher disagreed with Neyman, orthat there’s a difference between a behavioristic and aninferential testing interpretation, so we have to forever and ever assume these tools must be interpreted along those rigid lines. The founders did not think so, and they were right. It’s extremely important. Don’t insist on placing people in pigeon holes they themselves might have moved beyond, and don’t place yourself in a pigeon-hole in using a method. That said, error statistical methods are in sync as interconnected tools within an overarching philosophy. I don’t know that any of the probabilisms can readily be placed under the error statistical umbrella because they use probability in a very different way–at least the standard ones. But I think any inference, to a posterior or whatnot, can be assessed for severity. That would be a way to place them under an error statistical umbrella. But they can live their own lives as well

I got your point but I disagree. In my mind you are the one placing Fisherian and other methods into an error statistical pigeonhole.

Om: All that means is that his reports quantify error probabilities. I said nothing about how they are to be interpreted or used.Fisher says, if the null if alse, you’ll not be wrong in rejecting it, and if true, you’ll be wrong 5 (or whatever) % of the time in rejecting it.

“Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”.

Fisher (1926)

“A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”

This doesn’t read to me as being strictly about ‘error probabilities’, rather is a requirement of reproducibility or stability.

You could similarly say

“A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of *evidence*”

for any measure of ‘evidence’.

Sprott, for example, in his book ‘Statistical inference in science’, opens with a discussion of Fisher on repeatable experiments, while also advocating essentially direct likelihood inference (though not holding the strong likelihood principle).

He does not use ‘error statistical’ reasoning in the form you present it. This is but one example of someone strongly influenced by Fisher, who has further developed his methods and who does not fall easily under the ‘error statistical’ umbrella.

My concern is that such work – which I think is quite appealing – is in danger of being overlooked by trying to assimilate Fisher under the error statistical umbrella.

Similarly, because Gelman offered an olive branch in the form of his paper with Shalizi, you sometimes claim Gelman as ‘error statistical’. On the other hand he has stated over and over that he never uses nor considers type I or type II error reasoning.

That there are *some* points of agreement between these various approaches is good, but it is a mistake in my mind to claim therefore that the ‘overarching reasoning’ involved is error-statistical.

Om: Well you’re confused about what an error probability means to me. Severity reasoning is a proper subset of methods using error probabilities. Gelman calls himself error statistical in the specific sense of using Fisherian type simple significance tests for model checking. He has never disagreed with his quotes wherein he describes himself as using error statistical methods. But I don’t see the relevance of your points. There’s no danger of likelihood analysis being “overlooked”. I’m interested in error statistical methods, and don’t see why probabilisms can’t be probabilisms. I do not worship in the church of the Unificationist.

Only if someone recommends a method that permits inferring H, even though that method had no chance of uncovering a flaw in H, then I say it’s an unwarranted inference, and if you use such methods routinely, it’s bad science. Replace “no” with “little or no” and it’s still flouts the essential requirement to be self critical in science. I guarantee Gelman concurs with me on that.

You likewise appear to be confused about what I mean.

How about:

“a method that permits inferring [that an approach is based on error statistical reasoning], even though that method had no chance of uncovering [that this approach is not based on error statistical reasoning]…[is]…an unwarranted inference, and if you use such methods routinely, it’s bad [philosophy]”

Om: The majority of my work is showing that popular methods violate the even minimal requirements for a severe test.

In my last comment, I shared some things for the first time on this bog, thinking it might help, but your response makes me feel it was a waste of time. Here’s hoping some other readers got something out of it.

https://errorstatistics.files.wordpress.com/2014/06/fraser_is-bayes-posterior-just-quick-and-dirty-confidence.pdf