You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

Well, what’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting *H*_{0}, as in the joke: There’s a test statistic D such that *H*_{0} is rejected if its observed value d_{0} reaches or exceeds a cut-off d* where Pr(D > d*; *H*_{0}) is small, say .025.

* Reject H*_{0} if Pr(D > d_{0}; *H*_{0}) < .025.

The report might be “reject *H*_{0 }at level .025″.

*Example*: *H*_{0}: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject *H*_{0} .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that *H*_{0} “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; *H*_{0}), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting *H*_{0. }Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d_{0} ; *H*_{0} ) be small, but that Pr(D > d_{0} ; *H*_{0} ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this ….

1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes *H*_{0} “has not predicted”. That is, it arises to identify a sensible test statistic D.

In familiar scientific tests, we know the outcomes that are ‘more extreme’ from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative Z is benign, *not the other way around*. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of accordance, fit, or directional departure. Improbability of outcomes (under *H*_{0}) should not indicate discrepancy from *H*_{0} if even less probable outcomes would occur under discrepancies from *H*_{0}. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)

*2. N-P tests and tail areas*: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit the alternatives of interest—even if just in terms of directions of departure.

In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)

Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; *H*_{0}) is high, one wants Pr(D > d*; *H*’: μ_{0} + γ) to increase as γ increases. Any sensible distance measure D must **track** discrepancies from *H*_{0}. If you’re going to reason, “the larger the D value, the worse the fit with *H*_{0},” then observed differences must occur **because** of the falsity of *H*_{0} (in this connection consider Kadane’s howler).

3. But Fisher, strictly speaking, has only the null distribution, along with an implicit interest in tests with *sensitivity* toward implicit departures. To find out if *H*_{0} has or has not predicted observed results, we need a sensible distance measure. (Recall Senn’s post: “Fisher’s alternative to the alternative”.)

Suppose I take an observed difference d_{0} as grounds to reject *H*_{0 }on account of its being improbable under *H*_{0}, when in fact larger differences (larger D values) are more probable under *H*_{0}. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed with accounts that only look at the improbability of the observed outcome d_{0} under *H*_{0}.

4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under *H*_{0} should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under *H*_{0}. Equivalently, we want a high probability of inferring *H*_{0} when *H*_{0} is true. In my terms, considering Pr(D < d*;*H*_{0}) is what’s needed to block rejecting the null and inferring alternative *H*’ when you haven’t rejected it with severity (where H’ and *H*_{0 }exhaust the parameter space). In order to say that we have “sincerely tried”, to use Popper’s expression, to reject *H*’ when it is false and *H*_{0} is correct, we need Pr(D < d*; *H*_{0}) to be high.

*5. Concluding remarks*:

The rationale for the tail area, as I see it, is twofold: to get the right direction of departure, but also to ensure Pr(test T does *not* reject *H*_{0}; *H*_{0} ) is high.

If we don’t already have an appropriate distance measure D, then we don’t know which outcomes we should regard as those *H*_{0} *does or does not *predict–so Jeffreys’ quip can’t even be made out. That’s why Fisher looks at the tail area associated with any candidate for a test statistic. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics.

If we have an appropriate D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject *H*_{0} but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by *H*_{0} [ii].

Either way, Jeffreys’ criticism, funny as it seems, collapses.

When an observation leads to rejecting the null in a significance test, it is because of that outcome—*not because of any unobserved outcomes.* Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool T just is to understand what T would do under different outcomes, under different conjectures about what’s producing the data.

[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.

[ii] One may argue it should be even harder, but this is a distinct issue.

[iii] As usual, I’ll indicate a significantly changed draft with [ii] in the title. This [event] is not improbable, as it’s new material!

*Sir Harold’s “howler” fell out naturally from the alliteration, but I actually don’t think so famous a “one-liner”–one that raises a legitimate question to be answered –should be lumped in with the group of howlers that are repeated over and over again, despite clarifications/explanations/corrections having been given many times. (So there’s a time factor involved.) I also wouldn’t place logical puzzles, e.g., the Birnbaum argument in this category. By contrast, alleging that rejecting a null is intended, by N-P theory, to give stronger evidence against the null as the power increases, is. Several other howlers may found on this blog. I realized the need for a qualification in reading a comment today by Gelman.

Jeffreys, H. (1939 edition), *Theory of Probability*. Oxford.

Pearson, E.S. (1955), “Statistical Concepts in Their Relation to Reality.”

Why do you call the P-value, or tail area, Fisherian rather than Karl-Pearsonian?

Hi Golde: Do you think I should call it that?

I think that Jeffreys comment was valid regarding a way in which Fisherian significance tests were and are still sometimes presented, also by Fisher himself at times, I think. In order to deal with Jeffreys, it is necessary to say that the H0 is tested against alternatives *that put a larger probability on the test statistic being in the tail area* (if needed, on a specific side of the H0; or, of course, by making equivalent statements in terms of D), but not against general alternatives or “no specific alternative”.

So I think that your discussion is allright, but I’d give it to Jeffreys that the idea of significance tests that he attacked was indeed held and advertised by some.

Christian: I don’t understand what you’re referring to as the idea of tests he’s attacking being one that is held and advertised by some.

I had numerous discussions with people, some even statisticians, who thought that alternative hypotheses are nothing more than a technical requirement for N-P theory and who held that in principle one wants to test the H0 only against an omnibus- or no alternative at all. This is implicit for example in Donald Gillies’ 1973-“An objective theory of probability” and Donald told me later that he accepted my criticism regarding this issue. I don’t have the time to go through Fisher’s writings to find it right now but I’m sure that one can find places in which he seemingly advertises this idea; I’m quite sure I have seen such passages. He is more insightful in other places, but he often wrote things that look like contradicting things he wrote elsewhere, so I wouldn’t be surprised that some readers took him as saying that there is no role for any alternative and tests are generally “against everything”. Certainly I have met some fans of Fisher who advertise it like that.

But without the idea of an alternative “direction” with which the H0 is compared, Jeffreys would be right to wonder why some events that have not happened count against H0 in the calculation of the p-value or the definition of the critical region, whereas other don’t.

Christian: I think the issue as to whether Fisherian tests are “against everything” or against alternatives, be they only directional is distinct from the issue here. Spanos distinguishes what he calls “testing without” (testing against everything) and “testing within” (a model). To implement tests, however, even the former must indicate departures against which the tester requires sensitivity (Fisher’s term). Tail areas arise in test formulation. I understand that there’s a legitimate question as to how one determines “more extreme” values if one doesn’t have at least a directional alternative in mind (and I think Fisher assumed the tester would have such a directional interest for any given test–why else focus on “sensitivity”?). [He says things like, the tester know what he's interested in.]

What I’m not quite seeing is how this issue gets to the criticism of tail areas as “permitting outcomes that haven’t happened to count against H0 in the calculation of the p-value”. How, for example, would Jeffreys’ next sentence in my Note [i], in this post, make sense?

I guess it isn’t worthwhile to go much deeper here because I think you understood my main point and I agree that it requires a rather selective reading of Fisher to interpret him in such a way that this criticism makes sense, but then given the way Fisher wrote it is all too easy to read him selectively. I don’t know where to look for “Jeffreys’ next sentence in my Note 2″, sorry, but anyway I accept your view and just wanted to share what I wrote before.

Christian: I had meant my first note (I’ve corrected it):

[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.”

I think it is worthwhile because Jeffreys’ criticism has struck a deep chord with people, so it should at least be analyzed as to intended meaning. I truly don’t think it’s simply the problem of testing without an explicit alternative.

To get at what I do think the issue is, I suspect that critics inadvertently transpose the p-value and imagine Ho is being judged given any of the more extreme outcomes* have occurred.

what matters is not what Fisher said our thought, it’s the fact that p-values employ tail areas. I’m giving a rationale that differs from the behavioristic one.

*I had first written “alternatives” which might sound like alternative hypotheses, I meant outcomes Pr(Ho|D > d).

Mayo, I’m confused by this sentence: “Improbability of outcomes (under H0) should not indicate discrepancy from H0 if even less probable outcomes would occur under discrepancies from H0.” Is that a typo and you meant to say something like “if even less probable outcomes would occur under H0″?

Mark: This is actually a different claim from the other. Here’s it’s the fact that outcomes improbable under Ho, could be even less probable under discrepancies from Ho. It is a convoluted way of putting it (that larger D values should be more and more probable under discrepancies from Ho than they are under Ho), but it is what I intended. Thanks.

Actually, you say this later, “I take an observed difference d0 as grounds to reject H0 on account of its being improbable under H0, when in fact larger differences (larger D values) are more probable under H0.” I think that’s basically what you meant above, too, but it didn’t come off that way to me.

“e.g., the more patients show side effects after taking drug Z, the less indicative Z is benign”

Can you explain the basis for this argument more completely? Baseline differences are ever present in medicine and just as plausible an explanation as a treatment effect. I agree with it, but it seems very weak evidence.