Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

Comedy hour icon


This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

 ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) ….



It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting H0, as in the joke: There’s a test statistic D such that H0 is rejected if its observed value d0 reaches or exceeds a cut-off d* where Pr(D > d*; H0) is small, say .025.
           Reject H0 if Pr(D > d0H0) < .025.
The report might be “reject Hat level .025″.
Example:  H0: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject H0 .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this ….

1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes H0 “has not predicted”. That is, it arises to identify a sensible test statistic D.

In familiar scientific tests, we know the outcomes that are ‘more extreme’ from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative Z is benign, not the other way around. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of accordance, fit, or directional departure. Improbability of outcomes (under H0) should not indicate discrepancy from H0 if even less probable outcomes would occur under discrepancies from H0. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)

2. N-P tests and tail areas: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit the alternatives of interest—even if just in terms of directions of departure.

In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)

Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; H0) is high, one wants Pr(D > d*; H’: μ0 + γ) to increase as γ increases.  Any sensible distance measure D must track discrepancies from H0.  If you’re going to reason, “the larger the D value, the worse the fit with H0,” then observed differences must occur because of the falsity of H0 (in this connection consider Kadane’s howler).

3. But Fisher, strictly speaking, has only the null distribution, along with an implicit interest in tests with sensitivity toward implicit departures. To find out if H0 has or has not predicted observed results, we need a sensible distance measure. (Recall Senn’s post: “Fisher’s alternative to the alternative”, just reblogged yet again.**)

Suppose I take an observed difference d0 as grounds to reject H0 on account of its being improbable under H0, when in fact larger differences (larger D values) are more probable under H0. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed with accounts that only look at the improbability of the observed outcome d0 under H0.

4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under H0 should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under H0. Equivalently, we want a high probability of inferring H0 when H0 is true. In my terms, considering Pr(D < d*;H0) is what’s needed to block rejecting the null and inferring alternative H’ when you haven’t rejected it with severity (where H’ and Hexhaust the parameter space). In order to say that we have “sincerely tried”, to use Popper’s expression, to reject H’ when it is false and H0 is correct, we need Pr(D < d*; H0) to be high.

5. Concluding remarks:

The rationale for the tail area for Fisher, as I see it, is twofold: to get the right direction of departure, but also to ensure Pr(test T does not reject H0H0 ) is high.

If we don’t already have an appropriate distance measure D, then we don’t know which outcomes we should regard as those H0 does or does not predict–so Jeffreys’ quip can’t even be made out. That’s why Fisher looks at the tail area associated with any candidate for a test statistic. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics. (For N-P, the “tail area” falls out of the rejection region; they actually criticize Fisher for not justifying his use of it.)

If we have an appropriate D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject H0 but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by H0 [ii].

Either way, Jeffreys’ criticism, funny as it seems, collapses.

When an observation leads to rejecting the null in a significance test, it is because of that outcome—not because of any unobserved outcomes. Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool T just is to understand what T would do under different outcomes, under different conjectures about what’s producing the data.

Feb 22 addition(b): The relevance to Senn’s post is pretty obvious, as considering the tail area is a way Fisher ensures a sensible (directional) test statistic. The connection to my post about using  (1 – β)/α as a likelihood ratio is less direct.  If you use this in a Bayesian computation you’ll get a higher posterior probability for the non-null than if you used the observed data point. In this case, considering the tail area (for the power) really would make it easier to find evidence against the null. But it comes from using these terms in an entirely unintended way. Moreover, power is a lousy “distance” (between data and hypothesis) measure. That’s what I was trying to bring up in the post that went off topic.

[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.

[ii] One may argue it should be even harder, but this is a distinct issue.

[iii] I’ll indicate a significantly changed draft with [b] in the title.

*I initially called this, “Sir Harold’s ‘howler’. That phrase fell out naturally from the alliteration, but it’s strictly incorrect (as I wish to use the term “howler”). I don’t think so famous a “one-liner”–one that raises a legitimate question to be answered –should be lumped in with the group of howlers that are repeated over and over again, despite clarifications/explanations/corrections having been given many times. (So there’s a time factor involved.) I also wouldn’t place logical puzzles, e.g., the Birnbaum argument in this category. By contrast, alleging that rejecting a null is intended, by N-P theory, to give stronger evidence against the null as the power increases, is a howler. Several other howlers may found on this blog. I realized the need for a qualification in reading a comment on this blog by Andrew Gelman (1/14).

**Perhaps Senn disagrees with my take?

Jeffreys, H. (1939 edition), Theory of Probability. Oxford.

Pearson, E.S. (1955), “Statistical Concepts in Their Relation to Reality.”

Categories: Comedy, Discussion continued, Fisher, Jeffreys, P-values, Statistics, Stephen Senn

Post navigation

7 thoughts on “Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

  1. Alan

    It wasn’t a one liner. It was a part of a long discussion (roughly pages 384 to 388 in the edition I have) where his meaning is plainly stated with appropriate details.

    He wasn’t saying “p-value reasoning seems to have an absurd justification, so reject p-value tail area methods”. It would be more accurate to phrase his claim as “p-value tail area reasoning seems absurd on the face of it, but it may be more legitimate than it looks”.

    First he basically says, that for obvious reasons you can’t use the number P(observed data|H0) by itself most of the time (especially for continuous variables). He than says basically that “inverse probability” solves this dilemma trivially, correctly and in generality. Non-Bayesains who reject the use of inverse probability thus must use some other method to get something reasonable, hence tail area reasoning. He goes on to say basically this is sometimes the right thing or approximates the right thing to do and sometimes it doesn’t, and to say that some p-values are relevant for Bayesians.

    Here are some actual quotes:

    “The fundamental idea, and one that I should naturally accept, is that a law should not be accepted on data that themselves show large departures from its predictions”

    “It must be said that the method fulfills a practical need; but there was no need for the paradoxical use of P[-values]”

    “It should be said that several of the P[-value] integrals have a definite place in the present [Bayesian] theory …”

    Maybe it’s better and easier to just read Jeffreys than to take one line out of context and try to interpret it. Especially since I’m leaving much out as well.

    • Alan: This has got to be THE best known one-liner of all statistics, and I was perplexed that it is always mentioned as a criticism of Fisher (even by E. Pearson, e.g., “the N-P Story”). (It’s so famous, Senn has a pun on it, for a different problem; I don’t have it handy.)
      In your edition, p. 387, there is the very important point: “For the normal law with a known standard error,..the total area of the tail represents the probability, given the data, that the estimated difference has the wrong sign—provided there is no question whether the difference is zero.” (Jeffreys, p. 387)

  2. Suppose one takes a hierarchical Bayesian point of view, conditionally decomposing the full model into a ‘data/measurement error’ model and an underlying ‘true process’ model.

    For an underlying model that captures the ‘true process’, it then seems that a p-value is basically a check of the ‘mesurement error’ part.

    One alternative idea that comes to mind is to include the error distribution parameters as unknowns in the model (e.g. the standard deviation for a simple normal, zero mean case) and estimate them based on the data using the standard Bayes approach for obtaining the marginals of all unknowns.

    If one knows the ‘true’ measurement error then one can then compare the estimated error distribution with the known distribution (say visually at first – could also use other metrics) to see if they are consistent.

    If the ‘true process’ is mischaracterised as well then there could e.g. be unusual visible correlations between the measurement error estimate and the process model estimate, indicating that they are trying to compensate for one another. Having the various marginals would be very useful for this. This again is assessed as part of model checking.

    So in general the p-value idea makes sense to me as part of ‘self-consistency’ checks. But I also think from a practical (maybe theoretical?) point of view it helps to separate the phases into

    – assume a model and do the estimation
    – check for self-consistency of model assumptions

    This amounts to the error statistics approach as emphasised by Spanos’, and the Box/Gelman etc approach.

    p-value style reasoning seems most useful/important as part of the ‘self-consistency’ or ‘inductive premise’ part, but likelihood/bayes does seem useful for the estimation part in complex cases. Box bascially said all this years ago. Frequentist p-values aren’t the only way of carrying out consistency checks, however – many Bayesian or information-theoretic ideas have been proposed as well.

    The use of p-values or other sample-space-dependent reasoning in the *estimation* phase (*in addition to* in the self-consistency phase) appears to amount to adding ‘sample-space/data noise’ into a hierachical model – why not just include it in a hierarchical bayes model?

    • Hmm forget that very last sentence, or perhaps most of the above (I’ll stop thinking out loud on your blog!).

      While I accept most of your discussion, I’m still not sure whether p-value concepts (or N-P concepts) have any advantages for estimation (though of course they can be used) over other methods.

      The idea of using p-value-style ideas for model checking seems more acceptable, though (again there may be reasonable alternative strategies however).

      • [Again ignoring my promises, I suppose what I was wondering with the end of my first comment is whether one could include an analogue of additional frequentist sampling variability in the Bayesian estimation through a hierarchical structure…I guess this may relate to bootstrap/bayes/empirical bayes connections or something…]

  3. This surely alludes to the two different ways of calculating the likelihood ratio.The distinction goes back at least as far as Jeffreys, but when I wanted to discuss them I was surprised to find that there was no generally accepted name for the two approaches. So I called them, rather clumsily,the p-equals method and the p-less-than method. I maintain that it is clearly the former that’s relevant for the interpretation of p-values produced by a single test, as discussed in section 3 of

    • Not sure if this is to connect with Jeffreys tail area one-liner? Of course likelihood ratios do not have error control unless there are 2 predesignated point hypotheses. But that can’t work beyond highly artificial cases.

Blog at