Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Posted on January 2, 2016 by Mayo

This headliner appeared two years ago, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance…

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno [1] who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H₀ because of outcomes H₀ didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting H₀, as in the joke: There’s a test statistic D such that H₀ is rejected if its observed value d₀ reaches or exceeds a cut-off d* where Pr(D > d*; H₀) is small, say .025.
Reject H₀ if Pr(D > d₀; H₀) < .025.
The report might be “reject H₀at level .025″.
Example: H₀: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject H₀ .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that H₀ “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H₀), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H_0.Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d₀ ; H₀ ) be small, but that Pr(D > d₀ ; H₀ ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this ….

1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes H₀ “has not predicted”. That is, it arises to identify a sensible test statistic D.

In familiar scientific tests, we know the outcomes that are ‘more extreme’ from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative Z is benign, not the other way around. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of accordance, fit, or directional departure. Improbability of outcomes (under H₀) should not indicate discrepancy from H₀ if even less probable outcomes would occur under discrepancies from H₀. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)

2. N-P tests and tail areas: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit the alternatives of interest—even if just in terms of directions of departure.

In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)

Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; H₀) is high, one wants Pr(D > d*; H’: μ₀ + γ) to increase as γ increases. Any sensible distance measure D must track discrepancies from H₀. If you’re going to reason, “the larger the D value, the worse the fit with H₀,” then observed differences must occur because of the falsity of H₀ (in this connection consider Kadane’s howler).

3. But Fisher, strictly speaking, has only the null distribution, along with an implicit interest in tests with sensitivity toward implicit departures. To find out if H₀ has or has not predicted observed results, we need a sensible distance measure. (Recall Senn’s post: “Fisher’s alternative to the alternative”, just reblogged.**)

Suppose I take an observed difference d₀ as grounds to reject H₀on account of its being improbable under H₀, when in fact larger differences (larger D values) are more probable under H₀. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed with accounts that only look at the improbability of the observed outcome d₀ under H₀.

4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under H₀ should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under H₀. Equivalently, we want a high probability of inferring H₀ when H₀ is true. In my terms, considering Pr(D < d*;H₀) is what’s needed to block rejecting the null and inferring alternative H’ when you haven’t rejected it with severity (where H’ and H₀exhaust the parameter space). In order to say that we have “sincerely tried”, to use Popper’s expression, to reject H’ when it is false and H₀ is correct, we need Pr(D < d*; H₀) to be high.

5. Concluding remarks:

The rationale for the tail area, as I see it, is twofold: to get the right direction of departure, but also to ensure Pr(test T does not reject H₀; H₀ ) is high.

If we don’t already have an appropriate distance measure D, then we don’t know which outcomes we should regard as those H₀ does or does not predict–so Jeffreys’ quip can’t even be made out. That’s why Fisher looks at the tail area associated with any candidate for a test statistic. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics.
If we have an appropriate D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject H₀ but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by H₀ [ii].

Either way, Jeffreys’ criticism, funny as it seems, collapses.

When an observation leads to rejecting the null in a significance test, it is because of that outcome—not because of any unobserved outcomes. Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool T just is to understand what T would do under different outcomes, under different conjectures about what’s producing the data.

[1] I miss Leno. The new guys aren’t very funny and I rarely watch.

[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.

[ii] One may argue it should be even harder, but this is a distinct issue.

[iii] As usual, I’ll indicate a significantly changed draft with [iii] in the title. This [event] is not improbable, as it’s new material!

*I initially called this, “Sir Harold’s ‘howler’. That phrase fell out naturally from the alliteration, but it’s strictly incorrect (as I wish to use the term “howler”). I don’t think so famous a “one-liner”–one that raises a legitimate question to be answered –should be lumped in with the group of howlers that are repeated over and over again, despite clarifications/explanations/corrections having been given many times. (So there’s a time factor involved.) I also wouldn’t place logical puzzles, e.g., the Birnbaum argument in this category. By contrast, alleging that rejecting a null is intended, by N-P theory, to give stronger evidence against the null as the power increases, is a howler. Several other howlers may found on this blog. I realized the need for a qualification in reading a comment on this blog by Andrew Gelman (1/14).

**Perhaps Senn disagrees with my take?

Jeffreys, H. (1939 edition), Theory of Probability. Oxford.

Pearson, E.S. (1955), “Statistical Concepts in Their Relation to Reality.”

Categories: Comedy, Fisher, Jeffreys, P-values | 9 Comments

9 thoughts on “Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]”

January 4, 2016

lauriedavies2014

Deborah, suppose you are asked to predict the value of a N(0,1) random variable, how would you do it? It is clear that it is pointless to predict a single value as the prediction will be wrong with probability one. The opposite case is for the prediction to be correct with probability one which can be done by predicting any real number. The interesting case is for the prediction to be correct with a specified probability, say alpha. In this case you would specify a region Gamma(alpha) such that P(X in Gamma(alpha))=alpha. The choice of region would depend on the circumstances but for alpha=0.95 the choice Gamma(0.95)=(-1.96,1.96) would be standard. The prediction is correct if the observed vale x lies in Gamma(alpha). What did Jeffreys mean by “failing to predict observable results”? This could be forgetfulness or laziness but I take it here to mean expressly predicting that a value in the complement of Gamma(alpha) will not occur. Putting this together you make a prediction which, based on a hypothesis, will correct with probability alpha, namely that the observed value will lie in Gamma(alpha). This prediction is equivalent to predicting that an observed value in the complement of Gamma(alpha) will not occur. If the P-value of the observation is smaller that 1-alpha this is equivalent to the observed value lying in the compliment of Gamma(alpha). Thus in contrast to Jeffreys the hypothesis is rejected because a value predicted not to occur did in fact occur. This seems unremarkable.

Reply

January 4, 2016

Mayo

Laurie: I agree with you, thanks. I find it incredible that to this day almost nobody answers Jeffreys back and it’s considered a knock-down howler. I myself only took it up in this blog a couple of years ago, not in published work until my new book.

Reply
January 8, 2016

Mayo

Laurie: This is just what I’m saying in different words, and I’d never understood why people were tricked by Jeffreys’ clever-sounding joke. I don’t offhand know anyone else who has made this point (though now I know you have).

Reply

January 4, 2016

lauriedavies2014

Deborah, it is in Chapter 11.2.3 of my book. Before writing it I asked the odd Bayesian what Jeffreys meant by predict. None seemed to know.

Reply

January 7, 2016

Mayo

Excellent point.

Reply
January 8, 2016

Richard D. Morey

The discrete cases typically used in defense of the likelihood principle make the point more obvious. Two hypotheses may have the same probability for the data observed (say, y); suppose that greater y means greater discrepancy. Further suppose P(Y=y)=.049 for two different hypothetical null hypotheses, H0a and H0b. H0a has P(Y>y)=0 and H0b has P(Y>y)=.02. If H0a were the null, it would be rejected at alpha=0.05 because it did not predict data (that is, it predicted that Y>y would never occur) that were not observed (since Y=y, obviously Y is not > y). If H0b were the null hypothesis, it would not be rejected, because it predicted data that were not observed. This is not to defend Jeffreys’ statement, just to point out that it is perfectly interpretable.

Reply

January 8, 2016

Mayo

Richard: If you read my post you know I’ve done more than make it perfectly interpretable, I’ve shown it’s funny. In addition, I show it’s wrong, as a criticism of tail areas.

Reply

January 8, 2016

lauriedavies2014

Richard D. Morey. I am not sure I understand your example. A statistical test of size alpha is defined by a test statistic T and a region E(alpha) such that under the hypothesis H_a P_a(T(X) in E(alpha))=1-alpha. In your example Y=T(X) is discrete and for simplicity suppose Y is integer valued with, as is common, E_alpha=(-infty,k(alpha)]$ for some integer k(alpha). If we put k(alpha)=y then under Ha P(Y<=y)=1 and this is a test of size zero with alpha=0. If we use y-1 as the upper bound then under Ha P(Y<=y-1)=0.951 and we have a test of size 0.049. Under Hb we have in the first case P(y<=y)=0.98 and a test of size 0.02. In the second case we have a test of size 0.069. What does this mean for prediction? If we use the y-1 in both cases then we predict no value greater or equal to y will occur. Under Ha the prediction is correct with probability 0.951, under Hb it is correct with probability 0.931. If you require a prediction error of at most 0.05 then under Hb you will have to put k=y but then your prediction error probability is 0.02 < 0.049. It would help if you completely specified models corresponding to Ha and Hb and then, given an attainable alpha, exactly what your predictions are for the two cases? The predictions have to be correct with probability 1-alpha.

Reply
January 10, 2016

braynor2015

1. If I plot the p-value for the theoretical sampling distribution against a Bayes Factor, it’s generally one-to-one, so if its a howler for one, it is for the other.

2. If I look at the exact sampling distribution for real data (randomization or bootstrap) it is not generally monotone in the tails, so any point mass or ratio (e.g. a Bayes Factor) will jump around, which is pretty funny if you think about it. The exact p-value doesn’t do that (it does have discontinuities, though)

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Post navigation

9 thoughts on “Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Related

Post navigation

9 thoughts on “Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.