You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….
It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.
“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?
‘What’s unusual about that?’ you ask?
Well, what’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”
Much laughter.
[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]
I say it’s funny, so to see why I’ll strive to give it a generous interpretation.
We can view p-values in terms of rejecting H0, as in the joke: There’s a test statistic D such that H0 is rejected if its observed value d0 reaches or exceeds a cut-off d* where Pr(D > d*; H0) is small, say .025.
Reject H0 if Pr(D > d0; H0) < .025.
The report might be “reject H0 at level .025″.
Example: H0: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject H0 .
Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]
So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this …. Continue reading →