These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).
‘ “Did you hear the one about the frequentist . . .
“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).
“Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05. If the coin comes up tails reject the null hypothesis. Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test. It is also very robust against data errors; indeed it does not depend on the data at all. It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)
But is it allowed? I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post). I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**. That, recall, was Rule #1.
Now the reader might just say that Kadane is simply making a little joke, but then why include it within a chapter purporting to give serious criticisms of significance testing (and other frequentist methods)? Don’t the familiar fallacies of significance testing already make it enough of a whipping boy? Following the philosopher’s rule of “generous interpretation”, I will assume the criticisms are to be taken seriously and to heart.
Why do I go back to such a silly example one year later? Well, because readers seem to say it’s a bad test but still a valid significance test, whereas I’m trying to argue that it’s missing an “adequate fit measure”. It is missing the second of the three “steps in the original construction of tests”. Perhaps people see a significance probability as a type of conditional probability, where the improbable “event” need not have been rendered improbable by the hypotheses under test.
I am saying that a legitimate statistical test hypothesis must tell us (i.e., let us compute) how improbably far different experimental outcomes are from what would be expected under H. If H has nothing to do with the observed x, H cannot entail probabilities about x. It is correct to regard experimental results as anomalous for a hypothesis H only if, and only because, they run counter to what H tells us would occur in a universe where things are approximately as H asserts.
We typically use “;” for P(x;H), but maybe a distinct symbol is needed to indicate the connection I am after. David Freedman used “||” for a somewhat similar reason, I think.
*For “non-commercial” purposes, at one time it could be downloaded from http://uncertainty.stat.cmu.edu/.
** Statistical tests make this explicit by setting out a “test statistic” that is to be a relevant distance measure.