Return to the comedy hour…(on significance tests)

These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).

‘ “Did you hear the one about the frequentist . . .

“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).

 “Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05.  If the coin comes up tails reject the null hypothesis.  Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test.  It is also very robust against data errors; indeed it does not depend on the data at all.  It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)

Much laughter.


But is it allowed?  I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post).  I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**.  That, recall, was Rule #1.

Now the reader might just say that Kadane is simply making a little joke, but then why include it within a chapter purporting to give serious criticisms of significance testing (and other frequentist methods)?  Don’t the familiar fallacies of significance testing already make it enough of a whipping boy?   Following the philosopher’s rule of “generous interpretation”, I will assume the criticisms are to be taken seriously and to heart.

Why do I go back to such a silly example one year later?  Well, because readers seem to say it’s a bad test but still a valid significance test, whereas I’m trying to argue that it’s missing an “adequate fit measure”.  It is missing the second of the three “steps in the original construction of tests”. Perhaps people see a significance probability as a type of conditional probability, where the improbable “event” need not have been rendered improbable by the hypotheses under test.
I am saying that a legitimate statistical test hypothesis must tell us (i.e., let us compute) how improbably far different experimental outcomes are from what would be expected under H. If H has nothing to do with the observed x, H cannot entail probabilities about x.  It is correct to regard experimental results as anomalous for a hypothesis H only if, and only because, they run counter to what H tells us would occur in a universe where things are approximately as H asserts.

We typically use “;” for P(x;H), but maybe a distinct symbol is needed to indicate the connection I am after. David Freedman used “||” for a somewhat similar reason, I think.

*For “non-commercial” purposes, at one time it could be downloaded from

** Statistical tests make this explicit by setting out a “test statistic” that is to be a relevant distance measure.

Categories: Comedy, Philosophy of Statistics, Statistics | Tags: , , ,

Post navigation

8 thoughts on “Return to the comedy hour…(on significance tests)

  1. The statement, “Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05,” makes no sense. There is no such coin.

    • Andrew: This was Kadane’s example of course; are you saying such an outcome can’t be rigged, if not with a coin then balls from urns or whatnot?

  2. Yes, you can do it with balls from urns. Just not from coin flipping. I like examples that have some basis in reality.

  3. Miodrag Lovric

    I am reading Deborah’s book and she starts Section 4.3 with the following quote from the same Kadane’s book, page 438:

    “[W]ith a large sample size virtually every null hypothesis is rejected,
    while with a small sample size, virtually no null hypothesis is rejected.
    And we generally have very accurate estimates of the sample size
    available without having to use significance testing at all!”

    Interestingly, the same huge misunderstanding of significance testing can be found in many other reputable sources, including (1) Luis Pericchi in “Integrated Objective Bayesian Estimation and Hypothesis Testing”, comment, p. 27:

    “I will finish this subsection with two illuminating quotations, both about testing
    without posterior probabilities:

    “Do you want to reject a hypothesis? Just take enough data!” (Wonnacott
    and Wonnacott in several of their writings).

    “In real life, null hypothesis will always be rejected if enough data are taken
    because there will inevitably be uncontrolled sources of bias”. (Berger and
    Delampady, 1987).”

    Unfortunately, ALL these statisticians are deadly wrong: Ronald Wonacott and Thomas Wonacott, Joseph Kadane, Luis Perichi, James Berger, Mohan Delampady, and many many others. They have shown the basic misunderstanding of hypothesis testing.

    Interestingly, again, nobody has discovered the error.

    What is wrong? (It is important to note that Deborah didn’t make any comment on Kadane’s quote. I am sure that she has noticed that but didn’t have time to deal with it in the book.)

    • Miodrag: I was emphasizing this because so many people wrongly suppose that rejecting with higher power is more impressive. But I DO remark in SIST that even though it’s possible to make the test super-sensitive, it does not follow that all nulls will be rejected. They will if they’re false, and the test is sensitive enough.

Blog at