Return to the comedy hour…(on significance tests)

These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).

‘ “Did you hear the one about the frequentist . . .

“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).

 “Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05.  If the coin comes up tails reject the null hypothesis.  Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test.  It is also very robust against data errors; indeed it does not depend on the data at all.  It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)

Much laughter.

___________________

But is it allowed?  I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post).  I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**.  That, recall, was Rule #1.

Now the reader might just say that Kadane is simply making a little joke, but then why include it within a chapter purporting to give serious criticisms of significance testing (and other frequentist methods)?  Don’t the familiar fallacies of significance testing already make it enough of a whipping boy?   Following the philosopher’s rule of “generous interpretation”, I will assume the criticisms are to be taken seriously and to heart.

Why do I go back to such a silly example one year later?  Well, because readers seem to say it’s a bad test but still a valid significance test, whereas I’m trying to argue that it’s missing an “adequate fit measure”.  It is missing the second of the three “steps in the original construction of tests”. Perhaps people see a significance probability as a type of conditional probability, where the improbable “event” need not have been rendered improbable by the hypotheses under test.
I am saying that a legitimate statistical test hypothesis must tell us (i.e., let us compute) how improbably far different experimental outcomes are from what would be expected under H. If H has nothing to do with the observed x, H cannot entail probabilities about x.  It is correct to regard experimental results as anomalous for a hypothesis H only if, and only because, they run counter to what H tells us would occur in a universe where things are approximately as H asserts.

We typically use “;” for P(x;H), but maybe a distinct symbol is needed to indicate the connection I am after. David Freedman used “||” for a somewhat similar reason, I think.

*For “non-commercial” purposes, at one time it could be downloaded from http://uncertainty.stat.cmu.edu/.

** Statistical tests make this explicit by setting out a “test statistic” that is to be a relevant distance measure.

Categories: Comedy, Philosophy of Statistics, Statistics | Tags: , , , | 6 Comments

Post navigation

6 thoughts on “Return to the comedy hour…(on significance tests)

  1. The statement, “Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05,” makes no sense. There is no such coin.

    • Andrew: This was Kadane’s example of course; are you saying such an outcome can’t be rigged, if not with a coin then balls from urns or whatnot?

  2. Yes, you can do it with balls from urns. Just not from coin flipping. I like examples that have some basis in reality.

I welcome constructive comments for 14-21 days

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com. The Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 410 other followers