These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).
‘ “Did you hear the one about the frequentist . . .
“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).
“Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05. If the coin comes up tails reject the null hypothesis. Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test. It is also very robust against data errors; indeed it does not depend on the data at all. It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)
Much laughter.
___________________
But is it allowed? I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post). I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**. That, recall, was Rule #1.
Now the reader might just say that Kadane is simply making a little joke, but then why include it within a chapter purporting to give serious criticisms of significance testing (and other frequentist methods)? Don’t the familiar fallacies of significance testing already make it enough of a whipping boy? Following the philosopher’s rule of “generous interpretation”, I will assume the criticisms are to be taken seriously and to heart.
Why do I go back to such a silly example one year later? Well, because readers seem to say it’s a bad test but still a valid significance test, whereas I’m trying to argue that it’s missing an “adequate fit measure”. It is missing the second of the three “steps in the original construction of tests”. Perhaps people see a significance probability as a type of conditional probability, where the improbable “event” need not have been rendered improbable by the hypotheses under test.
I am saying that a legitimate statistical test hypothesis must tell us (i.e., let us compute) how improbably far different experimental outcomes are from what would be expected under H. If H has nothing to do with the observed x, H cannot entail probabilities about x. It is correct to regard experimental results as anomalous for a hypothesis H only if, and only because, they run counter to what H tells us would occur in a universe where things are approximately as H asserts.
We typically use “;” for P(x;H), but maybe a distinct symbol is needed to indicate the connection I am after. David Freedman used “||” for a somewhat similar reason, I think.
*For “non-commercial” purposes, at one time it could be downloaded from http://uncertainty.stat.cmu.edu/.
** Statistical tests make this explicit by setting out a “test statistic” that is to be a relevant distance measure.
The statement, “Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05,” makes no sense. There is no such coin.
Andrew: This was Kadane’s example of course; are you saying such an outcome can’t be rigged, if not with a coin then balls from urns or whatnot?
Yes, you can do it with balls from urns. Just not from coin flipping. I like examples that have some basis in reality.
Andrew: well I believe you about the difficulty of biasing coins, but I’m no happier with the criticism using balls in urns.
Perhaps a more extreme version of this coin? http://www.methodsappraisal.com/education-the-wrong-turn/
RJW:
Your comment got caught in a spam filter. Anyway, while not exactly on the topic of this particular issue, your link does have a nice picture of a bent coin. The position you describe there essentially falls under the mantra “Anything tests can do CIs do Better”, and it’s one we’ve discussed a fair amount on this blog. I agree that significance tests need supplementation with an indication of discrepancies that have an have not been corroborated by the results but I prefer to solve this problem via computing what I call the severity of tests (for claims about various discrepancies). CIs can help, but unless they too are supplemented, as well as applied for several confidence levels, they also will permit misinterpretations of results. In the following blogposts, I very briefly consider two: fallacies of rejection and fallacies of acceptance. Have a look. CI reformers (in this blog) haven’t yet shown that they can match the solution that falls out from the severity demand.
https://errorstatistics.com/2012/06/17/repost-51712-do-cis-avoid-fallacies-of-tests-reforming-the-reformers/
https://errorstatistics.com/2012/06/02/anything-tests-can-do-cis-do-better-cis-do-anything-better-than-tests/
I am reading Deborah’s book and she starts Section 4.3 with the following quote from the same Kadane’s book, page 438:
“[W]ith a large sample size virtually every null hypothesis is rejected,
while with a small sample size, virtually no null hypothesis is rejected.
And we generally have very accurate estimates of the sample size
available without having to use significance testing at all!”
Interestingly, the same huge misunderstanding of significance testing can be found in many other reputable sources, including (1) Luis Pericchi in “Integrated Objective Bayesian Estimation and Hypothesis Testing”, comment, p. 27:
“I will finish this subsection with two illuminating quotations, both about testing
without posterior probabilities:
“Do you want to reject a hypothesis? Just take enough data!” (Wonnacott
and Wonnacott in several of their writings).
“In real life, null hypothesis will always be rejected if enough data are taken
because there will inevitably be uncontrolled sources of bias”. (Berger and
Delampady, 1987).”
Unfortunately, ALL these statisticians are deadly wrong: Ronald Wonacott and Thomas Wonacott, Joseph Kadane, Luis Perichi, James Berger, Mohan Delampady, and many many others. They have shown the basic misunderstanding of hypothesis testing.
Interestingly, again, nobody has discovered the error.
What is wrong? (It is important to note that Deborah didn’t make any comment on Kadane’s quote. I am sure that she has noticed that but didn’t have time to deal with it in the book.)
Miodrag: I was emphasizing this because so many people wrongly suppose that rejecting with higher power is more impressive. But I DO remark in SIST that even though it’s possible to make the test super-sensitive, it does not follow that all nulls will be rejected. They will if they’re false, and the test is sensitive enough.