Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the nth time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.
A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!
As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:
But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics:
Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition:
“The classical thesis that a null hypothesis may be rejected with greater confidence, the greater the power of the test, is not borne out; indeed the reverse trend is signaled” (Howson and Urbach 2006, 154).
But this is mistaken. The frequentist appraisal of tests is, and has always been, the reverse, whether of Fisherian significance tests or those of the Neyman-Pearson variety. (Blogpost, 4/4/2-12)
I point this out directly in relation to their text in EGEK 1996, pp. 402-3, but to no avail. It’s not surprising that we see a new generation committing the same fallacy and/or repeating identical howlers against significance tests.
But there are some brand new monsters entering the scene, and that’s what I want to ultimately talk about. With nonreplication making the news in psychology, biology and elsewhere, people are getting more reflective than ever about significance tests and related methods. (I offered a reformulation of tests long ago that avoids fallacies and supplies a much-needed evidential interpretation of error probabilities; but readers already know that; see references.) Reforms are being taken seriously: some welcome (e.g., preregistration), others are rather questionable. (I gave a talk at the LSE on the topic last Tuesday.) Some of the more frightening monsters (inadvertently) manage to enshrine or at least encourage fallacies that were well known howlers just a few years ago–and the fallacy of rejection is one of them! It makes one yearn for the days of Morrison and Henkel when test abusers, not tests, were blamed. They asked:
How Could a Group of Psychologists Be So Wrong?
The fallacy of rejection or fallacy of nouvelle cuisine was deemed so shocking back in the days of Morrison and Henkel’s 1970 classic, The Significance Test Controversy, that researchers in psychology conducted studies to try to understand it. Rosenthal and Gaito (1963) discovered that statistical significance at a given level was often fallaciously taken as evidence of a greater discrepancy from the null, the larger the sample size n. In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size.[Of course the statistical model assumptions must hold; the sample size must be large enough for the model to be adequate. But that’s a distinct point.]
What is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p values! According to the theory, especially as this has been amplified by Neyman and Pearson, the probability of rejecting the null hypothesis for any given deviation from null and p values increases as a function of the number of observations. The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population…The question is, how could a group of psychologists be so wrong (Binder, p. 154)?
(Our convention is for “discrepancy” to refer to the parameter value, not the observed difference. Their use of “deviation” from the null alludes to our “discrepancy”.) The psychologists are falling into the fallacy of nouvelle cuisine, or:
The Mountains out of Molehills Fallacy: A (P-level) rejection of H0 with larger sample size (higher power) is evidence of a greater discrepancy from the null than with smaller sample size. (P-value = α)
The fallacious principle comes down to thinking that if a test is a reasonable indication of some discrepancy, then it is an even better indication of a very large discrepancy! To think this way is akin to making mountains out of molehills. The correct interpretation falls out immediately from the severity interpretation of tests (see slides 49-50 from my LSE talk). That’s why the insensitive toaster is a better indication that your house is ablaze than the one that goes off with burning toast (slide 50).
Why Is the Fallacy Becoming More Prevalent?
It’s rather indirect. It usually stems from trying to foist N-P notions onto an account of inference that is measuring something completely different! Error probabilities are dragged outside their home and put to a different use–ironically, in a few cases, out of a desire to forge a reconciliation between N-P, Fisherian, and Bayesian approaches. If one is trying for something more like a Bayesian posterior or a Bayes factor, or if one has bought into one of the “diagnostic testing factory” models of statistical inference, then one seeks something that resembles a likelihood ratio, and invariably, the ratio (1 – β)/α gets hired for the job. But (1 – β)/α performs this job in a backwards sort of way! Unsurprisingly, it is soon found to be bad for the new job, and is quickly fired! Then the Bayes factor, or other preferred measure, is quickly brought in to save the day. For a recent example of a new monster, note the “rejection ratio” proposed by Bayarri, Benjamin, Berger, Sellke (2016). For a recent post on this, see:
The result is taken as evidence for scrapping the N-P notions in favor of the rival measure, be it a Bayes factor, a likelihood ratio, a positive predictive value or the like. But N-P tests could never countenance that use of (1 – β)/α in the first place.
What’s Wrong with Taking (1 – β)/α as a Likelihood Ratio Comparing H0 and H1?
Take an example from a post by that name (02/10/15): Take a one-sided Normal test T+: with n iid samples:
H0: µ ≤ 0 against H1: µ > 0
For simplicity, let σ be known, say σ = 10, n = 100, σ/√n =σx= 1, α = .025.
So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)
So the Power of this test for alternative µ’ is Pr(Z > 1.96 ;µ =µ’)
Abbreviate “the power of test T+ to detect µ’ ” by Power(T+,µ’)
Simple rules for alternatives against which T+ has high power:
- If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
- If we add 3σx to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect.
Let the observed outcome just reach the cut-off to reject the null, z0 = 1.96.
If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using
it would be 40. (.999/.025).
It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases. What is commonly done next is to assign priors of .5 to the two hypotheses, yielding
Pr(H0|z0) = 1/ (1 + 40) = .024, so Pr(H1|z0) = .976.
Such an inference is highly unwarranted and would almost always result in an erroneous interpretation of the data.
How Could People Think It Plausible to Compute a Comparative Likelihood This Way?
I’ll come back to this (but if you’re curious see this post, section 3); in the meantime, feel free to share your comments.
 Share any other explanations for the increased prevalence of this fallacy.
The idea is to replace the probability of a type 1 error with the complement of the “positive predictive value” (PPV) from diagnostic testing. See, for example, this post, the section “but what about the statistics”?
Bayarri, M. J., Benjamin, D. J., Berger, J. O., & Sellke, T. M. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology
Howson, C. and P. Urbach (2006). Scientific Reasoning: The Bayesian Approach. La Salle, Il: Open Court
Mayo, D. G (1983) “An Objective Theory of Statistical Testing.” Synthese 57(2): 297-340.
Mayo, D. G (1996) Error and the Growth of Experimental Knowledge, [EGEK] Chicago: Chicago University Press.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” British Journal for the Philosophy of Science 57(2): 323–57.
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science, Volume 7, Philosophy of Statistics, (Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster. General editors: Dov M. Gabbay, Paul Thagard and John Woods) Elsevier: 1-46.
Morrison, D. and R. Henkel (eds.) (1970). Significance Test Controversy. Chicago: Aldine
Rosenthal, R. and J. Gaito (1963). “The Interpretation of Levels of Significance by Psychological Researchers. Journal of Psychology 55:33-38.
[i] See also, Mayo & Spanos (2011: 174).
- 3/19/14Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- 12/29/14To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
- 01/03/15 No headache power (for Deirdre)
- 02/10/15What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
- 08/20/15 How to Avoid Making Mountains Out of Molehills Using Power/Severity.
Note that in many test the exact null distribution of the statistics is unknown, and we need asymptotic approximations to get the null quantiles, so that one need large sample size to get the alpha that you are claiming to use…
Mayo, a question for clarification: You write “Our convention is for “discrepancy” to refer to the parameter value, not the observed difference”, but I am uncertain as to whether you have in mind that the discrepancy is scaled to the true variance or the observed variance or neither. It makes a difference in cases where the population variance is unknown (i.e. most real-world cases where the normal distribution is reasonable).