Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the n^th time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics:

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition:

“The classical thesis that a null hypothesis may be rejected with greater confidence, the greater the power of the test, is not borne out; indeed the reverse trend is signaled” (Howson and Urbach 2006, 154).

But this is mistaken. The frequentist appraisal of tests is, and has always been, the reverse, whether of Fisherian significance tests or those of the Neyman-Pearson variety. (Blogpost, 4/4/2-12)

I point this out directly in relation to their text in EGEK 1996, pp. 402-3, but to no avail. It’s not surprising that we see a new generation committing the same fallacy and/or repeating identical howlers against significance tests.

New Monsters?

But there are some brand new monsters entering the scene, and that’s what I want to ultimately talk about. With nonreplication making the news in psychology, biology and elsewhere, people are getting more reflective than ever about significance tests and related methods. (I offered a reformulation of tests long ago that avoids fallacies and supplies a much-needed evidential interpretation of error probabilities; but readers already know that; see references.) Reforms are being taken seriously: some welcome (e.g., preregistration), others are rather questionable. (I gave a talk at the LSE on the topic last Tuesday.) Some of the more frightening monsters (inadvertently) manage to enshrine or at least encourage fallacies that were well known howlers just a few years ago–and the fallacy of rejection is one of them! It makes one yearn for the days of Morrison and Henkel when test abusers, not tests, were blamed. They asked:

How Could a Group of Psychologists Be So Wrong?

The fallacy of rejection or fallacy of nouvelle cuisine was deemed so shocking back in the days of Morrison and Henkel’s 1970 classic, The Significance Test Controversy, that researchers in psychology conducted studies to try to understand it. Rosenthal and Gaito (1963) discovered that statistical significance at a given level was often fallaciously taken as evidence of a greater discrepancy from the null, the larger the sample size n. In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size.[Of course the statistical model assumptions must hold; the sample size must be large enough for the model to be adequate. But that’s a distinct point.]

What is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p values! According to the theory, especially as this has been amplified by Neyman and Pearson, the probability of rejecting the null hypothesis for any given deviation from null and p values increases as a function of the number of observations. The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population…The question is, how could a group of psychologists be so wrong (Binder, p. 154)?

(Our convention is for “discrepancy” to refer to the parameter value, not the observed difference. Their use of “deviation” from the null alludes to our “discrepancy”.) The psychologists are falling into the fallacy of nouvelle cuisine, or:

T he Mountains out of Molehills Fallacy: A (P-level) rejection of H₀ with larger sample size (higher power) is evidence of a greater discrepancy from the null than with smaller sample size. (P-value = α)

The fallacious principle comes down to thinking that if a test is a reasonable indication of some discrepancy, then it is an even better indication of a very large discrepancy! To think this way is akin to making mountains out of molehills. The correct interpretation falls out immediately from the severity interpretation of tests (see slides 49-50 from my LSE talk). That’s why the insensitive toaster is a better indication that your house is ablaze than the one that goes off with burning toast (slide 50).

Why Is the Fallacy Becoming More Prevalent?[1]

It’s rather indirect. It usually stems from trying to foist N-P notions onto an account of inference that is measuring something completely different! Error probabilities are dragged outside their home and put to a different use–ironically, in a few cases, out of a desire to forge a reconciliation between N-P, Fisherian, and Bayesian approaches. If one is trying for something more like a Bayesian posterior or a Bayes factor, or if one has bought into one of the “diagnostic testing factory” models of statistical inference[2], then one seeks something that resembles a likelihood ratio, and invariably, the ratio (1 – β)/α gets hired for the job. But (1 – β)/α performs this job in a backwards sort of way! Unsurprisingly, it is soon found to be bad for the new job, and is quickly fired! Then the Bayes factor, or other preferred measure, is quickly brought in to save the day. For a recent example of a new monster, note the “rejection ratio” proposed by Bayarri, Benjamin, Berger, Sellke (2016). For a recent post on this, see:

https://errorstatistics.com/2016/04/11/when-the-rejection-ratio-1-%CE%B2%CE%B1-turns-evidence-on-its-head-for-those-practicing-in-the-error-statistical-tribe/

The result is taken as evidence for scrapping the N-P notions in favor of the rival measure, be it a Bayes factor, a likelihood ratio, a positive predictive value or the like. But N-P tests could never countenance that use of (1 – β)/α in the first place.

What’s Wrong with Taking (1 – β)/α as a Likelihood Ratio Comparing H₀ and H₁?

Take an example from a post by that name (02/10/15): Take a one-sided Normal test T+: with n iid samples:

H₀: µ ≤ 0 against H₁: µ > 0

For simplicity, let σ be known, say σ = 10, n = 100, σ/√n =σ_x= 1, α = .025.

So the test would reject H₀ iff Z > c_.025 =1.96. (1.96. is the “cut-off”.)

So the Power of this test for alternative µ’ is Pr(Z > 1.96 ;µ =µ’)

Abbreviate “the power of test T+ to detect µ’ ” by Power(T+,µ’)

~~~~~~~~~~~~~

Simple rules for alternatives against which T+ has high power:

If we add σ_x(here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
If we add 3σ_xto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect.

Let the observed outcome just reach the cut-off to reject the null, z₀= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ₀ = 0 using

[Power(T+, 4.96)]/α,

it would be 40. (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases. What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H₀|z₀) = 1/ (1 + 40) = .024, so Pr(H₁|z₀) = .976.

Such an inference is highly unwarranted and would almost always result in an erroneous interpretation of the data.

~~~~~~~~~~~~~~

How Could People Think It Plausible to Compute a Comparative Likelihood This Way?

I’ll come back to this (but if you’re curious see this post, section 3); in the meantime, feel free to share your comments.

*****

[1] Share any other explanations for the increased prevalence of this fallacy.

[2]The idea is to replace the probability of a type 1 error with the complement of the “positive predictive value” (PPV) from diagnostic testing. See, for example, this post, the section “but what about the statistics”?

References

Bayarri, M. J., Benjamin, D. J., Berger, J. O., & Sellke, T. M. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology

Howson, C. and P. Urbach (2006). Scientific Reasoning: The Bayesian Approach. La Salle, Il: Open Court

Mayo, D. G (1983) “An Objective Theory of Statistical Testing.” Synthese 57(2): 297-340.

Mayo, D. G (1996) Error and the Growth of Experimental Knowledge, [EGEK] Chicago: Chicago University Press.

Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” British Journal for the Philosophy of Science 57(2): 323–57.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science, Volume 7, Philosophy of Statistics, (Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster. General editors: Dov M. Gabbay, Paul Thagard and John Woods) Elsevier: 1-46.

Morrison, D. and R. Henkel (eds.) (1970). Significance Test Controversy. Chicago: Aldine

Rosenthal, R. and J. Gaito (1963). “The Interpretation of Levels of Significance by Psychological Researchers. Journal of Psychology 55:33-38.

[i] See also, Mayo & Spanos (2011: 174).

3/19/14Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
12/29 /14To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
01/03/15 No headache power (for Deirdre)
02/10/15What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
08/20/15 How to Avoid Making Mountains Out of Molehills Using Power/Severity.

5 thoughts on “Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters”

May 16, 2016

Anonymous

Note that in many test the exact null distribution of the statistics is unknown, and we need asymptotic approximations to get the null quantiles, so that one need large sample size to get the alpha that you are claiming to use…

May 17, 2016

Michael Lew

Mayo, a question for clarification: You write “Our convention is for “discrepancy” to refer to the parameter value, not the observed difference”, but I am uncertain as to whether you have in mind that the discrepancy is scaled to the true variance or the observed variance or neither. It makes a difference in cases where the population variance is unknown (i.e. most real-world cases where the normal distribution is reasonable).

May 17, 2016

Mayo

Michael: Discrepancy would still refer t the parametric difference, even if one’s estimate was based on the data. How would you recommend keeping the two distinct? I think “effect size” is a mess (in the sense of being highly equivocal).

I meant to alert you to my follow-up post:Frequentstein
https://errorstatistics.com/2016/05/22/frequentstein-whats-wrong-with-1-%CE%B2%CE%B1-as-a-measure-of-evidence-against-the-null/comment-page-1/#comment-141024

May 24, 2016

Andrew Gelman

Deborah:

You’ll love my forthcoming post, “Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit.” No kidding.

May 24, 2016

Mayo

Andrew: Great! only I certainly wouldn’t equate the “making mts. out of molehills” fallacy with the problem of spurious p-values due to cherry-picking, multiple testing, optional stopping and other data-dependent biasing effects. I’m looking forward to blog exchanges!

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Post navigation

5 thoughts on “Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Related

Post navigation

5 thoughts on “Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.