This, from 2 years ago, “fits” at least as well today…HAPPY HALLOWEEN! Memory Lane

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways).  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a of “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.   It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting, that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis–especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.

The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the reverse. By and large, these critics are not espousing a Bayesian line but rather see themselves as offering “reforms” e.g., supplementing simple significance tests with power (e.g., Jacob Cohen’s “power analytic movement), and most especially,  replacing tests with confidence interval estimates of the size of discrepancy (from the null) indicated by the data.  Of course, the use of power is central for (frequentist) Neyman-Pearson tests, and (frequentist) confidence interval estimation even has a duality with hypothesis tests!)

But rather than take a temporary job of pointing up some understandable fallacies in the use of newly adopted statistical tools by social scientific practitioners, or lead by example of right-headed statistical analyses, the New Reformers have seemed to settle into a permanent career of showing the same fallacies.  Yes, they advocate “alternative” methods, e.g., “effect size” analysis, power analysis, confidence intervals, meta-analysis.  But never having adequately unearthed the essential reasoning and rationale of significance tests—admittedly something that goes beyond many typical expositions—their supplements and reforms often betray the same confusions and pitfalls that underlie the methods they seek to supplement or replace! (I will give readers a chance to demonstrate this in later posts.)

We all reject the highly lampooned, recipe-like uses of significance tests; I and others insist on interpreting tests to reflect the extent of discrepancy indicated or not (back when I was writing my doctoral dissertation and EGEK 1996).  I never imagined that hypotheses tests (of all stripes) would continue to be flogged again and again, in the same ways!

Frustrated with the limited progress in psychology, apparently inconsistent results, and lack of replication, an imagined malign conspiracy of significance tests is blamed: traditional reliance on statistical significance testing, we hear,

“has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122)[i].

Meta-analysis was to be the cure that would provide cumulative knowledge to psychology: Lest enthusiasm for revisiting the same cluster of elementary fallacies of tests begin to lose steam, the threats of dangers posed  become ever shriller: just as the witch is responsible for whatever ails a community, the significance tester is portrayed as so powerful as to be responsible for blocking scientific progress. In order to keep the gig alive, a certain level of breathless hysteria is common: “statistical significance is hurting people, indeed killing them” (Ziliak and McCloskey 2008, 186)[ii]; significance testers are members of a “cult” led by R.A. Fisher” whom they call “The Wasp”.  To the question, “What if there were no Significance Tests,” as the title of one book inquires[iii], surely the implication is that once tests are extirpated, their research projects would bloom and thrive; so let us have Task Forces[iv] to keep reformers busy at journalistic reforms to banish the test once and for all!

Harlow, L., Mulaik, S., Steiger, J. (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Hunter, J.E. (1997), “Needed: A Ban on the Significance Test,”, American Psychological Society 8:3-7.

Morrison, D. and Henkel, R. (eds.) (1970), The Significance Test Controversy, Aldine, Chicago.

MSERA (1998), Research in the Schools, 5(2) “Special Issue: Statistical Significance Testing,” Birmingham, Alabama.

Rosenthal, R. and Gaito, J. (1963), “The Interpretation of Levels of Significance by Psychologicl Researchers,”  Journal of Psychology 55:33-38.

Ziliak, T. and McCloskey, D. (2008), The Cult of Statistical Significance, University of Michigan Press.

[i]Schmidt was the one Erich Lehmann wrote to me about, expressing great concern.

[ii] While setting themselves up as High Priest and Priestess of “reformers” their own nostroms reveal they fall into the same fallacy pointed up by Rosenthal and Gaito (among many others) nearly a half a century ago.  That’s what should scare us!

[iii] In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

[iv] MSERA (1998): ‘Special Issue: Statistical Significance Testing,’ Research in the Schools, 5.   See also Hunter (1997). The last I heard, they have not succeeded in their attempt at an all-out “test ban”.  Interested readers might check the status of the effort, and report back.

Related posts:

Saturday night brainstorming and taskforces” 

“What do these share in common: MMs, limbo stick, ovulation, Dale Carnegie?: Sat. night potpourri”

Categories: significance tests, Statistics | Tags: , ,

Post navigation


  1. anon

    Mayo, may I say something, that I think it is really important: you are focusing on the wrong “enemies”.

    Your “enemies” are not the ones “witch hunting” significance tests. They are not McCloskeys or Kadanes.

    Your real enemies are 90% of applied scientists who misinterpret significant tests in every single paper. And these are not “bad scientists” only, but good and intelligent people. Even the best in some fields.

    Those are your “enemies”. They misuse the tools you defend. They give frequentist tools a bad name.

    It does not matter if you say that frequentist can avoids all the falacies, if in practice people do not avoid them!!!!

    This is people will continue to hit the same key over and over again! They will say that p-values are dangerous and etc etc etc… and they are not talking to you! They are talking to those scientists. They want them to stop and think about what they are doing.

    So, if you want to have a BIG and really POSITIVE effect, you really should focus on writing to the scientists who misuse frequentist tools. You should talk to them, not to other philosopher or to Bayesians or to whatever. You should read their papers, point where and why they are misusing it and how to avoid it with error statistical reasoning. If people stop the “significant”-“not-significant” no-sense, and start looking at warranted discrepancies and at substantive meaning of their findings, this will be a BIG and POSITIVE change! And the the “howlers” will reduce, A LOT!

    Best regards

    • Hate to say it but self-appointed “reformers” like McCloskey and Ziliak commit all the fallacies and the worst of them. The reformers need reforms! (check this bog for some, e.g., misinterpreting power). That is the purpose of my “How to Tell What’s True About Statistical Inference” (in the works, but getting there). That may not stop the huge cottage industry in churning out anti-significance test screeds–so long as it’s still so rewarding, but at least it will empower those who want to get to the bottom of what they’re being told or sold.

    • Anon, and now, may I say something? You’ve written a large number of comments to this blog as of late. I think they’d have much greater impact if you weren’t merely “anon”. It’s your choice.

Blog at WordPress.com.