Saturday Night Brainstorming & Task Forces: The TFSI on NHST

Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Please see 2015 update here.


Franz: It’s frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed.

Marty: I have with me a quite comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2008)”.

Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.

Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)?  There’s little point in continuing with methods whose efficacy have been falsified.  Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students’’ ?

Franz:  Already tried. Rozeboom 1997, p. 335.  Very, very similar phrasing also attempted by many, many others over 50 years.  All failed.

Gerry: It’s crystal clear that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.”  Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers.

We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”

Pawl: Oh My, Gerry!  That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.

Gerry: I thought it was pretty good, especially the part about “denying its parents”.

Dora: I like the part about the “compulsive hand washing”. Cool!

Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST?  Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…

Dora: Woah Jake!  Slow down. That was Cohen 1994, page 202, remember?  But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”!  NHST is a method promoted by that Fisherian cult of bee-keepers.

Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [iii].

Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1991, page 18; Meehl and Waller 2002, page 184, respectively.

Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203.  Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.

Jake: You want to ban Popper too?  Now you’re really going to scare people off our mission.

Franz: Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”

(Franz stands. Chest up, chin out, hand over his heart):

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”

Pawl: My! That was incredibly inspiring Franz.

Dora: Yes, really moving, only …

Gerry:  Only problem is, Schmidt’s already said it, 1996, page 123.

Jake: Maybe we can link users of NHST with one of the sects on the “watch list” at the TSA.

Dora: Ooh!  Good idea!  I’ll have my guys in D.C. investigate this.

Marty: How about we use a cartoon to convince people?  I’m not quite clear, but perhaps like that composite of Julia, suggesting how the other party isn’t going to help her get a job in web design, or start a garden.

Franz: And just what does that have to do with outlawing NHST?

Marty: Just saying….



 PARTING REMARK: I do sincerely hope that the New Reformers succeed with their long-running attempt to ban NHST in the fields with which they are dealing, so that practitioners in these fields can see at last how they may achieve the scientific status Franz describes.  However, if scientists in these fields are convinced that NHST tools are really holding them back from their potential, then ban or no ban, researchers should declare themselves free of them. (I’m not sure that the recommended 95% or 99% CI’s are better off, interpreted as they are as “a set of parameter values in which we may have confidence”, with or without meta-analysis. But even just removing the distraction of these critical meta-methodological efforts and hand-wringing should at least allow them to focus on the science itself.) To read 2015 update, see this post).


Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.

Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.

McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.

Meehl, P. E. (1991), “Why summaries of research on psychological theories are often uninterpretable. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59), Hillsdale, NJ: Lawrence Erlbaum.

Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.

Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.

Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.

Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)

Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.

Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.

Schmidt, F. (1996),  “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129..

Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press.

[ii] This is obviously a parody. Perhaps it can be seen as another one of those statistical theater of the absurd pieces, as was “Stat on a Hot Tin Roof.”(You know where to find it.)

[iii] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.

Categories: Statistics | Tags: , , , , , ,

Post navigation

7 thoughts on “Saturday Night Brainstorming & Task Forces: The TFSI on NHST

  1. The origin of this post: I was writing a section (of a chapter in my new book) that deals with the significance test controversy, and I was reviewing the literature (in psychology) from 2006, when I more or less stopped following it. Last night, after drafting the chapter, I took a look at the sporadic notes of mine that didn’t find their way in, and this post essentially wrote itself. The blogfolk are away for a bit, so excuse typos, etc. and feel free to alert me to them. If you see/know of any updates on the task forces that would likely interest readers, please send them to me at

  2. The subtext of this fictitious, facetious TFSI meeting seems to be that there is an intimate link between the sociology of science and the philosophy of science. Based on my own experience and observations and some of the literature I cite, I sincerely believe this assumption is wrong. Scientific communities are just as full of myth, ceremony, and outdated traditions as any other social system. So, when we observe that “reforms have been ineffectual so far” (sociology of science) tells us close to nothing about the substantive merit–or lack of merit–of these reform ideas (domain of the philosophy of science). Similarly, the persistence of NHST does not mean frequentist methods are best. I wish the link were as close as philosophers and statisticians might assume it to be, but unfortunately it is not. Traditions in science live or die because of the same social dynamics observed in other arenas–institutional power dynamics (gatekeepers etc.; Orlitzky, 2011, JMAC). Reason really has little to do with the death or survival of a method–even in science… which is precisely why “meta-methodological” discussions and reviews (such as Orlitzky, 2012, ORM) are so important–rather than distractions. “Meta” efforts can make explicit what was previously only implicit.

    The implication of all this is that even researchers dubious of the net benefits of the NHST can and should continue applying it, as long as the broader epistemology of social science remains unchanged. There is no need for anyone to become a martyr or “witch-hunting” zealot–not even in science…and, btw, on either side of this important debate about NHST (either side can be accused of making strawmen of the other–really unhelpful, see Orlitzky, 2011, JMAC).

    When there are institutions that we–subjectively and on balance–consider harmful for society (whether it’s the welfare state, NHST, or other largely unquestioned institutions) some journals simply give us a chance to vent our grievances.

    • Marc: Thanks so much for your comment. I don’t understand the claim that my little parody assumes “an intimate link between the sociology of science and the philosophy of science”. I’m not even sure what that means. Nor do I hold that “reforms have been ineffectual so far” (sociology of science) tells us close to nothing about the substantive merit–or lack of merit–of these reform ideas (domain of the philosophy of science)”, but it seems a mistake to identify the substantive merit of statistical tools with philosophy of science. (We philosophers of science don’t deserve that status I’m afraid.)

      But to get to your main point: I concur that there may be social explanations for methods that are distinct from the work the tools are capable of, as regards the intended goals, here statistical inference and learning taken broadly. (Common examples pointed to are training, polemical texts, publication edicts.) But it is interesting to note that I just about never hear it considered that the reason these widely used statistical tools remain is that they perform an important epistemological function. It reminds me of the reaction of Erich Lehmann years ago, when I sent him a paper highly critical of significance tests (by philosophers). In his (hand-written) letter, which surprised me with its strength, he wrote, “how arrogant of them” to suppose statistical practitioners are brainwashed and don’t have clear understanding of why they use their tools. He went on to say that was precisely the kind of philosophical attitude that outraged scientists!

      I have nothing at all against meta-methodological meetings, I have even been included in several. I do find the following worrisome: (1) Erroneous definitions of key concepts by the “reformers” (if you’re going to set out a shingle, it only makes sense to first master the concepts being questioned or underutilized, e.g., power). (2) The new reformers can be found committing some of the very fallacies of interpretation they are on about. (If significance tests are not properly understood, the associated methods of confidence intervals will be misunderstood and misused as well, and this is what we see). (3) The simple significance test, only a part of a complex methodological tool-kit, performs a function that I do not see performed by the recommended replacements (CIs and meta-analysis): detecting incompatibilities and indicating unexplained discrepancies, between data and what models and hypotheses say.

    • Jean (to Orlitzky)

      I’m wondering of your evidence that “Reason really has little to do with the death or survival of a method–even in science… “. Is there evidence that reason takes a back seat when it comes to methods for using evidence? Just wondering about the documentation you’re using…(my area is science and technology studies).

      • I had written that “I’m fairly sure that the stat reformers would not wish to embrace” a position that denies there is a reasoned or evidential adjudicator of methods, presuming instead that it’s (largely?) a matter of whichever socio-political-cultural norms happen to have the greater institutional clout. Yet Orlitzky writes that he does embrace such a socio-cultural-institutional constructivism, but I think he is led to his denial of reasoned evidence as a result of a (not uncommon) fallacy of false dilemma: Either the warrant for a claim (about a method or anything else) is an a priori matter (of pure logic or “transcendent” truth) OR it is a matter of who is in power and what group enjoys the most institutional clout. Negating the first disjunct is thought to leave the socio-institutional clout alternative. But, fortunately, these aren’t exhaustive: Even though individuals, task forces, and institutions choose methods (i.e., choice of method is not a priori), individuals may and should apply empirical criteria about performance, and about the capabilities of tools to advance knowledge/improve the critical appraisal of interpretations of data. (There’s a lot on “objectivity” one can find on this blog.) Now we all know it is futile to try to use argument to confront a radical socio-institutional-constructivist, but for a field’s methodology to be decided that way (and, thankfully, I do not think they are or will be) would strike me as unethical. I think Orlitzky is just infatuated by his hero, the colorful Feyerabend, overlooking the irony that exists even in this anarchist’s writings.

    In relation to Jean and Orlitzky:
    Jean’s query evokes the classic dilemma for one who fully holds the social-constructivist standpoint. If choice of methodology is a matter, not of reasoned grounds, but social-political (etc.) factors, then you cannot say the “reform” methods are to be preferred on reasoned grounds either (ie.., cannot say they effectively, or more effectively, advance the various, relevant, knowledge-gaining tasks). Then it is all a matter of “might” makes right, where “might” here would be whichever socio-political-cultural norms have the greater institutional clout. But I’m fairly sure that the stat reformers would not wish to embrace this. So then the question turns to Jean’s (i.e., whether, in this case, there is evidence that reasoned grounds are absent or largely absent).


    Schmidt, Frank L

    I did find this amusing.
    But it is definitely correct in its central message that it difficult–nearly impossible–to get researchers to abandon voodoo science.

Blog at