Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]
While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Please see 2015 update here.
Franz: It’s frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on those pesky tests.
Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed.
Marty: I have with me a quite comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2008)”.
Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.
Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)? There’s little point in continuing with methods whose efficacy have been falsified. Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students’’ ?
Franz: Already tried. Rozeboom 1997, p. 335. Very, very similar phrasing also attempted by many, many others over 50 years. All failed.
Gerry: It’s crystal clear that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.” Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers.
We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”
Pawl: Oh My, Gerry! That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.
Gerry: I thought it was pretty good, especially the part about “denying its parents”.
Dora: I like the part about the “compulsive hand washing”. Cool!
Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST? Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…
Dora: Woah Jake! Slow down. That was Cohen 1994, page 202, remember? But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”! NHST is a method promoted by that Fisherian cult of bee-keepers.
Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [iii].
Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1991, page 18; Meehl and Waller 2002, page 184, respectively.
Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203. Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.
Jake: You want to ban Popper too? Now you’re really going to scare people off our mission.
Franz: Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”
(Franz stands. Chest up, chin out, hand over his heart):
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”
Pawl: My! That was incredibly inspiring Franz.
Dora: Yes, really moving, only …
Gerry: Only problem is, Schmidt’s already said it, 1996, page 123.
Jake: Maybe we can link users of NHST with one of the sects on the “watch list” at the TSA.
Dora: Ooh! Good idea! I’ll have my guys in D.C. investigate this.
Marty: How about we use a cartoon to convince people? I’m not quite clear, but perhaps like that composite of Julia, suggesting how the other party isn’t going to help her get a job in web design, or start a garden.
Franz: And just what does that have to do with outlawing NHST?
Marty: Just saying….
PARTING REMARK: I do sincerely hope that the New Reformers succeed with their long-running attempt to ban NHST in the fields with which they are dealing, so that practitioners in these fields can see at last how they may achieve the scientific status Franz describes. However, if scientists in these fields are convinced that NHST tools are really holding them back from their potential, then ban or no ban, researchers should declare themselves free of them. (I’m not sure that the recommended 95% or 99% CI’s are better off, interpreted as they are as “a set of parameter values in which we may have confidence”, with or without meta-analysis. But even just removing the distraction of these critical meta-methodological efforts and hand-wringing should at least allow them to focus on the science itself.) To read 2015 update, see this post).
Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.
Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.
McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.
Meehl, P. E. (1991), “Why summaries of research on psychological theories are often uninterpretable. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59), Hillsdale, NJ: Lawrence Erlbaum.
Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.
Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.
Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.
Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)
Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.
Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.
Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129..
Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press.
[i] (https://errorstatistics.com/2011/09/26/whipping-boys-and-witch-hunters-comments-are-now-open/); (https://errorstatistics.com/2011/10/03/part-2-prionvac-the-will-to-understand-power/); (https://errorstatistics.com/2011/10/04/part-3-prionvac-how-the-reformers-should-have-done-their-job/).
[ii] This is obviously a parody. Perhaps it can be seen as another one of those statistical theater of the absurd pieces, as was “Stat on a Hot Tin Roof.”(You know where to find it.)
[iii] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.