Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update. Please see most recent 2015 update.
Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology.
While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost.
Pawl: This meeting will come to order. Welcome new members, Nayth, and S.C. To start with an overview…
Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on those pesky tests.
Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, (members S.C. and Nayth) can go beyond resurrecting the failed attempts to kick NHST’s butt.
Marty: Well, I have with me a quite comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.
Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.
Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)? There’s little point in continuing with methods whose efficacy have been falsified. Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students’’ ?
Franz: Already tried. Rozeboom 1997, page 335. Very, very similar phrasing also attempted by many, many others over 50 years. All failed. Darn.
Jake: Didn’t it kill to see all the attention p-values got with the Higgs boson discovery? P-value policing by Lindley and O’Hagan just made things worse (to use a term from the Normal Deviate).
Pawl: Indeed! Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.
Nayth: As the new “non-academic” member of TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”
Gerry: Declared recently by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too!
Dora: I really like the part about the ‘immaculate statistical conception’. It could work, we’ll have to wait til the book’s been out awhile.
Gerry: It’s crystal clear that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.” It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”
Pawl: Oh My, Gerry! That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.
Gerry: I thought it was pretty good, especially the part about “denying its parents”.
Dora: I like the part about the “compulsive hand washing”. Cool!
Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST? Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…
Dora: Woah Jake! Slow down. That was Cohen 1994, page 202, remember? But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”! NHST is a method promoted by that Fisherian cult of bee-keepers.
Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [iii].
Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1991, page 18; Meehl and Waller 2002, page 184, respectively.
Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203. Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.
Jake: You want to ban Popper too? Now you’re really going to scare people off our mission.
Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper. I was just reading an on-line article by Andrew Gelman. He says:
“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)
S.C: Maybe p-values should just be used as measures of observed fit and ban all inferential uses of NHST.
Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”
(Franz stands. Chest up, chin out, hand over his heart):
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”
Pawl: My! That was incredibly inspiring Franz.
Dora: Yes, really moving, only …
Gerry: Only problem is, Schmidt’s already said it, 1996, page 123.
Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.
Marty: Is it leaving? Anyway, this is in Nathan Silver’s recent book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.
Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does the new epidemiology member of TFSI want to jump in here?
S.C.: Not in the green frog pool I should hope! But I do have a radical suggestion that no one has so far dared to utter.
Dora: Oomph! Tell, tell!
S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improvement of practice will require re-education, not restriction”.
Marty: “Living With P-Values,” Greenland and Poole in the recent issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.
Pawl: I just had a quick look, but their article appears to just resurrect the same-old same-old: P-values have to be (mis)interpreted as posteriors, so here are some priors to do the trick.
Franz: Historically, the TFSI has not pushed the Bayesian line; we want people to use confidence intervals.
Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement Cis with some kind of severity analysis [as in Mayo] discussed in her blog*. In the Year of Statistics, 2013, we should take up the challenge at long last, …starting with our spring meeting. I move this meeting be adjourned, and we regroup at the Elba Room for drinks and dinner. Do I hear a second.
Paul: Hereby adjourned! Doubles of Elbar Grease for all!
PARTING REMARK (from 2012): I do sincerely hope that the New Reformers succeed with their long-running attempt to ban NHST in the fields with which they are dealing, so that practitioners in these fields can see at last how they may achieve the scientific status Franz describes. However, if scientists in these fields are convinced that NHST tools are really holding them back from their potential, then ban or no ban, researchers should declare themselves free of them. (I’m not sure that the recommended 95% or 99% CI’s are better off, interpreted as they are as “a set of parameter values in which we may have confidence”, with or without meta-analysis. But even just removing the distraction of these critical meta-methodological efforts and hand-wringing should at least allow them to focus on the science itself.)
*Among discussions of the New Reformers are the blogposts of Sept 26, Oct. 3 and 4, 2011[i]
Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.
Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.
Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?
Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8.
Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.
Meehl, P. E. (1991), “Why summaries of research on psychological theories are often uninterpretable. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59), Hillsdale, NJ: Lawrence Erlbaum.
Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.
Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.
Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.
Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)
Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.
Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.
Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.
Sliver, N. (2012), The Signal and the Noise, Penguin.
Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).
[i] (https://errorstatistics.com/2011/09/26/whipping-boys-and-witch-hunters-comments-are-now-open/); (https://errorstatistics.com/2011/10/03/part-2-prionvac-the-will-to-understand-power/); (https://errorstatistics.com/2011/10/04/part-3-prionvac-how-the-reformers-should-have-done-their-job/).
[ii] This is obviously a parody. Perhaps it can be seen as another one of those statistical theater of the absurd pieces, as was “Stat on a Hot Tin Roof.”(You know where to find it.)
[iii] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.
I agree (with S.C.) that “an improvement of practice will require re-education, not restriction”: the training of statisticians (in a respectable portion of statistics or mathematics departments) is especially poor, in the sense that any mention of history or philosophy of the subject is pretty much *absent*! As a result, *confusion* inevitably builds up over time for any student who cares to look *beyond* mathematical formalism, such as myself. These *unsuccessful* attempts at banning NHST call for a different approach to statistics education (i.e., educating people about statistical inference), at least in my opinion.
Nicole: Your point is well taken, but before you assume this is what S.C. has in mind, you might read the Greenland and Poole paper. I take it that they, like so many other reformers, want to set themselves up as the re-educators to go to.
It is rather curious to see statistical significance tests accused of foisting on practitioners an “immaculate statistical conception,” “compulsive hand washing” and germaphobia when according to the founders of classical statistics, the significance test is but one part of a quick and dirty melange, a multiplicity of tools, relevant for different purposes and at different stages. There was always a staunch resistance to the logicist ideal of a clean uniform approach to scientific research. While this is one of the central reasons they are faulted foundationally—according to Bayesians and others–the frequentist zeitgeist chafes under the pretensions of a single uniquely rational approach. It’s just too bad that crucial features of scientific research, such as “report all relevant information” and “don’t selectively report results”, do not get fully formalized. But these features do impact p-values and confidence levels, and that is why the frequentist demands these be validly computed. Don Fraser recently said something to the effect that*, isn’t it just too bad we still have to think!
*Statist. Sci. 26 (2011), 299-316.