Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update. Please see most recent 2015 update.
Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology.
While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pawl: This meeting will come to order. Welcome new members, Nayth, and S.C. To start with an overview…
Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on those pesky tests.
Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, (members S.C. and Nayth) can go beyond resurrecting the failed attempts to kick NHST’s butt.
Marty: Well, I have with me a quite comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.
Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.
Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)? There’s little point in continuing with methods whose efficacy have been falsified. Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students’’ ?
Franz: Already tried. Rozeboom 1997, page 335. Very, very similar phrasing also attempted by many, many others over 50 years. All failed. Darn.
Jake: Didn’t it kill to see all the attention p-values got with the Higgs boson discovery? P-value policing by Lindley and O’Hagan just made things worse (to use a term from the Normal Deviate).
Pawl: Indeed! Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.
Nayth: As the new “non-academic” member of TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”
Gerry: Declared recently by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too!
Dora: I really like the part about the ‘immaculate statistical conception’. It could work, we’ll have to wait til the book’s been out awhile.
Gerry: It’s crystal clear that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.” It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”
Pawl: Oh My, Gerry! That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.
Gerry: I thought it was pretty good, especially the part about “denying its parents”.
Dora: I like the part about the “compulsive hand washing”. Cool!
Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST? Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…
Dora: Woah Jake! Slow down. That was Cohen 1994, page 202, remember? But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”! NHST is a method promoted by that Fisherian cult of bee-keepers.
Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [iii].
Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1991, page 18; Meehl and Waller 2002, page 184, respectively.
Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203. Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.
Jake: You want to ban Popper too? Now you’re really going to scare people off our mission.
Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper. I was just reading an on-line article by Andrew Gelman. He says:
“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)
S.C: Maybe p-values should just be used as measures of observed fit and ban all inferential uses of NHST.
Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”
(Franz stands. Chest up, chin out, hand over his heart):
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”
Pawl: My! That was incredibly inspiring Franz.
Dora: Yes, really moving, only …
Gerry: Only problem is, Schmidt’s already said it, 1996, page 123.
Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.
Marty: Is it leaving? Anyway, this is in Nathan Silver’s recent book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.
Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does the new epidemiology member of TFSI want to jump in here?
S.C.: Not in the green frog pool I should hope! But I do have a radical suggestion that no one has so far dared to utter.
Dora: Oomph! Tell, tell!
S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improvement of practice will require re-education, not restriction”.
Marty: “Living With P-Values,” Greenland and Poole in the recent issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.
Pawl: I just had a quick look, but their article appears to just resurrect the same-old same-old: P-values have to be (mis)interpreted as posteriors, so here are some priors to do the trick.
Franz: Historically, the TFSI has not pushed the Bayesian line; we want people to use confidence intervals.
Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement Cis with some kind of severity analysis [as in Mayo] discussed in her blog*. In the Year of Statistics, 2013, we should take up the challenge at long last, …starting with our spring meeting. I move this meeting be adjourned, and we regroup at the Elba Room for drinks and dinner. Do I hear a second.
Everyone: Second!
Paul: Hereby adjourned! Doubles of Elbar Grease for all!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PARTING REMARK (from 2012): I do sincerely hope that the New Reformers succeed with their long-running attempt to ban NHST in the fields with which they are dealing, so that practitioners in these fields can see at last how they may achieve the scientific status Franz describes. However, if scientists in these fields are convinced that NHST tools are really holding them back from their potential, then ban or no ban, researchers should declare themselves free of them. (I’m not sure that the recommended 95% or 99% CI’s are better off, interpreted as they are as “a set of parameter values in which we may have confidence”, with or without meta-analysis. But even just removing the distraction of these critical meta-methodological efforts and hand-wringing should at least allow them to focus on the science itself.)
*Among discussions of the New Reformers are the blogposts of Sept 26, Oct. 3 and 4, 2011[i]
REFERENCES:
Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.
Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.
Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?
Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8.
Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.
Meehl, P. E. (1991), “Why summaries of research on psychological theories are often uninterpretable. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59), Hillsdale, NJ: Lawrence Erlbaum.
Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.
Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.
Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.
Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)
Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.
Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.
Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.
Sliver, N. (2012), The Signal and the Noise, Penguin.
Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).
[i] (https://errorstatistics.com/2011/09/26/whipping-boys-and-witch-hunters-comments-are-now-open/); (https://errorstatistics.com/2011/10/03/part-2-prionvac-the-will-to-understand-power/); (https://errorstatistics.com/2011/10/04/part-3-prionvac-how-the-reformers-should-have-done-their-job/).
[ii] This is obviously a parody. Perhaps it can be seen as another one of those statistical theater of the absurd pieces, as was “Stat on a Hot Tin Roof.”(You know where to find it.)
[iii] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.
I agree (with S.C.) that “an improvement of practice will require re-education, not restriction”: the training of statisticians (in a respectable portion of statistics or mathematics departments) is especially poor, in the sense that any mention of history or philosophy of the subject is pretty much *absent*! As a result, *confusion* inevitably builds up over time for any student who cares to look *beyond* mathematical formalism, such as myself. These *unsuccessful* attempts at banning NHST call for a different approach to statistics education (i.e., educating people about statistical inference), at least in my opinion.
Nicole: Your point is well taken, but before you assume this is what S.C. has in mind, you might read the Greenland and Poole paper. I take it that they, like so many other reformers, want to set themselves up as the re-educators to go to.
It is rather curious to see statistical significance tests accused of foisting on practitioners an “immaculate statistical conception,” “compulsive hand washing” and germaphobia when according to the founders of classical statistics, the significance test is but one part of a quick and dirty melange, a multiplicity of tools, relevant for different purposes and at different stages. There was always a staunch resistance to the logicist ideal of a clean uniform approach to scientific research. While this is one of the central reasons they are faulted foundationally—according to Bayesians and others–the frequentist zeitgeist chafes under the pretensions of a single uniquely rational approach. It’s just too bad that crucial features of scientific research, such as “report all relevant information” and “don’t selectively report results”, do not get fully formalized. But these features do impact p-values and confidence levels, and that is why the frequentist demands these be validly computed. Don Fraser recently said something to the effect that*, isn’t it just too bad we still have to think!
*Statist. Sci. 26 (2011), 299-316.
To: anonymous: You say, “It’s just too bad that crucial features of scientific research, such as ‘report all relevant information’ and ‘don’t selectively report results’, do not get fully formalized.” Your comment (to me) brings up the persistent question of how much (or exactly what) can be formalized! As a student who has seen both sides (i.e., technical/mathematical and philosophical), I *urge* that the limits or capabilities of formalizing important concepts, such as ‘relevant information’, be more thoroughly investigated. (My own research will shed light on these questions.) As an important example, attempts at formalizing the notion of “relevant information” have been *especially* unsuccessful, at least in my opinion. Hence, instead of wishing that important concepts be formalized, let’s focus more on finding out exactly what can be formalized, and the *implications* of the formalisms!
Anon: This brings out something else that has always puzzled me: the supposition that everything needs to be formalized is at odds with the rest of science, so why is it thought to be required of those aspects of inquiry that make use of formal statistics? I began this blog with a non-statistical inquiry concerning prions and protein folding quite deliberately. At times formal statistics weaves in and out of the research in that episode, so it’s intended that there be continuity (e.g., rates of transmission can be formally modeled statistically, and being able to alter the rates was one bit of evidence that they were on the right track, in a large and piecemeal progression that makes perfect sense). Statisticians might look at entirely non-statistical arenas too. Just because there is less noise and more theory in some cases should not render the domain discontinuous with the rest of science.
The supposition that all elements of a statistical inquiry are to be formalized leads some to suppose that background information does not enter into frequentist statistics. If background enters, some actually suppose, then it must be via prior probabilities in various hypotheses and assumptions! Since frequentist methods do not formally incorporate priors representing an agent’s degree of belief (actual or rational), it appears (to some) that every inquiry starts with a blank slate (for frequentists). Perhaps this relates to the fact that experimental design was developed separately from statistical inference—or so some claim.
I want to announce that the excellent blogfolk of Elba (in particular, Jean Miller, thank you!) have put up some links to the above dialogue. A couple of people wrote me directly expressing some skepticism that these people really said these things! They did. Schmidt’s paper is especially revealing, I hope interested people will have a look at it.
It reminds me of how much the lambasting of tests in psych arose, at least in part, because Fisherian tests only have a single distribution and there was found to be a serious lack of power in most psych tests. Jacob Cohen (could that be Jake?) promoted power analysis but was frustrated that people didn’t or couldn’t really use it. Schmidt here says that since there’s no way to get powerful enough tests in psych, attention to power cannot help. Here’s where meta-analysis enters, and why it’s so important to at least some members of the TSFI. You are to combine distinct results, never mind that they come from studies of quite different quality, to attain high power. Note that Schmidt seems to identify power (presumably against interesting alternatives) with replicability (p. 125).
And how does this help with Meehl’s call for better theories so as to have sufficiently precise predictions to qualify for Popperian severity?
In an e-mail today from Dr. John Kmetz, who has a blog Management Junk Science*, Kmetz expresses the following gripe about what he takes to be illicit uses of statistical methods in his field of business administration: “my career has shown that there is absolutely no interest in reform (or legitimacy) on the part of established social-science researchers. Change will have to come from outside”, and he joins many others in writing popular books on this kind of hand-wringing.
But what kind of change from outside can he mean? Are people outside the field going to demand better business management science? Why even assume “reform” is always meaningful? On the Popper-Lakatos view, if a given field is found to be often degenerating (to use a Lakatosian term), e.g., lacking stringent probes, non-cumulative, propped up only by ad hoc saves against anomaly, cherry picking, hunting for significance, inability to pinpoint blame for conflicting results, and/or what Kmetz calls model shopping, then at some point it is a degenerating research programme (Lakatos spelling), not merely passing through a degenerating stage: it is a non-science, fringe science or pseudoscience. (He calls it junk science). It may still be conducted for short term predictions, speculative thinking, and other ends. There is no reason to suppose that every area we may set out to investigate has actual phenomena, or even genuine effects, to find or theorize about. Statistical-scientific research into ESP was entertained until around 1980 or so, there’s no cut-off point (recall my posts over the summer). In short: why assume “reform” makes sense? Or at best it may mean placing the “research programme” into the non-science column.
*Here is the blog, which for some reason is loaded up with spam comments: http://sites.udel.edu/mjs/gassspp/
I don’t know if that’s intentional.
“[W]hy is [the supposition that everything needs to be formalized] thought to be required of those aspects of inquiry that make use of formal statistics?” I think this is a drastic mischaracterization of the debate over foundations. I don’t know of anyone who believes that ALL aspects of inquiry that make use of formal statistics should be formalized. I also don’t know anyone who thinks that nothing can be formalized. The disagreements are about *what* should be formalized, and what benefits the formalization brings.
If statistical philosophy is a field that is truly alive and is valuable because it progresses, then it only makes sense that practices should change over time as foundations inform theory, which then informs practice. Things that were not formalizable before might become formalizable. This is, in fact, true through the history of statistics: we wouldn’t have statistics without people exploring new ways of formalizing things. Likewise, old ways of formalizing things may be found wanting. Foundations are not dead (at least, I hope not for our jobs’ sake 🙂
But of course older practices have momentum, not because people believe they are better than other practices, but because scientists don’t have a whole lot of time to consider new methods. Scientists know very little about statistics, and even less about statistical foundations and theory. Reform is difficult for all kinds of reasons that have nothing to do with the quality or usefulness of current methods, so why mock reformers’ slow success and subtly (or not so subtly) imply that it is evidence that their proposals are somehow silly?
Or, perhaps I misread the purpose of the post?