Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.
Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say.
This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them.
Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.
Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.
Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past.
Marty: Well, I have with me a comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.
Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.
Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)? There’s little point in continuing with methods whose efficacy have not stood up to stringent tests. Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students (see Rozeboom quote here.)’’ ?
Franz: Already tried. Rozeboom 1997, page 335. Very, very similar phrasing also attempted by many, many others over 50 years. All failed. Darn.
Pawl: Indeed! And the machine’s back on in 2015. Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.
Nayth: As the “non-academic” Big Data member of the TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”
Gerry: Declared by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too! Let alone our replication gig.
Marty: It’s a gig that keeps on giving.
Dora: I really like the part about the ‘immaculate statistical conception’. It could have worked except for the fact that, although people love Silver’s book overall, they regard the short part on frequentist statistics as so far-fetched as to have been sent from another planet!
Nayth: Well here’s a news flash, Dora: I’ve heard the same thing about Ziliac and McCloskey’s book!
Gerry: If we can get back to our business, my diagnosis is that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.” It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”
Pawl: Oh My, Gerry! That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.
Gerry: I thought it was pretty good, especially the part about “denying its parents”.
Dora: I like the part about the “compulsive hand washing”. Cool!
Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST? Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…
Dora: Woah Jake! Slow down. That was Cohen 1994, page 202, remember? But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”! NHST is a method promoted by that Fisherian cult of bee-keepers.
Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [i].
Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203. Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.
Jake: You want to ban Popper too? Now you’re really going to scare people off our mission.
Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper. I was just reading an on-line article by Andrew Gelman. He says:
“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)
Jake: Of course Popper’s prime example of non falsifiable science was Freudian/Adlerian psychology which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. I’ve always suspected that’s one reason Meehl castigated experimental psychologists who could falsify via P-values, and thereby count as scientific (by Popper’s lights) whereas he could not. At least not yet.
Gerry: Maybe for once we should set the record straight:“It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.”
Pawl: That was tried by Gigerenzer in “The Inference Experts” (1989, 96).
S.C: This is radical but maybe p-values should just be used as measures of observed fit, and all inferential uses of significance testing banned.
Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”
(Franz stands. Chest up, chin out, hand over his heart):
“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”
Pawl: My! That was incredibly inspiring Franz.
Dora: Yes, really moving, only …
S.C.: How’s that meta-analysis working out for you social scientists, huh? Is the gloom lifting?
Franz: It was until we saw the ‘train wreck looming’ (after Deiderik Stapel),..but now we have replication projects.
Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.
Marty: Is it leaving? Anyway, this is in Nathan Silver’s recent book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.
Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does our epidemiologist want to jump in here?
S.C.: Not into the green frog pool I should hope! Nyuk! Nyuk! But I do have a radical suggestion that no one has so far dared to utter.
Dora: Oomph! Tell, tell!
S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improvement of practice will require re-education, not restriction”.
Marty: “Living With P-Values,”Greenland and Poole in a 2013 issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.
Pawl: I just had a quick look, but their article appears to resurrect the same-old same-old: P-values are or have to be (mis)interpreted as posteriors, so here are some priors to do the trick. Of course their reconciliation between P-values and the posterior probability (with weak priors) that you’ve got the wrong directional effect (in one sided tests) was shown long ago by Cox, Pratt, others.
Franz: It’s a neat trick, but it’s unclear how this reconciliation advances our goals. Historically, the TFSI has not pushed the Bayesian line. We in psychology have enough trouble being taken as serious scientists.
Nayth: Journalists must be Bayesian because,let’s face it, we’re all biased, and we need to be up front about it.
Jenina: I know I’m here just to take notes for a story, but I believe Nate Silver made this pronouncement in his 10 or 11 point list in his presidential address to the Joint Statistical Meetings in 2013. Has it caught on?
Nayth: Well, of course I’d never let the writers of my on-line data-driven news journal introduce prior probabilities into their articles, are you kidding me? No way! We’ve got to keep it objective, push randomized-controlled trials, you know, regular statistical methods–you want me to lose my advertising revenue? Fugetaboutit!
Dora: Do as we say, not as we do– when it’s really important to us, and our cost-functions dictate otherwise.
Gerry: As Franz observes, the TFSI has not pushed the Bayesian line because we want people to use confidence intervals (CIs). Anything tests can do CIs do better.
S.C.: You still need tests. Notice you don’t use CIs alone in replication or fraudbusting.
Dora: No one’s going to notice we use the very methods we are against when we have to test if another researcher did a shoddy job. Mere mathematical quibbling I say.
Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement CI’s with some kind of severity analysis [as in Mayo] discussed in her blog. In the Year of Statistics, 2013, we promised we’d take up the challenge at long last, but have we? No. I’d like to hear from our new member, Dr. Ian Nydes.
Ian: I have a new and radical suggestion, and coming from a doctor, people will believe it: prove mathematically that “the high rate of non replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research finds solely on the basis of a single study assessment by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values….
It can be proven that most claimed research findings are false”
S.C.: That was, word for word, in John Ioannidis’ celebrated (2005) paper: “Why Most Published Research Findings Are False.” I and my colleague have serious problems with this alleged “proof”.
Ian: Of course it’s Ioannidis, and many before him (e.g., John Pratt, Jim Berger), but it doesn’t matter. It works. The best thing about it is that many frequentists have bought it–hook, line, and sinker–as a genuine problem for them!
Jake: Ioannidis is to be credited for calling wide attention to some of the terrible practices we’ve been on about since at least the 1960s (if not earlier). Still, I agree with S.C. here: his “proof” works only if you basically assume researchers are guilty of all the sins that we’re trying to block: turning tests into dichotomous “up or down” affairs, p-hacking, cherry-picking and you name it. But as someone whose main research concerns power analysis, the thing I’m most disturbed about is his abuse of power in his computations.
Dora: Oomph! Power, shpower! Fussing over the correct use of such mathematical concepts is just, well, just mathematics; while surely the “Lord’s work,” it doesn’t pay the rent, and Ioannidis has stumbled on a gold mine!
Pawl: Do medical researchers really claim “conclusive research finds solely on the basis of a single study assessment by formal statistical significance”? Do you mean to tell me that medical researchers are actually engaging in worse practices than I’ve been chastising social science researchers about for decades?
Marty: I love it! We’re not the bottom of the scientific barrel after all–medical researchers are worse!
Dora: And they really “are killing people”, unlike that wild exaggeration by Ziliac and McCloskey (2008, 186).
Gerry: Ioannidis (2005) is a wild exaggeration. But my main objection is that it misuses frequentist probability. Suppose you’ve got an urn filled with hypotheses, they can be from all sorts of fields, and let’s say 10% of them are true, the rest false. In an experiment involving randomly selecting a hypothesis from this urn, the probability of selecting one with the property “true” is 10%. But now suppose I pick out H': “Ebola can be contracted by sexual intercourse”, or any hypothesis you like. It’s mistaken to say that the Pr(H’) = 0.10.
Pawl: It’s a common fallacy of probabilistic instantiation, like saying a particular 95% confidence interval estimate has probability .95.
S.C.: Only it’s worse, because we know the probabilities the hypotheses are true are very different from what you get in following his “Cult of the Holy Spike” priors.
Dora: More mathematical nit-picking. Who cares? My loss function says they’re irrelevant.
Ian: But Pawl, if all you care about is long-run screening probabilities, and posterior predictive values (PPVs), then it’s correct; the thing is, it works! It’s got everyone all upset.
Nayth: Not everyone, we Bayesians love it. The best thing is that it’s basically a Bayesian computation but with frequentist receiver operating curves!
Dora: It’s brilliant! No one cares about the delicate mathematical nit-picking. Ka-ching! (That’s the sound of my cash register, I’m an economist, you know.)
Pawl: I’ve got to be frank, I don’t see how some people, and I include some people in this room, can disparage the scientific credentials of significance tests and yet rely on them to indict other people’s research, and even to insinuate questionable research practices (QRPs) if not out and out fraud (as with Smeesters, Forster, Anil Potti, many others). Folks are saying that in the past we weren’t so hypocritical.
Gerry: I’ve heard other’s raise Pawl’s charge. They’re irate that the hard core reformers nowadays act like p-values can’t be trusted except when used to show that p-values can’t be trusted. Is this something our committee has to worry about?
Marty: Nah! It’s a gig!
Jenina: I have to confess that this is a part of my own special “gig”. I’ve a simple 3-step recipe that automatically lets me publish an attention-grabbing article whenever I choose.
Nayth: Really? What are they? (I love numbered lists.)
Jenina: It’s very simple:
Step #1: Non-replication: a story of some poor fool who thinks he’s on the brink of glory and fame when he gets a single statistically significant result in a social psychology experiment. Only then it doesn’t replicate! (Priming studies work well, or situated cognition).
Step #2: A crazy, sexy example where p-values are claimed to support a totally unbelievable, far-out claim. As a rule I take examples about sex (as in this study on penis size and voting behavior)–perhaps with a little evolutionary psychology twist thrown in to make it more “high brow”. (Appeals to ridicule about Fisher or Neyman-Pearson give it a historical flair, while still keeping it fun.) Else I choose something real spo-o-ky like ESP. (If I run out of examples, I take what some of the more vocal P-bashers are into, citing them of course, and getting extra brownie points.) Then, it’s a cinch to say “this stuff’s so unbelievable, if we just use our brains we’d know it was bunk!” (I guess you could call this my Bayesian cheerleading bit.)
Step #3: Ioannidis’ proof that most published research findings are false, illustrated with colorful charts.
Nayth: So it’s:
Jenina: (breaks into hysterical laughter): Yes, and sometimes,…(laughing so hard she can hardly speak)…sometimes the guy from Step #1, (who couldn’t replicate, and so couldn’t publish, his result) goes out and joins the replication movement and gets to publish his non-replication, without the hassle of peer review. (Ha Ha!)
(General laugher, smirking, groans, or head shaking)
[Nayth: (Aside to Jenina) Your step #2 is just like my toad example, except that turned out to be somewhat plausible.]
Dora: I love the three-step cha cha cha! It’s a win-win strategy: Non-replication, chump effect, Ioannidis (“most research findings are false”).
Marty: Non-replication, chump effect, Ioannidis. (What a great gig!)
Pawl: I’m more interested in a 2014 paper Ioannidis jointly wrote with several others on reducing research waste.
Gerry: I move we place it on our main reading list for our 2016 meeting and then move we adjourn to drinks and dinner. I’ve got reservations at a 5 star restaurant, all covered by TFSI Foundation.
Jake: I second. All in favor?
Pawl: Adjourned. Doubles of Elbar Grease for all!
S.C.: Isn’t that Deborah Mayo’s special concoction?
Pawl: Yes, most of us, if truth be known, are closet or even open error statisticians!*
*This post, of course, is a parody or satire (statistical satirical); all quotes are authentic as cited. Send any corrections.
Parting Remark: Given Fisher’s declaration when first setting out tests to the effect that that isolated results are too cheap to be worth having, Cox’s insistence donkey’s years ago that “It is very bad practice to summarise an important investigation solely by a value of P”, and a million other admonishments against statistical fallacies and lampoons, I’m starting to think that a license should be procured (upon passing a severe test) before being permitted to use statistical tests of any kind. My own conception is an inferential reformulation of Neyman-Pearson statistics in which one uses error probabilities to infer discrepancies that are well or poorly warranted by given data. (It dovetails also with Fisherian tests, as seen in Mayo and Cox 2010, using essentially the P-value distribution for sensitivity rather than attained power). Some of the latest movements to avoid biases, selection effects, barn hunting, cherry picking, multiple testing and the like, and to promote controlled trials and attention to experimental design are all to the good. They get their justification from the goal of avoiding corrupt error probabilities. Anyone who rejects the inferential use of error probabilities is hard pressed to justify the strenuous efforts to sustain them. These error probabilities, on which confidence levels and severity assessments are built, are very different from PPVs and similar computations that arise in the context of screening, say, thousands of genes. My worry is that the best of the New Reforms, in failing to make clear the error statistical basis for their recommendations, and confusing screening with the evidential appraisal of particular statistical hypotheses, will fail to halt some of the deepest and most pervasive confusions and fallacies about running and interpreting statistical tests.
[i] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.
[ii] From “The Chump Effect: Reporters are credulous, studies show”, by Andrew Ferguson.
“Entire journalistic enterprises, whole books from cover to cover, would simply collapse into dust if even a smidgen of skepticism were summoned whenever we read that “scientists say” or “a new study finds” or “research shows” or “data suggest.” Most such claims of social science, we would soon find, fall into one of three categories: the trivial, the dubious, or the flatly untrue.”
Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.
Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics 29 : 357-372.
Cox, D. R. (1977). The role of significance tests. (With discussion). Scand. J. Statist. 4 : 49-70.
Cox, D. R. (1982). Statistical significance tests. Br. J. Clinical. Pharmac. 14 : 325-331.
Gigerenzer, G. et.al., (1989) The Empire of Chance, CUP.
Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.
Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?
Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8.
Ioannides, J (2005), ‘Why Most Published Research Findings are False”.
Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.
Meehl, P. E. (1990), “Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.
Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.
Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.
Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.
Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)
Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.
Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.
Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.
Sliver, N. (2012), The Signal and the Noise, Penguin.
Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).