2015 Saturday Night Brainstorming and Task Forces: (4th draft)


TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. 


Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative?  It’s hard to say. 

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. 


Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past.

Marty: Well, I have with me a comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.

Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.

Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)?  There’s little point in continuing with methods whose efficacy have not stood up to stringent tests.  Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students ’’ ?

Franz:  Already tried. Rozeboom 1997, page 335.  Very, very similar phrasing also attempted by many, many others over 50 years.  All failed. Darn.

Jake: Didn’t it kill to see all the attention p-values got with the 2012 Higgs Boson discovery?  P-value policing by Lindley and O’Hagan (to use a term from the Normal Deviate) just made things worse.

Pawl: Indeed! And the machine’s back on in 2015. Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.

Nayth: As the “non-academic” Big Data member of the TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”

Gerry: Declared by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too! Let alone our replication gig.

Marty: It’s a gig that keeps on giving.

Dora: I really like the part about the ‘immaculate statistical’ conception.  It could have worked except for the fact that, although people love Silver’s book overall, they regard the short part on frequentist statistics as so far-fetched as to have been sent from another planet!

Nayth: Well here’s a news flash, Dora: I’ve heard the same thing about Ziliac and McCloskey’s book!

Gerry: If we can get back to our business, my diagnosis is that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.”  It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”

Pawl: Oh My, Gerry!  That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.

Gerry: I thought it was pretty good, especially the part about “denying its parents”.

Dora: I like the part about the “compulsive hand washing”. Cool!

Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST?  Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…

Dora: Woah Jake!  Slow down. That was Cohen 1994, page 202, remember?  But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”!  NHST is a method promoted by that Fisherian cult of bee-keepers.

Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [i].

Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1990, page 18; Meehl and Waller 2002, page 284, respectively.

Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203.  Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.

Jake: You want to ban Popper too?  Now you’re really going to scare people off our mission.

Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper.  I was just reading an on-line article by Andrew Gelman. He says:

“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)

Jake: Of course Popper’s prime example of non falsifiable science was Freudian/Adlerian psychology which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. I’ve always suspected that’s one reason Meehl castigated experimental psychologists who could falsify via P-values, and thereby count as scientific (by Popper’s lights) whereas he could not. At least not yet.

Gerry: Maybe for once we should set the record straight:“It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.”

Pawl: That was tried by Gigerenzer in “The Inference Experts” (1989, 96).

S.C: This is radical but maybe p-values should just be used as measures of observed fit, and all inferential uses of significance testing banned.

Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”

(Franz stands. Chest up, chin out, hand over his heart):

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”

Pawl: My! That was incredibly inspiring Franz.

Dora: Yes, really moving, only …

Gerry:  Only problem is, Schmidt’s already said it, 1996, page 123.

S.C.: How’s that meta-analysis working out for you social scientists, huh? Is the gloom lifting?

Franz: It was until we saw the ‘train wreck looming’ (after Deiderik Stapel),..but now we have replication projects.

Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.

Marty: Is it leaving?  Anyway, this is in Nathan Silver’s 2012 book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.

Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does our epidemiologist want to jump in here?

S.C.: Not into the green frog pool I should hope! Nyuk! Nyuk! But I do have a radical suggestion that no one has so far dared to utter.

Dora: Oomph! Tell, tell!

S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improve­ment of practice will require re-education, not restriction”.

Marty: “Living With P-Values,” Greenland and Poole in a 2013 issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.

Pawl: I just had a quick look, but their article appears to resurrect the same-old same-old: P-values are or have to be (mis)interpreted as posteriors, so here are some priors to do the trick. Of course their reconciliation between P-values and the posterior probability (with weak priors) that you’ve got the wrong directional effect (in one sided tests) was shown long ago by Cox, Pratt, others.

Franz: It’s a neat trick, but it’s unclear how this reconciliation advances our goals. Historically, the TFSI has not pushed the Bayesian line. We in psychology have enough trouble being taken as serious scientists.

Nayth: Journalists must be Bayesian because,let’s face it, we’re all biased, and we need to be up front about it.

Jenina: I know I’m here just to take notes for a story, but I believe Nate Silver made this pronouncement in his 10 or 11 point list in his presidential address to the Joint Statistical Meetings in 2013. Has it caught on?

Nayth: Well, of course I’d never let the writers of my on-line data-driven news journal introduce prior probabilities into their articles, are you kidding me? No way! We’ve got to keep it objective, push randomized-controlled trials, you know, regular statistical methods–you want me to lose my advertising revenue? Fugetaboutit!

Dora: Do as we say, not as we do– when it’s really important to us, and our cost-functions dictate otherwise.

Gerry: As Franz observes, the TFSI has not pushed the Bayesian line because we want people to use confidence intervals (CIs). Anything tests can do CIs do better.

S.C.: You still need tests. Notice you don’t use CIs alone in replication or fraudbusting.

Dora: No one’s going to notice we use the very methods we are against when we have to test if another researcher did a shoddy job. Mere mathematical quibbling I say.

Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement CI’s with some kind of severity analysis [as in Mayo] discussed in her blog. In the Year of Statistics, 2013, we promised we’d take up the challenge at long last, but have we?  No. I’d like to hear from our new member, Dr. Ian Nydes.

Ian: I have a new and radical suggestion, and coming from a doctor, people will believe it: prove mathematically that “the high rate of non replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research finds solely on the basis of a single study assessment by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values….

It can be proven that most claimed research findings are false”

S.C.: That was, word for word, in John Ioannidis’ celebrated (2005) paper: “Why Most Published Research Findings Are False.”  I and my colleague have serious problems with this alleged “proof”.

Ian: Of course it’s Ioannidis, and many before him (e.g., John Pratt, Jim Berger), but it doesn’t matter. It works. The best thing about it is that many frequentists have bought it–hook, line, and sinker–as a genuine problem for them!

Jake: Ioannidis is to be credited for calling wide attention to some of the terrible practices we’ve been on about since at least the 1960s (if not earlier).  Still, I agree with S.C. here: his “proof” works only if you basically assume researchers are guilty of all the sins that we’re trying to block: turning tests into dichotomous “up or down” affairs, p-hacking, cherry-picking and you name it. But as someone whose main research concerns power analysis, the thing I’m most disturbed about is his abuse of power in his computations.

Dora: Oomph! Power, shpower! Fussing over the correct use of such mathematical concepts is just, well, just mathematics; while surely the “Lord’s work,” it doesn’t pay the rent, and Ioannidis has stumbled on a gold mine!

 Pawl: Do medical researchers really claim “conclusive research finds solely on the basis of a single study assessment by formal statistical significance”? Do you mean to tell me that medical researchers are actually engaging in worse practices than I’ve been chastising social science researchers about for decades?

Marty: I love it! We’re not the bottom of the scientific barrel after all–medical researchers are worse! 

Dora: And they really “are killing people”, unlike that wild exaggeration by Ziliac and McCloskey (2008, 186).

Gerry: Ioannidis (2005) is a wild exaggeration. But my main objection is that it misuses frequentist probability. Suppose you’ve got an urn filled with hypotheses, they can be from all sorts of fields, and let’s say 10% of them are true, the rest false. In an experiment involving randomly selecting a hypothesis from this urn, the probability of selecting one with the property “true” is 10%. But now suppose I pick out H’: “Ebola can be contracted by sexual intercourse”, or any hypothesis you like. It’s mistaken to say that the Pr(H’) = 0.10.

Pawl: It’s a common fallacy of probabilistic instantiation, like saying a particular 95% confidence interval estimate has probability .95.

S.C.: Only it’s worse, because we know the probabilities the hypotheses are true are very different from what you get in following his “Cult of the Holy Spike” priors.

Dora: More mathematical nit-picking. Who cares? My loss function says they’re irrelevant.

Ian: But Pawl, if all you care about is long-run screening probabilities, and posterior predictive values (PPVs), then it’s correct; the thing is, it works! It’s got everyone all upset.

Nayth: Not everyone, we Bayesians love it.  The best thing is that it’s basically a Bayesian computation but with frequentist receiver operating curves!

Dora: It’s brilliant! No one cares about the delicate mathematical nit-picking. Ka-ching! (That’s the sound of my cash register, I’m an economist, you know.)

Pawl: I’ve got to be frank, I don’t see how some people, and I include some people in this room, can disparage the scientific credentials of significance tests and yet rely on them to indict other people’s research, and even to insinuate questionable research practices (QRPs) if not out and out fraud (as with Smeesters, Forster, Anil Potti, many others). Folks are saying that in the past we weren’t so hypocritical.


Gerry: I’ve heard other’s raise Pawl’s charge. They’re irate that the hard core reformers nowadays act like p-values can’t be trusted except when used to show that p-values can’t be trusted.  Is this something our committee has to worry about?

Marty: Nah! It’s a gig!

Jenina: I have to confess that this is a part of my own special “gig”. I’ve a simple 3-step recipe that automatically lets me publish an attention-grabbing article whenever I choose.

Nayth: Really?  What are they? (I love numbered lists.)

Jenina: It’s very simple:

Step #1: Non-replication: a story of some poor fool who thinks he’s on the brink of glory and fame when he gets a single statistically significant result in a social psychology experiment. Only then it doesn’t replicate! (Priming studies work well, or situated cognition).

Step #2: A crazy, sexy example where p-values are claimed to support a totally unbelievable, far-out claim. As a rule I take examples about sex (as in this study on penis size and voting behavior)–perhaps with a little evolutionary psychology twist thrown in to make it more “high brow”. (Appeals to ridicule about Fisher or Neyman-Pearson give it a historical flair, while still keeping it fun.) Else I choose something real spo-o-ky like ESP. (If I run out of examples, I take what some of the more vocal P-bashers are into, citing them of course, and getting extra brownie points.) Then, it’s a cinch to say “this stuff’s so unbelievable, if we just use our brains we’d know it was bunk!” (I guess you could call this my Bayesian cheerleading bit.)

Step #3: Ioannidis’ proof that most published research findings are false, illustrated with colorful charts.

Nayth: So it’s:

-step #1: non-replication,
-step #2: an utterly unbelievable but sexy “chump effect”[ii]that someone, somewhere has found statistically significant, and
-Step #3: Ioannidis (2005).

Jenina: (breaks into hysterical laughter): Yes, and sometimes,…(laughing so hard she can hardly speak)…sometimes the guy from Step #1, (who couldn’t replicate, and so couldn’t publish, his result) goes out and joins the replication movement and gets to publish his non-replication, without the hassle of peer review. (Ha Ha!)

(General laugher, smirking, groans, or head shaking)

[Nayth: (Aside to Jenina) Your step #2 is just like my toad example, except that turned out to be somewhat plausible.]

Dora: I love the three-step cha cha cha! It’s a win-win strategy: Non-replication, chump effect, Ioannidis (“most research findings are false”).

MartyNon-replication, chump effect, Ioannidis. (What a great gig!)

Pawl: I’m more interested in a 2014 paper Ioannidis jointly wrote with several others on reducing research waste.

Gerry: I move we place it on our main reading list for our 2016 meeting and then move we adjourn to drinks and dinner. I’ve got reservations at a 5 star restaurant, all covered by TFSI Foundation.

Jake: I second. All in favor?

All: Aye

Pawl:  Adjourned. Doubles of Elbar Grease for all!

S.C.: Isn’t that Deborah Mayo’s special concoction?

Pawl: Yes, most of us, if truth be known, are closet or even open error statisticians!*

*This post, of course, is a parody or satire (statistical satirical); all quotes are authentic as cited. Send any corrections.


Parting Remark: Given Fisher’s declaration when first setting out tests to the effect that isolated results are too cheap to be worth having, Cox’s (1982) insistence donkey’s years ago that “It is very bad practice to summarise an important investigation solely by a value of P”, and a million other admonishments against statistical fallacies and lampoons, I’m starting to think that a license should be procured (upon passing a severe test) before being permitted to use statistical tests of any kind. My own conception is an inferential reformulation of Neyman-Pearson statistics in which one uses error probabilities to infer discrepancies that are well or poorly warranted by given data. (It dovetails also with Fisherian tests, as seen in Mayo and Cox 2010, using essentially the P-value distribution for sensitivity rather than attained power). Some of the latest movements to avoid biases, selection effects, ban hunting, cherry picking, multiple testing and the like, and to promote controlled trials and attention to experimental design are all to the good. They get their justification from the goal of avoiding corrupt error probabilities. Anyone who rejects the inferential use of error probabilities is hard pressed to justify the strenuous efforts to sustain them. These error probabilities, on which confidence levels and severity assessments are built, are very different from PPVs and similar computations that arise in the context of screening, say, thousands of genes. My worry is that the best of the New Reforms, in failing to make clear the error statistical basis for their recommendations, and confusing screening with the evidential appraisal of particular statistical hypotheses, will fail to halt some of the deepest and most pervasive confusions and fallacies about running and interpreting statistical tests. 


[i] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.

[ii] From “The Chump Effect: Reporters are credulous, studies show”, by Andrew Ferguson.

“Entire journalistic enterprises, whole books from cover to cover, would simply collapse into dust if even a smidgen of skepticism were summoned whenever we read that “scientists say” or “a new study finds” or “research shows” or “data suggest.” Most such claims of social science, we would soon find, fall into one of three categories: the trivial, the dubious, or the flatly untrue.”

(selected) REFERENCES:

Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.

Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics 29 : 357-372.

Cox, D. R. (1977). The role of significance tests. (With discussion). Scand. J. Statist. 4 : 49-70.

Cox, D. R. (1982). Statistical significance tests. Br. J. Clinical. Pharmac. 14 : 325-331.

Gigerenzer, G. et.al., (1989) The Empire of Chance, CUP.

Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.

Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?

Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8. 

Ioannidis, J (2005), ‘Why Most Published Research Findings are False“. PLOS.Med

Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep ExplorationsRationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.

Meehl, P. E. (1990), “Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.

Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.

Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.

Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)

Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.

Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.

Schmidt, F. (1996),  “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.

Sliver, N. (2012), The Signal and the Noise, Penguin.

Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).

Categories: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics | Tags: , , , , , ,

Post navigation

19 thoughts on “2015 Saturday Night Brainstorming and Task Forces: (4th draft)

  1. Who are Pawl, Franz, Gerry, Jake, Dora, Marty, Nayth, S.C.,Ian, Janina in this statistical satire? A free book from the palindrome list for getting 9 of 10 correct (by Feb. 7). First 2 winners only. Hint: The answers are within the post.

  2. Dears,

    Funny! But the issue seems simple to me, and I worry that making it sound oh-so-Complicated-and-Philosophical will cause more damage. If a man wants to bet with you about the throw of a die and (1.) you have prior knowledge that he is utterly honest and the die is utterly fair and (2.) you don’t care how it comes up, which is to say that nothing hinges on the outcome, then NHST is fine. You have no priors and no loss function, which is the (rare) situation Fisher imagined. Gossett, who was busy brewing beer, was not in such a situation. Nor are most applied statisticians.


    Deirdre McCloskey

    • Deirdre:
      Thanks for your comment. I’m finding it odd, now that I rearranged my blog into two columns in order that commentators didn’t have to squish into a small space, that I’m rarely getting comments—despite getting double and triple the normal hits! Go figure.

      Now as far as what I think, as opposed to what I was doing in this little parody on the current and recent state of play in the statistical reform movements (building on my earlier years which focused on psych and social sciences) what I say above is that NHST is “a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects”. But let’s say you’re only talking about ordinary Fisherian tests where a small p-value might be taken as only indicating a discrepancy from the null (in some direction). (Such “isolated” effects, as Fisher calls them,don’t qualify as evidence of genuine effects yet, but might indicate some discrepancy to be probed further.) I think there are uses for such “pure” significance tests (e.g., in Cox’s case of “dividing” nulls, or checking assumptions) but outside of those cases, I’d insist on an assessment of the discrepancy from null that is and is not indicated.

      OK, now you’re saying that the applied statistician would introduce a loss function and maybe priors. Consider the loss function, since I know you’ve often said that. We’re testing if there’s a spike in the radiation level at Fukushima, and we are to assume, as you say, that the company X is playing fair and not muffling the dosimeters as they routinely do or did. Company X has a different loss function than the clean-up worker, and suppose they say “it’s a safe level” even if it would be assessed very differently on the worker’s loss function. So if I wanted to know if the radiation level had increased beyond k, I’d get different answers. If you say, well I wouldn’t let those loss functions influence my readings, then why did you at first say we’d have a loss function? If loss functions alter the evidential assessment, then different players wind up with different assessments of radiation levels. Does that make sense? I say no. If you say, of course the estimate of k would not be done with the loss functions, but in deciding how many months one should be allowed to work on the clean-up, we’d have to negotiate how to balance the concern that workers not get their yearly max dose too readily with the cost of not cleaning up the disaster. And so losses enter for the policy decision. But that is quite distinct from the estimate of k, and in fact it depends on being able to get an estimate of the radiation levels.

      So here we have cost-benefit functions aplenty, but an aversion to allowing them to enter in obtaining rival estimates of radiation doses (in a given plant/job/period), even while we grant that some policy decisions would need to be made in order that Japan not run out of a Fukushima clean-up labor pool in 5 years, as some estimate. Do you agree?

      Dosimeter of Fukushima citizen counts 40% lower than actual, Maker admits

  3. A number of people have written to me about this, even though they didn’t comment (but they should) and have sent me good stuff. One question arose as to whether I was mixing views of mine with member of this committee. First, since this is a blog and I was only planning a quick add-on to the previous years, it isn’t fully balanced or anything. Second, most important, insofar as I knew or know several of the people involved (not all of course), I know what they’ve said, written, or thought about some of the other criticisms. My point is that if you imagine a real committee, with newer guys coming in with possibly more extreme views, it makes perfect sense for some of the older guard to raise issues. And they knew or know my work, the cites to me are actual. They are scarcely all in sync on key matters, and the reservations I expressed were theirs (literally) or inferred from what they’ve elsewhere said. I had no time to flesh things out further, and even got too tired to look up the Cox reference in my “Parting Remark”. Please do send corrections and comments, questions. Well, at least I laughed, maybe that’s the point of satire.

  4. amateurstats

    If using p-values results most of the time in scenarios like the following:

    The first experiment gets a p-value of .001 in favor of the existence of a real effect, while the second and third experiments get p-values of .5 and .3 respectively.

    Then anyone who complains about p-values being useless is hypocritical since p-values were used to show the effect wasn’t real? I had never looked at it that way before. Thank you.

    • Well there are many ways:, e.g.,using p-value reasoning to determine that nominal or computed p-values were not actual p-values (as in the green jelly beens and acne cartoon), using significance tests to determine data are too good to be true, to show when failed replications indict previously observed effects., and to check assumptions of stat models used to test other claims. In general, error probabilities of procedures are the key to identifying (and taking account of) selection effects like cherry picking, multiple testing, stopping rules, etc. that is because error probabilities are self-correcting, meta-statistical quantities that concern the capabilities and incapabilities of tests.

      • amateurstats

        I’m not familiar with the terms “nominal p-value” and “real p-value”. What’s the definition for these that I can use to distinguish a nominal p-value from a real p-value in practice?

        • Amateurstats: The nominal or computed p-value just reflects the number of standard deviations the observed is from the null. That can have no bearing whatever on the actual error rate because of failed assumptions or data dependent selection effects. Please check this blog for a standard passage from “The significance test Controversy” in 1970.

          For a recent source, Goldacre’s “Bad Statistics”. Hunting and cherry picking, post-data subgroup selection, and multiple testing are amongst methods that render the computed or nominal p-value utterly wrong as an assessment of the actual p-value. The section on selection effects in Mayo and Cox (2010) is another. You may also check “optional stopping” on this blog.

  5. In a note from Gigerenzer today was an article of his with a valuable recognition: Fisher and Neyman (and i would add E. Pearson) were radically opposed to recipe-like uses of tests. His paper contrasts this with the automaticity of many Bayesians.

    “If statisticians agree on one thing, it is that scientific inference should not be made mechanically. Despite virulent disagreements on other issues, Ronald Fisher and Jerzy Neyman, two of the most influential statisticians of the 20th century, were of one voice on this matter. Good science requires both statistical tools and informed judgment about what model to construct, what hypotheses to test, and what tools to use. Practicing statisticians rely on a “statistical toolbox” and on their expertise to select a proper tool; social scientists, in contrast, tend to rely on a single tool”.

    Surrogate Science: The Idol of a Universal Method for Scientific Inference
    Gerd Gigerenzer and Julian N. Marewski
Journal of Management published online 2 September 2014 DOI: 10.1177/0149206314547522
    The online version of this article can be found at:

  6. amateurstats:
    You have astutely named yourself.

  7. Gerd Gigerenzer

    I love this satire, it is so true!

    • Gerd:
      Thanks so much for having having the sense of humor to appreciate this, even with the little Freudian jokes. Of course, you get one of the best lines.

  8. Name

    This is great: ‘immaculate statistical conception’. I hope to see that term peer reviewed so it can be cited by future generations.

    • It’s a quote.

      • Name

        Sorry, I do not have access to the entire book (only the pages google books allows). I thought the “immaculate statistical procedures” was a quote from Nate Silver, but the “immaculate statistical conception” term was new. Is it really in that book?

  9. Sorry for being late to the party here – enjoyable satire indeed. And I was even able to guess most of the real world actors behind the pseudonyms.

    Maybe of interest to you, Mayo: here is a psych journal that as of 2014 has in fact officially banned significance testing, as described in the editorial of the current issue.


    “The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.”

    • This has been tried many times in the past, never stuck. I wonder if they define invalid? Undoubtedly they allow confidence intervals whch are open to at least as much abuse, yes?

      • Actually, no – they also ban confidence intervals. They claim to entertain Bayesian analyses on a case-by-case basis, but prefer description and effect sizes.

        I admit that I was surprised reading this editorial.

Blog at WordPress.com.