Posts Tagged With: significance tests

2015 Saturday Night Brainstorming and Task Forces: (3rd draft)

img_0737

TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative?  It’s hard to say. 

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past.

Marty: Well, I have with me a comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.

Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.

Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)?  There’s little point in continuing with methods whose efficacy have not stood up to stringent tests.  Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students’’ ?

Franz:  Already tried. Rozeboom 1997, page 335.  Very, very similar phrasing also attempted by many, many others over 50 years.  All failed. Darn.

Jake: Didn’t it kill to see all the attention p-values got with the 2012 Higgs Boson discovery?  P-value policing by Lindley and O’Hagan (to use a term from the Normal Deviate) just made things worse.

Pawl: Indeed! And the machine’s back on in 2015. Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.

Nayth: As the “non-academic” Big Data member of the TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”

Gerry: Declared by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too! Let alone our replication gig.

Marty: It’s a gig that keeps on giving.

Dora: I really like the part about the ‘immaculate statistical conception’.  It could have worked except for the fact that, although people love Silver’s book overall, they regard the short part on frequentist statistics as so far-fetched as to have been sent from another planet!

Nayth: Well here’s a news flash, Dora: I’ve heard the same thing about Ziliac and McCloskey’s book!

Gerry: If we can get back to our business, my diagnosis is that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.”  It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”

Pawl: Oh My, Gerry!  That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.

Gerry: I thought it was pretty good, especially the part about “denying its parents”.

Dora: I like the part about the “compulsive hand washing”. Cool!

Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST?  Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…

Dora: Woah Jake!  Slow down. That was Cohen 1994, page 202, remember?  But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”!  NHST is a method promoted by that Fisherian cult of bee-keepers.

Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [i].

Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1991, page 18; Meehl and Waller 2002, page 184, respectively.

Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203.  Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.

Jake: You want to ban Popper too?  Now you’re really going to scare people off our mission.

Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper.  I was just reading an on-line article by Andrew Gelman. He says:

“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)

Jake: Of course Popper’s prime example of non falsifiable science was Freudian/Adlerian psychology which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. I’ve always suspected that’s one reason Meehl castigated experimental psychologists who could falsify via P-values, and thereby count as scientific (by Popper’s lights) whereas he could not. At least not yet.

Gerry: Maybe for once we should set the record straight:“It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.”

Pawl: That was tried by Gigerenzer in “The Inference Experts” (1989, 96).

S.C: This is radical but maybe p-values should just be used as measures of observed fit, and all inferential uses of significance testing banned.

Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”

(Franz stands. Chest up, chin out, hand over his heart):

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”

Pawl: My! That was incredibly inspiring Franz.

Dora: Yes, really moving, only …

Gerry:  Only problem is, Schmidt’s already said it, 1996, page 123.

S.C.: How’s that meta-analysis working out for you social scientists, huh? Is the gloom lifting?

Franz: It was until we saw the ‘train wreck looming’ (after Deiderik Stapel),..but now we have replication projects.

Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.

Marty: Is it leaving?  Anyway, this is in Nathan Silver’s recent book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.

Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does our epidemiologist want to jump in here?

S.C.: Not into the green frog pool I should hope! Nyuk! Nyuk! But I do have a radical suggestion that no one has so far dared to utter.

Dora: Oomph! Tell, tell!

S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improve­ment of practice will require re-education, not restriction”.

Marty: “Living With P-Values,” Greenland and Poole in a 2013 issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.

Pawl: I just had a quick look, but their article appears to resurrect the same-old same-old: P-values are or have to be (mis)interpreted as posteriors, so here are some priors to do the trick. Of course their reconciliation between P-values and the posterior probability (with weak priors) that you’ve got the wrong directional effect (in one sided tests) was shown long ago by Cox, Pratt, others.

Franz: It’s a neat trick, but it’s unclear how this reconciliation advances our goals. Historically, the TFSI has not pushed the Bayesian line. We in psychology have enough trouble being taken as serious scientists.

Nayth: Journalists must be Bayesian because,let’s face it, we’re all biased, and we need to be up front about it.

Jenina: I know I’m here just to take notes for a story, but I believe Nate Silver made this pronouncement in his 10 or 11 point list in his presidential address to the Joint Statistical Meetings in 2013. Has it caught on?

Nayth: Well, of course I’d never let the writers of my on-line data-driven news journal introduce prior probabilities into their articles, are you kidding me? No way! We’ve got to keep it objective, push randomized-controlled trials, you know, regular statistical methods–you want me to lose my advertising revenue? Fugetaboutit!

Dora: Do as we say, not as we do– when it’s really important to us, and our cost-functions dictate otherwise.

Gerry: As Franz observes, the TFSI has not pushed the Bayesian line because we want people to use confidence intervals (CIs). Anything tests can do CIs do better.

S.C.: You still need tests. Notice you don’t use CIs alone in replication or fraudbusting.

Dora: No one’s going to notice we use the very methods we are against when we have to test if another researcher did a shoddy job. Mere mathematical quibbling I say.

Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement CI’s with some kind of severity analysis [as in Mayo] discussed in her blog. In the Year of Statistics, 2013, we promised we’d take up the challenge at long last, but have we?  No. I’d like to hear from our new member, Dr. Ian Nydes.

Ian: I have a new and radical suggestion, and coming from a doctor, people will believe it: prove mathematically that “the high rate of non replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research finds solely on the basis of a single study assessment by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values….

It can be proven that most claimed research findings are false”

S.C.: That was, word for word, in John Ioannidis’ celebrated (2005) paper: “Why Most Published Research Findings Are False.”  I and my colleague have serious problems with this alleged “proof”.

Ian: Of course it’s Ioannidis, and many before him (e.g., John Pratt, Jim Berger), but it doesn’t matter. It works. The best thing about it is that many frequentists have bought it–hook, line, and sinker–as a genuine problem for them!

Jake: Ioannidis is to be credited for calling wide attention to some of the terrible practices we’ve been on about since at least the 1960s (if not earlier).  Still, I agree with S.C. here: his “proof” works only if you basically assume researchers are guilty of all the sins that we’re trying to block: turning tests into dichotomous “up or down” affairs, p-hacking, cherry-picking and you name it. But as someone whose main research concerns power analysis, the thing I’m most disturbed about is his abuse of power in his computations.

Dora: Oomph! Power, shpower! Fussing over the correct use of such mathematical concepts is just, well, just mathematics; while surely the “Lord’s work,” it doesn’t pay the rent, and Ioannidis has stumbled on a gold mine!

 Pawl: Do medical researchers really claim “conclusive research finds solely on the basis of a single study assessment by formal statistical significance”? Do you mean to tell me that medical researchers are actually engaging in worse practices than I’ve been chastising social science researchers about for decades?

Marty: I love it! We’re not the bottom of the scientific barrel after all–medical researchers are worse! 

Dora: And they really “are killing people”, unlike that wild exaggeration by Ziliac and McCloskey (2008, 186).

Gerry: Ioannidis (2005) is a wild exaggeration. But my main objection is that it misuses frequentist probability. Suppose you’ve got an urn filled with hypotheses, they can be from all sorts of fields, and let’s say 10% of them are true, the rest false. In an experiment involving randomly selecting a hypothesis from this urn, the probability of selecting one with the property “true” is 10%. But now suppose I pick out H': “Ebola can be contracted by sexual intercourse”, or any hypothesis you like. It’s mistaken to say that the Pr(H’) = 0.10.

Pawl: It’s a common fallacy of probabilistic instantiation, like saying a particular 95% confidence interval estimate has probability .95.

S.C.: Only it’s worse, because we know the probabilities the hypotheses are true are very different from what you get in following his “Cult of the Holy Spike” priors.

Dora: More mathematical nit-picking. Who cares? My loss function says they’re irrelevant.

Ian: But Pawl, if all you care about is long-run screening probabilities, and posterior predictive values (PPVs), then it’s correct; the thing is, it works! It’s got everyone all upset.

Nayth: Not everyone, we Bayesians love it.  The best thing is that it’s basically a Bayesian computation but with frequentist receiver operating curves!

Dora: It’s brilliant! No one cares about the delicate mathematical nit-picking. Ka-ching! (That’s the sound of my cash register, I’m an economist, you know.)

Pawl: I’ve got to be frank, I don’t see how some people, and I include some people in this room, can disparage the scientific credentials of significance tests and yet rely on them to indict other people’s research, and even to insinuate questionable research practices (QRPs) if not out and out fraud (as with Smeesters, Forster, Anil Potti, many others). Folks are saying that in the past we weren’t so hypocritical.

(silence)

Gerry: I’ve heard other’s raise Pawl’s charge. They’re irate that the hard core reformers nowadays act like p-values can’t be trusted except when used to show that p-values can’t be trusted.  Is this something our committee has to worry about?

Marty: Nah! It’s a gig!

Jenina: I have to confess that this is a part of my own special “gig”. I’ve a simple 3-step recipe that automatically lets me publish an attention-grabbing article whenever I choose.

Nayth: Really?  What are they? (I love numbered lists.)

Jenina: It’s very simple:

Step #1: Non-replication: a story of some poor fool who thinks he’s on the brink of glory and fame when he gets a single statistically significant result in a social psychology experiment. Only then it doesn’t replicate! (Priming studies work well, or situated cognition).

Step #2: A crazy, sexy example where p-values are claimed to support a totally unbelievable, far-out claim. As a rule I take examples about sex (as in this study on penis size and voting behavior)–perhaps with a little evolutionary psychology twist thrown in to make it more “high brow”. (Appeals to ridicule about Fisher or Neyman-Pearson give it a historical flair, while still keeping it fun.) Else I choose something real spo-o-ky like ESP. (If I run out of examples, I take what some of the more vocal P-bashers are into, citing them of course, and getting extra brownie points.) Then, it’s a cinch to say “this stuff’s so unbelievable, if we just use our brains we’d know it was bunk!” (I guess you could call this my Bayesian cheerleading bit.)

Step #3: Ioannidis’ proof that most published research findings are false, illustrated with colorful charts.

Nayth: So it’s:

-step #1: non-replication,
-step #2: an utterly unbelievable but sexy “chump effect”[ii]that someone, somewhere has found statistically significant, and
-Step #3: Ioannidis (2005).

Jenina: (breaks into hysterical laughter): Yes, and sometimes,…(laughing so hard she can hardly speak)…sometimes the guy from Step #1, (who couldn’t replicate, and so couldn’t publish, his result) goes out and joins the replication movement and gets to publish his non-replication, without the hassle of peer review. (Ha Ha!)

(General laugher, smirking, groans, or head shaking)

[Nayth: (Aside to Jenina) Your step #2 is just like my toad example, except that turned out to be somewhat plausible.]

Dora: I love the three-step cha cha cha! It’s a win-win strategy: Non-replication, chump effect, Ioannidis (“most research findings are false”).

MartyNon-replication, chump effect, Ioannidis. (What a great gig!)

Pawl: I’m more interested in a 2014 paper Ioannidis jointly wrote with several others on reducing research waste.

Gerry: I move we place it on our main reading list for our 2016 meeting and then move we adjourn to drinks and dinner. I’ve got reservations at a 5 star restaurant, all covered by TFSI Foundation.

Jake: I second. All in favor?

All: Aye

Pawl:  Adjourned. Doubles of Elbar Grease for all!

S.C.: Isn’t that Deborah Mayo’s special concoction?

Pawl: Yes, most of us, if truth be known, are closet or even open error statisticians!

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parting Remark: Given Fisher’s declaration when first setting out tests to the effect that that isolated results are too cheap to be worth having, Cox’s insistence donkey’s years ago that “It is very bad practice to summarise an important investigation solely by a value of P”, and a million other admonishments against statistical fallacies and lampoons, I’m starting to think that a license should be procured (upon passing a severe test) before being permitted to use statistical tests of any kind. My own conception is an inferential reformulation of Neyman-Pearson statistics in which one uses error probabilities to infer discrepancies that are well or poorly warranted by given data. (It dovetails also with Fisherian tests, as seen in Mayo and Cox 2010, using essentially the P-value distribution for sensitivity rather than attained power). Some of the latest movements to avoid biases, selection effects, barn hunting, cherry picking, multiple testing and the like, and to promote controlled trials and attention to experimental design are all to the good. They get their justification from the goal of avoiding corrupt error probabilities. Anyone who rejects the inferential use of error probabilities is hard pressed to justify the strenuous efforts to sustain them. These error probabilities, on which confidence levels and severity assessments are built, are very different from PPVs and similar computations that arise in the context of screening, say, thousands of genes. My worry is that the best of the New Reforms, in failing to make clear the error statistical basis for their recommendations, and confusing screening with the evidential appraisal of particular statistical hypotheses, will fail to halt some of the deepest and most pervasive confusions and fallacies about running and interpreting statistical tests. 

 

[i] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.

[ii] From “The Chump Effect: Reporters are credulous, studies show”, by Andrew Ferguson.

“Entire journalistic enterprises, whole books from cover to cover, would simply collapse into dust if even a smidgen of skepticism were summoned whenever we read that “scientists say” or “a new study finds” or “research shows” or “data suggest.” Most such claims of social science, we would soon find, fall into one of three categories: the trivial, the dubious, or the flatly untrue.”

(selected) REFERENCES:

Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.

Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics 29 : 357-372.

Cox, D. R. (1977). The role of significance tests. (With discussion). Scand. J. Statist. 4 : 49-70.

Cox, D. R. (1982). Statistical significance tests. Br. J. Clinical. Pharmac. 14 : 325-331.

Gigerenzer, G. et.al., (1989) The Empire of Chance, CUP.

Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.

Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?

Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8. 

Ioannides, J (2005), ‘Why Most Published Research Findings are False”.

Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.

Meehl, P. E. (1991), “Why summaries of research on psychological theories are often uninterpretable. In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59), Hillsdale, NJ: Lawrence Erlbaum.

Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.

Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.

Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.

Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)

Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.

Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.

Schmidt, F. (1996),  “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.

Sliver, N. (2012), The Signal and the Noise, Penguin.

Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).


Categories: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics | Tags: , , , , , , | 1 Comment

Higgs Discovery two years on (1: “Is particle physics bad science?”)

Higgs_cake-s

July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”) Continue reading

Categories: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics | Tags: , , , , , | 4 Comments

Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

Categories: Comedy, fallacy of rejection, Statistical power | Tags: , , , , | 9 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Exactly 1 year ago: I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 1 Comment

WHIPPING BOYS AND WITCH HUNTERS

This, from 2 years ago, “fits” at least as well today…HAPPY HALLOWEEN! Memory Lane

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways).  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a of “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.   It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting, that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis–especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing. Continue reading

Categories: significance tests, Statistics | Tags: , , | 3 Comments

Is Particle Physics Bad Science? (memory lane)

Memory Lane: reblog July 11, 2012 (+ updates at the end). 

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i].  Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson.  It is asked: “Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Bad science?   I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson.  The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson.  Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level.  Five standard deviations, assuming normality, means a p-value of around 0.0000005.  A number of questions spring to mind.

1.  Why such an extreme evidence requirement?  We know from a Bayesian  perspective that this only makes sense if (a) the existence of the Higgs  boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme.  Neither seems to be the case, so why  5-sigma?

2.  Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis.  Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is? Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , , | Leave a comment

PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)

Memory Lane: One Year Ago on error statistics.com

A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy.  The following are some excerpts, read the full blog here.   I make two comments at the end.

July 8th, 2012

Nathan Schachtman

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

…………

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large data set—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed. Continue reading

Categories: P-values, PhilStatLaw, significance tests | Tags: , , , , | 6 Comments

Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the third time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears). 

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

Mayo’s Rejoinder:

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation. Continue reading

Categories: Bayesian/frequentist, Comedy, P-values, Statistics | Tags: , , , | 13 Comments

Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ. Continue reading

Categories: confidence intervals and tests, reformers, Statistics | Tags: , , , | 7 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

Categories: phil/history of stat, Statistics | Tags: , , , | Leave a comment

Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST

img_0737Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update

Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. 

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost. Continue reading

Categories: Comedy, reformers, statistical tests, Statistics | Tags: , , , , , , | 8 Comments

A “Bayesian Bear” rejoinder practically writes itself…

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities. Coincidentally, I have been sent several different p-value U-Tube clips in the past two weeks, rehearsing essentially the same interpretive issues, but this one (“what the p-value”*) was created by some freebee outfit that will apparently set their irritating cartoon bear voices to your very own dialogue (I don’t know the website or outfit.)

The presumption is that somehow there would be no questions or confusion of interpretation were the output in the form of a posterior probability. The problem of indicating the extent of discrepancies that are/are not warranted by a given p-value is genuine but easy enough to solve**. What I never understand is why it is presupposed that the most natural and unequivocal way to interpret and communicate evidence (in this case, leading to low p-values) is by means of a (posterior) probability assignment, when it seems clear that the more relevant question the testy-voiced (“just wait a tick”) bear would put to the know-it-all bear would be: how often would this method erroneously declare a genuine discrepancy? A corresponding “Bayesian bear” video practically writes itself, but I’ll let you watch this first. Share any narrative lines that come to mind.

*Reference: Blume, J. and J. F. Peipert (2003). “What your statistician never told you about P-values.” J Am Assoc Gynecol Laparosc 10(4): 439-444.

**See for example, Mayo & Spanos (2011) ERROR STATISTICS

Categories: Statistics | Tags: , , , | 6 Comments

Is Particle Physics Bad Science?

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i].  Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson.  It is asked: “Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Bad science?   I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson.  The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson.  Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level.  Five standard deviations, assuming normality, means a p-value of around 0.0000005.  A number of questions spring to mind.

1.  Why such an extreme evidence requirement?  We know from a Bayesian  perspective that this only makes sense if (a) the existence of the Higgs  boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme.  Neither seems to be the case, so why  5-sigma?

2.  Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis.  Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is? Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , , | 11 Comments

PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)

A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy.  The following are some excerpts, read the full blog here.   I make two comments at the end.

July 8th, 2012

Nathan Schachtman

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

…………

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome: Continue reading

Categories: Statistics | Tags: , , , , | 9 Comments

G. Cumming Response: The New Statistics

Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post).   He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.

The New Statistics

PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)

I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:

Cumming (quoted by Mayo): as usual, the estimation approach is better

Mayo: Is it?

Lots to discuss there. In this first post I’ll outline the big picture as I see it.

‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article (http://tiny.cc/GeoffConversation) and radio talk (http://tiny.cc/geofftalk). The website www.thenewstatistics.com has information about the book (Cumming, 2012) and ESCI software, which is a free download.

Continue reading

Categories: Statistics | Tags: , , , , , , , | 5 Comments

Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include: Continue reading

Categories: Statistics | Tags: , , , | Leave a comment

U-Phil: Is the Use of Power* Open to a Power Paradox?

* to assess Detectable Discrepancy Size (DDS)

In my last post, I argued that DDS type calculations (also called Neymanian power analysis) provide needful information to avoid fallacies of acceptance in the test T+; whereas, the corresponding confidence interval does not (at least not without special testing supplements).  But some have argued that DDS computations are “fundamentally flawed” leading to what is called the “power approach paradox”, e.g., Hoenig and Heisey (2001).

We are to consider two variations on the one-tailed test T+: H0: μ ≤ 0 versus H1: μ > 0 (p. 21).  Following their terminology and symbols:  The Z value in the first, Zp1, exceeds the Z value in the second, Zp2, although the same observed effect size occurs in both[i], and both have the same sample size, implying that σ1 < σ2.  For example, suppose σx1 = 1 and σx2 = 2.  Let observed sample mean M be 1.4 for both cases, so Zp1 = 1.4 and Zp2 = .7. They note that for any chosen power, the computable detectable discrepancy size will be smaller in the first experiment, and for any conjectured effect size, the computed power will always be higher in the first experiment.

“These results lead to the nonsensical conclusion that the first experiment provides the stronger evidence for the null hypothesis (because the apparent power is higher but significant results were not obtained), in direct contradiction to the standard interpretation of the experimental results (p-values).” (p. 21)

But rather than show the DDS assessment “nonsensical”, nor any direct contradiction to interpreting p values, this just demonstrates something  nonsensical in their interpretation of the two p-value results from tests with different variances.  Since it’s Sunday  night and I’m nursing[ii] overexposure to rowing in the Queen’s Jubilee boats in the rain and wind, how about you find the howler in their treatment. (Also please inform us of articles pointing this out in the last decade, if you know of any.)

______________________

Hoenig, J. M. and D. M. Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations in Data Analysis,” The American Statistician, 55: 19-24.

 


[i] The subscript indicates the p-value of the associated Z value.

[ii] With English tea and a cup of strong “Elbar grease”.

Categories: Statistics, U-Phil | Tags: , , , , , | 6 Comments

Saturday Night Brainstorming & Task Forces: The TFSI on NHST

Each year leaders of the movement to reform statistical methodology in psychology and related social sciences get together for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. See my discussion of the New Reformers in the blogposts of Sept 26, Oct. 3 and 4, 2011[i]

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers, somewhere near an airport in a major metropolitan area.[ii] Continue reading

Categories: Statistics | Tags: , , , , , , | 7 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”

old blogspot typewriterDear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing, and you may wish to track down the rest of it. Sincerely, D. G. Mayo

Statist. Med. 2002; 21:2437–2444  http://errorstatistics.files.wordpress.com/2013/12/goodman.pdf

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

A comment on replication, p-values and evidence: S.N. Goodman, Statistics in Medicine 1992; 11:875–879

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: Statistics | Tags: , , , | 9 Comments

That Promissory Note From Lehmann’s Letter; Schmidt to Speak

Juliet Shaffer and Erich Lehmann

Monday, April 16, is Jerzy Neyman’s birthday, but this post is not about Neyman (that comes later, I hope). But in thinking of Neyman, I’m reminded of Erich Lehmann, Neyman’s first student, and a promissory note I gave in a post on September 15, 2011.  I wrote:

“One day (in 1997), I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  …. I remember it contained two especially noteworthy pieces of information, one intriguing, the other quite surprising.  The intriguing one (I’ll come back to the surprising one another time, if reminded) was this:  He told me he was sitting in a very large room at an ASA meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, dark table sat just one book, all alone, shiny red.  He said he wondered if it might be of interest to him!  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after.”

But what about the “surprising one” that I was to come back to “if reminded”? (yes, one person did remind me last month). The surprising one is that Lehmann’s letter—this is his first letter to me– asked me to please read a paper by Frank Schmidt to appear in his wife Juliet Shaffer’s new (at the time) journal, Psychological Methods, as he wondered if I had any ideas as to what may be done to answer such criticisms of frequentist tests!   But, clearly, few people could have been in a better position than Lehmann to “do something about” these arguments …hence my surprise.  But I think he was reluctant…. Continue reading

Categories: Statistics | Tags: , , , , | 1 Comment

Blog at WordPress.com. The Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 627 other followers