2015 Saturday Night Brainstorming and Task Forces: (3rd draft)

img_0737

TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative?  It’s hard to say. 

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past.

Marty: Well, I have with me a comprehensive 2012 report by M. Orlitzky that observes that “NHST is used in 94% of all articles in the Journal of Applied Psychology….Similarly, in economics, reliance on NHST has actually increased rather than decreased after McCloskey and Ziliak’s (1996) critique of the prevalence of NHST in the American Economic Review (Ziliak & McCloskey, 2009)”.

Dora: Oomph! Maybe their articles made things worse; I’d like to test if the effect is statistically real or not.

Pawl: Yes, that would be important. But, what new avenue can we try that hasn’t already been attempted and failed (if not actually galvanized NHST users)?  There’s little point in continuing with methods whose efficacy have not stood up to stringent tests.  Might we just declare that NHST is ‘‘surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students (see Rozeboom quote here.)’’ ?

Franz:  Already tried. Rozeboom 1997, page 335.  Very, very similar phrasing also attempted by many, many others over 50 years.  All failed. Darn.

Jake: Didn’t it kill to see all the attention p-values got with the 2012 Higgs Boson discovery?  P-value policing by Lindley and O’Hagan (to use a term from the Normal Deviate) just made things worse.

Pawl: Indeed! And the machine’s back on in 2015. Fortunately, one could see the physicist’s analysis in terms of frequentist confidence intervals.

Nayth: As the “non-academic” Big Data member of the TFSI, I have something brand new: explain that “frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researcher’s bias–keep him hermetically sealed off from the real world.”

Gerry: Declared by Nate Silver 2012, page 253. Anyway, we’re not out to condemn all of frequentist inference, because then our confidence intervals go out the window too! Let alone our replication gig.

Marty: It’s a gig that keeps on giving.

Dora: I really like the part about the ‘immaculate statistical conception’.  It could have worked except for the fact that, although people love Silver’s book overall, they regard the short part on frequentist statistics as so far-fetched as to have been sent from another planet!

Nayth: Well here’s a news flash, Dora: I’ve heard the same thing about Ziliac and McCloskey’s book!

Gerry: If we can get back to our business, my diagnosis is that these practitioners are suffering from a psychological disorder; their “mindless, mechanical behavior” is very much “reminiscent of compulsive hand washing.”  It’s that germaphobia business that Nayth just raised. Perhaps we should begin to view ourselves as Freudian analysts who empathize with the “the anxiety and guilt, the compulsive and ritualistic behavior foisted upon” researchers. We should show that we understand how statistical controversies are “projected onto an ‘intrapsychic’ conflict in the minds of researchers”. It all goes back to that “hybrid logic” attempting “to solve the conflict between its parents by denying its parents.”

Pawl: Oh My, Gerry!  That old Freudian metaphor scarcely worked even after Gigerenzer popularized it. 2000, pages 283, 280, and 281.

Gerry: I thought it was pretty good, especially the part about “denying its parents”.

Dora: I like the part about the “compulsive hand washing”. Cool!

Jake: Well, we need a fresh approach, not redundancy, not repetition. So how about we come right out with it: “What’s wrong with NHST?  Well, … it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it” tells us what we want to know, because we want to know what we want…

Dora: Woah Jake!  Slow down. That was Cohen 1994, page 202, remember?  But I agree with Jake that we’ve got to shout it out with all the oomph we can muster, even frighten people a little bit: “Statistical significance is hurting people, indeed killing them”!  NHST is a method promoted by that Fisherian cult of bee-keepers.

Pawl: She’s right, oh my: “I suggest to you that Sir Ronald has befuddled us, mesmerized us…. [NHST] is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.” Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” [i].

Gerry: H-e-ll-o! Dora and Pawl are just echoing the words in Ziliak and McCloskey 2008, page 186, and Meehl 1990, page 18; Meehl and Waller 2002, page 184, respectively.

Marty: Quite unlike Meehl, some of us deinstitutionalizers and cultural organizational researchers view Popper as not a hero but as the culprit. No one is alerting researchers that “NHST is the key statistical technique that puts into practice hypothetico-deductivism, the scientific inference procedure based on Popper’s falsifiability criterion. So, as long as the [research] community is devoted to hypothetico-deductivism, NHST will likely persist”. Orlitzky 2012, 203.  Rooting Popper out is imperative, if we’re ever going to deinstitutionalize NHST.

Jake: You want to ban Popper too?  Now you’re really going to scare people off our mission.

Nayth: True, it’s not just Meehl who extols Popper. Even some of the most philosophical of statistical practitioners are channeling Popper.  I was just reading an on-line article by Andrew Gelman. He says:

“At a philosophical level, I have been persuaded by the arguments of Popper (1959), … and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). ….At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models …. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, 70)

Jake: Of course Popper’s prime example of non falsifiable science was Freudian/Adlerian psychology which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. I’ve always suspected that’s one reason Meehl castigated experimental psychologists who could falsify via P-values, and thereby count as scientific (by Popper’s lights) whereas he could not. At least not yet.

Gerry: Maybe for once we should set the record straight:“It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.”

Pawl: That was tried by Gigerenzer in “The Inference Experts” (1989, 96).

S.C: This is radical but maybe p-values should just be used as measures of observed fit, and all inferential uses of significance testing banned.

Franz: But then you give up control of error probabilities. Instead of nagging about bans and outlawing, I say we try a more positive approach: point out how meta-analysis “means that cumulative understanding and progress in theory development is possible after all.”

(Franz stands. Chest up, chin out, hand over his heart):

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”

Pawl: My! That was incredibly inspiring Franz.

Dora: Yes, really moving, only …

Gerry:  Only problem is, Schmidt’s already said it, 1996, page 123.

S.C.: How’s that meta-analysis working out for you social scientists, huh? Is the gloom lifting?

Franz: It was until we saw the ‘train wreck looming’ (after Deiderik Stapel),..but now we have replication projects.

Nayth: From my experience as a TV pundit, I say just tell everyone how bad data and NHST are producing ” ‘statistically significant ‘(but manifestly ridiculous) findings” like how toads can predict earthquakes. You guys need to get on the Bayesian train.

Marty: Is it leaving?  Anyway, this is in Nathan Silver’s recent book, page 253. But I don’t see why it’s so ridiculous, I’ll bet it’s true. I read that some woman found that all the frogs she had been studying every day just up and disappeared from the pond just before that quake in L’Aquila, Italy, in 2009.

Dora: I’m with Marty. I really, really believe animals pick up those ions from sand and pools of water near earthquakes. My gut feeling is its very probable. Does our epidemiologist want to jump in here?

S.C.: Not into the green frog pool I should hope! Nyuk! Nyuk! But I do have a radical suggestion that no one has so far dared to utter.

Dora: Oomph! Tell, tell!

S.C.“Given the extensive misinterpretations of frequentist statistics and the enormous (some would say impossible) demands made by fully Bayesian analyses, a serious argument can be made for de-emphasizing (if not eliminating) inferential statistics in favor of more data presentation… Without an inferential ban, however, an improve­ment of practice will require re-education, not restriction”.

Marty: “Living With P-Values,”Greenland and Poole in a 2013 issue of Epidemiology. An “inferential ban”. Wow, that’s music to the deinstitutionalizer’s ears.

Pawl: I just had a quick look, but their article appears to resurrect the same-old same-old: P-values are or have to be (mis)interpreted as posteriors, so here are some priors to do the trick. Of course their reconciliation between P-values and the posterior probability (with weak priors) that you’ve got the wrong directional effect (in one sided tests) was shown long ago by Cox, Pratt, others.

Franz: It’s a neat trick, but it’s unclear how this reconciliation advances our goals. Historically, the TFSI has not pushed the Bayesian line. We in psychology have enough trouble being taken as serious scientists.

Nayth: Journalists must be Bayesian because,let’s face it, we’re all biased, and we need to be up front about it.

Jenina: I know I’m here just to take notes for a story, but I believe Nate Silver made this pronouncement in his 10 or 11 point list in his presidential address to the Joint Statistical Meetings in 2013. Has it caught on?

Nayth: Well, of course I’d never let the writers of my on-line data-driven news journal introduce prior probabilities into their articles, are you kidding me? No way! We’ve got to keep it objective, push randomized-controlled trials, you know, regular statistical methods–you want me to lose my advertising revenue? Fugetaboutit!

Dora: Do as we say, not as we do– when it’s really important to us, and our cost-functions dictate otherwise.

Gerry: As Franz observes, the TFSI has not pushed the Bayesian line because we want people to use confidence intervals (CIs). Anything tests can do CIs do better.

S.C.: You still need tests. Notice you don’t use CIs alone in replication or fraudbusting.

Dora: No one’s going to notice we use the very methods we are against when we have to test if another researcher did a shoddy job. Mere mathematical quibbling I say.

Paul: But it remains to show how confidence intervals can ensure ‘Popperian risk’ and ‘severe tests’ . We may need to supplement CI’s with some kind of severity analysis [as in Mayo] discussed in her blog. In the Year of Statistics, 2013, we promised we’d take up the challenge at long last, but have we?  No. I’d like to hear from our new member, Dr. Ian Nydes.

Ian: I have a new and radical suggestion, and coming from a doctor, people will believe it: prove mathematically that “the high rate of non replication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research finds solely on the basis of a single study assessment by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values….

It can be proven that most claimed research findings are false”

S.C.: That was, word for word, in John Ioannidis’ celebrated (2005) paper: “Why Most Published Research Findings Are False.”  I and my colleague have serious problems with this alleged “proof”.

Ian: Of course it’s Ioannidis, and many before him (e.g., John Pratt, Jim Berger), but it doesn’t matter. It works. The best thing about it is that many frequentists have bought it–hook, line, and sinker–as a genuine problem for them!

Jake: Ioannidis is to be credited for calling wide attention to some of the terrible practices we’ve been on about since at least the 1960s (if not earlier).  Still, I agree with S.C. here: his “proof” works only if you basically assume researchers are guilty of all the sins that we’re trying to block: turning tests into dichotomous “up or down” affairs, p-hacking, cherry-picking and you name it. But as someone whose main research concerns power analysis, the thing I’m most disturbed about is his abuse of power in his computations.

Dora: Oomph! Power, shpower! Fussing over the correct use of such mathematical concepts is just, well, just mathematics; while surely the “Lord’s work,” it doesn’t pay the rent, and Ioannidis has stumbled on a gold mine!

 Pawl: Do medical researchers really claim “conclusive research finds solely on the basis of a single study assessment by formal statistical significance”? Do you mean to tell me that medical researchers are actually engaging in worse practices than I’ve been chastising social science researchers about for decades?

Marty: I love it! We’re not the bottom of the scientific barrel after all–medical researchers are worse! 

Dora: And they really “are killing people”, unlike that wild exaggeration by Ziliac and McCloskey (2008, 186).

Gerry: Ioannidis (2005) is a wild exaggeration. But my main objection is that it misuses frequentist probability. Suppose you’ve got an urn filled with hypotheses, they can be from all sorts of fields, and let’s say 10% of them are true, the rest false. In an experiment involving randomly selecting a hypothesis from this urn, the probability of selecting one with the property “true” is 10%. But now suppose I pick out H': “Ebola can be contracted by sexual intercourse”, or any hypothesis you like. It’s mistaken to say that the Pr(H’) = 0.10.

Pawl: It’s a common fallacy of probabilistic instantiation, like saying a particular 95% confidence interval estimate has probability .95.

S.C.: Only it’s worse, because we know the probabilities the hypotheses are true are very different from what you get in following his “Cult of the Holy Spike” priors.

Dora: More mathematical nit-picking. Who cares? My loss function says they’re irrelevant.

Ian: But Pawl, if all you care about is long-run screening probabilities, and posterior predictive values (PPVs), then it’s correct; the thing is, it works! It’s got everyone all upset.

Nayth: Not everyone, we Bayesians love it.  The best thing is that it’s basically a Bayesian computation but with frequentist receiver operating curves!

Dora: It’s brilliant! No one cares about the delicate mathematical nit-picking. Ka-ching! (That’s the sound of my cash register, I’m an economist, you know.)

Pawl: I’ve got to be frank, I don’t see how some people, and I include some people in this room, can disparage the scientific credentials of significance tests and yet rely on them to indict other people’s research, and even to insinuate questionable research practices (QRPs) if not out and out fraud (as with Smeesters, Forster, Anil Potti, many others). Folks are saying that in the past we weren’t so hypocritical.

(silence)

Gerry: I’ve heard other’s raise Pawl’s charge. They’re irate that the hard core reformers nowadays act like p-values can’t be trusted except when used to show that p-values can’t be trusted.  Is this something our committee has to worry about?

Marty: Nah! It’s a gig!

Jenina: I have to confess that this is a part of my own special “gig”. I’ve a simple 3-step recipe that automatically lets me publish an attention-grabbing article whenever I choose.

Nayth: Really?  What are they? (I love numbered lists.)

Jenina: It’s very simple:

Step #1: Non-replication: a story of some poor fool who thinks he’s on the brink of glory and fame when he gets a single statistically significant result in a social psychology experiment. Only then it doesn’t replicate! (Priming studies work well, or situated cognition).

Step #2: A crazy, sexy example where p-values are claimed to support a totally unbelievable, far-out claim. As a rule I take examples about sex (as in this study on penis size and voting behavior)–perhaps with a little evolutionary psychology twist thrown in to make it more “high brow”. (Appeals to ridicule about Fisher or Neyman-Pearson give it a historical flair, while still keeping it fun.) Else I choose something real spo-o-ky like ESP. (If I run out of examples, I take what some of the more vocal P-bashers are into, citing them of course, and getting extra brownie points.) Then, it’s a cinch to say “this stuff’s so unbelievable, if we just use our brains we’d know it was bunk!” (I guess you could call this my Bayesian cheerleading bit.)

Step #3: Ioannidis’ proof that most published research findings are false, illustrated with colorful charts.

Nayth: So it’s:

-step #1: non-replication,
-step #2: an utterly unbelievable but sexy “chump effect”[ii]that someone, somewhere has found statistically significant, and
-Step #3: Ioannidis (2005).

Jenina: (breaks into hysterical laughter): Yes, and sometimes,…(laughing so hard she can hardly speak)…sometimes the guy from Step #1, (who couldn’t replicate, and so couldn’t publish, his result) goes out and joins the replication movement and gets to publish his non-replication, without the hassle of peer review. (Ha Ha!)

(General laugher, smirking, groans, or head shaking)

[Nayth: (Aside to Jenina) Your step #2 is just like my toad example, except that turned out to be somewhat plausible.]

Dora: I love the three-step cha cha cha! It’s a win-win strategy: Non-replication, chump effect, Ioannidis (“most research findings are false”).

MartyNon-replication, chump effect, Ioannidis. (What a great gig!)

Pawl: I’m more interested in a 2014 paper Ioannidis jointly wrote with several others on reducing research waste.

Gerry: I move we place it on our main reading list for our 2016 meeting and then move we adjourn to drinks and dinner. I’ve got reservations at a 5 star restaurant, all covered by TFSI Foundation.

Jake: I second. All in favor?

All: Aye

Pawl:  Adjourned. Doubles of Elbar Grease for all!

S.C.: Isn’t that Deborah Mayo’s special concoction?

Pawl: Yes, most of us, if truth be known, are closet or even open error statisticians!*

*This post, of course, is a parody or satire (statistical satirical); all quotes are authentic as cited. Send any corrections.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parting Remark: Given Fisher’s declaration when first setting out tests to the effect that that isolated results are too cheap to be worth having, Cox’s insistence donkey’s years ago that “It is very bad practice to summarise an important investigation solely by a value of P”, and a million other admonishments against statistical fallacies and lampoons, I’m starting to think that a license should be procured (upon passing a severe test) before being permitted to use statistical tests of any kind. My own conception is an inferential reformulation of Neyman-Pearson statistics in which one uses error probabilities to infer discrepancies that are well or poorly warranted by given data. (It dovetails also with Fisherian tests, as seen in Mayo and Cox 2010, using essentially the P-value distribution for sensitivity rather than attained power). Some of the latest movements to avoid biases, selection effects, barn hunting, cherry picking, multiple testing and the like, and to promote controlled trials and attention to experimental design are all to the good. They get their justification from the goal of avoiding corrupt error probabilities. Anyone who rejects the inferential use of error probabilities is hard pressed to justify the strenuous efforts to sustain them. These error probabilities, on which confidence levels and severity assessments are built, are very different from PPVs and similar computations that arise in the context of screening, say, thousands of genes. My worry is that the best of the New Reforms, in failing to make clear the error statistical basis for their recommendations, and confusing screening with the evidential appraisal of particular statistical hypotheses, will fail to halt some of the deepest and most pervasive confusions and fallacies about running and interpreting statistical tests. 

 

[i] References here are to Popper, 1977, 1962; Mayo, 1991, 1996, Salmon, 1984.

[ii] From “The Chump Effect: Reporters are credulous, studies show”, by Andrew Ferguson.

“Entire journalistic enterprises, whole books from cover to cover, would simply collapse into dust if even a smidgen of skepticism were summoned whenever we read that “scientists say” or “a new study finds” or “research shows” or “data suggest.” Most such claims of social science, we would soon find, fall into one of three categories: the trivial, the dubious, or the flatly untrue.”

(selected) REFERENCES:

Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49, 997-1003.

Cox, D. R. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics 29 : 357-372.

Cox, D. R. (1977). The role of significance tests. (With discussion). Scand. J. Statist. 4 : 49-70.

Cox, D. R. (1982). Statistical significance tests. Br. J. Clinical. Pharmac. 14 : 325-331.

Gigerenzer, G. et.al., (1989) The Empire of Chance, CUP.

Gigerenzer, G. (2000), “The Superego, the Ego, and the Id in Statistical Reasoning, “ Adaptive Thinking, Rationality in the Real World, OUP.

Gelman, A. (2011), “Induction and Deduction in Bayesian Data Analysis,” RMM vol. 2, 2011, 67-78. Special Topic: Statistical Science and Philosophy of Science: where do (should) they meet in 2011 and beyond?

Greenland, S. and Poole, C. (2013), “Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics,” Epidemiology 24: 62-8. 

Ioannides, J (2005), ‘Why Most Published Research Findings are False”.

Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature, 34(1), 97-114.

Meehl, P. E. (1990), “Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Meehl, P. and Waller, N. (2002), “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude,”Psychological Methods, Vol. 7: 283–300.

Orlitzky, M. (2012), “How Can Significance Tests be Deinstitutionalized?” Organizational Research Methods 15(2): 199-228.

Popper, K. (1962). Conjectures and Refutations. NY: Basic Books.

Popper, K. (1977). The Logic of Scientific Discovery, NY: Basic Books. (Original published 1959)

Rozeboom, W. (1997), “Good Science is Abductive, not hypothetico-deductive.” In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335-391). Hillsdale, NJ: Lawrence Erlbaum.

Salmon, W. C. (1984). Scientific Explanation and the Causal Structure of the World, Princeton, NJ: Princeton.

Schmidt, F. (1996),  “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers, Psychological Methods, Vol. 1(2): 115-129.

Sliver, N. (2012), The Signal and the Noise, Penguin.

Ziliak, S. T., & McCloskey, D. N. (2008), The cult of statistical significance: How the standard error costs us jobs, justice, and lives.” Ann Arbor: University of Michigan Press. (Short piece see: “The Cult of Statistical Significance” from Section on Statistical Education – JSM 2009).


Categories: Comedy, reforming the reformers, science communication, Statistical fraudbusting, statistical tests, Statistics | Tags: , , , , , , | 2 Comments

3 YEARS AGO: (JANUARY 2012) MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.

January 2012

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014. I will count U-Phil’s on a single paper as one of the three I highlight (else I’d have to choose between them). I will comment on  3-year old posts from time to time.

This Memory Lane needs a bit of explanation. This blog began largely as a forum to discuss a set of contributions from a conference I organized (with A. Spanos and J. Miller*) “Statistical Science and Philosophy of Science: Where Do (Should) They meet?”at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, in June 2010 (where I am a visitor). Additional papers grew out of conversations initiated soon after (with Andrew Gelman and Larry Wasserman). The conference site is here.  My reflections in this general arena (Sept. 26, 2012) are here.

As articles appeared in a special topic of the on-line journal, Rationality, Markets and Morals (RMM), edited by Max Albert[i]—also a conference participant —I would announce an open invitation to readers to take a couple of weeks to write an extended comment.  Each “U-Phil”–which stands for “U philosophize”- was a contribution to this activity. I plan to go back to that exercise at some point.  Generally I would give a “deconstruction” of the paper first, followed by U-Phils, and then the author gave responses to U-Phils and me as they wished. You can readily search this blog for all the U-Phils and deconstructions**.

I was also keeping a list of issues that we either haven’t taken up, or need to return to. One example here is: Bayesian updating and down dating. Further notes about the origins of this blog are here. I recommend everyone reread Senn’s paper.** 

For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

[i] Along with Hartmut Kliemt and Bernd Lahno.

*For a full list of collaborators, sponsors, logisticians, and related collaborations, see the conference page. The full list of speakers is found there as well.

**The U-Phil exchange between Mayo and Senn was published in the same special topic of RIMM. But I still wish to know how we can cultivate “Senn’s-ability.” We could continue that activity as well, perhaps.

Previous 3 YEAR MEMORY LANES:

Dec. 2011

Nov. 2011

Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, blog contents, Statistics, Stephen Senn, U-Phil | 2 Comments

Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)

images

.

Trial in Medical Research Scandal Postponed
By Jay Price

DURHAM, N.C. — A judge in Durham County Superior Court has postponed the first civil trial against Duke University by the estate of a patient who had enrolled in one of a trio of clinical cancer studies that were based on bogus science.

The case is part of what the investigative TV news show “60 Minutes” said could go down in history as one of the biggest medical research frauds ever.

The trial had been scheduled to start Monday, but several attorneys involved contracted flu. Judge Robert C. Ervin hasn’t settled on a new start date, but after a conference call with him Monday night, attorneys in the case said it could be as late as this fall.

Flu? Don’t these lawyers get flu shots? Wasn’t Duke working on a flu vaccine? Delaying til Fall 2015?

The postponement delayed resolution in the long-running case for the two patients still alive among the eight who filed suit. It also prolonged a lengthy public relations headache for Duke Medicine that has included retraction of research papers in major scientific journals, the embarrassing segment on “60 Minutes” and the revelation that the lead scientist had falsely claimed to be a Rhodes Scholar in grant applications and credentials.

Because it’s not considered a class action, the eight cases may be tried individually. The one designated to come first was brought by Walter Jacobs, whose wife, Julie, had enrolled in an advanced stage lung cancer study based on the bad research. She died in 2010.

“We regret that our trial couldn’t go forward on the scheduled date,” said Raleigh attorney Thomas Henson, who is representing Jacobs. “As our filed complaint shows, this case goes straight to the basic rights of human research subjects in clinical trials, and we look forward to having those issues at the forefront of the discussion when we are able to have our trial rescheduled.”

It all began in 2006 with research led by a young Duke researcher named Anil Potti. He claimed to have found genetic markers in tumors that could predict which cancer patients might respond well to what form of cancer therapy. The discovery, which one senior Duke administrator later said would have been a sort of Holy Grail of cancer research if it had been accurate, electrified other scientists in the field.

Then, starting in 2007, came the three clinical trials aimed at testing the approach. These enrolled more than 100 lung and breast cancer patients, and were eventually expected to enroll hundreds more.

Duke shut them down permanently in 2010 after finding serious problems with Potti’s science.

Now some of the patients – or their estates, since many have died from their illnesses – are suing Duke, Potti, his mentor and research collaborator Dr. Joseph Nevins, and various Duke administrators. The suit alleges, among other things, that they had engaged in a systematic plan to commercially develop cancer tests worth billions of dollars while using science that they knew or should have known to be fraudulent.

The latest revelation in the case, based on documents that emerged from the lawsuit and first reported in the Cancer Letter, a newsletter that covers cancer research issues, is that a young researcher working with Potti had alerted university officials to problems with the research data two years before the experiments on the cancer patients were stopped.

The whistleblower, Brad Perez, is now finishing up a medical residency at Duke. Perez declined to be interviewed, but responded by email that the issues with the research led him to quit working with Potti, though that cost him an extra year in medical school.

“In the course of my work in the Potti lab, I discovered what I perceived to be problems in the predictor models that made it difficult for me to continue working in that environment,” he wrote. “I raised my concerns with my laboratory peers, laboratory supervisors and medical school administrators. I chose to take an additional year to complete medical school in order to have a more successful research experience.”

In an emailed statement in response to questions about the case, Michael Schoenfeld, Duke’s vice president for public affairs and government relations, said Perez had passed his concerns about the lab through proper channels at Duke, and that the resulting review didn’t find research misconduct.

Since then, though, Perez’s concerns have been fully appreciated and recognized, Schoenfeld wrote.

“We can say with great confidence that any concerns like this received today would be handled very differently,” Schoenfeld wrote.

Really? What would they do differently?

“Despite his experience in Dr. Potti’s lab, we’re pleased that Dr. Perez elected to complete his medical education and research training at Duke, and is currently completing his residency in radiation oncology at Duke.”

Through his Raleigh attorney, Dan McLamb, Potti declined an interview, citing the pending court action. Potti now works at another cancer clinic, in Grand Forks, N.D.

No surprise he’s still practicing. Remind me of where not to go.

Potti, Nevins and various collaborators published studies in major research journals based on Potti’s findings beginning in 2006. But researchers elsewhere couldn’t reproduce their results and quickly began to raise questions. In particular, two biostatisticians at MD Anderson Cancer Center in Houston, Keith Baggerly and Kevin Coombes, brought problems they found to the attention Duke officials and began questioning the research publicly.

toilet-fireworks-by-stephenthruvegas-on-flickr

.

In 2009 Duke suspended the enrollment of new patients and commissioned an outside review. But the reviewers reported that Potti’s work seemed fine, and Duke rebooted the trials. University leaders later said those reviewers hadn’t looked at the basic data Potti had used.

Only after the Cancer Letter, which has followed the case closely for years, published a report in 2010 saying that Potti had falsely claimed a Rhodes Scholarship in grant applications and elsewhere did Duke’s official support for the research finally began to crumble. It again suspended new enrollments and ended the studies.

Outside scientists who raised questions about the research said they were most worried about the prospect that patients were being put at risk by their participation in the clinical trials. They said the unproven genetic analysis could result in patients being prescribed an improper treatment.

The following sounds like doublespeak:

Duke has maintained, though, that the patients received proper care.

“The criticism in the lawsuit is not related to the high quality of care this patient received,” Schoenfeld wrote in his statement. “While the science behind the genomic predictor used in the trials was ultimately found to involve falsified data, a key factor in the approval of the trial protocols provided that every patients would receive standard of care therapy for their disease whether or not the predictor ultimately proved to be useful.”

Firstly, the patients were promised “a personalized cancer regimen”custom-tailored for their tumor; secondly, the last sense is incomprehensible. Second, some of them were apparently getting the less effective treatment due to data mix-ups. Realize that these “treatments” also involved additional surgeries for purposes of the clinical trial only. Please correct me if I’m mistaken.

Regardless of which treatment patients in the clinical trials received, it was considered a best one for treating their disease, he wrote.

“A” best one?

The lawsuit charges, among other things, that in grant applications for the clinical trials that Potti intentionally lied, and included false and fraudulent information about the research results. Nevins, as his research supervisor, and Duke should have known what was wrong, the lawsuit says, because biostatisticians from MD Anderson and others had made numerous attempts to call attention to flaws in the science.

The suit also charges that the clinical trials began after Duke had been “placed on notice” of the flawed underlying science and suggests that the relationships among researchers and administrators were too cozy within the university for it to properly pursue questions about the research.

Duke has made substantive changes to prevent the problems brought to light in the Potti case from recurring, Schoenfeld said. These include better data management, new reviews of potential conflict of interest and improvements in handling reports related to the integrity of research.

“Many lessons learned from this situation have led to significant improvements in both basic and clinical research processes including many new and expanded programs related to scientific accountability, reporting of concerns related to research integrity, multiple improvements in data management and governance, and new scientific and conflict of interest review processes,” he wrote.

Duke put Potti on administrative leave in July 2010 after the charges about his credentials emerged. The next month, Duke announced that he had indeed padded his resume. Potti resigned six months later, and Nevins began the process of retracting the journal articles. Nevins retired from Duke in 2013.

In 2012, Potti accepted a reprimand from the North Carolina Medical Board. He remains licensed to practice in the state and in a few other states. According to records posted online by the North Carolina Medical Board, Potti had agreed to settlements in at least 11 malpractice cases against him, each resulting in a payment of at least $75,000.

Also, in a consent order negotiated with the medical board, Potti agreed to accept a formal reprimand for unprofessional conduct and admitted to having inaccurate information on his resume and in official Duke biographical sketches and to using those flawed credentials in research grant applications.

After leaving Duke, Potti worked awhile in a clinical role rather than in research at a cancer clinic in South Carolina. He was fired from that job after the “60 Minutes” segment aired, though the company he had worked for there, Coastal Cancer Center, said in a news release that his work had been exemplary.

The news release also said that the company had hired him after received glowing letters of recommendation from top medical officials at Duke.

While the trial may be delayed for months, the judge still is expected to hear motions in the case Thursday.

Why months?

Note: I did not fix any of the ungrammatical parts of this news release.

Read the article here:

For background posts on this blog, please see:

“Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins)

What have we learned from the Anil Potti training and test data fireworks ? Part 1 (draft 2)

————————————————–

The following is an excerpt from this week’s  Cancer Letter on the issue of “No Harm done?”

http://www.cancerletter.com/articles/20150123_2

No Harm Done? 1/23/15

Duke’s motions for a summary judgment argue that the case should turn on North Carolina law, as opposed to established ethical constructs.

In an effort to determine the burden of proof that has to be met by the plaintiff to demonstrate negligence per se, Duke’s motion states that standards contained in the 1979 report by the National Commission for protection of human Services of Biomedical and Behavioral Research, known as the Belmont Report, don’t create obligations under North Carolina law. Similarly, they argue that the federal law, Title 45 part 46 of the Code of Federal Regulations, which sets out requirements for research institutions, is not a part of North Carolina law, either.

Duke basically states that it did nothing wrong.

“Plaintiffs cannot show that a different course of treatment would have made any difference in their care or chance of survival,” the Duke motion reads. “Expert testimony in this case has not established that any clinical trial available in the United States in 2010 would have prolonged plaintiffs’ life expectancy or treated them more effectively. Therefore, plaintiffs cannot meet causation of damage elements of their negligence per se claim.”

Another court filing deals specifically with the case of Juliet Jacobs, a patient with metastatic lung cancer who—with Potti’s knowledge—made a recording of the now disgraced doctor as he presented the trial to her. Juliet’s widower, Walter, is one of the plaintiffs.

Duke attorneys argue that in that specific instance, “these defendants did not abuse, breach, or take advantage of Mrs. Jacobs’s confidence or trust. Instead, they were open, fair, and honest with Mrs. Jacobs and her husband regarding her prognosis and treatment options. Mr. & Mrs. Jacobs were made aware that the clinical trial may increase, decrease or have no effect on Mrs. Jacobs’s likelihood of responding to chemotherapy. They were also encouraged to seek other treatment alternatives.”

Duke’s filings also hold that “the undisputed evidence in this case has established that there was no clinical trial or other treatment available in the United States in 2010 that would have cured Mrs. Jacobs’s cancer or prolonged her life expectancy. Plaintiff cannot show that a different course of treatment would have made any difference in Mrs. Jacobs’s chance of survival.”

Duke attorneys are not representing Potti, who was dismissed from the university. However, they are representing Nevins, the deans, the IRB chair and the spinoff company that was going to commercialize the Nevins-and-Potti inventions.

The defendants argue that the plaintiffs cannot prove “negligence per se” claims because they cannot show that there was “(1) a duty created by a statute or ordinance; (2) that the statute or ordinance was enacted to protect a class of persons which includes the plaintiff; (3) a breach of the statutory duty; (4) that the injury sustained was suffered by an interest which the statute protected; (5) that the injury was of the nature contemplated in the statute; and (6) that the violation of the statute proximally caused the injury.”

Plaintiffs argue that Duke is ultimately responsible for the actions of its scientists and administrators.

“Defendants admit that Dr. Potti fabricated, falsified and intentionally manipulated the data that formed the ‘basis for clinical trials’ in which Juliet Jacobs was enrolled,” one of the plaintiffs’ filings states. “Much of the… falsified, fabricated, and manipulated data came from the laboratory of Dr. Nevins, for which he was ultimately responsible. In fact, Dr. Nevins admitted one set of ‘intentionally altered’ data that came from his lab ‘provided support for the lung cancer trials…’

“Manipulating and fabricating the data for a clinical trial and then lying to a patient to obtain informed consent is a breach of good faith. It constitutes battery and invalidates informed consent. Dr. Potti is the physician who presented the informed consent to the plaintiffs. He is the one who falsified, fabricated and intentionally manipulated the data. He entered into a Consent Order with the North Carolina Medical Board admitting that he committed ‘unprofessional conduct.’ He admitted that there was a responsibility to tell the patients, including Juliet Jacobs, about the controversy with the medicine. Dr. Potti did not inform the Jacobs of either the ‘controversy’ or the fraud.”

Nevins acknowledges that he did not examine the data until October 2010, three months after this publication reported that Potti had misstated his credentials, claiming to have been a Rhodes Scholar, and after Potti was barred from Duke campus.

 

“Money, Fame and Overall Fortune”

Countering Duke’s assertion that no one was injured because patients were assigned to standard therapy, the plaintiffs say that Juliet Jacobs was falsely led to accept a treatment regimen she would not have ordinarily considered.

Imagine if she had been told the trials had been stopped on grounds of flawed data/bad models, and only recently renewed. Imagine if she’d seen the Perez letter or the Baggerly and Coombes articles. I can’t argue the legal subtleties, but it’s outrageous.

The patient’s husband and daughter “testified to the exact opposite,” the filing reads. “Plaintiffs showed that Juliet and Walter Jacobs did not want standard of care chemotherapy and would not have participated if it had not been for the defendants’ fraud.”

The Duke protocol required a second biopsy and led the patient to a chemotherapy regimen that was more aggressive than she would have ordinarily chosen for end-of-life care.

“The second biopsy was not required for the alleged ‘standard of care’ chemotherapy—it was required for participation in the clinical trials,” the plaintiffs argue. “Defendants want to turn a lawsuit based upon personal injury into a wrongful death action. The question is not whether ‘standard of care chemotherapy’ was provided and whether or not the same caused her death. Instead, the question posed by the plaintiffs is whether or not the defendants’ actions caused a personal injury to Juliet and Walter Jacobs. Attempting to recast this as a wrongful death action…is a red herring thrown to distract the finder of fact.”

Most importantly, Juliet Jacobs was deceived, the plaintiffs’ attorneys argue.

“Because her quality of life was very important to her, if she had been given proper consent and told that there was no ‘silver bullet’ and if she had not been told by Dr. Potti that he could give her a chance to live for ten years, she and Walter would more likely than not have made other choices regarding how they spent her last days and what quality that life would have.”

An audio recording of the Jacobs meeting with Potti captures the doctors expressing hope for a miracle.

In the recording, Juliet Jacobs says that her son-in-law has had chemotherapy for a decade, and that he is the only survivor in a clinical trial.

Potti: “Wow. And I, I wouldn’t be surprised if I expect that from you. That’s what I mean. I’m 100 percent on board here, OK?”

Like other patients, Jacobs was presented with a consent form that contained the claim that the genomic predictor that would be used had the accuracy of approximately 80 percent.

Instead of going into hospice care, Juliet Jacobs ended up with a lot of toxicity and a quality of life her family members described as poor.

The date of the family’s meeting with Potti is important: Feb. 11, 2010, a month after Duke restarted the trials following an internal investigation that has since been shown to be cursory and skewed. That controversy was never mentioned to the prospective patient and her family.

Knowing what he knows now, Walter Jacobs is furious.

“I know that it’s an immoral, evil, awful thing that has been done,” he said in a deposition.

The plaintiffs also allege a “civil conspiracy.”

“The underlying conspiracy was among the defendants and Dr. Potti and Dr. Nevins, on behalf of themselves and on behalf of their outside financial interest, Cancer Guide, to cover up the falsification in order to continue the clinical trials. The successful conclusion of the clinical trials would have meant money, fame and overall fortune.”

 

Categories: junk science, rejected post, Statistics | Tags: | 6 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri

images-2

For entertainment only

Here’s the follow-up to my last (reblogged) post. initially here. My take hasn’t changed much from 2013. Should we be labeling some pursuits “for entertainment only”? Why not? (See also a later post on the replication crisis in psych.)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I had said I would label as pseudoscience or questionable science any enterprise that regularly permits the kind of ‘verification biases’ in the statistical dirty laundry list.  How regularly? (I’ve been asked)

Well, surely if it’s as regular as, say, much of social psychology, it goes over the line. But it’s not mere regularity, it’s the nature of the data, the type of inferences being drawn, and the extent of self-scrutiny and recognition of errors shown (or not shown). The regularity is just a consequence of the methodological holes. My standards may be considerably more stringent than most, but quite aside from statistical issues, I simply do not find hypotheses well-tested if they are based on “experiments” that consist of giving questionnaires. At least not without a lot more self-scrutiny and discussion of flaws than I ever see. (There may be counterexamples.)

Attempts to recreate phenomena of interest in typical social science “labs” leave me with the same doubts. Huge gaps often exist between elicited and inferred results. One might locate the problem under “external validity” but to me it is just the general problem of relating statistical data to substantive claims.

Experimental economists (expereconomists) take lab results plus statistics to warrant sometimes ingenious inferences about substantive hypotheses.  Vernon Smith (of the Nobel Prize in Econ) is rare in subjecting his own results to “stress tests”.  I’m not withdrawing the optimistic assertions he cites from EGEK (Mayo 1996) on Duhem-Quine (e.g., from “Rhetoric and Reality” 2001, p. 29). I’d still maintain, “Literal control is not needed to attribute experimental results correctly (whether to affirm or deny a hypothesis). Enough experimental knowledge will do”.  But that requires piece-meal strategies that accumulate, and at least a little bit of “theory” and/or a decent amount of causal understanding.[1]

I think the generalizations extracted from questionnaires allow for an enormous amount of “reading into” the data. Suddenly one finds the “best” explanation. Questionnaires should be deconstructed for how they may be misinterpreted, not to mention how responders tend to guess what the experimenter is looking for. (I’m reminded of the current hoopla over questionnaires on breadwinners, housework and divorce rates!) I respond with the same eye-rolling to just-so story telling along the lines of evolutionary psychology.

I apply the “Stapel test”: Even if Stapel had bothered to actually carry out the data-collection plans that he so carefully crafted, I would not find the inferences especially telling in the least. Take for example the planned-but-not-implemented study discussed in the recent New York Times article on Stapel:

 Stapel designed one such study to test whether individuals are inclined to consume more when primed with the idea of capitalism. He and his research partner developed a questionnaire that subjects would have to fill out under two subtly different conditions. In one, an M&M-filled mug with the word “kapitalisme” printed on it would sit on the table in front of the subject; in the other, the mug’s word would be different, a jumble of the letters in “kapitalisme.” Although the questionnaire included questions relating to capitalism and consumption, like whether big cars are preferable to small ones, the study’s key measure was the amount of M&Ms eaten by the subject while answering these questions….Stapel and his colleague hypothesized that subjects facing a mug printed with “kapitalisme” would end up eating more M&Ms.

Stapel had a student arrange to get the mugs and M&Ms and later load them into his car along with a box of questionnaires. He then drove off, saying he was going to run the study at a high school in Rotterdam where a friend worked as a teacher.

Stapel dumped most of the questionnaires into a trash bin outside campus. At home, using his own scale, he weighed a mug filled with M&Ms and sat down to simulate the experiment. While filling out the questionnaire, he ate the M&Ms at what he believed was a reasonable rate and then weighed the mug again to estimate the amount a subject could be expected to eat. He built the rest of the data set around that number. He told me he gave away some of the M&M stash and ate a lot of it himself. “I was the only subject in these studies,” he said.

He didn’t even know what a plausible number of M&Ms consumed would be! But never mind that, observing a genuine “effect” in this silly study would not have probed the hypothesis. Would it?

II. Dancing the pseudoscience limbo: How low should we go?

images-1

.

 

Should those of us serious about improving the understanding of statistics be expending ammunition on studies sufficiently crackpot to lead CNN to withdraw reporting on a resulting (published) paper?

“Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as “silly,” “stupid,” “sexist,” and “offensive.” Others were less nice.”

That’s too low down for me.…(though it’s good for it to be in Retraction Watch). Even stooping down to the level of  “The Journal of Psychological Pseudoscience” strikes me as largely a waste of time–for meta-methodological efforts at least. January 25, 2015 note: Given the replication projects, and the fact that a meta-methodological critique of them IS worthwhile, this claim should be qualified. Remember this post was first blogged in June, 2013. 

I was hastily making these same points in an e-mail to A. Gelman just yesterday:

E-mail to Gelman: Yes, the idea that X should be published iff a p<.05 in an interesting topic is obviously crazy.

I keep emphasizing that the problems of design and of linking stat to substantive are the places to launch a critique, and the onus is on the researcher to show how violations are avoided.  … I haven’t looked at the ovulation study (but this kind of thing has been done a zillion times) and there are a zillion confounding factors and other sources of distortion that I know were not ruled out. I’m prepared to abide such studies as akin to Zoltar at the fair [Zoltar the fortune teller]. Or, view it as a human interest story—let’s see what amusing data they collected, […oh, so they didn’t even know if women they questioned were ovulating]. You talk of top psych journals, but I see utter travesties in the ones you call top. I admit I have little tolerance for this stuff, but I fail to see how adopting a better statistical methodology could help them. …

Look, there aren’t real regularities in many, many areas–better statistics could only reveal this to an honest researcher. If Stapel actually collected data on M&M’s and having a mug with “Kapitalism” in front of subjects, it would still be B.S.! There are a lot of things in the world I consider crackpot. They may use some measuring devices, and I don’t blame those measuring devices simply because they occupy a place in a pseudoscience or “pre-science” or “a science-wannabe”. Do I think we should get rid of pseudoscience? Yes! [At least if they have pretensions to science, and are not described as “for entertainment purposes only”[2].] But I’m afraid this would shut down [or radically redescribe] a lot more fields than you and most others would agree to.  So it’s live and let live, and does anyone really think it’s hurting honest science very much?

There are fields like (at least parts of) experimental psychology that have been trying to get scientific by relying on formal statistical methods, rather than doing science. We get pretensions to science, and then when things don’t work out, they blame the tools. First, significance tests, then confidence intervals, then meta-analysis,…do you think these same people are going to get the cumulative understanding they seek when they move to Bayesian methods? Recall [Frank] Schmidt in one of my Saturday night comedies, rhapsodizing about meta-analysis:

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. ..[T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting. Young people starting out in the behavioral and social sciences today can hope for a much brighter future.”(Schmidt 1996)

III. Dale Carnegie salesman fallacy:

It’s not just that bending over backwards to criticize the most blatant abuses of statistics is a waste of time. I also think dancing the pseudoscientific limbo too low has a tendency to promote its very own fallacy! I don’t know if it has a name, so I made one up. Carnegie didn’t mean this to be used fallaciously, but merely as a means to a positive sales pitch for an idea, call it H. You want to convince a person of H? Get them to say yes to a series of claims first, then throw in H and let them make the leap to accept H too. “You agree that the p-values in the ovulation study show nothing?” “Yes” “You agree that study on bicep diameter is bunk?” “Yes, yes”, and  “That study on ESP—pseudoscientific, yes?” “Yes, yes, yes!” Then announce, “I happen to favor operational probalogist statistics (H)”. Nothing has been said to advance H, no reasons have been given that it avoids the problems raised. But all those yeses may well lead the person to say yes to H, and to even imagine an argument has been given. Dale Carnegie was a shrewd man.

Note: added Jan 24, 2015: You might be interested in the (brief) exchange between Gelman and I in the comments from the original post.
Of relevance was the later post on the replication crisis in psychology: http://errorstatistics.com/2014/06/30/some-ironies-in-the-replication-crisis-in-social-psychology-1st-installment/

[1] Vernon Smith ends his paper:

My personal experience as an experimental economist since 1956 resonates, well with Mayo’s critique of Lakatos: “Lakatos, recall, gives up on justifying control; at best we decide—by appeal to convention—that the experiment is controlled. … I reject Lakatos and others’ apprehension about experimental control. Happily, the image of experimental testing that gives these philosophers cold feet bears little resemblance to actual experimental learning. Literal control is not needed to correctly attribute experimental results (whether to affirm or deny a hypothesis). Enough experimental knowledge will do. Nor need it be assured that the various factors in the experimental context have no influence on the result in question—far from it. A more typical strategy is to learn enough about the type and extent of their influences and then estimate their likely effects in the given experiment”.  [Mayo EGEK 1996, 240].  V. Smith, “Method in Experiment: Rhetoric and Reality” 2001, 29.

My example in this chapter was linking statistical models in experiments on Brownian motion (by Brown).

[2] I actually like Zoltar (or Zoltan) fortune telling machines, and just the other day was delighted to find one in a costume store on 21st St.

 

Categories: junk science, Statistical fraudbusting, Statistics | 3 Comments

Some statistical dirty laundry

Objectivity 1: Will the Real Junk Science Please Stand Up?

.

It’s an apt time to reblog the “statistical dirty laundry” post from 2013 here. I hope we can take up the recommendations from Simmons, Nelson and Simonsohn at the end (Note [5]), which we didn’t last time around.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

images

.

3. Hanging out some statistical dirty laundry.
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
  • A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
  • The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
  • The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
  • Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

4.  Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and  associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one?  Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!”  No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)

 I recommend reading the Tilberg report!


*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From  Simmons, Nelson and Simonsohn:

 Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.

If you determined sample size in advance, say it.

If you did not drop any variables, say it.

If you did not drop any conditions, say it.

The Fall 2012 Newsletter for the Society for Personality and Social Psychology
 
Popper, K. 1994, The Myth of the Framework.
Categories: junk science, reproducibility, spurious p values, Statistics | 27 Comments

Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

.

fraudbusters

If questionable research practices (QRPs) are prevalent in your field, then apparently you can’t be guilty of scientific misconduct or fraud (by mere QRP finagling), or so some suggest. Isn’t that an incentive for making QRPs the norm? 

The following is a recent blog discussion (by  Ulrich Schimmack) on the Jens Förster scandal: I thank Richard Gill for alerting me. I haven’t fully analyzed Schimmack’s arguments, so please share your reactions. I agree with him on the importance of power analysis, but I’m not sure that the way he’s using it (via his “R index”) shows what he claims. Nor do I see how any of this invalidates, or spares Förster from, the fraud allegations along the lines of Simonsohn[i]. Most importantly, I don’t see that cheating one way vs another changes the scientific status of Forster’s flawed inference. Forster already admitted that faced with unfavorable results, he’d always find ways to fix things until he got results in sync with his theory (on the social psychology of creativity priming). Fraud by any other name.

Förster

Förster

The official report, “Suspicion of scientific misconduct by Dr. Jens Förster,” is anonymous and dated September 2012. An earlier post on this blog, “Who ya gonna call for statistical fraud busting” featured a discussion by Neuroskeptic that I found illuminating, from Discover Magazine: On the “Suspicion of Scientific Misconduct by Jens Förster. Also see Retraction Watch.

Does anyone know the official status of the Forster case?

How Power Analysis Could Have Prevented the Sad Story of Dr. Förster”

From Ulrich Schimmack’s “Replicability Index” blog January 2, 2015. A January 14, 2015 update is here. (occasional emphasis in bright red is mine)

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct. 

Mayo: Note the language: “acted in accordance with”. Not even “acted in a way that, while leading to illicit results, is not so very uncommon in this field, so may not rise to the level of scientific misconduct”. With this definition, there’s no misconduct with Anil Potti and a number of other apparent ‘frauds’ either.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

  1. Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
  2. Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
  3. Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
  4. Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results.The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below).

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68).   An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results, and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. This account of events is consistent with other examples.

For example, the R-Index for a set of studies by Roy Baumeister was 49%. Roy Baumeister also explained why his R-Index is so low.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

Mayo: Here I am in total agreement. Yet well-known critics claim significance tests can say nothing in the case of statistically insignificant results, or that use of power is an “inconsistent hybrid”. It is not. See Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in (Mayo and Spanos) Error and Inference.

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

If so, then why would a “prevalent” practice be to bias inferences by selecting results in sync with one’s hypothesis?

Schimmack has a (Jan 15, 2015) update here in which he appears to retract what he said above! Why? As best as I could understand it, it’s because the accused fraudster denies committing any QRPs, and so if he doesn’t want to admit the lesser crime, sparing him from the “fraud” label, then he must be guilty of the more serious crime of  fraud after all.

Since Richard Gill alerted me to these blogposts, and I trust Gill’s judgment, there’s bound to be something in all of this reanalysis.

[i] Fake Data Colada.Maybe the author has also changed his mind, given his update.

 

 

 

 

 

Categories: junk science, reproducibility, Statistical fraudbusting, Statistical power, Statistics | Tags: | 22 Comments

Winners of the December 2014 Palindrome Contest: TWO!

I am pleased to announce that there were two (returning) winners for the December Palindrome contest.
The requirement was: In addition to Elba, one word: Math

(or maths; mathematics, for anyone brave enough).

The winners in alphabetical order are:

images-5

.

 

Karthik Durvasula
Visiting Assistant Professor in Phonology & Phonetics at Michigan State University

Palindrome: Ha! Am I at natal bash? tame lives, ol’ able-stats Elba. “Lose vile maths!” a blatant aim, aah!

(This was in honor of my birthday–thanks Karthik!)

Bio: I’m a Visiting Assistant Professor in Phonology & Phonetics at Michigan State University. My work primarily deals with probing people’s subconscious knowledge of (abstract) sound patterns. Recently, I have been working on auditory illusions that stem from the bias that such subconscious knowledge introduces.

Statement: “Trying to get a palindrome that was at least partially meaningful was fun and challenging. Plus I get an awesome book for my efforts. What more could a guy ask for! I also want to thank Mayo for being excellent about email correspondence, and answering my (sometimes silly) questions tirelessly.”

Book choiceAn Introduction to the Philosophy of Science (K. Staley 2014, Cambridge University Press).

 

lori wike falls

.

 

Lori Wike: Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

Palindrome: Able foe rip menisci? Tam, eh? Tam-tam? GMAT mathematics in empire of Elba!

(Lori was brave enough to use “mathematics”–successfully! The only reason I know meniscus is from working in a knee clinic as a graduate student to supplement my Fellowship at U Penn. Congratulations!)

Bio: Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

Statement: “I’m very happy to be a third-time winner in this palindrome contest. I definitely appreciated the challenge of trying to work “mathematical” into a palindrome, and I must thank my dear friend, Luke, whose recent knee surgery brought “menisci” to my mind. Here is a picture of me visiting Akaka Falls, a necessary stop on any palindromist tour itinerary! I’ve been fascinated by palindromes ever since first learning about them as a child in a Martin Gardner book. I started writing palindromes several years ago when my interest in the form was rekindled by reading about the constraint-based techniques of several Oulipo writers. While I love all sorts of wordplay and puzzles, and I occasionally write some word-unit palindromes as well, I find writing the traditional letter-unit palindromes to be the most satisfying challenge, due to the extreme formal constraint of exact letter reversal–which is made even more fun in a contest like this where one has to include specific words in the palindrome. Lately I’ve been writing a lot of palindrome limericks (“palimericks”) and I’d like to attempt to write a palindrome sonnet in iambic pentameter.”

Book choice:  What is this thing called science? (A. Chalmers 1999 (3rd ed), Hackett Publishing Company).

CONGRATULATIONS TO BOTH! And thanks so much for your interest!

Mayo’s December attempts/examples included:

Elba, I, math girl, let racecar stats = sexes = stats racecar, tell right, amiable!
Elba, I, math, gin, stats = testset = stats night, amiable.
Elba, I, math, gin = night amiable!

Elba saw, aimed a cadet fight. A math gifted academia was able.
Elba fan I, “stats goddess” I, math girl, right. A missed dog stats in a fable.
Elba nut, I rave: “No stats goddess I. Math gin night. A missed dog stats, one variable, no tuna!”

JANUARY: IRONY, IRONIC, IRONICAL. Extra points for using any two forms of these words in your palindrome. See rules.

 

 

Categories: Palindrome | 2 Comments

“Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins)

toilet-fireworks-by-stephenthruvegas-on-flickr

more Potti training/validation fireworks

So it turns out there was an internal whistleblower in the Potti scandal at Duke after all (despite denials by the Duke researchers involved ). It was a medical student Brad Perez. It’s in the Jan. 9, 2015 Cancer Letter*. Ever since my first post on Potti last May (part 1), I’ve received various e-mails and phone calls from people wishing to confide their inside scoops and first-hand experiences working with Potti (in a statistical capacity) but I was waiting for some published item. I believe there’s a court case still pending (anyone know?)

Now here we have a great example of something I am increasingly seeing: Challenges to the scientific credentials of data analysis are dismissed as mere differences in statistical philosophies or as understandable disagreements about stringency of data validation.[i] This is further enabled by conceptual fuzziness as to what counts as meaningful replication, validation, legitimate cross-validation.

If so, then statistical philosophy is of crucial practical importance.[ii]

Here’s the bulk of Perez’s memo (my emphasis in bold), followed by an even more remarkable reply from Potti and Nevins. Continue reading

Categories: evidence-based policy, junk science, PhilStat/Med, Statistics | Tags: | 28 Comments

On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)

shattered-glass-portrait-1

owhadi

.

Houman Owhadi

Professor of Applied and Computational Mathematics and Control and Dynamical Systems,
Computing + Mathematical Sciences
California Institute of Technology, USA

 

Clintpic

.

Clint Scovel
Senior Scientist,
Computing + Mathematical Sciences
California Institute of Technology, USA

 

 “On the Brittleness of Bayesian Inference: An Update”

Dear Readers,

This is an update on the results discussed in http://arxiv.org/abs/1308.6306 (“On the Brittleness of Bayesian Inference”) and a high level presentation of the more  recent paper “Qualitative Robustness in Bayesian Inference” available at http://arxiv.org/abs/1411.3984.

In http://arxiv.org/abs/1304.6772 we looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Now it is already well established that when the class of priors is finite-dimensional then one obtains robustness.  What we observe is that, under general conditions, when the class of priors is finite codimensional, then the optimal bounds on posterior are as large as possible, no matter the number of data points.

Our motivation for specifying a finite co-dimensional  class of priors is to look at what classical Bayesian sensitivity  analysis would conclude under finite  information and the best way to understand this notion of “brittleness under finite information”  is through the simple example already given in http://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/ and recalled in Example 1. The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (see Example 2 for an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference [6]. Continue reading

Categories: Bayesian/frequentist, Statistics | 13 Comments

“When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)

images-9I’m about to post an update of this, most viewed, blogpost, so I reblog it here as a refresher. If interested, you might check the original discussion.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I am grateful to Drs. Owhadi, Scovel and Sullivan for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it. 

—————————————-

owhadiHouman Owhadi
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA
 Clint Scovel
ClintpicSenior Scientist,
Computing + Mathematical Sciences,
California Institute of Technology, USA
TimSullivanTim Sullivan
Warwick Zeeman Lecturer,
Assistant Professor,
Mathematics Institute,
University of Warwick, UK

“When Bayesian Inference Shatters: A plain Jane explanation”

This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at http://arxiv.org/abs/1308.6306 with the following abstract:

“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”

Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data. Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Statistics | 1 Comment

Significance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)

Unknown-3

too strict/not strict enough

Given the daily thrashing significance tests receive because of how preposterously easy it is claimed to satisfy the .05 significance level requirement, it’s surprising[i] to hear Naomi Oreskes blaming the .05 standard as demanding too high a burden of proof for accepting climate change. “Playing Dumb on Climate Change,” N.Y. Times Sunday Revat 2 (Jan. 4, 2015). Is there anything for which significance levels do not serve as convenient whipping boys?  Thanks to lawyer Nathan Schachtman for alerting me to her opinion piece today (congratulations to Oreskes!),and to his current blogpost. I haven’t carefully read her article, but one claim jumped out: scientists, she says, “practice a form of self-denial, denying themselves the right to believe anything that has not passed very high intellectual hurdles.” If only! *I add a few remarks at the end.  Anyhow here’s Schachtman’s post:

NAS-3

.

 

Playing Dumb on Statistical Significance”
by Nathan Schachtman

Naomi Oreskes is a professor of the history of science in Harvard University. Her writings on the history of geology are well respected; her writings on climate change tend to be more adversarial, rhetorical, and ad hominem. See, e.g., Naomi Oreskes,Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming(N.Y. 2010). Oreskes’ abuse of the meaning of significance probability for her own rhetorical ends is on display in today’s New York Times. Naomi Oreskes, “Playing Dumb on Climate Change,” N.Y. Times Sunday Revat 2 (Jan. 4, 2015).

Oreskes wants her readers to believe that those who are resisting her conclusions about climate change are hiding behind an unreasonably high burden of proof, which follows from the conventional standard of significance in significance probability. In presenting her argument, Oreskes consistently misrepresents the meaning of statistical significance and confidence intervals to be about the overall burden of proof for a scientific claim:

“Typically, scientists apply a 95 percent confidence limit, meaning that they will accept a causal claim only if they can show that the odds of the relationship’s occurring by chance are no more than one in 20. But it also means that if there’s more than even a scant 5 percent possibility that an event occurred by chance, scientists will reject the causal claim. It’s like not gambling in Las Vegas even though you had a nearly 95 percent chance of winning.”

Although the confidence interval is related to the pre-specified Type I error rate, alpha, and so a conventional alpha of 5% does lead to a coefficient of confidence of 95%, Oreskes has misstated the confidence interval to be a burden of proof consisting of a 95% posterior probability. The “relationship” is either true or not; the p-value or confidence interval provides a probability for the sample statistic, or one more extreme, on the assumption that the null hypothesis is correct. The 95% probability of confidence intervals derives from the long-term frequency that 95% of all confidence intervals, based upon samples of the same size, will contain the true parameter of interest.

Oreskes is an historian, but her history of statistical significance appears equally ill considered. Here is how she describes the “severe” standard of the 95% confidence interval: Continue reading

Categories: evidence-based policy, science communication, Statistics | 58 Comments

No headache power (for Deirdre)

670px-Relieve-a-Tension-Headache-Step-6Bullet1

.

Deirdre McCloskey’s comment leads me to try to give a “no headache” treatment of some key points about the power of a statistical test. (Trigger warning: formal stat people may dislike the informality of my exercise.)

We all know that for a given test, as the probability of a type 1 error goes down the probability of a type 2 error goes up (and power goes down).

And as the probability of a type 2 error goes down (and power goes up), the probability of a type 1 error goes up. Leaving everything else the same. There’s a trade-off between the two error probabilities.(No free lunch.) No headache powder called for.

So if someone said, as the power increases, the probability of a type 1 error decreases, they’d be saying: As the type 2 error decreases, the probability of a type 1 error decreases! That’s the opposite of a trade-off. So you’d know automatically they’d made a mistake or were defining things in a way that differs from standard NP statistical tests.

Before turning to my little exercise, I note that power is defined in terms of a test’s cut-off for rejecting the null, whereas a severity assessment always considers the actual value observed (attained power). Here I’m just trying to clarify regular old power, as defined in a N-P test.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let’s use a familiar oversimple example to fix the trade-off in our minds so that it cannot be dislodged. Our old friend, test T+ : We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known, fixed σ:

H0: µ ≤  0 against H1: µ >  0

Let σ = 2n = 25, so (σ/ √n) = .4. To avoid those annoying X-bars, I will use M for the sample mean. I will abbreviate (σ/ √n) as σx .

  • Test T+ is a rule: reject Hiff M > m*
  • Power of a test T+ is computed in relation to values of µ >  0.
  • The power of T+ against alternative µ =µ1) = Pr(T+ rejects H0 ;µ = µ1) = Pr(M > m*; µ = µ1)

We may abbreviate this as : POW(T+,α, µ = µ1) Continue reading

Categories: power, statistical tests, Statistics | 6 Comments

Blog Contents: Oct.- Dec. 2014

metablog old fashion typewriterBLOG CONTENTS: OCT – DEC 2014*

OCTOBER 2014

  • 10/01 Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?
  • 10/05 Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)
  • 10/07 A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”
  • 10/10 BREAKING THE (Royall) LAW! (of likelihood) (C)
  • 10/14 Gelman recognizes his error-statistical (Bayesian) foundations
  • 10/18 PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must
  • 10/22 September 2014: Blog Contents
  • 10/25 3 YEARS AGO: MONTHLY MEMORY LANE
  • 10/26 To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola
  • 10/31 Oxford Gaol: Statistical Bogeymen

NOVEMBER 2014

  • 11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”
  • 11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA
  • 11/11 The Amazing Randi’s Million Dollar Challenge
  • 11/12 A biased report of the probability of a statistical fluke: Is it cheating?
  • 11/15 Why the Law of Likelihood is bankrupt–as an account of evidence
  • 11/18 Lucien Le Cam: “The Bayesians Hold the Magic”
  • 11/20 Erich Lehmann: Statistician and Poet
  • 11/22 Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”
  • 11/25 How likelihoodists exaggerate evidence from statistical tests
  • 11/30 3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

 

DECEMBER 2014

  • 12/02 My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics
  • 12/04 “Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)
  • 12/06 How power morcellators inadvertently spread uterine cancer
  • 12/11 Msc. Kvetch: What does it mean for a battle to be “lost by the media”?
  • 12/13 S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)
  • 12/17 Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)
  • 12/21 Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)
  • 12/23 All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”
  • 12/26 3 YEARS AGO: MONTHLY (Dec.) MEMORY LANE
  • 12/29 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
  • 12/31 Midnight With Birnbaum (Happy New Year)

* Compiled by Jean A. Miller

Categories: blog contents, Statistics | Leave a comment

Midnight With Birnbaum (Happy New Year)

 Just as in the past 3 years since I’ve been blogging, I revisit that spot in the road at 11p.m.*,just outside the Elbar Room, get into a strange-looking taxi, and head to “Midnight With Birnbaum”. I wonder if they’ll come for me this year, given that my Birnbaum article is out… This is what the place I am taken to looks like. [It’s 6 hrs later here, so I’m about to leave…]

You know how in that (not-so) recent movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf?  He is impressed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve 2011 2012, 2013, 2014) and is taken back fifty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a couple of brief (12/31/14) updates at the end.  

.

.

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics.  I happen to be writing on your famous argument about the likelihood principle (LP).  (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept.

ERROR STATISTICIAN: Yes, but I actually don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP.[ii]  Sorry,…I know it’s famous…

BIRNBAUM:  Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I happen to be a frequentist (error statistical) philosopher; I have recently (2006) found a hole in your proof,..er…well I hope we can discuss it.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it! Continue reading

Categories: Birnbaum Brakes, Statistics, strong likelihood principle | Tags: , , , | 2 Comments

To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)

Part 2 Prionvac: The Will to Understand PowerI said I’d reblog one of the 3-year “memory lane” posts marked in red, with a few new comments (in burgundy), from time to time. So let me comment on one referring to Ziliac and McCloskey on power. (from Oct.2011). I would think they’d want to correct some wrong statements, or explain their shifts in meaning. My hope is that, 3 years on, they’ll be ready to do so. By mixing some correct definitions with erroneous ones, they introduce more confusion into the discussion.

From my post 3 years ago: “The Will to Understand Power”: In this post, I will adhere precisely to the text, and offer no new interpretation of tests. Type 1 and 2 errors and power are just formal notions with formal definitions.  But we need to get them right (especially if we are giving expert advice).  You can hate the concepts; just define them correctly please.  They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, then (say) such and such a positive effect is true.”

So far so good (keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine.

Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive effect as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01.

(1) The power of the test to detect H’(δ) =

P(test rejects null at .01 level; H’(δ) is true).

Say it is 0.85.

“If the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (Z & M, 132-3).

But this is not so.  Perhaps they are slipping into the cardinal error of mistaking (1) as a posterior probability:

(1’) P(H’(δ) is true| test rejects null at .01 level)! Continue reading

Categories: 3-year memory lane, power, Statistics | Tags: , , | 6 Comments

3 YEARS AGO: MONTHLY (Dec.) MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: December 2011. I mark in red 3 posts that seem most apt for general background on key issues in this blog.*

*I announced this new, once-a-month feature at the blog’s 3-year anniversary. I will repost and comment on one of the 3-year old posts from time to time. [I’ve yet to repost and comment on the one from Oct. 2011, but will very shortly.] For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

Previous 3 YEAR MEMORY LANES:

Nov. 2011

Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, blog contents, Statistics | Leave a comment

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

spanos 2014

.

 

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

images-5

Categories: Error Statistics, fallacy of rejection, Phil6334, reforming the reformers, Statistics | 27 Comments

Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)

images-16Diederik Stapel’s book, “Ontsporing” has been translated into English, with some modifications. From what I’ve read, it’s interesting in a bizarre, fraudster-porn sort of way.

Faking Science: A true story of academic fraud

Diederik Stapel
Translated by Nicholas J.L. Brown

Nicholas J. L. Brown (nick.brown@free.fr)
Strasbourg, France
December 14, 2014

Derailed_Stapel_tight1

.

Foreword to the Dutch edition

I’ve spun off, lost my way, crashed and burned; whatever you want to call it. It’s not much fun. I was doing fine, but then I became impatient, overambitious, reckless. I wanted to go faster and better and higher and smarter, all the time. I thought it would help if I just took this one tiny little shortcut, but then I found myself more and more often in completely the wrong lane, and in the end I wasn’t even on the road at all. I left the road where I should have gone straight on, and made my own, spectacular, destructive, fatal accident. I’ve ruined my life, but that’s not the worst of it. My recklessness left a multiple pile-up in its wake, which caught up almost everyone important to me: my wife and children, my parents and siblings, colleagues, students, my doctoral candidates, the university, psychology, science, all involved, all hurt or damaged to some degree or other. That’s the worst part, and it’s something I’m going to have to learn to live with for the rest of my life, along with the shame and guilt. I’ve got more regrets than hairs on my head, and an infinite amount of time to think about them. Continue reading

Categories: Statistical fraudbusting, Statistics | Tags: | 4 Comments

Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)

4160cZ5qLWL._UY250_

Kent Staley has written a clear and engaging introduction to PhilSci that manages to blend the central key topics of philosophy of science with current philosophy of statistics. Quite possibly, Staley explains Error Statistics more clearly in many ways than I do in his 10 page section, 9.4. CONGRATULATIONS STALEY*

You can get this book for free by merely writing one of the simpler palindrome’s in the December contest.

Here’s an excerpt from that section:

.

Staley

9.4 Error-statistical philosophy of science and severe testing

Deborah Mayo has developed an alternative approach to the interpretation of frequentist statistical inference (Mayo 1996). But the idea at the heart of Mayo’s approach is one that can be stated without invoking probability at all. ….

Mayo takes the following “minimal scientific principle for evidence” to be uncontroversial:

Principle 3 (Minimal principle for evidence) Data xo provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H, even if H is false.(Mayo and Spanos, 2009, 3) Continue reading

Categories: Announcement, Palindrome, Statistics, StatSci meets PhilSci | Tags: | 10 Comments

S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)

YoungPhoto2008

.

 

S. Stanley Young, PhD
Assistant Director
Bioinformatics National Institute of Statistical Sciences Research Triangle Park, NC

Are there mortality co-benefits to the Clean Power Plan? It depends.

Some years ago, I listened to a series of lectures on finance. The professor would ask a rhetorical question, pause to give you some time to think, and then, more often than not, answer his question with, “It depends.” Are there mortality co-benefits to the Clean Power Plan? Is mercury coming from power plants leading to deaths? Well, it depends.

So, rhetorically, is an increase in CO2 a bad thing? There is good and bad in everything. Well, for plants an increase in CO2 is a good thing. They grow faster. They convert CO2 into more food and fiber. They give off more oxygen, which is good for humans. Plants appear to be CO2 starved.

It is argued that CO2 is a greenhouse gas and an increase in CO2 will raise temperatures, ice will melt, sea levels will rise, and coastal area will flood, etc. It depends. In theory yes, in reality, maybe. But a lot of other events must be orchestrated simultaneously. Obviously, that scenario depends on other things as, for the last 18 years, CO2 has continued to go up and temperatures have not. So it depends on other factors, solar radiance, water vapor, El Nino, sun spots, cosmic rays, earth presession, etc., just what the professor said.

young pic 1

So suppose ambient temperatures do go up a few degrees. On balance, is that bad for humans? The evidence is overwhelming that warmer is better for humans. One or two examples are instructive. First, Cox et al., (2013) with the title, “Warmer is healthier: Effects on mortality rates of changes in average fine particulate matter (PM2.5) concentrations and temperatures in 100 U.S. cities.” To quote from the abstract of that paper, “Increases in average daily temperatures appear to significantly reduce average daily mortality rates, as expected from previous research.” Here is their plot of daily mortality rate versus Max temperature. It is clear that as the maximum temperature in a city goes up, mortality goes down. So if the net effect of increasing CO2 is increasing temperature, there should be a reduction in deaths. Continue reading

Categories: evidence-based policy, junk science, Statistics | Tags: | 35 Comments

Blog at WordPress.com. The Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 627 other followers