Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care.

II. Fixing Science. So, one day in January, I was invited to speak in a panel “Falsifiability and the Irreproducibility Crisis” at a conference “Fixing Science: Practical Solutions for the Irreproducibility Crisis.” The inviter, whom I did not know, David Randall, explained that a speaker withdrew from the session because of some kind of controversy surrounding the conference, but did not give details. He pointed me to an op-ed in the Wall Street Journal. I had already heard about the conference months before (from Nathan Schachtman) and before checking out the op-ed, my first thought was: I wonder if the controversy has to do with the fact that a keynote speaker is Ron Wasserstein, ASA Executive Director, a leading advocate of retiring “statistical significance”, and barring P-value thresholds in interpreting data. Another speaker eschews all current statistical inference methods (e.g., P-values, confidence intervals) as just too uncertain (D. Trafimow). More specifically, I imagined it might have to do with the controversy over whether the March 2019 editorial in TAS (Wasserstein, Schirm, and Lazar 2019) was a continuation of the ASA 2016 Statement on P-values, and thus an official ASA policy document, or not. Karen Kafadar, recent President of the American Statistical Association (ASA), made it clear in December 2019 that it is not.[2] The “no significance/no thresholds” view is the position of the guest editors of the March 2019 issue. (See “P-Value Statements and Their Unintended(?) Consequences” and “Les stats, c’est moi“.) Kafadar created a new 2020 ASA Task Force on Statistical Significance and Replicability to:

prepare a thoughtful and concise piece …without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice”. (Kafadar 2019, p. 4)

Maybe those inviting me didn’t know I’m “anti” the Anti-Statistical Significance campaign (“On some self-defeating aspects of the 2019 recommendations“), that  I agree with John Ioannidis (2019) that “retiring statistical significance would give bias a free pass“, and published an editorial “P-value Thresholds: Forfeit at Your Peril“. While I regard many of today’s statistical reforms as welcome (preregistration, testing for replication, transparency about data-dredging, P-hacking and multiple testing), I argue that those in Wasserstein et al., (2019) are “Doing more harm than good“. In “Don’t Say What You don’t Mean“, I express doubts that Wasserstein et al. (2019) could really mean to endorse certain statements in their editorial that are so extreme as to conflict with the ASA 2016 guide on P-values. To be clear, I reject oversimple dichotomies, and cookbook uses of tests, long lampooned, and have developed a reformulation of tests that avoids the fallacies of significance and non-significance.[1] It’s just that many of the criticisms are confused, and, consequently so are many reforms.

III. Bad Statistics is Their Product. It turns out that the brouhaha around the conference had nothing to do with all that. I thank Dorothy Bishop for pointing me to her blog which gives a much fuller background. Aside from the lack of women (I learned a new word–a manference), her real objection is on the order of “Bad Statistics is Their Product”: The groups sponsoring the Fixing Science conference, The National Association of Scholars and the Independent Institute, Bishop argues, are using the replication crisis to cast doubt on well-established risks, notably those of climate change. She refers to a book whose title echoes David Michael’s: Merchants of Doubt (2010(by historians of science: Conway and Oreskes). Bishop writes:

Uncertainty about science that threatens big businesses has been promoted by think tanks … which receive substantial funding from those vested interests. The Fixing Science meeting has a clear overlap with those players. (Bishop)

The speakers on bad statistics, as she sees it, are “foils” for these interests, and thus “responsible scientists should avoid” the meeting.

But what if things are the reverse?  What if “bad statistics is our product” leaders also have an agenda. By influencing groups who have a voice in evidence policy in government agencies, they might effectively discredit methods they don’t like, and advance those they like. Suppose you have strong arguments that the consequences of this will undermine important safeguards (despite the key players being convinced they’re promoting better science). Then you should speak, if you can, and not stay away. You should try to fight fire with fire.

IV. So what Happened? So I accepted the invitation and gave what struck me as a fairly radical title: “P-Value ‘Reforms’: Fixing Science or Threats to Replication and Falsification?” (The abstract and slides are below.) Bishop is right that evidence of bad science can be exploited to selectively weaken entire areas of science; but evidence of bad statistics can also be exploited to selectively weaken entire methods one doesn’t like, and successfully gain acceptance of alternative methods, without the hard work of showing those alternative methods do a better, or even a good, job at the task at hand. Of course both of these things might be happening simultaneously.

Do the conference organizers overlap with science policy as Bishop alleges? I’d never encountered either outfits before, but Bishop quotes from their annual report.

In April we published The Irreproducibility Crisis, a report on the modern scientific crisis of reproducibility—the failure of a shocking amount of scientific research to discover true results because of slipshod use of statistics, groupthink, and flawed research techniques. We launched the report at the Rayburn House Office Building in Washington, DC; it was introduced by Representative Lamar Smith, the Chairman of the House Committee on Science, Space, and Technology.

So there is a mix with science policy makers in Washington, and their publication, The Irreproducibility Crisis, is clearly prepared to find its scapegoat in the bad statistics supposedly encouraged in statistical significance tests. To its credit, it discusses how data-dredging and multiple testing can make it easy to arrive at impressive-looking findings that are spurious, but nothing is said about ways to adjust or account for multiple testing and multiple modeling. (P-values are defined correctly, but their interpretation of confidence levels is incorrect.)  Published before the Wasserstein et al. (2019) call to end P-value thresholds, which would require the FDA and other agencies to end what many consider vital safeguards of error control, it doesn’t go that far. Not yet at least! Trying to prevent that from happening is a key reason I decided to attend. (updated 2/16)

My first step was to send David Randall my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–which he actually read and wrote a report on–and I met up with him in NYC to talk. He seemed surprised to learn about the controversies over statistical foundations and the disagreement about reforms. So did I hold people’s feet to the fire at the conference (when it came to scapegoating statistical significance tests and banning P-value thresholds for error probability control?) I did! I continue to do so in communications with David Randall. (I’ll write more in the comments to this post, once our slides are up.)

As for climate change, I wound up entirely missing that part of the conference: Due to the grounding of all flights to and from CLT the day I was to travel, thanks to rain, hail and tornadoes, I could only fly the following day, so our sessions were swapped. I hear the presentations will be posted. Doubtless, some people will use bad statistics and the “replication crisis” to claim there’s reason to reject our best climate change models, without having adequate knowledge of the science. But the real and present danger today that I worry about is that they will use bad statistics to claim there’s reason to reject our best (error) statistical practices, without adequate knowledge of the statistics or the philosophical and statistical controversies behind  the “reforms”.

Let me know what you think in the comments.

V. Here’s my abstract and slides

P-Value “Reforms”: Fixing Science or Threats to Replication and Falsification?

Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome, others are quite radical. The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. Paradoxically, some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. Some even preclude testing and falsifying claims altogether. Too often the statistics wars become proxy battles between competing tribal leaders, each keen to advance a method or philosophy, rather than improve scientific accountability.

[1] Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST), 2018; SIST excerpts; Mayo and Cox 2006; Mayo and Spanos 2006.

[2] All uses of ASA II(note) on this blog must now be qualified to reflect this.

[3] You can find a lot on the conference and the groups involved on-line. The letter by Lenny Teytelman warning people off the conference is here. Nathan Schachtman has a post up today on his law blog here.


Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides

Post navigation

33 thoughts on “Bad Statistics is Their Product: Fighting Fire With Fire (ii)

  1. Nathan Schachtman


    I have blogged a bit about the denunciators but not really very much yet about the substance of the conference. As you know, there is a statement in the lead 2019 ASA editorial “about time for change,” in which the authors declare that their proposals will ameliorate the so-called replication crisis. As well as I can make out, folks at the National Association of Scholars took these declarations at face value, and without any empirical support. I for one was glad that you could attend and join issue with these claims.

    Yes; the National Association of Scholars’ report on the replication issues had a misinterpretation of confidence intervals. I pointed this out and told one of the authors not to fret, that it is one of the most common mistakes out there, made by many scientists and even some statisticians. My understanding is that the error will be corrected in subsequent editions.

    The presentations on climate change were difficult to take in completely, and I am looking forward to studying the slides. I can say this, however: two of the presenters acknowledged clearly that the Earth is warming up, and that human activity is at least partially a cause. One of these two presenters took issues with some of the catastrophic models and projections, for what seemed like good grounds. I heard another speaker express general skepticism but he offered no evidence or analysis to support his assessment, and so I gave it little consideration.

    For me, it is not unthinkable that climate change scientists have overstated their conclusions, for personal or political reasons. Many of the blogviators and twitterers have reduced the issue to accept or reject the most extreme models and projections. Still, I confess that this is an area of science I cannot get up to speed on fully and quickly, and to some extent I must use a trust heuristic in doing so. Still, I don’t think the world will stop rotating on its axis if we try to verify the scientific claims of the science change proponents. I personally hope to look more closely at the actual data and analyses soon.

    Nathan Schachtman

    • Nathan:
      Thanks for your comment. I had tweeted both of your blogposts on this, but neglected to include them in Note [3] as intended. I planned to go back and give a list. I just now added a link to your blog from today. I hope everyone who reads my blog knows I go back to add bits to my posts and make corrections, usually indicating with numbers or letters after the title.
      You wrote: “As you know, there is a statement in the lead 2019 ASA editorial ‘about time for change,’ in which the authors declare that their proposals will ameliorate the so-called replication crisis…folks at the National Association of Scholars took these declarations at face value, and without any empirical support”.
      Yes, that is why I slogged to CA!
      By the way, we shouldn’t call it the 2019 ASA editorial, but at most the 2019 editorial in the special issue of TAS,, or just Wasserstein, Schirm, and Lazar (2019). I know I was calling it ASA II for several months on this blog. I’d like to think that helped aggravate the situation enough to move people a step towards the events that resulted in the call for a new Task Force. It’s not at all clear why Wasserstein, with his ASA Exec Director’s hat on, would wish to take such an extreme and strident line on an issue that he himself admits is disputed by many other statistical practitioners and even though he admits there is no agreement about replacement methods. (I was shocked at the conference when Wasserstein said that even “his boss” disagreed with him! I assume he meant Karen. He doubtless believes in his heart of hearts that he’s fixing…something, and doubtless defenders of one or another side in other science policy disputes feel likewise.

      You had pointed out the error in CIs in their report, but you shouldn’t tell them not to worry about it. They should worry, and it should lead them to distrust some things they may be reading. I think they should revise considerable portions of it, even though, to give credit where it’s due, there are places where they do an excellent job of informally explaining P-values, and are thorough in surveying the landscape of today’s discussions on irreplication. Then again, I read it quickly.

  2. Prof. Mayo: more later, but:
    1) Nit: Typo, I think: Karan Kafadar => Karen.

    2) Climate science, as a field, tries very hard to quantify & express uncertainty, as can be seen in the way IPCC expresses things (akin to the way the Surgeon General reports on smoking work). I’m an American Geophysical Union member, have attended numerous talks, and a long workshop on uncertainty. The famed climate scientist Stephen Schneider (a friend) pushed IPCC hard on this and always told audiences that the two extreme outcomes “no problem” and “we’re doomed” were the least likely, and that is the mainstream viewpoint. Of course, when filtered through journal abstracts (which often like to play up results, as per Tim Edgell) and worse, popular press, that gets lost.

    3) Of course, unlike epidemiology and many social-science problems, physics is the primary basis for climate science and Arrhenius’ paper-and-pencil estimate 120+ years ago was not too bad:

    As it stands, it is quite clear that humans are responsible more than 100% of the warming over the last few decades (we also generate aerosols, which reduce the warming.) For that NOT to be true, one has to discard pillars of physics like conservation of energy and quantum mechanics.

    3) Climate scientists certainly use statistical methods, but in somewhat different ways than in those areas.
    (I’m an advisor at UCSF’s Center for Tobacco Control Research and Education, often attend internal seminars whose key results can only be obtained via statistical analysis, often without being able to do the experiments one would really like. Nobody will ever start with 10,000 12-year-olds and randomly assign half to smoke, then do longitudinal study until all die. I’ve been in a few meetings going over forest plot meta-analyses to try to understand reasons for differing results.)

    In analyzing temperature trends, climate scientists know they’re dealing with a noisy stochastic process, generally say they want 30-years. The most common anti-science trick is cherry-picking start and end dates, especially starting with 1998, year of a truly exceptional El Nino (hot), or picking preferred datasets (satellites vs ground stations). Another one is cherry-picking geographies to find cooling somewhere, slightly akin to garden-of-forking-paths, but less sophisticated. (One of the speakers at Fixing Science has a long history of such things, I have copies of his books).

    Too bad you missed the climate session, would have been interesting to hear your views.

    • John: I fixed the typo, thank you, there are bound to be others as I was going so fast.
      You wrote: “Climate scientists certainly use statistical methods, but in somewhat different ways than in those areas”. Yes, that’s a reason I identify the clear and present danger in the last paragraph (as being more of a threat to statistical methods than to climate science.)

  3. Agreed, again given that statistics matters less than physics to climate science, but more to epidemiology, as in the current challenges in figuring out what’s really going on with vaping, where the first credible studies are pretty new, and of course, no one knows much about the long-term health effects. Similar arguments seem to reappear regarding particulate matter.

    NAS has long history of attacks on climate science by people who were demonstrably ill-infomred about it, but very ideological.
    Distinguished MIT climate scientist Kerry Emanuel tried to educate them, but failed, and ended up resigning from NAS membership.
    NAS President Peter Wood was upset about a profile on me in Science:
    I’d never heard of NAS, so I investigated, just read 1st page:

    Click to access bottling.nonsense.pdf

    NAS Irreproducibility Report & climate science:
    If one is unfamiliar with climate science, their comments might seem credible, especially when enmeshed in general discussion of problems with science/statistics, some of which are real.

    BUT suppose one actually studies climate science (I’m an American Physical Union member, attend the big meetings, read IPCC reports, live near Stanford, attend lectures, know many climate scientists)
    and also studies the climate denial industry and players, then red flags are pervasive.
    The 2 non-scientist authors rely heavily on low-credibility people and sources.
    It really is as a bad as a report on public health including great doubts on smoking:disease relationship while mostly citing articles by Philip Morris employees and economists in thinktanks funded by tobacco companies, as Independent Institute & Heartland have been for decades.

    Here are a few examples on the report, with more references than you’d want to examine, but perhaps hint that I recognize a lot of people offhand and actually know a great deal about some of their cites.

    Report was rolled out with Lamar Smith (R-TX), one of the two most famed Texas climate denier Reps: (see section on climate change)

    The NAS Report says:
    “Sloppy procedures don’t just allow for sloppy science. They allow, as opportunistic infections, politicized groupthink and advocacy-driven science. Above all, they
    allow for progressive skews and inhibitions on scientific research, especially in ideologically driven
    fields such as climate science, radiation biology, and social psychology (marriage law). Not all
    irreproducible research is progressive advocacy; not all progressive advocacy is irreproducible; but
    the intersection between the two is very large. The intersection between the two is a map of much
    that is wrong with modern science.”

    Who’s ideologically driven?
    Climate scientists, or the two authors of report, who cite Scott Adams’ Dilbert as a source, working for NAS, funded by Sarah Scaife Foundation, L&H Bradley, Charles Koch, etc?
    I’ve met at least 3 quite famous climate scientists who are (or were) conservatives or registered Republicans and that didn’t bother anybody or cause their science to be ignored. Scientists mostly want to do science, without being hassled by politicians (like James Inhofe(R-OK) or Joe Barton(R-TX).

    I noted that there was an Afterword by *Will Happer.*
    That raises serious concerns. They didn’t mention he was for many years Chairman of the Board of George C Marshall Institute (focus of Merchants of Doubt), founded the CO2 Coalition, testified for Peabody Energy (coal), etc. Some of the information in the report is relatively esoteric, leadign one to wonder if Happer and his allies contributed to it. WH1-12 READ THREAD

    References 83,105, 106 involve Judith Curry and 106 cites a piece at GWPF (main British climate denial group), run by a sports anthropologist Benny Peiser) and Scott Adams’ Dilbert cartoon. a guy who attacked Noami Oreskes, admitted all wrong.

    James Wallace, Joseph D’Aleo, Craig Idso: I don’t recognize Wallace offhand, but:
    (He runs a little family business, a climate-denial thinktank, often like a subsidiary of Heartland Institute, which funded it for many years.)

    145 As for McShane&Wyner (MW), that was a paper by 2 statisticians in Wharton’s marketing dept, designed to cast doubt on the climate hockey stick (it failed miserably, especially given later research that kept reaffirming the hockey stick.). Judith Curry labeled them “leading statisticians”. One was a PhD student/new PhD and the other was an Assoc. Prof who was skeptical of climate science without knowing much of it. This research went nowhere.

    I wrote of it in 2010 in my analysis of the 2006 Wegman Report (abbreviated WR).
    MW included plagiarism of background material to make people think they knew paleoclimate (they didn’t, shown by absolutely stupid errors) and some fabrications, as per pp.96-113 in

    Click to access STRANGE.SCHOLARSHIP.V1.02.pdf

    (This analysis was picked up by USA Today, led to a Nature editorial, a profile of me in Science, a few retractions, although fewer than should have been.)

    This created issues with numerous reply papers, mostly critical,
    summarized by 2 top climate scientists:

    Bad science happens when people submit papers with bad statistics & reviewers don’t have expertise to catch it…
    but also when people submit papers to statistics journals that don’t have reviewers with application domain expertise to notice instantly-obvious errors and results that simply contradict masses of evidence & well-understood physics. Sigh.

    (By the way, if you’re ever out here and feel like a tour of the Computer History Museum in Mountain View, let me know. I took Andrew Gelman around a few years ago, which was fun.)

  4. John:
    The National Association of Scholars purports to be opposed to “politicized groupthink and advocacy-driven science”. Then they should also be opposed to advocacy-driven statistics. That is why I attended the conference.

    • John Mashey

      Yes, they should!
      Of course, everybody should be in favor of good statistics, and wary of bad (whether from 1) malice (bias, ideology) or incompetence)!
      Sadly, the sometimes-stovepiped structure of academe sometimes leads to poor statistics in application areas. Luckily, I spent 1st 10 years at Bell Labs, where people were encouraged to take hard problems to statistics research dept. Also, internal peer review was tough. Member of Tech staff writes paper for external publication. It goes to up chain Supervisor, Dept Head Directir, Executive Director, who sends it to 2 other EDs. They send down chains to folks who can review. Reviews come back up, over to ED and down to author. It was well-known that if there was serious statistics, paper would likely get reviewed by John Tukey & co in statistics research center. Having a bad Tukeyreview come back through management = career-limiting move… so of course sensible people would go over there and ask for help first. Bell Labs used statistics rather heavily to make big$ decisions, so people cared.

      It was good you spoke, my wife enjoyed your talk… although her casual impression of audience was that many didn’t have the math background to appreciate it.
      Note that I didn’t try to discourage speakers, although having studied NAS before, I did check with one speaker to make sure they knew what to expect.(They did.)

      • John:
        As you can see from my slides, there’s nothing mathematical about my talk. If you cannot say, in advance, about any possible result, that it will not be allowed to count in favor of a claim, then you don’t have a test of that claim. No matter what happens, one is allowed to spin the result as evidence for a favored claim. The method had no chance of finding some flaw in the claim even if it’s false. Is that technical? If so, no one from the conference will understand pseudoscience or falsification–the topic of the panel I was asked to be in. Nor will they grasp a word of the National Association of Scholars’ publication: The Irreproducibility of Science––which is very clear about this point. If, on the other hand, my claim is understood all too well, then it will be clear that no threshold, no test, no falsification, even of the statistical sort. I wonder if some people claim there’s something mathematical about this elementary idea in order to ignore it. If so, it will come back to haunt them very soon. Maybe they will comprehend Ioannidis (2019):

        “Many fields of investigation … have major gaps in the ways they conduct, analyse, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self‐ostracized from the remit of science.” (Ioannidis 2019)

        “abandoning the concept of statistical significance would make claims of ‘irreproducibility’ difficult if not impossible to make. In our opinion this approach may give bias a free pass.” (Hardwicke and Ioannidis 2019)

        You can find these citations and a discussion in my very short editorial “P-value thresholds: forfeit at your peril.”

        You say “everyone should be in favor of good statistics”, but this ignores the fact that there very often isn’t agreement about this, which is what the “statistics wars” are all about. Take a look at my previous blog “P-values on trial”.

        • Life got busy, I need to actually read your book & look at slides.
          On latter, for some reason, my Win10 Chrome can’t get beyond Slide 1, I just checked on iPhone & it seems OK there.

          On “good statistics”: I note that the title of this thread includes “bad statistics”, so whatever’s left I hope includes “good statistics”. 🙂 Not being part of the statistics wars, I’ve generally been interested in usable methods for getting answers to help drive various kinds of decisions in business, product design or public policies … where sometimes the answer is “need more studies” and sometimes “we have to make a decision now, what does the data tell us?”

          All this is an interesting discussion, but not one for which time was previously allocated! Later.

          • John: Well no, removing bad stat doesn’t necessarily leave you with good stat.
            Anyway, I sent you my slides by email. When they’re available on the conference site, I’ll link to that. Send me your reactions as you read the first Excursion of my book, SIST.

            • “so whatever’s left I hope includes “good statistics”.
              Sorry, maybe I needed to be even more explicit. I didn’t think there was any implication that everything left was good.

              My mental model is that for any given problem, there is likely a partial ordering of methods, of which some are bad (for sure, like using arithmetic averages of ratios), through some I *hope* are good (or least least “good enough”).
              For instance, I think of the various normality tests, of which I think there have been occasional arguments 🙂 For my needs in computer benchmark statistics, some of the simpler ones seemed adequate.

              I mentioned Tukey influence (I still have his 1976 EDA book, obsolete in many ways, but still with good thoughts about using different methods to gain insight. That’s why I say partial ordering above, even without getting into the arguments of statistics wars.)

              I’ve often repeated a few of his quotes:
              One of my favorites has been:
              “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

              In the latter case, there may be no good statistics.

    • Stuart Hurlbert

      Mayo, maybe the phrase you want would be better labeled “crowd-driven science.”

      We are “advocates” every time we advocate a particular procedure or approach in an article or in advice to colleagues or students.

      Neither I nor any of my students ever used the phrase “statistically significant” over the last 30 plus years, and no reviewer or editor attempted to call us out on it. (We would have won, if they did, simply by citing a half dozen reputable statisticians who have made the case for de-dichotomizing
      P-values over the decades….)

  5. rkenett

    Climate change is a prime example where the claim is verbal: “The earth gets warmer” This is a verbal representation of research (or “research”) that can be based on parametrized models (or not).

    The S-type error of Gelman and Carlin is that this claim is wrong in that earth doe not get warmer or even gets cooler.

    To follow up on this discussion required an approach to represent (or generalise) findings that is based on alternatives representations of conceptual target statements and an approach to assess it (for example assess the s-type error).

    This path has not been explored. Worse yet, it has been blocked by reviewers who were faced with a new perspective outside their range of visibility.

    Would NAS (I mean Nathan Schachtman) be open to it? For more on this see my tweets about Mayo’s reference to the fixing science event and

    The NAS (National academies of science) document on reproducibility in science made several mistakes. One on the definition of p-values that Mayo was able to fix. The other on the difference between reproducibility and repeatability that I was not able to correct in spite of several emails pointing to my Nature methods clarification

    To “fix science” we need to:
    1. clarify the terminology
    2. embrace new perspectives

    Seems like the event Mayo and Schactman attended did neither.

    • Ron:
      You wrote: “Climate change is a prime example where the claim is verbal: ‘The earth gets warmer’” . Yes, it is verbal, as claims tend to be. There actually was discussion about terminology at the conference. For example, Barry Smith’s keynote focussed on how preregistration will fail if there is latitude in defining terms. The whole field of verbal “ontology”–as that notion is used these days–was discussed.
      As for directional errors, that’s a major focus of standard statistical significance tests, whether called “sign” errors or something else.

      • rkenett


        ontologies are totally different from alternative representations of claims using alternatives with meaning equivalence and alternatives with surface similarity. look at the translational medicine and clinical research examples in my paper where a boundary of meaning is delineated (BOM) in a table presenting research claims. it is certainly not a matter of terminology but an approach to represent and generalise research findings.

        moreover, the directional aspects you refer to are also different from sign type errors where you assess the direction of significant effects. look at the gelman – carlin papers.

        the point i am making is that, with proper generalisation of findings that leverage claims indicating a direction, the s-type error assesses you probability of being wrong. this is what many scientist look for.

        all this is explained, with example, in the publications i referred to above.

        • Ron: Your complaint seems to be that you think there are important topics ignored by this workshop, but I’m not sure how that really furthers the particular conversation. If you want to read a few pages in SIST on the Gelman and Carlin 2014 article (where their analysis is or appears to be in tension with severity), go to p. 359 of SIST

          • rkenett


            Yes I sahad seen the reference you point at but it treats M-type errors not S-type errors. Some of your argumentation however still holds, e.g. “When results seem out of whack with what’s known, it’s grounds to suspect the assumptions”.

            Please note that I took a different path aiming at presentation of findings using a generalization approach based on alternative representations, some with meaning equivalence and some with surface similarity. The application of an S-type error calculation is indeed shpower but the use of outside information, known before the data was collected, dampens the critiquing, An additional approach I looked into that has similar flavour is the application of Cornfield inequality, or sensitivity analysis, to support a verbally described claim.

            One nice feature of the Gelman Carlin semi empirical Bayes approach is that it forces a consideration of the study design, something sometimes avoided in bayesian circles and certainly avoided in the literature on structural causality networks.

            By the way, I did not intend my note to look like a “complaint” merely an observation and suggestion.

            • Ron: I’ve never seen Gelman call his approach “empirical Bayesian” (do you know where he does?). I doubt he would (I may be wrong or he may have changed) even though he does often view his priors as rooted in frequencies of some sort.

  6. It’s interesting that WordPress, in automatically triggering links to what appear to be related blogs, found the one from 2015 on climate change. Oreskes thinks .05 is too strict a standard, but there’s some confusion.

  7. I’ve now revised this post slightly because i got the sense, from some who emailed, that I wasn’t clear enough as to what I was trying to prevent and push back against in attending this conference. I hope this revision, numbered as (ii), is clearer.

  8. It seems to me the comments & post are talking past each other. Mashey, for example, is looking at climate skepticism, unlikely to be accepted; whereas you’re looking at skepticism regarding statistical tests, which is happening now. Is the reason the “no significance/no threshold” leaders, as you call them, attended a conference whose political agenda they might not agree with, in order to execute their own “political” agenda regarding statistical significance tests? I take it that is what your post was talking about and the reason you went was to put the brakes on it. The Wasserstein et al. (2019) editorial in The American Statistician says that the reason statistical tests will successfully be retired now is because there is “a perfect storm” of irreplication. A crisis should never be wasted. Very glad you were there to pull on the reins of that runaway horse!

    • Jean:
      Thank you for your comment. You are exactly right. Whether or not people are or become more skeptical about some climate change models is entirely separate from “interpretation B” of “bad statistics is their product”. I hope that my attending and speaking, and the follow-up discussions I’ve had and am having with people at the National Association of Scholars, is successfully pulling on the reins of that runaway horse. It’s very easy for those who are not aware of the disagreement regarding the statistical, philosophical and historical issues, to suppose there is general agreement among those who are concerned about a statistical crisis of replication that a good way to fix things is by banning the words “significance/significant” and avoiding using P value thresholds in interpreting results. But there isn’t agreement, and as critics point out, the “retire significance” reforms do nothing to block the key source of irreplication–data-dredging, selection effects, multiple testing. It’s the opposite!

      Eager investigators will still need to spin or data dredge when facing P values that are not small. However, if we’re not allowed to hold them accountable for violating pre-designated thresholds, the data dredgers won’t have to worry about their bad statistics. Ironically, the ability to test for lack of replication will also be stymied. People confuse the (bad) idea of having a single threshold, like .05, to unthinkingly use across all studies*, with the (good) idea that there should be a stipulation, in advance, of results that will not be able to be interpreted as counting in favor of a claim, e.g., the drug is beneficial, the practice is harmful (in a specified way), etc.. Otherwise, there’s no testing. Of course I discuss all this in detail in the papers cited.

      * or move from a single small P value to a substantive scientific theory or claim–manifestly invalid for statistical significance test reasoning.

    • Perhaps people can explain to me why National Association of Scholars chose to roll out its Irreproducibility Report,,
      with an Afterword by Will Happer: WH1-12

      not with AAAS, NIH, NSF, the usual NAS or any other science organization, but with the *most* aggressive attacker of climate science/scientists in Congress at the time, Lamar Smith (R-TX):
      “Climate change
      Smith has unequivocally stated that he believes the climate is changing, and that humans play a role in climate change. However, he questions the extent of the impact, and accuses scientists of promoting a personal agenda unsupported by evidence.[54] Smith has made a number of false and misleading claims about climate change.[55]

      As of 2015, Smith has received more than $600,000 from the fossil fuel industry during his career in Congress.[56] In 2014, Smith got more money from fossil fuels than he did from any other industry.[57] Smith publicly denies global warming.[58][59][60] Under his leadership, the House Science committee has held hearings that feature the views of climate change deniers,[61] subpoenaed the records and communications of scientists who published papers that Smith disapproved of,[58] and attempted to cut NASA’s earth sciences budget.[62] He has been criticized for conducting “witch hunts” against climate scientists.[57] In his capacity as chair of the House Committee on Science, Space and Technology, Smith issued more subpoenas in his first three years than the committee had for its entire 54-year history.[57] In a June 2016 response letter to the Union of Concerned Scientists, Mr. Smith cited the work of the House Un-American Activities Committee in the 1950s as valid legal precedent for his investigation.[63][64]

      On December 1, 2016, as chair on the House Committee on Science, Space and Technology, he tweeted out on behalf of that committee a Breitbart article denying climate change.[65]”

      • John:
        We don’t have a clue why they launched their publication with Lamar Smith, and know absolutely nothing about this group. These are intriguing observations but mostly at right angles from the issues in my post. Jean noted in her comment that “Mashey, for example, is looking at climate skepticism, unlikely to be accepted; whereas you’re looking at skepticism regarding statistical tests, which is happening now.” The climate modeling topic is so politicized that I think some are reluctant to get involved, and so are not commenting. Another question Jean raises is worthwhile: Is the reason the “no significance/no threshold” leaders, as you call them, attended a conference whose political agenda they might not agree with, in order to execute their own “political” agenda regarding statistical significance tests?” Maybe they think Lamar can wave a wand an write science policy that is favored by the anti-statistical significance test group.

  9. Tom Kepler

    Let’s deal first with the striking asymmetry here. Deborah Mayo has spent her professional career thinking deeply about statistics and error in science and sharing her ideas broadly. David Randall and Christopher Welser are an author of young adult fiction, and a faculty fellow in Classics, respectively. Neither has publicly expressed any interest in statistics prior to releasing their report, The Irreproducibility Crisis of Modern Science. They have instead repeatedly expressed their interest in social counter-reform. They have their mission and Mayo has hers. They are not equally invested in actually improving statistical practice.

    The event does, however, offer an opportunity to consider the problem of bias. Statistical practice provides the tools to reduce subjectivity and personal bias substantially but never to eliminate them entirely. We will never, in spite of the promises from the promoters of Big Data Science, automate data analysis, much less scientific practice. Human judgment will never be replaced by algorithms. The question that should be asked is the extent to which the remaining bias is relevant to the topic at hand and the impact it has on the outcome.

    Academic scientists clearly have a personal interest in keeping their jobs and gaining respect among their colleagues. Their understandable bias is to publish exciting papers. This is why the results of their research should be examined carefully with this understanding in mind and with the best statistical methods available. I am not downplaying the importance of such bias. It and the perverse incentives that make it hard to overcome are devastating to the social organization of science and thereby to the practice of science as a whole. In spite of this ineradicable self-interest, however, the many scientist I know are certainly interested in making real, repeatable discoveries.

    Biases also arise in the consumers of science and those who are affected by its impact on policy. A CEO whose profits stand to fall if anthropogenic climate change is taken seriously by the EPA has a very different kind of bias with respect to the science. Both he and the academician are interested in preserving their own jobs and social standing, but the CEO is biased against a particular set of results. There is much more to be said about this, but it is important to admit that there are different sources of bias that lead to different problems in scientific research and its influence on policy.

    Apart from that, it’s fun to see Mayo educate her audience on type-II errors and Neyman-Pearson testing. I hope they were particularly attentive to her discussion of non-binary classification of test results. It’s not just the accept/reject classification of our textbooks, but the consideration of the severity of the test. It is very easy to see failures of replicability but much more difficult to see failures of discovery.

    • Tom:
      Thank you so much for your very rich comment, I’m so glad to rein in the discussion to get to the main focus of my post.
      Just to focus on something you say at the end:
      “it’s fun to see Mayo educate her audience on type-II errors and Neyman-Pearson testing. I hope they were particularly attentive to her discussion of non-binary classification of test results.”
      I’m afraid that error statistical tests, whether they be called statistical significance tests (as with Fisher) or tests of statistical hypotheses (Neyman and Pearson), are presented in the most oversimple and distorted ways. In many cases, people don’t have the statistical background, in others, they’re just repeating a straw man version of tests they’ve been taught somewhere, or is pushed by anti-testers. I’m not sure I’ve succeeded in educating them.

      I’m prepared to allow “NHST” to stand for the abusive animal it is often associated with, but it’s an animal that none of the founders would ever have endorsed, and good practice shuns. In NHST, you go directly from a single small P-value to whatever substantive full-blown scientific theory you are putting forward to “explain” it. A radical violation of severity. Nonsignificant results are construed as “proving” a point null hypothesis (as alleged in Amrhein et al 2019). Another radical violation of severity. There is no recognition that statistical tests may be used in a non-binary fashion, even though people are supposed to know about power analysis. Severity blends tests and confidence intervals while improving on both. Instead of having a single confidence level of .95, each point in a CI corresponds to a different hypothesis that has passed with different severities. The results correspond to inferring which discrepancies are well or poorly tested–degrees of well-testedness being given to each. This is akin, mathematically to confidence distributions CDs.

      Unfortunately, the “new statistics”, rather than use CDs or vary their confidence levels, recommend .95––even as they blast .05 as arbitrary in tests. They are exactly on par. I’m sorry to say that the confidence interval CI “crusaders”, as Stuart Hurlbert calls them–at least the leaders– are so keen to have people use 95% CIs rather than tests (which needn’t be stuck at .05, but is to involve balancing type 1 and 2 error probabilities) that they are among the leading instigators in portraying tests illicitly. (see p. 436, 6.7 Farewell Keepsake in SIST).

      They’re shooting themselves in the foot (or their feet) because, for one thing, confidence levels are now starting to lose their error probability meaning (e.g., the NEJM cited in my editorial “P value thresholds: Forfeit at your peril”. If this becomes the new normal then only error statistical tests, and not CIs, will afford error control (and any procedures designed to match them). Yet Neyman developed CIs as inversions of tests. Of course, you know all this.
      I’ll comment some more later on.

  10. The climate change deniers seek to undermine the well-established scientific consensus that climate warming is occurring and that mankind is a significant contributor. They will adopt both “A” and “B”, i.e. criticizing bad statistics to deny “risk,” but also criticizing the use of P value-based research as so corrupted by errors as to render all results using that method as unreliable. Your contribution that P-value opponents are throwing the baby(useful statistics) out with the bathwater(data mining and other misuses) is valuable. You ran the risk of being misquoted or used to political ends, but you did not run the risk of remaining silent and allowing opponents of science to control the public space of communication without opposition.

    There is a reproducibility problem in science, especially the social sciences. That needs to be addressed, but to say that because of the reproducibility problem no science should be trusted is an overgeneralization.

    To boycott a conference because speakers are not published or because their organization is funded by climate deniers seems to me to be an ad hominem fallacy. Those speakers should be judged on the merits of their arguments, not on the basis of their membership in a group.

    In sum, I think you did science a service, you did philosophy of science a service, and you did me as a citizen a service by participating in the conference. Thank you for that.

    • William:
      Thank you so much for your insightful comment! And for your thanks. You’re one of the few people who got the point.
      “Your contribution that P-value opponents are throwing the baby(useful statistics) out with the bathwater(data mining and other misuses) is valuable. You ran the risk of being misquoted or used to political ends, but you did not run the risk of remaining silent and allowing opponents of science to control the public space of communication without opposition.”
      Only I think a bit, or a lot, more opposition was probably warranted.
      I’ll ponder this some more!

  11. Steven McKinney

    I can only agree with William Hendricks, well said. The more I read about the participants, and review their misguided websites and ludicrous papers, the more my blood boils and I realize I should refrain from saying what I really think of these characters. Thank you Mayo for fighting the good fight. You are a much better person than I.

  12. Steven McKinney

    I had missed the paragraph in Kafadar’s final editorial that Karen Kafadar has started an ASA group to sort out the mess spilled by Wasserstein in the pages of the ASA publication vehicle, The American Statistician. I note, positively, that Karen Kafadar’s group includes Yoav Benjamini, Bradley Efron, Nancy Reid, Stephen Stigler and a host of other sensible scientists. Thanks for your links to that.

    The most ironic discussion at the NAS (no not that NAS) con-ference is this one:

    David Trafimow, Professor, Psychology Department, New Mexico State University

    “What Journals Can Do To Fix The Irreproducibility Crisis”

    which consists of banishing statistical methods with decades of proven usefulness. Thankfully he is not a member of Kafadar’s group.

    • Steven:
      Thank you so much for your comments, it’s great to have some kindred spirits come around. Several people emailed me expressing their reluctance to join in the comments, largely because of the climate change remarks.
      Yes, it’s a shock to hear someone advocate, as a cure, “banishing statistical methods with decades of proven usefulness”. You could say, nihilistically, that if you never make statistical inferences, you won’t make bad ones. However, we know that the authors who publish in his journals do not refrain from making inferences, it is found that they exaggerate their findings in ways that would be blocked by using statistical significance tests. (See a blogpost by D. Lakens, and the paper Fricker et al., 2019). I call Trafimow’s policy “don’t ask, don’t tell” (search this blog for a post) because he says it’s OK to have used error statistical methods in arriving at a result, but you should not mention this in your write-up. He seems to be a very nice person (I made a point of talking with him during lunch at the conference), and I have no doubt he has seen the terrible things (cases of bad stats) that he described (to me) in social psychology. But why insist that studies that work hard to sustain controls also give up statistical methods. We should not let the tail wag the dog. Unfortunately, he also retains confusions, and claims significance tests are invalid because they do not supply posterior probabilities.

      Yes, the Task Force has excellent people on it. Great work by Kafadar. I will be very impressed if they manage not to kow tow to Wasserstein. I think the ASA board should ask Wasserstein to refrain from his campaign to get journals to adopt his “abandon significance/ never use P value thresholds to interpret data”, since in wearing his executive director’s hat, he is an ASA spokesperson. He should also be blocked from using ASA letterhead for his campaign. Yet he is absolutely driven to impose his view on all science. See my “doing more harm than good” post.

  13. Steven McKinney

    Bradley Efron kow tows to no one.

Blog at