slides

Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care.

II. Fixing Science. So, one day in January, I was invited to speak in a panel “Falsifiability and the Irreproducibility Crisis” at a conference “Fixing Science: Practical Solutions for the Irreproducibility Crisis.” The inviter, whom I did not know, David Randall, explained that a speaker withdrew from the session because of some kind of controversy surrounding the conference, but did not give details. He pointed me to an op-ed in the Wall Street Journal. I had already heard about the conference months before (from Nathan Schachtman) and before checking out the op-ed, my first thought was: I wonder if the controversy has to do with the fact that a keynote speaker is Ron Wasserstein, ASA Executive Director, a leading advocate of retiring “statistical significance”, and barring P-value thresholds in interpreting data. Another speaker eschews all current statistical inference methods (e.g., P-values, confidence intervals) as just too uncertain (D. Trafimow). More specifically, I imagined it might have to do with the controversy over whether the March 2019 editorial in TAS (Wasserstein, Schirm, and Lazar 2019) was a continuation of the ASA 2016 Statement on P-values, and thus an official ASA policy document, or not. Karen Kafadar, recent President of the American Statistical Association (ASA), made it clear in December 2019 that it is not.[2] The “no significance/no thresholds” view is the position of the guest editors of the March 2019 issue. (See “P-Value Statements and Their Unintended(?) Consequences” and “Les stats, c’est moi“.) Kafadar created a new 2020 ASA Task Force on Statistical Significance and Replicability to:

prepare a thoughtful and concise piece …without leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice”. (Kafadar 2019, p. 4)

Maybe those inviting me didn’t know I’m “anti” the Anti-Statistical Significance campaign (“On some self-defeating aspects of the 2019 recommendations“), that  I agree with John Ioannidis (2019) that “retiring statistical significance would give bias a free pass“, and published an editorial “P-value Thresholds: Forfeit at Your Peril“. While I regard many of today’s statistical reforms as welcome (preregistration, testing for replication, transparency about data-dredging, P-hacking and multiple testing), I argue that those in Wasserstein et al., (2019) are “Doing more harm than good“. In “Don’t Say What You don’t Mean“, I express doubts that Wasserstein et al. (2019) could really mean to endorse certain statements in their editorial that are so extreme as to conflict with the ASA 2016 guide on P-values. To be clear, I reject oversimple dichotomies, and cookbook uses of tests, long lampooned, and have developed a reformulation of tests that avoids the fallacies of significance and non-significance.[1] It’s just that many of the criticisms are confused, and, consequently so are many reforms.

III. Bad Statistics is Their Product. It turns out that the brouhaha around the conference had nothing to do with all that. I thank Dorothy Bishop for pointing me to her blog which gives a much fuller background. Aside from the lack of women (I learned a new word–a manference), her real objection is on the order of “Bad Statistics is Their Product”: The groups sponsoring the Fixing Science conference, The National Association of Scholars and the Independent Institute, Bishop argues, are using the replication crisis to cast doubt on well-established risks, notably those of climate change. She refers to a book whose title echoes David Michael’s: Merchants of Doubt (2010(by historians of science: Conway and Oreskes). Bishop writes:

Uncertainty about science that threatens big businesses has been promoted by think tanks … which receive substantial funding from those vested interests. The Fixing Science meeting has a clear overlap with those players. (Bishop)

The speakers on bad statistics, as she sees it, are “foils” for these interests, and thus “responsible scientists should avoid” the meeting.

But what if things are the reverse?  What if “bad statistics is our product” leaders also have an agenda. By influencing groups who have a voice in evidence policy in government agencies, they might effectively discredit methods they don’t like, and advance those they like. Suppose you have strong arguments that the consequences of this will undermine important safeguards (despite the key players being convinced they’re promoting better science). Then you should speak, if you can, and not stay away. You should try to fight fire with fire.

IV. So what Happened? So I accepted the invitation and gave what struck me as a fairly radical title: “P-Value ‘Reforms’: Fixing Science or Threats to Replication and Falsification?” (The abstract and slides are below.) Bishop is right that evidence of bad science can be exploited to selectively weaken entire areas of science; but evidence of bad statistics can also be exploited to selectively weaken entire methods one doesn’t like, and successfully gain acceptance of alternative methods, without the hard work of showing those alternative methods do a better, or even a good, job at the task at hand. Of course both of these things might be happening simultaneously.

Do the conference organizers overlap with science policy as Bishop alleges? I’d never encountered either outfits before, but Bishop quotes from their annual report.

In April we published The Irreproducibility Crisis, a report on the modern scientific crisis of reproducibility—the failure of a shocking amount of scientific research to discover true results because of slipshod use of statistics, groupthink, and flawed research techniques. We launched the report at the Rayburn House Office Building in Washington, DC; it was introduced by Representative Lamar Smith, the Chairman of the House Committee on Science, Space, and Technology.

So there is a mix with science policy makers in Washington, and their publication, The Irreproducibility Crisis, is clearly prepared to find its scapegoat in the bad statistics supposedly encouraged in statistical significance tests. To its credit, it discusses how data-dredging and multiple testing can make it easy to arrive at impressive-looking findings that are spurious, but nothing is said about ways to adjust or account for multiple testing and multiple modeling. (P-values are defined correctly, but their interpretation of confidence levels is incorrect.)  Published before the Wasserstein et al. (2019) call to end P-value thresholds, which would require the FDA and other agencies to end what many consider vital safeguards of error control, it doesn’t go that far. Not yet at least! Trying to prevent that from happening is a key reason I decided to attend. (updated 2/16)

My first step was to send David Randall my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–which he actually read and wrote a report on–and I met up with him in NYC to talk. He seemed surprised to learn about the controversies over statistical foundations and the disagreement about reforms. So did I hold people’s feet to the fire at the conference (when it came to scapegoating statistical significance tests and banning P-value thresholds for error probability control?) I did! I continue to do so in communications with David Randall. (I’ll write more in the comments to this post, once our slides are up.)

As for climate change, I wound up entirely missing that part of the conference: Due to the grounding of all flights to and from CLT the day I was to travel, thanks to rain, hail and tornadoes, I could only fly the following day, so our sessions were swapped. I hear the presentations will be posted. Doubtless, some people will use bad statistics and the “replication crisis” to claim there’s reason to reject our best climate change models, without having adequate knowledge of the science. But the real and present danger today that I worry about is that they will use bad statistics to claim there’s reason to reject our best (error) statistical practices, without adequate knowledge of the statistics or the philosophical and statistical controversies behind  the “reforms”.

Let me know what you think in the comments.

V. Here’s my abstract and slides

P-Value “Reforms”: Fixing Science or Threats to Replication and Falsification?

Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome, others are quite radical. The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. Paradoxically, some of the reforms intended to fix science enable rather than reveal illicit inferences due to P-hacking, multiple testing, and data-dredging. Some even preclude testing and falsifying claims altogether. Too often the statistics wars become proxy battles between competing tribal leaders, each keen to advance a method or philosophy, rather than improve scientific accountability.

[1] Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST), 2018; SIST excerpts; Mayo and Cox 2006; Mayo and Spanos 2006.

[2] All uses of ASA II on this blog must now be qualified to reflect this.

[3] You can find a lot on the conference and the groups involved on-line. The letter by Lenny Teytelman warning people off the conference is here. Nathan Schachtman has a post up today on his law blog here.

 

Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides | 23 Comments

The Statistics Wars: Errors and Casualties

.

Had I been scheduled to speak later at the 12th MuST Conference & 3rd Workshop “Perspectives on Scientific Error” in Munich, rather than on day 1, I could have (constructively) illustrated some of the errors and casualties by reference to a few of the conference papers that discussed significance tests. (Most gave illuminating discussions of such topics as replication research, the biases that discredit meta-analysis, statistics in the law, formal epistemology [i]). My slides follow my abstract. Continue reading

Categories: slides, stat wars and their casualties | Tags: | Leave a comment

Replication Crises and the Statistics Wars: Hidden Controversies

.

Below are the slides from my June 14 presentation at the X-Phil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some.  However, I did learn something about X-phil, particularly from the very interesting poster session [1]. This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.

[1] The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.

Categories: Philosophy of Statistics, replication research, slides | Leave a comment

Your data-driven claims must still be probed severely

Vagelos Education Center

Below are the slides from my talk today at Columbia University at a session, Philosophy of Science and the New Paradigm of Data-Driven Science, at an American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics. Todd was brave to sneak in philosophy of science in an otherwise highly mathematical conference.

Philosophy of Science and the New Paradigm of Data-Driven Science : (Room VEC 902/903)
Organizer and Chair: Todd Kuffner (Washington U)

  1. Deborah Mayo (Virginia Tech) “Your Data-Driven Claims Must Still be Probed Severely”
  2.  Ian McKeague (Columbia) “On the Replicability of Scientific Studies”
  3.  Xiao-Li Meng (Harvard) “Conducting Highly Principled Data Science: A Statistician’s Job and Joy

 

Categories: slides, Statistics and Data Science | 5 Comments

Blog at WordPress.com.