Author Archives: Mayo

April 1, 2020: Memory Lane of April 1’s past



My “April 1” posts for the past 8 years have been so close to the truth or possible truth that they weren’t always spotted as April Fool’s pranks, which is what made them genuine April Fool’s pranks. (After a few days I either labeled them as such, e.g., “check date!”), or revealed it in a comment). Given the level of current chaos and stress, I decided against putting up a planned post for today, so I’m just doing a memory lane of past posts. (You can tell from reading the comments which had most people fooled.)

4/1/12 Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 

This morning I received a paper I have been asked to review (anonymously as is typical). It is to head up a forthcoming issue of a new journal called Philosophy of Statistics: Retraction Watch.  This is the first I’ve heard of the journal, and I plan to recommend they publish the piece, conditional on revisions. I thought I would post the abstract here. It’s that interesting.

“Some Slightly More Realistic Self-Criticism in Recent Work in Philosophy of Statistics,” Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 (2012), pp. 1-19.In this paper we delineate some serious blunders that we and others have made in published work on frequentist statistical methods. First, although we have claimed repeatedly that a core thesis of the frequentist testing approach is that a hypothesis may be rejected with increasing confidence as the power of the test increases, we now see that this is completely backwards, and we regret that we have never addressed, or even fully read, the corrections found in Deborah Mayo’s work since at least 1983, and likely even before that.

You can read the rest here.


4/1/13 Flawed Science and Stapel: Priming for a Backlash? 

My first fraud kit


Deiderik Stapel is back in the news, given the availability of the English translation of the Tilberg (Levelt and Noort Committees) Report as well as his book, Ontsporing (Dutch for “Off the Rails”), where he tries to explain his fraud. An earlier post on him is here. While the disgraced social psychologist was shown to have fabricated the data for something like 50 papers, it seems that some people think he deserves a second chance. A childhood friend, Simon Kuper, in an article “The Sin of Bad Science,” describes a phone conversation with Stapel:…..

You can read the rest here.


4/1/14 Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic 

Danver State Hospital

Danver State Hospital

I had heard of medical designs that employ individuals who supply Bayesian subjective priors that are deemed either “enthusiastic” or “skeptical” as regards the probable value of medical treatments.[i] …But I’d never heard of these Bayesian designs in relation to decisions about building security or renovations! Listen to this….

You may have heard that the Department of Homeland Security (DHS), whose 240,000 employees are scattered among 50 office locations around D.C.,has been planning to have headquarters built at an abandoned insane asylum St Elizabeths in DC [ii]. (Here’s a 2015 update.)

You can read the rest here.


4/01/15 Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’) 



Given recent evidence of the irreproducibility of a surprising number of published scientific findings, the White House’s Office of Science and Technology Policy (OSTP) sought ideas for “leveraging its role as a significant funder of scientific research to most effectively address the problem”, and announced funding for projects to “reset the self-corrective process of scientific inquiry”. (first noted in this post.)

You can read the rest here.


4/1/16 Er, about those “other statistical approaches”: Hold off until a balanced critique is in?

street-chalk-art-optical-illusion-6I could have told them that the degree of accordance enabling the “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

You can read the rest here.


4/1/17 and 4/1/18  were slight updates of 4/1/16. @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

4/1/19 there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:

You can read the rest here.


Categories: Comedy, Statistics | Leave a comment

The Corona Princess: Learning from a petri dish cruise (i)


Q. Was it a mistake to quarantine the passengers aboard the Diamond Princess in Japan?

A. The original statement, which is not unreasonable, was that the best thing to do with these people was to keep them safely quarantined in an infection-control manner on the ship. As it turned out, that was very ineffective in preventing spread on the ship. So the quarantine process failed. I mean, I’d like to sugarcoat it and try to be diplomatic about it, but it failed. I mean, there were people getting infected on that ship. So something went awry in the process of quarantining on that ship. I don’t know what it was, but a lot of people got infected on that ship. (Dr. A Fauci, Feb 17, 2020)

This is part of an interview of Dr. Anthony Fauci, the coronavirus point person we’ve been seeing so much of lately. Fauci has been the director of the National Institute of Allergy and Infectious Diseases since all the way back to 1984! You might find his surprise surprising. Even before getting our recent cram course on coronavirus transmission, tales of cruises being hit with viral outbreaks are familiar enough. The horror stories from passengers on the floating petri dish were well known by this Feb 17 interview. Even if everything had gone as planned, the quarantine was really only for the (approximately 3700) passengers because the 1000 or so crew members still had to run the ship, as well as cook and deliver food to the passenger’s cabins. Moreover, the ventilation systems on cruise ships can’t filter out particles smaller than 5000 or 1000 nanometers.[1]

“If the coronavirus is about the same size as SARS [severe acute respiratory syndrome], which is 120 nanometers in diameter, then the air conditioning system would be carrying the virus to every cabin,” according to Purdue researcher, Qingyan Chen, who specializes in how air particles spread in different passenger crafts. (His estimate was correct: the coronavirus is 120 nanometers.) Halfway through the quarantine, after passenger complaints, they began circulating only fresh air–which would have been preferable from the start. By then, however, it is too late: the ventilation system is already likely filled with the virus, says Chen.[2] Arthur Caplan, the bioethicist who is famous for issuing rulings on such matters, declares that

“Boats are notorious places for being incubators for viruses. It’s only morally justified to keep people on the boat if there are no other options.”

Admittedly, it is hard to see an alternative option to accommodate so many passengers for a 2 week quarantine on land, and there was the possible danger of any infections spreading to the local population in Japan. So, by his assessment, it may be considered morally justified.

The upshot: As of 19 March 2020, at least 712 out of the 3,711 passengers and crew had tested positive for covid-19; 9 of those who were on board have died from the disease (all over the age of 70). As I was writing this, I noted a new CDC report on the Diamond Princess as well as other cruise ships; they state 9 deaths.[3] A table on the distribution of ages of passengers on the Diamond Princess is in Note [4].

So how did the Diamond Princess cruise ship become a floating petri dish for the coronavirus from Feb 4-Feb 20?

The Quarantine

It was their last night of a 2-week luxury cruise aboard the Diamond Princess in Japan (Feb 3) when the captain came on the intercom. He announced: a passenger on this ship who disembarked in Hong Kong 9 days ago (Jan 25) has tested positive for the coronovirus. (He was on board for 5 days.) Everyone will have to stay on board an extra day to be examined by the Japanese health authorities. A new slate of activities was arranged to occupy passengers during the day of health screening–later mostly dropped. But on the evening of February 3, things continued on the ship more or less as before the intercom message.

“The response aboard the Diamond Princess reflected concern, but not a major one. The buffets remained open as usual. Onboard celebrations, opera performances and goodbye parties continued”. (NYT, March 8)

The next day, as health officials went door to door to screen passengers, guests still circulated on board, lined up for buffets, and used communal spaces. But then, the following morning (Feb 5), as guests were heading to breakfast, the captain came over the intercom again. He announced that 10 people had tested positive for the coronavirus and would be taken off the ship. Everyone else would now have to be quarantined in their cabins for 14 days. The second day of the quarantine (Feb 6) it was announced that 20 people more had tested positive, then on day three, 41 more, then 64 more, and on and on. By the end of the quarantine on February 19 at least 621 on the ship had tested positive for the virus.

Adding to the stress, “we quickly learned that our tests were part of an initial batch of 273 samples and that the first 10 cases reported on day one were only from the first 31 samples that had been processed” from the passengers with highest risk. (U.S. passenger, Spencer Fehrenbacher, interviewed on the ship)

As the number of infected ballooned, passengers were not always informed right away; some took to counting ambulances lined up outside to find out how many new cases would be announced at some point. I wonder if the passengers were told that the very first person to test positive was a crew member responsible for preparing food. In fact, by February 9, around 20 of the crew members tested positive, 15 of which were workers preparing food. Crew members lived in close quarters, shared rooms and continued to eat their meals together buffet-style. They had no choice but to keep running the ship as best as they could.

“Feverish passengers were left in their rooms for days without being tested for the virus. Health officials and even some medical professionals worked on board without full protective gear. [Several got infected.] Sick crew members slept in cabins with roommates who continued their duties across the ship, undercutting the quarantine”. (NYT Feb 22)

Passengers in cabins without windows (and later, others) were allowed to walk on deck, six feet apart, for a short time daily. Unfortunately, presumed infection-free “green zones” were not rigidly separated from potentially contaminated “red zones”, and people walked back and forth between them. Gay Courter, a writer from the U.S. who, as it happens, situated one of her murder mysteries on a cruise ship, told Time “It feels like I’m in a bad movie. I tell myself, ‘Wake up, wake up, this isn’t really happening.’” (Time, Feb 11). This is the same bad movie we are all in now, except our horror tale has gotten much worse than on Feb 10.

At some point, I think Feb 10, the ship became the largest concentration of Covid-19 cases outside China, which is why you’ll notice the Diamond Princess has own category in the data compiled by the World Health Organization (Worldometer).

In a Science Today article, a Japanese infectious disease specialist regretted the patchwork way in which passenger testing was done:

Japan has missed a chance to answer important epidemiological questions about the new virus and the illness it causes. For instance, a rigorous investigation that tested all passengers at the start of the quarantine and followed them through to the end could have provided information on when infections occurred and answered questions about transmission, the course of the illness, and the behavior of the virus.

(They were only able to test people in stages.) A similar paucity of testing in the U.S. robs us from crucial information for understanding and controlling the coronavirus. However, there is a fair amount being gleaned from the Diamond Princess, as you can see in the references below. (Please share additional references in the comments.) More is bound to follow.

Estimates from the Diamond Princess

“Data from the Diamond Princess cruise ship outbreak provides a unique snapshot of the true mortality and symptomatology of the disease, given that everyone on board was tested, regardless of symptoms”–or at least virtually all. [link] The estimates (from the Diamond Princess) I’ve seen are based on those from the London School of Hygiene and Tropical Medicine, in a paper still in preprint form,”Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship”.

Adjusting for delay from confirmation-to-death, we estimated case and infection fatality ratios (CFR, IFR) for COVID-19 on the Diamond Princess ship as 2.3% (0.75%-5.3%) [among symptomatic] and 1.2% (0.38-2.7%) [all cases]. Comparing deaths onboard with expected deaths based on naive CFR estimates using China data, we estimate IFR and CFR in China to be 0.5% (95% CI: 0.2-1.2%) and 1.1% (95% CI: 0.3-2.4%) respectively. (PDF)

(For definitions and computations, see the article.) These are lower than the numbers we are often hearing. They used their lower fatality estimates to adjust (down) the estimates from China data. The paper lists a number of caveats.[5] I hope readers will have a look at it (it’s just a few pages) and share their thoughts in the comments. (Their estimates are in sync with an article by Fauci et al., to come out this week in NEJM; but whatever the numbers turn out to be, we know our healthcare system, in many places, is being overloaded. [6])

Another study takes the daily reports of infections on the Diamond Princes to attempt to evaluate the impact of the quarantine, as imperfect as it was, in comparison to a counterfactual situation where nothing was done, including not removing infected people from the ship. They estimate nearly 80%, rather than 17% would have been infected. [link]

We found that the reproductive number [R0] of COVID-19 in the cruise ship situation of 3,700 persons confined to a limited space was around 4 times higher than in the epicenter in Wuhan, where was estimated to have a mean of 3.7.[7]

The interventions that included the removal of all persons with confirmed COVID-19 disease combined with the quarantine of all passengers substantially reduced the anticipated number of new COVID-19 cases compared to a scenario without any interventions (17% attack rate with intervention versus 79% without intervention) … However, the main conclusion from our modelling is that evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew members from getting infected.” [link]

Only 76, rather than 621 would have been infected, they estimate. [8]

Conclusions: The cruise ship conditions clearly amplified an already highly transmissible disease. The public health measures prevented more than 2000 additional cases compared to no interventions. However, evacuating all passengers and crew early on in the outbreak would have prevented many more passengers and crew from infection.

These studies and models are of interest, although I’m in no position to evaluate them. Please share your thoughts and information, and point out any errors you find. I will indicate updates in the title of this post.


I leave off with the remark of one of the U.S. passengers interviewed while still on the Diamond Princess:

“Being knee deep in the middle of a crisis leaves a person with two options — optimism or pessimism. The former gives a person strength, and the latter gives rise to fear.” (link)

He, like the others who were evacuated, faced an additional 2 weeks of quarantine.[9] He has since returned home and remains infection free.



[1] As a noteworthy aside, Fauci was able to assure the interviewer that the “danger of getting coronavirus now is just minusculely low” (in the U.S. on Feb. 17). What a difference 2 weeks can make.

[2] In a 2015 paper, Chen and colleagues found a cruise ship’s ventilation spread particles from cabin to cabin. They found that 1 infected person typically led to more than 40 cases a week later on a 2000 passenger cruise. By contrast, the coronavirus, with a reproductive rate of 2 cases per infected person, would only lead to 3 new cases during that time. Planes rely on high-strength air filters and are designed to circulate air within cabin sections.

[3] In a March 23 CDC report: Among 3,711 Diamond Princess passengers and crew, 712 (19.2%) had positive test results for SARS-CoV-2. Of these, 331 (46.5%) were asymptomatic at the time of testing. Among 381 symptomatic patients, 37 (9.7%) required intensive care, and nine (1.3%) died (8).

They found coronavirus in Diamond Princess cabins 17 days after passengers disembarked (prior to cleaning).

[4] A table from the Japanese National Institute of Infectious Diseases (NIID) (Source LINK):


“There were some limitations to our analysis. Cruise ship passengers may have a different health status to the general population of their home countries, due to health requirements to embark on a multi-week holiday, or differences related to socio-economic status or comorbidities. Deaths only occurred in individuals 70 years or older, so we were not able to generate age-specific cCFRs; the fatality risk may also be influenced by differences in healthcare between countries”.

[6] In a March 26 article by Fauci and others, Covid-19 — Navigating the Uncharted, we read:

“If one assumes that the number of asymptomatic or minimally symptomatic cases is several times as high as the number of reported cases, the case fatality rate may be considerably less than 1%.”

[7] R0 may be viewed as the expected number of cases generated directly by 1 case in a susceptible population.

[8] The number in the most recent report is 712, but that would be after the quarantine ended on Feb 19.

[9] I read today that one of the U. S. evacuated passengers just entered a clinical trial on remdesivir. This would be over a month since the end of the first quarantine.



  • Giwa, A.,  LLB, MD, MBA, FACEP, FAAEM; Desai, A., MD; Duca, A., MD; Translation by: Sabrina Paula Rodera Zorita, MD (2020). “Novel 2019 Coronavirus SARS-CoV-2 (COVID-19): An Updated Overview for Emergency Clinicians – 03-23-20”; Pub Med ID: 32207910; (LINK)
  • Japanese National Institute of Infectious Diseases (NIID). “Field Briefing: Diamond Princess COVID-19 Cases, 20 Feb Update” (LINK)
  • Rocklöv, J., Sjödin, H., & Wilder-Smith, A.  “COVID-19 outbreak on the Diamond Princess cruise ship: estimating the epidemic potential and effectiveness of public health countermeasures”, Journal of Travel Medicine, (Feb 28, 2020) [link]
  • Russell, T., Hellewell, J.,Jarvis, C., van-Zandvoort, K.Abbott, S.,Ratnayake, R., Flasche, S., Eggo, R. & Kucharski, A. (2020). “Estimating the infection and case fatality ratio for COVID-19 using age-adjusted data from the outbreak on the Diamond Princess cruise ship.” MedRXIV: The preprint server for the Health Sciences. (March 9, 2020). (PDF)
  • Zheng, L., Chen, Q., Xu, J., & Wu, F. (2016). Evaluation of intervention measures for respiratory disease transmission on cruise ships. Indoor and Built Environment, 25(8), 1267–1278. (First Published online August 28, 2015 ). (PDF)
Categories: covid-19 | 22 Comments

Stephen Senn: Being Just about Adjustment (Guest Post)



Stephen Senn
Consultant Statistician

Correcting errors about corrected estimates

Randomised clinical trials are a powerful tool for investigating the effects of treatments. Given appropriate design, conduct and analysis they can deliver good estimates of effects. The key feature is concurrent control. Without concurrent control, randomisation is impossible. Randomisation is necessary, although not sufficient, for effective blinding. It also is an appropriate way to deal with unmeasured predictors, that is to say suspected but unobserved factors that might also affect outcome. It does this by ensuring that, in the absence of any treatment effect, the expected value of variation between and within groups is the same. Furthermore, probabilities regarding the relative variation can be delivered and this is what is necessary for valid inference. Continue reading

Categories: randomization, S. Senn | 6 Comments

My Phil Stat Events at LSE



I will run a graduate Research Seminar at the LSE on Thursdays from May 21-June 18:


(See my new blog for specifics (
I am co-running a workshop
from 19-20 June, 2020 at LSE (Center for the Philosophy of Natural and Social Sciences CPNSS), with Roman Frigg. Participants include:
Alexander Bird (King’s College London), Mark Burgman (Imperial College London), Daniele Fanelli (LSE), David Hand (Imperial College London), Christian Hennig (University of Bologna), Katrin Hohl (City University London), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland).
If you have a particular Phil Stat event you’d like me to advertise, please send it to me.
Categories: Announcement, Philosophy of Statistics | Leave a comment

Replying to a review of Statistical Inference as Severe Testing by P. Bandyopadhyay


Notre Dame Philosophical Reviews is a leading forum for publishing reviews of books in philosophy. The philosopher of statistics, Prasanta Bandyopadhyay, published a review of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)(SIST) in this journal, and I very much appreciate his doing so. Here I excerpt from his review, and respond to a cluster of related criticisms in order to avoid some fundamental misunderstandings of my project. Here’s how he begins:

In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.[1]


I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.
Continue reading

Categories: Statistical Inference as Severe Testing–Review | Tags: | 24 Comments

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)



This is a belated birthday post for R.A. Fisher (17 February, 1890-29 July, 1962)–it’s a guest post from earlier on this blog by Aris Spanos. 

Happy belated birthday to R.A. Fisher!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998) Continue reading

Categories: Fisher, phil/history of stat, Spanos | 2 Comments

Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides | 33 Comments

My paper, “P values on Trial” is out in Harvard Data Science Review


My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue. Continue reading

Categories: multiple testing, P-values, significance tests, Statistics | 29 Comments

S. Senn: “Error point: The importance of knowing how much you don’t know” (guest post)


Stephen Senn
Consultant Statistician

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453] 

First things second

The classic text by David Cox and David Hinkley, Theoretical Statistics (1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes after that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state: Continue reading

Categories: Fisher, randomization, Stephen Senn | Tags: | 7 Comments

Aris Spanos Reviews Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A. Spanos

Aris Spanos was asked to review my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic  Logic of Statistical Inference. The journal is OEconomia: History, Methodology, Philosophy. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpt for length.To read his full article go to external journal pdf or stable internal blog pdf. Continue reading

Categories: Spanos, Statistical Inference as Severe Testing | Leave a comment

The NAS fixes its (main) mistake in defining P-values!

Mayo new elbow

(reasonably) satisfied

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.
Two footnotes, on pages 31 35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOC

Thank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.
With kind regards and wishes of a happy 2020,
Jenny Heimberg
Jennifer Heimberg, Ph.D.
Senior Program Officer

The National Academies of Sciences, Engineering, and Medicine

Continue reading

Categories: NAS, P-values | 2 Comments

Midnight With Birnbaum (Happy New Year 2019)!

 Just as in the past 8 years since I’ve been blogging, I revisit that spot in the road at 9p.m., just outside the Elbar Room, look to get into a strange-looking taxi, to head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wonder if the car will come for me this year, as I wait out in the cold, now that Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT 2018) has been out over a year. STINT doesn’t rehearse the argument from my Birnbaum article, but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). 2019 was the 61th birthday of Cox’s “weighing machine” example, which was the basis of Birnbaum’s attempted proof. Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2020? Anyway, the cab is finally here…the rest is live. Happy New Year! Continue reading

Categories: Birnbaum Brakes, strong likelihood principle | Tags: , , , | Leave a comment

A Perfect Time to Binge Read the (Strong) Likelihood Principle

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data. Continue reading

Categories: Birnbaum, Birnbaum Brakes, law of likelihood | 7 Comments

61 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)


2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. it is now 61. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my (still) new book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018). It’s especially relevant to take this up now, just before we leave 2019, for reasons that will be revealed over the next day or two. For a sneak preview of those reasons, see the “note to the reader” at the end of this post. So, let’s go back to it, with an excerpt from SIST (pp. 170-173). Continue reading

Categories: Birnbaum, Statistical Inference as Severe Testing, strong likelihood principle | Leave a comment

Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)


I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) howler well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past: Continue reading

Categories: memory lane, significance tests, Statistics | Tags: | Leave a comment

“Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)

les stats, c’est moi

When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”.  [1]. So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “we take that step here!” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.) Continue reading

Categories: ASA Guide to P-values | 84 Comments

P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)


Mayo writing to Kafadar

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

  • “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading

Categories: ASA Guide to P-values, Bayesian/frequentist, P-values | 3 Comments

A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science. Continue reading

Categories: Error Statistics | 11 Comments

The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)


cure by committee

Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II(note)), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire. Continue reading

Categories: ASA Guide to P-values | 7 Comments

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)


“Before we stood on the edge of the precipice, now we have taken a great step forward”


What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II(note)–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading

Categories: P-values, stat wars and their casualties, statistical significance tests | 14 Comments

Blog at