Author Archives: Mayo

The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)


cure by committee

Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire.

[I thought that with my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), that I had said pretty much all I cared to say on this topic (and by and large, this is true), but almost as soon as it appeared in print just around a year ago, things got very strange.]

But wait. Someone might object that I’m the one doing more harm than good by linking the ASA (The American Statistical Association) to Wasserstein’s campaign to get publishers, journalists, authors and the general public to buy into the recommendations of ASA II. “Shhhhh!” some counsel, “don’t give it more attention; we want people to look away”. Nothing to see? I don’t think so. I will discuss this point in this post in PART II, as soon as I sketch my list of reasons #2-5.

Before starting, let me remind readers that what I abbreviate as ASA II only refers to those portions of the 2019 editorial by Wasserstein, Schirm, and Lazar that allude to their general recommendations, not their summaries of contributed papers in the issue of TAS.


2 Decriminalize theft to end robbery. The key argument for impeaching and removing statistical significance levels and P-value thresholds commit fallacies of the “cut-off your nose to spite your face” variety. For example, we should ban P-value thresholds because they cause biased selection and data dredging. Discard P-value thresholds and P-hacking disappears! Or so it is argued. Even were this true, it would be like arguing we should decriminalize robbery since then the crime of robbery would disappear! (ends justify the means fallacy). But it is also not true (that biased reporting goes away if you have no thresholds.) Faced with unwelcome nonsignificant results, eager researchers are still led to massage, spin, and data dredge–only now it is much harder to directly hold them accountable. For the argument, see my (“P-value Thresholds: Forfeit at your Peril“, 2019).

3 Straw men and women fallacies. ASA I and II do more harm than good by presenting oversimple caricatures of the tests. Even ASA I excludes a consideration of alternatives, error probabilities and power[1]. At the same time, it will contrast these threadbare “nil null” hypothesis tests with confidence intervals (CIs)–never minding that the latter employs alternatives. No wonder CIs look better, but such a test is unfair. (Neyman developed confidence intervals as inversions of tests at the same time he was developing hypotheses tests with alternatives in 1930. Using only significance tests, you could recover the lower (and upper) 1-α CI bounds if you wanted, by asking for the hypotheses that the data are statistically significantly greater (smaller) than, at level c, using the usual 2-sided computation).

In ASA II, we learn that “no p-value can reveal the …presence…of an association or effect” (at odds with principle 1 of ASA I). That could be true only in the sense that no formal statistical quantity alone could reveal the presence of an association. But in a realistic setting, small p-values surely do reveal the presence of effects. Yes, there are assumptions, but significance tests are prime tools to probe them. We hear of “the seductive certainty falsely promised by statistical significance”, and are told that “a declaration of statistical significance is the antithesis of thoughtfulness”. (How an account that never issues an inference without an associated error probability can be promising certainty is unexplained. On the second allegation, ignoring how thresholds are rendered meaningful by choosing them to reflect background information and a host of theoretical and epistemic considerations, is all more straw.) The requirement in philosophy of a reasonably generous interpretation of what your criticizing isn’t a call for being kind or gentle, it’s that otherwise your criticism is guilty of straw men (and women) fallacies, and thus fails.

4 Alternatives to significance testing are given a pass.You will not find any appraisal of the alternative methods recommended to replace significance tests for their intended tasks. Although many of the “alternative measures of evidence” listed in ASA I and II: Likelihood ratios, Bayes factors (subjective, default, empirical), posterior predictive values (in diagnostic screening) have been critically evaluated by leading statisticians, no word of criticism is heard here. Here’s an exercise: run down the list of 6 “principles” of ASA I, applying them to any of the alternative measures of evidence on offer. Take, for example, Bayes factors. I claim that they do worse than do significance tests, even without modifications.[2]

5 Assumes probabilism. Any fair (non question-begging) comparison of statistical methods should recognize different roles probability may play in inference. The role of probability in inference by way of statistical falsification is quite different from using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model–or comparative measures of these.  I abbreviate the former as error statistical methods, the latter, as variants on probabilism. Use whatever terms you like. Statistical significance tests are part of a panoply of methods where probability arises to assess and control misleading interpretations of data.

Error probabilities quantify the capabilities of a method to detect the ways a claim (hypothesis, model or other) may be false, or specifiably flawed. The basic principle of testing is minimalist: there is evidence for a claim only to the extent it has been subjected to, and passes, a test that had at least a reasonable probability of having discerned how the claim may be false. (For a more detailed exposition, see Mayo 2018, or excerpts from SIST on this blog).

Reason #5, then, is that “measures of evidence” in both ASA I and II beg this key question (about the role of probability in statistical inference) in favor of probabilisms–usually comparative as with Bayes factors. If the recommendation in ASA II to remove statistical thresholds is taken seriously, there are no tests and no statistical falsification. Recall what Ioannidis said in objecting to “don’t say signiicance”, cited in my last post:

Potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

“Self-ostracizing” is a great term. ASA should ostracize self-ostracizing. This takes me back to the question I promised to come back to: is it a mistake to see the ASA as entangled in the campaign to ban use of the “S-word”, and kill P-value thresholds?


Those who say it is a mistake, point to the fact that what I’m abbreviating as ASA II did not result from the kind of process that led to ASA I, with extended meetings of statisticians followed by a Board vote[3]. I don’t think that suffices. Granted, the “P-value Project” (as it is called at ASA) is only a small part of the ASA, led by Executive Director Wasserstein. Nevertheless, as indicated on the ASA website, “As executive director, Wasserstein also is an official ASA spokesperson.” In his active campaign to get journals, societies, practitioners and the general public to accept the recommendations in ASA II, he wears his executive director hat, does he not?

As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, but it will not suffice to note that it never came to a Board vote. The campaign to editors to revise their guidelines for authors, taking account of both ASA I and II, should also be addressed. Keeping things blurred gives plausible deniability, but at the cost of increasing confusion and an “anything goes” attitude.

ASA II clearly presents itself as a continuation of ASA I (again, ASA II refers just to the portion of the editorial encompassing the general recommendation: don’t say significance or significant, oust P-value thresholds). It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blog, the result is to give us principles that are in tension with the original 6.) Next, it goes on to say:

The ASA Statement on P-Values and Statistical Significance started moving us toward this world…. The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. … it is time to stop using the term “statistically significant” entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive…

Undoubtedly, there are signs in ASA I that they were on the verge of this step, notably, the last section: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. .. likelihood ratios or Bayes factors”. (p. 132).

A letter to the editor on ASA I was quite prescient. It was written by Ionides, Giessing, Ritov and Page (link):

Mixed with the sensible advice on how to use p-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid p-values. They don’t tell you what you want to know. …The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.

What do you think? Please share your comments on this blogpost.

[1] “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power (among other things)” (ASA I)

[2] The ASA 2016 Guide’s Six Principles

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[3] I am grateful to Ron Wasserstein for inviting me to be a “philosophical observer” of this historical project (I attended just one day).


Blog posts on ASA II:

  • June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
    July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)
  • July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
  • September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
  • Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests


  • Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.


Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA 321:2067‐2068. (pdf)

Ionides, E., Giessing, A., Ritov, Y.  & Page, S. (2017). Response to the ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 71:1, 88-89. (pdf)

Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).

Mayo, D. G. (2019), P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. (pdf) doi:10.1111/eci.13170

Wasserstein, R. & Lazar, N. (2016), The ASA’s Statement on p-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (ASA II)(pdf)

Categories: ASA Guide to P-values | 2 Comments

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)


“Before we stood on the edge of the precipice, now we have taken a great step forward”


What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i]

In this exercise, I imagine I am someone who eagerly wants the recommendations in ASA II to be accepted by authors, journals, agencies, and the general public. In essence the recommendations are: you may report the P-value associated with a test statistic d–a measure of distance or incompatibility between data and a reference hypothesis– but don’t say that what you’re measuring are the attained statistical significance levels associated with d. (Even though that is the mathematical definition of what is being measured.) Do not predesignate a P-value to be used as a threshold for inferring evidence of a discrepancy or incompatibility–or if you do, never use this threshold in interpreting data.

“Whether a p-value passes any arbitrary threshold should not be considered at all” in interpreting data. (ASA II)

This holds, even if you also supply an assessment of indicated population effect size or discrepancy (via confidence intervals, equivalence tests, severity assessments). The same goes for other thresholds based on confidence intervals or Bayes factors.

I imagine myself a member of the ASA II team setting out the recommendation for ASA II, weighing if it’s a good idea. We in this leadership group know there’s serious disagreement about our recommendations in ASA II, and that ASA II could not by any stretch be considered a consensus statement. Indeed even among over 40 papers explicitly invited to discuss “a world beyond P < 0.05”, we (unfortunately) wound up with proposals in radical disagreement. We [ASA II authors] observe “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018).”

(Aside: Hey, they are citing my book!)

So we agree there is disagreement. We also agree that a large part of the blame for lack of replication in many fields may be traced to bad behavior encouraged by the reward structure: Incentives to publish surprising and novel studies, coupled with an overly flexible methodology, where many choice points in the “forking paths” (Gelman and Loken 2014) between data and hypotheses open the door into “questionable research practices” (QRPs). Call this the flexibility, rewards, and bias F, R & B hypothesis. On this hypothesis, the pressure to publish, to be accepted, is so great as to seduce even researchers who are well aware of the pitfalls to capitalize on selection biases (even if it’s only subliminal).

As a member of the team, I imagine reasoning as follows:

Either the recommendations in ASA II will be followed or they won’t. If the latter, then it cannot be considered successful. Now suppose the former, that people do take it up to a significant extent. The F, R & B hypothesis predicts that the imprimatur of the ASA will encourage researchers to adopt, or at least act in accordance with, ASA II recommendations. [ii] The trouble is that there will be no grounds for thinking that any apparent conversion was based on good reasons, or, at any rate, we will be unable to distinguish following the ASA II stipulations on grounds of evidence from following them because the ASA said so. Therefore even in the former situation, where the new stipulations are taken up to a significant degree, with lots of apparent converts, ASA II could not count as a success. Therefore, in either case, what had seemed to us a great step forward, is unsuccessful. So we shouldn’t put it forward.

“Before we were with our backs against the wall, now we have done a 180 degree turn”

A further worry occurs to me in my imaginary weighing of whether our ASA team should go ahead with publishing ASA II. It is this: many of the apparent converts to ASA II might well have come to accept its stipulations on grounds of good reasons, after carrying out a reasoned comparison of statistical significance tests with leading alternative methods, as regards its intended task (distinguishing real effects from random or spurious ones)–if the ASA had only seen its role as facilitating the debate between alternative methods, and as offering a forum for airing contrasting arguments held by ASA members. By marching ahead to urge journals, authors, and agencies to comply with ASA II, we will never know.

Not only will we not know how much any observed effect in compliance is due to finding its stipulations are warranted, as opposed to it just confirming the truth of the F, R, & B hypothesis–not to mention people’s fear of being on the wrong side of the ASA’s preferences. It’s worse. The tendency to the human weakness of instantiating the F, R & B hypothesis will be strengthened. Why? Because even in the face of acknowledged professional disagreement of a fairly radical sort, and even as we write “the ideas in this editorial are… open to debate” (ASA II), we are recommending our position be accepted without actually having that debate. In asking for compliance, we are saying, in effect, “we have been able to see it is for the better, even though we recognize there is no professional agreement on our recommendations, and even major opposition”. John Ioannidis, no stranger to criticizing statistical significance tests, wrote this note after the publication of ASA II:

Many fields of investigation … have major gaps in the ways they conduct, analyze, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

Therefore, to conclude with my imaginary scenario, we might imagine the ASA team recognizes that putting forward ASA II (in March 2019) is necessarily going to be unsuccessful and self-defeating, extolling the very behavior we supposedly want to eradicate. So we don’t do it. That imaginary situation, unfortunately, is not the real one we find ourselves in.

Making progress, without bad faith, in the real world needn’t be ruled out entirely. There are those, after all, who never heard of ASA II, and do not publish in journals that require obeisance to it. It’s even possible that the necessary debate and comparison of alternative tools for the job could take place after the fact. That would be welcome. None of this would diminish my first self-defeating aspect of the ASA II.

My follow-up post is now up: “The ASA’s P-value Project: Why it’s Doing More Harm than Good‘.

[i] See also June 17, 2019. Here I give specific suggestions for why certain principles in ASA II need to be amended to avoid being in tension with ASA I.

[ii] “Imprimatur” means “let it be printed” in Latin. Now I am very careful to follow the context: It is not a consensus document, I make very clear. In fact, that is a key premise of my argument. But the statement that is described as (largely) consensual (ASA I) “stopped just short” of the 2019 editorial. When it first appeared, I asked Wasserstein about the relationship between the two documents. That was the topic of my June 17 post linked in [i]). It was never made clear. It’s blurred. Is it somewhere in the document and I missed it? Increasingly, now that it’s been out long enough for people to start citing it, it is described as the latest ASA recommendations. (They are still just recommendations.) If the ASA wants to clearly distinguish the 2019 from the 2016 statement, this is the time for the authors to do it. (I only consider, as part of ASA II, those general recommendations that are given, not any of the individual papers in the special issue.)

Blog posts on ASA II:

  • June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
    July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)
  • July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
  • September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.


  • Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.


Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”. American Scientist 2: 460-5. (pdf)

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA 321:2067‐2068. (pdf)

Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).

Mayo, D. G. (2019), P‐value thresholds: Forfeit at your peril. Eur J Clin Invest, 49: e13170. (pdf) doi:10.1111/eci.13170

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (online paper)(pdf)

Categories: P-values, stat wars and their casualties, statistical significance tests | 12 Comments

Exploring a new philosophy of statistics field

This article came out on Monday on our Summer Seminar in Philosophy of Statistics in Virginia Tech News Daily magazine.

October 28, 2019


From universities around the world, participants in a summer session gathered to discuss the merits of the philosophy of statistics. Co-director Deborah Mayo, left, hosted an evening for them at her home.


In the heat of a Blacksburg summer evening, the talk on Deborah Mayo’s back deck was of philosophy and statistics. Fifteen innovators in the Virginia Tech Summer Seminar in Philosophy of Statistics were contemplating the beginnings of a new field — Phil Stat.

“The overarching goal is that Phil Stat, short for the philosophy of statistics, will become a field in philosophy,”  said Mayo, one of the seminar’s co-directors and a professor emerita in the Virginia Tech Department of Philosophy.  “Today the problems about data are everywhere, as are problems about ethics and values. The justification for this new field is if you don’t understand the underpinnings of statistics, you cannot understand the consequences of certain reforms that are being proposed or adopted.”

Mayo defines Phil Stat as the philosophical and conceptual foundations of statistical inference. The idea involves the formation of judgments about the measures that define a population and the reliability of statistical relationships,  usually based on a random sampling of data. With this, Phil Stat analyzes the uses of probability in collecting, modeling, and learning from the data.

Aris Spanos, Mayo’s co-director of the seminar, said that during the past decade, many published, observation-based or experience-based research results in several disciplines within the medical and social sciences have been found not to be replicable. This has led some researchers to regard the results as untrustworthy, and several leading statisticians have been calling for reforms. Spanos said the need is pressing for a better understanding of the main sources of untrustworthy evidence and a balanced appraisal of the proposed reforms.

“We designed the seminar on the philosophy of statistics in response to these discussions to inform the participants about these debates,” said Spanos, the Wilson E. Schmidt Professor of Economics in the College of Science. “We wanted to provide them with a sufficient background in the philosophy of science and statistics to enable them to participate in these debates.”

Mayo and Spanos decided the seminar, held on Virginia Tech’s Blacksburg campus, would help advance scholarship in this new transdisciplinary area and seminar participants could integrate into their research and teaching. In response to their call for applicants, a selection committee invited 15 of the 55 faculty, postdoctoral fellows, and senior graduate students who applied to participate.

The participants were a diverse group. They came from Auburn University, Duke University, Lehman College at City University of New York, the Ohio State University, Princeton University, Radboud University, Rutgers University, St. John’s College at the University of Oxford, Université de Montréal, the University of Amsterdam, the University of Colorado at Boulder, the University of Illinois at Urbana-Champaign, and the University of Utah. An attorney from the New Jersey Office of the Public Defender also joined their ranks.

Participants with the Dean of Science

Participants from the Virginia Tech Summer Seminar in Philosophy of Statistics included Dean Sally C. Morton (third from the right in the first row).  Deborah Mayo and Aris Spanos appear to the left behind her.


Sally C. Morton, dean of the College of Science and interim director of the Fralin Life Sciences Institute at Virginia Tech, attended one of the seminar sessions.

“The proper use of evidence in decision-making is essential to tackling the complex problems in society today,” said Morton. “The summer seminar that brought together the fields of statistics and philosophy demonstrated the power of using a transdisciplinary approach to give the attendees an expansive view of the challenges we face. I was delighted to see the deliberate inclusion of students and early-career researchers in the seminar.”

For two weeks, the group gathered with special guest speakers, both in person and through an online meeting platform. Presenting at the seminar were Andrew Gelman, a professor of statistics from Columbia University; Richard Morey, a reader for the School of Psychology at Cardiff University; Nathan Schachtman, a lawyer who specializes in scientific and medico-legal issues; and Stephen Senn, a consultant statistician from Edinburgh, Scotland.

The seminar, largely funded by Mayo and her husband, George Chatfield, through their Fund for Experimental Reasoning, Reliability, Objectivity, and Rationality of Science, also benefited from a number of sponsors. These included the College of Liberal Arts and Human Sciences,  the College of Science, the Data and Decisions Destination Area,the Department of Philosophy, and the Department of Economics.

The summer seminar is not the first collaboration between Mayo and Spanos. In 2010, they coedited the book “Error and Inference: Recent Exchanges on Experimental Reasoning,  Reliability, and the Objectivity and Rationality of Science.”  Together they have also published six papers and book chapters in such publications as the British Journal for the Philosophy of Science, Synthese, and the Philosophy of Science. The contributions stemmed from a Virginia Tech conference, ERROR06, which included the statistician Sir David Cox and the philosophers Alan Chalmers, Clark Glymour, and Alan Musgrave.

More recently, Mayo authored the book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,” published by Cambridge University Press.

The directors and participants will continue to propel Phil Stat beyond the summer experience through conferences, online publications, and an upcoming book, “Probability Theory and Statistical Inference: Modeling withObservational Data,” slated for publication by CambridgeUniversity Press. As a group, they occasionally meet online and maintain a blog together. And they plan to present sessions at conferences.

“What we initiated here at Virginia Tech,” Mayo said, “will have a big impact not just on the way we think about the philosophy of science, but on how both it and the philosophy of knowledge are taught and integrated.”

-Written by Leslie King

© 2019 Virginia Polytechnic Institute and State University. All rights reserved.

See our website at

Categories: Philosophy of Statistics, Summer Seminar in PhilStat | 2 Comments

The First Eye-Opener: Error Probing Tools vs Logics of Evidence (Excursion 1 Tour II)

1.4, 1.5

In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP),  I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).  Continue reading

Categories: Error Statistics, law of likelihood, SIST | 14 Comments

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon


Continue to the third, and last stop of Excursion 1 Tour I of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–Section 1.3. It would be of interest to ponder if (and how) the current state of play in the stat wars has shifted in just one year. I’ll do so in the comments. Use that space to ask me any questions.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Severity: Strong vs Weak (Excursion 1 continues)


Marking one year since the appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.


  • I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

Continue reading

Categories: Statistical Inference as Severe Testing | 5 Comments

How My Book Begins: Beyond Probabilism and Performance: Severity Requirement

This week marks one year since the general availability of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s how it begins (Excursion 1 Tour 1 (1.1)). Material from the preface is here. I will sporadically give some “one year later” reflections in the comments. I invite readers to ask me any questions pertaining to the Tour.

The journey begins..(1.1)

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 4 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 19 Comments

Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)


The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues: Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access)


A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 6 Comments

Gelman blogged our exchange on abandoning statistical significance

A. Gelman

I came across this post on Gelman’s blog today:

Exchange with Deborah Mayo on abandoning statistical significance

It was straight out of blog comments and email correspondence back when the ASA, and significant others, were rising up against the concept of statistical significance. Here it is: Continue reading

Categories: Gelman blogs an exchange with Mayo | Tags: | 7 Comments

All She Wrote (so far): Error Statistics Philosophy: 8 years on


Error Statistics Philosophy: Blog Contents (8 years)
By: D. G. Mayo

Dear Reader: I began this blog 8 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room Friday evening (a smaller one was held earlier in the week), both for the blog and the 1 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff. If you’re in the neighborhood, stop by for some Elba Grease.

Ship Statinfasst made its most recent journey at the Summer Seminar for Phil Stat from July 28-Aug 11, co-directed with Aris Spanos. It was one of the main events that occupied my time the past academic year, from the planning, advertising and running. We had 15 fantastic faculty and post-doc participants (from 55 applicants), and plan to continue the movement to incorporate PhilStat in philosophy and methodology, both in teaching and research. You can find slides from the Seminar (zoom videos, including those of special invited speakers, to come) on Slides and other materials from the Spring Seminar co-taught with Aris Spanos (and cross-listed with Economics) can be found on this blog here

Continue reading

Categories: 8 year memory lane, blog contents, Metablog | 3 Comments

(one year ago) RSS 2018 – Significance Tests: Rethinking the Controversy


Here’s what I posted 1 year ago on Aug 30, 2018.


Day 2, Wednesday 05/09/2018

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

Categories: memory lane | Tags: | Leave a comment

Palavering about Palavering about P-values


Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:

The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019] Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.

Admittedly, their recent statement, which I refer to as ASA II, has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post: Continue reading

Categories: ASA Guide to P-values, P-values | Tags: | 12 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with posts on E.S. Pearson in marking his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom. Continue reading

Categories: Egon Pearson, Statistics | Leave a comment

Statistical Concepts in Their Relation to Reality–E.S. Pearson

11 August 1895 – 12 June 1980

In marking Egon Pearson’s birthday (Aug. 11), I’ll  post some Pearson items this week. They will contain some new reflections on older Pearson posts on this blog. Today, I’m posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve linked to it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, it might be said that some people concentrate to an absurd extent on “science-wise error rates” in their view of statistical tests as dichotomous screening devices.) Continue reading

Categories: Egon Pearson, phil/history of stat, Philosophy of Statistics | Tags: , , | Leave a comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll post some Pearson items this week to mark his birthday.


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

Continue reading

Categories: E.S. Pearson, Error Statistics | Leave a comment

S. Senn: Red herrings and the art of cause fishing: Lord’s Paradox revisited (Guest post)


Stephen Senn
Consultant Statistician


Previous posts[a],[b],[c] of mine have considered Lord’s Paradox. To recap, this was considered in the form described by Wainer and Brown[1], in turn based on Lord’s original formulation:

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls : : : . Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded. [2](p. 304)

The issue is whether the appropriate analysis should be based on change-scores (weight in June minus weight in September), as proposed by a first statistician (whom I called John) or analysis of covariance (ANCOVA), using the September weight as a covariate, as proposed by a second statistician (whom I called Jane). There was a difference in mean weight between halls at the time of arrival in September (baseline) and this difference turned out to be identical to the difference in June (outcome). It thus follows that, since the analysis of change score is algebraically equivalent to correcting the difference between halls at outcome by the difference between halls at baseline, the analysis of change scores returns an estimate of zero. The conclusion is thus, there being no difference between diets, diet has no effect. Continue reading

Categories: Stephen Senn | 24 Comments

Summer Seminar in PhilStat Participants and Special Invited Speakers


Participants in the 2019 Summer Seminar in Philosophy of Statistics

Continue reading

Categories: Summer Seminar in PhilStat | Leave a comment

The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting  yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II, in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II guidelines were heading.

The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments). Continue reading

Categories: ASA Guide to P-values | 22 Comments

Blog at