P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)


Mayo writing to Kafadar

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

  • “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day).

Recently, at chapter meetings, conferences, and other events, I’ve had the good fortune to meet many of our members, many of whom feel queasy about the effects of differing views on p-values expressed in the March 2019 supplement of The American Statistician (TAS). The guest editors— Ronald Wasserstein, Allen Schirm, and Nicole Lazar—introduced the ASA Statement on P-Values (2016) by stating the obvious: “Let us be clear. Nothing in the ASA statement is new.” Indeed, the six principles are well-known to statisticians.The guest editors continued, “We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.”…

Wait a minute. I’m confused about who is speaking. The statements “Let us be clear…” and “We hoped that a statement from the world’s largest professional association…” come from the 2016 ASA Statement on P-values. I abbreviate this as ASA I (Wasserstein and Lazar 2016). The March 2019 editorial that Kafadar says is making many members “feel queasy,” is the update (Wasserstein, Schirm, and Lazar 2019). I abbreviate it as ASA II [i]. 

A healthy debate about statistical approaches can lead to better methods. But, just as Wilks and his colleagues discovered, unintended consequences may have arisen: Nonstatisticians (the target of the issue) may be confused about what to do. Worse, “by breaking free from the bonds of statistical significance” as the editors suggest and several authors urge, researchers may read the call to “abandon statistical significance” as “abandon statistical methods altogether.” …

But we may need more. How exactly are researchers supposed to implement this “new concept” of statistical thinking? Without specifics, questions such as “Why is getting rid of p-values so hard?” may lead some of our scientific colleagues to hear the message as, “Abandon p-values”—despite the guest editors’ statement: “We are not recommending that the calculation and use of continuous p-values be discontinued.”

Brad Efron once said, “Those who ignore statistics are condemned to re-invent it.” In his commentary (“It’s not the p-value’s fault”) following the 2016 ASA Statement on P-Values, Yoav Benjamini wrote, “The ASA Board statement about the p-values may be read as discouraging the use of p-values because they can be misused, while the other approaches offered there might be misused in much the same way.” Indeed, p-values (and all statistical methods in general) can be misused. (So may cars and computers and cell phones and alcohol. Even words in the English language get misused!) But banishing them will not prevent misuse; analysts will simply find other ways to document a point—perhaps better ways, but perhaps less reliable ones. And, as Benjamini further writes, p-values have stood the test of time in part because they offer “a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs”—especially now that Efron’s bootstrap has become a familiar tool in all branches of science for characterizing uncertainty in statistical estimates.[Benjamini is commenting on ASA I.]

… It is reassuring that “Nature is not seeking to change how it considers statistical evaluation of papers at this time,” but this line is buried in its March 20 editorial, titled “It’s Time to Talk About Ditching Statistical Significance.” Which sentence do you think will be more memorable? We can wait to see if other journals follow BASP’s lead and then respond. But then we’re back to “reactive” versus “proactive” mode (see February’s column), which is how we got here in the first place.

… Indeed, the ASA has a professional responsibility to ensure good science is conducted—and statistical inference is an essential part of good science. Given the confusion in the scientific community (to which the ASA’s peer-reviewed 2019 TAS supplement may have unintentionally contributed), we cannot afford to sit back. After all, that’s what started us down the “abuse of p-values” path. 

Is it unintentional? [ii]

…Tukey wrote years ago about Bayesian methods: “It is relatively clear that discarding Bayesian techniques would be a real mistake; trying to use them everywhere, however, would in my judgment, be a considerably greater mistake.” In the present context, perhaps he might have said: “It is relatively clear that trusting or dismissing results based on a single p-value would be a real mistake; discarding p-values entirely, however, would in my judgment, be a considerably greater mistake.” We should take responsibility for the situation in which we find ourselves today (and during the past decades) to ensure that our well-researched and theoretically sound statistical methodology is neither abused nor dismissed categorically. I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether. Please send me your ideas! 

You can read the full June President’s Corner.

On Fri, Nov 8, 2019 at 2:09 PM Deborah Mayo <mayod@vt.edu> wrote:

Dear Professor Kafadar;

Your article in the President’s Corner of the ASA for June 2019 was sent to me by someone who had read my “P-value Thresholds: Forfeit at your Peril” editorial, invited by John Ioannidis. I find your sentiment welcome and I’m responding to your call for suggestions.

For starters, when representatives of the ASA issue articles criticizing P-values and significance tests, recommending their supplementation or replacement by others, three very simple principles should be followed:

  • The elements of tests should be presented in an accurate, fair and at least reasonably generous manner, rather than presenting mainly abuses of the methods;
  • The latest accepted methods should be included, not just crude nil null hypothesis tests. How these newer methods get around the often-repeated problems should be mentioned.
  • Problems facing the better-known alternatives, recommended as replacements or supplements to significance tests, should be discussed. Such an evaluation should recognize the role of statistical falsification is distinct from (while complementary to) using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model.

Here’s what I recommend ASA do now in order to correct the distorted picture that is now widespread and growing: Run a conference akin to the one Wasserstein ran on “A World Beyond ‘P < 0.05′” except that it would be on evaluating some competing methods for statistical inference: Comparative Methods of Statistical Inference: Problems and Prospects.

The workshop would consist of serious critical discussions on Bayes Factors, confidence intervals[iii], Likelihoodist methods, other Bayesian approaches (subjective, default non-subjective, empirical), particularly in relation to today’s replication crisis. …

Growth of the use of these alternative methods have been sufficiently widespread to have garnered discussions on well-known problems….The conference I’m describing will easily attract the leading statisticians in the world. …

D. Mayo

Please share your comments on this blogpost.


[i] My reference to ASA II refers just to the portion of the editorial encompassing their general recommendations: don’t say significance or significant, oust P-value thresholds. (It mostly encompasses the first 10 pages.) It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blogpost, the result is to give us principles that are in tension with the original 6.) Note my new qualification in [ii]*

[ii]*As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, with some kind of disclaimer. That will help, but merely noting that it never came to a Board vote will not quell the confusion now rattling some ASA members. The ASA’s P-value campaign to editors to revise their author guidelines asks them to take account of both ASA I and II. In carrying out the P-value campaign, at which he is highly effective, Ron Wasserstein obviously* wears his Executive Director’s hat. See The ASA’s P-value Project: Why It’s Doing More Harm than Good. So, until some kind of clarification is issued by the ASA, I’ve hit upon this solution.

The ASA P-value Project existed before the 2016 ASA I. The only difference in today’s P-value Project–since the March 20, 2019 editorial by Wasserstein et al– is that the ASA Executive Director (in talks, presentations, correspondence) recommends ASA I and the general stipulations of ASA II–even though that’s not a policy document. I will now call it the 2019 ASA P-value Project II. It also includes the rather stronger principles in ASA II. Even many who entirely agree with the “don’t say significance” and “don’t use P-value thresholds” recommendations have concurred with my “friendly amendments” to ASA II (including, for example, Greenland, Hurlbert, and others). See my post from June 17, 2019.

You merely have to look at the comments to that blog. If Wasserstein would make those slight revisions, the 2019 P-value Project II wouldn’t contain the inconsistencies, or at least “tensions” that it now does, assuming that it retains ASA I. The 2019 ASA P-value Project II sanctions making the recommendations in ASA II, even though ASA II is not an ASA policy statement.

However, I don’t see that those made queasy by ASAII would be any less upset with the reality of the ASA P-value Project II.

[iii]Confidence intervals (CIs) clearly aren’t “alternative measures of evidence” in relation to statistical significance tests. The same man, Neyman, developed tests (with Pearson) and CIs, even earlier ~1930. They were developed as duals, or inversions, of tests. Yet the advocates of CIs–the CI Crusaders, S. Hurlbert calls them–are some of today’s harshest and most ungenerous critics of tests. For these crusaders, it has to be “CIs only”. Supplementing p-values with CIs isn’t good enough. Now look what’s happened to CIS in the latest guidelines of the NEJM. You can readily find them searching NEJM on this blog. (My own favored measure, severity, improves on CIs, moves away from the fixed confidence level, and provides a different assessment corresponding to each point in the CI.

*Or is it not obvious? I think it is, because he is invited and speaks, writes, and corresponds in that capacity.


Wasserstein, R. & Lazar, N. (2016) [ASA I], The ASA’s Statement on p-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.

Wasserstein, R., Schirm, A. and Lazar, N. (2019) [ASA II] “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19: Editorial. (ASA II)(pdf)

Related posts on ASA II:

Related book (excerpts from posts on this blog are collected here)

Mayo, (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP).

Categories: ASA Guide to P-values, Bayesian/frequentist, P-values | Leave a comment

A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science.

#The sins of quantification

Every quantification which is unclear as to its scope and the context in which it is produced obscures rather than elucidates.

Traditionally, the strength of numbers in the making of an argument has rested on their purported objectivity and neutrality. Expressions such as “Concrete numbers”, “The numbers speak for themselves”, “The data/the model don’t lie” are common currency. Today, doubts about algorithmic instances of quantification – e.g. in promoting, detaining, conceding freedom or credit, are becoming more urgent and visible. Yet the doubt should be general. It is becoming realised that in every activity of quantification, the technique or the methods are never neutral, because it is never possible to separate entirely the act of quantifying from the wishes and expectations of the quantifier.  Thus, books apparently telling separate stories, such as Rigor Mortis, Weapons of Math Destruction, the Tyranny of Metrics, or Useless Arithmetic, dealing with statistics, algorithms, indicators and models, share a common concern.

# Statisticians know

Statisticians are increasingly aware that each number presupposes an underlying narrative, a worldview, and a purpose of the exercise. The maturity of this debate in the house of statistics is not an accident. Statistics is a discipline, with recognized leaders and institutions, and although one might derive an impression of disorder by the use a petition to influence a scientific argument, one cannot deny that the problems in statistics are being tackled head on, in the public arena, in spite of the obvious difficulty for the lay public to follow the technicality of the arguments. With its ongoing  discussion of significance, the community of statistics is teaching us an important lesson about the tight coupling between technique and values. How so? We recap here some elements of the debate.

  • For some, it would be better to throw away the concept of significance altogether, because the p-test, – with its magical p<0.05 threshold, is being misused as a measure of veracity and publishability.
  • Others object that discussion should not take place with the instrument of a petition and that withdrawing tests of significance would make science even more uncertain.
  • The former retort that since this discussion has been going on for decades on academic journal without the existing flaws being fixed, then perhaps times are ripe for action.

A good vantage point to look at this debate in its entirety is this section in Andrew Gelman’s blog.

# Different worlds

An important aspect of this discussion is that the contenders may inhabit different worlds. One world is full of important effects which are overlooked because the test of significance fails (p value greater that 0.05 in statistical parlance). The other world is instead replete with bogus results passed on to the academic literature thanks to a low value of the p-test (p<0.05).

A modicum of investigation reveals that the contention is normative, or indeed political. To take an example, some may fear the introduction on the market of ineffectual pharmaceutical products, others that important epidemiological effects of a pollutant on health may be overlooked. The first group would thus have a more restrictive value for the test, the second group a less restrictive one.

All this is not new. Philosopher Richard Rudner had already written in 1953 that it is impossible to use a test of significance without knowing to what it is being applied, i.e. without making a value judgment. Interestingly, Rudner used this example to make the point that scientists do need to make value judgments.

# How about mathematical models?

In all this discussion mathematical models have enjoyed a relative immunity, perhaps because mathematical modelling is not a discipline. But the absence of awareness of a quality problem is not proof of the absence of a problem.  And there are signals that the crisis there might be even worse than that which is recognised in statistics.

Implausible quantifications of the effect of climate change on the gross domestic product of a country at the year 2100, or of the safety of a disposal for nuclear waste a million years from now, or of the risk of the financial products at the heart of the latest financial crisis, are just examples that are easily seen in the literature. Political decision in the field of transports may be based on a model which needs as an input the average number of passengers sitting is a car several decades in the future. A scholar studying science and technology laments the generation of artefactual numbers through methods and concepts such as ‘expected utility’, ‘decision theory’, ‘life cycle assessment’, ‘ecosystem services’ ‘sound scientific decisions’ and ‘evidence-based policy’ to convey a spurious impression of certainty and control over important issues concerning health and the environment. A rhetorical use of quantification may thus be used in evidence-based policy to hide important knowledge and power asymmetries: the production of evidence empowers those who can pay for it, a trend noted in both the US and Europe.

# Resistance?

Since its inception the current of post normal science (PNS) has insisted on the need to fight against instrumental or fantastic quantifications. PNS scholars suggested the use of pedigree for numerical information (NUSAP), and recently for mathematical models. Combined with PNS’ concept of extended peer communities, these tools are meant to facilitate a discussion of the various attributes of a quantification. This information includes not just its uncertainty, but also its history, the profile of its producers, its position within a system of power and norms, and overall its ‘fitness for function’, while also identifying the possible exclusion of competing stakes and worldviews.

Stat-Activisme, a recent French intellectual ovement, proposes to ‘fight against’ as well as ‘fight with’ numbers. Stat-activisme targets invasive metrics and biased statistics, with a rich repertoire of strategies from ‘statistical judo’ to the construction of alternative measures.

As philosopher Jerome Ravetz reminds us, so long as our modern scientific culture has faith in numbers as if they were ‘nuggets of truth’, we will be victims of ‘funny numbers’ employed to rule our technical society.

Note: A different version of this piece has been published in Italian in the journal Epidemiologia and Prevenzione.

Categories: Error Statistics | 11 Comments

The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)


cure by committee

Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire. Continue reading

Categories: ASA Guide to P-values | 7 Comments

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)


“Before we stood on the edge of the precipice, now we have taken a great step forward”


What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading

Categories: P-values, stat wars and their casualties, statistical significance tests | 12 Comments

Exploring a new philosophy of statistics field

This article came out on Monday on our Summer Seminar in Philosophy of Statistics in Virginia Tech News Daily magazine.

October 28, 2019


From universities around the world, participants in a summer session gathered to discuss the merits of the philosophy of statistics. Co-director Deborah Mayo, left, hosted an evening for them at her home.

Continue reading

Categories: Philosophy of Statistics, Summer Seminar in PhilStat | 2 Comments

The First Eye-Opener: Error Probing Tools vs Logics of Evidence (Excursion 1 Tour II)

1.4, 1.5

In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP),  I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).  Continue reading

Categories: Error Statistics, law of likelihood, SIST | 14 Comments

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon


Continue to the third, and last stop of Excursion 1 Tour I of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–Section 1.3. It would be of interest to ponder if (and how) the current state of play in the stat wars has shifted in just one year. I’ll do so in the comments. Use that space to ask me any questions.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Severity: Strong vs Weak (Excursion 1 continues)


Marking one year since the appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.


  • I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

Continue reading

Categories: Statistical Inference as Severe Testing | 5 Comments

How My Book Begins: Beyond Probabilism and Performance: Severity Requirement

This week marks one year since the general availability of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s how it begins (Excursion 1 Tour 1 (1.1)). Material from the preface is here. I will sporadically give some “one year later” reflections in the comments. I invite readers to ask me any questions pertaining to the Tour.

The journey begins..(1.1)

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 4 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 19 Comments

Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)


The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues: Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access)


A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 6 Comments

Gelman blogged our exchange on abandoning statistical significance

A. Gelman

I came across this post on Gelman’s blog today:

Exchange with Deborah Mayo on abandoning statistical significance

It was straight out of blog comments and email correspondence back when the ASA, and significant others, were rising up against the concept of statistical significance. Here it is: Continue reading

Categories: Gelman blogs an exchange with Mayo | Tags: | 7 Comments

All She Wrote (so far): Error Statistics Philosophy: 8 years on


Error Statistics Philosophy: Blog Contents (8 years)
By: D. G. Mayo

Dear Reader: I began this blog 8 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room Friday evening (a smaller one was held earlier in the week), both for the blog and the 1 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff. If you’re in the neighborhood, stop by for some Elba Grease.

Ship Statinfasst made its most recent journey at the Summer Seminar for Phil Stat from July 28-Aug 11, co-directed with Aris Spanos. It was one of the main events that occupied my time the past academic year, from the planning, advertising and running. We had 15 fantastic faculty and post-doc participants (from 55 applicants), and plan to continue the movement to incorporate PhilStat in philosophy and methodology, both in teaching and research. You can find slides from the Seminar (zoom videos, including those of special invited speakers, to come) on SummerSeminarPhilStat.com. Slides and other materials from the Spring Seminar co-taught with Aris Spanos (and cross-listed with Economics) can be found on this blog here

Continue reading

Categories: 8 year memory lane, blog contents, Metablog | 3 Comments

(one year ago) RSS 2018 – Significance Tests: Rethinking the Controversy


Here’s what I posted 1 year ago on Aug 30, 2018.


Day 2, Wednesday 05/09/2018

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

Categories: memory lane | Tags: | Leave a comment

Palavering about Palavering about P-values


Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:

The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019] Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.

Admittedly, their recent statement, which I refer to as ASA II, has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post: Continue reading

Categories: ASA Guide to P-values, P-values | Tags: | 12 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with posts on E.S. Pearson in marking his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom. Continue reading

Categories: Egon Pearson, Statistics | Leave a comment

Statistical Concepts in Their Relation to Reality–E.S. Pearson

11 August 1895 – 12 June 1980

In marking Egon Pearson’s birthday (Aug. 11), I’ll  post some Pearson items this week. They will contain some new reflections on older Pearson posts on this blog. Today, I’m posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve linked to it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, it might be said that some people concentrate to an absurd extent on “science-wise error rates” in their view of statistical tests as dichotomous screening devices.) Continue reading

Categories: Egon Pearson, phil/history of stat, Philosophy of Statistics | Tags: , , | Leave a comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll post some Pearson items this week to mark his birthday.


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

Continue reading

Categories: E.S. Pearson, Error Statistics | Leave a comment

S. Senn: Red herrings and the art of cause fishing: Lord’s Paradox revisited (Guest post)


Stephen Senn
Consultant Statistician


Previous posts[a],[b],[c] of mine have considered Lord’s Paradox. To recap, this was considered in the form described by Wainer and Brown[1], in turn based on Lord’s original formulation:

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls : : : . Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded. [2](p. 304)

The issue is whether the appropriate analysis should be based on change-scores (weight in June minus weight in September), as proposed by a first statistician (whom I called John) or analysis of covariance (ANCOVA), using the September weight as a covariate, as proposed by a second statistician (whom I called Jane). There was a difference in mean weight between halls at the time of arrival in September (baseline) and this difference turned out to be identical to the difference in June (outcome). It thus follows that, since the analysis of change score is algebraically equivalent to correcting the difference between halls at outcome by the difference between halls at baseline, the analysis of change scores returns an estimate of zero. The conclusion is thus, there being no difference between diets, diet has no effect. Continue reading

Categories: Stephen Senn | 24 Comments

Blog at WordPress.com.