Here are all the slides along with the video from the 11 January Phil Stat Forum with speakers: Deborah G. Mayo, Yoav Benjamini and moderator/discussant David Hand.

# P-values

## Philip Stark (guest post): commentary on “The Statistics Wars and Intellectual Conflicts of Interest” (Mayo Editorial)

**Philip B. Stark**

Professor

Department of Statistics

University of California, Berkeley

I enjoyed Prof. Mayo’s comment in *Conservation Biology* Mayo, 2021 very much, and agree enthusiastically with most of it. Here are my key takeaways and reflections.

Error probabilities (or error rates) are essential to consider. If you don’t give thought to what the data would be like if your theory is false, you are not doing science. Some applications really require a decision to be made. Does the drug go to market or not? Are the girders for the bridge strong enough, or not? Hence, banning “bright lines” is silly. Conversely, no threshold for significance, no matter how small, suffices to prove an empirical claim. In replication lies truth. Abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of evidence, publication decisions are even more subject to cronyism, “taste”, confirmation bias, etc. Throwing away P-values because many practitioners don’t know how to use them is perverse. It’s like banning scalpels because most people don’t know how to perform surgery. People who wish to perform surgery should be trained in the proper use of scalpels, and those who wish to use statistics should be trained in the proper use of P-values. Throwing out P-values is self-serving to statistical instruction, too: we’re making our lives easier by teaching *less* instead of teaching *better.* Continue reading

## The ASA controversy on P-values as an illustration of the difficulty of statistics

Christian HennigProfessorDepartment of Statistical Sciences

University of Bologna

**The ASA controversy on P-values as an illustration of the difficulty of statistics**

“I work on Multidimensional Scaling for more than 40 years, and the longer I work on it, the more I realise how much of it I don’t understand. This presentation is about my current state of not understanding.”(John Gower, world leading expert on Multidimensional Scaling, on a conference in 2009)

“The lecturer contradicts herself.”(Student feedback to an ex-colleague for teaching methods and then teaching what problems they have)

**1 Limits of understanding**

Statistical tests and P-values are widely used and widely misused. In 2016, the ASA issued a statement on significance and P-values with the intention to curb misuse while acknowledging their proper definition and potential use. In my view the statement did a rather good job saying things that are worthwhile saying while trying to be acceptable to those who are generally critical on P-values as well as those who tend to defend their use. As was predictable, the statement did not settle the issue. A “2019 editorial” by some of the authors of the original statement (recommending “to abandon statistical significance”) and a 2021 ASA task force statement, much more positive on P-values, followed, showing the level of disagreement in the profession. Continue reading

## E. Ionides & Ya’acov Ritov (Guest Post) on Mayo’s editorial, “The Statatistics Wars and Intellectual Conflicts of Interest”

Edward L. Ionides

Director of Undergraduate Programs and Professor,

Department of Statistics, University of Michigan

Department of Statistics, University of Michigan

Thanks for the clear presentation of the issues at stake in your recent *Conservation Biology* editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”. Continue reading

## Bickel’s defense of significance testing on the basis of Bayesian model checking

In my last post, I said I’d come back to a (2021) article by David Bickel, “Null Hypothesis Significance Testing Defended and Calibrated by Bayesian Model Checking” in *The American Statistician.* His abstract begins as follows:

Significance testing is often criticized because

p-values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that thep-value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if thep-value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check….(fromBickel2021)

## P-values disagree with posteriors? Problem is your priors, says R.A. Fisher

How often do you hear P-values criticized for “exaggerating” the evidence against a null hypothesis? If your experience is like mine, the answer is ‘all the time’, and in fact, the charge is often taken as one of the strongest cards in the anti-statistical significance playbook. The argument boils down to the fact that the P-value accorded to a point null *H*_{0 }can be small while its Bayesian posterior probability high–provided a high enough prior is accorded to *H*_{0}. But why suppose P-values should match Bayesian posteriors? And what justifies the high (or “spike”) prior to a point null? While I discuss this criticism at considerable length in *Statistical Inference as Severe Testing: How to get beyond the statistics wars* (CUP, 2018), I did not quote an intriguing response by R.A. Fisher to disagreements between P-values and posteriors’s (in *Statistical Methods and Scientific Inference, Fisher 1956);* namely, that such a prior probability assignment would itself be rejected by the observed small P-value–if the prior were itself regarded as a hypothesis to test. Or so he says. I did mention this response by Fisher in an encyclopedia article from way back in 2006 on “philosophy of statistics”: Continue reading

## Memory Lane (4 years ago): Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

An argument that assumes the very thing that was to have been argued for is guilty of *begging the question*; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue *unsoundly*, and in bad faith. When a whirlpool of “reforms” subliminally alter the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test. Continue reading

## Statistics and the Higgs Discovery: 9 yr Memory Lane

*I’m reblogging two of my Higgs posts at the 9th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2” (from March, 2013**).[1]*

Some people say to me: “severe testing is fine for ‘sexy science’ like in high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning, at least, when we’re trying to find things out [2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

The Higgs discussion finds its way into Tour III in Excursion 3 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). You can read it (in proof form) here, pp. 202-217. in a section with the provocative title:

3.8 The Probability Our Results Are Statistical Fluctuations: Higgs’ Discovery

## Reminder: March 25 “How Should Applied Science Journal Editors Deal With Statistical Controversies?” (Mark Burgman)

*The seventh meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties
**

**25 March, 2021**

**TIME: 15:00-16:45 (London); 11:00-12:45 (New York, NOTE TIME CHANGE TO MATCH UK TIME**)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“ How should applied science journal editors deal with statistical controversies?**

**“**

**Mark Burgman** Continue reading

## March 25 “How Should Applied Science Journal Editors Deal With Statistical Controversies?” (Mark Burgman)

*The seventh meeting of our Phil Stat Forum*:*

**The Statistics Wars
and Their Casualties
**

**25 March, 2021**

**TIME: 15:00-16:45 (London); 11:00-12:45 (New York, NOTE TIME CHANGE)**

**For information about the Phil Stat Wars forum and how to join, click on this link.**

**“ How should applied science journal editors deal with statistical controversies?**

**“**

**Mark Burgman** Continue reading

## Souvenir From the NISS Stat Debate for Users of Bayes Factors (& P-Values)

What would I say is the most important takeaway from last week’s NISS “statistics debate” if you’re using (or contemplating using) Bayes factors (BFs)–of the sort Jim Berger recommends–as replacements for P-values? It is that J. Berger only regards the BFs as appropriate when there’s grounds for a high concentration (or spike) of probability on a sharp null hypothesis, e.g.,H_{0}: θ = θ_{0}.

Thus, it is crucial to distinguish between precise hypotheses that are just stated for convenience and have no special prior believability, and precise hypotheses which do correspond to a concentration of prior belief. (J. Berger and Delampady 1987, p. 330).

## My Responses (at the P-value debate)

How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer.

The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts. Continue reading

## The P-Values Debate

## The Statistics Debate! (NISS DEBATE, October 15, Noon – 2 pm ET)

**October 15, Noon – 2 pm ET (Website)**

*Where do ***YOU **stand?

**YOU**stand?

Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading

## August 6: JSM 2020 Panel on P-values & “Statistical Significance”

**July 30** **PRACTICE** **VIDEO** for JSM talk (All materials for Practice JSM session here)

JSM 2020 Panel Flyer (PDF)

JSM online program w/panel abstract & information):

## JSM 2020: P-values & “Statistical Significance”, August 6

## My paper, “P values on Trial” is out in Harvard Data Science Review

My new paper, “*P* Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in *Harvard Data Science Review (HDSR). HDSR *describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue. Continue reading

## The NAS fixes its (main) mistake in defining P-values!

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.

Two footnotes, on pages~~31~~35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOCThank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.

With kind regards and wishes of a happy 2020,

Jenny Heimberg

Jennifer Heimberg, Ph.D.

Senior Program OfficerThe National Academies of Sciences, Engineering, and Medicine

## P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

- “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading