Author Archives: Mayo

Memory Lane (4 years ago): Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*


An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1).

The Bayes factor discussed is of H1 over H0, in two-sided Normal testing of H0: μ = 0 versus H1: μ ≠ 0.

“The variance of the observations is known. Without loss of generality, we assume that the variance is 1, and the sample size is also 1.” (p. 2 supplementary)

“This is achieved by assuming that μ under the alternative hypothesis is equal to ± (z0.025 + z0.75) = ± 2.63 [1.96. + .63]. That is, the alternative hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63”. (p. 2 supplementary)

Putting to one side whether this is “without loss of generality”, the use of “power” is quite different from the correct definition. The power of a test T  (with type I error probability α) to detect a discrepancy μ’ is the probability T generates an observed difference that is statistically significant at level α, assuming μ = μ’. The value z = 2.63 comes from the fact that the alternative against which this test has power .75 is the value .63 SE in excess of the cut-off for rejection. (Since an SE is 1, they add .63 to 1.96.) I don’t really see why it’s advantageous to ride roughshod on the definition of power, and it’s not the main point of this blogpost, but it’s worth noting if you’re to avoid sinking into the quicksand.

Let’s distinguish the appropriateness of the test for a Bayesian, from its appropriateness as a criticism of significance tests. The latter is my sole focus. The criticism is that, at least if we accept these Bayesian assignments of priors, the posterior probability on H0 will be larger than the p-value. So if you were to interpret a p-value as a posterior on H0 (a fallacy) or if you felt intuitively that a .05 (2-sided) statistically significant result should correspond to something closer to a .05 posterior on H0, you should instead use a p-value of .005–or so it is argued. I’m not sure of the posterior on H0, but the BF is between around 14 and 26.[1] That is the argument. If you lower the required p-value, it won’t be so easy to get statistical significance, and irreplicable results won’t be as common. [2]

The alternative corresponding to the preferred p =.005 requirement

“corresponds to a classical, two-sided test of size α = 0.005. The alternative hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is nearly identical to the “power” curve if that curve had been defined using 80% power, rather than 75% power. The Power curve for 80% power would place ½ its mass at ±2.80”. (Supplementary, p. 2)

z = 2.8 comes from adding .84 SE to the cut-off: 1.96 SE +.84 SE = 2.8. This gets to the alternative vs which the α = 0.05 test has 80% power. (See my previous post on power.)

Is this a good form of inference from the Bayesian perspective? (Why are we comparing μ = 0 and μ = 2.8?). As is always the case with “p-values exaggerate” arguments, there’s the supposition that testing should be on a point null hypothesis, with a lump of prior probability given to H0 (or to a region around 0 so small that it’s indistinguishable from 0). I leave those concerns for Bayesians, and I’m curious to hear from you. More importantly, does it constitute a relevant and sound criticism of significance testing? Let’s be clear: a tester might well have her own reasons for preferring z = 2.8 rather than z = 1.96, but that’s not the question. The question is whether they’ve provided a good argument for the significance tester to do so?

II. What might the significance tester say?

For starters, when she sets .8 power to detect a discrepancy, she doesn’t “implicitly assume” it’s a plausible population discrepancy, but simply one she wants the test to detect by producing a statistically significant difference (with probability .8). And if the test does produce a difference that differs statistically significantly from H0, she does not infer the alternative against which the test had high power, call it μ’. (The supposition that she does grows out of fallaciously transposing “the conditional” involved in power.) Such a rule of interpreting data would have a high error probability of erroneously inferring a discrepancy μ’ (here 2.8).

The significance tester merely seeks evidence of some (genuine) discrepancy from 0, and eschews a comparative inference such as the ratio of the probability of the data under the points 0 and 2.63 (or 2.8). I don’t say there’s no role for a comparative inference, nor preclude someone arguing it is comparing how well μ = 2.8 “explains” the data compared to μ = 0 (given the assumptions), but the form of inference is so different from significance testing, it’s hard to compare them. She definitely wouldn’t ignore all the points in between 0 and 2.8. A one-sided test is preferable (unless the direction of discrepancy is of no interest). While one or two-sided doesn’t make that much difference for a significance tester, it makes a big difference for the type of Bayesian analyses that is appealed to in the “p-values exaggerate” literature. That’s because a lump prior, often .5 (but here .9!), is placed on the point 0 null. Without the lump, the p-value tends to be close to the posterior probability for H0, as Casella and Berger (1987a,b) show–even though p-values and posteriors are actually measuring very different things.

“In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987 b, p. 334, p. 335).

There is a long and old literature on all this (at least since Edwards, Lindman and Savage 1963–let me know if you’re aware of older sources).

Those who lodge the “p-values exaggerate” critique often say, we’re just showing what would happen even if we made the strongest case for the alternative. No they’re not. They wouldn’t be putting the lump prior on 0 were they concerned not to bias things in favor of the null, and they wouldn’t be looking to compare 0 with so far away an alternative as 2.8 either.

The only way a significance tester can appraise or calibrate a measure such as a BF (and these will differ depending on the alternative picked) is to view it as a statistic and consider the probability of an even larger BF under varying assumptions about the value of μ. This is an error probability associated with the method. Accounts that appraise inferences according to the error probability of the method used I call error statistical (which is less equivocal than frequentist or other terms.)

For example, rejecting H0 when z ≥ 1.96 (which is the .05 test, since they make it 2-sided), we said, had .8 power to detect μ = 2.8, but with the .005 test it has only 50% power to do so. If one insists on a fixed .005 cut-off, this is construed as no evidence against the null (or even evidence for it–for a Bayesian). The new test has only 30% probability of finding significance were the data generated by μ = 2.3. So the significance tester is rightly troubled by the raised type II error [3], although the members of an imaginary Toxic Co. (having the risks of their technology probed) might be happy as clams.[4]

Suppose we do attain statistical significance at the recommended .005 level, say z = 2.8. The BF advocate assures us we can infer μ = 2.8, which is now 25 times as likely as μ = 0, (if all the various Bayesian assignments hold). The trouble is, the significance tester doesn’t want to claim good evidence for μ = 2.8. The significance tester merely infers an indication of a discrepancy (an isolated low p-value doesn’t suffice, and the assumptions also must be checked). She’d never ignore all the points other than 0 and ± 2.8. Suppose we were testing μ ≤ 2.7 vs. μ > 2.7, and observed z = 2.8. What is the p-value associated with this observed difference? The answer is ~.46. (Her inferences are not in terms of points but of discrepancies from the null, but I’m trying to relate the criticism to significance tests. ) To obtain μ ≥ 2.7 using one-sided confidence intervals would require a confidence level of .46 .54. An absurdly low confidence level/high error probability.

The one-sided lower .975 bound with z = 2.8 would only entitle inferring μ > .84 (2.8 – 1.96)–quite a bit smaller than inferring μ = 2.8. If confidence levels are altered as well (and I don’t see why they wouldn’t be), the one-sided lower .995 bound would only be μ > 0. Thus, while the lump prior on  Hresults in a bias in favor of a null–increasing the type II error probability– it’s of interest to note that achieving the recommended p-value licenses an inference much larger than what the significance tester would allow.

Note, their inferences remain comparative in the sense of “H1 over H0” on a given measure, it doesn’t actually say there’s evidence against (or for) either (unless it goes on to compute a posterior, not just odds ratios or BFs), nor does it falsify either hypothesis. This just underscores the fact that the BF comparative inference is importantly different from significance tests which seek to falsify a null hypothesis, with a view toward learning if there are genuine discrepancies, and if so, their magnitude.

Significance tests do not assign probabilities to these parametric hypotheses, but even if one wanted to, the spiked priors needed for the criticism are questioned by Bayesians and frequentists alike. Casella and Berger (1987a) say that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in one or two-sided tests. According to them “The testing of a point null hypothesis is one of the most misused statistical procedures.” (ibid., p. 106)

III. Why significance testers should reject the “redefine statistical significance” argument:

(i) If you endorse this particular Bayesian way of attaining the BF, fine, but then your argument begs the central question against the significance tester (or of the confidence interval estimator, for that matter). The significance tester is free to turn the situation around, as Fisher does, as refuting the assumptions:

Even if one were to imagine that H0  had an extremely high prior probability, says Fisher—never minding “what such a statement of probability a priori could possibly mean”(Fisher, 1973, p.42)—the resulting high posteriori probability to H0 , he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (ibid., p. 44) … “…is not capable of finding expression in any calculation of probability a posteriori” (ibid., p. 43). Indeed, if one were to consider the claim about the priori probability to be itself a hypothesis, Fisher says, “it would be rejected at once by the observations at a level of significance almost as great [as reached by H0 ]. …Were such a conflict of evidence, as has here been imagined under discussion… in a scientific laboratory, it would, I suggest, be some prior assumption…that would certainly be impugned.” (p. 44)

(ii) Suppose, on the other hand, you don’t endorse these priors or the Bayesian computation on which the “redefine significance” argument turns. Since lowering the p-value cut-off doesn’t seem too harmful, you might tend to look the other way as to the argument on which it is based. Isn’t that OK? Not unless you’re prepared to have your students compute these BFs and/or posteriors in just the manner upon which the critique of significance tests rests. Will you say, “oh that was just for criticism, not for actual use”? Unless you’re prepared to defend the statistical analysis, you shouldn’t support it. Lowering the p-value that you require for evidence of a discrepancy, or getting more data (should you wish to do so) doesn’t require it.

Moreover, your student might point out that you still haven’t matched p-values and BFs (or posteriors on H0 ): They still differ, with the p-value being smaller. If you wanted to match the p-value and the posterior, you could do so very easily: use the frequency matching priors (which doesn’t use the spike). You could still lower the p-value to .005, and obtain a rejection region precisely identical to the Bayesian. Why isn’t that a better solution than one based on a conflicting account of statistical inference?

Of course, even that is to grant the problem as put before us by the Bayesian argument. If you’re following good error statistical practice you might instead shirk all cut-offs. You’d report attained p-values, and wouldn’t infer a genuine effect until you’ve satisfied Fisher’s requirements: (a) Replicate yourself, show you can bring about results that “rarely fail to give us a statistically significant result” (1947, p. 14) and that you’re getting better at understanding the causal phenomenon involved. (b) Check your assumptions: both the statistical model, the measurements, and the links between statistical measurements and research claims. (c) Make sure you adjust your error probabilities to take account of, or at least report, biasing selection effects (from cherry-picking, trying and trying again, multiple testing, flexible determinations, post-data subgroups)–according to the context. That’s what prespecified reports are to inform you of. The suggestion that these are somehow taken care of by adjusting the pool of hypotheses on which you base a prior will not do. (It’s their plausibility that often makes them so seductive, and anyway, the injury is to how well-tested claims are, not to their prior believability.) The appeal to diagnostic testing computations of “false positive rates” in this paper opens up a whole new urn of worms. Don’t get me started. (see related posts.)

A final word is from a guest post by Senn.  Harold Jeffreys, he says, held that if you use the spike (which he introduced), you are to infer the hypothesis that achieves greater than .5 posterior probability.

Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) … A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

Please share your views, and alert me to errors. I will likely update this. Stay tuned for asterisks.
12/17 * I’ve already corrected a few typos.

[1] I do not mean the “false positive rate” defined in terms of α and (1 – β)–a problematic animal I put to one side here (Mayo 2003). Richard Morey notes that using their prior odds of 1:10, even the recommended BF of 26 gives us an unimpressive  posterior odds ratio of 2.6 (email correspondence).

[2] Note what I call the “fallacy of replication”. It’s said to be too easy to get low p-values, but at the same time it’s too hard to get low p-values in replication. Is it too easy or too hard? That just shows it’s not the p-value at fault but cherry-picking and other biasing selection effects. Replicating a p-value is hard–when you’ve cheated or been sloppy  the first time.

[3] They suggest increasing the sample size to get the power where it was with rejection at z = 1.96, and, while this is possible in some cases, increasing the sample size changes what counts as one sample. As n increases the discrepancy indicated by any level of significance decreases.

[4] The severe tester would report attained levels and,in this case, would indicate the the discrepancies indicated and ruled out with reasonable severity. (Mayo and Spanos 2011). Keep in mind that statistical testing inferences are  in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values. As for the imaginary Toxic Co., I’d put the existence of a risk of interest in the null hypothesis of a one-sided test.

Related Posts

10/26/17: Going round and round again: a roundtable on reproducibility & lowering p-values

10/18/17: Deconstructing “A World Beyond P-values”

1/19/17: The “P-values overstate the evidence against the null” fallacy

8/28/16 Tragicomedy hour: p-values vs posterior probabilities vs diagnostic error rates

12/20/15 Senn: Double Jeopardy: Judge Jeffreys Upholds the Law, sequel to the pathetic p-value.

2/1/14 Comedy hour at the Bayesian epistemology retreat: highly probable vs highly probed vs B-boosts

11/25/14: How likelihoodists exaggerate evidence from statistical tests

Elements of this post are from Mayo 2018.


Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., 3 … Johnson, V. (2017, July 22), “Redefine statistical significance“, Nature Human Behavior.

Berger, J. O. and Delampady, M. (1987). “Testing Precise Hypotheses” and “Rejoinder“, Statistical Science 2(3), 317-335.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Cassella, G. and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady”, Statistical Science 2(3), 344–347.

Edwards, W., Lindman, H. and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242.

Fisher, R. A. (1947). The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd. (First published 1935).

Fisher, R. A. (1973). Statistical Methods and Scientific Inference, 3rd ed,  New York: Hafner Press.

Ghosh, J. Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. New York: Springer.

Mayo, D. G. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing? Commentary on J. Berger’s Fisher Address,” Statistical Science 18: 19-24.

Mayo (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge (June 2018)

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 2 Comments

Our presentations from the PSA: Philosophy in Science (PinS) symposium

Philosophy in Science:
Can Philosophers of Science Contribute to Science?


Below are the presentations from our remote session on “Philosophy in Science”on November 13, 2021 at the Philosophy of Science Association meeting. We are having an extended discussion on Monday November, 22 at 3pm Eastern Standard Time. If you wish to take part, write to me of your interest by email (error) with the subject “PinS” or use comments below. (Include name, affiliation and email).

Session Abstract: Although the question of what philosophy can bring to science is an old topic, the vast majority of current philosophy of science is a meta-discourse on science, taking science as its object of study, rather than an attempt to intervene on science itself. In this symposium, we discuss a particular interventionist approach, which we call “philosophy in science (PinS)”, i.e., an attempt at using philosophical tools to make a significant scientific contribution. This approach remains rare, but has been very successful in a number of cases, especially in philosophy of biology, medicine, physics, statistics, and the social sciences. Our goal is to provide a description of PinS through both a bibliometric approach and the examination of specific case studies. We also aim to explain how PinS differs from mainstream philosophy of science and partly similar approaches such as “philosophy of science in practice”.

Here are the members and the titles of their talks. (Link to session/abstracts):

  • Thomas Pradeu (CNRS & University Of Bordeaux) & Maël Lemoine (University Of Bordeaux): Philosophy in Science: Definition and Boundaries
  • Deborah Mayo (Virginia Tech): My Philosophical Interventions in Statistics
  • Elliott Sober (University Of Wisconsin – Madison): Philosophical Interventions in Science – a Strategy and a Case Study (Parsimony)
  • Randolph Nesse (Arizona State University) & Paul Griffiths (University of Sydney): How Evolutionary Science and Philosophy Can Collaborate to Redefine Disease


T. Pradeu & M. Lemoine slides: “Philosophy in Science: Definition and Boundaries”:


D. Mayo slides: “Philosophical Interventions in the Statistics Wars”:


E. Sober: “Philosophical Interventions in Science – A Strategy and a Case Study (Parsimony)”


R. Nesse & P. Griffiths: How Evolutionary Science and Philosophy Can Collaborate to Redefine Disease”:

Categories: PSA 2021 | 7 Comments

Our session is now remote: Philo of Sci Association (PSA): Philosophy IN Science (PinS): Can Philosophers of Science Contribute to Science?


Philosophy in Science: Can Philosophers of Science Contribute to Science?
     on November 13, 2-4 pm


OUR SESSION HAS BECOME REMOTE: PLEASE JOIN US on ZOOM! This session revolves around the intriguing question: Can Philosophers of Science Contribute to Science? They’re calling it philosophy “in” science–when philosophical ministrations actually intervene in a science itself.  This is the session I’ll be speaking in. I hope you will come to our session if you’re there–it’s hybrid, so you can’t see it through a remote link. But I’d like to hear what you think about this question–in the comments to this post. Continue reading

Categories: Announcement, PSA 2021 | Leave a comment

S. Senn: The Many Halls Problem (Guest Post)


Stephen Senn
Consultant Statistician
Edinburgh, Scotland


The Many Halls Problem
It’s not that paradox but another

Generalisation is passing…from the consideration of a restricted set to that of a more comprehensive set containing the restricted one…Generalization may be useful in the solution of problems. George Pólya [1] (P108)


In a previous blog I considered Lord’s Paradox[2], applying John Nelder’s calculus of experiments[3, 4]. Lord’s paradox involves two different analyses of the effect of two different diets, one for each of two different student halls, on weight of students. One statistician compares the so-called change scores or gain scores (final weight minus initial weight) and the other compares final weights, adjusting for initial weights using analysis of covariance. Since the mean initial weights vary between halls, the two analyses will come to different conclusions unless the slope of final on initial weights just happens to be one (in practice, it would usually be less). The fact that two apparently reasonable analyses would lead to different conclusions constitutes the paradox. I chose the version of the paradox outlined by Wainer and Brown [5] and also discussed in The Book of Why[6].  I illustrated this by considering two different experiments: one in which, as in the original example, the diet varies between halls and a further example in which it varies within halls. I simulated some data which are available in the appendix to that blog but which can also be downloaded from here so that any reader who wishes to try their hand at analysis can have a go. Continue reading

Categories: Lord's paradox, S. Senn | 7 Comments

I’ll be speaking at the Philo of Sci Association (PSA): Philosophy IN Science: Can Philosophers of Science Contribute to Science?


Philosophy in Science: Can Philosophers of Science Contribute to Science?
     on November 13, 2-4 pm


This session revolves around the intriguing question: Can Philosophers of Science Contribute to Science? They’re calling it philosophy “in” science–when philosophical ministrations actually intervene in a science itself.  This is the session I’ll be speaking in. I hope you will come to our session if you’re there–it’s hybrid, so you can’t see it through a remote link. But I’d like to hear what you think about this question–in the comments to this post. Continue reading

Categories: Error Statistics | 4 Comments

Philo of Sci Assoc (PSA) Session: Current Debates on Statistical Modeling and Inference



The Philosophy of Science Association (PSA) is holding its biennial meeting (one year late)–live/hybrid/remote*–in November, 2021, and I plan to be there (first in-person meeting since Feb 2020). Some of the members from the 2019 Summer Seminar that I ran with Aris Spanos are in a Symposium:

Current Debates on Statistical Modeling and Inference
     on November 13, 9 am-12:15 pm  

Here are the members and talks (Link to session/abstracts):

  • Aris Spanos (Virginia Tech): Self-Correction and Statistical Misspecification (co-author Deborah Mayo (Virginia Tech)
  • Roubin Gong (Rutgers): Measuring Severity in Statistical Inference
  • Riet van Bork (University of Amsterdam): Psychometric Models: Statistics and Interpretation (co-author Jan-Willem Romeijn (University of Groningen)
  • Marcello di Bello (Lehman College CUNY): Is Algorithmic Fairness Possible?
  • Elay Shech (Auburn University): Statistical Modeling, Mis-specification Testing, and Exploration
Continue reading
Categories: Error Statistics | 1 Comment

The (Vaccine) Booster Wars: A prepost


We’re always reading about how the pandemic has created a new emphasis on preprints, so it stands to reason that non-reviewed preposts would now have a place in blogs. Maybe then I’ll “publish” some of the half-baked posts languishing on draft in I’ll update or replace this prepost after reviewing.

The Booster wars

Continue reading

Categories: the (Covid vaccine) booster wars | 18 Comments

Workshop-New Date!

The Statistics Wars
and Their Casualties

New Date!

4-5 April 2022

London School of Economics (CPNSS)

Yoav Benjamini (Tel Aviv University), Alexander Bird (University of Cambridge), Mark Burgman (Imperial College London),  Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), Stephen Guettinger (London School of Economics and Political Science), David Hand (Imperial College London), Margherita Harris (London School of Economics and Political Science), Christian Hennig (University of Bologna), Katrin Hohl (City University London), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent) Continue reading

Categories: Error Statistics | Leave a comment

All She Wrote (so far): Error Statistics Philosophy: 10 years on

Dear Reader: I began this blog 10 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room–remotely for the first time due to Covid– both for the blog and the 3 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP, 2018). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff, where we had a session deconstructing the arguments against statistical significance tests (with Sir David Cox, Richard Morey and Aris Spanos). Join us between 7 and 8 pm in a drink of Elba Grease.


Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I posted several excerpts and mementos from SIST here. I thank readers for their input. Readers might want to look up the topics in SIST on this blog to check out the comments, and see how ideas were developed, corrected and turned into “excursions” in SIST.

I recently invited readers to weigh in on the ASA Task Force on Statistical significance and Replication--any time through September–to be part of a joint guest post (or posts). All contributors will get a free copy of SIST. Continue reading

Categories: 10 year memory lane, Statistical Inference as Severe Testing | Leave a comment

Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i)


I. A principled disagreement

The other day I was in a practice (zoom) for a panel I’m in on how different approaches and philosophies (Frequentist, Bayesian, machine learning) might explain “why we disagree” when interpreting clinical trial data. The focus is radiation oncology.[1] An important point of disagreement between frequentist (error statisticians) and Bayesians concerns whether and if so, how, to modify inferences in the face of a variety of selection effects, multiple testing, and stopping for interim analysis. Such multiplicities directly alter the capabilities of methods to avoid erroneously interpreting data, so the frequentist error probabilities are altered. By contrast, if an account conditions on the observed data, error probabilities drop out, and we get principles such as the stopping rule principle. My presentation included a quote from Bayarri and J. Berger (2004): Continue reading

Categories: multiple testing, statistical significance tests, strong likelihood principle | 26 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. Yes, i know I’ve been neglecting this blog as of late, but this topic will appear in a new guise in a post I’m writing now, to appear tomorrow.


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.  Continue reading

Categories: E.S. Pearson, Error Statistics | 2 Comments

Fair shares: sexual justice in patient recruitment in clinical trials



Stephen Senn
Consultant Statistician
Edinburgh, Scotland

It is hard to argue against the proposition that approaches to clinical research should treat not only men but also women fairly, and of course this applies also to other ways one might subdivide patients. However, agreeing to such a principle is not the same as acting on it and when one comes to consider what in practice one might do, it is far from clear what the principle ought to be. In other words, the more one thinks about implementing such a principle the less obvious it becomes as to what it is.

Three possible rules

Continue reading

Categories: evidence-based policy, PhilPharma, RCTs, S. Senn | 5 Comments

Invitation to discuss the ASA Task Force on Statistical Significance and Replication


The latest salvo in the statistics wars comes in the form of the publication of The ASA Task Force on Statistical Significance and Replicability, appointed by past ASA president Karen Kafadar in November/December 2019. (In the ‘before times’!) Its members are:

Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

The full report of this Task Force is in the The Annals of Applied Statistics, and on my blogpost. It begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Continue reading

Categories: 2016 ASA Statement on P-values, ASA Task Force on Significance and Replicability, JSM 2020, National Institute of Statistical Sciences (NISS), statistical significance tests | 2 Comments

Statistics and the Higgs Discovery: 9 yr Memory Lane


I’m reblogging two of my Higgs posts at the 9th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2” (from March, 2013).[1]

Some people say to me: “severe testing is fine for ‘sexy science’ like in high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning, at least, when we’re trying to find things out [2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

The Higgs discussion finds its way into Tour III in Excursion 3 of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). You can read it (in proof form) here, pp. 202-217. in a section with the provocative title:

3.8 The Probability Our Results Are Statistical Fluctuations: Higgs’ Discovery

Continue reading

Categories: Higgs, highly probable vs highly probed, P-values | Leave a comment

Statisticians Rise Up To Defend (error statistical) Hypothesis Testing


What is the message conveyed when the board of a professional association X appoints a Task Force intended to dispel the supposition that a position advanced by the Executive Director of association X does not reflect the views of association X on a topic that members of X disagree on? What it says to me is that there is a serious break-down of communication amongst the leadership and membership of that association. So while I’m extremely glad that the ASA appointed the Task Force on Statistical Significance and Replicability in 2019, I’m very sorry that the main reason it was needed was to address concerns that an editorial put forward by the ASA Executive Director (and 2 others) “might be mistakenly interpreted as official ASA policy”. The 2021 Statement of the Task Force (Benjamini et al. 2021) explains:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force…

Continue reading

Categories: ASA Task Force on Significance and Replicability, Schachtman, significance tests | 9 Comments

June 24: “Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” (Katrin Hohl)

The tenth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

24 June 2021

TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)

For information about the Phil Stat Wars forum and how to join, click on this link.

Katrin Hohl_copy


“Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” 

Katrin Hohl Continue reading

Categories: Error Statistics | Leave a comment

At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability

The ASA President’s Task Force Statement on Statistical Significance and Replicability has finally been published. It found a home in The Annals of Applied Statistics, after everyone else they looked to–including the ASA itself– refused to publish it.  For background see this post. I’ll comment on it in a later post. There is also an Editorial: Statistical Significance, P-Values, and Replicability by Karen Kafadar. Continue reading

Categories: ASA Task Force on Significance and Replicability | 10 Comments

June 24: “Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” (Katrin Hohl)

The tenth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

24 June 2021

TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)

For information about the Phil Stat Wars forum and how to join, click on this link.

Katrin Hohl_copy


“Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” 

Katrin Hohl Continue reading

Categories: Error Statistics | Leave a comment

The F.D.A.’s controversial ruling on an Alzheimer’s drug (letter from a reader)(ii)

I was watching Biogen’s stock (BIIB) climb over 100 points yesterday because its Alzheimer’s drug, aducanumab [brand name: Aduhelm], received surprising FDA approval.  I hadn’t been following the drug at all (it’s enough to try and track some Covid treatments/vaccines). I knew only that the FDA panel had unanimously recommended not to approve it last year, and the general sentiment was that it was heading for FDA rejection yesterday. After I received an email from Geoff Stuart[i] asking what I thought, I found out a bit more. He wrote: Continue reading

Categories: PhilStat/Med, preregistration | 10 Comments

Bayesian philosophers vs Bayesian statisticians: Remarks on Jon Williamson

While I would agree that there are differences between Bayesian statisticians and Bayesian philosophers, those differences don’t line up with the ones drawn by Jon Williamson in his presentation to our Phil Stat Wars Forum (May 20 slides). I hope Bayesians (statisticians, or more generally, practitioners, and philosophers) will weigh in on this. 

Continue reading
Categories: Phil Stat Forum, stat wars and their casualties | 11 Comments

Blog at