3 Commentaries on my Editorial are being published in Conservation Biology



There are 3 commentaries soon to be published in Conservation Biology on my editorial, “The statistics wars and intellectual conflicts of interest” also published in Conservation Biology.


Professor Philip B. Stark
Department of Statistics
University of California, Berkeley

You can read a draft of Philip Stark’s commentary here



Professor Christian Hennig
Department of Statistical Sciences “Paolo Fortunati”
University of Bologna.

Here is a draft of Christian Hennig’s commentary



Kent Staley and John Park , who each wrote individual commentaries for the blog, joined forces to write a joint commentary for the journal. You can read the draft here.


Kent W. Staley
Professor, Coordinator of Graduate Studies
Department of Philosophy
Saint Louis University






John Park, MD
Radiation Oncologist
Kansas City VA Medical Center


A commentary by Lakens will appear in the psychology journal that adopted the “abandon significance” recommendation discussed in my editorial. Others may appear as letters or part of longer papers elsewhere. I will update this blog once I have that information. I’m very impressed with these interesting, interdisciplinary efforts, and grateful to all the authors for pursuing them.

All of the initial blog commentaries on Mayo’s (2021) editorial (up through Jan 18, 2022) are below*

Ionides and Ritov

*There are 2 more I will shortly post. I stopped when David cox died to post some memorial items.

For background: The slides and video from our January 11, 2022 Forum, with presentations by Y. Benjamini, D. Hand and I, which grew out of my editorial, can be found here:

January 11 Forum: “Statistical Significance Test Anxiety” : Benjamini, Mayo, Hand

Categories: Mayo editorial, significance tests | Tags: , , , , | Leave a comment

A statistically significant result indicates H’ (μ > μ’) when POW(μ’) is low (not the other way round)–but don’t ignore the standard error


1. New monsters. One of the bizarre facts of life in the statistics wars is that a method from one school may be criticized on grounds that it conflicts with a conception that is the reverse of what that school intends. How is that even to be deciphered? That was the difficult task I set for myself in writing Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2008) [SIST 2018]. I thought I was done, but new monsters keep appearing. In some cases, rather than see how the notion of severity gets us beyond fallacies, misconstruals are taken to criticize severity! So, for example, in the last couple of posts, here and here, I deciphered some of the better known power howlers (discussed in SIST Ex 5 Tour II) I’m linking to all of this tour (in proofs).

We may examine claim (I) in a typical one-sided test (of the mean):

H0: μ < μ0 vs H1: μ > μ0

(I):Our Claim:  If the power of the test to detect μ’ is high, (i.e., POW(μ’) is high) (e.g., over .5) then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low (e.g., less than .1), it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

Now this claim (I) follows directly from the definitions of terms, but some argue that this just goes to show what’s wrong with the terms, rather than with their construal of them.

Specific Example of test T+. Let’s use an example from our last post, taken from SIST (2018), just to illustrate. We’re testing the normal mean µ:

H0: µ ≤ 150 against H1: µ > 150

with σ = 10, SE = σ/√n  = 1.  The critical value for α =.025 is z = 1.96. That is, we reject the claim that the population mean is less than or equal to 150, (we reject µ ≤ 150) and infer there is evidence µ > 150 whenever the sample mean M is at least 1.96 SE in excess of 150, i.e., when M > 150 + 1.96(1). For simplicity, let’s use the 2 SE cut-off as the critical value for rejecting:

Test T+: with n = 100: Reject µ ≤ 150 when M > 150 + 2SE = 152.

QUESTION: Now, suppose your outcome just makes it over the hurdle, M = 152. Does it make sense to say that this outcome is better evidence that µ is at least 153 than it is evidence that µ is at least 152? Well, POW(153) > POW(152), so if we thought the higher the power against μ’ the stronger the evidence that µ is at least µ’, then the answer would be yes. But logic alone would tell us that since:

claim A (e.g., µ ≥ 153) entails claim B (e.g., µ ≥ 152), claim B should be better warranted than claim A.

Nevertheless, we showed how one can make sense of the allegation that if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then if you will use the observed M as an estimate of µ (rather than a lower CI bound), then you’ll be “exaggerating” µ. Some take away from this that: low power for µ’ indicates poor evidence for µ > µ’. Or they put it as a comparative, the higher the power to detect µ’ the better the evidence for µ > µ’. This conflicts with our claim (I). We show that (I) is correct–but some may argue upholding I is a problem for severity!

2.  A severity critic. One such critic, Rochefort-Maranda (2020), hereafter, RM writes: “My aim…is to show how the problem of inflated effect sizes…corrupts the severity measure of evidence” where severity is from Mayo and Spanos 2006, 20011. But his example actually only has sample size of 10! You would be right to suspect violated assumptions, and Aris Spanos (2022), in his article in Philosophy of Science, shows in great detail how far from satisfying the assumptions his example is.

“[RM’s] claim that ‘the more powerful a test that rejects H0, the more the evidence against H0,’ constitutes a misconception. This claim is based on misunderstanding the difference between aiming for “a large n” predata to increase the power of the test (a commendable strategy) and what the particular power implies, postdata” (p. 16)

“Rochefort-Maranda’s (2020) case against the postdata severity evaluation, built on a numerical example using a ‘bad draw’ of simulated data with n = 10, illustrates how one can generate untrustworthy evidence (inconsistent estimators and an underpowered test) and declare severity as the culprit for the ensuing dubious results. His discussion is based on several misconceptions about the proper implementation and interpretation of frequentist testing.” (p. 18)

RM’s data, Spanos shows,

“is a form of “simulated data dredging” that describes the practice of simulating hundreds of replications of size n by changing the “seed” of the pseudorandom number algorithm in search of a desired result.” (ibid.)

Here the desired result is one that appears to lead to endorsing an inference with high severity even where it is stipulated we happen to known the inference is false. He doesn’t show such a misleading result probable–merely logically possible–and in fact, he himself says it’s practically impossible to replicate such misleading data!

3. The whole criticism is wrong-headed, even if assumptions hold. I want to raise another very general problem that would hold for such a criticism even if we imagine all the assumptions are met. (This is different from the “M inflates µ” problem.) In a nutsehll: The fact that one can imagine a parameter value excluded from a confidence interval CI at a reasonable CI level is scarcely an indictment of CIs! RM’s argument is just of that form, and it seems to me he could have spared himself the elaborate simulations and programming to show this. He overlooks the fact that the error probability must be included to qualify the inference, be it a p-value, confidence interval, or severity assessment.

Go back to our example. We observe M = 152 and our critic says: But suppose we knew the true µ was 150.01.  RM is alarmed: We have observed a difference from 150 of 2 when the true difference is only .01. He blames it on the low power against .01.

“That is because we have a significant result with an underpowered test such that the effect size is incredibly bigger than reality [200 times bigger]. The significance tester ‘would thus be wrong to believe that there is such a substantial difference from H0. But S would feel warranted to reach a similarly bad conclusion with the severity score.” (Rochefort-Maranda 2020)

Bad conclusion? Let’s separate his two allegations [i]: Yes, it would be wrong to take the observed M as the population µ–but do people not already know this? (One would be wrong ~50% of the time.) I come back to this at the end with some quotes from Stephen Senn.

But there’s scarcely anything bad about inferring µ > 150.04—provided the assumptions hold approximately. It’s a well warranted statistical inference.

The .975 lower bound with M = 152 is µ > 150.04.

RM comes along and says but suppose I know µ = 150.01. Of course, we don’t know the true value µ. But let’s suppose we do. Then  the alleged problem is that we infer H0:µ > 150.04 with severity .975 (the lower .975 confidence bound), and we’re “wrong” because the true value is outside the confidence interval!! Note that the inference to µ > 150.01 is even stronger, severity .98.

Insofar as a statistical inference account is fallible, a CI, even at high confidence level, can exclude the imagined known µ value. This is no indictment of the method. The same is true for a severity assessment, and of course, there is a duality between tests and CIs. We control such errors at small values as we choose the error probability associated with the method.

Remember, the form of inference (with CIs and tests) is not to a point value but to an inequality such as µ > 150.01.

Of course the inference is actually: if the assumptions hold approximately, then there is a good indication that µ > 150.01. The p-value ~.02. The confidence level with 150.01 as lower bound is ~.98. The fact that the power against µ = 150.01 is low, is actually a way to explain why the just statistically significant M is a good indication that µ > 150.01. (ii)

Once again, as in our last post, if one knows the observed difference is radically out of line, one suspects the computations of the p-value, the lower confidence bound, and severity are illicit, typically by biasing selections, ignoring data-dredging, optional stopping and or using a sample size too small to satisfy assumptions. This is what goes wrong in the RM illustration, as Spanos shows.

4. To conclude…. It does not make sense to say that because the test T+ has low power against values “close” to µ0 (here 150) that a statistically significant M isn’t good evidence that µ exceeds that value. At least not within the error statistical paradigm. It’s the opposite, and one only needs to remember that the power of the test against µ0 is α, say .025. It is even more absurd to say this in the case where M exceeds the 2SE cut-off (we’ve been supposing it just makes it to the statistically significant M, 151.96, or to simplify, 152). Suppose for example M = 153. This is even stronger grounds that µ > 150.01 (p value ~001).

On the other hand, it makes sense to say–since it’s true– that as sample size increases, the value of M needed to just reach .025 statistical significance gets closer to 150. So if you’re bound to use the observed M to estimate the population mean, then just reaching 025 significance is less of an exaggeration with higher n.

Question for the Reader: However, what if we compare two .025 statistically significant results in test T+ but with two sample sizes, say one with n = 100 (as before) and a second with n = 10,000.? The 025 statistically significant result with n = 10,000 indicates less of a discrepancy from 150 than the one with n= 100. Do you see why? (Give your reply in the comments). (See (iii))

Finally, to take the observed effect M as a good estimate of the true µ in the population is a bad idea—it is to ignore the fact that statistical methods have uncertainty. Rather, we would use the lower bound of a confidence interval with reasonably high confidence level (or corresponding high severity). If you think .975 or .95 give lower bounds that are too conservative, as someone who emailed me recently avers, then at least report the 1SE lower bound (for a confidence level of .84). Error statistical methods don’t hide the uncertainties associated with the method. If you do, it’s no wonder you end up with unwarranted claims. (iv)

Stephen Senn on ignoring uncertainty. There’s a guest post on this blog:by Stephen Senn, which alludes to R.A. Fisher and Cox and Hinkley (1974),on this issue

“In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten.

…to provide a point estimate without also providing a standard error is, indeed, an all too standard error. …if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. … This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. …

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.” (Senn)

Use the comments for your queries and response to my “question for the reader”.


[i] I put aside the fact that we would never call the degree of corroboration a “score”.

(ii) I hear people say, well if the power against 150.01 is practically the same as α, then the test isn’t discriminating 150.01 from the null of 150. Fine. M warrants both µ > 150 as well as µ > 150.01, and the warrant for the former is scarcely stronger than the latter. So the warranted inferences are similar.

(iii) On the other hand, if a test just fails to make it to the statistically significant cut-off, and POW(µ’) is low, then there’s poor evidence that µ < µ’. It’s for an inference of this form that the low power creates problems.

(iv) I note in one of my comments that Ioannidis’ (2008) way of stating the “inflation” claim is less open to misconstrual He says it’s the observed effect (i.e., the observed M) that “inflates” the “true” population effect, when the test has low power to detect that effect (but he allows it can also underestimate the true effect)–especially in the context of a variety of selective effects.


  • Rochefort-Maranda, G. (2020). Inflated effect sizes and underpowered tests: how the severity measure of evidence is affected by the winners’ curse. Phil Studies.
  • Mayo, D. G. (2018).Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.
  • Mayo, D. G., & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction. The British Journal for the Philosophy of Science, 57(2), 323–357.
  • Mayo, D. G., & Spanos, A. (2011). Error statistics. Philosophy of Statistics, 7, 152–198.
  • Spanos, A. (2022) Severity and Trustworthy Evidence: Foundational Problems versus Misuses of Frequentist Testing.  Philosophy of Science.





Categories: power, reforming the reformers, SIST, Statistical Inference as Severe Testing | 16 Comments

Do “underpowered” tests “exaggerate” population effects? (iv)


You will often hear that if you reach a just statistically significant result “and the discovery study is underpowered, the observed effects are expected to be inflated” (Ioannidis 2008, p. 64), or “exaggerated” (Gelman and Carlin 2014). This connects to what I’m referring to as the second set of concerns about statistical significance tests, power and magnitude errors. Here, the problem does not revolve around erroneously interpreting power as a posterior probability, as we saw in the fallacy in this post. But there are other points of conflict with the error statistical tester, and much that cries out for clarification — else you will misunderstand the consequences of some of today’s reforms..

(1) In one sense, the charge is unexceptional: If the various discovery procedures in the examples these authors discuss — flexible stopping rules, data dredging, and host of other biasing selection effects —  then finding statistical significance fails to give evidence of a genuine population effect. In those cases, an assertion about evidence of a genuine effect could be said to be “inflating”, but that’s because the error probability assessments, and thus the computation of power, fail to hold. That is why, as Fisher stressed, “we need, not an isolate record of statistical significance”, but must show it stands up to audits of the data and to replication. Granted, the sample size must be large enough to sustain the statistical model assumptions, and when not, we have grounds to suspect violations.

Let’s clarify further points:

(2) For starters it is incorrect to speak of tests being “underpowered” (tout court), because power is always defined in terms of a specific discrepancy from a test hypothesis or alternative null hypothesis. At most, they can mean that the test has low power to detect discrepancies of interest, or low power to detect a magnitude of (population) effect that is assumed known to be true. (The latter is what these critics tend to have in mind.) Take the type of example from discussions of the “mountains out of molehills” fallacy (in this blog and in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2008) [SIST 2018], Ex 5 Tour II), in testing the mean µ of a Normal distribution (with a random sample of size n): Test T+:

 H0: µ ≤ µ0 against H1: µ > µ0,

We can speak of a test T+ having low power to detect µ = µ’, (for µ’ a value in H1) while having high power to detect a larger discrepancy µ”. To remind us:

POW(µ’)–the power of the test to detect µ’ –is the probability the test  rejects H0, computed under the assumption that we are in a world where µ = µ’.

(I’ve often said that speaking of a test’s “sensitivity” would be less open to misconstrual, but it’s the same idea.) We want tests sufficiently powerful to detect discrepancies of interest, but once the data are in hand, a construal of the discrepancies warranted must take into account the sensitivity of the test that H0 has failed. (Is it like the fire alarm that goes off with burnt toast? Or one that only triggers when the house is ablaze?) 

If you want an alternative against which test T+ has super high power (~.98), choose µ’ = µ0 + 4 standard error units. But it would be unwarranted to take a just statistically significant result as grounds to infer a µ this large. (It would be wrong 98% of the time). The high power is telling us that if µ were as large as µ’, then with high probability the test would reject H0: µ ≤ µ0 (and find an indication in the direction of alternative H1.) It is an “if-then” claim.)

(3) To keep in mind the claim that I am making, I write it here

Mayo: If POW(μ’) is high then a just significant result is poor evidence that μ > μ’; while if POW(μ’) is low it’s good evidence that μ > μ’ (provided assumptions for these claims hold approximately).

Of course we would stipulate values for “high” (e.g., over .5) and “low” (e.g., less than .2), but this suffices for now. Let me suggest an informal way to understand error statistical reasoning from low power against an alternative μ’: Because it is improbable to get as low a P-value as we did (or lower), were μ as small as μ’–i.e., because POW(μ’) is low–it is an indication we’re in a world where population mean μ is greater than μ’. This is exactly the reasoning that allows inferring μ > μ0 with a statistically significant result. And notice: the power of the test against μ0 is α!

(4) The observed effect M is not the estimate of the population effect µ. Rather, we would use the lower bound of a confidence interval with high confidence level (or corresponding high severity). Nor is it correct to say the estimate “has” the power, it’s the test that has it (in relation to various alternatives–forming a power function).

But if it is supposed we will estimate the population mean using the statistically significant effect size, (as suggested in Ioannidis 2008 and Gelman and Carlin 2014), and it is further stipulated that this is known to be too high, then yes, then you can say the estimate is too high. The observed mean “exaggerates” what you know on good evidence to be the correct mean. No one can disagree with that, although they measure the exaggeration by a ratio. This is not about analyzing results in terms of power (it is not “power analytic reasoning”). But no matter. See “From p. 359 SIST” below or these pages here.

(5) Specific Example of test T+. Let’s use an example from SIST (2018) of testing

µ ≤ 150 vs. µ > 150

with σ = 10, SE = σ/√n  = 1.  The critical value for α =.025 is z = 1.96. That is, we reject when the sample mean  M > 150 + 1.96(1). You observe a just statistically significant result. You reject the null hypothesis and infer µ >150. Gelman and Carlin write:

An unbiased estimate will have 50% power if the true effect is 2 standard errors away from 0, it will have 17% power if the true effect is 1 standard error away from 0, and it will have 10% power if the true effect is 0.65 standard errors away from 0 (ibid., p. 4).

These correspond to µ =152, µ =151, µ =150.65. It’s odd to talk of an estimate having power; what they mean is that the test T+ has a power of .5 to detect a discrepancy 2 standard errors away from 150, and so on. I deliberately use numbers to match theirs.

[At this point, I turn to extracts from pp. 359-361 of SIST.] The “unbiased estimate” here is the statistically significant M. [I’m using M for X.] To see we’d match their numbers, compute POW(µ =152), POW(µ =151), POW(µ =150.65)[i]:

(a) Pr(M > 151.96; µ = 152) = Pr(Z > .04) = .51;
(b) Pr(M > 151.96; µ = 151)= Pr(Z >.96) = .17;
(c) Pr(M > 151.96; µ = 150.65)= Pr(Z >1.31) = .1.

They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result exaggeratesthe magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, “if the power is this high [.8], overestimation of the magnitude of the effect will be small” Gelman and Carlin 2014, p. 3. [I inserted this para from SIST p. 359 in version (iii).] Note POW(152.85) = .8….

They appear to be saying that there’s better evidence for µ ≥152 than for µ ≥151 than for µ ≥150.65, since the power assessments go down. Nothing changes if we write >. Notice that in each case the SEV computation for µ ≥152, µ ≥151, µ ≥150.65 are the complements, .49, .83, .9. So the lower the power for µ’ the stronger the evidence for µ > µ’. Thus there’s disagreement with my assertion in (3). But let’s try to pursue their thinking.

Suppose we observe M = 152. Say we have excellent reason to think it’s too big. We’re rather sure the mean temperature is no more than ~150.25 or 150.5, judging from previous cooling accidents, or perhaps from the fact that we don’t see some drastic effects we’d expect from water that hot. Thus 152 is an overestimate. …Some remarks:

From point (4), the inferred estimate would not be 152 but rather the lower confidence bounds, say, µ > (152 – 2SE ), i.e., µ > 150 (for a .975 lower confidence bound). True, but suppose the lower bound at a reasonable confidence level is still at odds with what we assume is known. For example, a lower .93 bound is µ > 150.5. What then? Then we simply have a conflict between what these data indicate and assumed background knowledge. 

Do Gelman and Carlin really want to say that the statistically significant M fails to warrant µ ≥ µ’ for any µ’ between 150 and 152 on grounds that the power in this range is low (going from .025 to .5)? If so, the result surely couldn’t warrant values larger than 152 (*). So it appears no values would be able to be inferred from the result.

[(*)Here’s a point of logic: If claim A (e.g., µ ≥ 152) entails claim B (e.g., µ ≥ 150.5), then in sensible inference accounts, claim B should be better warranted than claim A.]

A way to make sense of their view is to see them as saying the observed mean is so out of line with what’s known, that we suspect the assumptions of the test are questionable or invalid. Suppose you have considerable grounds for this suspicion: signs of cherry-picking, multiple testing, artificiality of experiments, publication bias and so forth — as are rife both examples given in Gelman and Carlin’s paper [as in Ioannidis 2008]. You have grounds to question the result because you question the reported error probabilities. …The error statistical point in (3) still stands.

This returns us to point (1). One reasons, if the assumptions are met, and the error probabilities approximately correct, then the statistically significant result would indicate µ > 150.5, P-value .07, or severity level .93. But you happen to know (or so it is stipulated) that µ ≤ 150.5. Thus, that’s grounds to question whether the assumptions are met. You suspect it would fail an audit. In that case put the blame where it belongs.[ii]

Please use the comments for questions and remarks.


From p. 359 SIST:

They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result exaggeratesthe magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, if the power is this high [.8], . . . overestimation of the magnitude of the effect will be small(p. 3).

[Added 5/4 22: [Remember, to say the power against the (assumed) known discrepancy from the null is less than .5 just is to say that the observed M (which just reaches statistical significance) exceeds the true value. And to say the power against the (assumed) known discrepancy exceeds .5 just is to say it exceeds the observed M (so M is not exaggerating it). Also see note (i) from SIST.

[i] There are slight differences depending on whether they are using 2 as the cut-off or 1.96, and from their using a two-sided test, but we hardly add anything for the negative direction: For (a), Pr( M < -2; µ =2) = Pr(Z < -4) ~ 0.

[ii] The point can also be made out by increasing power by dint of sample size. If n = 10,000, (σ/√n) = 0.1.  Test T+(n=10,000) rejects Hat the .025 level if  M > 150.2.  A 95% confidence interval is [150, 150.4]. With n = 100, the just .025 significant result corresponds to the interval [150, 154]. The latter is indicative of a larger discrepancy. Granted, sample size must be large enough for the statistical assumptions to pass an audit.


Categories: power, reforming the reformers, SIST, Statistical Inference as Severe Testing | 14 Comments

Join me in reforming the “reformers” of statistical significance tests


The most surprising discovery about today’s statistics wars is that some who set out shingles as “statistical reformers” themselves are guilty of misdefining some of the basic concepts of error statistical tests—notably power. (See my recent post on power howlers.) A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests. The only way that disputing tribes can get beyond the statistics wars is by (at least) understanding correctly the central concepts. But these misunderstandings are more common than ever, so I’m asking readers to help. Why are they more common (than before the “new reformers” of the last decade)? I suspect that at least one reason is the popularity of Bayesian variants on tests: if one is looking to find posterior probabilities of hypotheses, then error statistical ingredients may tend to look as if that’s what they supply. 

Run a little experiment if you come across a criticism based on the power of a test. Ask: are the critics interpreting the power of a test (with null hypothesis H) against an alternative H’ as if it were a posterior probability on H’? If they are, then it’s fallacious. But it will help understand why some people claim that high power against H’ warrants a stronger indication of a discrepancy H’, upon getting a just statistically significant result. But this is wrong. (See my recent post on power howlers.)

I had a blogpost on Ziliac and McCloskey (2008) (Z & M)on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) =

Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so.  They are mistaking (1), defining power, as giving a posterior probability of .85 to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

As Aris Spanos (2008) points out, “They have it backwards”. Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011) 

However, their slippery slides are very illuminating for common misinterpretations behind the criticisms of statistical significance tests–assuming a reader can catch them, because they only make them some of the time. [i] According to Ziliak and McCloskey (2008): “It is the history of Fisher significance testing. One erects little significancehurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133)

They construe little significanceas little hurdles! It explains how they wound up supposing high power translates into high hurdles. Its the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using sensitivityrather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejection) fallacy. A powerful test does give the null hypothesis a harder time in the sense that its more probable that discrepancies from it are detected. That makes it easier to infer H1. Z & M have their hurdles in a twist.

For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

What power howlers have you found? Share them in the comments. 

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s The Cult of Statistical SignificanceErasmus Journal for Philosophy and Economics, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

[i] When it comes to raising the power by increasing sample size, they often make true claims, so it’s odd when there’s a switch or mixture, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious. 

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’.

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent.  To Z & M, “the so-called fallacy of affirming the consequent may not be a fallacy at all in a science that is serious about decisions and belief.”  It is, they think, how Bayesians reason. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails data x will get a “B-boost” from x, unless its probability is already 1. The error statistician objects that the probability of finding an H that perfectly fits x is high, even if H is false–but the Bayesian need not object if she isn’t in the business of error probabilities. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

Categories: power, SIST, statistical significance tests | Tags: , , | 1 Comment

Happy Birthday Neyman: What was Neyman opposing when he opposed the ‘Inferential’ Probabilists? Your weekend Phil Stat reading


Today is Jerzy Neyman’s birthday (April 16, 1894 – August 5, 1981). I’m reposting a link to a quirky, but fascinating, paper of his that explains one of the most misunderstood of his positions–what he was opposed to in opposing the “inferential theory”. The paper, fro 60 years ago,Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. It arises on p. 391 of Excursion 5 Tour III of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s a link to the proofs of that entire tour. If you hear Neyman rejecting “inferential accounts,” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. He is not rejecting statistical inference in favor of behavioral performance as is typically thought. It’s amazing how an idiosyncratic use of a word 60 years ago can cause major rumblings decades later. Neyman always distinguished his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). You can find quite a lot on this blog searching Birnbaum. Continue reading

Categories: Bayesian/frequentist, Neyman | Leave a comment

Power howlers return as criticisms of severity

Mayo bangs head

Suppose you are reading about a statistically significant result x that just reaches a threshold p-value α from a test T+ of the mean of a Normal distribution

 H0: µ ≤  0 against H1: µ >  0

with n iid samples, and (for simplicity) known σ.  The test “rejects” H0 at this level & infers evidence of a discrepancy in the direction of H1.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ). See point* on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the frequentist error statistical philosophy? Continue reading

Categories: Statistical power, statistical tests | Tags: , , , , | 7 Comments

Insevere Tests of Severe Testing (iv)


One does not have evidence for a claim if little if anything has been done to rule out ways the claim may be false. The claim may be said to “pass” the test, but it’s one that utterly lacks stringency or severity. On the basis of this very simple principle, I build a notion of evidence that applies to any error prone inference. In this account, data x are evidence for a claim C only if (and only to the extent that) C has passed a severe test with x.[1] How to apply this simple idea, however, and how to use it to solve central problems of induction and statistical inference requires careful consideration of how it is to be fleshed out. (See this post on strong vs weak severity.)

Consider a fairly egregious, yet all-too familiar, example of a poorly tested claim to the effect that a given drug improves lung function on people with a given fatal lung disease. Say the CEO of the drug company, confronted with disappointing results from an RCT — they are no better than would be expected by the background variability alone — orders his data analysts to “slice and dice” the data until they get some positive results. They might try and try again to find a benefit among various subgroups (e.g., males, females, employment history, etc.). Failing yet again they might vary how “lung benefit” is measured using different proxy variables. This way of proceeding has a high probability of issuing in a report of drug benefit H1 (in some subgroup or other), even if no benefit exists (i.e., even if the null or test hypothesis H0 is true). (For a real case, see my “p-values on trial” in Harvard Data Science Review.)

The method has a high error probability in relation to what it infers, H1. H1 passes a test with low or even minimal severity. The gambit leading to low severity here is referred to with a variety of names, multiple testing, significance seeking, data-dredging, subgroup analysis, outcome switching, and data torturing and others besides. Experimental design principles endorsed by hundreds of medical journals, best-practice statistical manuals, and replication researchers reflect the need to block cavalier attitudes towards inferring data-dredged hypotheses. A variety of ways to avoid, adjust or otherwise compensate for “post data selection,” as some now call it, are well-known.

Some central features of the severity assessment:

  1. The severity assessment attaches to the method of inferring a claim C with a given test T and data x. The resulting assessment for a given hypothesis H1– in this case low — remains even if H1 is known or believed to be true (plausible, probable, or the like). Perhaps there are other data out there, y, or a different type of test, T’, that provide a warrant for H1, but that doesn’t change the low severity afforded by x from test T. In other words, asserting H1 might be right, but if it’s based on the post-data multiple searching method, it is right for the wrong reason. The method, as I described it, failed to distinguish cases where mere random variation throws up a interesting pattern in the particular subgroup which the researchers seize on.
  2. It is incorrect to speak of the severity of a test, in and of itself. Severity, as used and developed by me and by Spanos, refers to an assessment of how well-tested a particular claim of interest is. (It is post-data.) It is analogous to Popper’s term “corroboration” (a claim is corroborated if it passes severely)–never mind that he never adequately cashed it out. The severity associated with C measures how well-corroborated C is, with the data x and the test T under consideration.
  3. In assessing the severity associated with a method, we have to consider how it behaves in general, with other possible outcomes–not just the one you happen to observe–and under various alternatives. That is, we consider the method’s error probabilities–its capabilities to avoid (or commit) erroneous interpretations of the data. Methods that use probability (in inference) to assess and control error probabilities I call error statistical accounts. My account of evidence is one of severe testing based on error statistics.
  4. It is rarely the hypotheses or claims themselves that determine the severity with which they pass tests. Hypotheses pass poor tests when they happen to contain sufficiently vague terms, lending themselves to “just so” stories. An example from Popper is the concept of an “inferiority complex” in Adler’s psychological theory. Whatever behavior is observed, Popper charges, can be ‘explained’ as in sync with Adler (same for concepts in Freud). The theory may be logically falsifiable, but it is immunized from being found false. The theory is easily saved by ad hoc means, even if it’s false. The data-dredger can pull off the same stunt, but–as is more typical– the flexibility is in the data and hypothesis generation and analysis.On the flip side, theories with high content and “corroborative tendrils” that give it more chances of failing enjoy high severity provided that they pass a test that probably would have found flaws. (Sometimes philosophers talk of a large scale theory, paradigm, or research program that is understood to include overall testing methods as well as particular hypothesis.) [Updated 4/5 to include the flip side. For a discussion see SIST (2018) pp. 237-8.]

If someone is interested in appraising the value of our account of severity, and especially if they purport to refute it, they should be sure they are talking about an account with these essential features. Otherwise, their assessment will have no bearing on this account of severity.

Severe testing considers alternative hypotheses but is not a comparative account–there’s a big difference!

A comparative account of evidence merely reports that one hypothesis (model or claim) is favored over another in some sense: It might be said to be more likely, better supported, fit the data better or the like. Comparative accounts do not test, provide evidence for, or falsify hypotheses. They are limited to claiming one fits data better than another in some sense — even though they do not exhaust the possibilities, and even though both might be quite lousy. The better of two poorly warranted hypothesis is still a poorly warranted hypothesis.(See Mayo 2018, Mayo and Spanos 2011).

The classic example of a comparative account is based on the likelihood ratio of the hypothesis H1 over H0 compares the probability (or density) of x under H1which we may write as Pr(x;H1) — to the probability of x under H0, Pr(x;H0).

The likelihood ratio is Pr(x;H1)/Pr(x;H0).

With likelihoods, the data x are fixed while the hypotheses vary. Given the data x, it easy to find a hypothesis H1 that perfectly agrees with the data so that H1 is a better fit to the data than is hypothesis H0. However, as statistician George Barnard puts it, “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). So the probability of finding some better fitting alternative or other is high (if not guaranteed) even when H0 correctly describes the data generation.

Suppose someone proposes that H1 passes a severe test so long as the data are more probable under H1 than under some H0. Such an account will fail to meet even minimal requirements of severe tests in the error statistical account. Since the data dredging and other biasing selection effects do not alter the likelihood ratio or the Bayes Factor, basing severity on such comparative accounts will be at odds with the one we intend. This does not seem to bother the authors of a recent paper, van Dongen, Sprenger and Wagenmakers (2022), hereafter, VSW (2022). They say straight out:

the Bayes factor only depends on the probability of the data in light of the two competing hypotheses. As Mayo emphasizes (e.g., Mayo and Kruse, 2001; Mayo, 2018), the Bayes factor is insensitive to variations the sampling protocol that affect the error rates, i.e., optional stopping of the experiment.[2] The Bayes factor only depends on the actually observed data, and not on whether they have been collected from an experiment with fixed or variable sample size, and so on. In other words, the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion. (VSW 2022)

Stopping at this point and acknowledging the difference in statistical philosophies would be my recommendation. We’re not always in a context of severe testing in our sense. But these authors desire (or appear to desire) an error-statistical severity omelet without breaking the error statistical eggs (to allude to a famous analogous quote by Savage).

In the next paragraph they assure us that they too can capture severity, if not in my sense, then in a (subjective Bayesian) sense they find superior:

We agree with this observation [in the above quote], but we believe that the proper place for severity in statistical inference is the choice of the tested hypotheses (VSW 2022).

But the example they give that is supposed to convince me that I ought to define severity comparatively is not promising. According to them:

a stringent scrutiny of the claim C: “90% of all swans are white” requires only a single swan if the alternative claim is “all swans are black”.

But H1: 90% swans are white, does not pass a stringent scrutiny by dint of finding a single white swan x, although x falsifies H0: all swans are black. (It doesn’t matter for my point how we label the two hypotheses.) While I don’t know the precise distribution of white and black swans (nor how the sample was collected, nor whether the hypotheses are specified post hoc), it would be silly to suppose that a single white swan is good evidence that 90% of the population of swans are white.

A more familiar example of the same form as theirs would be to take a single case where a treatment works as grounds to stringently pass a hypothesis H1: that it works in at least 90% of the population. For these authors, as I understand them, what does the work that enables the alleged stringent inference to H1 is setting H0 as a hypothesis that x falsifies. Of course these two hypotheses scarcely exhaust the space of hypotheses — but this is a standard move (and a standard problem) in comparativist accounts [3]. To my ears, the example illustrated the problem with a comparative appraisal: Pr(x;H1) is surely greater than Pr(x;H0) which is 0, but H1 has not thereby been subjected to a scrutiny that it probably would have failed, if false.

In statistical significance tests, say, concerning the mean μ of a Normal distribution: H0: μ < μ0 versus H1: μ > μ0, we have an alternative hypothesis, but it is not a comparative account. (We could equally well have H0: μ = μ0) VSW question how such an alternative can pass with severity because it is composite (p. 6)–H1: μ > μ0 includes a range of values, e.g., the mean survival is higher in the treated vs the control group. Here’s how it does: A small p-value can warrant H1 with severity because with high probability, 1 – p, we would have obtained a larger p-value were we in a world where H0 is adequate. It is rather the comparative appraisal of point hypotheses that cannot falsify a hypothesis.


I will study the rest of VSW’s paper at a later date. The subjective Bayesian account is sufficiently flexible to redefine terms and goals so that the newly defined severity passes the test. But since the authors already conceded “the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion,” it’s hard to see how their view is being put to a severe test.

I may come back to this in a later post. For a detailed development of severe testing, see proofs of the first three excursions from SIST.[4]

Share you constructive remarks in the comments.


[1] Merely blocking an inference to a claim that passes with low severity is what I call weak severity. A fuller, strong severity principle says: We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of findings flaws or discrepancies from C, were they to be present, and yet none or few are found, the passing result, x, is evidence for C.

[2] Optional stopping is another gambit that can wreck error probability guarantees, violating what Cox and Hinkley (1974) call weak repeated sampling. (For details, see SIST pp 44-5; Mayo and Kruse 2001 below).

[3] Some Bayesians object to Bayes factors for similar reasons. Gelman (2011) says: “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (p. 74) or a weighted average of them.

[4] I have finally pulled together the pieces from the page proofs of the first three “excursions” of my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (2018, CUP) [SIST]. Here they are, beginning with the Preface: Excursions 1-3 from SIST. I would have hoped that scholars discussing severity and Popper would have looked at what I say about Popper in Excursion 2 (especially Tour II). To depict Popper as endorsing the naive or dogmatic variants called out by Lakatos in 1970 is highly problematic e.g., that old view of falsification by “basic statements”.
The best treatment of Bayes and Popper, I recalled when writing this, is in a book by the non-subjective Bayesian, Roger Rosenkrantz (1977), chapter 6. I looked it up today, and yes I think it is an excellent discussion that at least takes a reader up to Popper 1977. (updated on 4/6/22)



Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science, 23(2), 123–32.

Cox, D. R. & Hinkley, D. (1974). Theoretical Statistics. London: Chapman and Hall LTD.

Gelman, A. (2011). Induction and Deduction in Bayesian Data Analysis, Rationality, Markets and Morals 2:67-78.

Mayo D. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge: Cambridge University Press.

Mayo D. & Kruse, M. (2001). Principles of inference and their consequences. In D. Corfield and J. Williamson (eds.) Foundations of Bayesianism, pp. 381-403. The Netherlands: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. (2011). Error Statistics.

Savage, L. J. (1961). The Foundations of Statistics Reconsidered. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1. Berkeley: University of California Press, 57586.

van Dongen, N. N. N., Wagenmakers, E., & Sprenger, J. (2020, December 16). A Bayesian perspective on severity: Risky predictions and specific hypotheses. PsyArXiv preprints. (To appear in Psychonomic Bulletin and Review 2022.)

Categories: Error Statistics | 2 Comments

No fooling: The Statistics Wars and Their Casualties Workshop is Postponed to 22-23 September, 2022

The Statistics Wars
and Their Casualties

Postponed to
22-23 September 2022


London School of Economics (CPNSS)

Yoav Benjamini (Tel Aviv University), Alexander Bird (University of Cambridge), Mark Burgman (Imperial College London),
Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), Stephen Guettinger (London School of Economics and Political Science), David Hand (Imperial College London), Margherita Harris (London School of Economics and Political Science), Christian Hennig (University of Bologna), Katrin Hohl *(City University London),
Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent)

Panel Leaders: TBA

While the field of statistics has a long history of passionate foundational controversy the last decade has, in many ways, been the most dramatic. Misuses of statistics, biasing selection effects, and high powered methods of Big-Data analysis, have helped to make it easy to find impressive-looking but spurious, results that fail to replicate. As the crisis of replication has spread beyond psychology and social sciences to biomedicine, genomics and other fields, people are getting serious about reforms.  Many are welcome (preregistration, transparency about data, eschewing mechanical uses of statistics); some are quite radical. The experts do not agree on how to restore scientific integrity, and these disagreements reflect philosophical battles–old and new– about the nature of inductive-statistical inference and the roles of probability in statistical inference and modeling. These philosophical issues simmer below the surface in competing views about the causes of problems and potential remedies. If statistical consumers are unaware of assumptions behind rival evidence-policy reforms, they cannot scrutinize the consequences that affect them (in personalized medicine, psychology, law, and so on). Critically reflecting on proposed reforms and changing standards requires insights from statisticians, philosophers of science, psychologists, journal editors, economists and practitioners from across the natural and social sciences. This workshop will bring together these interdisciplinary insights–from speakers as well as attendees.

Organizers: D. Mayo and R. Frigg

Logistician (chief logistics and contact person): Jean Miller 

*We have had numerous postponements due to Covid and LSE regulations. Hohl’s attendance is uncertain.

Categories: Error Statistics | Leave a comment

The AI/ML Wars: “explain” or test black box models?


I’ve been reading about the artificial intelligence/machine learning (AI/ML) wars revolving around the use of so-called “black-box” algorithms–too complex for humans, even their inventors, to understand. Such algorithms are increasingly used to make decisions that affect you, but if you can’t understand, or aren’t told, why a machine predicted your graduate-school readiness, or which drug a doctor should prescribe for you, etc, you’d likely be dissatisfied and want some kind of explanation. Being told the machine is highly accurate (in some predictive sense) wouldn’t suffice. A new AI field has grown up around the goal of developing (secondary) “white box” models to “explain” the workings of the (primary) black box model. Some call this explainable AI, or XAI. The black box is still used to reach predictions or decisions, but the explainable model is supposed to help explain why the output was reached. (The EU and DARPA in the U.S. have instituted broad requirements and programs for XAI.) Continue reading

Categories: machine learning, XAI/ML | 15 Comments

Philosophy of Science Association (PSA) 22 Call for Contributed Papers

PSA2022: Call for Contributed Papers


Twenty-Eighth Biennial Meeting of the Philosophy of Science Association
November 10 – November 13, 2022
Pittsburgh, Pennsylvania


Submissions open on March 9, 2022 for contributed papers to be presented at the PSA2022 meeting in Pittsburgh, Pennsylvania, on November 10-13, 2022. The deadline for submitting a paper is 11:59 PM Pacific Standard Time on April 6, 2022. 

Contributed papers may be on any topic in the philosophy of science. The PSA2022 Program Committee is committed to assembling a program with high-quality papers on a variety of topics and diverse presenters that reflects the full range of current work in the philosophy of science. Continue reading

Categories: Announcement | Leave a comment

January 11 Forum: “Statistical Significance Test Anxiety” : Benjamini, Mayo, Hand

Here are all the slides along with the video from the 11 January Phil Stat Forum with speakers: Deborah G. Mayo, Yoav Benjamini and moderator/discussant David Hand.

D. Mayo                 Y. Benjamini.           D. Hand


Y. Benjamini’s slides: “The ASA president Task Force Statement on Statistical Significance and Replicability



Mayo slides are from the Editorial* in Conservation Biology: “The Statistics Wars and Intellectual Conflicts of Interest” Mayo (2021)  




Video of presentations with D. Hand as moderator/discussant:


Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, P-values, statistical significance | Leave a comment

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]


R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I reblog (with a few new comments) one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

Continue reading

Categories: fiducial probability, Fisher, Phil6334/ Econ 6614, Statistics | Leave a comment

R.A. Fisher: “Statistical methods and Scientific Induction” with replies by Neyman and E.S. Pearson

17 Feb 1890-29 July 1962

In recognition of Fisher’s birthday (Feb 17), I reblog what I call the “Triad”–an exchange between  Fisher, Neyman and Pearson (N-P) a full 20 years after the Fisher-Neyman break-up–adding a few new introductory remarks here. While my favorite is still the reply by E.S. Pearson, which alone should have shattered Fisher’s allegations that N-P “reinterpret” tests of significance as “some kind of acceptance procedure”, they are all chock full of gems for different reasons. They are short and worth rereading. Neyman’s article pulls back the cover on what is really behind Fisher’s over-the-top polemics, what with Russian 5-year plans and commercialism in the U.S. Not only is Fisher jealous that N-P tests came to overshadow “his” tests, he is furious at Neyman for driving home the fact that Fisher’s fiducial approach had been shown to be inconsistent (by others). The flaw is glaring and is illustrated very simply by Neyman in his portion of the triad. Further details may be found in my book, SIST (2018) especially pp 388-392 linked to here. It speaks to a common fallacy seen every day in interpreting confidence intervals. As for Neyman’s “behaviorism”, Pearson’s last sentence is revealing. Continue reading

Categories: E.S. Pearson, Fisher, Neyman, phil/history of stat | Leave a comment

Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll reblog some Fisherian items this week with a few new remarks. This paper comes just before the conflicts with Neyman and Pearson (N-P) erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see Fisher and N-P as ending up in a similar place while starting from different origins, as David Cox might say [1]. Unfortunately, the blow-up that occurred soon after is behind today’s misdirected war vs statistical significance tests.* I quote just the most relevant portions…the full article is linked below.** Happy Birthday Fisher! Continue reading

Categories: Fisher, phil/history of stat | Tags: , , , | Leave a comment

“Should Science Abandon Statistical Significance?” Session at AAAS Annual Meeting, Feb 18

Karen Kafadar, Yoav Benjamini, and Donald Macnaughton will be in a session:

Should Science Abandon Statistical Significance?

Friday, Feb 18 from 2-2:45 PM (EST) at the AAAS 2022 annual meeting.

The general program is here. To register*, go to this page.


The concept of statistical significance is central in scientific research. However, the concept is often poorly understood and thus is often unfairly criticized. This presentation includes three independent but overlapping arguments about the usefulness of the concept of statistical significance to reliably detect “effects” in frontline scientific research data. We illustrate the arguments with examples of scientific importance from genomics, physics, and medicine. We explain how the concept of statistical significance provides a cost-efficient objective way to empower scientific research with evidence.

Papers Continue reading

Categories: AAAS, Announcement, statistical significance | Tags: | Leave a comment

January 11 PhilStat Forum: Mayo: “The Stat Wars and Intellectual Conflicts of Interest”

Here are my slides on my Editorial in Conservation Biology: “The Statistics Wars and Intellectual Conflicts of Interest” Mayo (2021)  presented at  the 11 January Phil Stat Forum with speakers: Deborah G. Mayo and Yoav Benjamini and moderator David Hand. (Benjamini’s slides & full Video to come shortly)

D. Mayo                 Y. Benjamini.           D. Hand



For more details on the focus and background readings see this post on the Phil Stat Forum blog or this post January 10 post.

Categories: editors | Tags: , , | Leave a comment

ENBIS Webinar: Statistical Significance and p-values

Yesterday’s event video recording is available at:

European Network for Business and Industrial Statistics (ENBIS) Webinar:
Statistical Significance and p-values
Europe/Amsterdam (CET); 08:00-09:30 am (EST)

ENBIS will dedicate this webinar to the memory of Sir David Cox, who sadly passed away in January 2022.

Continue reading

Categories: Announcement, significance tests, Sir David Cox | Tags: , | 2 Comments

“A [very informal] Conversation Between Sir David Cox & D.G. Mayo”

In June 2011, Sir David Cox agreed to a very informal ‘interview’ on the topics of the 2010 workshop that I co-ran at the London School of Economics (CPNSS), Statistical Science and Philosophy of Science, where he was a speaker. Soon after I began taping, Cox stopped me in order to show me how to do a proper interview. He proceeded to ask me questions, beginning with:

COX: Deborah, in some fields foundations do not seem very important, but we both think foundations of statistical inference are important; why do you think that is?

MAYO: I think because they ask about fundamental questions of evidence, inference, and probability. I don’t think that foundations of different fields are all alike; because in statistics we’re so intimately connected to the scientific interest in learning about the world, we invariably cross into philosophical questions about empirical knowledge and inductive inference.

Continue reading

Categories: Birnbaum, Likelihood Principle, Sir David Cox, StatSci meets PhilSci | Tags: , | Leave a comment

Sir David Cox: An intellectual interview by Nancy Reid

Hinkley, Reid & Cox

Here’s an in-depth interview of Sir David Cox by Nancy Reid that brings out a rare, intellectual understanding and appreciation of some of Cox’s work. Only someone truly in the know could have managed to elicit these fascinating reflections. The interview was in Oct 1993, published in 1994.

Nancy Reid (1994). A Conversation with Sir David Cox, Statistical Science 9(3): 439-455.







Categories: Sir David Cox | Leave a comment

A interview with Sir David Cox by “Statistics Views” (upon turning 90)

Sir David Cox

Sir David Cox: July 15, 1924-Jan 18, 2022

The original Statistics Views interview is here:

“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”– An interview with Sir David Cox


  • Author: Statistics Views
  • Date: 24 Jan 2014

Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. The Cox point process was named after him. Continue reading

Categories: Sir David Cox | 4 Comments

Blog at WordPress.com.