Author Archives: Mayo

S. Senn: “Error point: The importance of knowing how much you don’t know” (guest post)


Stephen Senn
Consultant Statistician

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453] 

First things second

The classic text by David Cox and David Hinkley, Theoretical Statistics (1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes after that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state:

Superficially, point estimation may seem a simpler problem to discuss than that of interval estimation; in fact, however, any replacement of an uncertain quantity is bound to involve either some arbitrary choice or a precise specification of the purpose for which the single quantity  is required. Note that in interval-estimation we explicitly recognize that the conclusion is uncertain, whereas in point estimation…no explicit recognition is involved in the final answer. [2, p. 250]

In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten. For example, much of the criticism of randomisation overlooks the fact that the statistical analysis will deliver a probability statement and, other things being equal, the more unobserved prognostic factors there are, the more uncertain the result will be claimed to be. However, statistical statements are not wrong because they are uncertain, they are wrong if claimed to be more certain (or less certain) than they are.

A standard error

Amongst justifications that Cox and Hinkley give for calculating point estimates is that when supplemented with an appropriately calculated standard error they will, in many cases, provide the means of calculating a confidence interval, or if you prefer being Bayesian, a credible interval. Thus, to provide a point estimate without also providing a standard error is, indeed, an all too standard error. Of course, there is no value in providing a standard error unless it has been calculated appropriately and addressing the matter of appropriate calculation is not necessarily easy. This is a point I shall pick up below but for the moment let us proceed to consider why it is useful to have standard errors.

First, suppose you have a point estimate. At some time in the past you or someone else decided to collect the data that made it possible. Time and money were invested in doing this. It would not have been worth doing this unless there was a state of uncertainty that the collection of data went some way to resolving. Has it been resolved? Are you certain enough? If not, should more data be collected or would that not be worth it? This cannot be addressed without assessing the uncertainty in your estimate and this is what the standard error is for.

Second, you may wish to combine the estimate with other estimates. This has a long history in statistics. It has been more recently (in the last half century) developed under the heading of meta-analysis, which is now a huge field of theoretical study and practical application. However, the subject is much older than that. For example, I have on the shelves of my library at home, a copy of the second (1875) edition of On the Algebraical And Numerical Theory of Observations: And The Combination of Observations, by George Biddell Airy (1801-1892). [3] Chapter III is entitled ‘Principles of Forming the Most Advantageous Combination of Fallible Measures’ and treats the matter in some detail. For example, Airy defines what he calls the theoretical weight (t.w.) for combining errors asand then draws attention to ‘two remarkable results’

First. The combination-weight for each measure ought to be proportional to its theoretical weight.

Second. When the combination-weight for each measure is proportional to its theoretical weight, the theoretical weight of the final result is equal to the sum of the theoretical weights of the several collateral measures. (pp. 55-56).

We are now more used to using the standard error (SE) rather than the probable error (PE) to which Airy refers. However, the PE, which can be defined as the SE multiplied by the upper quartile of the standard Normal distribution, is just a multiple of the SE. Thus we have PE ≈ 0.645 × SE  and therefore 50% of values ought to be in the range mean −PE to mean +PE, hence the name. Since the PE is just a multiple of the SE, Airy’s second remarkable result applies in terms of SEs also. Nowadays we might speak of the precision, defined thus

and say that estimates should be combined in proportion to their precision, in which case the precision of the final result will be the sum of the individual precisions.

This second edition of Airy’s book dates from 1875 but, although, I have not got a copy of the first edition, which dates from 1861, I am confident that the history can be pushed at least as far as that. In fact, as has often been noticed, fixed effects meta-analysis is really just a form of least squares, a subject developed at the end of the 18thand beginning of the 19th century by Legendre, Gauss and Laplace, amongst others. [4]

A third reason to be interested in standard errors is that you may wish to carry out a Bayesian analysis. In that case, you should consider what the mean and the ‘standard error’ of your prior distribution are. You can then apply Airy’s two remarkable results as follows.


Ignoring uncertainty

Suppose that you regard all this concern with uncertainty as an unnecessary refinement and argue, “Never mind Airy’s precision weighting; when I have more than one estimate, I’ll just use an unweighted average”. This might seem like a reasonable ‘belt and braces’ approach but the figure below illustrates a problem. It supposes the following. You have one estimate and you then obtain a second. You now form an unweighted average of the two. What is the precision of this mean compared to a) the first result alone and b) the second result alone? In the figure, the X axis gives the relative precision of the second result alone to that of the first result alone. The Y axis gives the relative precision of the mean to the first result alone (red curve) or to the second result alone (blue curve).

Figure: Precision of an unweighted mean of two estimates as a function of the relative precision of the second compared to the first. The red curve gives the relative precision of the mean to that of the first and the blue curve the relative precision of the mean to the second. If both estimates are equally precise, the ratio is one and the precision of the mean is twice that of either result alone.

Where a curve is below 1, the precision of the mean is below the relevant single result. If the precision of the second result is less than 1/3 of the first, you would be better off using the first result alone. On the other hand, if the second result is more than three times as precise as the first, you would be better off using the second alone. The consequence is, that if you do not know the precision of your results you not only don’t know which one to trust, you don’t even know if an average of them should be preferred.

Not ignoring uncertainty

So, to sum up, if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. However, as I said in the introduction, all too easily, attention focuses on estimating the parameter of interest and not the probability statement. This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. As a common example of the problem, consider the following statement: ‘all covariates are balanced, therefore they do not need to be in the model’. The point of view expresses the belief that nothing of relevance will change if the covariates are not in the model, so why bother.

It is true that if a linear model applies, the point estimate for a ‘treatment effect’ will not change by including balanced covariates in the model. However, the expression of uncertainty will be quite different. The balanced case is one that does not apply in general. It thus follows that valid expressions of uncertainty have to allow for prognostic factors being imbalanced and this is, indeed, what they do. Misunderstanding of this is an error often made by critics of randomisation. I explain the misunderstanding like this: If we knew that important but unobserved prognostic factors were balanced, the standard analysis of clinical trials would be wrong. Thus, to claim that the analysis of randomised clinical trial relies on prognostic factors being balanced is exactly the opposite of what is true. [5]

As I explain in my blog Indefinite Irrelevance, if the prognostic factors are balanced, not adjusting for them, treats them as if they might be imbalanced: the confidence interval will be too wide given that we know that they are not imbalanced. (See also The Well Adjusted Statistician. [6])

Another way of understanding this is through the following example.

Consider a two-armed placebo-controlled clinical trial of a drug with a binary covariate (let us take the specific example of sex) and suppose that the patients split 50:50 according to the covariate. Now consider these two questions. What allocation of patients by sex within treatment arms will be such that differences in sex do not impact on 1) the estimate of the effect and 2) the estimate of the standard error of the estimate of the effect?

Everybody knows what the answer is to 1): the males and females must be equally distributed with respect to treatment. (Allocation one in the table below.) However, the answer to 2) is less obvious: it is that the two groups within which variances are estimated must be homogenous by treatment and sex. (Allocation two in the table below shows one of the two possibilities.) That means that if we do not put sex in the model, in order to remove sex from affecting the estimate of the variance, we would have to have all the males in one treatment group and all the females in another.

Allocation one Allocation two
Sex Sex
Male Female Male Female Total


Placebo 25 25 50 0 50
Drug 25 25 0 50 50
Total 50 50 50 50 100

Table: Percentage allocation by sex and treatment for two possible clinical trials

Of course, nobody uses allocation two but if allocation one is used, then the logical approach is to analyse the data so that the influence of sex is eliminated from the estimate of the variance, and hence the standard error. Savage, referring to Fisher, puts it thus:

He taught what should be obvious but always demands a second thought from me: if an experiment is laid out to diminish the variance of comparisons, as by using matched pairs…the variance eliminated from the comparison shows up in the estimate of this variance (unless care is taken to eliminate it)… [1, p. 450]

The consequence is that one needs to allow for this in the estimation procedure. One needs to ensure not only that the effect is estimated appropriately but that its uncertainty is also assessed appropriately. In our example this means that sex, in addition to treatment, must be in the model.

Here There be Tygers

it doesn’t approve of your philosophy Ray Bradbury, Here There be Tygers

So, estimating uncertainty is a key task of any statistician. Most commonly, it is addressed by calculating a standard error. However, this is not necessarily a simple matter. The school of statistics associated with design and analysis of agricultural experiments founded by RA Fisher, and to which I have referred  as the Rothamsted School, addressed this in great detail. Such agricultural experiments could have a  complicated block structure, for example, rows and columns of a field, with whole plots defined by their crossing and subplots within the whole plots. Many treatments could be studied simultaneously, with some (for example crop variety) being capable of being varied at the level of the plots but some (for example fertiliser) at the level of the subplots. This meant that variation at different levels affected different treatment factors. John Nelder developed a formal calculus to address such complex problems [7, 8].

In the world of clinical trials in which I have worked, we distinguish between trials in which patients can receive different treatments on different occasions and those in which each patient can independently receive only one treatment and those in which all the patients in the same centre must receive the same treatment. Each such design (cross-over, parallel, cluster) requires a different approach to assessing uncertainty. (See To Infinity and Beyond.) Naively treating all observations as independent can underestimate the standard error, a problem that Hurlbert has referred to as pseudoreplication. [9]

A key point, however, is this: the formal nature of experimentation forces this issue of variation to our attention. In observational studies we may be careless. We tend to assume that once we have chosen and made various adjustments to correct bias in the point estimate, that the ‘errors’ can then be treated as independent. However, only for the simplest of experimental studies would such an assumption be true, so what justifies making it as matter of habit for observational ones?

Recent work on historical controls has underlined the problem [10-12]. Trials that use such controls have features of both experimental and observational studies and so provide an illustrative bridge between the two. It turns out that treating the data as if they came from one observational study would underestimate the variance and hence overestimate the precision of the result. The implication is that analyses of observational studies more generally may be producing inappropriately narrow confidence intervals. [10]

Rigorous uncertainty

If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties. Francis Bacon, The Advancement of Learning, Book I, v,8.

In short, I am making an argument for Fisher’s general attitude to inference. Harry Marks has described it thus:

Fisher was a sceptic…But he was an unusually constructive sceptic. Uncertainty and error were, for Fisher, inevitable. But ‘rigorously specified uncertainty’ provided a firm ground for making provisional sense of the world. H Marks [13, p.94]

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.


  1. Savage, J., On rereading R.A. Fisher. Annals of Statistics, 1976. 4(3): p. 441-500.
  2. Cox, D.R. and D.V. Hinkley, Theoretical Statistics. 1974, London: Chapman and Hall.
  3. Airy, G.B., On the Algebraical and Numerical Theory of Errors of Observations and the Combination of Observations. 1875, london: Macmillan.
  4. Stigler, S.M., The History of Statistics: The Measurement of Uncertainty before 1900. 1986, Cambridge, Massachusets: Belknap Press.
  5. Senn, S.J., Seven myths of randomisation in clinical trials. Statistics in Medicine, 2013. 32(9): p. 1439-50.
  6. Senn, S.J., The well-adjusted statistician. Applied Clinical Trials, 2019: p. 2.
  7. Nelder, J.A., The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 147-162.
  8. Nelder, J.A., The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 163-178.
  9. Hurlbert, S.H., Pseudoreplication and the design of ecological field experiments. Ecological monographs, 1984. 54(2): p. 187-211.
  10. Collignon, O., et al., Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. Stat Methods Med Res, 2019: p. 962280219880213.
  11. Galwey, N.W., Supplementation of a clinical trial by historical control data: is the prospect of dynamic borrowing an illusion? Statistics in Medicine 2017. 36(6): p. 899-916.
  12. Schmidli, H., et al., Robust meta‐analytic‐predictive priors in clinical trials with historical control information. Biometrics, 2014. 70(4): p. 1023-1032.
  13. Marks, H.M., Rigorous uncertainty: why RA Fisher is important. Int J Epidemiol, 2003. 32(6): p. 932-7; discussion 945-8.


Categories: Fisher, randomization, Stephen Senn | Tags: | 4 Comments

Aris Spanos Reviews Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A. Spanos

Aris Spanos was asked to review my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic  Logic of Statistical Inference. The journal is OEconomia: History, Methodology, Philosophy. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpting for length.To read his full article go to external journal pdf or stable internal blog pdf.


2 Mayo (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The sub-title of Mayo’s (2018) book provides an apt description of the primary aim of the book in the sense that its focus is on the current discussions pertaining to replicability and trustworthy empirical evidence that revolve around the main fault line in statistical inference: the nature, interpretation and uses of probability in statistical modeling and inference. This underlies not only the form and structure of inductive inference, but also the nature of the underlying statistical reasonings as well as the nature of the evidence it gives rise to.

A crucial theme in Mayo’s book pertains to the current confusing and confused discussions on reproducibility and replicability of empirical evidence. The book cuts through the enormous level of confusion we see today about basic statistical terms, and in so doing explains why the experts so often disagree about reforms intended to improve statistical science.

Mayo makes a concerted effort to delineate the issues and clear up these confusions by defining the basic concepts accurately and placing many widely held methodological views in the best possible light before scrutinizing them. In particular, the book discusses at length the merits and demerits of the proposed reforms which include: (a) replacing p-values with Confidence Intervals (CIs), (b) using estimation-based effect sizes and (c) redefining statistical significance.

The key philosophical concept employed by Mayo to distinguish between a sound empirical evidential claim for a hypothesis H and an unsound one is the notion of a severe test: if little has been done to rule out flaws (errors and omissions) in pronouncing that data x0 provide evidence for a hypothesis H, then that inferential claim has not passed a severe test, rendering the claim untrustworthy. One has trustworthy evidence for a claim C only to the extent that C passes a severe test; see Mayo (1983; 1996). A distinct advantage of the concept of severe testing is that it is sufficiently general to apply to both frequentist and Bayesian inferential methods.

Mayo makes a case that there is a two-way link between philosophy and statistics. On one hand, philosophy helps in resolving conceptual, logical, and methodological problems of statistical inference. On the other hand, viewing statistical inference as severe testing gives rise to novel solutions to crucial philosophical problems including induction, falsification and the demarcation of science from pseudoscience. In addition, it serves as the foundation for understanding and getting beyond the statistics wars that currently revolves around the replication crises; hence the title of the book, Statistical Inference as Severe Testing.

Chapter (excursion) 1 of Mayo’s (2018) book sets the scene by scrutinizing the different role of probability in statistical inference, distinguishing between:

(i) Probabilism. Probability is used to assign a degree of confirmation, support or belief in a hypothesis H, given data x0 (Bayesian, likelihoodist, Fisher (fiducial)). An inferential claim H is warranted when it is assigned a high probability, support, or degree of belief (absolute or comparative).
(ii) Performance. Probability is used to ensure the long-run reliability of inference procedures; type I, II, coverage probabilities (frequentist, behavioristic Neyman-Pearson). An inferential claim H is warranted when it stems from a procedure with a low long-run error.
(iii) Probativism. Probability is used to assess the probing capacity of inference procedures, pre-data (type I, II, coverage probabilities), as well as post-data (p-value, severity evaluation). An inferential claim H is warranted when the different ways it can be false have been adequately probed and averted.

Mayo argues that probativism based on the severe testing account uses error probabilities to output an evidential interpretation based on assessing how severely an inferential claim H has passed a test with data x0. Error control and long-run reliability is necessary but not sufficient for probativism. This perspective is contrasted to probabilism (Law of Likelihood (LL) and Bayesian posterior) that focuses on the relationships between data x0 and hypothesis H, and ignores outcomes xRother than x0 by adhering to the Likelihood Principle (LP): given a statistical model Mθ(x) and data x0, all relevant sample information for inference purposes is contained in L(θ; x0), ∀θ∈Θ. Such a perspective can produce unwarranted results with high probability, by failing to pick up on optional stopping, data dredging and other biasing selection effects. It is at odds with what is widely accepted as the most effective way to improve replication: predesignation, and transparency about how hypotheses and data were generated and selected.

Chapter (excursion) 2 entitled ‘Taboos of Induction and Falsification’ relates the various uses of probability to draw certain parallels between probabilism, Bayesian statistics and Carnapian logics of confirmation on one side, and performance, frequentist statistics and Popperian falsification on the other. The discussion in this chapter covers a variety of issues in philosophy of science, including, the problem of induction, the asymmetry of induction and falsification, sound vs. valid arguments, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, the old evidence problem, corroboration, demarcation of science and pseudoscience, Duhem’s problem and novelty of evidence. These philosophical issues are also related to statistical conundrums as they relate to significance testing, fallacies of rejection, the cannibalization of frequentist testing known as Null Hypothesis Significance Testing (NHST) in psychology, and the issues raised by the reproducibility and replicability of evidence.

Chapter (excursion) 3 on ‘Statistical Tests and Scientific Inference’ provides a basic introduction to frequentist testing paying particular attention to crucial details, such as specifying explicitly the assumed statistical model Mθ(x) and the proper framing of hypotheses in terms of its parameter space Θ, with a view to provide a coherent account by avoiding undue formalism. The Neyman-Pearson (N-P) formulation of hypothesis testing is explained using a simple example, and then related to Fisher’s significance testing. What is different from previous treatments is that the claimed ‘inconsistent hybrid’ associated with the NHST caricature of frequentist testing is circumvented. The crucial difference often drawn is based on the N-P emphasis on pre-data long-run error probabilities, and the behavioristic interpretation of tests as accept/reject rules. By contrast, the post-data p-value associated with Fisher’s significance tests is thought to provide a more evidential interpretation. In this chapter, the two approaches are reconciled in the context of the error statistical framework. The N-P formulation provides the formal framework in the context of which an optimal theory of frequentist testing can be articulated, but in its current expositions lack a proper evidential interpretation. [For the detailed example see his review  pdf.]   

If a hypothesis H0 passes a test Τα that was highly capable of finding discrepancies from it, were they to be present, then the passing result indicates some evidence for their absence. The resulting evidential result comes in the form of the magnitude of the discrepancy γ from H0 warranted with test Τα and data x0 at different levels of severity. The intuition underlying the post-data severity is that a small p-value or a rejection of H0 based on a test with low power (e.g. a small n) for detecting a particular discrepancy γ provides stronger evidence for the presence of γ than if the test had much higher power (e.g. a large n).

The post-data severity evaluation outputs the discrepancy γ stemming from the testing results and takes the probabilistic form:

SEV (θ ≶ θ1; x0)=P(d(X) ≷ d(x0); θ10+γ), for all θ1∈Θ1,

where the inequalities are determined by the testing result and the sign of d(x0). [Ed Note ≶ is his way of combining the definition of severity for both > and <, in order to abbreviate. It is not used in SIST.] When the relevant N-P test result is ‘accept (reject) H0’ one is seeking the smallest (largest) discrepancy γ, in the form of an inferential claim θ ≶ θ10+γ, warranted by Τα and x0 at a high enough probability, say .8 or .9. The severity evaluations are introduced by connecting them to more familiar calculations relating to observed confidence intervals and p-value calculations. A more formal treatment to the post-data severity evaluation is given in chapter (excursion) 5.[Ed. note: “Excursions” are actually Parts, Tours are chapters]

Mayo uses the post-data severity perspective to scorch several misinterpretations of the p-value, including the claim that the p-value is not a legitimate error probability. She also calls into question any comparisons of the tail areas of d(X) under H0 that vary with xRn, with posterior distribution tail areas that vary with θ∈Θ, pointing out that this is tantamount to comparing apples and oranges!

The real life examples of the 1919 eclipse data for testing the General Theory of Relativity, as well as the 2012 discovery of the Higgs particle are used to illustrate some of the concepts in this chapter.

The discussion in this chapter sheds light on several important problems in statistical inference, including several howlers of statistical testing, Jeffreys’ tail area criticism, weak conditionality principle and the likelihood principle.

[To read about excursion 4, see his full review  pdf.]

Chapter (excursion) 5, entitled ‘Power and Severity’, provides an in-depth discussion of power and its abuses or misinterpretations, as well as scotch several confusions permeating the current discussions on the replicability of empirical evidence.

Confusion 1: The power of a N-P test Τα:= {d(X), C1(α)} is a pre-data error probability that calibrates the generic (for any sample realization x∈Rn ) capacity of the test in detecting different discrepancies from H0, for a given type I error probability α. As such, the power is not a point function one can evaluate arbitrarily at a particular value θ1. It is defined for all values in the alternative space θ1∈Θ1.

Confusion 2: The power function is properly defined for all θ1∈Θ1 only when (Θ0, Θ1) constitute a partition of Θ. This is to ensure that θ is not in a subset of Θ ignored by the comparisons since the main objective is to narrow down the unknown parameter space Θ using hypothetical values of θ. …Hypothesis testing poses questions as to whether a hypothetical value θ0 is close enough to θ in the sense that the difference (θ – θ0) is ‘statistically negligible’; a notion defined using error probabilities.

Confusion 3: Hacking (1965) raised the problem of using predata error probabilities, such as the significance level α and power, to evaluate the testing results post-data. As mentioned above, the post-data severity aims to address that very problem, and is extensively discussed in Mayo (2018), excursion 5.

Confusion 4: Mayo and Spanos (2006) define “attained power” by replacing cα with the observed d(x0). But this should not be confused with replacing θ1 with its observed estimate [e.g., xn], as in what is often called “observed” or “retrospective” power. To compare the two in example 2, contrast:

Attained power: POW(µ1)=Pr(d(X) > d(x0); µ=µ1), for all µ10,

with what Mayo calls Shpower which is defined at µ=xn:

Shpower: POW(xn)=Pr(d(X) > d(x0); µ=xn).

Shpower makes very little statistical sense, unless point estimation justifies the inferential claim xn ≅ µ, which it does not, as argued above. Unfortunately, the statistical literature in psychology is permeated with (implicitly) invoking such a claim when touting the merits of estimation-based effect sizes. The estimate xrepresents just a single value of Xn ∼N(µ, σ2/n ), and any inference pertaining to µ needs to take into account the uncertainty described by this sampling distribution; hence, the call for using interval estimation and hypothesis testing to account for that sampling uncertainty. The post-data severity evaluation addresses this problem using hypothetical reasoning and taking into account the relevant statistical context (11). It outputs the discrepancy from H0 warranted by test Τα and data x0, with high enough severity, say bigger than .85. Invariably, inferential claims of the form µ ≷ µ1= xn are assigned low severity of .5.

Confusion 5: Frequentist error probabilities (type I, II, coverage, p-value) are not conditional on H (H0 or H1) since θ=θ0 or θ=θ1 being ‘true or false’ do not constitute legitimate events in the context of Mθ(x); θ is an unknown constant. The clause ‘given H is true’ refers to hypothetical scenarios under which the sampling distribution of the test statistic d(X) is evaluated as in (10).

This confusion undermines the credibility of Positive Predictive Value (PPV):

where (i) F = H0 is false, (ii) R=test rejects H0, and (iii) H0: no disease, used by Ioannidis (2005) to make his case that ‘most published research findings are false’ when PPV = Pr(F|R)<.5. His case is based on ‘guessing’ probabilities at a discipline wide level, such as Pr(F)=.1, Pr(R|F)=.8 and Pr(R|F)=.15, and presuming that the last two relate to the power and significance level of a N-P test. He then proceeds to blame the wide-spread abuse of significance testing (p-hacking, multiple testing, cherry-picking, low power) for the high de facto type I error (.15). Granted, such abuses do contribute to untrustworthy evidence, but not via false positive/negative rates since (i) and (iii) are not legitimate events in the context of Mθ(x), and thus Pr(R|F) and Pr(R|F) have nothing to do with the significance level and the power of a N-P test. Hence, the analogical reasoning relating the false positive and false negative rates in medical detecting devices to the type I and II error probabilities in frequentist testing is totally misplaced. These rates are established by the manufacturers of medical devices after running a very large number (say, 10000) of medical ‘tests’ with specimens that are known to be positive or negative; they are prepared in a lab. Known ‘positive’ and ‘negative’ specimens constitute legitimate observable events one can condition upon. In contrast, frequentist error probabilities (i) are framed in terms of θ (which are not observable events in Mθ(x)) and (ii) depend crucially on the particular statistical context (11); there is no statistical context for the false positive and false negative rates.

A stronger case can be made that abuses and misinterpretations of frequentist testing are only symptomatic of a more extensive problem: the recipe-like/uninformed implementation of statistical methods. This contributes in many different ways to untrustworthy evidence, including: (i) statistical misspecification (imposing invalid assumptions on one’s data), (ii) poor implementation of inference methods (insufficient understanding of their assumptions and limitations), and (iii) unwarranted evidential interpretations of their inferential results (misinterpreting p-values and CIs, etc.).

Mayo uses the concept of a post-data severity evaluation to illuminate the above mentioned issues and explain how it can also provide the missing evidential interpretation of testing results. The same concept is also used to clarify numerous misinterpretations of the p-value throughout this book, as well as the fallacies:
(a) Fallacy of acceptance (non-rejection). No evidence against H0 is misinterpreted as evidence for it. This fallacy can easily arise when the power of a test is low (e.g. small n problem) in detecting sizeable discrepancies.
(b) Fallacy of rejection. Evidence against H0 is misinterpreted as evidence for a particular H1. This fallacy can easily arise when the power of a test is very high (e.g. large n problem) and it detects trivial discrepancies; see Mayo and Spanos (2006).

In chapter 5 Mayo returns to a recurring theme throughout the book, the mathematical duality between Confidence Intervals (CIs) and hypothesis testing, with a view to call into question certain claims about the superiority of CIs over p-values. This mathematical duality derails any claims that observed CIs are less vulnerable to the large n problem and more informative than p-values. Where they differ is in terms of their inferential claims stemming from their different forms of reasoning, factual vs. hypothetical. That is, the mathematical duality does not imply inferential duality. This is demonstrated by contrasting CIs with the post-data severity evaluation.

Indeed, a case can be made that the post-data severity evaluation addresses several long-standing problems associated with frequentist testing, including the large n problem, the apparent arbitrariness of the N-P framing that allows for simple vs. simple hypotheses, say H0: µ= 1 vs. H1: µ=1, the arbitrariness of the rejection thresholds, the problem of the sharp dichotomy (e.g. reject H0 at .05 but accept H0 at .0499), and distinguishing between statistical and substantive significance. It also provides a natural framework for evaluating reproducibility/replicability issues and brings out the problems associated with observed CIs and estimation-based effect sizes; see Spanos (2019).

Chapter 5 also includes a retrospective view of the disputes between Neyman and Fisher in the context of the error statistical perspective on frequentist inference, bringing out their common framing and their differences in emphasis and interpretation. The discussion also includes an interesting summary of their personal conflicts, not always motivated by statistical issues; who said the history of statistics is boring?

Chapter (excursion) 6 of Mayo (2018) raises several important foundational issues and problems pertaining to Bayesian inference, including its primary aim, subjective vs. default Bayesian priors and their interpretations, default Bayesian inference vs. the Likelihood Principle, the role of the catchall factor, the role of Bayes factors in Bayesian testing, and the relationship between Bayesian inference and error probabilities. There is also discussion about attempts by ‘default prior’ Bayesians to unify or reconcile frequentist and Bayesian accounts.

A point emphasized in this chapter pertains to model validation. Despite the fact that Bayesian statistics shares the same concept of a statistical model Mθ(x) with frequentist statistics, there is hardly any discussion on validating Mθ(x) to secure the reliability of the posterior distribution:…upon which all Bayesian inferences are based. The exception is the indirect approach to model validation in Gelman et al (2013) based on the posterior predictive distribution:Since m(x) is parameter free, one can use it as a basis for simulating a number of replications x1, x2, …, xn to be used as indirect evidence for potential departures from the model assumptions vis-à-vis data x0, which is clearly different from frequentist M-S testing of the Mθ(x) assumptions. The reason is that m(x) is a smoothed mixture of f(x; θ) and π(θ|x0 ) and one has no way of attributing blame to one or the other when any departures are detected. For instance, in the case of the simple Normal model in (9), a highly skewed prior might contribute (indirectly) to departures from the Normality assumption when tested using simulated data using (12). Moreover, the ‘smoothing’ with respect to the parameters in deriving m(x) is likely to render testing departures from the IID assumptions a lot more unwieldy.

On the question posed by the title of this review, Mayo’s answer is that the error statistical framework, a refinement or extension of the original Fisher-Neyman-Pearson framing in the spirit of Peirce, provides a pertinent foundation for frequentist modeling and inference.

3 Conclusions

A retrospective view of Hacking (1965) reveals that its main weakness is that its perspective on statistical induction adheres too closely to the philosophy of science framing of that period, and largely ignores the formalism based on the theory of stochastic processes {Xt, t∈N} that revolves around the concept of a statistical model Mθ(x). Retrospectively, its value stems primarily from a number of very insightful arguments and comments that survived the test of time. The three that stand out are: (i) an optimal point estimator [θ-hat(X)] of θ does not warrant the inferential claim [θ-hat(x0)]≅ θ, (ii) a statistical inference is very different from a decision, and (iii) the distinction between the pre-data error probabilities and the post-data evaluation of the evidence stemming from testing results; a distinction that permeates Mayo’s (2018) book. Hacking’s change of mind on the aptness of logicism and the problems with the long run frequency is also particularly interesting. Hacking’s (1980) view of the long run frequency is almost indistinguishable from that of Cramer (1946, 332) and Neyman (1952, 27) mentioned above, or Mayo (2018), when he argues: “Probabilities conform to the usual probability axioms which have among their consequences the essential connection between individual and repeated trials, the weak law of large numbers proved by Bernoulli. Probabilities are to be thought of as theoretical properties, with a certain looseness of fit to the observed world. Part of this fit is judged by rules for testing statistical hypotheses along the lines described by Neyman and Pearson. It is a “frequency view of probability” in which probability is a dispositional property…” (Hacking, 1980, 150-151).

Probability as a dispositional property’ of a chance set-up alludes to the propensity interpretation of probability associated with Peirce and Popper, which is in complete agreement with the model-based frequentist interpretation; see Spanos (2019).

The main contribution of Mayo’s (2018) book is to put forward a framework and a strategy to evaluate the trustworthiness of evidence resulting from different statistical accounts. Viewing statistical inference as a form of severe testing elucidates the most widely employed arguments surrounding commonly used (and abused) statistical methods. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to evaluate and control how severely tested different inferential claims are. Without assuming that other statistical accounts aim for severe tests, Mayo proposes the following strategy for evaluating the trustworthiness of evidence: begin with a minimal requirement that if a test has little or no chance to detect flaws in a claim H, then H’s passing result constitutes untrustworthy evidence. Then, apply this minimal severity requirement to the various statistical accounts as well as to the proposed reforms, including estimation-based effect sizes, observed CIs and redefining statistical significance. Finding that they fail even the minimal severity requirement provides grounds to question the trustworthiness of their evidential claims. One need not reject some of these methods just because they have different aims, but because they give rise to evidence [claims] that fail the minimal severity requirement. Mayo challenges practitioners to be much clearer about their aims in particular contexts and different stages of inquiry. It is in this way that the book ingeniously links philosophical questions about the roles of probability in inference to the concerns of practitioners about coming up with trustworthy evidence across the landscape of the natural and the social sciences.


  • Barnard, George. 1972. Review article: Logic of Statistical Inference. The British Journal of the Philosophy of Science, 23: 123- 190.
  • Cramer, Harald. 1946. Mathematical Methods of Statistics, Princeton: Princeton University Press.
  • Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society A, 222(602): 309-368.
  • Fisher, Ronald A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
  • Gelman, Andrew. John B. Carlin, Hal S. Stern, Donald B. Rubin. 2013. Bayesian Data Analysis, 3rd ed. London: Chapman & Hall/CRC.
  • Hacking, Ian. 1972. Review: Likelihood. The British Journal for the Philosophy of Science, 23(2): 132-137.
  • Hacking, Ian. 1980. The Theory of Probable Inference: Neyman, Peirce and Braithwaite. In D. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite. Cambridge: Cambridge University Press, 141-160.
  • Ioannidis, John P. A. 2005. Why Most Published Research Findings Are False. PLoS medicine, 2(8): 696-701.
  • Koopman, Bernard O. 1940. The Axioms and Algebra of Intuitive Probability. Annals of Mathematics, 41(2): 269-292.
  • Mayo, Deborah G. 1983. An Objective Theory of Statistical Testing. Synthese, 57(3): 297-340.
  • Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: The University of Chicago Press.
  • Mayo, Deborah G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars. Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos. 2004. Methodology in Practice: Statistical Misspecification Testing. Philosophy of Science, 71(5): 1007-1025.
  • Mayo, Deborah G. and Aris Spanos. 2006. Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction. British Journal for the Philosophy of Science, 57(2): 323- 357.
  • Mayo, Deborah G. and Aris Spanos. 2011. Error Statistics. In D. Gabbay, P. Thagard, and J. Woods (eds), Philosophy of Statistics, Handbook of Philosophy of Science. New York: Elsevier, 151-196.
  • Neyman, Jerzy. 1952. Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. Washington: U.S. Department of Agriculture.
  • Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall.
  • Salmon, Wesley C. 1967. The Foundations of Scientific Inference. Pittsburgh: University of Pittsburgh Press.
  • Spanos, Aris. 2013. A Frequentist Interpretation of Probability for Model-Based Inductive Inference. Synthese, 190(9):1555- 1585.
  • Spanos, Aris. 2017. Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference. In Advances in Statistical Methodologies and Their Applications to Real Problems., 3-28.
  • Spanos, Aris. 2018. Mis-Specification Testing in Retrospect. Journal of Economic Surveys, 32(2): 541-577.
  • Spanos, Aris. 2019. Probability Theory and Statistical Inference: Empirical Modeling with Observational Data, 2nd ed. Cambridge: Cambridge University Press.
  • Von Mises, Richard. 1928. Probability, Statistics and Truth, 2nd ed. New York: Dover.
  • Williams, David. 2001. Weighing the Odds: A Course in Probability and Statistics. Cambridge: Cambridge University Press.
Categories: Spanos, Statistical Inference as Severe Testing | Leave a comment

The NAS fixes its (main) mistake in defining P-values!

Mayo new elbow

(reasonably) satisfied

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.
Two footnotes, on pages 31 35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOC

Thank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.
With kind regards and wishes of a happy 2020,
Jenny Heimberg
Jennifer Heimberg, Ph.D.
Senior Program Officer

The National Academies of Sciences, Engineering, and Medicine

Continue reading

Categories: NAS, P-values | 2 Comments

Midnight With Birnbaum (Happy New Year 2019)!

 Just as in the past 8 years since I’ve been blogging, I revisit that spot in the road at 9p.m., just outside the Elbar Room, look to get into a strange-looking taxi, to head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wonder if the car will come for me this year, as I wait out in the cold, now that Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT 2018) has been out over a year. STINT doesn’t rehearse the argument from my Birnbaum article, but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). 2019 was the 61th birthday of Cox’s “weighing machine” example, which was the basis of Birnbaum’s attempted proof. Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2020? Anyway, the cab is finally here…the rest is live. Happy New Year! Continue reading

Categories: Birnbaum Brakes, strong likelihood principle | Tags: , , , | Leave a comment

A Perfect Time to Binge Read the (Strong) Likelihood Principle

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data. Continue reading

Categories: Birnbaum, Birnbaum Brakes, law of likelihood | 6 Comments

61 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)


2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. it is now 61. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my (still) new book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018). It’s especially relevant to take this up now, just before we leave 2019, for reasons that will be revealed over the next day or two. For a sneak preview of those reasons, see the “note to the reader” at the end of this post. So, let’s go back to it, with an excerpt from SIST (pp. 170-173). Continue reading

Categories: Birnbaum, Statistical Inference as Severe Testing, strong likelihood principle | Leave a comment

Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)


I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) howler well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past: Continue reading

Categories: memory lane, significance tests, Statistics | Tags: | Leave a comment

“Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)

les stats, c’est moi

When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”.  [1]. So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “we take that step here!” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.) Continue reading

Categories: ASA Guide to P-values | 84 Comments

P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)


Mayo writing to Kafadar

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

  • “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading

Categories: ASA Guide to P-values, Bayesian/frequentist, P-values | 1 Comment

A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science. Continue reading

Categories: Error Statistics | 11 Comments

The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)


cure by committee

Everything is impeach and remove these days! Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (F, R & B Hypothesis). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire. Continue reading

Categories: ASA Guide to P-values | 7 Comments

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)


“Before we stood on the edge of the precipice, now we have taken a great step forward”


What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading

Categories: P-values, stat wars and their casualties, statistical significance tests | 12 Comments

Exploring a new philosophy of statistics field

This article came out on Monday on our Summer Seminar in Philosophy of Statistics in Virginia Tech News Daily magazine.

October 28, 2019


From universities around the world, participants in a summer session gathered to discuss the merits of the philosophy of statistics. Co-director Deborah Mayo, left, hosted an evening for them at her home.

Continue reading

Categories: Philosophy of Statistics, Summer Seminar in PhilStat | 2 Comments

The First Eye-Opener: Error Probing Tools vs Logics of Evidence (Excursion 1 Tour II)

1.4, 1.5

In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP),  I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).  Continue reading

Categories: Error Statistics, law of likelihood, SIST | 14 Comments

The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon


Continue to the third, and last stop of Excursion 1 Tour I of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–Section 1.3. It would be of interest to ponder if (and how) the current state of play in the stat wars has shifted in just one year. I’ll do so in the comments. Use that space to ask me any questions.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Severity: Strong vs Weak (Excursion 1 continues)


Marking one year since the appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.


  • I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

Continue reading

Categories: Statistical Inference as Severe Testing | 5 Comments

How My Book Begins: Beyond Probabilism and Performance: Severity Requirement

This week marks one year since the general availability of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s how it begins (Excursion 1 Tour 1 (1.1)). Material from the preface is here. I will sporadically give some “one year later” reflections in the comments. I invite readers to ask me any questions pertaining to the Tour.

The journey begins..(1.1)

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 4 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 19 Comments

Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)


The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues: Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access)


A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 6 Comments

Blog at