Monthly Archives: January 2020

S. Senn: “Error point: The importance of knowing how much you don’t know” (guest post)


Stephen Senn
Consultant Statistician

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453] 

First things second

The classic text by David Cox and David Hinkley, Theoretical Statistics (1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes after that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state:

Superficially, point estimation may seem a simpler problem to discuss than that of interval estimation; in fact, however, any replacement of an uncertain quantity is bound to involve either some arbitrary choice or a precise specification of the purpose for which the single quantity  is required. Note that in interval-estimation we explicitly recognize that the conclusion is uncertain, whereas in point estimation…no explicit recognition is involved in the final answer. [2, p. 250]

In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten. For example, much of the criticism of randomisation overlooks the fact that the statistical analysis will deliver a probability statement and, other things being equal, the more unobserved prognostic factors there are, the more uncertain the result will be claimed to be. However, statistical statements are not wrong because they are uncertain, they are wrong if claimed to be more certain (or less certain) than they are.

A standard error

Amongst justifications that Cox and Hinkley give for calculating point estimates is that when supplemented with an appropriately calculated standard error they will, in many cases, provide the means of calculating a confidence interval, or if you prefer being Bayesian, a credible interval. Thus, to provide a point estimate without also providing a standard error is, indeed, an all too standard error. Of course, there is no value in providing a standard error unless it has been calculated appropriately and addressing the matter of appropriate calculation is not necessarily easy. This is a point I shall pick up below but for the moment let us proceed to consider why it is useful to have standard errors.

First, suppose you have a point estimate. At some time in the past you or someone else decided to collect the data that made it possible. Time and money were invested in doing this. It would not have been worth doing this unless there was a state of uncertainty that the collection of data went some way to resolving. Has it been resolved? Are you certain enough? If not, should more data be collected or would that not be worth it? This cannot be addressed without assessing the uncertainty in your estimate and this is what the standard error is for.

Second, you may wish to combine the estimate with other estimates. This has a long history in statistics. It has been more recently (in the last half century) developed under the heading of meta-analysis, which is now a huge field of theoretical study and practical application. However, the subject is much older than that. For example, I have on the shelves of my library at home, a copy of the second (1875) edition of On the Algebraical And Numerical Theory of Observations: And The Combination of Observations, by George Biddell Airy (1801-1892). [3] Chapter III is entitled ‘Principles of Forming the Most Advantageous Combination of Fallible Measures’ and treats the matter in some detail. For example, Airy defines what he calls the theoretical weight (t.w.) for combining errors asand then draws attention to ‘two remarkable results’

First. The combination-weight for each measure ought to be proportional to its theoretical weight.

Second. When the combination-weight for each measure is proportional to its theoretical weight, the theoretical weight of the final result is equal to the sum of the theoretical weights of the several collateral measures. (pp. 55-56).

We are now more used to using the standard error (SE) rather than the probable error (PE) to which Airy refers. However, the PE, which can be defined as the SE multiplied by the upper quartile of the standard Normal distribution, is just a multiple of the SE. Thus we have PE ≈ 0.645 × SE  and therefore 50% of values ought to be in the range mean −PE to mean +PE, hence the name. Since the PE is just a multiple of the SE, Airy’s second remarkable result applies in terms of SEs also. Nowadays we might speak of the precision, defined thus

and say that estimates should be combined in proportion to their precision, in which case the precision of the final result will be the sum of the individual precisions.

This second edition of Airy’s book dates from 1875 but, although, I have not got a copy of the first edition, which dates from 1861, I am confident that the history can be pushed at least as far as that. In fact, as has often been noticed, fixed effects meta-analysis is really just a form of least squares, a subject developed at the end of the 18thand beginning of the 19th century by Legendre, Gauss and Laplace, amongst others. [4]

A third reason to be interested in standard errors is that you may wish to carry out a Bayesian analysis. In that case, you should consider what the mean and the ‘standard error’ of your prior distribution are. You can then apply Airy’s two remarkable results as follows.


Ignoring uncertainty

Suppose that you regard all this concern with uncertainty as an unnecessary refinement and argue, “Never mind Airy’s precision weighting; when I have more than one estimate, I’ll just use an unweighted average”. This might seem like a reasonable ‘belt and braces’ approach but the figure below illustrates a problem. It supposes the following. You have one estimate and you then obtain a second. You now form an unweighted average of the two. What is the precision of this mean compared to a) the first result alone and b) the second result alone? In the figure, the X axis gives the relative precision of the second result alone to that of the first result alone. The Y axis gives the relative precision of the mean to the first result alone (red curve) or to the second result alone (blue curve).

Figure: Precision of an unweighted mean of two estimates as a function of the relative precision of the second compared to the first. The red curve gives the relative precision of the mean to that of the first and the blue curve the relative precision of the mean to the second. If both estimates are equally precise, the ratio is one and the precision of the mean is twice that of either result alone.

Where a curve is below 1, the precision of the mean is below the relevant single result. If the precision of the second result is less than 1/3 of the first, you would be better off using the first result alone. On the other hand, if the second result is more than three times as precise as the first, you would be better off using the second alone. The consequence is, that if you do not know the precision of your results you not only don’t know which one to trust, you don’t even know if an average of them should be preferred.

Not ignoring uncertainty

So, to sum up, if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. However, as I said in the introduction, all too easily, attention focuses on estimating the parameter of interest and not the probability statement. This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. As a common example of the problem, consider the following statement: ‘all covariates are balanced, therefore they do not need to be in the model’. The point of view expresses the belief that nothing of relevance will change if the covariates are not in the model, so why bother.

It is true that if a linear model applies, the point estimate for a ‘treatment effect’ will not change by including balanced covariates in the model. However, the expression of uncertainty will be quite different. The balanced case is one that does not apply in general. It thus follows that valid expressions of uncertainty have to allow for prognostic factors being imbalanced and this is, indeed, what they do. Misunderstanding of this is an error often made by critics of randomisation. I explain the misunderstanding like this: If we knew that important but unobserved prognostic factors were balanced, the standard analysis of clinical trials would be wrong. Thus, to claim that the analysis of randomised clinical trial relies on prognostic factors being balanced is exactly the opposite of what is true. [5]

As I explain in my blog Indefinite Irrelevance, if the prognostic factors are balanced, not adjusting for them, treats them as if they might be imbalanced: the confidence interval will be too wide given that we know that they are not imbalanced. (See also The Well Adjusted Statistician. [6])

Another way of understanding this is through the following example.

Consider a two-armed placebo-controlled clinical trial of a drug with a binary covariate (let us take the specific example of sex) and suppose that the patients split 50:50 according to the covariate. Now consider these two questions. What allocation of patients by sex within treatment arms will be such that differences in sex do not impact on 1) the estimate of the effect and 2) the estimate of the standard error of the estimate of the effect?

Everybody knows what the answer is to 1): the males and females must be equally distributed with respect to treatment. (Allocation one in the table below.) However, the answer to 2) is less obvious: it is that the two groups within which variances are estimated must be homogenous by treatment and sex. (Allocation two in the table below shows one of the two possibilities.) That means that if we do not put sex in the model, in order to remove sex from affecting the estimate of the variance, we would have to have all the males in one treatment group and all the females in another.

Allocation one Allocation two
Sex Sex
Male Female Male Female Total


Placebo 25 25 50 0 50
Drug 25 25 0 50 50
Total 50 50 50 50 100

Table: Percentage allocation by sex and treatment for two possible clinical trials

Of course, nobody uses allocation two but if allocation one is used, then the logical approach is to analyse the data so that the influence of sex is eliminated from the estimate of the variance, and hence the standard error. Savage, referring to Fisher, puts it thus:

He taught what should be obvious but always demands a second thought from me: if an experiment is laid out to diminish the variance of comparisons, as by using matched pairs…the variance eliminated from the comparison shows up in the estimate of this variance (unless care is taken to eliminate it)… [1, p. 450]

The consequence is that one needs to allow for this in the estimation procedure. One needs to ensure not only that the effect is estimated appropriately but that its uncertainty is also assessed appropriately. In our example this means that sex, in addition to treatment, must be in the model.

Here There be Tygers

it doesn’t approve of your philosophy Ray Bradbury, Here There be Tygers

So, estimating uncertainty is a key task of any statistician. Most commonly, it is addressed by calculating a standard error. However, this is not necessarily a simple matter. The school of statistics associated with design and analysis of agricultural experiments founded by RA Fisher, and to which I have referred  as the Rothamsted School, addressed this in great detail. Such agricultural experiments could have a  complicated block structure, for example, rows and columns of a field, with whole plots defined by their crossing and subplots within the whole plots. Many treatments could be studied simultaneously, with some (for example crop variety) being capable of being varied at the level of the plots but some (for example fertiliser) at the level of the subplots. This meant that variation at different levels affected different treatment factors. John Nelder developed a formal calculus to address such complex problems [7, 8].

In the world of clinical trials in which I have worked, we distinguish between trials in which patients can receive different treatments on different occasions and those in which each patient can independently receive only one treatment and those in which all the patients in the same centre must receive the same treatment. Each such design (cross-over, parallel, cluster) requires a different approach to assessing uncertainty. (See To Infinity and Beyond.) Naively treating all observations as independent can underestimate the standard error, a problem that Hurlbert has referred to as pseudoreplication. [9]

A key point, however, is this: the formal nature of experimentation forces this issue of variation to our attention. In observational studies we may be careless. We tend to assume that once we have chosen and made various adjustments to correct bias in the point estimate, that the ‘errors’ can then be treated as independent. However, only for the simplest of experimental studies would such an assumption be true, so what justifies making it as matter of habit for observational ones?

Recent work on historical controls has underlined the problem [10-12]. Trials that use such controls have features of both experimental and observational studies and so provide an illustrative bridge between the two. It turns out that treating the data as if they came from one observational study would underestimate the variance and hence overestimate the precision of the result. The implication is that analyses of observational studies more generally may be producing inappropriately narrow confidence intervals. [10]

Rigorous uncertainty

If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties. Francis Bacon, The Advancement of Learning, Book I, v,8.

In short, I am making an argument for Fisher’s general attitude to inference. Harry Marks has described it thus:

Fisher was a sceptic…But he was an unusually constructive sceptic. Uncertainty and error were, for Fisher, inevitable. But ‘rigorously specified uncertainty’ provided a firm ground for making provisional sense of the world. H Marks [13, p.94]

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.


  1. Savage, J., On rereading R.A. Fisher. Annals of Statistics, 1976. 4(3): p. 441-500.
  2. Cox, D.R. and D.V. Hinkley, Theoretical Statistics. 1974, London: Chapman and Hall.
  3. Airy, G.B., On the Algebraical and Numerical Theory of Errors of Observations and the Combination of Observations. 1875, london: Macmillan.
  4. Stigler, S.M., The History of Statistics: The Measurement of Uncertainty before 1900. 1986, Cambridge, Massachusets: Belknap Press.
  5. Senn, S.J., Seven myths of randomisation in clinical trials. Statistics in Medicine, 2013. 32(9): p. 1439-50.
  6. Senn, S.J., The well-adjusted statistician. Applied Clinical Trials, 2019: p. 2.
  7. Nelder, J.A., The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 147-162.
  8. Nelder, J.A., The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 163-178.
  9. Hurlbert, S.H., Pseudoreplication and the design of ecological field experiments. Ecological monographs, 1984. 54(2): p. 187-211.
  10. Collignon, O., et al., Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. Stat Methods Med Res, 2019: p. 962280219880213.
  11. Galwey, N.W., Supplementation of a clinical trial by historical control data: is the prospect of dynamic borrowing an illusion? Statistics in Medicine 2017. 36(6): p. 899-916.
  12. Schmidli, H., et al., Robust meta‐analytic‐predictive priors in clinical trials with historical control information. Biometrics, 2014. 70(4): p. 1023-1032.
  13. Marks, H.M., Rigorous uncertainty: why RA Fisher is important. Int J Epidemiol, 2003. 32(6): p. 932-7; discussion 945-8.


Categories: Fisher, randomization, Stephen Senn | Tags: | 2 Comments

Aris Spanos Reviews Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A. Spanos

Aris Spanos was asked to review my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic  Logic of Statistical Inference. The journal is OEconomia: History, Methodology, Philosophy. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpting for length.To read his full article go to external journal pdf or stable internal blog pdf.


2 Mayo (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The sub-title of Mayo’s (2018) book provides an apt description of the primary aim of the book in the sense that its focus is on the current discussions pertaining to replicability and trustworthy empirical evidence that revolve around the main fault line in statistical inference: the nature, interpretation and uses of probability in statistical modeling and inference. This underlies not only the form and structure of inductive inference, but also the nature of the underlying statistical reasonings as well as the nature of the evidence it gives rise to.

A crucial theme in Mayo’s book pertains to the current confusing and confused discussions on reproducibility and replicability of empirical evidence. The book cuts through the enormous level of confusion we see today about basic statistical terms, and in so doing explains why the experts so often disagree about reforms intended to improve statistical science.

Mayo makes a concerted effort to delineate the issues and clear up these confusions by defining the basic concepts accurately and placing many widely held methodological views in the best possible light before scrutinizing them. In particular, the book discusses at length the merits and demerits of the proposed reforms which include: (a) replacing p-values with Confidence Intervals (CIs), (b) using estimation-based effect sizes and (c) redefining statistical significance.

The key philosophical concept employed by Mayo to distinguish between a sound empirical evidential claim for a hypothesis H and an unsound one is the notion of a severe test: if little has been done to rule out flaws (errors and omissions) in pronouncing that data x0 provide evidence for a hypothesis H, then that inferential claim has not passed a severe test, rendering the claim untrustworthy. One has trustworthy evidence for a claim C only to the extent that C passes a severe test; see Mayo (1983; 1996). A distinct advantage of the concept of severe testing is that it is sufficiently general to apply to both frequentist and Bayesian inferential methods.

Mayo makes a case that there is a two-way link between philosophy and statistics. On one hand, philosophy helps in resolving conceptual, logical, and methodological problems of statistical inference. On the other hand, viewing statistical inference as severe testing gives rise to novel solutions to crucial philosophical problems including induction, falsification and the demarcation of science from pseudoscience. In addition, it serves as the foundation for understanding and getting beyond the statistics wars that currently revolves around the replication crises; hence the title of the book, Statistical Inference as Severe Testing.

Chapter (excursion) 1 of Mayo’s (2018) book sets the scene by scrutinizing the different role of probability in statistical inference, distinguishing between:

(i) Probabilism. Probability is used to assign a degree of confirmation, support or belief in a hypothesis H, given data x0 (Bayesian, likelihoodist, Fisher (fiducial)). An inferential claim H is warranted when it is assigned a high probability, support, or degree of belief (absolute or comparative).
(ii) Performance. Probability is used to ensure the long-run reliability of inference procedures; type I, II, coverage probabilities (frequentist, behavioristic Neyman-Pearson). An inferential claim H is warranted when it stems from a procedure with a low long-run error.
(iii) Probativism. Probability is used to assess the probing capacity of inference procedures, pre-data (type I, II, coverage probabilities), as well as post-data (p-value, severity evaluation). An inferential claim H is warranted when the different ways it can be false have been adequately probed and averted.

Mayo argues that probativism based on the severe testing account uses error probabilities to output an evidential interpretation based on assessing how severely an inferential claim H has passed a test with data x0. Error control and long-run reliability is necessary but not sufficient for probativism. This perspective is contrasted to probabilism (Law of Likelihood (LL) and Bayesian posterior) that focuses on the relationships between data x0 and hypothesis H, and ignores outcomes xRother than x0 by adhering to the Likelihood Principle (LP): given a statistical model Mθ(x) and data x0, all relevant sample information for inference purposes is contained in L(θ; x0), ∀θ∈Θ. Such a perspective can produce unwarranted results with high probability, by failing to pick up on optional stopping, data dredging and other biasing selection effects. It is at odds with what is widely accepted as the most effective way to improve replication: predesignation, and transparency about how hypotheses and data were generated and selected.

Chapter (excursion) 2 entitled ‘Taboos of Induction and Falsification’ relates the various uses of probability to draw certain parallels between probabilism, Bayesian statistics and Carnapian logics of confirmation on one side, and performance, frequentist statistics and Popperian falsification on the other. The discussion in this chapter covers a variety of issues in philosophy of science, including, the problem of induction, the asymmetry of induction and falsification, sound vs. valid arguments, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, the old evidence problem, corroboration, demarcation of science and pseudoscience, Duhem’s problem and novelty of evidence. These philosophical issues are also related to statistical conundrums as they relate to significance testing, fallacies of rejection, the cannibalization of frequentist testing known as Null Hypothesis Significance Testing (NHST) in psychology, and the issues raised by the reproducibility and replicability of evidence.

Chapter (excursion) 3 on ‘Statistical Tests and Scientific Inference’ provides a basic introduction to frequentist testing paying particular attention to crucial details, such as specifying explicitly the assumed statistical model Mθ(x) and the proper framing of hypotheses in terms of its parameter space Θ, with a view to provide a coherent account by avoiding undue formalism. The Neyman-Pearson (N-P) formulation of hypothesis testing is explained using a simple example, and then related to Fisher’s significance testing. What is different from previous treatments is that the claimed ‘inconsistent hybrid’ associated with the NHST caricature of frequentist testing is circumvented. The crucial difference often drawn is based on the N-P emphasis on pre-data long-run error probabilities, and the behavioristic interpretation of tests as accept/reject rules. By contrast, the post-data p-value associated with Fisher’s significance tests is thought to provide a more evidential interpretation. In this chapter, the two approaches are reconciled in the context of the error statistical framework. The N-P formulation provides the formal framework in the context of which an optimal theory of frequentist testing can be articulated, but in its current expositions lack a proper evidential interpretation. [For the detailed example see his review  pdf.]   

If a hypothesis H0 passes a test Τα that was highly capable of finding discrepancies from it, were they to be present, then the passing result indicates some evidence for their absence. The resulting evidential result comes in the form of the magnitude of the discrepancy γ from H0 warranted with test Τα and data x0 at different levels of severity. The intuition underlying the post-data severity is that a small p-value or a rejection of H0 based on a test with low power (e.g. a small n) for detecting a particular discrepancy γ provides stronger evidence for the presence of γ than if the test had much higher power (e.g. a large n).

The post-data severity evaluation outputs the discrepancy γ stemming from the testing results and takes the probabilistic form:

SEV (θ ≶ θ1; x0)=P(d(X) ≷ d(x0); θ10+γ), for all θ1∈Θ1,

where the inequalities are determined by the testing result and the sign of d(x0). [Ed Note ≶ is his way of combining the definition of severity for both > and <, in order to abbreviate. It is not used in SIST.] When the relevant N-P test result is ‘accept (reject) H0’ one is seeking the smallest (largest) discrepancy γ, in the form of an inferential claim θ ≶ θ10+γ, warranted by Τα and x0 at a high enough probability, say .8 or .9. The severity evaluations are introduced by connecting them to more familiar calculations relating to observed confidence intervals and p-value calculations. A more formal treatment to the post-data severity evaluation is given in chapter (excursion) 5.[Ed. note: “Excursions” are actually Parts, Tours are chapters]

Mayo uses the post-data severity perspective to scorch several misinterpretations of the p-value, including the claim that the p-value is not a legitimate error probability. She also calls into question any comparisons of the tail areas of d(X) under H0 that vary with xRn, with posterior distribution tail areas that vary with θ∈Θ, pointing out that this is tantamount to comparing apples and oranges!

The real life examples of the 1919 eclipse data for testing the General Theory of Relativity, as well as the 2012 discovery of the Higgs particle are used to illustrate some of the concepts in this chapter.

The discussion in this chapter sheds light on several important problems in statistical inference, including several howlers of statistical testing, Jeffreys’ tail area criticism, weak conditionality principle and the likelihood principle.

[To read about excursion 4, see his full review  pdf.]

Chapter (excursion) 5, entitled ‘Power and Severity’, provides an in-depth discussion of power and its abuses or misinterpretations, as well as scotch several confusions permeating the current discussions on the replicability of empirical evidence.

Confusion 1: The power of a N-P test Τα:= {d(X), C1(α)} is a pre-data error probability that calibrates the generic (for any sample realization x∈Rn ) capacity of the test in detecting different discrepancies from H0, for a given type I error probability α. As such, the power is not a point function one can evaluate arbitrarily at a particular value θ1. It is defined for all values in the alternative space θ1∈Θ1.

Confusion 2: The power function is properly defined for all θ1∈Θ1 only when (Θ0, Θ1) constitute a partition of Θ. This is to ensure that θ is not in a subset of Θ ignored by the comparisons since the main objective is to narrow down the unknown parameter space Θ using hypothetical values of θ. …Hypothesis testing poses questions as to whether a hypothetical value θ0 is close enough to θ in the sense that the difference (θ – θ0) is ‘statistically negligible’; a notion defined using error probabilities.

Confusion 3: Hacking (1965) raised the problem of using predata error probabilities, such as the significance level α and power, to evaluate the testing results post-data. As mentioned above, the post-data severity aims to address that very problem, and is extensively discussed in Mayo (2018), excursion 5.

Confusion 4: Mayo and Spanos (2006) define “attained power” by replacing cα with the observed d(x0). But this should not be confused with replacing θ1 with its observed estimate [e.g., xn], as in what is often called “observed” or “retrospective” power. To compare the two in example 2, contrast:

Attained power: POW(µ1)=Pr(d(X) > d(x0); µ=µ1), for all µ10,

with what Mayo calls Shpower which is defined at µ=xn:

Shpower: POW(xn)=Pr(d(X) > d(x0); µ=xn).

Shpower makes very little statistical sense, unless point estimation justifies the inferential claim xn ≅ µ, which it does not, as argued above. Unfortunately, the statistical literature in psychology is permeated with (implicitly) invoking such a claim when touting the merits of estimation-based effect sizes. The estimate xrepresents just a single value of Xn ∼N(µ, σ2/n ), and any inference pertaining to µ needs to take into account the uncertainty described by this sampling distribution; hence, the call for using interval estimation and hypothesis testing to account for that sampling uncertainty. The post-data severity evaluation addresses this problem using hypothetical reasoning and taking into account the relevant statistical context (11). It outputs the discrepancy from H0 warranted by test Τα and data x0, with high enough severity, say bigger than .85. Invariably, inferential claims of the form µ ≷ µ1= xn are assigned low severity of .5.

Confusion 5: Frequentist error probabilities (type I, II, coverage, p-value) are not conditional on H (H0 or H1) since θ=θ0 or θ=θ1 being ‘true or false’ do not constitute legitimate events in the context of Mθ(x); θ is an unknown constant. The clause ‘given H is true’ refers to hypothetical scenarios under which the sampling distribution of the test statistic d(X) is evaluated as in (10).

This confusion undermines the credibility of Positive Predictive Value (PPV):

where (i) F = H0 is false, (ii) R=test rejects H0, and (iii) H0: no disease, used by Ioannidis (2005) to make his case that ‘most published research findings are false’ when PPV = Pr(F|R)<.5. His case is based on ‘guessing’ probabilities at a discipline wide level, such as Pr(F)=.1, Pr(R|F)=.8 and Pr(R|F)=.15, and presuming that the last two relate to the power and significance level of a N-P test. He then proceeds to blame the wide-spread abuse of significance testing (p-hacking, multiple testing, cherry-picking, low power) for the high de facto type I error (.15). Granted, such abuses do contribute to untrustworthy evidence, but not via false positive/negative rates since (i) and (iii) are not legitimate events in the context of Mθ(x), and thus Pr(R|F) and Pr(R|F) have nothing to do with the significance level and the power of a N-P test. Hence, the analogical reasoning relating the false positive and false negative rates in medical detecting devices to the type I and II error probabilities in frequentist testing is totally misplaced. These rates are established by the manufacturers of medical devices after running a very large number (say, 10000) of medical ‘tests’ with specimens that are known to be positive or negative; they are prepared in a lab. Known ‘positive’ and ‘negative’ specimens constitute legitimate observable events one can condition upon. In contrast, frequentist error probabilities (i) are framed in terms of θ (which are not observable events in Mθ(x)) and (ii) depend crucially on the particular statistical context (11); there is no statistical context for the false positive and false negative rates.

A stronger case can be made that abuses and misinterpretations of frequentist testing are only symptomatic of a more extensive problem: the recipe-like/uninformed implementation of statistical methods. This contributes in many different ways to untrustworthy evidence, including: (i) statistical misspecification (imposing invalid assumptions on one’s data), (ii) poor implementation of inference methods (insufficient understanding of their assumptions and limitations), and (iii) unwarranted evidential interpretations of their inferential results (misinterpreting p-values and CIs, etc.).

Mayo uses the concept of a post-data severity evaluation to illuminate the above mentioned issues and explain how it can also provide the missing evidential interpretation of testing results. The same concept is also used to clarify numerous misinterpretations of the p-value throughout this book, as well as the fallacies:
(a) Fallacy of acceptance (non-rejection). No evidence against H0 is misinterpreted as evidence for it. This fallacy can easily arise when the power of a test is low (e.g. small n problem) in detecting sizeable discrepancies.
(b) Fallacy of rejection. Evidence against H0 is misinterpreted as evidence for a particular H1. This fallacy can easily arise when the power of a test is very high (e.g. large n problem) and it detects trivial discrepancies; see Mayo and Spanos (2006).

In chapter 5 Mayo returns to a recurring theme throughout the book, the mathematical duality between Confidence Intervals (CIs) and hypothesis testing, with a view to call into question certain claims about the superiority of CIs over p-values. This mathematical duality derails any claims that observed CIs are less vulnerable to the large n problem and more informative than p-values. Where they differ is in terms of their inferential claims stemming from their different forms of reasoning, factual vs. hypothetical. That is, the mathematical duality does not imply inferential duality. This is demonstrated by contrasting CIs with the post-data severity evaluation.

Indeed, a case can be made that the post-data severity evaluation addresses several long-standing problems associated with frequentist testing, including the large n problem, the apparent arbitrariness of the N-P framing that allows for simple vs. simple hypotheses, say H0: µ= 1 vs. H1: µ=1, the arbitrariness of the rejection thresholds, the problem of the sharp dichotomy (e.g. reject H0 at .05 but accept H0 at .0499), and distinguishing between statistical and substantive significance. It also provides a natural framework for evaluating reproducibility/replicability issues and brings out the problems associated with observed CIs and estimation-based effect sizes; see Spanos (2019).

Chapter 5 also includes a retrospective view of the disputes between Neyman and Fisher in the context of the error statistical perspective on frequentist inference, bringing out their common framing and their differences in emphasis and interpretation. The discussion also includes an interesting summary of their personal conflicts, not always motivated by statistical issues; who said the history of statistics is boring?

Chapter (excursion) 6 of Mayo (2018) raises several important foundational issues and problems pertaining to Bayesian inference, including its primary aim, subjective vs. default Bayesian priors and their interpretations, default Bayesian inference vs. the Likelihood Principle, the role of the catchall factor, the role of Bayes factors in Bayesian testing, and the relationship between Bayesian inference and error probabilities. There is also discussion about attempts by ‘default prior’ Bayesians to unify or reconcile frequentist and Bayesian accounts.

A point emphasized in this chapter pertains to model validation. Despite the fact that Bayesian statistics shares the same concept of a statistical model Mθ(x) with frequentist statistics, there is hardly any discussion on validating Mθ(x) to secure the reliability of the posterior distribution:…upon which all Bayesian inferences are based. The exception is the indirect approach to model validation in Gelman et al (2013) based on the posterior predictive distribution:Since m(x) is parameter free, one can use it as a basis for simulating a number of replications x1, x2, …, xn to be used as indirect evidence for potential departures from the model assumptions vis-à-vis data x0, which is clearly different from frequentist M-S testing of the Mθ(x) assumptions. The reason is that m(x) is a smoothed mixture of f(x; θ) and π(θ|x0 ) and one has no way of attributing blame to one or the other when any departures are detected. For instance, in the case of the simple Normal model in (9), a highly skewed prior might contribute (indirectly) to departures from the Normality assumption when tested using simulated data using (12). Moreover, the ‘smoothing’ with respect to the parameters in deriving m(x) is likely to render testing departures from the IID assumptions a lot more unwieldy.

On the question posed by the title of this review, Mayo’s answer is that the error statistical framework, a refinement or extension of the original Fisher-Neyman-Pearson framing in the spirit of Peirce, provides a pertinent foundation for frequentist modeling and inference.

3 Conclusions

A retrospective view of Hacking (1965) reveals that its main weakness is that its perspective on statistical induction adheres too closely to the philosophy of science framing of that period, and largely ignores the formalism based on the theory of stochastic processes {Xt, t∈N} that revolves around the concept of a statistical model Mθ(x). Retrospectively, its value stems primarily from a number of very insightful arguments and comments that survived the test of time. The three that stand out are: (i) an optimal point estimator [θ-hat(X)] of θ does not warrant the inferential claim [θ-hat(x0)]≅ θ, (ii) a statistical inference is very different from a decision, and (iii) the distinction between the pre-data error probabilities and the post-data evaluation of the evidence stemming from testing results; a distinction that permeates Mayo’s (2018) book. Hacking’s change of mind on the aptness of logicism and the problems with the long run frequency is also particularly interesting. Hacking’s (1980) view of the long run frequency is almost indistinguishable from that of Cramer (1946, 332) and Neyman (1952, 27) mentioned above, or Mayo (2018), when he argues: “Probabilities conform to the usual probability axioms which have among their consequences the essential connection between individual and repeated trials, the weak law of large numbers proved by Bernoulli. Probabilities are to be thought of as theoretical properties, with a certain looseness of fit to the observed world. Part of this fit is judged by rules for testing statistical hypotheses along the lines described by Neyman and Pearson. It is a “frequency view of probability” in which probability is a dispositional property…” (Hacking, 1980, 150-151).

Probability as a dispositional property’ of a chance set-up alludes to the propensity interpretation of probability associated with Peirce and Popper, which is in complete agreement with the model-based frequentist interpretation; see Spanos (2019).

The main contribution of Mayo’s (2018) book is to put forward a framework and a strategy to evaluate the trustworthiness of evidence resulting from different statistical accounts. Viewing statistical inference as a form of severe testing elucidates the most widely employed arguments surrounding commonly used (and abused) statistical methods. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to evaluate and control how severely tested different inferential claims are. Without assuming that other statistical accounts aim for severe tests, Mayo proposes the following strategy for evaluating the trustworthiness of evidence: begin with a minimal requirement that if a test has little or no chance to detect flaws in a claim H, then H’s passing result constitutes untrustworthy evidence. Then, apply this minimal severity requirement to the various statistical accounts as well as to the proposed reforms, including estimation-based effect sizes, observed CIs and redefining statistical significance. Finding that they fail even the minimal severity requirement provides grounds to question the trustworthiness of their evidential claims. One need not reject some of these methods just because they have different aims, but because they give rise to evidence [claims] that fail the minimal severity requirement. Mayo challenges practitioners to be much clearer about their aims in particular contexts and different stages of inquiry. It is in this way that the book ingeniously links philosophical questions about the roles of probability in inference to the concerns of practitioners about coming up with trustworthy evidence across the landscape of the natural and the social sciences.


  • Barnard, George. 1972. Review article: Logic of Statistical Inference. The British Journal of the Philosophy of Science, 23: 123- 190.
  • Cramer, Harald. 1946. Mathematical Methods of Statistics, Princeton: Princeton University Press.
  • Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society A, 222(602): 309-368.
  • Fisher, Ronald A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
  • Gelman, Andrew. John B. Carlin, Hal S. Stern, Donald B. Rubin. 2013. Bayesian Data Analysis, 3rd ed. London: Chapman & Hall/CRC.
  • Hacking, Ian. 1972. Review: Likelihood. The British Journal for the Philosophy of Science, 23(2): 132-137.
  • Hacking, Ian. 1980. The Theory of Probable Inference: Neyman, Peirce and Braithwaite. In D. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite. Cambridge: Cambridge University Press, 141-160.
  • Ioannidis, John P. A. 2005. Why Most Published Research Findings Are False. PLoS medicine, 2(8): 696-701.
  • Koopman, Bernard O. 1940. The Axioms and Algebra of Intuitive Probability. Annals of Mathematics, 41(2): 269-292.
  • Mayo, Deborah G. 1983. An Objective Theory of Statistical Testing. Synthese, 57(3): 297-340.
  • Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: The University of Chicago Press.
  • Mayo, Deborah G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars. Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos. 2004. Methodology in Practice: Statistical Misspecification Testing. Philosophy of Science, 71(5): 1007-1025.
  • Mayo, Deborah G. and Aris Spanos. 2006. Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction. British Journal for the Philosophy of Science, 57(2): 323- 357.
  • Mayo, Deborah G. and Aris Spanos. 2011. Error Statistics. In D. Gabbay, P. Thagard, and J. Woods (eds), Philosophy of Statistics, Handbook of Philosophy of Science. New York: Elsevier, 151-196.
  • Neyman, Jerzy. 1952. Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. Washington: U.S. Department of Agriculture.
  • Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall.
  • Salmon, Wesley C. 1967. The Foundations of Scientific Inference. Pittsburgh: University of Pittsburgh Press.
  • Spanos, Aris. 2013. A Frequentist Interpretation of Probability for Model-Based Inductive Inference. Synthese, 190(9):1555- 1585.
  • Spanos, Aris. 2017. Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference. In Advances in Statistical Methodologies and Their Applications to Real Problems., 3-28.
  • Spanos, Aris. 2018. Mis-Specification Testing in Retrospect. Journal of Economic Surveys, 32(2): 541-577.
  • Spanos, Aris. 2019. Probability Theory and Statistical Inference: Empirical Modeling with Observational Data, 2nd ed. Cambridge: Cambridge University Press.
  • Von Mises, Richard. 1928. Probability, Statistics and Truth, 2nd ed. New York: Dover.
  • Williams, David. 2001. Weighing the Odds: A Course in Probability and Statistics. Cambridge: Cambridge University Press.
Categories: Spanos, Statistical Inference as Severe Testing | Leave a comment

The NAS fixes its (main) mistake in defining P-values!

Mayo new elbow

(reasonably) satisfied

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.
Two footnotes, on pages 31 35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOC

Thank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.
With kind regards and wishes of a happy 2020,
Jenny Heimberg
Jennifer Heimberg, Ph.D.
Senior Program Officer

The National Academies of Sciences, Engineering, and Medicine

Continue reading

Categories: NAS, P-values | 2 Comments

Blog at