**Stephen Senn**

*Consultant Statistician*

*Edinburgh*

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453]

The classic text by David Cox and David Hinkley, *Theoretical Statistics *(1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes *after* that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state:

Superficially, point estimation may seem a simpler problem to discuss than that of interval estimation; in fact, however, any replacement of an uncertain quantity is bound to involve either some arbitrary choice or a precise specification of the purpose for which the single quantity is required. Note that in interval-estimation we explicitly recognize that the conclusion is uncertain, whereas in point estimation…no explicit recognition is involved in the final answer.[2, p. 250]

In my opinion, a great deal of confusion about statistics can be traced to the fact that the point estimate is seen as being the be all and end all, the expression of uncertainty being forgotten. For example, much of the criticism of randomisation overlooks the fact that the statistical analysis will deliver a probability statement and, other things being equal, the more unobserved prognostic factors there are, the more uncertain the result will be claimed to be. However, statistical statements are not wrong *because* they are uncertain, they are wrong if claimed to be more certain (or less certain) than they are.

Amongst justifications that Cox and Hinkley give for calculating point estimates is that when supplemented with an appropriately calculated standard error they will, in many cases, provide the means of calculating a confidence interval, or if you prefer being Bayesian, a credible interval. Thus, to provide a point estimate without also providing a standard error is, indeed, an all too standard error. Of course, there is no value in providing a standard error unless it has been calculated appropriately and addressing the matter of appropriate calculation is not necessarily easy. This is a point I shall pick up below but for the moment let us proceed to consider why it is useful to have standard errors.

First, suppose you have a point estimate. At some time in the past you or someone else decided to collect the data that made it possible. Time and money were invested in doing this. It would not have been worth doing this unless there was a state of uncertainty that the collection of data went some way to resolving. Has it been resolved? Are you certain enough? If not, should more data be collected or would that not be worth it? This cannot be addressed without assessing the uncertainty in your estimate and this is what the standard error is for.

Second, you may wish to combine the estimate with other estimates. This has a long history in statistics. It has been more recently (in the last half century) developed under the heading of *meta-analysis*, which is now a huge field of theoretical study and practical application. However, the subject is much older than that. For example, I have on the shelves of my library at home, a copy of the second (1875) edition of *On the Algebraical And Numerical Theory of Observations: And The Combination of* *Observations*, by George Biddell Airy (1801-1892). [3] Chapter III is entitled ‘Principles of Forming the Most Advantageous Combination of Fallible Measures’ and treats the matter in some detail. For example, Airy defines what he calls the *theoretical weight* (*t.w.*) for combining errors asand then draws attention to ‘two remarkable results’

First. The combination-weight for each measure ought to be proportional to its theoretical weight.

Second. When the combination-weight for each measure is proportional to its theoretical weight, the theoretical weight of the final result is equal to the sum of the theoretical weights of the several collateral measures. (pp. 55-56).

We are now more used to using the standard error (*SE*) rather than the probable error (*PE*) to which Airy refers. However, the *PE*, which can be defined as the *SE* multiplied by the upper quartile of the standard Normal distribution, is just a multiple of the *SE*. Thus we have *PE ≈ 0.645 × SE* and therefore 50% of values ought to be in the range −*PE to +PE*, hence the name. Since the *PE* is just a multiple of the *SE*, Airy’s second *remarkable result *applies in terms of SEs also. Nowadays we might speak of the *precision*, defined thus

and say that estimates should be combined in proportion to their precision, in which case the precision of the final result will be the sum of the individual precisions.

This second edition of Airy’s book dates from 1875 but, although, I have not got a copy of the first edition, which dates from 1861, I am confident that the history can be pushed at least as far as that. In fact, as has often been noticed, fixed effects meta-analysis is really just a form of least squares, a subject developed at the end of the 18^{th}and beginning of the 19^{th} century by Legendre, Gauss and Laplace, amongst others. [4]

A third reason to be interested in standard errors is that you may wish to carry out a Bayesian analysis. In that case, you should consider what the mean and the ‘standard error’ of your prior distribution are. You can then apply Airy’s two remarkable results as follows.

and

Suppose that you regard all this concern with uncertainty as an unnecessary refinement and argue, “Never mind Airy’s precision weighting; when I have more than one estimate, I’ll just use an unweighted average”. This might seem like a reasonable ‘belt and braces’ approach but the figure below illustrates a problem. It supposes the following. You have one estimate and you then obtain a second. You now form an unweighted average of the two. What is the precision of this mean compared to a) the first result alone and b) the second result alone? In the figure, the X axis gives the relative precision of the second result alone to that of the first result alone. The Y axis gives the relative precision of the mean to the first result alone (red curve) or to the second result alone (blue curve).

Where a curve is below 1, the precision of the mean is below the relevant single result. If the precision of the second result is less than 1/3 of the first, you would be better off using the first result alone. On the other hand, if the second result is more than three times as precise as the first, you would be better off using the second alone. The consequence is, that if you do not know the precision of your results you *not only don’t know which one to trust, you don’t even know if an average of them should be preferred*.

So, to sum up, if you don’t know how uncertain your evidence is, you can’t use it. Thus, assessing uncertainty is important. However, as I said in the introduction, all too easily, attention focuses on estimating the parameter of interest and not the probability statement. This (perhaps unconscious) obsession with point estimation as the be all and end all causes problems. As a common example of the problem, consider the following statement: ‘all covariates are balanced, therefore they do not need to be in the model’. The point of view expresses the belief that nothing of relevance will change if the covariates are not in the model, so why bother.

It is true that if a linear model applies, the point estimate for a ‘treatment effect’ will not change by including balanced covariates in the model. However, the expression of uncertainty will be quite different. The balanced case is one that does not apply in general. It thus follows that valid expressions of uncertainty have to allow for prognostic factors being imbalanced and this is, indeed, what they do. Misunderstanding of this is an error often made by critics of randomisation. I explain the misunderstanding like this: *If we knew that important but unobserved prognostic factors were balanced, the standard analysis of clinical trials would be wrong*. Thus, to claim that the analysis of randomised clinical trial relies on prognostic factors being balanced is exactly the opposite of what is true. [5]

As I explain in my blog Indefinite Irrelevance, if the prognostic factors are balanced, not adjusting for them, treats them as if they might be imbalanced: the confidence interval will be too wide given that we know that they are __not__ imbalanced. (See also The Well Adjusted Statistician. [6])

Another way of understanding this is through the following example.

Consider a two-armed placebo-controlled clinical trial of a drug with a binary covariate (let us take the specific example of *sex*) and suppose that the patients split 50:50 according to the covariate. Now consider these two questions. What allocation of patients by sex within treatment arms will be such that differences in sex do not impact on 1) the estimate of the effect and 2) the estimate of the standard error of the estimate of the effect?

Everybody knows what the answer is to 1): the males and females must be equally distributed with respect to treatment. (Allocation one in the table below.) However, the answer to 2) is less obvious: it is that the two groups within which variances are estimated must be homogenous by treatment and sex. (Allocation two in the table below shows one of the two possibilities.) That means that if we do not put sex in the model, in order to remove sex from affecting the estimate of the variance, we would have to have all the males in one treatment group and all the females in another.

Allocation one | Allocation two | |||||

Sex | Sex | |||||

Male | Female | Male | Female | Total | ||

Treatment |
Placebo | 25 | 25 | 50 | 0 | 50 |

Drug | 25 | 25 | 0 | 50 | 50 | |

Total | 50 | 50 | 50 | 50 | 100 |

Table: Percentage allocation by sex and treatment for two possible clinical trials

Of course, nobody uses allocation two but if allocation one is used, then the logical approach is to analyse the data so that the influence of sex is eliminated from the estimate of the variance, and hence the standard error. Savage, referring to Fisher, puts it thus:

He taught what should be obvious but always demands a second thought from me: if an experiment is laid out to diminish the variance of comparisons, as by using matched pairs…the variance eliminated from the comparison shows up in the estimate of this variance (unless care is taken to eliminate it)… [1, p. 450]

The consequence is that one needs to allow for this in the estimation procedure. One needs to ensure not only that the effect is estimated appropriately but that __its uncertainty is also assessed appropriately__. In our example this means that *sex*, in addition to *treatment*, must be in the model.

it doesn’t approve of your philosophyRay Bradbury,Here There be Tygers

So, estimating uncertainty is a key task of any statistician. Most commonly, it is addressed by calculating a standard error. However, this is not necessarily a simple matter. The school of statistics associated with design and analysis of agricultural experiments founded by RA Fisher, and to which I have referred as the Rothamsted School, addressed this in great detail. Such agricultural experiments could have a complicated block structure, for example, rows and columns of a field, with whole plots defined by their crossing and subplots within the whole plots. Many treatments could be studied simultaneously, with some (for example crop variety) being capable of being varied at the level of the plots but some (for example fertiliser) at the level of the subplots. This meant that variation at different levels affected different treatment factors. John Nelder developed a formal calculus to address such complex problems [7, 8].

In the world of clinical trials in which I have worked, we distinguish between trials in which patients can receive different treatments on different occasions and those in which each patient can independently receive only one treatment and those in which all the patients in the same centre must receive the same treatment. Each such design (cross-over, parallel, cluster) requires a different approach to assessing uncertainty. (See To Infinity and Beyond.) Naively treating all observations as independent can underestimate the standard error, a problem that Hurlbert has referred to as *pseudoreplication. *[9]

A key point, however, is this: the formal nature of experimentation forces this issue of variation to our attention. In observational studies we may be careless. We tend to assume that once we have chosen and made various adjustments to correct bias in the point estimate, that the ‘errors’ can then be treated as independent. However, only for the simplest of experimental studies would such an assumption be true, so what justifies making it as matter of habit for observational ones?

Recent work on historical controls has underlined the problem [10-12]. Trials that use such controls have features of both experimental and observational studies and so provide an illustrative bridge between the two. It turns out that treating the data as if they came from one observational study would underestimate the variance and hence overestimate the precision of the result. The implication is that analyses of observational studies more generally may be producing inappropriately narrow confidence intervals. [10]

If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties. Francis Bacon,The Advancement of Learning, Book I, v,8.

In short, I am making an argument for Fisher’s general attitude to inference. Harry Marks has described it thus:

Fisher was a sceptic…But he was an unusually constructive sceptic. Uncertainty and error were, for Fisher, inevitable. But ‘rigorously specified uncertainty’ provided a firm ground for making provisional sense of the world. H Marks [13, p.94]

Point estimates are not enough. It is rarely the case that you have to act immediately based on your best guess. Where you don’t, you have to know how good your guesses are. This requires a principled approach to assessing uncertainty.

- Savage, J.,
*On rereading R.A. Fisher.*Annals of Statistics, 1976.**4**(3): p. 441-500. - Cox, D.R. and D.V. Hinkley,
*Theoretical Statistics*. 1974, London: Chapman and Hall. - Airy, G.B.,
*On the Algebraical and Numerical Theory of Errors of Observations and the Combination of Observations*. 1875, london: Macmillan. - Stigler, S.M.,
*The History of Statistics: The Measurement of Uncertainty before 1900*. 1986, Cambridge, Massachusets: Belknap Press. - Senn, S.J.,
*Seven myths of randomisation in clinical trials.*Statistics in Medicine, 2013.**32**(9): p. 1439-50. - Senn, S.J.,
*The well-adjusted statistician.*Applied Clinical Trials, 2019: p. 2. - Nelder, J.A.,
*The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance.*Proceedings of the Royal Society of London. Series A, 1965.**283**: p. 147-162. - Nelder, J.A.,
*The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance.*Proceedings of the Royal Society of London. Series A, 1965.**283**: p. 163-178. - Hurlbert, S.H.,
*Pseudoreplication and the design of ecological field experiments.*Ecological monographs, 1984.**54**(2): p. 187-211. - Collignon, O., et al.,
*Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations.*Stat Methods Med Res, 2019: p. 962280219880213. - Galwey, N.W.,
*Supplementation of a clinical trial by historical control data: is the prospect of dynamic borrowing an illusion?*Statistics in Medicine 2017.**36**(6): p. 899-916. - Schmidli, H., et al.,
*Robust meta‐analytic‐predictive priors in clinical trials with historical control information.*Biometrics, 2014.**70**(4): p. 1023-1032. - Marks, H.M.,
*Rigorous uncertainty: why RA Fisher is important.*Int J Epidemiol, 2003.**32**(6): p. 932-7; discussion 945-8.

]]>

Aris Spanos was asked to review my *Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars* (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic *Logic of Statistical Inference. The journal is **OEconomia: History, Methodology, Philosophy*. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpting for length.To read his full article go to external journal pdf or stable internal blog pdf.

**….**

The sub-title of Mayo’s (2018) book provides an apt description of the primary aim of the book in the sense that its focus is on the current discussions pertaining to replicability and trustworthy empirical evidence that revolve around the main fault line in statistical inference: the nature, interpretation and uses of probability in statistical modeling and inference. This underlies not only the form and structure of inductive inference, but also the nature of the underlying statistical reasonings as well as the nature of the evidence it gives rise to.

A crucial theme in Mayo’s book pertains to the current confusing and confused discussions on reproducibility and replicability of empirical evidence. The book cuts through the enormous level of confusion we see today about basic statistical terms, and in so doing explains why the experts so often disagree about reforms intended to improve statistical science.

Mayo makes a concerted effort to delineate the issues and clear up these confusions by defining the basic concepts accurately and placing many widely held methodological views in the best possible light before scrutinizing them. In particular, the book discusses at length the merits and demerits of the proposed reforms which include: (a) replacing p-values with Confidence Intervals (CIs), (b) using estimation-based effect sizes and (c) redefining statistical significance.

The key philosophical concept employed by Mayo to distinguish between a *sound* empirical evidential claim for a hypothesis *H* and an unsound one is the notion of a *severe test*: if little has been done to rule out *flaws (errors and omissions)* in pronouncing that data **x**_{0} provide evidence for a hypothesis *H*, then that inferential claim has not passed a severe test, rendering the claim untrustworthy. One has trustworthy evidence for a claim C only to the extent that C passes a severe test; see Mayo (1983; 1996). A distinct advantage of the concept of severe testing is that it is sufficiently general to apply to both frequentist and Bayesian inferential methods.

Mayo makes a case that there is a two-way link between philosophy and statistics. On one hand, philosophy helps in resolving conceptual, logical, and methodological problems of statistical inference. On the other hand, viewing statistical inference as severe testing gives rise to novel solutions to crucial philosophical problems including induction, falsification and the demarcation of science from pseudoscience. In addition, it serves as the foundation for understanding and getting beyond the statistics wars that currently revolves around the replication crises; hence the title of the book, *Statistical Inference as Severe Testing*.

Chapter (excursion) 1 of Mayo’s (2018) book sets the scene by scrutinizing the different role of probability in *statistical inference*, distinguishing between:

**(i) Probabilism.** Probability is used to assign a *degree of confirmation, support or belief* in a hypothesis *H*, given data **x**_{0} (Bayesian, likelihoodist, Fisher (fiducial)). An inferential claim *H* is warranted when it is assigned a *high* probability, support, or degree of belief (absolute or comparative).

**(ii) Performance.** Probability is used to ensure the *long-run reliability* of inference procedures; type I, II, coverage probabilities (frequentist, behavioristic Neyman-Pearson). An inferential claim *H* is warranted when it stems from a procedure with a low long-run error.

**(iii) Probativism.** Probability is used to assess the *probing capacity* of inference procedures, *pre-data* (type I, II, coverage probabilities), as well as *post-data* (p-value, severity evaluation). An inferential claim *H* is warranted when the different ways it can be false have been adequately probed and averted.

Mayo argues that probativism based on the severe testing account uses error probabilities to output an evidential interpretation based on assessing how severely an inferential claim *H* has passed a test with data **x**_{0}. Error control and long-run reliability is necessary but not sufficient for probativism. This perspective is contrasted to probabilism (Law of Likelihood (LL) and Bayesian posterior) that focuses on the relationships between data **x**_{0} and hypothesis *H*, and ignores outcomes **x**∈* R^{n }*other than

Chapter (excursion) 2 entitled ‘Taboos of Induction and Falsification’ relates the various uses of probability to draw certain parallels between probabilism, Bayesian statistics and Carnapian logics of confirmation on one side, and performance, frequentist statistics and Popperian falsification on the other. The discussion in this chapter covers a variety of issues in philosophy of science, including, the problem of induction, the asymmetry of induction and falsification, sound vs. valid arguments, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, the old evidence problem, corroboration, demarcation of science and pseudoscience, Duhem’s problem and novelty of evidence. These philosophical issues are also related to statistical conundrums as they relate to significance testing, fallacies of rejection, the cannibalization of frequentist testing known as Null Hypothesis Significance Testing (NHST) in psychology, and the issues raised by the reproducibility and replicability of evidence.

Chapter (excursion) 3 on ‘Statistical Tests and Scientific Inference’ provides a basic introduction to frequentist testing paying particular attention to crucial details, such as specifying explicitly the assumed statistical model **M**_{θ}(**x**) and the proper framing of hypotheses in terms of its parameter space Θ, with a view to provide a coherent account by avoiding undue formalism. The Neyman-Pearson (N-P) formulation of hypothesis testing is explained using a simple example, and then related to Fisher’s significance testing. What is different from previous treatments is that the claimed ‘inconsistent hybrid’ associated with the NHST caricature of frequentist testing is circumvented. The crucial difference often drawn is based on the N-P emphasis on pre-data long-run error probabilities, and the behavioristic interpretation of tests as accept/reject rules. By contrast, the post-data p-value associated with Fisher’s significance tests is thought to provide a more evidential interpretation. In this chapter, the two approaches are reconciled in the context of the error statistical framework. The N-P formulation provides the formal framework in the context of which an optimal theory of frequentist testing can be articulated, but in its current expositions lack a proper evidential interpretation. **[For the detailed example see his review pdf.] **…

If a hypothesis *H _{0}* passes a test

The *post-data severity evaluation* outputs the discrepancy γ stemming from the testing results and takes the probabilistic form:

SEV (θ ≶ θ* _{1}*;

where the inequalities are determined by the testing result and the sign of d(**x*** _{0}*).

Mayo uses the post-data severity perspective to scorch several misinterpretations of the p-value, including the claim that the p-value is not a legitimate error probability. She also calls into question any comparisons of the tail areas of d(**X**) under *H _{0}* that vary with

The real life examples of the 1919 eclipse data for testing the General Theory of Relativity, as well as the 2012 discovery of the Higgs particle are used to illustrate some of the concepts in this chapter.

The discussion in this chapter sheds light on several important problems in statistical inference, including several howlers of statistical testing, Jeffreys’ tail area criticism, weak conditionality principle and the likelihood principle.

…**[To read about excursion 4, see his full review pdf.]**

Chapter (excursion) 5, entitled ‘Power and Severity’, provides an in-depth discussion of power and its abuses or misinterpretations, as well as scotch several confusions permeating the current discussions on the replicability of empirical evidence.

**Confusion 1:** The power of a N-P test *Τ*_{α}:= {d(**X**), C_{1}(α)} is a *pre-data* error probability that calibrates the generic (for any sample realization x∈* R^{n}* ) capacity of the test in detecting different discrepancies from

**Confusion 2:** The power function is properly defined for all θ_{1}∈Θ_{1} only when (Θ_{0}, Θ_{1}) constitute a partition of Θ. This is to ensure that θ^{∗} is not in a subset of Θ ignored by the comparisons since the *main objective* is to *narrow down* the unknown parameter space Θ using *hypothetical* values of θ. …Hypothesis testing poses questions as to whether a hypothetical value θ_{0} is close enough to θ^{∗} in the sense that the difference (θ^{∗} – θ_{0}) is ‘statistically negligible’; a notion defined using error probabilities.

**Confusion 3:** Hacking (1965) raised the problem of using predata error probabilities, such as the significance level α and power, to evaluate the testing results post-data. As mentioned above, the post-data severity aims to address that very problem, and is extensively discussed in Mayo (2018), excursion 5.

**Confusion 4:** Mayo and Spanos (2006) define “attained power” by replacing c_{α} with the observed d(**x**_{0}). But this should not be confused with replacing θ_{1} with its observed estimate [e.g., *x*_{n}], as in what is often called “observed” or “retrospective” power. To compare the two in example 2, contrast:

Attained power: POW(µ_{1})=Pr(d(**X**) > d(**x**_{0}); µ=µ_{1}), for all µ_{1}>µ_{0},

with what Mayo calls Shpower which is defined at µ=*x*_{n}:

Shpower: POW(*x*_{n})=Pr(d(**X**) > d(**x**_{0}); µ=*x*_{n}).

Shpower makes very little statistical sense, unless point estimation justifies the inferential claim *x*_{n }≅ µ^{∗}, which it does not, as argued above. Unfortunately, the statistical literature in psychology is permeated with (implicitly) invoking such a claim when touting the merits of estimation-based effect sizes. The estimate *x*_{n }represents just a single value of X_{n} ∼N(µ, σ^{2}/n ), and any inference pertaining to µ needs to take into account the uncertainty described by this sampling distribution; hence, the call for using interval estimation and hypothesis testing to account for that sampling uncertainty. The post-data severity evaluation addresses this problem using hypothetical reasoning and taking into account the relevant statistical context (11). It outputs the discrepancy from *H _{0}* warranted by test

**Confusion 5:** Frequentist error probabilities (type I, II, coverage, p-value) are *not conditional* on *H* (*H*_{0} or *H _{1}*) since θ=θ

This confusion undermines the credibility of Positive Predictive Value (PPV):

where (i) *F* = *H _{0}* is false, (ii) R=test rejects

A stronger case can be made that abuses and misinterpretations of frequentist testing are only symptomatic of a more extensive problem: the *recipe-like/uninformed implementation of statistical methods*. This contributes in many different ways to untrustworthy evidence, including: (i) statistical misspecification (imposing invalid assumptions on one’s data), (ii) poor implementation of inference methods (insufficient understanding of their assumptions and limitations), and (iii) unwarranted evidential interpretations of their inferential results (misinterpreting p-values and CIs, etc.).

Mayo uses the concept of a post-data severity evaluation to illuminate the above mentioned issues and explain how it can also provide the missing evidential interpretation of testing results. The same concept is also used to clarify numerous misinterpretations of the p-value throughout this book, as well as the fallacies:

**(a) Fallacy of acceptance (non-rejection).** No evidence against *H _{0}* is misinterpreted as evidence for it. This fallacy can easily arise when the power of a test is low (e.g. small

In chapter 5 Mayo returns to a recurring theme throughout the book, the mathematical duality between Confidence Intervals (CIs) and hypothesis testing, with a view to call into question certain claims about the superiority of CIs over p-values. This mathematical duality derails any claims that observed CIs are less vulnerable to the large n problem and more informative than p-values. Where they differ is in terms of their inferential claims stemming from their different forms of reasoning, factual vs. hypothetical. That is, the mathematical duality does not imply inferential duality. This is demonstrated by contrasting CIs with the post-data severity evaluation.

Indeed, a case can be made that the post-data severity evaluation addresses several long-standing problems associated with frequentist testing, including the large *n* problem, the apparent arbitrariness of the N-P framing that allows for simple vs. simple hypotheses, say *H _{0}*: µ= 1 vs.

Chapter 5 also includes a retrospective view of the disputes between Neyman and Fisher in the context of the error statistical perspective on frequentist inference, bringing out their common framing and their differences in emphasis and interpretation. The discussion also includes an interesting summary of their personal conflicts, not always motivated by statistical issues; who said the history of statistics is boring?

Chapter (excursion) 6 of Mayo (2018) raises several important foundational issues and problems pertaining to Bayesian inference, including its primary aim, subjective vs. default Bayesian priors and their interpretations, default Bayesian inference vs. the Likelihood Principle, the role of the catchall factor, the role of Bayes factors in Bayesian testing, and the relationship between Bayesian inference and error probabilities. There is also discussion about attempts by ‘default prior’ Bayesians to unify or reconcile frequentist and Bayesian accounts.

A point emphasized in this chapter pertains to model validation. Despite the fact that Bayesian statistics shares the same concept of a statistical model **M**_{θ}(**x**) with frequentist statistics, there is hardly any discussion on validating **M**_{θ}(**x**) to secure the reliability of the posterior distribution:…upon which all Bayesian inferences are based. The exception is the indirect approach to model validation in Gelman et al (2013) based on the posterior predictive distribution:Since *m*(**x**) is parameter free, one can use it as a basis for simulating a number of replications **x*** _{1}*,

On the question posed by the title of this review, Mayo’s answer is that the error statistical framework, a refinement or extension of the original Fisher-Neyman-Pearson framing in the spirit of Peirce, provides a pertinent foundation for frequentist modeling and inference.

A retrospective view of Hacking (1965) reveals that its main weakness is that its perspective on statistical induction adheres too closely to the philosophy of science framing of that period, and largely ignores the formalism based on the theory of stochastic processes {X* _{t}*, t∈

Probability as a dispositional property’ of a chance set-up alludes to the *propensity interpretation* of probability associated with Peirce and Popper, which is in complete agreement with the model-based frequentist interpretation; see Spanos (2019).

The main contribution of Mayo’s (2018) book is to put forward a framework and a strategy to evaluate the trustworthiness of evidence resulting from different statistical accounts. Viewing statistical inference as a form of severe testing elucidates the most widely employed arguments surrounding commonly used (and abused) statistical methods. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to evaluate and control how severely tested different inferential claims are. Without assuming that other statistical accounts aim for severe tests, Mayo proposes the following strategy for evaluating the trustworthiness of evidence: begin with a minimal requirement that if a test has little or no chance to detect flaws in a claim *H*, then *H*’s passing result constitutes untrustworthy evidence. Then, apply this minimal severity requirement to the various statistical accounts as well as to the proposed reforms, including estimation-based effect sizes, observed CIs and redefining statistical significance. Finding that they fail even the minimal severity requirement provides grounds to question the trustworthiness of their evidential claims. One need not reject some of these methods just because they have different aims, but because they give rise to evidence [claims] that fail the minimal severity requirement. Mayo challenges practitioners to be much clearer about their aims in particular contexts and different stages of inquiry. It is in this way that the book ingeniously links philosophical questions about the roles of probability in inference to the concerns of practitioners about coming up with trustworthy evidence across the landscape of the natural and the social sciences.

**References**

- Barnard, George. 1972. Review article: Logic of Statistical Inference.
*The British Journal of the Philosophy of Science*, 23: 123- 190. - Cramer, Harald. 1946.
*Mathematical Methods of Statistics*, Princeton: Princeton University Press. - Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics.
*Philosophical Transactions of the Royal Society*A, 222(602): 309-368. - Fisher, Ronald A. 1925.
*Statistical Methods for Research Workers*. Edinburgh: Oliver & Boyd. - Gelman, Andrew. John B. Carlin, Hal S. Stern, Donald B. Rubin. 2013.
*Bayesian Data Analysis*, 3rd ed. London: Chapman & Hall/CRC. - Hacking, Ian. 1972. Review: Likelihood.
*The British Journal for the Philosophy of Science*, 23(2): 132-137. - Hacking, Ian. 1980. The Theory of Probable Inference: Neyman, Peirce and Braithwaite. In D. Mellor (ed.),
*Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite*. Cambridge: Cambridge University Press, 141-160. - Ioannidis, John P. A. 2005. Why Most Published Research Findings Are False.
*PLoS*medicine, 2(8): 696-701. - Koopman, Bernard O. 1940. The Axioms and Algebra of Intuitive Probability.
*Annals of Mathematics*, 41(2): 269-292. - Mayo, Deborah G. 1983. An Objective Theory of Statistical Testing.
*Synthese*, 57(3): 297-340. - Mayo, Deborah G. 1996.
*Error and the Growth of Experimental Knowledge*. Chicago: The University of Chicago Press. - Mayo, Deborah G. 2018.
*Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars*. Cambridge: Cambridge University Press. - Mayo, Deborah G. and Aris Spanos. 2004. Methodology in Practice: Statistical Misspecification Testing.
*Philosophy of Science*, 71(5): 1007-1025. - Mayo, Deborah G. and Aris Spanos. 2006. Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.
*British Journal for the Philosophy of Science*, 57(2): 323- 357. - Mayo, Deborah G. and Aris Spanos. 2011. Error Statistics. In D. Gabbay, P. Thagard, and J. Woods (eds),
*Philosophy of Statistics, Handbook of Philosophy of Science*. New York: Elsevier, 151-196. - Neyman, Jerzy. 1952.
*Lectures and Conferences on Mathematical Statistics and Probability*, 2nd ed. Washington: U.S. Department of Agriculture. - Royall, Richard. 1997.
*Statistical Evidence: A Likelihood Paradigm*. London: Chapman & Hall. - Salmon, Wesley C. 1967.
*The Foundations of Scientific Inference*. Pittsburgh: University of Pittsburgh Press. - Spanos, Aris. 2013. A Frequentist Interpretation of Probability for Model-Based Inductive Inference.
*Synthese*, 190(9):1555- 1585. - Spanos, Aris. 2017. Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference. In
*Advances in Statistical Methodologies and Their Applications to Real Problems*. http://dx.doi.org/10.5772/65720, 3-28. - Spanos, Aris. 2018. Mis-Specification Testing in Retrospect.
*Journal of Economic Surveys*, 32(2): 541-577. - Spanos, Aris. 2019.
*Probability Theory and Statistical Inference: Empirical Modeling with Observational Data*, 2nd ed. Cambridge: Cambridge University Press. - Von Mises, Richard. 1928.
*Probability, Statistics and Truth*, 2nd ed. New York: Dover. - Williams, David. 2001.
*Weighing the Odds: A Course in Probability and Statistics*. Cambridge: Cambridge University Press.

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.

Two footnotes, on pages~~31~~35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOCThank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.

With kind regards and wishes of a happy 2020,

Jenny Heimberg

Jennifer Heimberg, Ph.D.

Senior Program OfficerThe National Academies of Sciences, Engineering, and Medicine

I’m really glad to see the effort! The footnote on p. 35 reads:

The original document read:

And the revised paragraph is:

Although my letter had also made the point about the difference between ordinary English, and technical, uses of “likelihood”, I did not expect them to tinker with those because the document is filled with jumbled uses of the two. Notice, just for one example, how the replacement on p. 221, along with the footnote

is immediately followed by:

Do you see any mixture of “likelihood” and “probability”?[1]

Still, I *greatly appreciate* their making the correction which will alert readers to be careful in combing through the document. As encouragement to others to write-in corrections, they might have acknowledged the error corrector, but I’m not complaining. It underscores my position that it’s really not so onerous or impossible to fix mistakes in committee-generated “guides for best practices”. See, for instance, my friendly amendments to the March 2019 editorial in *The American Statistician*.[2]

[1]At a time when people are cavalierly combining Type I error probabilities and power in a quasi-Bayesian computation to yield a “posterior predictive value” (the *diagnostic screening model* of tests)–which is also found in the NAS document– it’s especially important to be consistent in the use of “likelihood”. For a criticism of the diagnostic screening model see pp 361-370 of my *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST) (2018, CUP), or search this blog.

[2] Before you quit a committee on scientific methodology because you think they’re not upholding standards, please alert me error@vt.edu.

]]>

You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve ~~2011~~ ~~2012~~, ~~2013~~, ~~2014~~, ~~2015~~, ~~2016~~, ~~2017~~, ~~2018~~, 2019) and is taken back sixty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a few 2019 updates–one is of great significance.

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be writing on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your new book: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (STINT, 2018, CUP).

ERROR STATISTICIAN: You’ve read my new book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found it in 2006.[ii] Sorry,…I know it’s famous…

BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!

ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.

BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.

ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.

BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:

(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)

ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.

BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.

ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.

BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.

(They fill their glasses again)

ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?

BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.

ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.

BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BB-experiment.

ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.

BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a *single* experiment, so really you only need to apply the *weak LP* which frequentists accept. Yes? (The *weak LP is* the same as the *sufficiency principle*).

ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?

BIRNBAUM: I don’t think anyone has ever called it that.

ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?

BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2

Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).

ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?

BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB- experiment.

*My this drink is sour! *

ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.

BIRNBAUM: Perhaps you’re in want of a gene; never mind.

I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).

ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.

BIRNBAUM: Yes, the BB-experiment computes the P-value in an *unconditional* manner: it takes the convex combination over the 2 ways the result could have come about.

ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.

BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.

By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.

ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)

BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”

ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!

BIRNBAUM: So far all of this was step (1).

ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?

BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.

This gives us premise (2a):

(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?

BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.

(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then

x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.

BIRNBAUM: Yes. There was no need to repeat the whole spiel.

ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course,all of this assumes the model is correct or adequate to begin with.

BIRNBAUM: Yes, the SLP is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?

ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.

BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?

ERROR STATISTICAL PHILOSOPHER: Well the WCP is defined for actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.

BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice

(1), (2a) and (2b) yield the strong LP!

Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).

ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).

BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?

(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)

ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:

Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:

premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):

That is because in either case, the p-value would be (p’ + p”)/2

Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:

premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.

premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.

If (1) is true, then (2a) and (2b) must be false!

If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:

The average p-value (p’ + p”)/2 = p’ which is false.

Likewise if (1) is true, then (2b) is asserting:

the average p-value (p’ + p”)/2 = p” which is false

Alternatively, we can see what goes wrong by realizing:

If (2a) and (2b) are true, then premise (1) must be false.

In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.

I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).

BIRNBAUM: Yet some people still think it is a breakthrough (in favor of Bayesianism).

ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in *Statistical Science?* The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.

BIRNBAUM: Yes I have seen your 2014 paper, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations! I’ve also seen the slides on your blog.

ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I *must* get your answer to a question before you leave this year.

S*udden interruption by the waiter*

WAITER: Who gets the tab?

BIRNBAUM: I do. To Elbar Grease! And to your new book SIST! I’ve read it 3 times. I have a list of comments and questions right here.

ERROR STATISTICAL PHILOSOPHER: Let me see, I’d love to read your questions and comments. (She takes a long legal-sized yellow sheet from Birnbaum, notices it is filled with tiny hand-written comments, covering both sides.)

BIRNBAUM:** To Elbar Grease! To Severe Testing! Happy New Year!**

ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962)paper, you seemed to agree with Pratt that WCP can’t do the job you intend.

BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)

ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?

WAITER: We’re closing now; shall I call you a taxicab?

BIRNBAUM: Yes.

ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?

MANAGER: We’re closing now; I’m sorry you must leave.

ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….

*Large group of people bustle past.*

Prof. Birnbaum…? Allan? **Where did he go? **(oy, not again!)

**But wait! I’ve got his list of comments and questions in my hand! It’s real!!!**

**Link to complete discussion: **

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).*Statistical Science* 29 (2014), no. 2, 227-266.

[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as historical background papers may be found in my last blogpost. Please see that post for how you can very easily win a free signed copy of SIST.

[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term *sampling theory, *or my preferred *error statistics, *as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the *strong likelihood principle* (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

**SLP** (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E

_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomesx* andy* (from E_{1}and E_{2}respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), thenx* andy* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

**Violation of SLP:**

Whenever outcomes

x* andy* from experiments E_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, and f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ, and yet outcomesx* andy* have different implications for an inference about θ.

For an example of a SLP violation, E_{1} might be sampling from a Normal distribution with a fixed sample size n, and E_{2} the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E_{1} , where n was fixed, say, at 100, and experiment E_{2} where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

*Now for the surprising part:* Remember the 61-year old chestnut from my last post where a coin is flipped to decide which of two experiments to perform? David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which E_{i }produced the measurement, the assessment should be in terms of the properties of the particular E_{i}. Nothing could be more obvious.

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP)–although even that has been shown to be optional, strictly speaking. But this would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], I show how data may violate the SLP while holding both the WCP and SP. Such cases directly refute [WCP entails SLP].

In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome * z* from experiment E”. He writes it: Ev(E,

In my formulation of the argument, I introduce a new symbol to represent a function from a given experiment-outcome pair, (E,**z**) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.

(E,**z**) Infr_{E}(**z**) is to be read “the inference implication from outcome **z** in experiment E” (according to whatever inference type/school is being discussed).

*An outline of my argument is in the slides for a talk below: *

**Binge reading the Likelihood Principle.**

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums– I’ve pasted most of the early historical sources before the slides. The argument is simple; showing what’s wrong with it took a long time. My earliest treatment, via counterexample, is in Mayo (2010). A deeper argument is in Mayo (2014) in *Statistical Science*.[ii] An intermediate paper Mayo (2013) corresponds to the slides below–they were presented at the JSM. Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).

**Why this issue is bound to resurface in 2020.**

I had blogged this “binge read” post a year ago, but the issue has scarcely been put to rest. I expect it to resurface in 2020 for a few reasons. Firstly, I’d promised myself that once my book (SIST) was out that I’d try to collect central textbooks that are still calling it a theorem, and write to the authors. Hence, my offer in my last post to send you a free signed copy of SIST in exchange for texts you find (1 book per textbook, and the page/pages themselves need to be sent or attached). The argument barely takes a page.

Secondly, I’ve already been asked to review some new attempts to declare an improvement on the original attempt. There’s no mention of my disproof, nor that of Mike Evans’.

Third, it ought to come up as a crucial battle about the very notion of “evidence”, blithely taken for granted in such “best practice guides” as the 2016 ASA statement on P-values and significance (ASA I). It is the interpretation of evidence (left intuitive by Birnbaum) underlying the SLP that is being presupposed.

You may not wish to engage in what looks to be (and is) a rather convoluted logical argument. That’s fine, but just remember that when someone says “it’s been proved mathematically” that error probabilities are irrelevant to evidence post data, you can say, “I read somewhere that this has been disproved”.

—–

[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).

The argument purports to follow from principles frequentist error statisticians accept.

[ii] The link includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.

**Classic Birnbaum Papers:**

- Birnbaum, A. (1962), “On the Foundations of Statistical Inference“,
*Journal of the American Statistical Association*57(298), 269-306. - Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”,
*Journal of the American Statistical Association*57(298), 307-326. - Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
- Birnbaum, A (1972), “More on Concepts of Statistical Evidence“,
*Journal of the American Statistical Association*, 67(340), 858-861.

**Note to Reader:** If you look at the “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

**Durbin:**

- Durbin, J. (1970), “On Birnbaum’s Theorem on the Relation Between Sufficiency, Conditionality and Likelihood”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 395-398. - Savage, L. J., (1970), “Comments on a Weakened Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 399-401. - Birnbaum, A. (1970), “On Durbin’s Modified Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 402-403.

There’s also a good discussion in Cox and Hinkley 1974.

**Evans, Fraser, and Monette:**

- Evans, M., Fraser, D.A., and Monette, G., (1986), “On Principles and Arguments to Likelihood.”
*The Canadian Journal of Statistics*14: 181-199.

**Kalbfleisch:**

- Kalbfleisch, J. D. (1975), “Sufficiency and Conditionality”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 251-259. - Barnard, G. A., (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 260-261. - Barndorff-Nielsen, O. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 261-262. - Birnbaum, A. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 262-264. - Kalbfleisch, J. D. (1975), “Reply to Comments”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), p. 268.

**My discussions:**

- Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in
*Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science*(D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14. - Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in
*JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453. - Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder:
*Statistical Science**.*

]]>

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. it is now 61. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my (still) new book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST, 2018). It’s especially relevant to take this up now, just before we leave 2019, for reasons that will be revealed over the next day or two. For a sneak preview of those reasons, see the “note to the reader” at the end of this post. So, let’s go back to it, with an excerpt from SIST (pp. 170-173).

**Exhibit (vi): Two Measuring Instruments of Different Precisions. ***Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?*

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

*Basis for the joke: *An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, E_{1 }or E_{2}, to use in observing a Normally distributed random sample * Z* to make inferences about mean

In testing a null hypothesis such as *θ* = 0, the same * z *measurement would correspond to a much smaller

Suppose that we know we have observed a measurement from E_{2 }with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which E_{i } has produced * z*, the

The point essentially is that the marginal distribution of a

P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences aboutWeak Conditionality Principle (WCP):θ are appropriately drawn in terms of the sampling behaviorin the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

**Is There a Catch?**

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

Note to the Reader:

The LP was a main topic for the first few years of this blog. That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in *Statistical Science*.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

An intermediate paper is Mayo (2013).

What I discovered in 2019 is that rather than admit that Allan Birnbaum’s (1962) alleged proof is circular, some authors are claiming to have new proofs of it. These consist of reiterating the same premises that render the argument circular, but with greater exuberance and more certainty. In this way, it is said to avoid objections to the earlier attempted proof! The only problem is: once an argument is circular, it remains so.

Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system). **So, ****in 2020, when you find a textbook that claims the LP is a theorem, provable from the (WEP) and (SP), or (WCP) alone, please send me an attachment or link of the relevant pages and reference. A free signed copy of SIST goes to the first person (1 copy for each such textbook) who does so. **Since there are many such textbooks out there, I expect to part with several copies of SIST in 2020.

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of SIST Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

If you’re keen to try your hand at the arguments (Birnbaum’s or mine), you might start with a summary post (based on slides) here. It is *not* included in SIST. There’s no real mathematics or statistics involved, it’s pure logic. But it’s very circuitous, which is one reason why the supposed “proof” has stuck around as long as it has. The other reason is that many people *want* it to be so.

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word in the rejoinder.

**References **(outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, *Journal of the American Statistical Association* 57(298), 269-306.

Birnbaum, A. (1975). *Comments on Paper by J. D. Kalbfleisch*. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in *JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: *Statistical Science** *29(2) pp. 227-239, 261-266*.*

I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) ~~howler~~ well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past:

2013 is right around the corner, and here are 13 well-known criticisms of statistical significance tests, and how they are addressed within the error statistical philosophy, as discussed in Mayo, D. G. and Spanos, A. (2011) “Error Statistics“.

- (#1) Error statistical tools forbid using any background knowledge [1].
- (#2) All statistically signiﬁcant results are treated the same.
- (#3) The p-value does not tell us how large a discrepancy is found.
- (#4) With large enough sample size even a trivially small discrepancy from the null can be detected.
- (#5) Whether there is a statistically signiﬁcant diﬀerence from the null depends on which is the null and which is the alternative.
- (#6) Statistically insigniﬁcant results are taken as evidence that the null hypothesis is true.
- (#7) Error probabilities are misinterpreted as posterior probabilities.
- (#8) Error statistical tests are justiﬁed only in cases where there is a very long (if not inﬁnite) series of repetitions of the same experiment.
- (#9) Specifying statistical tests is too arbitrary.
- (#10) We should be doing conﬁdence interval estimation rather than signiﬁcance tests.
- (#11) Error statistical methods take into account the intentions of the scientists analyzing the data.
- (#12) All models are false anyway.
- (#13) Testing assumptions involves illicit data-mining.

You can read how we avoid them in the full paper here.

My book, *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST 2018, CUP), excavates the most recent variations on all of these howlers. To allege that statistical significance tests don’t use background information is a willful distortion of the tests which Fisher developed, hand-in-hand, with a large methodology of experimental design: randomization, predesignation and testing model assumptions. All these depend on incorporating background information into the specification and interpretation of tests. “The purpose of randomisation” Fisher made clear, “is to guarantee the validity of the test of significance” (1935). Observational (and other) studies that lack proper controls may well need to concede that any reported P-values are illicit–but then why report P-values at all? (Confidence levels are then equally illicit, except as descriptive measures without error control.) I say they should not report P-values lacking in error-statistical interpretations, at least not without reporting this. But don’t punish studies that work hard to attain error control.

Before you jump on the popular (but misguided) bandwagons of “abandoning statistical significance” or derogating P-values as so-called “purely (blank slate) statistical measures”, ask for evidence supporting the criticisms.[2] You will find they are based on rather blatant misuses and abuses. Only by blocking the credulity with which such apparitions are met these days (in some circles) can we attain improved statistical inferences in Christmases yet to come.

[1] “Error statistical methods” is an umbrella term for methods that employ probability in inference to assess and control the capabilities of methods to avoid mistakes in interpreting data. It includes statistical significance tests, confidence intervals confidence distributions, randomization, resampling and bootstrapping. A proper subset of error statistical methods are those that use error probabilities to assess and control the *severity* with which claims may be said to have passed a test (with given data). A claim C passes a test with severity to the extent that it has been subjected to and survives a test that probably would have found specified flaws in C, if present. Please see excerpts from SIST 2018.

[2] See

- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

The paper referred to in the post from Christmas past (1) is:

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in* Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 Philosophy of Statistics.

When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”. [1]. So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “**we take that step here!**” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.)

One final challenge, which I hope to address in my final month as ASA president, concerns issues of significance, multiplicity, and reproducibility. In 2016, the ASA published a statement that simply reiterated what

p-values are and are not. It did not recommend specific approaches, other than “good statistical practice … principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.”The guest editors of the March 2019 supplement to

The American Statisticianwent further, writing: “The ASA Statement on P-Values and Statistical Significancestopped just short of recommending that declarations of ‘statistical significance’ be abandoned.We take that step here. … [I]t is time to stop using the term ‘statistically significant’ entirely.”Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. In fact, the ASA does not endorse any article, by any author, in any journal—even an article written by a member of its own staff in a journal the ASA publishes. (Kafadar, December President’s Corner)

Yet Wasserstein et al. 2019 describes itself as a *continuation* of the ASA 2016 Statement on P-values, which I abbreviate as ASA I. (Wasserstein is the Executive Director of the ASA.) It describes itself as merely recording the decision to “take that step here”, and add one more “don’t” to ASA I. As part of this new “don’t,” it also stipulates that we should not consider “at all” whether pre-designated P-value thresholds are met. (It also restates four of the six principles in ASA I so as to be considerably stronger than those in ASA I. I argue, in fact, the resulting principles are inconsistent with principles 1 and 4 of ASA I. See my post from June 17, 2019.) Since it describes itself as a continuation of the ASA policy in ASA I, and that description survived peer review at the journal TAS, readers presume that’s what it is; absent any disclaimer to the contrary, that conception (or misconception) remains operative.

There really is no other way to read the claim in the Wasserstein et al. March 2019 editorial: “*The ASA Statement on P-Values and Statistical Significance *stopped just short of recommending that declarations of ‘statistical significance’ be abandoned.”[2] We take that step here.” Had the authors viewed their follow-up as anything but a continuation of ASA I, they would have said something like: “Our own recommendation is to go *much further* than ASA I. We suggest that all branches of science stop using the term ‘statistically significant’ entirely.” They do not say that. What they say is written from the perspective of “Les stats, c’est moi”.

**The 2019 P-value Project II**

Kafadar deserves a great deal of credit for providing some needed qualification in her December note. However, there needs to be a disclaimer by ASA as regards what it calls its **P-value Project**. The P-value project, started in 2014, refers to the overall ASA campaign to provide guides for the correct use and interpretation of P-values and statistical significance, and journal editors and societies are to consider revising their instructions to authors taking into account its guidelines. ASA I was distilled from many meetings and discussions from representatives in statistics. The only difference in today’s P-value Project is that both ASA I *and* the 2019 editorial by Wasserstein et al. are to form the new ASA guidelines–even if the latter is not to be regarded as a continuation of ASA I (in accord with Kafadar’s qualification). I will refer to it as the **2019 ASA P-value Project II**. Wasserstein et al. 2019 is a piece of the P-value project, and the authors thank the ASA for its support of this Project at the end of the article. [4] [5]

**Of Policies and Working Groups**

Kafadar continues:

Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and

p-values?” Should the ASA have a policy on hypothesis testing or on using “statistical significance”?

Allow me to weigh in here: No, no it should not. At one time I would have said yes, but no more. I can hear the policy now (sounding much like Wasserstein et al. 2019, only written in stone): “Don’t say, never say, or if you really feel you must say significance, and are prepared to thoroughly justify such a “thoughtless” term, then you may only say “significance level p” where p is continuous, and never rounded up or cut off, ever. But never, ever use the “ant” ending: signifi* cant. *Y

Why can’t the ASA merely provide a bipartisan forum for discussion of the multitude of models, methods, aims, goals, and philosophies of its members? Wasserstein et al. 2019 admits there is no agreement, and that there might never be. Spare us another document whose implication is: we need not test, and cannot falsify claims, even statistically (since that is the consequence of no thresholds). I realize that Kafadar is calling for a serious statement–one that counters the impression of the Wasserstein et al. opinion.

To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece reflecting “good statistical practice,” without leaving the impression that

p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice.” …The ASA should develop—and publicize—a properly endorsed statement on these issues that will guide good practice.

Be careful what you wish for. I give major plaudits to Kafadar for pressing hard to see that alternative views are respected, and to counter the popular but terrible arguments of the form: since these methods are misused, they should be banished, and replaced with methods advocated by group Z (even if the credentials of Z’s methods haven’t been scrutinized!) We have already seen in 2019 the extensive politicization and sensationalizing of bandwagons in statistics. (See my editorial P-value Thresholds: Forfeit at your Peril.)The average ASA member, who doesn’t happen to be a thought leader or member of a politically correct statistical-philosophical tribe, is in great danger of being muffled entirely. There’s already a loss of trust. We already know, under the motto that “a crisis should never be wasted”, that many leaders of statistical tribes view the crisis of replication as an opportunity to sell alternative methods they have long been promoting. Rather than the properly endorsed, truly representative, statement that Kafadar seeks, we may get dictates from those who are quite convinced that they know best: “les stats, c’est moi”.

**APPENDIX. How a Working Group on P-values and Significance Testing Could Work**

I see one way that a working group could actually work. The 2016 ASA statement, ASA I, had a principle, it was #4. You don’t hear about it in the 2019 follow-up. It is that “P-values and related statistics” cannot be correctly interpreted without knowing how many hypotheses were tested, how data were specified and results selected for inference. Notice the qualification “and related statistics”. The presumption is that some methods don’t require that information! That information is necessary only if one is out to control the error probabilities associated with an inference.

Here’s my idea: Have the group consist of those who work in areas where statistical inferences depend on controlling error probabilities (I call such methods *error statistical*). They would be involved in current uses and developments of statistical significance testing and the much larger (frequentist) error statistical methodology within which it forms just a part. They would be familiar with, and some would be involved in developing, the latest error statistical tools, including tests and confidence distributions, P-values with high dimensional data, current problems of adjusting for multiple testing, and of testing statistical model assumptions, and they would be capable of different aspects of comparative statistical methods (Bayesian and error statistical). They would present their findings and recommendations, and responses sought.

The need for the kind of forum I’m envisioning is so pressing, that it should not be contingent on being created by any outside association. It should emerge spontaneously in 2020. *We take that step here.*

*Please share your comments in the comments.*

[1] This is a pun on “l’état, c’est moi” (“I am the state”, Louis XIV*.) I thank Glenn Shafer for the appropriate French spelling for my pun. (*Thanks to S. Senn for noticing I was missing the X in Louis XIV.)

[2] They are referring to the last section of ASA I on “other measures of evidence”. Indeed, that section suggests an endorsement of an assortment of alternative measures of evidence including Bayes factors, likelihood ratios and others. There is no attention to whether any of these methods accomplish the key task of the statistical significance test–to distinguish genuine from spurious effects. For a fuller explanation of this last section, please see my post from June 17, 2019 and November 14, 2019. And, obviously, check the last section of ASA I.

Shortly after the 2019 editorial appeared, I queried Wasserstein as to the relationship between it and ASA I. It was never clarified. I hope now that it will be. At the same time I informed him of what appeared to me to be slips in expressing principles of ASA I, and I offered friendly amendments (see my post from June 17, 2019).

[3] If you’re giving the history of statistics, you can speak of those bad, bad men–dichotomaniacs, Neyman and Pearson– who, following Fisher, divided results into significant and non-significant discrepancies (introduced the alternative hypotheses, type I and II errors, power and optimal tests) and thereby tried to reduce all of statistics to acceptance sampling, engineering, and 5-year plans in Russia–as Fisher (1955) himself said (after the professional break with Neyman in 1935). Never mind that Neyman developed confidence intervals at the same time, 1930. For a full discussion of the history of the Fisher-Neyman (and related) wars, please see my *Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018).

[4] I was just sent this podcast and interview of Ron Wasserstein, so I’m adding it as a footnote. There, Wasserstein et al. 2019 is clearly described as the ASA’s “further guidance”, and Wasserstein takes no exception to it. The interviewer says:

**“**But it would seem as though Ron’s work has only just begun. The ASA has just published further guidance in the most recent edition of The American Statistician, which is open access and written for non-statisticians. The guidance is intended to go further and argues for an end to the concept of statistical significance and towards a model which the ASA have coined their ATOM Principle: Accept uncertainty, Thoughtful, Open and Modest.”

https://www.howresearchers.com/wp-content/uploads/2019/05/hrcw-transcript-episode-2.pdf

[5]Nathan Schachtman, in a new post just added to his law blog on this very topic, displays a letter from the ASA acknowledging that a journal has revised its guidelines taking into account *both* ASA I and the 2019 Wasserstein et al. editorial. I had seen this letter, in relation to the NEJM, but it’s hard to know what to make of it. I haven’t seen others acknowledging other journals, and there have been around 7 at this point. I may just be out of the loop.

**Selected blog posts on ASA I and the Wasserstein et al. 2019 editorial:**

- March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

]]>

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

- “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day).

Recently, at chapter meetings, conferences, and other events, I’ve had the good fortune to meet many of our members, many of whom feel queasy about the effects of differing views on p-values expressed in the March 2019 supplement of The American Statistician (TAS). The guest editors— Ronald Wasserstein, Allen Schirm, and Nicole Lazar—introduced the ASA Statement on P-Values (2016) by stating the obvious: “Let us be clear. Nothing in the ASA statement is new.” Indeed, the six principles are well-known to statisticians.The guest editors continued, “We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.”…

Wait a minute. I’m confused about who is speaking. The statements “Let us be clear…” and “We hoped that a statement from the world’s largest professional association…” come from the 2016 ASA Statement on P-values. I abbreviate this as ASA I (Wasserstein and Lazar 2016). The March 2019 editorial that Kafadar says is making many members “feel queasy,” is the update (Wasserstein, Schirm, and Lazar 2019). I abbreviate it as ASA II [i].

A healthy debate about statistical approaches can lead to better methods. But, just as Wilks and his colleagues discovered, unintended consequences may have arisen: Nonstatisticians (the target of the issue) may be confused about what to do. Worse, “by breaking free from the bonds of statistical significance” as the editors suggest and several authors urge, researchers may read the call to “abandon statistical significance” as “abandon statistical methods altogether.” …

But we may need more. How exactly are researchers supposed to implement this “new concept” of statistical thinking? Without specifics, questions such as “Why is getting rid of p-values so hard?” may lead some of our scientific colleagues to hear the message as, “Abandon p-values”—despite the guest editors’ statement: “We are not recommending that the calculation and use of continuous p-values be discontinued.”

Brad Efron once said, “Those who ignore statistics are condemned to re-invent it.” In his commentary (“It’s not the p-value’s fault”) following the 2016 ASA Statement on P-Values, Yoav Benjamini wrote, “The ASA Board statement about the p-values may be read as discouraging the use of p-values because they can be misused, while the other approaches offered there might be misused in much the same way.” Indeed, p-values (and all statistical methods in general) can be misused. (So may cars and computers and cell phones and alcohol. Even words in the English language get misused!) But banishing them will not prevent misuse; analysts will simply find other ways to document a point—perhaps better ways, but perhaps less reliable ones. And, as Benjamini further writes, p-values have stood the test of time in part because they offer “a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs”—especially now that Efron’s bootstrap has become a familiar tool in all branches of science for characterizing uncertainty in statistical estimates.[Benjamini is commenting on ASA I.]

… It is reassuring that “Nature is not seeking to change how it considers statistical evaluation of papers at this time,” but this line is buried in its March 20 editorial, titled “It’s Time to Talk About Ditching Statistical Significance.” Which sentence do you think will be more memorable? We can wait to see if other journals follow BASP’s lead and then respond. But then we’re back to “reactive” versus “proactive” mode (see February’s column), which is how we got here in the first place.

… Indeed, the ASA has a professional responsibility to ensure good science is conducted—and statistical inference is an essential part of good science. Given the confusion in the scientific community (to which the ASA’s peer-reviewed 2019 TAS supplement may have unintentionally contributed), we cannot afford to sit back. After all, that’s what started us down the “abuse of p-values” path.

Is it unintentional? [ii]

…Tukey wrote years ago about Bayesian methods: “It is relatively clear that discarding Bayesian techniques would be a real mistake; trying to use them everywhere, however, would in my judgment, be a considerably greater mistake.” In the present context, perhaps he might have said: “It is relatively clear that trusting or dismissing results based on a single p-value would be a real mistake; discarding p-values entirely, however, would in my judgment, be a considerably greater mistake.” We should take responsibility for the situation in which we find ourselves today (and during the past decades) to ensure that our well-researched and theoretically sound statistical methodology is neither abused nor dismissed categorically. I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether. Please send me your ideas!

You can read the full June President’s Corner.

On Fri, Nov 8, 2019 at 2:09 PM Deborah Mayo <mayod@vt.edu> wrote:

Dear Professor Kafadar;

Your article in the President’s Corner of the ASA for June 2019 was sent to me by someone who had read my “P-value Thresholds: Forfeit at your Peril” editorial, invited by John Ioannidis. I find your sentiment welcome and I’m responding to your call for suggestions.

For starters, when representatives of the ASA issue articles criticizing P-values and significance tests, recommending their supplementation or replacement by others, three very simple principles should be followed:

- The elements of tests should be presented in an accurate, fair and at least reasonably generous manner, rather than presenting mainly abuses of the methods;
- The latest accepted methods should be included, not just crude nil null hypothesis tests. How these newer methods get around the often-repeated problems should be mentioned.
- Problems facing the better-known alternatives, recommended as replacements or supplements to significance tests, should be discussed. Such an evaluation should recognize the role of statistical falsification is distinct from (while complementary to) using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model.

Here’s what I recommend ASA do now in order to correct the distorted picture that is now widespread and growing: Run a conference akin to the one Wasserstein ran on “A World Beyond ‘P < 0.05′” except that it would be on evaluating some competing methods for statistical inference: Comparative Methods of Statistical Inference: Problems and Prospects.

The workshop would consist of serious critical discussions on Bayes Factors, confidence intervals[iii], Likelihoodist methods, other Bayesian approaches (subjective, default non-subjective, empirical), particularly in relation to today’s replication crisis. …

Growth of the use of these alternative methods have been sufficiently widespread to have garnered discussions on well-known problems….The conference I’m describing will easily attract the leading statisticians in the world. …

Sincerely,

D. Mayo

Please share your comments on this blogpost.

************************************

[i] My reference to ASA II refers just to the portion of the editorial encompassing their general recommendations: don’t say significance or significant, oust P-value thresholds. (It mostly encompasses the first 10 pages.) It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blogpost, the result is to give us principles that are in tension with the original 6.) Note my new qualification in [ii]*

[ii]*As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, with some kind of disclaimer. That will help, but merely noting that it never came to a Board vote will not quell the confusion now rattling some ASA members. The *ASA’s P-value campaign* to editors to revise their author guidelines asks them to take account of both ASA I *and* II. In carrying out the P-value campaign, at which he is highly effective, Ron Wasserstein obviously* wears his Executive Director’s hat. See The ASA’s P-value Project: Why It’s Doing More Harm than Good. So, until some kind of clarification is issued by the ASA, I’ve hit upon this solution.

The ASA P-value Project existed before the 2016 ASA I. The only difference in today’s P-value Project–since the March 20, 2019 editorial by Wasserstein et al– is that the ASA Executive Director (in talks, presentations, correspondence) recommends ASA I *and* the general stipulations of ASA II–even though that’s not a policy document. I will now call it the **2019 ASA P-value Project II**. It also includes the rather stronger principles in ASA II. Even many who entirely agree with the “don’t say significance” and “don’t use P-value thresholds” recommendations have concurred with my “friendly amendments” to ASA II (including, for example, Greenland, Hurlbert, and others). See my post from June 17, 2019.

You merely have to look at the comments to that blog. If Wasserstein would make those slight revisions, the 2019 P-value Project II wouldn’t contain the inconsistencies, or at least “tensions” that it now does, assuming that it retains ASA I. The 2019 ASA P-value Project II sanctions making the recommendations in ASA II, even though ASA II is not an ASA policy statement.

However, I don’t see that those made queasy by ASAII would be any less upset with the reality of the ASA P-value Project II.

[iii]Confidence intervals (CIs) clearly aren’t “alternative measures of evidence” in relation to statistical significance tests. The same man, Neyman, developed tests (with Pearson) and CIs, even earlier ~1930. They were developed as duals, or inversions, of tests. Yet the advocates of CIs–the CI Crusaders, S. Hurlbert calls them–are some of today’s harshest and most ungenerous critics of tests. For these crusaders, it has to be “CIs only”. Supplementing p-values with CIs isn’t good enough. Now look what’s happened to CIS in the latest guidelines of the NEJM. You can readily find them searching NEJM on this blog. (My own favored measure, severity, improves on CIs, moves away from the fixed confidence level, and provides a different assessment corresponding to each point in the CI.

*****Or is it not obvious? I think it is, because he is invited and speaks, writes, and corresponds in that capacity.

Wasserstein, R. & Lazar, N. (2016) [ASA I], The ASA’s Statement on *p*-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.

Wasserstein, R., Schirm, A. and Lazar, N. (2019) [ASA II] “Moving to a World Beyond ‘p < 0.05’”, *The American Statistician *73(S1): 1-19: Editorial. (ASA II)(pdf)

**Related posts on ASA II:**

- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)
- July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring?
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- Nov 22. The ASA’s P-value Project: Why It’s Doing More Harm than Good.

**Related book (excerpts from posts on this blog are collected here)**

Mayo, (2018). *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars**, *SIST (2018, CUP).

Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),

&

Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

**What can we learn from the debate on statistical significance?**

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification. Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science.

**#The sins of quantification**

Every quantification which is unclear as to its scope and the context in which it is produced obscures rather than elucidates.

Traditionally, the strength of numbers in the making of an argument has rested on their purported objectivity and neutrality. Expressions such as “Concrete numbers”, “The numbers speak for themselves”, “The data/the model don’t lie” are common currency. Today, doubts about algorithmic instances of quantification – e.g. in promoting, detaining, conceding freedom or credit, are becoming more urgent and visible. Yet the doubt should be general. It is becoming realised that in every activity of quantification, the technique or the methods are never neutral, because it is never possible to separate entirely the act of quantifying from the wishes and expectations of the quantifier. Thus, books apparently telling separate stories, such as Rigor Mortis, Weapons of Math Destruction, the Tyranny of Metrics, or Useless Arithmetic, dealing with statistics, algorithms, indicators and models, share a common concern.

**# Statisticians know**

Statisticians are increasingly aware that each number presupposes an underlying narrative, a worldview, and a purpose of the exercise. The maturity of this debate in the house of statistics is not an accident. Statistics is a discipline, with recognized leaders and institutions, and although one might derive an impression of disorder by the use a petition to influence a scientific argument, one cannot deny that the problems in statistics are being tackled head on, in the public arena, in spite of the obvious difficulty for the lay public to follow the technicality of the arguments. With its ongoing discussion of significance, the community of statistics is teaching us an important lesson about the tight coupling between technique and values. How so? We recap here some elements of the debate.

- For some, it would be better to throw away the concept of significance altogether, because the p-test, – with its magical p<0.05 threshold, is being misused as a measure of veracity and publishability.
- Others object that discussion should not take place with the instrument of a petition and that withdrawing tests of significance would make science even more uncertain.
- The former retort that since this discussion has been going on for decades on academic journal without the existing flaws being fixed, then perhaps times are ripe for action.

A good vantage point to look at this debate in its entirety is this section in Andrew Gelman’s blog.

**# Different worlds**

An important aspect of this discussion is that the contenders may inhabit different worlds. One world is full of important effects which are overlooked because the test of significance fails (p value greater that 0.05 in statistical parlance). The other world is instead replete with bogus results passed on to the academic literature thanks to a low value of the p-test (p<0.05).

A modicum of investigation reveals that the contention is normative, or indeed political. To take an example, some may fear the introduction on the market of ineffectual pharmaceutical products, others that important epidemiological effects of a pollutant on health may be overlooked. The first group would thus have a more restrictive value for the test, the second group a less restrictive one.

All this is not new. Philosopher Richard Rudner had already written in 1953 that it is impossible to use a test of significance without knowing to what it is being applied, i.e. without making a value judgment. Interestingly, Rudner used this example to make the point that scientists do need to make value judgments.

**# How about mathematical models?**

In all this discussion mathematical models have enjoyed a relative immunity, perhaps because mathematical modelling is not a discipline. But the absence of awareness of a quality problem is not proof of the absence of a problem. And there are signals that the crisis there might be even worse than that which is recognised in statistics.

Implausible quantifications of the effect of climate change on the gross domestic product of a country at the year 2100, or of the safety of a disposal for nuclear waste a million years from now, or of the risk of the financial products at the heart of the latest financial crisis, are just examples that are easily seen in the literature. Political decision in the field of transports may be based on a model which needs as an input the average number of passengers sitting is a car several decades in the future. A scholar studying science and technology laments the generation of artefactual numbers through methods and concepts such as ‘expected utility’, ‘decision theory’, ‘life cycle assessment’, ‘ecosystem services’ ‘sound scientific decisions’ and ‘evidence-based policy’ to convey a spurious impression of certainty and control over important issues concerning health and the environment. A rhetorical use of quantification may thus be used in evidence-based policy to hide important knowledge and power asymmetries: the production of evidence empowers those who can pay for it, a trend noted in both the US and Europe.

**# Resistance?**

Since its inception the current of post normal science (PNS) has insisted on the need to fight against instrumental or fantastic quantifications. PNS scholars suggested the use of pedigree for numerical information (NUSAP), and recently for mathematical models. Combined with PNS’ concept of extended peer communities, these tools are meant to facilitate a discussion of the various attributes of a quantification. This information includes not just its uncertainty, but also its history, the profile of its producers, its position within a system of power and norms, and overall its ‘fitness for function’, while also identifying the possible exclusion of competing stakes and worldviews.

Stat-Activisme, a recent French intellectual ovement, proposes to ‘fight against’ as well as ‘fight with’ numbers. Stat-activisme targets invasive metrics and biased statistics, with a rich repertoire of strategies from ‘statistical judo’ to the construction of alternative measures.

As philosopher Jerome Ravetz reminds us, so long as our modern scientific culture has faith in numbers as if they were ‘nuggets of truth’, we will be victims of ‘funny numbers’ employed to rule our technical society.

**Note:** A different version of this piece has been published in Italian in the journal Epidemiologia and Prevenzione.

*Everything is impeach and remove these days!* Should that hold also for the concept of statistical significance and P-value thresholds? There’s an active campaign that says yes, but I aver it is doing more harm than good. In my last post, I said I would count the ways it is detrimental until I became “too disconsolate to continue”. There I showed why the new movement, launched by Executive Director of the ASA (American Statistical Association), Ronald Wasserstein (in what I dub ASA II), is self-defeating: it instantiates and encourages the human-all-too-human tendency to exploit researcher flexibility, rewards, and openings for bias in research (**F, R & B Hypothesis**). That was reason #1. Just reviewing it already fills me with such dismay, that I fear I will become too disconsolate to continue before even getting to reason #2. So let me just quickly jot down reasons #2, 3, 4, and 5 (without full arguments) before I expire.

[I thought that with my book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP), that I had said pretty much all I cared to say on this topic (and by and large, this is true), but almost as soon as it appeared in print just around a year ago, things got very strange.]

But wait. Someone might object that I’m the one doing more harm than good by linking the ASA (The American Statistical Association) to Wasserstein’s campaign to get publishers, journalists, authors and the general public to buy into the recommendations of ASA II. “Shhhhh!” some counsel, “don’t give it more attention; we want people to look away”. Nothing to see? I don’t think so. I will discuss this point in this post in PART II, as soon as I sketch my list of reasons #2-5.

Before starting, let me remind readers that what I abbreviate as ASA II only refers to those portions of the 2019 editorial by Wasserstein, Schirm, and Lazar that allude to their general recommendations, not their summaries of contributed papers in the issue of TAS.

**PART I**

* 2 Decriminalize theft to end robbery.* The key arguments for impeaching and removing statistical significance levels and P-value thresholds commit fallacies of the “cut-off your nose to spite your face” variety. For example, we should ban P-value thresholds because they cause biased selection and data dredging. Discard P-value thresholds and P-hacking disappears! Or so it is argued. Even were this true, it would be like arguing we should decriminalize robbery since then the crime of robbery would disappear! (

** 3 Straw men and women fallacies.** ASA I and II do more harm than good by presenting oversimple caricatures of the tests. Even ASA I excludes a consideration of alternatives, error probabilities and power[1]. At the same time, it will contrast these threadbare “nil null” hypothesis tests with confidence intervals (CIs)–never minding that the latter employs alternatives. No wonder CIs look better, but such a test is unfair. (Neyman developed confidence intervals as inversions of tests at the same time he was developing hypotheses tests with alternatives in 1930. Using only significance tests, you could recover the lower (and upper) 1-α CI bounds if you wanted, by asking for the hypotheses that the data are statistically significantly greater (smaller) than, at level α, using the usual 2-sided computation).

In ASA II, we learn that “no p-value can reveal the …presence…of an association or effect” (at odds with principle 1 of ASA I). That could be true only in the sense that no formal statistical quantity alone could reveal the presence of an association. But in a realistic setting, small p-values surely do reveal the presence of effects. Yes, there are assumptions, but significance tests are prime tools to probe them. We hear of “the seductive certainty falsely promised by statistical significance”, and are told that “a declaration of statistical significance is the antithesis of thoughtfulness”. (How an account that never issues an inference without an associated error probability can be promising certainty is unexplained. On the second allegation, ignoring how thresholds are rendered meaningful by choosing them to reflect background information and a host of theoretical and epistemic considerations, is all more straw.) The requirement in philosophy of a reasonably generous interpretation of what your criticizing isn’t a call for being kind or gentle, it’s that otherwise your criticism is guilty of straw men (and women) fallacies, and thus fails.

* 4 Alternatives to significance testing are given a pass.*You will not find any appraisal of the alternative methods recommended to replace significance tests for their intended tasks. Although many of the “alternative measures of evidence” listed in ASA I and II: Likelihood ratios, Bayes factors (subjective, default, empirical), posterior predictive values (in diagnostic screening) have been critically evaluated by leading statisticians, no word of criticism is heard here. Here’s an exercise: run down the list of 6 “principles” of ASA I, applying them to any of the alternative measures of evidence on offer. Take, for example, Bayes factors. I claim that they do worse than do significance tests, even without modifications.[2]

** 5 Assumes probabilism.** Any fair (non question-begging) comparison of statistical methods should recognize different roles probability may play in inference. The role of probability in inference by way of statistical falsification is quite different from using probability to quantify degrees of confirmation, support, plausibility or belief in a statistical hypothesis or model–or comparative measures of these. I abbreviate the former as

Error probabilities quantify the capabilities of a method to detect the ways a claim (hypothesis, model or other) may be false, or specifiably flawed. The basic principle of testing is minimalist: there is evidence for a claim only to the extent it has been subjected to, and passes, a test that had at least a reasonable probability of having discerned how the claim may be false. (For a more detailed exposition, see Mayo 2018, or excerpts from SIST on this blog).

Reason #5, then, is that “measures of evidence” in both ASA I and II beg this key question (about the role of probability in statistical inference) in favor of probabilisms–usually comparative as with Bayes factors. If the recommendation in ASA II to remove statistical thresholds is taken seriously, there are no tests and no statistical falsification. Recall what Ioannidis said in objecting to “don’t say signiicance”, cited in my last post:

Potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

“Self-ostracizing” is a great term. ASA should ostracize self-ostracizing. This takes me back to the question I promised to come back to: is it a mistake to see the ASA as entangled in the campaign to ban use of the “S-word”, and kill P-value thresholds?

**PART II**

Those who say it is a mistake, point to the fact that what I’m abbreviating as ASA II did not result from the kind of process that led to ASA I, with extended meetings of statisticians followed by a Board vote[3]. I don’t think that suffices. Granted, the “P-value Project” (as it is called at ASA) is only a small part of the ASA, led by Executive Director Wasserstein. Nevertheless, as indicated on the ASA website, “As executive director, Wasserstein also is an official ASA spokesperson.” In his active campaign to get journals, societies, practitioners and the general public to accept the recommendations in ASA II, he wears his executive director hat, does he not?

As soon as I saw the 2019 document, I queried Wasserstein as to the relationship between ASA I and II. It was never clarified. I hope now that it will be, but it will not suffice to note that it never came to a Board vote. The campaign to editors to revise their guidelines for authors, taking account of both ASA I *and* II, should also be addressed. Keeping things blurred gives plausible deniability, but at the cost of increasing confusion and an “anything goes” attitude.

ASA II clearly presents itself as a continuation of ASA I (again, ASA II refers just to the portion of the editorial encompassing the general recommendation: don’t say significance or significant, oust P-value thresholds). It begins with a review of 4 of the 6 principles from ASA I, even though they are stated in more extreme terms than in ASA I. (As I point out in my blog, the result is to give us principles that are in tension with the original 6.) Next, it goes on to say:

The

ASA Statement on P-Values and Statistical Significancestarted moving us toward this world…. TheASA Statement on P-Values and Statistical Significancestopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. … it is time to stop using the term “statistically significant” entirely. Nor should variants such as ‘significantly different,’ ‘p< 0.05,’ and ‘nonsignificant’ survive…

Undoubtedly, there are signs in ASA I that they were on the verge of this step, notably, the last section: “In view of the prevalent misuses of and misconceptions concerning *p*-values, some statisticians prefer to supplement or even replace *p*-values with other approaches. .. likelihood ratios or Bayes factors”. (p. 132).

A letter to the editor *on ASA I* was quite prescient. It was written by Ionides, Giessing, Ritov and Page (link):

Mixed with the sensible advice on how to use p-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid p-values. They don’t tell you what you want to know. …The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.

What do you think? Please share your comments on this blogpost.

[1] “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power (among other things)” (ASA I)

[2] **The ASA 2016 Guide’s Six Principles**

*P*-values can indicate how incompatible the data are with a specified statistical model.*P*-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.- Scientific conclusions and business or policy decisions should not be based only on whether a
*p*-value passes a specific threshold. - Proper inference requires full reporting and transparency.
*P*-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain*p*-values (typically those passing a significance threshold) renders the reported*p*-values essentially uninterpretable. - A
*p*-value, or statistical significance, does not measure the size of an effect or the importance of a result. - By itself, a
*p*-value does not provide a good measure of evidence regarding a model or hypothesis.

[3] I am grateful to Ron Wasserstein for inviting me to be a “philosophical observer” of this historical project (I attended just one day).

**Blog posts on ASA II:**

- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post) - July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- Nov 4, 2019. On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests

**On ASA I:**

- Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.

**REFERENCES:**

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. *JAMA* 321:2067‐2068. (pdf)

(2017). Response to the ASA’s Statement on *p*-Values: Context, Process, and Purpose, *The American Statistician*, 71:1, 88-89. (pdf)

Mayo, (2018). *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars**, *SIST (2018, CUP).

Mayo, D. G. (2019), *P*‐value thresholds: Forfeit at your peril. *Eur J Clin Invest*, 49: e13170. (pdf) doi:10.1111/eci.13170

Wasserstein, R. & Lazar, N. (2016), The ASA’s Statement on *p*-Values: Context, Process, and Purpose”. Volume 70, 2016 – Issue 2.

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, *The American Statistician *73(S1): 1-19: Editorial. (ASA II)(pdf)

What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i]

In this exercise, I imagine I am someone who eagerly wants the recommendations in ASA II to be accepted by authors, journals, agencies, and the general public. In essence the recommendations are: you may report the P-value associated with a test statistic d–a measure of distance or incompatibility between data and a reference hypothesis– but don’t say that what you’re measuring are the attained statistical significance levels associated with d. (Even though that is the mathematical definition of what is being measured.) Do not predesignate a P-value to be used as a threshold for inferring evidence of a discrepancy or incompatibility–or if you do, never use this threshold in interpreting data.

“Whether a p-value passes any arbitrary threshold should not be considered at all” in interpreting data. (ASA II)

This holds, even if you also supply an assessment of indicated population effect size or discrepancy (via confidence intervals, equivalence tests, severity assessments). The same goes for other thresholds based on confidence intervals or Bayes factors.

I imagine myself a member of the ASA II team setting out the recommendation for ASA II, weighing if it’s a good idea. We in this leadership group know there’s serious disagreement about our recommendations in ASA II, and that ASA II could not by any stretch be considered a consensus statement. Indeed even among over 40 papers explicitly invited to discuss “a world beyond P < 0.05”, we (unfortunately) wound up with proposals in radical disagreement. We [ASA II authors] observe “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018).”

(Aside: Hey, they are citing my book!)

So we agree there is disagreement. We also agree that a large part of the blame for lack of replication in many fields may be traced to bad behavior encouraged by the reward structure: Incentives to publish surprising and novel studies, coupled with an overly flexible methodology, where many choice points in the “forking paths” (Gelman and Loken 2014) between data and hypotheses open the door into “questionable research practices” (QRPs). Call this the * flexibility, rewards, and bias F, R & B hypothesis*. On this hypothesis, the pressure to publish, to be accepted, is so great as to seduce even researchers who are well aware of the pitfalls to capitalize on selection biases (even if it’s only subliminal).

As a member of the team, I imagine reasoning as follows:

Either the recommendations in ASA II will be followed or they won’t. If the latter, then it cannot be considered successful. Now suppose the former, that people do take it up to a significant extent. The F, R & B hypothesis predicts that the imprimatur of the ASA will encourage researchers to adopt, or at least act in accordance with, ASA II recommendations. [ii] The trouble is that there will be no grounds for thinking that any apparent conversion was based on good reasons, or, at any rate, we will be unable to distinguish following the ASA II stipulations on grounds of evidence from following them because the ASA said so. Therefore even in the former situation, where the new stipulations are taken up to a significant degree, with lots of apparent converts, ASA II could not count as a success. Therefore, in either case, what had seemed to us a great step forward, is unsuccessful. So we shouldn’t put it forward.

“Before we were with our backs against the wall, now we have done a 180 degree turn”

A further worry occurs to me in my imaginary weighing of whether our ASA team should go ahead with publishing ASA II. It is this: many of the apparent converts to ASA II might well have come to accept its stipulations on grounds of good reasons, *after* carrying out a reasoned comparison of statistical significance tests with leading alternative methods, as regards its intended task (distinguishing real effects from random or spurious ones)–if the ASA had only seen its role as facilitating the debate between alternative methods, and as offering a forum for airing contrasting arguments held by ASA members. By marching ahead to urge journals, authors, and agencies to comply with ASA II, *we will never know. *

Not only will we not know how much any observed effect in compliance is due to finding its stipulations are warranted, as opposed to it just confirming the truth of the F, R, & B hypothesis–not to mention people’s fear of being on the wrong side of the ASA’s preferences. It’s worse. The tendency to the human weakness of instantiating the F, R & B hypothesis will be strengthened. Why? Because even in the face of acknowledged professional disagreement of a fairly radical sort, and even as we write “the ideas in this editorial are… open to debate” (ASA II), we are recommending our position be accepted without actually having that debate. In asking for compliance, we are saying, in effect, “we have been able to see it is for the better, even though we recognize there is no professional agreement on our recommendations, and even major opposition”. John Ioannidis, no stranger to criticizing statistical significance tests, wrote this note after the publication of ASA II:

Many fields of investigation … have major gaps in the ways they conduct, analyze, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science. (Ioannidis 2019)

Therefore, to conclude with my imaginary scenario, we might imagine the ASA team recognizes that putting forward ASA II (in March 2019) is necessarily going to be unsuccessful and self-defeating, extolling the very behavior we supposedly want to eradicate. So we don’t do it. That imaginary situation, unfortunately, is not the real one we find ourselves in.

Making progress, without bad faith, in the real world needn’t be ruled out entirely. There are those, after all, who never heard of ASA II, and do not publish in journals that require obeisance to it. It’s even possible that the necessary debate and comparison of alternative tools for the job could take place after the fact. That would be welcome. None of this would diminish my first self-defeating aspect of the ASA II.

My follow-up post is now up: “The ASA’s P-value Project: Why it’s Doing More Harm than Good‘.

[i] See also June 17, 2019. Here I give specific suggestions for why certain principles in ASA II need to be amended to avoid being in tension with ASA I.

[ii] “Imprimatur” means “let it be printed” in Latin. Now I am very careful to follow the context: It is not a consensus document, I make very clear. In fact, that is a key premise of my argument. But the statement that is described as (largely) consensual (ASA I) “stopped just short” of the 2019 editorial. When it first appeared, I asked Wasserstein about the relationship between the two documents. That was the topic of my June 17 post linked in [i]). It was never made clear. It’s blurred. Is it somewhere in the document and I missed it? Increasingly, now that it’s been out long enough for people to start citing it, it is described as the latest ASA recommendations. (They are still just recommendations.) If the ASA wants to clearly distinguish the 2019 from the 2016 statement, this is the time for the authors to do it. (I only consider, as part of ASA II, those general recommendations that are given, not any of the individual papers in the special issue.)

This discussion is continued in my next post: The ASA P-value project: Why it’s doing more harm than good.

**Blog posts on ASA II:**

- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

July 12, 2019: B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post) - July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

On ASA I:

- Link to my published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.

**REFERENCES:**

Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”. *American Scientist* 2: 460-5. (pdf)

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. *JAMA* 321:2067‐2068. (pdf)

Mayo, (2018). *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars**, *SIST (2018, CUP).

Mayo, D. G. (2019), *P*‐value thresholds: Forfeit at your peril. *Eur J Clin Invest*, 49: e13170. (pdf) doi:10.1111/eci.13170

Wasserstein, R., Schirm, A. and Lazar, N. (2019) “Moving to a World Beyond ‘p < 0.05’”, *The American Statistician *73(S1): 1-19: Editorial. (online paper)(pdf)

October 28, 2019

*From universities around the world, participants in a summer session gathered to discuss the merits of the philosophy of statistics. Co-director Deborah Mayo, left, hosted an evening for them at her home.*

In the heat of a Blacksburg summer evening, the talk on Deborah Mayo’s back deck was of philosophy and statistics. Fifteen innovators in the Virginia Tech Summer Seminar in Philosophy of Statistics were contemplating the beginnings of a new field — Phil Stat.

“The overarching goal is that Phil Stat, short for the philosophy of statistics, will become a field in philosophy,” said Mayo, one of the seminar’s co-directors and a professor emerita in the Virginia Tech Department of Philosophy. “Today the problems about data are everywhere, as are problems about ethics and values. The justification for this new field is if you don’t understand the underpinnings of statistics, you cannot understand the consequences of certain reforms that are being proposed or adopted.”

Mayo defines Phil Stat as the philosophical and conceptual foundations of statistical inference. The idea involves the formation of judgments about the measures that define a population and the reliability of statistical relationships, usually based on a random sampling of data. With this, Phil Stat analyzes the uses of probability in collecting, modeling, and learning from the data.

Aris Spanos, Mayo’s co-director of the seminar, said that during the past decade, many published, observation-based or experience-based research results in several disciplines within the medical and social sciences have been found not to be replicable. This has led some researchers to regard the results as untrustworthy, and several leading statisticians have been calling for reforms. Spanos said the need is pressing for a better understanding of the main sources of untrustworthy evidence and a balanced appraisal of the proposed reforms.

“We designed the seminar on the philosophy of statistics in response to these discussions to inform the participants about these debates,” said Spanos, the Wilson E. Schmidt Professor of Economics in the College of Science. “We wanted to provide them with a sufficient background in the philosophy of science and statistics to enable them to participate in these debates.”

Mayo and Spanos decided the seminar, held on Virginia Tech’s Blacksburg campus, would help advance scholarship in this new transdisciplinary area and seminar participants could integrate into their research and teaching. In response to their call for applicants, a selection committee invited 15 of the 55 faculty, postdoctoral fellows, and senior graduate students who applied to participate.

The participants were a diverse group. They came from Auburn University, Duke University, Lehman College at City University of New York, the Ohio State University, Princeton University, Radboud University, Rutgers University, St. John’s College at the University of Oxford, Université de Montréal, the University of Amsterdam, the University of Colorado at Boulder, the University of Illinois at Urbana-Champaign, and the University of Utah. An attorney from the New Jersey Office of the Public Defender also joined their ranks.

*Participants from the Virginia Tech Summer Seminar in Philosophy of Statistics included Dean Sally C. Morton (third from the right in the first row). Deborah Mayo and Aris Spanos appear to the left behind her.*

Sally C. Morton, dean of the College of Science and interim director of the Fralin Life Sciences Institute at Virginia Tech, attended one of the seminar sessions.

“The proper use of evidence in decision-making is essential to tackling the complex problems in society today,” said Morton. “The summer seminar that brought together the fields of statistics and philosophy demonstrated the power of using a transdisciplinary approach to give the attendees an expansive view of the challenges we face. I was delighted to see the deliberate inclusion of students and early-career researchers in the seminar.”

For two weeks, the group gathered with special guest speakers, both in person and through an online meeting platform. Presenting at the seminar were Andrew Gelman, a professor of statistics from Columbia University; Richard Morey, a reader for the School of Psychology at Cardiff University; Nathan Schachtman, a lawyer who specializes in scientific and medico-legal issues; and Stephen Senn, a consultant statistician from Edinburgh, Scotland.

The seminar, largely funded by Mayo and her husband, George Chatfield, through their Fund for Experimental Reasoning, Reliability, Objectivity, and Rationality of Science, also benefited from a number of sponsors. These included the College of Liberal Arts and Human Sciences, the College of Science, the Data and Decisions Destination Area,the Department of Philosophy, and the Department of Economics.

The summer seminar is not the first collaboration between Mayo and Spanos. In 2010, they coedited the book “Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science.” Together they have also published six papers and book chapters in such publications as the British Journal for the Philosophy of Science, Synthese, and the Philosophy of Science. The contributions stemmed from a Virginia Tech conference, ERROR06, which included the statistician Sir David Cox and the philosophers Alan Chalmers, Clark Glymour, and Alan Musgrave.

More recently, Mayo authored the book “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,” published by Cambridge University Press.

The directors and participants will continue to propel Phil Stat beyond the summer experience through conferences, online publications, and an upcoming book, “Probability Theory and Statistical Inference: Modeling withObservational Data,” slated for publication by CambridgeUniversity Press. As a group, they occasionally meet online and maintain a blog together. And they plan to present sessions at conferences.

“What we initiated here at Virginia Tech,” Mayo said, “will have a big impact not just on the way we think about the philosophy of science, but on how both it and the philosophy of knowledge are taught and integrated.”

-Written by Leslie King

© 2019 Virginia Polytechnic Institute and State University. All rights reserved.

See our website at SummerSeminarPhilStat.com

]]>In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP), I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).

All of Excursion 1 Tour II is *here*. After this post, I’ll resume regular blogging for a while, so you can catch up to us. Several free (signed) copies of SIST will be given away on Twitter shortly.

**1.4 The Law of Likelihood and Error Statistics**

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.

*Law of Likelihood (LL):*Data ** x **are better evidence for hypothesis

*H _{0 }*and

**Does the Law of Likelihood Obey the Minimal Requirement for Severity?**

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (** x**)–all winners. A hypothesis

Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as *x*_{0 }=<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis *H _{0 }*: θ = 0.5, given

The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis *H _{0 }*

Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of ** x **maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis

Note that for any outcome of *n *Bernoulli trials, the likelihood of *H _{0 }*: θ = 0.5 is (0.5)

(*) Pr(LR in favor of *H _{1 }*over

*Thus the LL permits BENT evidence. *The severity for *H _{1 }*is minimal, though the particular

What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes *H _{0 }*maximally likely, we can find an

**To continue reading Excursion 1 Tour II, go here.**

__________

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Here’s link to all excerpts and mementos that I’ve posted (up to July 2019).

Mementos from Excursion I Tour II are here.

Blurbs of all 16 Tours can be found here.

Search topics of interest on this blog for the development of many of the ideas in SIST, and a rich sampling of comments from readers.

]]>

Continue to the third, and last stop of Excursion 1 Tour I of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–Section 1.3. It would be of interest to ponder if (and how) the current state of play in the stat wars has shifted in just one year. I’ll do so in the comments. Use that space to ask me any questions.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional eﬀort to provide the scientific world with a uniﬁed testing methodology. (J. Berger 2003, p. 4)

From the aerial perspective of a hot-air balloon, we may see contemporary statistics as a place of happy multiplicity: the wealth of computational ability allows for the application of countless methods, with little handwringing about foundations. Doesn’t this show we may have reached “the end of statistical foundations”? One might have thought so. Yet, descending close to a marshy wetland, and especially scratching a bit below the surface, reveals unease on all sides. The false dilemma between probabilism and long-run performance lets us get a handle on it. In fact, the Bayesian versus frequentist dispute arises as a dispute between probabilism and performance. This gets to my second reason for why the time is right to jump back into these debates: the “statistics wars” present new twists and turns. Rival tribes are more likely to live closer and in mixed neighborhoods since around the turn of the century. Yet, to the beginning student, it can appear as a jungle.

**Statistics Debates: Bayesian versus Frequentist**

These days there is less distance between Bayesians and frequentists, especially with the rise of objective [default] Bayesianism, and we may even be heading toward a coalition government. (Efron 2013, p. 145)

A central way to formally capture probabilism is by means of the formula for conditional probability, where Pr(** x**) > 0:

Since Pr(*H *and ** x**) = Pr(

where ~*H *is the denial of *H*. It would be cashed out in terms of all rivals to *H *within a frame of reference. Some call it Bayes’ Rule or inverse probability. Leaving probability uninterpreted for now, if the data are very improbable given *H*, then our probability in *H *after seeing x, the *posterior *probability Pr(*H*|**x***)*, may be lower than the probability in *H *prior to x, the *prior *prob- ability Pr(*H*). Bayes’ Theorem is just a theorem stemming from the definition of conditional probability; it is only when statistical inference is thought to be encompassed by it that it becomes a statistical philosophy. Using Bayes’ Theorem doesn’t make you a Bayesian.

Larry Wasserman, a statistician and master of brevity, boils it down to a contrast of goals. According to him (2012b):

The Goal of Frequentist Inference: Construct procedure with frequentist guarantees [i.e., low error rates].

TheGoalof Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

At times he suggests we use B(*H*) for belief and F(*H*) for frequencies. The distinctions in goals are too crude, but they give a feel for what is often regarded as the Bayesian-frequentist controversy. However, they present us with the false dilemma (performance or probabilism) I’ve said we need to get beyond.

Today’s Bayesian–frequentist debates clearly differ from those of some years ago. In fact, many of the same discussants, who only a decade ago were arguing for the irreconcilability of frequentist *P*-values and Bayesian measures, are now smoking the peace pipe, calling for ways to unify and marry the two. I want to show you what really drew me back into the Bayesian–frequentist debates sometime around 2000. If you lean over the edge of the gondola, you can hear some Bayesian family feuds starting around then or a bit after. Principles that had long been part of the Bayesian hard core are being questioned or even abandoned by members of the Bayesian family. Suddenly sparks are ﬂying, mostly kept shrouded within Bayesian walls, but nothing can long be kept secret even there. Spontaneous combustion looms. Hard core subjectivists are accusing the increasingly popular “objective (non-subjective)” and “reference” Bayesians of practicing in bad faith; the new frequentist–Bayesian uniﬁcationists are taking pains to show they are not subjective; and some are calling the new Bayesian kids on the block “pseudo Bayesian.” Then there are the Bayesians camping somewhere in the middle (or perhaps out in left ﬁeld) who, though they still use the Bayesian umbrella, are ﬂatly denying the very idea that Bayesian updating ﬁts anything they actually do in statistics. Obeisance to Bayesian reasoning remains, but on some kind of a priori philosophical grounds. Let’s start with the uniﬁcations.

While subjective Bayesianism oﬀers an algorithm for coherently updating prior degrees of belief in possible hypotheses *H*_{1}, *H*_{2}, …, *H** _{n}*, these uniﬁcations fall under the umbrella of non-subjective Bayesian paradigms. Here the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, ideally to have minimal impact on the posterior probability. I will call such Bayesian priors

True blue subjective Bayesians are understandably unhappy with non- subjective priors. Rather than quantify prior beliefs, non-subjective priors are viewed as primitives or conventions for obtaining posterior probabilities. Take Jay Kadane (2008):

The growth in use and popularity of Bayesian methods has stunned many of us who were involved in exploring their implications decades ago. The result … is that there are users of these methods who do not understand the

philosophical basis of the methodsthey are using, and hence may misinterpret or badly use the results … No doubt helping people to use Bayesian methods more appropriately is an important task of our time. (p. 457, emphasis added)

I have some sympathy here: Many modern Bayesians aren’t aware of the traditional philosophy behind the methods they’re buying into. Yet there is not just one philosophical basis for a given set of methods. This takes us to one of the most dramatic shifts in contemporary statistical foundations. It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but you’ll notice that groups holding this position, while they still dot the landscape in 2018, have been gradually shrinking. Some Bayesians have come to question whether the wide- spread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.

**Marriages of Convenience?**

The current frequentist–Bayesian uniﬁcations are often marriages of convenience; statisticians rationalize them less on philosophical than on practical grounds. For one thing, some are concerned that methodological conﬂicts are bad for the profession. For another, frequentist tribes, contrary to expectation, have not disappeared. Ensuring that accounts can control their error probabilities remains a desideratum that scientists are unwilling to forgo. Frequentists have an incentive to marry as well. Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and conﬁdence levels – frequentists are constantly put on the defensive. Jim Berger (2003) proposes a construal of significance tests on which the tribes of Fisher, Jeffreys, and Neyman could agree, yet none of the chiefs of those tribes concur (Mayo 2003b). The success stories are based on agreements on numbers that are not obviously true to any of the three philosophies. Beneath the surface – while it’s not often said in polite company – the most serious disputes live on. I plan to lay them bare.

If it’s assumed an evidential assessment of hypothesis *H *should take the form of a posterior probability of *H *– a form of probabilism – then *P*-values and conﬁdence levels are applicable only through misinterpretation and mistranslation. Resigned to live with *P*-values, some are keen to show that construing them as posterior probabilities is not so bad (e.g., Greenland and Poole 2013). Others focus on long-run error control, but cede territory wherein probability captures the epistemological ground of statistical inference. Why assume significance levels and conﬁdence levels lack an authentic epistemological function? I say they do [have one]: to secure and evaluate how well probed and how severely tested claims are.

**Eclecticism and Ecumenism**

If you look carefully between dense forest trees, you can distinguish uniﬁcation country from lands of eclecticism (Cox 1978) and ecumenism (Box 1983), where tools ﬁrst constructed by rival tribes are separate, and more or less equal (for different aims). Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges. For example, frequentist methods have long been employed to check or calibrate Bayesian methods (e.g., Box 1983); you might test your statistical model using a simple significance test, say, and then proceed to Bayesian updating. Others suggest scrutinizing a posterior probability or a likelihood ratio from an error probability standpoint. What this boils down to will depend on the notion of probability used. If a procedure frequently gives high probability for *claim C *even if *C *is false, severe testers deny convincing evidence has been provided, and never mind about the meaning of probability. One argument is that throwing different methods at a problem is all to the good, that it increases the chances that at least one will get it right. This may be so, provided one understands how to interpret competing answers. Using multiple methods is valuable when a shortcoming of one is rescued by a strength in another. For example, when randomized studies are used to expose the failure to replicate observational studies, there is a presumption that the former is capable of discerning problems with the latter. But what happens if one procedure fosters a goal that is not recognized or is even opposed by another? Members of rival tribes are free to sneak ammunition from a rival’s arsenal – but what if at the same time they denounce the rival method as useless or ineffective?

* Decoupling. *On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched. In an attempted meeting of the minds (Bayesian and error statistical), Andrew Gelman and Cosma Shalizi (2013) claim that “implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo” (p. 10). In particular, Bayesian model checking, they say, uses statistics to satisfy Popperian criteria for

**Why Our Journey?**

We have all, or nearly all, moved past these old [Bayesian-frequentist] debates, yet our textbook explanations have not caught up with the eclecticism of statistical practice. (Kass 2011, p. 1)

When Kass proﬀers “a philosophy that matches contemporary attitudes,” he ﬁnds resistance to his big tent. Being hesitant to reopen wounds from old battles does not heal them. Distilling them in inoffensive terms just leads to the marshy swamp. Textbooks can’t “catch-up” by soft-peddling competing statistical accounts. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.

From an elevated altitude we see how it occurs. Once high-proﬁle failures of replication spread to biomedicine, and other “hard” sciences, the problem took on a new seriousness. Where does the new scrutiny look? By and large, it collects from the earlier social science “significance test controversy” and the traditional philosophies coupled to Bayesian and frequentist accounts, along with the newer Bayesian–frequentist uniﬁcations we just surveyed. This jungle has never been disentangled. No wonder leading reforms and semi-popular guidebooks contain misleading views about all these tools. No wonder we see the same fallacies that earlier reforms were designed to avoid, and even brand new ones. Let me be clear, I’m not speaking about ﬂat-out howlers such as interpreting a *P*-value as a posterior probability. By and large, they are more subtle; you’ll want to reach your own position on them. It’s not a matter of switching your tribe, but excavating the roots of tribal warfare. To tell what’s true about them. I don’t mean understand them at the socio-psychological levels, although there’s a good story there (and I’ll leak some of the juicy parts during our travels).

*How can we make progress when it is difficult even to tell what is true about* *the different* *methods of statistics? *We must start afresh, taking responsibility to oﬀer a new standpoint from which to interpret the cluster of tools around which there has been so much controversy. Only then can we alter and extend their limits. I admit that the statistical philosophy that girds our explorations is not out there ready-made; if it was, there would be no need for our holiday cruise. While there are plenty of giant shoulders on which we stand, we won’t be restricted by the pronouncements of any of the high and low priests, as sagacious as many of their words have been. In fact, we’ll brazenly question some of their most entrenched mantras. Grab on to the gondola, our balloon’s about to land.

In Tour II, I’ll give you a glimpse of the core behind statistics battles, with a ﬁrm promise to retrace the steps more slowly in later trips.

**FOR ALL OF TOUR I: SIST Excursion 1 Tour I**

**THE FULL ITINERARY:*** Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*: **SIST Itinerary**

**REFERENCES:**

Berger, J. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and ‘Rejoinder’, *Statistical Science* 18(1), 1–12; 28–32.

Box, G. (1983). ‘An Apology for Ecumenism in Statistics’, in Box, G., Leonard, T., and Wu, D. (eds.), *Scientific Inference, Data Analysis, and Robustness*, New York:

Academic Press, 51–84.

Cox, D. (1978). ‘Foundations of Statistical Inference: The Case for Eclecticism’, *Australian Journal of Statistics* 20(1), 43–59.

Efron, B. (2013). ‘A 250-Year Argument: Belief, Behavior, and the Bootstrap’, *Bulletin of the American Mathematical Society* 50(1), 126–46.

Fraser, D. (2011). ‘Is Bayes Posterior Just Quick and Dirty Confidence?’ and ‘Rejoinder’, *Statistical Science* 26(3), 299–316; 329–31.

Gelman, A. and Shalizi, C. (2013). ‘Philosophy and the Practice of Bayesian Statistics’ and ‘Rejoinder’, *British Journal of Mathematical and Statistical Psychology* 66(1), 8–38; 76–80.

Greenland, S. and Poole, C. (2013). ‘Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics’ and ‘Rejoinder: Living with Statistics in Observational Research’, *Epidemiology* 24(1), 62–8; 73–8.

Kadane, J. (2008). ‘Comment on Article by Gelman’, Bayesian Analysis 3(3), 455–8.

Kass, R. (2011). ‘Statistical Inference: The Big Picture (with discussion and rejoinder)’, *Statistical Science* 26(1), 1–20.

Mayo, D. (2003b). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Commentary on J. Berger’s Fisher Address’, *Statistical Science* 18, 19–24.

Wasserman, L. (2012b). ‘What is Bayesian/Frequentist Inference?’, Blogpost on normaldeviate.wordpress.com (11/7/2012).

]]>