Author Archives: Mayo

Tom Sterkenburg Reviews Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (2018, CUP)

T. Sterkenburg

Tom Sterkenburg, PhD
Postdoctoral Fellow
Munich Center for Mathematical Philosophy
LMU Munich
Munich, German

Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The foundations of statistics is not a land of peace and quiet. “Tribal warfare” is perhaps putting it too strong, but it is the case that for decades now various camps and subcamps have been exchanging heated arguments about the right statistical methodology. That these skirmishes are not just an academic exercise is clear from the widespread use of statistical methods, and contemporary challenges that cry for more secure foundations: the rise of big data, the replication crisis.

One often hears that to blame are classical, frequentist methods, that lack a proper justification and are easily misused at that; so that it is all a matter of stepping up our efforts to spread the Bayesian philosophy. This not only ignores the various conflicting views within the Bayesian camp, but also gives too little credit to opposing philosophical perspectives. In particular, this does not do justice to the work of philosopher of statistics Deborah Mayo. Perhaps most famously in her Lakatos Award winning Error and the Growth of Experimental Knowledge (1996), Mayo has been developing an account of statistical and scientific inference that builds on Popper’s falsificationist philosophy and frequentist statistics. She has now written a new book, with the stated goal of helping us get beyond the statistics wars.

This work is a genuine tour de force. Mayo weaves together an extraordinary amount of philosophical themes, technical discussions, and historical anecdotes into a lively and engaging exposition of what she calls the error-statistical philosophy. Like few other works in the area Mayo instills in the reader an appreciation for both the interest and the significance of the topic of statistical methodology, and indeed for the importance of philosophers engaging with it.

That does not yet make the book an easy read. In fact, the downside of Mayo’s conversational style of presentation is that it can take a serious effort on the reader’s part to distill the argumentative structure and how various observations and explanations hang together. This, unfortunately, also limits its use somewhat for those intended readers that are new to the discussed topics.

In the following I will summarize the book, and conclude with some general remarks. (Mayo organizes her book into “excursions” divided into “tours”—we are invited to imagine we are on a cruise—but below I will stick to chapters divided into parts.)

Chapter 1 serves as a warming-up. In the course of laying out the motivation for the book’s project, Mayo introduces severity as a requirement for evidence. On the weak version of the severity criterion, one does not have evidence for a claim C if the method used to arrive at evidence x, even if x agrees with C, had little capability of finding flaws with C even if they exist (Mayo also uses the acronym BENT: bad evidence, no test).On its strong version, if C passes a test that did have high capability of contradicting C, then the passing outcome x is evidence—or at least, an indication—for C. The double role for statistical inference is to identify BENT cases, where we actually have poor evidence; and, using strong severity, to mount positive arguments from coincidence.

Thus if a statistical philosophy is to tell us what we seek to quantify using probability, then Mayo’s error-statistical philosophy says that this is “well-testedness” or probativeness. This she sets apart from probabilism, which sees probability as a way of quantifying plausibility of hypotheses (tenet of the Bayesian approach), but also from performance, where probability is a method’s long-run frequency of faulty inferences (the classical, frequentist approach). Mayo is careful, too, to set her philosophy apart from recent efforts to unify or bridge Bayesian and frequentist statistics, approaches that she chastises as “marriages of convenience” that simply look away from the underlying philosophical incongruities. There is here an ambiguity in the nature of Mayo’s project, that remains unresolved throughout the book: is she indeed proposing a new perspective “to tell what is true about the different methods of statistics” (p. 28), the view-from-a-balloon that might finally get us beyond the statistics wars, or should we actually see her as joining the fray with a yet different competing account? What is certainly clear is that Mayo’s philosophy is much closer to the frequentist than the Bayesian school, so that an important application of the new perspective is to exhibit the flaws of the latter. In the second part of the chapter Mayo immediately gets down to business, revisiting a classic point of contention in the form of the likelihood principle.

In Chapter 2 the discussion shifts to Bayesian confirmation theory, in the context of traditional philosophy of science and the problem of induction. Mayo’s diagnosis is that the aim of confirmation theory is merely to try to spell out inductive method, having given up on actually providing justification for it; and in general, that philosophers of science now feel it is taboo to even try to make progress on this account. The latter assessment is not entirely fair, even if it is true that recent proposals addressing the problem of induction (notably those by John Norton and by Gerhard Schurz, who both abandon the idea of a single context-independent inductive method) are still far removed from actual scientific or statistical practice. More interesting than the familiar issues with confirmation theory Mayo lists in the first part of the chapter is therefore the positive account she defends in the second.

Here she discusses falsificationism and how the error-statistical account builds and  improves on Popper’s ideas. We read about demarcation, Duhem’s problem, and novel predictions; but also about the replicability crisis in psychology and fallacies of significance tests. In the last section Mayo returns to the question that has been in the background all this time: what is the error-statistical answer to the problem of inductive inference? By then we have already been handed a number of clues: inferences to hypotheses are arguments from strong coincidence, that (unlike “inductive” but really   still deductive probabilistic logics) provide genuine “lift-off”, and that (against Popperians) we are free to call warranted or justified. Mayo emphasises that the output of a statistical inference is not a belief; and it is undeniable that for the plausibility of an hypothesis severe testing is neither necessary (the problem of after-the-fact cooked-up hypotheses, Mayo points out, is exactly that they can be so plausible) nor sufficient (as illustrated by the base-rate fallacy). Nevertheless, the envisioned epistemic yield of a (warranted) inference remains agonizingly imprecise. For instance, we read that (sensibly enough) isolated significant results do not count; but when do results start counting, and how? Much is delegated to the dynamics of the overall inquiry, as further illustrated below.

Chapter 3 goes deeper into severe testing: as employed in actual cases of scientific inference, and as instantiated in methods from classical statistics. Thus the first part starts with the 1919 Eddington experiment to test Einstein’s relativity theory, and continues with a discussion of Neyman–Pearson (N–P) tests. The latter are then accommodated into the error-statistical story, with the admonition that the severity rationale goes beyond the usual behavioural warrant of N–P testing as the guarantee of being rarely wrong in repeated application. Moreover, it is stressed, the statistical methods given by N–P as well as Fisherian tests represent “canonical pieces of statistical reasoning, in their naked form as it were” (p. 150). In a real scientific inquiry these are only part of the investigator’s reservoir of error-probabilistic tools “both formal and quasi-formal”, providing the parts that “are integrated in building up arguments from coincidence, informing background theory, self- correcting […], in an iterative movement” (p. 162).

In the next part of Chapter 3, Mayo defends the classical methods against an array of attacks launched from different directions. Apart from some old charges (or “howlers and chestnuts of statistical tests”), these include the excusations arising from the “family feud” between adherents of Fisher and Neyman–Pearson. Mayo argues that the purported different interpretational stances of the founders (Fisher’s more evidential outlook versus Neyman’s more behaviourist position) are a bad reason to preclude a unified view on both methodologies. In the third part, Mayo extends this discussion to incorporate confidence intervals, and the chapter concludes with another illustration of statistical testing in actual scientific inference, the 2012 discovery of the Higgs boson.

The different parts of Chapter 4 revolve around the theme of objectivity. First up is the “dirty hands argument”, the idea that since we can never be free of the influence of subjective choices, all statistical methods must be (equally) subjective. The mistake, Mayo says, is to assume that we are incapable of registering and managing these inevitable threats to objectivity. The subsequent dismissal of the Bayesian way of taking into account—or indeed embracing—subjectivity is followed, in the second part of the chapter, by a response to a series of Bayesian critiques of frequentist methods, and particularly the charge that, as compared to Bayesian posterior probabilities, P values overstate the evidence. The crux of Mayo’s reply is that “it’s erroneous to fault one statistical philosophy from the perspective of a philosophy with a different and incompatible conception of evidence or inference” (p. 265). This is certainly a fair point, but could just as well be turned against her own presentation of the error-statistical perspective as a meta-methodology. Of course, the lesson we are actually encouraged to draw is that an account of evidence in terms of severe testing is preferable to one in terms of plausibility. For this Mayo makes a strong case, in the next part, in connection to the need for tools to intercept various illegitimate research practices. The remainder of the chapter is devoted to some other important themes around frequentist methods: randomization, the trope that “all models are false”, and model validation.

Chapter 5 is a relatively technical chapter about the notion of a test’s power. Mayo addresses some purported misunderstandings around the use of power, and discusses the notion of attained or post-data power, combining elements of N–P and of Fisher, as part of her severity account. Later in the chapter we revisit the replication crisis, and in the last part we are given an entertaining “deconstruction” of the debates between N–P and Fisher. Finally, in Chapter 6, Mayo takes one last look at the probabilistic “foundations lost”, to clear the way for her parting proclamation of the new probative foundations. She discusses the retreat by theoreticians from full-blown subjective Bayesianism, the shaky grounds under objective or default Bayesianism, and attempts at unification (“schizophrenia”) or flat-out pragmatism. Saved till the end, fittingly, is the recent “falsificationist Bayesianism” that emerges from the writings of Andrew Gelman, who indeed adopts important elements of the error-statistical philosophy.

It seems only a plausible if not warranted inductive inference that the statistics wars will rage on for a while; but what, towards an assessment of Mayo’s programme, should we be looking for in a foundational account of statistics? The philosophical attraction of the dominant Bayesian approach lies in its promise of a principled and unified account of rational inference. It appears to be too rigid, however, in suggesting a fully mechanical method of inference: after you fix your prior it is, on the standard conception, just a matter of conditionalizing. At the same time it appears to leave too much open, in allowing you to reconstruct any desired reasoning episode by suitable choice of model and prior. Mayo is very clear that her account resists the first: we are not looking for a purely formal account, a single method that can be mindlessly pursued. Still, the severity rationale is emphatically meant to be restrictive: to expose certain inferences as unwarranted. But the threat of too much flexibility is still lurking in how much is delegated to the messy context of the overall inquiry. If too much is left to context-dependent expert judgment, for instance, the account risks to forfeit its advertized capacity to help us hold the experts accountable for their inferences. This motivates the desire for a more precise philosophical conception, if possible, of what inferences count as warranted and how. What Mayo’s book should certainly convince us of is the value of seeking to develop her programme further, and for that reason alone the book is recommended reading for all philosophers—not least those of the Bayesian denomination—concerned with the foundations of statistics.

Sterkenburg, T. (2020). “Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”, Journal for General Philosophy of Science 51: 507–510. ( Link to review.

Excerpts, mementos, and sketches of 16 tours (including links to proofs) are here. 

Categories: SIST, Statistical Inference as Severe Testing–Review, Tom Sterkenburg | 6 Comments

CUNY zoom talk on Wednesday: Evidence as Passing a Severe Test

If interested, write to me for the zoom link (


High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then there isn’t evidence for it. It has not passed even a minimally severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This minimal severe-testing requirement leads to reformulating significance tests (and related methods) to avoid familiar criticisms and abuses. Viewing statistical inference as severe testing–whether or not you accept it–offers a key to understand and get beyond the statistics wars.

Categories: Announcement | Leave a comment

April 22 “How an information metric could bring truce to the statistics wars” (Daniele Fanelli)

The eighth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

22 April 2021

TIME: 15:00-16:45 (London); 10:00-11:45 (New York, EST)

For information about the Phil Stat Wars forum and how to join, click on this link.

“How an information metric could bring truce to the statistics wars

Daniele Fanelli

Abstract: Both sides of debates on P-values, reproducibility, and other meta-scientific issues are entrenched in traditional methodological assumptions. For example, they often implicitly endorse rigid dichotomies (e.g. published findings are either “true” or “false”, replications either “succeed” or “fail”, research practices are either “good” or “bad”), or make simplifying and monistic assumptions about the nature of research (e.g. publication bias is generally a problem, all results should replicate, data should always be shared).

Thinking about knowledge in terms of information may clear a common ground on which all sides can meet, leaving behind partisan methodological assumptions. In particular, I will argue that a metric of knowledge that I call “K” helps examine research problems in a more genuinely “meta-“ scientific way, giving rise to a methodology that is distinct, more general, and yet compatible with multiple statistical philosophies and methodological traditions.

This talk will present statistical, philosophical and scientific arguments in favour of K, and will give a few examples of its practical applications.

Daniele Fanelli is a London School of Economics Fellow in Quantitative Methodology, Department of Methodology, London School of Economics and Political Science. He graduated in Natural Sciences, earned a PhD in Behavioural Ecology and trained as a science communicator, before devoting his postdoctoral career to studying the nature of science itself – a field increasingly known as meta-science or meta-research. He has been primarily interested in assessing and explaining the prevalence, causes and remedies to problems that may affect research and publication practices, across the natural and social sciences. Fanelli helps answer these and other questions by analysing patterns in the scientific literature using meta- analysis, regression and any other suitable methodology. He is a member of the Research Ethics and Bioethics Advisory Committee of Italy’s National Research Council, for which he developed the first research integrity guidelines, and of the Research Integrity Committee of the Luxembourg Agency for Research Integrity (LARI).


Fanelli D (2019) A theory and methodology to quantify knowledge. Royal Society Open Science – (PDF)

4 page Background: Fanelli D (2018) Is science really facing a reproducibility crisis, and do we need it to? PNAS – (PDF)

Slides & Video Links: 


*Meeting 16 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

Categories: Phil Stat Forum, replication crisis, stat wars and their casualties | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy (guest post)

I am reblogging a guest post that Aris Spanos wrote for this blog on Neyman’s birthday some years ago.   

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,

where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from  f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

Xt = α0 + α1Xt-1 + σεt,  t=1,2,…,n

This indicates how one can use pseudo-random numbers for the error term  εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.

Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).

I have restored all available links to the following references.

For further discussion on the above issues see:

Spanos, A. (2013), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese.

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.

Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.

[i]He was born in an area that was part of Russia.

Categories: Neyman, Spanos | Leave a comment

Happy Birthday Neyman: What was Neyman opposing when he opposed the ‘Inferential’ Probabilists?


Today is Jerzy Neyman’s birthday (April 16, 1894 – August 5, 1981). I’m posting a link to a quirky paper of his that explains one of the most misunderstood of his positions–what he was opposed to in opposing the “inferential theory”. The paper is Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. It arises on p. 391 of Excursion 5 Tour III of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s a link to the proofs of that entire tour. If you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. He is not rejecting statistical inference in favor of behavioral performance as typically thought. Neyman always distinguished his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). You can find quite a lot on this blog searching Birnbaum.

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.

What doesn’t Neyman like about Birnbaum’s advocacy of a Principle of Sufficiency S (p. 25)? He doesn’t like that it is advanced as a normative principle (e.g., about when evidence is or ought to be deemed equivalent) rather than a criterion that does something for you, such as control errors. (Presumably it is relevant to a type of context, say parametric inference within a model.) S is put forward as a kind of principle of rationality, rather than one with a rationale in solving some statistical problem

“The principle of sufficiency (S): If E is specified experiment, with outcomes x; if t = t (x) is any sufficient statistic; and if E’ is the experiment, derived from E, in which any outcome x of E is represented only by the corresponding value t = t (x) of the sufficient statistic; then for each x, Ev (E, x) = Ev (E’, t) where t = t (x)… (S) may be described informally as asserting the ‘irrelevance of observations independent of a sufficient statistic’.”

Ev(E, x) is a metalogical symbol referring to the evidence from experiment E with result x. The very idea that there is such a thing as an evidence function is never explained, but to Birnbaum “inferential theory” required such things. (At least that’s how he started out.) The view is very philosophical and it inherits much from logical positivism and logics of induction.The principle S, and also other principles of Birnbaum, have a normative character: Birnbaum considers them “compellingly appropriate”.

“The principles of Birnbaum appear as a kind of substitutes for known theorems” Neyman says. For example, various authors proved theorems to the general effect that the use of sufficient statistics will minimize the frequency of errors. But if you just start with the rationale (minimizing the frequency of errors, say) you wouldn’t need these”principles” from on high as it were. That’s what Neyman seems to be saying in his criticism of them in this paper. Do you agree? He has the same gripe concerning Cornfield’s conception of a default-type Bayesian account akin to Jeffreys. Why?

[i] I am grateful to @omaclaran for reminding me of this paper on twitter in 2018.

[ii] Or so I argue in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, 2018, CUP.

[iii] Do you think Neyman is using “breakthrough” here in reference to Savage’s description of Birnbaum’s “proof” of the (strong) Likelihood Principle? Or is it the other way round? Or neither? Please weigh in.


Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27.

Categories: Bayesian/frequentist, Error Statistics, Neyman | 3 Comments

Intellectual conflicts of interest: Reviewers


Where do journal editors look to find someone to referee your manuscript (in the typical “double blind” review system in academic journals)? One obvious place to look is the reference list in your paper. After all, if you’ve cited them, they must know about the topic of your paper, putting them in a good position to write a useful review. The problem is that if your paper is on a topic of ardent disagreement, and you argue in favor of one side of the debates, then your reference list is likely to include those with actual or perceived conflicts of interest. After all, if someone has a strong standpoint on an issue of some controversy, and a strong interest in persuading others to accept their side, it creates an intellectual conflict of interest, if that person has power to uphold that view. Since your referee is in a position of significant power to do just that, it follows that they have a conflict of interest (COI). A lot of attention is paid to author’s conflicts of interest, but little into intellectual or ideological conflicts of interests of reviewers. At most, the concern is with the reviewer having special reasons to favor the author, usually thought to be indicated by having been a previous co-author. We’ve been talking about journal editors conflicts of interest as of late (e.g., with Mark Burgman’s presentation at the last Phil Stat Forum) and this brings to mind another one.

But is it true that just because a reviewer is put in a position of competing interests (staunchly believing in a position opposed to yours, while under an obligation to provide a fair and unbiased review) that their fairness in executing the latter is compromised? I surmise that your answer to this question will depend on which of two scenarios you imagine yourself in: In the first, you imagine yourself reviewing a paper that argues in favor of a position that you oppose. In the second, you imagine that your paper, which argues in favor of a view, has been sent to a reviewer with a vested interest in opposing that view.

In other words, if the paper argues in favor of a position, call it position X, and you oppose X, I’m guessing you imagine you’d have no trouble giving fair and constructive assessments of arguments in favor of X. You would not dismiss arguments in favor of X, just because you sincerely oppose X. You’d give solid reasons. You’d be much more likely to question if a reviewer, staunchly opposed to position X, will be an unbiased reviewer of your paper in favor of X. I’m not biased, but they are.

I think the truth is that reviewers with a strong standpoint on a controversial issue, are likely to have an intellectual conflict of interest in reviewing a paper in favor of a position they oppose. Recall that it suffices, according to standard definitions of an individual having a COI, that reasonable grounds exist to question whether their judgments and decisions can be unbiased. (For example, investment advisors avoid recommending stocks they themselves own, to avoid a conflict of interest.) If this is correct, does it follow that opponents of a contentious issue should not serve as reviewers of papers that take an opposite stance?  I say no because an author can learn a lot from a biased review about how to present their argument in the strongest possible terms, and how to zero in on the misunderstandings and confusions underlying objections to the view. Authors will almost surely not persuade such a reviewer by means of a revised paper, but they will be in possession of an argument that may enable them to persuade others.

A reviewer who deeply opposes position X will indeed, almost certainly, raise criticisms of a paper that favors X, but it does not follow that they are not objective or valid criticisms. Nevertheless, if all the reviewers come from this group, the result is still an unbalanced and unfair assessment, especially in that–objective or not–the critical assessment is more likely to accentuate the negative. If the position X happens to be currently unpopular, and opposing X the “received” position extolled by leaders of associations, journals, and institutions, then restricting reviewers to those opposed to X would obstruct intellectual progress. Progress comes from challenging the status quo and the tendency of people to groupthink and to jump on the bandwagon endorsed by many influential thought leaders of the day. Thus it would make sense for authors to have an opportunity to point out ahead of time to journal editors–who might not be aware of the particular controversy–the subset of references with a vested intellectual interest against the view for which they are arguing. If the paper is nevertheless sent to those reviewers, a judicious journal editor should weigh very heavily the author’s retorts and rejoinders. [1]

Here’s an example from outside of academia–the origins of the Coronavirus. The president of an organization that is directly involved with and heavily supported by funds for experimenting on coronaviruses, Peter Daszak, has a vested interest in blocking hypotheses of lab leaks or lab errors. Such hypotheses, if accepted, would have huge and adverse effects on that research and its regulation. When he is appointed to investigate Coronavirus origins, he has a conflict of interest. See this post.

Molecular biologist, Richard Ebright, one of the scientists to Call for a Full and Unrestricted International Forensic Investigation into the Origins of COVID-19 claims “the fact that the WHO named Daszak as a member of its mission, and the fact that the WHO retained Daszak as a member of its mission after being informed of his conflicts of interest, make it clear that the WHO study cannot be considered a credible, independent investigation.” (LINK) If all the reviewers of a paper in support of a lab association come from team Daszak, the paper is scarcely being given a fair shake.

Do you agree? Share your thoughts in the comments.

[1] The problem is compounded by the fact that today there are more journal submissions than ever, and with the difficulty in getting volunteers, there’s pressure on the journal editor not to dismiss the views of referees. My guess is that anonymity doesn’t play a big role most of the time.


Categories: conflicts of interest, journal referees | 12 Comments

ASA to Release the Recommendations of its Task Force on Statistical Significance and Replication

The American Statistical Association has announced that it has decided to reverse course and share the recommendations developed by the ASA Task Force on Statistical Significance and Replicability in one of its official channels. The ASA Board created this group [1] in November 2019 “with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors.” (AMSTATNEWS 1 February 2020). Some members of the ASA Board felt that its earlier decision not to make these recommendations public, but instead to leave the group to publish its recommendations on its own, might give the appearance of a conflict of interest between the obligation of the ASA to represent the wide variety of methodologies used by its members in widely diverse fields, and the advocacy by some members who believe practitioners should stop using the term “statistical significance” and end the practice of using p-value thresholds in interpreting data [the Wasserstein et al. (2019) editorial]. I think that deciding to publicly share the new Task Force recommendations is very welcome, given especially that the Task Force was appointed to avoid just such an apparent conflict of interest. Past ASA President, Karen Kafadar noted:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this [Wasserstein et al. (2019)] editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

… To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece … without leaving the impression that p-values and hypothesis tests…have no role in ‘good statistical practice’. (K. Kafadar, President’s Corner, 2019, p. 4)

Thus the Task Force on Statistical Significance and Replicability was born. Meanwhile, its recommendations remain under wraps. The one principle mentioned in Kafadar’s JSM presentation is that there be a disclaimer on all publications, articles, editorials authored by ASA staff, making it clear that the views presented are theirs and not the associations. It is good that we can now count on seeing the original recommendations. Were they only to have appeared in a distinct publication, perhaps in a non-statistics journal, we would never actually know if we were getting to see the original recommendations, or some modified version of them.

For a blogpost that provides the background to this episode, see “Why hasn’t the ASA board revealed the recommendations of its new task force on statistical significance and replicability?”


[1] Members of the ASA Task Force on Statistical Significance and Replicability

Linda Young, National Agricultural Statistics Service and University of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, The George Washington University (ASA Publications Representative)
Mark Glickman, Harvard University (ASA Section Representative)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri





Kafadar, K. Presidents Corner “The Year in Review … And More to Come” AMSTATNEWS 1 December 2019.

“Highlights of the November 2019 ASA Board of Directors Meeting”, AMSTATNEWS 1 January 2020.

Kafadar, K. “Task Force on Statistical Significance and Replicability Created”, AMSTATNEWS 1 February 2020.

Categories: conflicts of interest | Leave a comment

The Stat Wars and Intellectual conflicts of interest: Journal Editors


Like most wars, the Statistics Wars continues to have casualties. Some of the reforms thought to improve reliability and replication may actually create obstacles to methods known to improve on reliability and replication. At each one of our meeting of the Phil Stat Forum: “The Statistics Wars and Their Casualties,” I take 5 -10 minutes to draw out a proper subset of casualties associated with the topic of the presenter for the day. (The associated workshop that I have been organizing with Roman Frigg at the London School of Economics (CPNSS) now has a date for a hoped for in-person meeting in London: 24-25 September 2021.) Of course we’re interested not just in casualties but in positive contributions, though what counts as a casualty and what a contribution is itself a focus of philosophy of statistics battles.

At our last meeting, Thursday, 25 March, Mark Burgman, Director of the Centre for Environmental Policy at Imperial College London and Editor-in-Chief of the journal Conservation Biology, spoke on “How should applied science journal editors deal with statistical controversies?“. His slides are here:  (pdf). The casualty I focussed on is how the statistics wars may put journal editors in positions of conflicts of interest that can get in the way of transparency and avoidance of bias. I presented it in terms of 4 questions (nothing to do with the fact that it’s currently Passover):


D. Mayo’s Casualties: Intellectual Conflicts of Interest: Questions for Burgman


  1. In an applied field such as conservation science, where statistical inferences often are the basis for controversial policy decisions, should editors and editorial policies avoid endorsing one side of the long-standing debate revolving around statistical significance tests?  Or should they adopt and promote a favored methodology?
  2. If editors should avoid taking a side in setting author’s guidelines and reviewing papers, what policies should be adopted to avoid deferring to the calls of those wanting them to change their author’s guidelines? Have you ever been encouraged to do so?
  3. If one has a strong philosophical statistical standpoint and a strong interest in persuading others to accept it, does it create a conflict of interest, if that person has power to enforce that philosophy (especially in a group already driven by perverse incentives)? If so, what is your journal doing to take account of and prevent conflicts of interest?
  4. What do you think of the March 2019 Editorial of The American Statistician (Wasserstein et al., 2019) Don’t say “statistical significance” and don’t use predesignated p-value thresholds in interpreting data (e.g., .05, .01, .005).

(While not an ASA policy document, Wasserstein’s status as ASA executive director gave it a lot of clout. Should he have issued a disclaimer that the article only represents the authors’ views?) [1]

This is the first of some posts on intellectual conflicts of interest that I’ll be writing shortly. [2]

Mark Burgman’s presentation (Link)

D. Mayo’s Casualties (Link)

[1] For those who don’t know the story: Because no disclaimer was issued, the ASA Board appointed a new task force on Statistical Significance and Reproducibility in 2019 to provide recommendations. These have thus far not been made public. For the background, see this post.

Burgman said that he had received a request to follow the “don’t say significance, don’t use P-value thresholds”, but upon considering it with colleagues, they decided against it. Why not include, as part of journal information shared with authors, that the editors consider it important to retain a variety of statistical methodologies–correctly used–and have explicitly rejected the call to ban any of them (even if they come with official association letterhead).

[2] WordPress has just sprung a radical change on bloggers, and as I haven’t figured it out yet, and my blog assistant is unavailable, I’ve cut this post short.

Categories: Error Statistics | Leave a comment

Reminder: March 25 “How Should Applied Science Journal Editors Deal With Statistical Controversies?” (Mark Burgman)

The seventh meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

25 March, 2021

TIME: 15:00-16:45 (London); 11:00-12:45 (New York, NOTE TIME CHANGE TO MATCH UK TIME**)

For information about the Phil Stat Wars forum and how to join, click on this link.

How should applied science journal editors deal with statistical controversies?

Mark Burgman

Mark Burgman is the Director of the Centre for Environmental Policy at Imperial College London and Editor-in-Chief of the journal Conservation Biology, Chair in Risk Analysis & Environmental Policy. Previously, he was Adrienne Clarke Chair of Botany at the University of Melbourne, Australia. He works on expert judgement, ecological modelling, conservation biology and risk assessment. He has written models for biosecurity, medicine regulation, marine fisheries, forestry, irrigation, electrical power utilities, mining, and national park planning. He received a BSc from the University of New South Wales (1974), an MSc from Macquarie University, Sydney (1981), and a PhD from the State University of New York at Stony Brook (1987). He worked as a consultant ecologist and research scientist in Australia, the United States and Switzerland during the 1980’s before joining the University of Melbourne in 1990. He joined CEP in February, 2017. He has published over two hundred and fifty refereed papers and book chapters and seven authored books. He was elected to the Australian Academy of Science in 2006.

Abstract: Applied sciences come with different focuses. In environmental science, as in epidemiology, the framing and context of problems is often in crises. Decisions are imminent, data and understanding are incomplete, and ramifications of decisions are substantial. This context makes the implications of inferences from data especially poignant. It also makes the claims made by fervent and dedicated authors especially challenging. The full gamut of potential statistical foibles and psychological frailties are on display. In this presentation, I will outline and summarise the kinds of errors of reasoning that are especially prevalent in ecology and conservation biology. I will outline how these things appear to be changing, providing some recent examples. Finally, I will describe some implications of alternative editorial policies.

Some questions:

*Would it be a good thing to dispense with p-values, either through encouragement or through strict editorial policy?

*Would it be a good thing to insist on confidence intervals?

*Should editors of journals in a broad discipline, band together and post common editorial policies for statistical inference?

*Should all papers be reviewed by a professional statistician?

If so, which kind?


Professor Burgman is developing this topic anew, so we don’t have the usual background reading. However, we do have his slides:

*Mark Burgman’s Draft Slides:  “How should applied science journal editors deal with statistical controversies?” (pdf)

*D. Mayo’s Slides: “The Statistics Wars and Their Casualties for Journal Editors: Intellectual Conflicts of Interest: Questions for Burgman” (pdf)

*A paper of mine from the Joint Statistical Meetings, “Rejecting Statistical Significance Tests: Defanging the Arguments”, discusses an episode that is relevant for the general topic of how journal editors should deal with statistical controversies.

Video Links: 

Mark Burgman’s presentation:

D. Mayo’s Casualties:

Please feel free to continue the discussion by posting questions or thoughts in the comments section on this PhilStatWars post.

*Meeting 15 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

**UK doesn’t change their clock until March 28.

Categories: ASA Guide to P-values, confidence intervals and tests, P-values, significance tests | Tags: , | 1 Comment

Pandemic Nostalgia: The Corona Princess: Learning from a petri dish cruise (reblog 1yr)


Last week, giving a long postponed talk for the NY/NY Metro Area Philosophers of Science Group (MAPS), I mentioned how my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) invites the reader to see themselves on a special interest cruise as we revisit old and new controversies in the philosophy of statistics–noting that I had no idea in writing the book that cruise ships would themselves become controversial in just a few years. The first thing I wrote during early pandemic days last March was this post on the Diamond Princess. The statistics gleaned from the ship remain important resources which haven’t been far off in many ways. I reblog it here. Continue reading

Categories: covid-19, memory lane | Leave a comment

March 25 “How Should Applied Science Journal Editors Deal With Statistical Controversies?” (Mark Burgman)

The seventh meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

25 March, 2021

TIME: 15:00-16:45 (London); 11:00-12:45 (New York, NOTE TIME CHANGE)

For information about the Phil Stat Wars forum and how to join, click on this link.

How should applied science journal editors deal with statistical controversies?

Mark Burgman Continue reading

Categories: ASA Guide to P-values, confidence intervals and tests, P-values, significance tests | Tags: , | 1 Comment

Falsifying claims of trust in bat coronavirus research: mysteries of the mine (i)-(iv)


Have you ever wondered if people read Master’s (or even Ph.D) theses a decade out? Whether or not you have, I think you will be intrigued to learn the story of why an obscure Master’s thesis from 2012, translated from Chinese in 2020, is now an integral key for unravelling the puzzle of the global controversy about the mechanism and origins of Covid-19. The Master’s thesis by a doctor, Li Xu [1], “The Analysis of 6 Patients with Severe Pneumonia Caused by Unknown Viruses”, describes 6 patients he helped to treat after they entered a hospital in 2012, one after the other, suffering from an atypical pneumonia from cleaning up after bats in an abandoned copper mine in China. Given the keen interest in finding the origin of the 2002–2003 severe acute respiratory syndrome (SARS) outbreak, Li wrote: “This makes the research of the bats in the mine where the six miners worked and later suffered from severe pneumonia caused by unknown virus a significant research topic”. He and the other doctors treating the mine cleaners hypothesized that their diseases were caused by a SARS-like coronavirus from having been in close proximity to the bats in the mine. Continue reading

Categories: covid-19, falsification, science communication | 19 Comments

Aris Spanos: Modeling vs. Inference in Frequentist Statistics (guest post)


Aris Spanos
Wilson Schmidt Professor of Economics
Department of Economics
Virginia Tech

The following guest post (link to updated PDF) was written in response to C. Hennig’s presentation at our Phil Stat Wars Forum on 18 February, 2021: “Testing With Models That Are Not True”. Continue reading

Categories: misspecification testing, Spanos, stat wars and their casualties | 11 Comments

R.A. Fisher: “Statistical methods and Scientific Induction” with replies by Neyman and E.S. Pearson

In Recognition of Fisher’s birthday (Feb 17), I reblog his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. My favorite is the reply by E.S. Pearson, but all are chock full of gems for different reasons. They are each very short and are worth your rereading. Continue reading

Categories: E.S. Pearson, Fisher, Neyman, phil/history of stat | Leave a comment

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)



This is a belated birthday post for R.A. Fisher (17 February, 1890-29 July, 1962)–it’s a guest post from earlier on this blog by Aris Spanos that has gotten the highest number of hits over the years. 

Happy belated birthday to R.A. Fisher!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998) Continue reading

Categories: Fisher, phil/history of stat, Spanos | 2 Comments

Reminder: February 18 “Testing with models that are not true” (Christian Hennig)

The sixth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

18 February, 2021

TIME: 15:00-16:45 (London); 10-11:45 a.m. (New York, EST)

For information about the Phil Stat Wars forum and how to join, click on this link. 


Testing with Models that Are Not True Continue reading

Categories: Phil Stat Forum | Leave a comment

S. Senn: The Power of Negative Thinking (guest post)



Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Sepsis sceptic

During an exchange on Twitter, Lawrence Lynn drew my attention to a paper by Laffey and Kavanagh[1]. This makes an interesting, useful and very depressing assessment of the situation as regards clinical trials in critical care. The authors make various claims that RCTs in this field are not useful as currently conducted. I don’t agree with the authors’ logic here although, perhaps, surprisingly, I consider that their conclusion might be true. I propose to discuss this here. Continue reading

Categories: power, randomization | 5 Comments

February 18 “Testing with models that are not true” (Christian Hennig)

The sixth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

18 February, 2021

TIME: 15:00-16:45 (London); 10-11:45 a.m. (New York, EST)

For information about the Phil Stat Wars forum and how to join, click on this link. 


Testing with Models that Are Not True

Christian Hennig

Continue reading

Categories: Phil Stat Forum | 1 Comment

The Covid-19 Mask Wars : Hi-Fi Mask Asks


Effective yesterday, February 1, it is a violation of federal law not to wear a mask on a public conveyance or in a transit hub, including taxis, trains and commercial trucks (The 11 page mandate is here.)

The “mask wars” are a major source of disagreement and politicizing science during the current pandemic, but my interest here is not of clashes between pro-and anti-mask culture warriors, but the clashing recommendations among science policy officials and scientists wearing their policy hats. A recent Washington Post editorial by Joseph Allen, (director of the Healthy Buildings program at the Harvard T.H. Chan School of Public Health), declares “Everyone should be wearing N95 masks now”. In his view: Continue reading

Categories: covid-19 | 27 Comments

January 28 Phil Stat Forum “How Can We Improve Replicability?” (Alexander Bird)

The fifth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

28 January, 2021

TIME: 15:00-16:45 (London); 10-11:45 a.m. (New York, EST)


“How can we improve replicability?”

Alexander Bird 

Continue reading

Categories: Phil Stat Forum | 1 Comment

Blog at