Error Statistics

A. Saltelli (Guest post): What can we learn from the debate on statistical significance?

Professor Andrea Saltelli
Centre for the Study of the Sciences and the Humanities (SVT), University of Bergen (UIB, Norway),
Open Evidence Research, Universitat Oberta de Catalunya (UOC), Barcelona

What can we learn from the debate on statistical significance?

The statistical community is in the midst of crisis whose latest convulsion is a petition to abolish the concept of significance. The problem is perhaps neither with significance, nor with statistics, but with the inconsiderate way we use numbers, and with our present approach to quantification.  Unless the crisis is resolved, there will be a loss of consensus in scientific arguments, with a corresponding decline of public trust in the findings of science.

#The sins of quantification

Every quantification which is unclear as to its scope and the context in which it is produced obscures rather than elucidates.

Traditionally, the strength of numbers in the making of an argument has rested on their purported objectivity and neutrality. Expressions such as “Concrete numbers”, “The numbers speak for themselves”, “The data/the model don’t lie” are common currency. Today, doubts about algorithmic instances of quantification – e.g. in promoting, detaining, conceding freedom or credit, are becoming more urgent and visible. Yet the doubt should be general. It is becoming realised that in every activity of quantification, the technique or the methods are never neutral, because it is never possible to separate entirely the act of quantifying from the wishes and expectations of the quantifier.  Thus, books apparently telling separate stories, such as Rigor Mortis, Weapons of Math Destruction, the Tyranny of Metrics, or Useless Arithmetic, dealing with statistics, algorithms, indicators and models, share a common concern.

# Statisticians know

Statisticians are increasingly aware that each number presupposes an underlying narrative, a worldview, and a purpose of the exercise. The maturity of this debate in the house of statistics is not an accident. Statistics is a discipline, with recognized leaders and institutions, and although one might derive an impression of disorder by the use a petition to influence a scientific argument, one cannot deny that the problems in statistics are being tackled head on, in the public arena, in spite of the obvious difficulty for the lay public to follow the technicality of the arguments. With its ongoing  discussion of significance, the community of statistics is teaching us an important lesson about the tight coupling between technique and values. How so? We recap here some elements of the debate.

  • For some, it would be better to throw away the concept of significance altogether, because the p-test, – with its magical p<0.05 threshold, is being misused as a measure of veracity and publishability.
  • Others object that discussion should not take place with the instrument of a petition and that withdrawing tests of significance would make science even more uncertain.
  • The former retort that since this discussion has been going on for decades on academic journal without the existing flaws being fixed, then perhaps times are ripe for action.

A good vantage point to look at this debate in its entirety is this section in Andrew Gelman’s blog.

# Different worlds

An important aspect of this discussion is that the contenders may inhabit different worlds. One world is full of important effects which are overlooked because the test of significance fails (p value greater that 0.05 in statistical parlance). The other world is instead replete with bogus results passed on to the academic literature thanks to a low value of the p-test (p<0.05).

A modicum of investigation reveals that the contention is normative, or indeed political. To take an example, some may fear the introduction on the market of ineffectual pharmaceutical products, others that important epidemiological effects of a pollutant on health may be overlooked. The first group would thus have a more restrictive value for the test, the second group a less restrictive one.

All this is not new. Philosopher Richard Rudner had already written in 1953 that it is impossible to use a test of significance without knowing to what it is being applied, i.e. without making a value judgment. Interestingly, Rudner used this example to make the point that scientists do need to make value judgments.

# How about mathematical models?

In all this discussion mathematical models have enjoyed a relative immunity, perhaps because mathematical modelling is not a discipline. But the absence of awareness of a quality problem is not proof of the absence of a problem.  And there are signals that the crisis there might be even worse than that which is recognised in statistics.

Implausible quantifications of the effect of climate change on the gross domestic product of a country at the year 2100, or of the safety of a disposal for nuclear waste a million years from now, or of the risk of the financial products at the heart of the latest financial crisis, are just examples that are easily seen in the literature. Political decision in the field of transports may be based on a model which needs as an input the average number of passengers sitting is a car several decades in the future. A scholar studying science and technology laments the generation of artefactual numbers through methods and concepts such as ‘expected utility’, ‘decision theory’, ‘life cycle assessment’, ‘ecosystem services’ ‘sound scientific decisions’ and ‘evidence-based policy’ to convey a spurious impression of certainty and control over important issues concerning health and the environment. A rhetorical use of quantification may thus be used in evidence-based policy to hide important knowledge and power asymmetries: the production of evidence empowers those who can pay for it, a trend noted in both the US and Europe.

# Resistance?

Since its inception the current of post normal science (PNS) has insisted on the need to fight against instrumental or fantastic quantifications. PNS scholars suggested the use of pedigree for numerical information (NUSAP), and recently for mathematical models. Combined with PNS’ concept of extended peer communities, these tools are meant to facilitate a discussion of the various attributes of a quantification. This information includes not just its uncertainty, but also its history, the profile of its producers, its position within a system of power and norms, and overall its ‘fitness for function’, while also identifying the possible exclusion of competing stakes and worldviews.

Stat-Activisme, a recent French intellectual ovement, proposes to ‘fight against’ as well as ‘fight with’ numbers. Stat-activisme targets invasive metrics and biased statistics, with a rich repertoire of strategies from ‘statistical judo’ to the construction of alternative measures.

As philosopher Jerome Ravetz reminds us, so long as our modern scientific culture has faith in numbers as if they were ‘nuggets of truth’, we will be victims of ‘funny numbers’ employed to rule our technical society.

Note: A different version of this piece has been published in Italian in the journal Epidemiologia and Prevenzione.

Categories: Error Statistics | 11 Comments

The First Eye-Opener: Error Probing Tools vs Logics of Evidence (Excursion 1 Tour II)

1.4, 1.5

In Tour II of this first Excursion of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018, CUP),  I pull back the cover on disagreements between experts charged with restoring integrity to today’s statistical practice. Some advised me to wait until later (in the book) to get to this eye-opener. Granted, the full story involves some technical issues, but after many months, I think I arrived at a way to get to the heart of things informally (with a promise of more detailed retracing of steps later on). It was too important not to reveal right away that some of the most popular “reforms” fall down on the job even with respect to our most minimal principle of evidence (you don’t have evidence for a claim if little if anything has been done to probe the ways it can be flawed).  Continue reading

Categories: Error Statistics, law of likelihood, SIST | 14 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 19 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll post some Pearson items this week to mark his birthday.


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

Continue reading

Categories: E.S. Pearson, Error Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | Leave a comment

Neyman vs the ‘Inferential’ Probabilists


We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Source: Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Categories: Error Statistics | Leave a comment

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt


For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

1.4 The Law of Likelihood and Error Statistics

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading

Categories: Error Statistics, law of likelihood, SIST | 2 Comments

American Phil Assoc Blog: The Stat Crisis of Science: Where are the Philosophers?

Ship StatInfasST

The Statistical Crisis of Science: Where are the Philosophers?

This was published today on the American Philosophical Association blog. 

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

Continue reading

Categories: Error Statistics, Philosophy of Statistics, Summer Seminar in PhilStat | 2 Comments

Little Bit of Logic (5 mini problems for the reader)

Little bit of logic (5 little problems for you)[i]

Deductively valid arguments can readily have false conclusions! Yes, deductively valid arguments allow drawing their conclusions with 100% reliability but only if all their premises are true. For an argument to be deductively valid means simply that if the premises of the argument are all true, then the conclusion is true. For a valid argument to entail  the truth of its conclusion, all of its premises must be true.  In that case the argument is said to be (deductively) sound.

Equivalently, using the definition of deductive validity that I prefer: A deductively valid argument is one where, the truth of all its premises together with the falsity of its conclusion, leads to a logical contradiction (A & ~A).

Show that an argument with the form of disjunctive syllogism can have a false conclusion. Such an argument take the form (where A, B are statements): Continue reading

Categories: Error Statistics | 22 Comments

Mayo-Spanos Summer Seminar PhilStat: July 28-Aug 11, 2019: Instructions for Applying Now Available


See the Blog at SummerSeminarPhilStat

Categories: Announcement, Error Statistics, Statistics | Leave a comment

You Should Be Binge Reading the (Strong) Likelihood Principle



An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, if outcomes x* and y* (from E1 and E2 respectively) determine the same (i.e., proportional) likelihood function (f1(x*; θ) = cf2(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)
Continue reading

Categories: Error Statistics, Statistics, strong likelihood principle | 1 Comment

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)


Tour I The Myth of “The Myth of Objectivity”*


Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t. Continue reading

Categories: Error Statistics, SIST, Statistical Inference as Severe Testing | 4 Comments

Summer Seminar PhilStat: July 28-Aug 11, 2019 (ii)

First draft of PhilStat Announcement


Categories: Announcement, Error Statistics | 5 Comments

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)

Tour II It’s the Methods, Stupid

There is perhaps in current literature a tendency to speak of the Neyman–Pearson contributions as some static system, rather than as part of the historical process of development of thought on statistical theory which is and will always go on. (Pearson 1962, 276)

This goes for Fisherian contributions as well. Unlike museums, we won’ t remain static. The lesson from Tour I of this Excursion is that Fisherian and Neyman– Pearsonian tests may be seen as offering clusters of methods appropriate for different contexts within the large taxonomy of statistical inquiries. There is an overarching pattern: Continue reading

Categories: Error Statistics, Statistical Inference as Severe Testing | 4 Comments

Memento & Quiz (on SEV): Excursion 3, Tour I


As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*


We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR. Continue reading

Categories: Severity, Statistical Inference as Severe Testing | 16 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

Excursion 3 Exhibit (i)

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1). Continue reading

Categories: Error Statistics, Severity, Statistical Inference as Severe Testing | 44 Comments

Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)


Stephen Senn
Consultant Statistician

The Rothamsted School

I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.

That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®. Continue reading

Categories: Error Statistics | 11 Comments

The Replication Crises and its Constructive Role in the Philosophy of Statistics-PSA2018

Below are my slides from a session on replication at the recent Philosophy of Science Association meetings in Seattle.


Categories: Error Statistics | Leave a comment

Tour Guide Mementos (Excursion 1, Tour I of How to Get Beyond the Statistics Wars)


Tour guides in your travels jot down Mementos and Keepsakes from each Tour[i] of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018). Their scribblings, which may at times include details, at other times just a word or two, may be modified through the Tour, and in response to questions from travelers (so please check back). Since these are just mementos, they should not be seen as replacements for the more careful notions given in the journey (i.e., book) itself. Still, you’re apt to flesh out your notes in greater detail, so please share yours (along with errors you’re bound to spot), and we’ll create Meta-Mementos. Continue reading

Categories: Error Statistics, Statistical Inference as Severe Testing | 8 Comments

Blog at