Memento & Quiz (on SEV): Excursion 3, Tour I


As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*


We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.

key terms (incomplete please send me yours)

GTR, eclipse test, ether effect, corona effect, PPN framework, statistical test ingredients, Anglo-Polish collaboration, Lambda criterion; Type I error, Type II error, power, P-value, unbiased tests, consistent tests uniformly most powerful (UMP); severity interpretation of tests, severity function, water plant accident; sufficient statistic; frequentist principle of evidence FEV; sensitivity achieved, [same as attained power (att power)], Cox’s taxonomy (embedded, nested, dividing, testing assumptions), Nordvedt effect, equivalence principle (strong and weak)

Semi-Severe Severity Quiz, based on the example in Exhibit (i) of Excursion 3

  1. Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, and n = 100, observed x  = 152 (i.e., d = 2), find the severity associated with μ > 150.5 .

i.e.,SEV100(μ > 150.5) = ________

  1. Compute 3 or more of the severity assessments for Table 3.2, with x  = 153.
  2. Comparing n = 100 with n = 10,000: Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, change the sample size so that  n = 10,000.

The 2SE rejection rule would now be: reject (i.e., “infer evidence against H0”) whenever X   > _____.

Assume x  = just reaches this 2SE cut-off. (added previous sentence, Dec 10, I thought it was clear.) What’s the severity associated with inferring μ > 150.5 now?

i.e., SEV10,000(μ > 150.5) = ____

Compare with SEV100(μ > 150.5).

4. NEW. I realized I needed to include a “negative” result. Assume x  = 151.5. Keeping to the same test with n = 100, find SEV100(μ ≤ 152).

5. If you’re following the original schedule, you’ll have read Tour II of Excursion 3, so here’s an easy question: Why does Souvenir M tell you to “relax”?

6. Extra Credit: supply some key terms from this Tour that I left out in the above list.

*The reference is to Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.


Categories: Severity, Statistical Inference as Severe Testing | 8 Comments

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

Excursion 3 Exhibit (i)

Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired.  It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean x computed, each with a known standard deviation σ = 10.  When the cooling system is effective, each measurement is like observing X ~ N(150, 102). Because of this variability, we expect different 100-fold water samples to lead to different values of X, but we can deduce its distribution. If each X ~N(μ = 150, 102) then X is also Normal with μ = 150, but the standard deviation of X is only σ/√n = 10/√100 = 1. So X ~ N(μ = 150, 1).

It is the distribution of X that is the relevant sampling distribution here. Because it’s a large random sample, the sampling distribution of X is Normal or approximately so, thanks to the Central Limit Theorem. Note the mean of the sampling distribution of X is the same as the underlying mean, both are μ. The frequency link was created by randomly selecting the sample, and we assume for the moment it was successful. Suppose they are testing:

H0: μ ≤ 150 vs. H1: μ > 150.

The test rule for α =  0.025 is:

Reject H0: iff  > 150 + cασ/√100 = 150 + 1.96(1)=151.96,
since cα = 1.96.

For simplicity, let’s go to the 2-standard error cut-off for rejection:

Reject H0 (infer there’s an indication that μ >  150) iff X  ≥ 152.

The test statistic d(x) is a standard Normal variable: Z = √100( X – 150)/10 = X – 150 which, for x  = 152 is 2. The area to the right of 2 under the standard Normal is around 0.025.

Now we begin to move beyond the strict N-P interpretation. Say x is just significant at the 0.025 level (x  = 152). What warrants taking the data as indicating μ > 150 is not that they’d rarely be wrong in repeated trials on cooling systems by acting this way–even though that’s true. There’s a good indication that it’s not in compliance right now. Why? The severity rationale: Were the mean temperature no higher than 150, then over 97% of the time their method would have resulted in a lower mean temperature than observed. Were it clearly in the safe zone, say μ = 149 degrees, a lower observed mean would be even more probable. Thus, x = 152 indicates some positive discrepancy from H(though we don’t consider it rejected by a single outcome). They’re going to take another round of measurements before acting. In the context of a policy action, to which this indication might lead, some type of loss function would be introduced. We’re just considering the evidence, based on these measurements; all for illustrative purposes.

Severity Function:

I will abbreviate “the severity with which claim passes test T with data x“:

SEV(test T, outcome x, claim C).

Reject/Do Not Reject: will be interpreted inferentially, in this case as an indication or evidence of the presence or absence of discrepancies of interest.

Let us suppose we are interested in assessing the severity of C: μ > 153. I imagine this would be a full-on emergency for the ecosystem!

Reject H0Suppose the observed mean is x  = 152, just at the cut-off for rejecting H0:

d(x0) = √100(152 – 150)/10 = 2.

The data reject H0 at level 0.025. We want to compute

SEV(T, x = 152, C: μ > 153).

We may say: “the data accord with C: μ > 153,” that is, severity condition (S-1) is satisfied; but severity requires there to be at least a reasonable probability of a worse fit with C if C is false (S-2). Here, “worse fit with C” means x ≤ 152 (i.e., d(x0) ≤ 2). Given it’s continuous, as with all the following examples, < or ≤ give the same result. The context indicates which is more useful. This probability must be high for to pass severely; if it’s low, it’s BENT.

We need Pr(X ≤ 152; μ > 153 is false).  To say μ > 153 is false is to say μ ≤ 153. So we want Pr(X ≤ 152; μ ≤ 153).  But we need only evaluate severity at the point μ = 153, because this probability is even greater for μ < 153:

Pr(X ≤ 152; μ = 153) = Pr(Z ≤ -1) = 0.16.

Here, Z = √100(152 – 153)/10 = -1. Thus SEV(T,  x = 152, C: μ > 153) = 0.16. Very low. Our minimal severity principle blocks μ > 153 because it’s fairly probable (84% of the time) that the test would yield an even larger mean temperature than we got, if the water samples came from a body of water whose mean temperature is 153. Table 3.1 gives the severity values associated with different claims, given x = 152. Call tests of this form T+

In each case, we are making inferences of form: μ > μ= 150 + γ, for different values of γ. To merely infer μ > 150 , the severity is 0.97 since Pr(X ≤ 152; μ = 150) = Pr(Z ≤ 2) = 0.97. While the data give an indication of non-compliance, μ > 150, to infer C: μ > 153 would be making mountains out of molehills. In this case, the observed difference just hit the cut-off for rejection. N-P tests leave things at that coarse level in computing power and the probability of a Type II error, but severity will take into account the actual outcome. Table 3.2 gives the severity values associated with different claims, given x = 153.

If “the major criticism of the Neyman-Pearson frequentist approach” is that it fails to provide “error probabilities fully varying with the data” as J. Berger alleges, (2003, p.6) then, we’ve answered the major criticism.

Non-rejection. Now suppose x = 151, so the test does not reject H0. The standard formulation of N-P (as well as Fisherian) tests stops there. But we want to be alert to a fallacious interpretation of a “negative” result: inferring there’s no positive discrepancy from μ = 150. No (statistical) evidence of non-compliance isn’t evidence of compliance, here’s why. We have (S-1): the data “accord with” H0, but what if the test had little capacity to have alerted us to discrepancies from 150? The alert comes by way of “a worse fit” with H0–namely,  a mean x  > 151*. Condition (S-2) requires us to consider Pr(X > 151; μ = 150), which is only 0.16. To get this, standardize X to obtain a standard Normal variate:  Z = √100(151 – 150)/10 = 1; and Pr(X > 151; μ = 150) = 0.16. Thus, SEV(T+, x  = 151, C: μ ≤ 150) = low(0.16). Table 3.3 gives the severity values associated with different inferences of form: μ ≤ μ1= 150 + γ, given  x = 151.

Can they at least say that x = 151 is a good indication that μ ≤ 150.5? No, SEV(T+, x  = 151, C: μ ≤ 150.5) ≅ 0.3, [Z = 151 – 150.5 = 0.5]. But x = 151 is a good indication that μ ≤ 152 and μ ≤ 153 (with severity indications of 0.84 and 0.97, respectively).

You might say, assessing severity is no different from what we would do with a judicious use of existing error probabilities. That’s what the severe tester says. Formally speaking, it may be seen merely as a good rule of thumb to avoid fallacious interpretations. What’s new is the statistical philosophy behind it. We no longer seek either probabilism or performance, but rather using relevant error probabilities to assess and control severity.5

5Initial developments of the severity idea were Mayo (1983, 1988, 1991, 1996). In Mayo and Spanos (2006, 2011), it was developed much further.


NOTE: I will set out some quiz examples of severity in the next week for practice.

*There is a typo in the book here, it has “-” rather than “>”

You can find the beginning of this section (3.2), the development of N-P tests, in this post.

To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP).

Where you are in the journey:

Excursion 3: Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests                                             119

3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test                                                                                    121

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration              131

YOU exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)

3.3 How to Do All N-P Tests Do (and more) While
a Member of the Fisherian Tribe                                                    146

  • All excerpts and mementos (until Nov. 30, 2018) are here.
  • The full Itinerary (Table of Contents) is here.
Categories: Error Statistics, Severity, Statistical Inference as Severe Testing | 43 Comments

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)

Neyman & Pearson

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:

Step 1. We must first specify the set of results . . .

Step 2. We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level the chance that, if H0 is true, a result will occur in random sampling lying beyond that level . . .

In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)

In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!

We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed: Continue reading

Categories: E.S. Pearson, Neyman, Statistical Inference as Severe Testing, statistical tests, Statistics | 1 Comment

Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

Mayo 2018, CUP

The 1919 eclipse experiments opened Popper’ s eyes to what made Einstein’ s theory so different from other revolutionary theories of the day: Einstein was prepared to subject his theory to risky tests.[1] Einstein was eager to galvanize scientists to test his theory of gravity, knowing the solar eclipse was coming up on May 29, 1919. Leading the expedition to test GTR was a perfect opportunity for Sir Arthur Eddington, a devout follower of Einstein as well as a devout Quaker and conscientious objector. Fearing “ a scandal if one of its young stars went to jail as a conscientious objector,” officials at Cambridge argued that Eddington couldn’ t very well be allowed to go off to war when the country needed him to prepare the journey to test Einstein’ s predicted light deflection (Kaku 2005, p. 113). Continue reading

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

Stephen Senn: On the level. Why block structure matters and its relevance to Lord’s paradox (Guest Post)


Stephen Senn
Consultant Statistician


In a previous post I considered Lord’s paradox from the perspective of the ‘Rothamsted School’ and its approach to the analysis of experiments. I now illustrate this in some detail giving an example.

What I shall do

I have simulated data from an experiment in which two diets have been compared in 20 student halls of residence, each diet having been applied to 10 halls. I shall assume that the halls have been randomly allocated the diet and that in each hall 10 students have been randomly chosen to have their weights recorded at the beginning of the academic year and again at the end. Continue reading

Categories: Lord's paradox, Statistical Inference as Severe Testing, Stephen Senn | 34 Comments

SIST* Posts: Excerpts & Mementos (to Nov 30, 2018)

Surveying SIST Posts so far

SIST* BLOG POSTS (up to Nov 30, 2018)


  • 05/19: The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
  • 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
  • 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
  • 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
  • 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
  • 10/10: Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)
  • 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

Mementos, Keepsakes and Souvenirs

  • 10/29: Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)
  • 11/8:   Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)
  • 10/5:  “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1)
  • 11/14: Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation)
  • 11/17: Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018)

Categories: SIST, Statistical Inference as Severe Testing | 3 Comments

Mementos for Excursion 2 Tour II: Falsification, Pseudoscience, Induction (2.3-2.7)


Excursion 2 Tour II: Falsification, Pseudoscience, Induction*

Outline of Tour. Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

Mementos from 2.3 Continue reading

Categories: Popper, Statistical Inference as Severe Testing, Statistics | 5 Comments

Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)


Excursion 2 Tour I: Induction and Confirmation (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)

Tour Blurb. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if x confirms H, it confirms (H & J) for any J. This also leads to what’s called the tacking paradox.

Quiz on 2.1 Soundness vs Validity in Deductive Logic. Let ~C be the denial of claim C. For each of the following argument, indicate whether it is valid and sound, valid but unsound, invalid. Continue reading

Categories: induction, SIST, Statistical Inference as Severe Testing, Statistics | 10 Comments

Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)


Stephen Senn
Consultant Statistician

The Rothamsted School

I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.

That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®. Continue reading

Categories: Error Statistics | 11 Comments

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)


I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:

Souvenir C: A Severe Tester’s Translation Guide

Just as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(x) greater than or equal to 1.96, i.e., {x: d(x) ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particular x, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedure would have yielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcome or other. Continue reading

Categories: Statistical Inference as Severe Testing | 7 Comments

The Replication Crises and its Constructive Role in the Philosophy of Statistics-PSA2018

Below are my slides from a session on replication at the recent Philosophy of Science Association meetings in Seattle.


Categories: Error Statistics | Leave a comment

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)

Stat Museum

Excursion 1 Tour II: Error Probing Tools vs. Logics of Evidence 

Blurb. Core battles revolve around the relevance of a method’s error probabilities. What’s distinctive about the severe testing account is that it uses error probabilities evidentially: to assess how severely a claim has passed a test. Error control is necessary but not sufficient for severity. Logics of induction focus on the relationships between given data and hypotheses–so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). Tour II takes us to the crux of central wars in relation to the Law of Likelihood (LL) and Bayesian probabilism. (1.4) Hypotheses deliberately designed to accord with the data can result in minimal severity. The likelihoodist wishes to oust them via degrees of belief captured in prior probabilities. To the severe tester, such gambits directly alter the evidence by leading to inseverity. (1.5) Stopping rules: If a tester tries and tries again until significance is reached–optional stopping–significance will be attained erroneously with high probability. According to the LP, the stopping rule doesn’t alter evidence. The irrelevance of optional stopping is an asset for holders of the LP, it’s the opposite for a severe tester. The warring sides talk past each other. Continue reading

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

A small amendment to Nuzzo’s tips for communicating p-values


I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes: Continue reading

Categories: P-values, science communication | 2 Comments

severe testing or severe sabotage? Christian Robert and the book slasher.

severe testing or severe sabotage? [not a book review]


I came across this anomaly on Christian Robert’s blog

Last week, I received this new book of Deborah Mayo, which I was looking forward reading and annotating!, but thrice alas, the book had been sabotaged: except for the preface and acknowledgements, the entire book is printed upside down [a minor issue since the entire book is concerned] and with some part of the text cut on each side [a few letters each time but enough to make reading a chore!]. I am thus waiting for a tested copy of the book to start reading it in earnest!

How bizarre, my book has been slashed with a knife, cruelly stabbing the page,letting words bleed out helter skelter. Some part of the text cut on each side? It wasn’t words with “Bayesian” in them was it? The only anomalous volume I’ve seen has a slightly crooked cover. Do you think it is the Book Slasher out for Halloween, or something more sinister? It’s a bit like serving the Michelin restaurant reviewer by dropping his meal on the floor, or accidentally causing a knife wound. I hope they remedy this quickly. (Talk about Neyman and quality control).

Readers: Feel free to use the comments to share you particular tale of woe in acquiring the book.

Categories: Statistical Inference as Severe Testing | 4 Comments

Tour Guide Mementos (Excursion 1, Tour I of How to Get Beyond the Statistics Wars)


Tour guides in your travels jot down Mementos and Keepsakes from each Tour[i] of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018). Their scribblings, which may at times include details, at other times just a word or two, may be modified through the Tour, and in response to questions from travelers (so please check back). Since these are just mementos, they should not be seen as replacements for the more careful notions given in the journey (i.e., book) itself. Still, you’re apt to flesh out your notes in greater detail, so please share yours (along with errors you’re bound to spot), and we’ll create Meta-Mementos. Continue reading

Categories: Error Statistics, Statistical Inference as Severe Testing | 8 Comments

Philosophy of Statistics & the Replication Crisis in Science: A philosophical intro to my book (slides)

a road through the jungle

In my talk yesterday at the Philosophy Department at Virginia Tech, I introduced my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Cambridge 2018). I began with my preface (explaining the meaning of my title), and turned to the Statistics Wars, largely from Excursion 1 of the book. After the sum-up at the end, I snuck in an example from the replication crisis in psychology. Here are the slides.


Categories: Error Statistics | Leave a comment

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)

StatSci/PhilSci Museum

Where you are in the Journey*  We’ll move from the philosophical ground floor to connecting themes from other levels, from Popperian falsification to significance tests, and from Popper’s demarcation to current-day problems of pseudoscience and irreplication. An excerpt from our Museum Guide gives a broad-brush sketch of the first few sections of Tour II:

Karl Popper had a brilliant way to “solve” the problem of induction: Hume was right that enumerative induction is unjustified, but science is a matter of deductive falsification. Science was to be demarcated from pseudoscience according to whether its theories were testable and falsifiable. A hypothesis is deemed severely tested if it survives a stringent attempt to falsify it. Popper’s critics denied he could sustain this and still be a deductivist …

Popperian falsification is often seen as akin to Fisher’s view that “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (1935a, p. 16). Though scientists often appeal to Popper, some critics of significance tests argue that they are used in decidedly non-Popperian ways. Tour II explores this controversy.

While Popper didn’t make good on his most winning slogans, he gives us many seminal launching-off points for improved accounts of falsification, corroboration, science versus pseudoscience, and the role of novel evidence and predesignation. These will let you revisit some thorny issues in today’s statistical crisis in science. Continue reading

Categories: Statistical Inference as Severe Testing | 11 Comments

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based”


My new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,” you might have discovered, includes Souvenirs throughout (A-Z). But there are some highlights within sections that might be missed in the excerpts I’m posting. One such “keepsake” is a quote from Fisher at the very end of Section 2.1

These are some of the first clues we’ll be collecting on a wide difference between statistical inference as a deductive logic of probability, and an inductive testing account sought by the error statistician. When it comes to inductive learning, we want our inferences to go beyond the data: we want lift-off. To my knowledge, Fisher is the only other writer on statistical inference, aside from Peirce, to emphasize this distinction.

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigour is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. (Fisher 1935b, p. 54)

How do you understand this remark of Fisher’s? (Please share your thoughts in the comments.) My interpretation, and its relation to the “lift-off” needed to warrant inductive inferences, is discussed in an earlier section, 1.2, posted here.   Here’s part of that. 

Continue reading

Categories: induction, keepsakes from Stat Wars, Statistical Inference as Severe Testing | 7 Comments

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)

StatSci/PhilSci Museum

Where you are in the Journey* 

Cox: [I]n some fields foundations do not seem very important, but we both think that foundations of statistical inference are important; why do you think that is?

Mayo: I think because they ask about fundamental questions of evidence, inference, and probability … we invariably cross into philosophical questions about empirical knowledge and inductive inference. (Cox and Mayo 2011, p. 103)

Contemporary philosophy of science presents us with some taboos: Thou shalt not try to find solutions to problems of induction, falsification, and demarcating science from pseudoscience. It’s impossible to understand rival statistical accounts, let alone get beyond the statistics wars, without first exploring how these came to be “lost causes.” I am not talking of ancient history here: these problems were alive and well when I set out to do philosophy in the 1980s. I think we gave up on them too easily, and by the end of Excursion 2 you’ll see why. Excursion 2 takes us into the land of “Statistical Science and Philosophy of Science” (StatSci/PhilSci). Our Museum Guide gives a terse thumbnail sketch of Tour I. Here’s a useful excerpt:

Once the Problem of Induction was deemed to admit of no satisfactory, non-circular solutions (~1970s), philosophers of science turned to building formal logics of induction using the deductive calculus of probabilities, often called Confirmation Logics or Theories. A leader of this Confirmation Theory movement was Rudolf Carnap. A distinct program, led by Karl Popper, denies there is a logic of induction, and focuses on Testing and Falsification of theories by data. At best a theory may be accepted or corroborated if it fails to be falsified by a severe test. The two programs have analogues to distinct methodologies in statistics: Confirmation theory is to Bayesianism as Testing and Falsification are to Fisher and Neyman–Pearson.


Continue reading

Categories: induction, Statistical Inference as Severe Testing | 2 Comments

All She Wrote (so far): Error Statistics Philosophy: 7 years on

Error Statistics Philosophy: Blog Contents (7 years) [i]
By: D. G. Mayo

Dear Reader: I began this blog 7 years ago (Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening, both for the blog and the appearance of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP). While a special rush edition made an appearance on Sept 3, in time for the RSS meeting in Cardiff, it was decided to hold off on the festivities until copies of the book were officially available (yesterday)! If you’re in the neighborhood, stop by for some Elba Grease


Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I thank readers for their input. Please peruse the offerings below, taking advantage of the discussions by guest posters and readers! I posted the first 3 sections of Tour I (in Excursion i) here, here, and here.
This blog will return to life, although I’m not yet sure of exactly what form it will take. Ideas are welcome. The tone of a book differs from a blog, so we’ll have to see what voice emerges here.


D. Mayo Continue reading

Categories: 3-year memory lane, 4 years ago!, blog contents, Metablog | 2 Comments

Blog at