Just as in the past 7 years since I’ve been blogging, I revisit that spot in the road at 9p.m., just outside the Elbar Room, look to get into a strange-looking taxi, to head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wonder if the car will come for me this year, as I wait out in the cold, now that *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT)* is out. STINT doesn’t rehearse the argument from my Birnbaum article, but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). 2018 was the 60th birthday of Cox’s “weighing machine” example, which was the basis of Birnbaum’s attempted proof. Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2019? Anyway, the cab is finally here…the rest is live. Happy New Year! Continue reading

## Midnight With Birnbaum (Happy New Year 2018)

## You Should Be Binge Reading the (Strong) Likelihood Principle

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term *sampling theory, *or my preferred *error statistics, *as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the *strong likelihood principle* (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

**SLP** (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E

_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomesx* andy* (from E_{1}and E_{2}respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), thenx* andy* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

Continue reading

## 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my new book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST). It’s especially relevant to take this up now, just before we leave 2018, for reasons that will be revealed over the next day or two. So, let’s go back to it, with an excerpt from SIST (pp. 170-173).

**Exhibit (vi): Two Measuring Instruments of Different Precisions. ***Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?*

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

*Basis for the joke: *An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. Continue reading

## Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

## Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t. Continue reading

## Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

**Excursion 3 Tour III:**

A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the **capabilities** of methods to what may be inferred. CIs thereby obtain an inferential rationale (beyond performance), and several benchmarks are reported. Continue reading

## Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III

**Tour III Capability and Severity: Deeper Concepts**

** **

From the itinerary: A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs), but in fact there’s a clear duality between the two. The dual mission of the first stop (Section 3.7) of this tour is to illuminate both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. The severity analysis seamlessly blends testing and estimation. A typical inquiry first tests for the existence of a genuine effect and then estimates magnitudes of discrepancies, or inquires if theoretical parameter values are contained within a confidence interval. At the second stop (Section 3.8) we reopen a highly controversial matter of interpretation that is often taken as settled. It relates to statistics and the discovery of the Higgs particle – displayed in a recently opened gallery on the “Statistical Inference in Theory Testing” level of today’s museum. Continue reading

## Summer Seminar PhilStat: July 28-Aug 11, 2019 (ii)

## Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)

**Excursion 3 Tour II: It’s The Methods, Stupid**

Tour II disentangles a jungle of conceptual issues at the heart of today’s statistics wars. The first stop **(3.4)** unearths the basis for a number of howlers and chestnuts thought to be licensed by Fisherian or N-P tests.* In each exhibit, we study the basis for the joke. Together, they show: the need for an adequate test statistic, the difference between implicationary (i assumptions) and actual assumptions, and the fact that tail areas serve to raise, and not lower, the bar for rejecting a null hypothesis. (Additional howlers occur in Excursion 3 Tour III)

*recommended: medium to heavy shovel *

## It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)

**Tour II It’s the Methods, Stupid**

There is perhaps in current literature a tendency to speak of the Neyman–Pearson contributions as some static system, rather than as part of the historical process of development of thought on statistical theory which is and will always go on. (Pearson 1962, 276)

This goes for Fisherian contributions as well. Unlike museums, we won’ t remain static. The lesson from Tour I of this Excursion is that Fisherian and Neyman– Pearsonian tests may be seen as offering clusters of methods appropriate for different contexts within the large taxonomy of statistical inquiries. There is an overarching pattern: Continue reading

## Memento & Quiz (on SEV): Excursion 3, Tour I

As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*****

We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR. Continue reading

## First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]

**Exhibit (i) N-P Methods as Severe Tests: First Look (Water Plant Accident)**

There’s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired. It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean *x* computed, each with a known standard deviation σ = 10. When the cooling system is effective, each measurement is like observing X ~ N(150, 10^{2}). Because of this variability, we expect different 100-fold water samples to lead to different values of *X*, but we can deduce its distribution. If each *X* ~N(μ = 150, 10^{2}) then *X* is also Normal with μ = 150, but the standard deviation of *X* is only σ/√*n*^{ }= 10/√100 = 1. So *X* ~ N(μ = 150, 1). Continue reading

## Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)

**3.2 N-P Tests: An Episode in Anglo-Polish Collaboration***

We proceed by setting up a specific hypothesis to test,

H_{0 }in Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives toH_{0 }which we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:

Step 1. We must first specify the set of results . . .

Step 2.We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level the chance that, ifH_{0}is true, a result will occur in random sampling lying beyond that level . . .In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)

In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!

We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed: Continue reading

## Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

**Excursion 3 Statistical Tests and Scientific Inference**

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

The 1919 eclipse experiments opened Popper’ s eyes to what made Einstein’ s theory so different from other revolutionary theories of the day: Einstein was prepared to subject his theory to risky tests.[1] Einstein was eager to galvanize scientists to test his theory of gravity, knowing the solar eclipse was coming up on May 29, 1919. Leading the expedition to test GTR was a perfect opportunity for Sir Arthur Eddington, a devout follower of Einstein as well as a devout Quaker and conscientious objector. Fearing “ a scandal if one of its young stars went to jail as a conscientious objector,” officials at Cambridge argued that Eddington couldn’ t very well be allowed to go off to war when the country needed him to prepare the journey to test Einstein’ s predicted light deflection (Kaku 2005, p. 113). Continue reading

## Stephen Senn: On the level. Why block structure matters and its relevance to Lord’s paradox (Guest Post)

**Stephen Senn**

*Consultant Statistician*

*Edinburgh*

**Introduction**

In a **previous pos**t I considered Lord’s paradox from the perspective of the ‘Rothamsted School’ and its approach to the analysis of experiments. I now illustrate this in some detail giving an example.

**What I shall do**

I have simulated data from an experiment in which two diets have been compared in 20 student halls of residence, each diet having been applied to 10 halls. I shall assume that the halls have been randomly allocated the diet and that in each hall 10 students have been randomly chosen to have their weights recorded at the beginning of the academic year and again at the end. Continue reading

## SIST* Posts: Excerpts & Mementos (to Nov 30, 2018)

**SIST* BLOG POSTS (up to Nov 30, 2018)**

*Excerpts*

- 05/19: The Meaning of My Title:
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* - 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
- 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
- 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
- 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
- 10/10: Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3)
- 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

*Mementos, Keepsakes and Souvenirs*

- 10/29: Tour Guide
**Mementos**(Excursion 1 Tour II of How to Get Beyond the Statistics Wars) - 11/8:
**Souvenir**C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) - 10/5: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (
**Keepsake**by Fisher, 2.1) - 11/14: Tour Guide
**Mementos**and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) - 11/17:
**Mementos**for Excursion 2 Tour II Falsification, Pseudoscience, Induction

**Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars *(Mayo, CUP 2018)

## Mementos for Excursion 2 Tour II: Falsiﬁcation, Pseudoscience, Induction (2.3-2.7)

**Excursion 2 Tour II: Falsification, Pseudoscience, Induction***

*Outline of Tour.* Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

**Mementos from 2.3** Continue reading

## Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)

**Excursion 2 Tour I: Induction and Confirmation ***(Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)*

*Tour Blurb*. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if ** x** confirms

*H*, it confirms (

*H*&

*J*) for any

*J*. This also leads to what’s called the tacking paradox.

**Quiz on 2.1** Soundness vs Validity in Deductive Logic. Let ~*C *be the denial of claim *C*. For each of the following argument, indicate whether it is **valid** **and sound**, **valid but unsound**, **invalid**. Continue reading

## Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)

**Stephen Senn**

Consultant Statistician

Edinburgh

**The Rothamsted School**

I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.

That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®. Continue reading

## Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)

I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:

Souvenir C: A Severe Tester’s Translation GuideJust as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(

) greater than or equal to 1.96, i.e., {xx: d() ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particularx, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedurexwouldhaveyielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcomeorother. Continue reading

## The Replication Crises and its Constructive Role in the Philosophy of Statistics-PSA2018

Below are my slides from a session on replication at the recent Philosophy of Science Association meetings in Seattle.