## Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

## Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt

For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

**1.4 The Law of Likelihood and Error Statistics**

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one. Continue reading

## there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power: Continue reading

## Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks. Continue reading

## 1 Days to Apply for the Summer Seminar in Phil Stat

Go to the website for instructions: SummerSeminarPhilStat.com.

## S. Senn: To infinity and beyond: how big are your data, really? (guest post)

**Stephen Senn**

Consultant Statistician

Edinburgh

**What is this ***n *you boast about?

*n*you boast about?

Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an *n *that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.

I shall illustrate this point using clinical trials in asthma. Continue reading

## Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)

**Statistical Inference as Severe Testing:**

*How to Get Beyond the Statistics Wars (2018, CUP)*

**Deborah G. Mayo**

**Abstract for Book**

By disinterring the underlying statistical philosophies this book sets the stage for understanding and finally getting beyond today’s most pressing controversies revolving around statistical methods and irreproducible findings. *Statistical Inference as Severe Testing *takes the reader on a journey that provides a non-technical “how to” guide for zeroing in on the most influential arguments surrounding commonly used–and abused– statistical methods. The book sets sail with a tool for telling what’s true about statistical controversies: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or *severe test*. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to assess and control how severely tested claims are. Viewing statistical inference as severe testing supplies novel solutions to problems of induction, falsification and demarcating science from pseudoscience, and serves as the linchpin for understanding and getting beyond the statistics wars. The book links philosophical questions about the roles of probability in inference to the concerns of practitioners in psychology, medicine, biology, economics, physics and across the landscape of the natural and social sciences.

**Keywords for book:**

Severe testing, Bayesian and frequentist debates, Philosophy of statistics, Significance testing controversy, statistics wars, replication crisis, statistical inference, error statistics, Philosophy and history of Neyman, Pearson and Fisherian statistics, Popperian falsification

## Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. These 2 posts reflect my working out of these ideas in writing Section 5.8 of *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST, CUP 2018).* *Here’s all of Section 5.8 (“Neyman’s Performance and Fisher’s Fiducial Probability”) for your Saturday night reading.*

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74). Continue reading

## Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s* fiducial probability*.

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehmann (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics. Continue reading

## Guest Blog: R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

In recognition of R.A. Fisher’s birthday on February 17…a week of Fisher posts!

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

## R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts begun on his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. They are each very short and are worth your rereading.

*“Statistical Methods and Scientific Induction”*

*by Sir Ronald Fisher (1955)
*

**SUMMARY**

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

- “Repeated sampling from the same population”,
- Errors of the “second kind”,
- “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

**“Note on an Article by Sir Ronald Fisher“**

**by Jerzy Neyman (1956)**

**Summary**

(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation. (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible. Continue reading

## Guest Post: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

**As part of the week of posts on R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012, and 2017. See especially the comments from Feb 2017. **

*‘Fisher’s alternative to the alternative’*

*By: Stephen Senn*

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

*Today is R.A. Fisher’s birthday. I will post some Fisherian items this week in recognition of it*. This paper comes just before the conflicts with Neyman and Pearson erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. We may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!*

“Two New Properties of Mathematical Likelihood“

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

## American Phil Assoc Blog: The Stat Crisis of Science: Where are the Philosophers?

## The Statistical Crisis of Science: Where are the Philosophers?

This was published today on the American Philosophical Association blog.

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

## Summer Seminar in PhilStat: July 28-Aug 11

Please See New Information for Summer Seminar in PhilStat

## Little Bit of Logic (5 mini problems for the reader)

**Little bit of logic (5 little problems for you)[i]**

**Deductively valid** arguments can readily have false conclusions! Yes, deductively valid arguments allow drawing their conclusions with 100% reliability but **only if** all their premises are true. For an argument to be deductively valid means simply that **if** the premises of the argument are all true, then the conclusion is true. For a valid argument to entail the truth of its conclusion, all of its premises must be true. In that case the argument is said to be (deductively) **sound**.

Equivalently, using the definition of deductive validity that I prefer: A deductively valid argument is one where, the truth of all its premises together with the falsity of its conclusion, leads to a logical contradiction (A & ~A).

Show that an argument with the form of **disjunctive syllogism** can have a false conclusion. Such an argument take the form (where A, B are statements): Continue reading

## Mayo Slides Meeting #1 (Phil 6334/Econ 6614, Mayo & Spanos)

Slides * *Meeting #1 (Phil 6334/Econ 6614: Current Debates on Statistical Inference and Modeling (D. Mayo and A. Spanos)

## Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

## 4.8 All Models Are False

. . . it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. . . . The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis. (Cox 1995, p. 456)

A popular slogan in statistics and elsewhere is “all models are false!” Is this true? What can it mean to attribute a truth value to a model? Clearly what is meant involves some assertion or hypothesis about the model – that it correctly or incorrectly represents some phenomenon in some respect or to some degree. Such assertions clearly can be true. As Cox observes, “the very word model implies simplification and idealization.” To declare, “all models are false” by dint of their being idealizations or approximations, is to stick us with one of those “all flesh is grass” trivializations (Section 4.1). So understood, it follows that all statistical models are false, but we have learned nothing about how statistical models may be used to infer true claims about problems of interest. Since the severe tester’s goal in using approximate statistical models is largely to learn where they break down, their strict falsity is a given. Yet it does make her wonder why anyone would want to place a probability assignment on their truth, unless it was 0? Today’s tour continues our journey into solving the problem of induction (Section 2.7). Continue reading

## Protected: Participants in 6334/6614 Meeting place Jan-Feb

## 6334/6614: Captain’s Library: Biblio With Links

** Mayo and A. Spanos
PHIL 6334/ ECON 6614: Spring 2019: Current Debates on Statistical Inference and Modeling**

*Bibliography **(this includes a selection of articles with links; *numbers 1-15 after the item refer to seminar meeting number.)

See Syllabus (first) for class meetings, and the page PhilStat19 menu up top for other course items.

Achinstein (2010). Mill’s Sins or Mayo’s Errors? (**E&I**: 170-188). (11)

Bacchus, Kyburg, & Thalos (1990). Against Conditionalization, *Synthese* (85): 475-506. (15)

Barnett (1999). *Comparative Statistical Inference* (Chapter 6: Bayesian Inference), John Wiley & Sons. (1), (15)

Begley & Ellis (2012) Raise standards for preclinical cancer research. *Nature* 483: 531-533. (10)