A famous chestnut given by Cox (1958) recently came up in conversation. The example “is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992, p. 582). When I describe it, you’ll find it hard to believe many regard it as causing an earthquake in statistical foundations, unless you’re already steeped in these matters. If half the time I reported my weight from a scale that’s always right, and half the time use a scale that gets it right with probability .5, would you say I’m right with probability ¾? Well, maybe. But suppose you knew that this measurement was made with the scale that’s right with probability .5? The overall error probability is scarcely relevant for giving the warrant of the particular measurement,knowing which scale was used. Continue reading

# Error Statistics

## Cox’s (1958) weighing machine example

## Szucs & Ioannidis Revive the Limb-Sawing Fallacy

When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week), Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject

Hwe will actually be left in complete existential vacuum because during the rejection of_{0 }HNHST ‘_{0 }saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to rejectHthen it follows that pr(data or more extreme data|_{0}H) is useless because_{0}His not true” (p.15)._{0}

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis]

H. Surely, we must also reject probabilities conditional on_{0}H, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ “_{0}

*Ha! Ha!* By this reasoning, no hypothetical testing or falsification could ever occur. As soon as *H* is falsified, the grounds for falsifying disappear! If *H*: all swans are white, then if I see a black swan, *H* is falsified. But according to this criticism, we can no longer assume the deduced prediction from *H*! What? Continue reading

## 3 YEARS AGO (DECEMBER 2013): MEMORY LANE

**MONTHLY MEMORY LANE: 3 years ago: December 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a group count as one. In this post, that makes 12/27-12/28 count as one.

**December 2013**

**(12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)**- (12/7) FDA’s New Pharmacovigilance
- (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
- (12/11) Blog Contents for Oct and Nov 2013
- (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
**(12/15) Surprising Facts about Surprising Facts****(12/19) A. Spanos lecture on “Frequentist Hypothesis Testing**”**(12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3**- (12/25) “Bad Arguments” (a book by Ali Almossawi)
- (12/26) Mascots of Bayesneon statistics (rejected post)
**(12/27) Deconstructing Larry Wasserman****(12/28) More on deconstructing Larry Wasserman (Aris Spanos)****(12/28) Wasserman on Wasserman: Update! December 28, 2013****(12/31) Midnight With Birnbaum (Happy New Year)**

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

## “Tests of Statistical Significance Made Sound”: excerpts from B. Haig

I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

## The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an

error-statisticalphilosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error. Continue reading

## 3 YEARS AGO (NOVEMBER 2013): MEMORY LANE

**MONTHLY MEMORY LANE: 3 years ago: November 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a group count as one. Here I’m counting 11/9, 11/13, and 11/16 as one

**November 2013**

- (11/2)
**Oxford Gaol: Statistical Bogeymen** - (11/4)
**Forthcoming paper on the strong likelihood principle** - (11/9) Null Effects and Replication (cartoon pic)
- (11/9)
**Beware of questionable front page articles warning you to beware of questionable front page articles**(iii) - (11/13)
**T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)** - (11/16) PhilStock: No-pain bull
- (11/16)
**S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)** - (11/18)
**Lucien Le Cam: “The Bayesians hold the Magic”** - (11/20) Erich Lehmann: Statistician and Poet
- (11/23)
**Probability that it is a statistical fluke [i]** - (11/27) “
**The probability that it be a statistical fluke” [iia]** - (11/30) Saturday night comedy at the “Bayesian Boy” diary (rejected post*)

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

## 3 YEARS AGO (OCTOBER 2013): MEMORY LANE

**MONTHLY MEMORY LANE: 3 years ago: October 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a pair count as one.

**October 2013**

**(10/3) Will the Real Junk Science Please Stand Up? (critical thinking)**- (
**10/5)****Was Janina Hosiasson pulling Harold Jeffreys’ leg?** **(10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock****(10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”(10/5 and 10/12 are a pair)**- (10/19) Blog Contents: September 2013
**(10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*****(10/25) Bayesian confirmation theory: example from last post…(10/19 and 10/25 are a pair)****(10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)****(10/31) WHIPPING BOYS AND WITCH HUNTERS**(interesting to see how things have changed and stayed the same over the past few years, share comments)

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

## For Statistical Transparency: Reveal Multiplicity and/or Just Falsify the Test (Remark on Gelman and Colleagues)

Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end [1].

[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).

An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a *multiverse analysis*, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies. Continue reading

## Peircean Induction and the Error-Correcting Thesis (Part I)

Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. *Happy birthday Peirce*.

**Peircean Induction and the Error-Correcting Thesis**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

## If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

1. **PhilSci and StatSci.** I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in *Ecology*:

“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant

[sic]of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.^{[1]} But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability? Continue reading

## 3 YEARS AGO (JULY 2013): MEMORY LANE

**MONTHLY MEMORY LANE: 3 years ago: July 2013. **I mark in **red** **three** posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently **[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.

**July 2013**

**(7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud**- (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
**(7/11) Is Particle Physics Bad Science? (memory lane)**- (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
**(7/14) Stephen Senn: Indefinite irrelevance****(7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)**- (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
**(7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior**- (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
**(7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk**

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016.

## 3 YEARS AGO (JUNE 2013): MEMORY LANE

**MONTHLY MEMORY LANE: 3 years ago: June 2013. **I mark in red **three** posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently **[1]**.** **Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.

**June 2013
**

- (6/1) Winner of May Palindrome Contest
- (6/1) Some statistical dirty laundry
***(recently reblogged)** **(6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers :(6/5 and6/6 are paired as one)**- (6/6) PhilStock: Topsy-Turvy Game
**(6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)**- (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?”*
**(recently reblogged)** - (6/11) Mayo: comment on the repressed memory research
**[How a conceptual criticism, requiring no statistics, might go.]** **(6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!**- (6/19) PhilStock: The Great Taper Caper
**(6/19) Stanley Young: better p-values through randomization in microarrays**- (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri*
**(recently reblogged)** - (6/26) Why I am not a “dualist” in the sense of Sander Greenland
- (6/29) Palindrome “contest” contest
- (6/30) Blog Contents: mid-year

**[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.**

## Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)

*Today is Allan Birnbaum’s birthday. In honor of his birthday this year, I’m posting the articles in the *Synthese* volume that was dedicated to his memory in 1977. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up.(Even if you are,you may be unaware of some of these key papers.)*

**HAPPY BIRTHDAY ALLAN!**

*Synthese* Volume 36, No. 1 Sept 1977: *Foundations of Probability and Statistics*, Part I

**Editorial Introduction:**

This special issue of

Syntheseon the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors ofSynthesein October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.THE EDITORS

## A. Spanos: Talking back to the critics using error statistics

Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.

See also on this blog:

A. Spanos, “Recurring controversies about p-values and confidence intervals revisited”

A. Spanos, “Lecture on frequentist hypothesis testing

## Your chance to continue the “due to chance” discussion in roomier quarters

Comments get unwieldy after 100, so here’s a chance to continue the **“due to chance” discussion** in some roomier quarters. (There seems to be at least two distinct lanes being travelled.) Now one of the main reasons I run this blog is to discover potential clues to solving or making progress on thorny philosophical problems I’ve been wrangling with for a long time. I think I extracted some illuminating gems from the discussion here, but I don’t have time to write them up, and won’t for a bit, so I’ve parked a list of comments wherein the golden extracts lie (I think) over at **my Rejected Posts blog[1]**. (They’re all my comments, but as influenced by readers, so I thank you!) Over there, there’s no “return and resubmit”, but around a dozen posts have eventually made it over here, tidied up. Please continue the discussion on this blog (I don’t even recommend going over there). You can link to your earlier comments by clicking on the date.

[1] The Spiegelhalter (PVP) link is here.

## Don’t throw out the error control baby with the bad statistics bathwater

**My invited comments on the ASA Document on P-values***

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed.

Statistical significance tests are a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). These may be called *error statistical methods* (or *sampling theory*). The error statistical methodology supplies what Birnbaum called the “one rock in a shifting scene” (ibid.) in statistical thinking and practice. Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data. Continue reading

## Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

I generally find National Academy of Science (NAS) manifestos highly informative. I only gave a quick reading to around 3/4 of this one. I thank Hilda Bastian for twittering the link. Before giving my impressions, I’m interested to hear what readers think, whenever you get around to having a look. Here’s from the intro*:

Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems…

Continue reading

## Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

In recognition of R.A. Fisher’s birthday today, I’ve decided to share some thoughts on a topic that has so far has been absent from this blog: Fisher’s* fiducial probability*. **Happy Birthday Fisher.**

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehman (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.

So what’s *fiducial inference*? I follow Cox (2006), adapting for the case of the lower limit: Continue reading

## Gelman on ‘Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper’

*“Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper” by Andrew Gelman*

Hiro Minato points us to a news article by physicist Natalie Wolchover entitled “A Fight for the Soul of Science.”

I have no problem with most of the article, which is a report about controversies within physics regarding the purported untestability of physics models such as string theory (as for example discussed by my Columbia colleague Peter Woit). Wolchover writes:

Whether the fault lies with theorists for getting carried away, or with nature, for burying its best secrets, the conclusion is the same: Theory has detached itself from experiment. The objects of theoretical speculation are now too far away, too small, too energetic or too far in the past to reach or rule out with our earthly instruments. . . .

Over three mild winter days, scholars grappled with the meaning of theory, confirmation and truth; how science works; and whether, in this day and age, philosophy should guide research in physics or the other way around. . . .

To social and behavioral scientists, this is all an old old story. Concepts such as personality, political ideology, and social roles are undeniably important but only indirectly related to any measurements. In social science we’ve forever been in the unavoidable position of theorizing without sharp confirmation or falsification, and, indeed, unfalsifiable theories such as Freudian psychology and rational choice theory have been central to our understanding of much of the social world.

But then somewhere along the way the discussion goes astray: Continue reading

## Statistical “reforms” without philosophy are blind (v update)

Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) *What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values? *

*I. To get at philosophical underpinnings, the single most import question is this:*

**(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data? ** Continue reading

## Statistical rivulets: Who wrote this?

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

**Who wrote this and when?**