Monthly Archives: November 2018

SIST* Posts: Excerpts & Mementos (to Nov 17, 2018)

Surveying SIST Posts so far

SIST* BLOG POSTS (up to Nov 17, 2018)


  • 05/19: The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
  • 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
  • 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
  • 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
  • 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
  • 10/10: Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)

Mementos, Keepsakes and Souvenirs

  • 10/29: Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)
  • 11/8:   Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)
  • 10/5:  “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1)
  • 11/14: Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation)
  • 11/17: Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018)

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Mementos for Excursion 2 Tour II: Falsification, Pseudoscience, Induction (2.3-2.7)


Excursion 2 Tour II: Falsification, Pseudoscience, Induction*

Outline of Tour. Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

Mementos from 2.3

There are four key, interrelated themes from Popper:

(1) Science and Pseudoscience. For a theory to be scientific it must be testable and falsifiable.

(2) Conjecture and Refutation. We learn not by enumerative induction but by trial and error: conjecture and refutation.

(3) Observations Are Not Given. If they are at the “foundation,” it is only because there are apt methods for testing their validity. We dub claims observable because or to the extent that they are open to stringent checks.

(4) Corroboration Not Confirmation, Severity Not Probabilism. Rejecting probabilism, Popper denies scientists are interested in highly probable hypotheses (in any sense). They seek bold, informative, interesting conjectures and ingenious and severe attempts to refute them.

These themes are in the spirit of the error statistician. Considerable spade-work is required to see what to keep and what to revise, so bring along your archeological shovels.

The Severe Tester Revises Popper’s Demarcation of Science (Live Exhibit (vi)): What he should be asking is not whether a theory is unscientific, but When is an inquiry into a theory, or an appraisal of claim H, unscientic? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face- saving devices, then the inquiry – the handling and use of data – is unscientific. Despite being logically falsifiable, theories can be rendered immune from falsification by means of questionable methods for their testing.

Greater Content, Greater Severity. The severe tester accepts Popper’s central intuition in (4): if we wanted highly probable claims, scientists would stick to low-level observables and not seek generalizations, much less theories with high explanatory content.A highly explanatory, high-content theory, with interconnected tentacles, has a higher probability of having flaws discerned than low-content theories that do not rule out as much. Thus, when the bolder, higher content, theory stands up to testing, it may earn higher overall severity than the one with measly content. It is the fuller, unifying, theory developed in the course of solving interconnected problems that enables severe tests.

Methodological Probability. Probability in learning attaches to a method of conjecture and refutation, that is to testing: it is methodological probability. An error probability is a special case of a methodological probability. We want methods with a high probability of teaching us (and machines) how to distinguish approximately correct and incorrect interpretations of data. That a theory is plausible is of little interest, in and of itself; what matters is that it is implausible for it to have passed these tests were it false or incapable of adequately solving its set of problems.

Methodological falsification. We appeal to methodological rules for when to regard a claim as falsified.

  • Inductive-statistical falsification proceeds by methods that allow ~H to be inferred with severity. A first step is often to infer an anomaly is real, by falsifying a “due to chance” hypothesis.
  • Going further, we may corroborate (i.e., infer with severity) effects that count as falsifying hypotheses. A falsifying hypothesis is a hypothesis inferred in order to falsify some other claim. Example: the pathological proteins (prions) in mad cow disease infect without nucleic acid. This falsifies: all infectious agents involve nucleic acid.

Despite giving lip service to testing and falsification, many popular accounts of statistical inference do not embody falsification – even of a statistical sort.

However, the falsifying hypotheses that are integral for Popper also necessitate an evidence-transcending (inductive) statistical inference. If your statistical account denies we can reliably falsify interesting theories because doing so is not strictly deductive, it is irrelevant to real-world knowledge.

The Popperian (Methodological) Falsificationist Is an Error Statistician

When is a statistical hypothesis to count as falsified? Although extremely rare events may occur, Popper notes:

such occurrences would not be physical effects, because, on account of their immense improbability, they are not reproducible at will … If, however, we find reproducible deviations from a macro effect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsified. (Popper 1959, p. 203)

In the same vein, we heard Fisher deny that an “isolated record” of statistically significant results suffices to warrant a reproducible or genuine effect (Fisher 1935a, p. 14).

In a sense, the severe tester ‘breaks’ from Popper by solving his key problem: Popper’s account rests on severe tests, tests that would probably falsify claims if false, but he cannot warrant saying a method is probative or severe, because that would mean it was reliable, which makes Popperians squeamish. It would appear to concede to his critics that Popper has a “whiff of induction” after all. But it’s not inductive enumeration. Error statistical methods (whether from statistics or informal) can supply the severe tests Popper sought.

A scientific inquiry (a procedure for finding something out) for a severe tester:

  • blocks inferences that fail the minimal requirement for severity:
  • must be able to embark on a reliable probe to pinpoint blame for anomalies (and use the results to replace falsied claims and build a repertoire of errors).

The parenthetical remark isn’t absolutely required, but is a feature that greatly strengthens scientific credentials.

The reliability requirement is: infer claims just to the extent that they pass severe tests. There’s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudoscience.

To see mementos of 2.4-2.7, I’ve placed them here.**

All of 2.3 is here.

Please use the comments for your questions, corrections, suggested additions.

*All items refer to my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018)

**I’m bound to revise and add to these during a seminar next semester.

Categories: Popper, Statistical Inference as Severe Testing, Statistics | 5 Comments

Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)


Excursion 2 Tour I: Induction and Confirmation (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)

Tour Blurb. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if x confirms H, it confirms (H & J) for any J. This also leads to what’s called the tacking paradox.

Quiz on 2.1 Soundness vs Validity in Deductive Logic. Let ~C be the denial of claim C. For each of the following argument, indicate whether it is valid and sound, valid but unsound, invalid. Continue reading

Categories: induction, SIST, Statistical Inference as Severe Testing, Statistics | 8 Comments

Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)


Stephen Senn
Consultant Statistician

The Rothamsted School

I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.

That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®. Continue reading

Categories: Error Statistics | 11 Comments

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)


I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:

Souvenir C: A Severe Tester’s Translation Guide

Just as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(x) greater than or equal to 1.96, i.e., {x: d(x) ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particular x, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedure would have yielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcome or other. Continue reading

Categories: Statistical Inference as Severe Testing | Leave a comment

The Replication Crises and its Constructive Role in the Philosophy of Statistics-PSA2018

Below are my slides from a session on replication at the recent Philosophy of Science Association meetings in Seattle.


Categories: Error Statistics | Leave a comment

Blog at