Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

Tour Guide Mementos (Excursion 1, Tour I of How to Get Beyond the Statistics Wars)


Tour guides in your travels jot down Mementos and Keepsakes from each Tour[i] of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP 2018). Their scribblings, which may at times include details, at other times just a word or two, may be modified through the Tour, and in response to questions from travelers (so please check back). Since these are just mementos, they should not be seen as replacements for the more careful notions given in the journey (i.e., book) itself. Still, you’re apt to flesh out your notes in greater detail, so please share yours (along with errors you’re bound to spot), and we’ll create Meta-Mementos.

Excursion 1. Tour I: Beyond Probabilism and Performance


Notes from Section1.1 Severity Requirement: Bad Evidence, No Test (BENT)

1.1 Terms (quick looks, to be crystalized as we journey on)

  1. epistemology: The general area of philosophy that deals with knowledge, evidence, inference, and rationality.
  2. severity requirement. In its weakest form it supplies a minimal requirement for evidence:
    severity requirement (weak): One does not have evidence for a claim if little if anything has been done to rule out ways the claim may be false. If data x agree with a claim C but the method used is practically guaranteed to find such agreement, and had little or no capability of finding flaws with C even if they exist, then we have bad evidence, no test (BENT).
  3. error probabilities of a method: probabilities it leads or would lead  to erroneous interpretations of data. (We will formalize this as we proceed.)

error statistical account: one that revolves around the control and assessment of a method’s error probabilities. An inference is qualified by the error probability of the method that led to it.

(This replaces common uses of “frequentist” which actually has many other connotations.)
error statistician: one who uses error statistical methods.

severe testers: a proper subset of error statisticians: those who use error probabilities to assess and control severity. (They may use them for other purposes as well.)

The severe tester also requires reporting what has been poorly probed and inseverely tested,
Error probabilities can, but don’t necessarily, provide assessments of the capability of methods to reveal or avoid mistaken interpretations of data. When they do, they may be used to assess how severely a claim passes a test.

  1. methodology and meta-methodology: Methods we use to study statistical methods may be called our meta-methodology – it’s one level removed.

We can keep to testing language as part of the meta-language we use to talk about formal statistical methods, where the latter include estimation, exploration, prediction, and data analysis.

There’s a difference between finding H poorly tested by data x, and finding x renders H improbable – in any of the many senses the latter takes on.
H: Isaac knows calculus.
x: results of a coin flipping experiment

Even taking H to be true, data x has done nothing to probe the ways in which H might be false.

5. R.A. Fisher, against isolated statistically significant results (p.4).

[W]e need, not an isolated record, but a reliable method of procedure. In relation to the
test of significance, we may say that a phenomenon is experimentally demonstrable
when we know how to conduct an experiment which will rarely fail to give us
a statistically significant result. (Fisher 1935b/1947, p. 14)


Notes from section 1.2 of SIST: How to get beyond the stat wars

6. statistical philosophy (associated with a statistical methodology): core ideas that direct its principles, methods, and interpretations.
two main philosophies about the roles of probability in statistical inference : performance (in the long run) and probabilism.
(i) performance: probability functions to control and assess the relative frequency of erroneous inferences in some long run of applications of the method
(ii) probabilism: probability functions to assign degrees of belief, support, or plausibility to hypotheses. They may be non-comparative (a posterior probability) or comparative (a likelihood ratio or Bayes Factor)

Severe testing introduces a third:
(iii) probativism: probability functions to assess and control a methods’ capability of detecting mistaken inferences, i.e., the severity associated with inferences.
• Performance is a necessary but not a sufficient condition for probativeness.
• Just because an account is touted as having a long-run rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand.

7. Severity strong (argument from coincidence):
We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is evidence for C.
lift-off vs drag down
(i) lift-off : an overall inference can be more reliable and precise than its premises individually.
(ii) drag-down: An overall inference is only as reliable/precise as is its weakest premise.

• Lift-off is associated with convergent arguments, drag-down with linked arguments.
• statistics is the science par excellence for demonstrating lift-off!

8. arguing from error: there is evidence an error is absent to the extent that a procedure with a high capability of signaling the error, if and only if it is present, nevertheless detects no error.

Bernouilli (coin tossing) model: we record success or failure, assume a fixed probability of success θ on each trial, and that trials are independent. (P-value in the case of the Lady Tasting tea, pp. 16-17).

Error probabilities can be readily invalidated due to how the data (and hypotheses!) are generated or selected for testing.

9. computed (or nominal) vs actual error probabilities: You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting (e.g., your computed P-value can be small, but the actual P-value is high.)

Examples: Peirce and Dr. Playfair (a law is inferred even though half of the cases required Playfair to modify the formula after the fact. ) Texas marksman (shooting prowess inferred from shooting bullets into the side of a barn, and painting a bull’s eye around clusters of bullet holes); Pickrite stock portfolio (Pickrite’s effectiveness at stock picking is inferred based on selecting those on which the “method” did best)
• We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.
• A key role for statistical inference is to identify ways to spot egregious deceptions and create strong arguments from coincidence.

10. Auditing a P-value (one part) checking if the results due to selective reporting, cherry picking, trying and trying again, or any number of other similar ruses.
• Replicability isn’t enough: Example. observational studies on Hormone Replacement therapy (HRT) reproducibly showed benefits, but had little capacity to unearth biases due to “the healthy women’s syndrome.”

Souvenir A.[ii] Postcard to Send: the 4 fallacies from the opening of 1.1.
• We should oust mechanical, recipe-like uses of statistical methods long lampooned,
• But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warnings.
• They have the means by which to register formally the fallacies in the postcard list. (Failed statistical assumptions, selection effects alter a test’s error probing capacities).
• Don’t throw out the error control baby with the bad statistics bathwater.

10. severity requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.
severity (strong): If C passes a test that was highly capable of finding flaws or discrepancies from C, and yet no or few are found, then the passing result, x, is an indication of, or evidence for, C.


Notes from Section 1.3: The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon

The Bayesian versus frequentist dispute parallels disputes between probabilism and performance.

-Using Bayes’ Theorem doesn’t make you a Bayesian.

-subjective Bayesianism and non-subjective (default) Bayesians

11. Advocates of unifications are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisfied?

It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but some Bayesians have come to question whether the widespread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.

Marriages of Convenience? The current frequentist–Bayesian unifications are often marriages of convenience;

-some are concerned that methodological conflicts are bad for the profession.

-frequentist tribes have not disappeared; scientists still call for error control.

-Frequentists’ incentive to marry: Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and confidence levels – frequentists are constantly put on the defensive.

Eclecticism and Ecumenism. Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges.

Decoupling. On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched (e.g., Gelman and Cosma Shalizi 2013). The concept of severe testing is sufficiently general to apply to any of the methods now in use.

Why Our Journey? To disentangle the jumgle. Being hesitant to reopen wounds from old battles does not heal them. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.

How it occurs: the new stat scrutiny (arising from failures of replication) collects from:

-the earlier social science “significance test controversy”

-the traditional frequentist and Bayesian accounts, and corresponding frequentist-Bayesian wars

-the newer Bayesian–frequentist unifications (non-subjective, default Bayesianism)

This jungle has never been disentangled.

Your Tour Guide invites your questions in the comments.


[i] As these are scribbled down in notebooks through ocean winds, wetlands and insects, do not expect neatness. Please share improvements nd corrections.

[ii] For a free copy of “Statistical Inference as Severe Testing”, send me your conception of Souvenir A, your real souvenir A, or a picture of your real Souvenir A (through Nov 16, 2018).


Categories: Error Statistics, Statistical Inference as Severe Testing | Leave a comment

Philosophy of Statistics & the Replication Crisis in Science: A philosophical intro to my book (slides)

a road through the jungle

In my talk yesterday at the Philosophy Department at Virginia Tech, I introduced my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Cambridge 2018). I began with my preface (explaining the meaning of my title), and turned to the Statistics Wars, largely from Excursion 1 of the book. After the sum-up at the end, I snuck in an example from the replication crisis in psychology. Here are the slides.


Categories: Error Statistics | Leave a comment

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)

StatSci/PhilSci Museum

Where you are in the Journey*  We’ll move from the philosophical ground floor to connecting themes from other levels, from Popperian falsification to significance tests, and from Popper’s demarcation to current-day problems of pseudoscience and irreplication. An excerpt from our Museum Guide gives a broad-brush sketch of the first few sections of Tour II:

Karl Popper had a brilliant way to “solve” the problem of induction: Hume was right that enumerative induction is unjustified, but science is a matter of deductive falsification. Science was to be demarcated from pseudoscience according to whether its theories were testable and falsifiable. A hypothesis is deemed severely tested if it survives a stringent attempt to falsify it. Popper’s critics denied he could sustain this and still be a deductivist …

Popperian falsification is often seen as akin to Fisher’s view that “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (1935a, p. 16). Though scientists often appeal to Popper, some critics of significance tests argue that they are used in decidedly non-Popperian ways. Tour II explores this controversy.

While Popper didn’t make good on his most winning slogans, he gives us many seminal launching-off points for improved accounts of falsification, corroboration, science versus pseudoscience, and the role of novel evidence and predesignation. These will let you revisit some thorny issues in today’s statistical crisis in science.

2.3 Popper, Severity, and Methodological Probability

Here’s Popper’s summary (drawing from Popper, Conjectures and Refutations, 1962, p. 53):

  • [Enumerative] induction … is a It is neither a psychological fact …nor one of scientific procedure.
  • The actual procedure of science is to operate with conjectures…
  • Repeated observation and experiments function in science as tests of our conjectures or hypotheses, i.e., as attempted refutations.
  • [It is wrongly believed that using the inductive method can] serve as a criterion of demarcation between science and pseudoscience. … None of this is altered in the least if we say that induction makes theories only probable.

There are four key, interrelated themes:

(1) Science and Pseudoscience. Redefining scientific method gave Popper a new basis for demarcating genuine science from questionable science or pseudoscience. Flexible theories that are easy to confirm – theories of Marx, Freud, and Adler were his exemplars – where you open your eyes and find confirmations everywhere, are low on the scientific totem pole (ibid., p. 35). For a theory to be scientific it must be testable and falsifiable. Continue reading

Categories: Statistical Inference as Severe Testing | 7 Comments

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based”


My new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,” you might have discovered, includes Souvenirs throughout (A-Z). But there are some highlights within sections that might be missed in the excerpts I’m posting. One such “keepsake” is a quote from Fisher at the very end of Section 2.1

These are some of the first clues we’ll be collecting on a wide difference between statistical inference as a deductive logic of probability, and an inductive testing account sought by the error statistician. When it comes to inductive learning, we want our inferences to go beyond the data: we want lift-off. To my knowledge, Fisher is the only other writer on statistical inference, aside from Peirce, to emphasize this distinction.

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigour is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. (Fisher 1935b, p. 54)

How do you understand this remark of Fisher’s? (Please share your thoughts in the comments.) My interpretation, and its relation to the “lift-off” needed to warrant inductive inferences, is discussed in an earlier section, 1.2, posted here.   Here’s part of that. 

Continue reading

Categories: induction, keepsakes from Stat Wars, Statistical Inference as Severe Testing | 2 Comments

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)

StatSci/PhilSci Museum

Where you are in the Journey* 

Cox: [I]n some fields foundations do not seem very important, but we both think that foundations of statistical inference are important; why do you think that is?

Mayo: I think because they ask about fundamental questions of evidence, inference, and probability … we invariably cross into philosophical questions about empirical knowledge and inductive inference. (Cox and Mayo 2011, p. 103)

Contemporary philosophy of science presents us with some taboos: Thou shalt not try to find solutions to problems of induction, falsification, and demarcating science from pseudoscience. It’s impossible to understand rival statistical accounts, let alone get beyond the statistics wars, without first exploring how these came to be “lost causes.” I am not talking of ancient history here: these problems were alive and well when I set out to do philosophy in the 1980s. I think we gave up on them too easily, and by the end of Excursion 2 you’ll see why. Excursion 2 takes us into the land of “Statistical Science and Philosophy of Science” (StatSci/PhilSci). Our Museum Guide gives a terse thumbnail sketch of Tour I. Here’s a useful excerpt:

Once the Problem of Induction was deemed to admit of no satisfactory, non-circular solutions (~1970s), philosophers of science turned to building formal logics of induction using the deductive calculus of probabilities, often called Confirmation Logics or Theories. A leader of this Confirmation Theory movement was Rudolf Carnap. A distinct program, led by Karl Popper, denies there is a logic of induction, and focuses on Testing and Falsification of theories by data. At best a theory may be accepted or corroborated if it fails to be falsified by a severe test. The two programs have analogues to distinct methodologies in statistics: Confirmation theory is to Bayesianism as Testing and Falsification are to Fisher and Neyman–Pearson.


Continue reading

Categories: induction, Statistical Inference as Severe Testing | 2 Comments

All She Wrote (so far): Error Statistics Philosophy: 7 years on

Error Statistics Philosophy: Blog Contents (7 years) [i]
By: D. G. Mayo

Dear Reader: I began this blog 7 years ago (Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening, both for the blog and the appearance of my new book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP). While a special rush edition made an appearance on Sept 3, in time for the RSS meeting in Cardiff, it was decided to hold off on the festivities until copies of the book were officially available (yesterday)! If you’re in the neighborhood, stop by for some Elba Grease


Many of the discussions in the book were importantly influenced (corrected and improved) by reader’s comments on the blog over the years. I thank readers for their input. Please peruse the offerings below, taking advantage of the discussions by guest posters and readers! I posted the first 3 sections of Tour I (in Excursion i) here, here, and here.
This blog will return to life, although I’m not yet sure of exactly what form it will take. Ideas are welcome. The tone of a book differs from a blog, so we’ll have to see what voice emerges here.


D. Mayo Continue reading

Categories: 3-year memory lane, 4 years ago!, blog contents, Metablog | 2 Comments

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)


How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

From the aerial perspective of a hot-air balloon, we may see contemporary statistics as a place of happy multiplicity: the wealth of computational ability allows for the application of countless methods, with little handwringing about foundations. Doesn’t this show we may have reached “the end of statistical foundations”? One might have thought so. Yet, descending close to a marshy wetland, and especially scratching a bit below the surface, reveals unease on all sides. The false dilemma between probabilism and long-run performance lets us get a handle on it. In fact, the Bayesian versus frequentist dispute arises as a dispute between probabilism and performance. This gets to my second reason for why the time is right to jump back into these debates: the “statistics wars” present new twists and turns. Rival tribes are more likely to live closer and in mixed neighborhoods since around the turn of the century. Yet, to the beginning student, it can appear as a jungle.

Statistics Debates: Bayesian versus Frequentist

These days there is less distance between Bayesians and frequentists, especially with the rise of objective [default] Bayesianism, and we may even be heading toward a coalition government. (Efron 2013, p. 145)

Continue reading

Categories: Statistical Inference as Severe Testing | 1 Comment

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)


I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

The cruise begins…

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “evidence-based” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, significance seeking, P-hacking, and assorted modes of extra- ordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraud-busting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 8 Comments

The Physical Reality of My New Book! Here at the RSS Meeting


Categories: SIST | 3 Comments

RSS 2018 – Significance Tests: Rethinking the Controversy


Day 2, Wednesday 05/09/2018

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

Categories: Error Statistics | 2 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: August 2015. I mark in red 3-4 posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of relevance to philosophy of statistics [2]. Posts that are part of a “unit” or a group count as one.

August 2015

  • 08/05 Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen
  • 08/08  Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • 08/11 A. Spanos: Egon Pearson’s Neglected Contributions to Statistics (recently reblogged)
  • 08/14 Performance or Probativeness? E.S. Pearson’s Statistical Philosophy
  • 08/15  Severity in a Likelihood Text by Charles Rohde
  • 08/19 Statistics, the Spooky Science
  • 08/20 How to avoid making mountains out of molehills, using power/severity
  • 08/24 3 YEARS AGO (AUGUST 2012): MEMORY LANE
  • 08/31 The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with the discussion of E.S. Pearson in honor of his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, Egon Pearson, Statistics | 1 Comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Today is Egon Pearson’s birthday. In honor of his birthday, I am posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | 2 Comments

For Popper’s Birthday: Reading from Conjectures and Refutations (+ self-test)


28 July 1902 – 17 September 1994

Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, an essay, and thumbnail definitions at the end[ii].

Blog Readers who wish to send me their answers will have their papers graded [use the comments or] An A- or better earns a signed copy of my forthcoming book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. [iii]

[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts on this blog: parts 12, and 3.

[iii] I posted this once before, but now I have a better prize.



Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.



Categories: Popper | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics [2].  Posts that are part of a “unit” or a group count as one.

July 2015

  • 07/03 Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)
  • 07/09  Winner of the June Palindrome contest: Lori Wike
  • 07/11 Higgs discovery three years on (Higgs analysis and statistical flukes)-reblogged recently
  • 07/14  Spot the power howler: α = ß?
  • 07/17  “Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)
  • 07/22 3 YEARS AGO (JULY 2012): MEMORY LANE
  • 07/24 Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
  • 07/29  Telling What’s True About Power, if practicing within the error-statistical tribe


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)

Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine?

A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.

Stephen Senn
Consultant Statistician,


Thirty years ago, Laupacis et al1 proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).

Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write:  ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’2. Fifteen years further on, bringing us up to date,  Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’3

This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and  Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirty-year anniversary of NNTs provides a good excuse. Continue reading

Categories: personalized medicine, PhilStat/Med, S. Senn | 7 Comments

Statistics and the Higgs Discovery: 5-6 yr Memory Lane


I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.  Continue reading

Categories: Higgs, highly probable vs highly probed, P-values | 1 Comment

Replication Crises and the Statistics Wars: Hidden Controversies


Below are the slides from my June 14 presentation at the X-Phil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some.  However, I did learn something about X-phil, particularly from the very interesting poster session [1]. This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.

[1] The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.

Categories: Philosophy of Statistics, replication research, slides | Leave a comment

Your data-driven claims must still be probed severely

Vagelos Education Center

Below are the slides from my talk today at Columbia University at a session, Philosophy of Science and the New Paradigm of Data-Driven Science, at an American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics. Todd was brave to sneak in philosophy of science in an otherwise highly mathematical conference.

Philosophy of Science and the New Paradigm of Data-Driven Science : (Room VEC 902/903)
Organizer and Chair: Todd Kuffner (Washington U)

  1. Deborah Mayo (Virginia Tech) “Your Data-Driven Claims Must Still be Probed Severely”
  2.  Ian McKeague (Columbia) “On the Replicability of Scientific Studies”
  3.  Xiao-Li Meng (Harvard) “Conducting Highly Principled Data Science: A Statistician’s Job and Joy


Categories: slides, Statistics and Data Science | 5 Comments

Blog at