Error Statistics

Statistical rivulets: Who wrote this?

questionmark pink


[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Who wrote this and when?

Categories: Error Statistics, Statistics | Leave a comment

Popper on pseudoscience: a comment on Pigliucci (i), (ii) 9/18, (iii) 9/20



Jump to Part (ii) 9/18/15 and (iii) 9/20/15 updates

I heard a podcast the other day in which the philosopher of science, Massimo Pigliucci, claimed that Popper’s demarcation of science fails because it permits pseudosciences like astrology to count as scientific! Now Popper requires supplementing in many ways, but we can get far more mileage out of Popper’s demarcation than Pigliucci supposes.

Pigliucci has it that, according to Popper, mere logical falsifiability suffices for a theory to be scientific, and this prevents Popper from properly ousting astrology from the scientific pantheon. Not so. In fact, Popper’s central goal is to call our attention to theories that, despite being logically falsifiable, are rendered immune from falsification by means of ad hoc maneuvering, sneaky face-saving devices, “monster-barring” or “conventionalist stratagems”. Lacking space on Twitter (where the “Philosophy Bites” podcast was linked), I’m placing some quick comments here. (For other posts on Popper, please search this blog.) Excerpts from the classic two pages in Conjectures and Refutations (1962, pp. 36-7) will serve our purpose:

It is easy to obtain confirmations, or verifications, for nearly every theory–if we look for confirmations.



Confirmations should count only if they are the result of risky predictions; that is [if the theory or claim H is false] we should have expected an event which was incompatible with the theory [or claim]….

Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability, but there are degrees of testability, some theories are more testable..

Confirming evidence should not count except when it is the result of a genuine test of the theory, and this means that it can be presented as a serious but unsuccessful attempt to falsify the theory. (I now speak of such cases as ‘corroborating evidence’).

Continue reading

Categories: Error Statistics, Popper, pseudoscience, Statistics | Tags: , | 5 Comments

(Part 3) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Last third of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society 41(2) 2005: 299-319

Part 2 is here.

8. Random sampling and the uniformity of nature

We are now at the point to address the final move in warranting Peirce’s SCT. The severity or trustworthiness assessment, on which the error correcting capacity depends, requires an appropriate link (qualitative or quantitative) between the data and the data generating phenomenon, e.g., a reliable calibration of a scale in a qualitative case, or a probabilistic connection between the data and the population in a quantitative case. Establishing such a link, however, is regarded as assuming observed regularities will persist, or making some “uniformity of nature” assumption—the bugbear of attempts to justify induction.

But Peirce contrasts his position with those favored by followers of Mill, and “almost all logicians” of his day, who “commonly teach that the inductive conclusion approximates to the truth because of the uniformity of nature” (2.775). Inductive inference, as Peirce conceives it (i.e., severe testing) does not use the uniformity of nature as a premise. Rather, the justification is sought in the manner of obtaining data. Justifying induction is a matter of showing that there exist methods with good error probabilities. For this it suffices that randomness be met only approximately, that inductive methods check their own assumptions, and that they can often detect and correct departures from randomness.

… It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required. (1.94)

Continue reading

Categories: C.S. Peirce, Error Statistics, phil/history of stat | Leave a comment

(Part 2) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce 9/10/1839 – 4/19/1914

C. S. Peirce
9/10/1839 – 4/19/1914

Continuation of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Part 1 is here.

There are two other points of confusion in critical discussions of the SCT, that we may note here:

I. The SCT and the Requirements of Randomization and Predesignation

The concern with “the trustworthiness of the proceeding” for Peirce like the concern with error probabilities (e.g., significance levels) for error statisticians generally, is directly tied to their view that inductive method should closely link inferences to the methods of data collection as well as to how the hypothesis came to be formulated or chosen for testing.

This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated (1.95) namely, that the sample be (approximately) random and that the property being tested not be determined by the particular sample x— i.e., predesignation.

The picture of Peircean induction that one finds in critics of the SCT disregards these crucial requirements for induction: Neither enumerative induction nor H-D testing, as ordinarily conceived, requires such rules. Statistical significance testing, however, clearly does. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Peircean Induction and the Error-Correcting Thesis (Part I)

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Yesterday was C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic. I only recently discovered a passage where Popper calls Peirce one of the greatest philosophical thinkers ever (I don’t have it handy). If Popper had taken a few more pages from Peirce, he would have seen how to solve many of the problems in his work on scientific inference, probability, and severe testing. I’ll blog the main sections of a (2005) paper of mine over the next few days. It’s written for a very general philosophical audience; the statistical parts are pretty informal. I first posted it in 2013Happy (slightly belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Severity in a Likelihood Text by Charles Rohde

Mayo elbow


I received a copy of a statistical text recently that included a discussion of severity, and this is my first chance to look through it. It’s Introductory Statistical Inference with the Likelihood Function by Charles Rohde from Johns Hopkins. Here’s the blurb:



This textbook covers the fundamentals of statistical inference and statistical theory including Bayesian and frequentist approaches and methodology possible without excessive emphasis on the underlying mathematics. This book is about some of the basic principles of statistics that are necessary to understand and evaluate methods for analyzing complex data sets. The likelihood function is used for pure likelihood inference throughout the book. There is also coverage of severity and finite population sampling. The material was developed from an introductory statistical theory course taught by the author at the Johns Hopkins University’s Department of Biostatistics. Students and instructors in public health programs will benefit from the likelihood modeling approach that is used throughout the text. This will also appeal to epidemiologists and psychometricians. After a brief introduction, there are chapters on estimation, hypothesis testing, and maximum likelihood modeling. The book concludes with sections on Bayesian computation and inference. An appendix contains unique coverage of the interpretation of probability, and coverage of probability and mathematical concepts.

It’s welcome to see severity in a statistics text; an example from Mayo and Spanos (2006) is given in detail. The author even says some nice things about it: “Severe testing does a nice job of clarifying the issues which occur when a hypothesis is accepted (not rejected) by finding those values of the parameter (here mu) which are plausible (have high severity[i]) given acceptance. Similarly severe testing addresses the issue of a hypothesis which is rejected…. . ”  I don’t know Rohde, and the book isn’t error-statistical in spirit at all.[ii] In fact, inferences based on error probabilities are often called “illogical” because they take into account cherry-picking, multiple testing, optional stopping and other biasing selection effects that the likelihoodist considers irrelevant. I wish he had used severity to address some of the classic howlers he delineates regarding N-P statistics. To his credit, they are laid out with unusual clarity. For example a rejection of a point null µ= µ0 based on a result that just reaches the 1.96 cut-off for a one-sided test is claimed to license the inference to a point alternative µ= µ’ that is over 6 standard deviations greater than the null. (pp. 49-50). But it is not licensed. The probability of a larger difference than observed, were the data generated under such an alternative is ~1, so the severity associated with such an inference is ~ 0. SEV(µ <µ’) ~1. 

[i]Not to quibble, but I wouldn’t say parameter values are assigned severity, but rather that various hypotheses about mu pass with severity. The hypotheses are generally in the form of discrepancies, e..g,µ >µ’

[ii] He’s a likelihoodist from Johns Hopkins. Royall has had a strong influence there (Goodman comes to mind), and elsewhere, especially among philosophers. Bayesians also come back to likelihood ratio arguments, often. For discussions on likelihoodism and the law of likelihood see:

How likelihoodists exaggerate evidence from statistical tests

Breaking the Law of Likelihood ©

Breaking the Law of Likelihood, to keep their fit measures in line A, B

Why the Law of Likelihood is Bankrupt as an Account of Evidence

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

Categories: Error Statistics, Severity | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

egon pearson swim

11 August 1895 – 12 June 1980

Today is Egon Pearson’s birthday. I reblog a post by my colleague Aris Spanos from (8/18/12): “Egon Pearson’s Neglected Contributions to Statistics.”  Happy Birthday Egon Pearson!

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: phil/history of stat, Statistics, Testing Assumptions | Tags: , , , | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

April 16, 1894 – August 5, 1981

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

Neyman died on August 5, 1981. Here’s an unusual paper of his, “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I have been reading a fair amount by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance*, is largely a fable that has been steadily built up and accepted as gospel. This has justified ignoring Neyman-Pearson statistics (as resting solely on long-run performance and irrelevant to scientific inference) and turning to crude variations of significance tests, that Fisher wouldn’t have countenanced for a moment (so-called NHSTs), lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis. The upshot is to reject all of frequentist statistics, even though P-values are a teeny tiny part. *This represents a change in my perception of Neyman’s philosophy since EGEK (Mayo 1996).  I still say that that for our uses of method, it doesn’t matter what anybody thought, that “it’s the methods, stupid!” Anyway, I recommend, in this very short paper, the general comments and the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 19 Comments

Spot the power howler: α = ß?

Spot the fallacy!

  1. METABLOG QUERYThe power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
  2. So, the probability of incorrectly rejecting the null hypothesis is β.
  3. But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.


Categories: Error Statistics, power, Statistics | 12 Comments

Higgs discovery three years on (Higgs analysis and statistical flukes)



2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. 

“Higgs Analysis and Statistical Flukes: part 2”images

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels. Continue reading

Categories: Higgs, highly probable vs highly probed, P-values, Severity | Leave a comment

What Would Replication Research Under an Error Statistical Philosophy Be?

f1ce127a4cfe95c4f645f0cc98f04fcaAround a year ago on this blog I wrote:

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of a conceptual clarification 

Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)

It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks)

  • Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
  • Simple conceptual job that philosophers are good at

(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)



Example of revealing inconsistencies and tensions 

Critic: It’s too easy to satisfy standard significance thresholds

You: Why do replicationists find it so hard to achieve significance thresholds?

Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs

You: So, the replication researchers want methods that pick up on and block these biasing selection effects.

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.


Whether this can be resolved or not is separate.

  • We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
  • As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

Categories: Error Statistics, reproducibility, Statistics | 18 Comments

From our “Philosophy of Statistics” session: APS 2015 convention



“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:


D. Mayo: “Error Statistical Control: Forfeit at your Peril” 


S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”


A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)


For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

“Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo



A new joint paper….

“Error statistical modeling and inference: Where methodology meets ontology”

Aris Spanos · Deborah G. Mayo

Abstract: In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

Keywords: Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

The related conference.

Mayo & Spanos spotlight

Reference: Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” Synthese (online May 13, 2015), pp. 1-23.

Categories: Error Statistics, misspecification testing, O & M conference, reproducibility, Severity, Spanos | 2 Comments

Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)



These days, there are so many dubious assertions about alleged correlations between two variables that an entire website: Spurious Correlation (Tyler Vigen) is devoted to exposing (and creating*) them! A classic problem is that the means of variables X and Y may both be trending in the order data are observed, invalidating the assumption that their means are constant. In my initial study with Aris Spanos on misspecification testing, the X and Y means were trending in much the same way I imagine a lot of the examples on this site are––like the one on the number of people who die by becoming tangled in their bedsheets and the per capita consumption of cheese in the U.S.

The annual data for 2000-2009 are: xt: per capita consumption of cheese (U.S.) : x = (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8); yt: Number of people who died by becoming tangled in their bedsheets: y = (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)

I asked Aris Spanos to have a look, and it took him no time to identify the main problem. He was good enough to write up a short note which I’ve pasted as slides.


Aris Spanos

Wilson E. Schmidt Professor of Economics
Department of Economics, Virginia Tech



*The site says that the server attempts to generate a new correlation every 60 seconds.

Categories: misspecification testing, Spanos, Statistics, Testing Assumptions | 14 Comments

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen


Neyman, drawn by ?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I hadn’t posted this paper of Neyman’s before, so here’s something for your weekend reading:  “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.”  I recommend, especially, the example on home ownership. Here are two snippets:


The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 18 Comments

Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!


bending of starlight.

[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories.” (Popper, CR, [p. 36))


Popper lauds Einstein’s General Theory of Relativity (GTR) as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found. The truth is that even if no deflection effect had been found in the 1919 experiments it would have been blamed on the sheer difficulty in discerning so small an effect (the results that were found were quite imprecise.) This would have been entirely correct! Yet many Popperians, perhaps Popper himself, get this wrong.[i] Listen to Popperian Paul Meehl (with whom I generally agree).

The stipulation beforehand that one will be pleased about substantive theory T when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing…what astrologers and Marxists and psychoanalysts allegedly do, playing heads I win, tails you lose.” (Meehl 1978, 821)

No, there is a confusion of logic. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect need not be taken to refute the effect, or even as evidence as against H. This makes perfect sense if one keeps in mind that a test might have had little chance to detect the effect, even if it existed. The point really reflects the asymmetry of falsification and corroboration. Popperian Alan Chalmers wrote an appendix to a chapter of his book, What is this Thing Called Science? (1999)(which at first had criticized severity for this) once I made my case. [i] Continue reading

Categories: fallacy of non-significance, philosophy of science, Popper, Severity, Statistics | Tags: | 2 Comments

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

spanos 2014



This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)


Categories: Error Statistics, fallacy of rejection, Phil6334, reforming the reformers, Statistics | 27 Comments

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)

(May 4) 7 Deborah Mayo  “Ontology & Methodology in Statistical Modeling”Below are the slides from my Rutgers seminar for the Department of Statistics and Biostatistics yesterday, since some people have been asking me for them. The abstract is here. I don’t know how explanatory a bare outline like this can be, but I’d be glad to try and answer questions[i]. I am impressed at how interested in foundational matters I found the statisticians (both faculty and students) to be. (There were even a few philosophers in attendance.) It was especially interesting to explore, prior to the seminar, possible connections between severity assessments and confidence distributions, where the latter are along the lines of Min-ge Xie (some recent papers of his may be found here.)

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance”

[i]They had requested a general overview of some issues in philosophical foundations of statistics. Much of this will be familiar to readers of this blog.



Categories: Bayesian/frequentist, Error Statistics, Statistics | 11 Comments

The Amazing Randi’s Million Dollar Challenge

09randi3-master675-v2-1The NY Times Magazine had a feature on the Amazing Randi yesterday, “The Unbelievable Skepticism of the Amazing Randi.” It described one of the contestants in Randi’s most recent Million Dollar Challenge, Fei Wang:

“[Wang] claimed to have a peculiar talent: from his right hand, he could transmit a mysterious force a distance of three feet, unhindered by wood, metal, plastic or cardboard. The energy, he said, could be felt by others as heat, pressure, magnetism or simply “an indescribable change.” Tonight, if he could demonstrate the existence of his ability under scientific test conditions, he stood to win $1 million.”

Isn’t “an indescribable change” rather vague?

…..The Challenge organizers had spent weeks negotiating with Wang and fine-tuning the protocol for the evening’s test. A succession of nine blindfolded subjects would come onstage and place their hands in a cardboard box. From behind a curtain, Wang would transmit his energy into the box. If the subjects could successfully detect Wang’s energy on eight out of nine occasions, the trial would confirm Wang’s psychic power. …”

After two women failed to detect the “mystic force” the M.C. announced the contest was over.

“With two failures in a row, it was impossible for Wang to succeed. The Million Dollar Challenge was already over.”

You think they might have given him another chance or something.

“Stepping out from behind the curtain, Wang stood center stage, wearing an expression of numb shock, like a toddler who has just dropped his ice cream in the sand. He was at a loss to explain what had gone wrong; his tests with a paranormal society in Boston had all succeeded. Nothing could convince him that he didn’t possess supernatural powers. ‘This energy is mysterious,’ he told the audience. ‘It is not God.’ He said he would be back in a year, to try again.”

The article is here. If you don’t know who A. Randi is, you should read it.

Randi, much better known during Uri Geller spoon-bending days, has long been the guru to skeptics and fraudbusters, but also a hero to some critical psi believers like I.J. Good. Geller continually sued Randi for calling him a fraud. As such, I.J. Good warned me that I might be taking a risk in my use of “gellerization” in EGEK (1996), but I guess Geller doesn’t read philosophy of science. A post on “Statistics and ESP Research” and Diaconis is here.


I’d love to have seen Randi break out of these chains!


Categories: Error Statistics | Tags: | 3 Comments

Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”



The biennial meeting of the Philosophy of Science Association (PSA) starts this week (Nov. 6-9) in Chicago, together with the History of Science Society. I’ll be part of the symposium:


How Many Sigmas to Discovery?
Philosophy and Statistics in the Higgs Experiments


on Nov.8 with Robert Cousins, Allan Franklin, and Kent Staley. If you’re in the neighborhood stop by.



“A 5 sigma effect!” is how the recent Higgs boson discovery was reported. Yet before the dust had settled, the very nature and rationale of the 5 sigma (or 5 standard deviation) discovery criteria began to be challenged and debated both among scientists and in the popular press. Why 5 sigma? How is it to be interpreted? Do p-values in high-energy physics (HEP) avoid controversial uses and misuses of p-values in social and other sciences? The goal of our symposium is to combine the insights of philosophers and scientists whose work interrelates philosophy of statistics, data analysis and modeling in experimental physics, with critical perspectives on how discoveries proceed in practice. Our contributions will link questions about the nature of statistical evidence, inference, and discovery with questions about the very creation of standards for interpreting and communicating statistical experiments. We will bring out some unique aspects of discovery in modern HEP. We also show the illumination the episode offers to some of the thorniest issues revolving around statistical inference, frequentist and Bayesian methods, and the philosophical, technical, social, and historical dimensions of scientific discovery.


1) How do philosophical problems of statistical inference interrelate with debates about inference and modeling in high energy physics (HEP)?

2) Have standards for scientific discovery in particle physics shifted? And if so, how has this influenced when a new phenomenon is “found”?

3) Can understanding the roles of statistical hypotheses tests in HEP resolve classic problems about their justification in both physical and social sciences?

4) How do pragmatic, epistemic and non-epistemic values and risks influence the collection, modeling, and interpretation of data in HEP?


Abstracts for Individual Presentations

robert cousins(1) Unresolved Philosophical Issues Regarding Hypothesis Testing in High Energy Physics
Robert D. Cousins.
Professor, Department of Physics and Astronomy, University of California, Los Angeles (UCLA)

The discovery and characterization of a Higgs boson in 2012-2013 provide multiple examples of statistical inference as practiced in high energy physics (elementary particle physics).  The main methods employed have a decidedly frequentist flavor, drawing in a pragmatic way on both Fisher’s ideas and the Neyman-Pearson approach.  A physics model being tested typically has a “law of nature” at its core, with parameters of interest representing masses, interaction strengths, and other presumed “constants of nature”.  Additional “nuisance parameters” are needed to characterize the complicated measurement processes.  The construction of confidence intervals for a parameter of interest q is dual to hypothesis testing, in that the test of the null hypothesis q=q0 at significance level (“size”) a is equivalent to whether q0 is contained in a confidence interval for q with confidence level (CL) equal to 1-a.  With CL or a specified in advance (“pre-data”), frequentist coverage properties can be assured, at least approximately, although nuisance parameters bring in significant complications.  With data in hand, the post-data p-value can be defined as the smallest significance level a at which the null hypothesis would be rejected, had that a been specified in advance.  Carefully calculated p-values (not assuming normality) are mapped onto the equivalent number of standard deviations (“s”) in a one-tailed test of the mean of a normal distribution. For a discovery such as the Higgs boson, experimenters report both p-values and confidence intervals of interest. Continue reading

Categories: Error Statistics, Higgs, P-values | Tags: | 18 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 1,070 other followers