Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization .) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) . I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013. Happy (belated) Birthday Peirce.
Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319
Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:
Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)
Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):
Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.
Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).
In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.
2. Probabilities are assigned to procedures not hypotheses
Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x0, of the observed facts; if x> x0 reject H; if x< x0 accept H.” Although the outputs of N-P tests do not assign hypotheses degrees of probability, “it may often be proved that if we behave according to such a rule … we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false” (Neyman and Pearson, 1933, p.142).[i]
The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:
The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)
The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:
If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)
For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).
Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).
Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis H is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what H asserts, and yet it did not.
3. So why is justifying Peirce’s SCT thought to be so problematic?
You can read Section 3 here. (it’s not necessary for understanding the rest).
4. Peircean induction as severe testing
… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).
The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)
When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.
This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)
While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly corroborated (by his lights), he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.
In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis H when not only does H “accord with” the data x; but also, so good an accordance would very probably not have resulted, were H not true. In other words, we may inductively infer H when it has withstood a test of experiment that it would not have withstood, or withstood so well, were H not true (or were a specific flaw present). This can be encapsulated in the following severity requirement for an experimental test procedure, ET, and data set x.
Hypothesis H passes a severe test with x iff (firstly) x accords with H and (secondly) the experimental test procedure ET would, with very high probability, have signaled the presence of an error were there a discordancy between what H asserts and what is correct (i.e., were H false).
The test would “have signaled an error” by having produced results less accordant with H than what the test yielded. Thus, we may inductively infer H when (and only when) H has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely H has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for H but the probative capacity of the test of experiment ET (with regard to those errors that an inference to H is declaring to be absent)……….
You can read the rest of Section 4 here here
5. The path from qualitative to quantitative induction
In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.
(I) First-Order, Rudimentary or Crude Induction
Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim H, provisionally adopt H. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of H‘s falsity would probably have been detected, were H false, finding no evidence against H is poor inductive evidence for H. H has passed only a highly unreliable error probe.
(II) Second Order (Qualitative) Induction
It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.
The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)
The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)
The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.
The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)
It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.
(III) Third Order, Statistical (Quantitative) Induction
We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.
6. Quantitative and qualitative induction: significance test reasoning
A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a null hypothesis H0, which is an assertion about the distribution of the sample X = (X1, …, Xn), a set of random variables, and (2) a function of the sample, d(x), the test statistic, which reflects the difference between the data x = (x1, …, xn), and null hypothesis H0. The observed value of d(X) is written d(x). The larger the value of d(x) the further the outcome is from what is expected under H0, with respect to the particular question being asked. We can imagine that null hypothesis H0 is
H0: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.
Let d(x) measure the increased risk of cancer in n women, half of which were randomly assigned to HRT. H0 asserts, in effect, that it is an error to take as genuine any positive value of d(x)—any observed difference is claimed to be “due to chance”. The test computes (3) the p-value, which is the probability of a difference larger than d(x), under the assumption that H0 is true:
p-value = Prob(d(X) > d(x)); H0).
If this probability is very small, the data are taken as evidence that
H*: cancer risks are higher in women treated with HRT
The reasoning is a statistical version of modes tollens.
If the hypothesis H0 is correct then, with high probability, 1- p, the data would not be statistically significant at level p.
x is statistically significant at level p.
Therefore, x is evidence of a discrepancy from H0, in the direction of an alternative hypothesis H.
(i.e., H* severely passes, where the severity is 1 minus the p-value)[iii]
If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject H0 (and H0 is actually true), we would find we were rarely able to get so statistically significant a result to recur, and in this way we would discover our original error.
It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)
The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.
 Stigler discusses some of the experiments Peirce performed. In one, with Joseph Jastrow, the goal was to test whether there’s a threshold below which you can’t discern the difference in weights between two objects. Psychologists had hypothesized that there was a minimal threshold “ such that if the difference was below the threshold, termed the just noticeable difference (jnd), the two stimuli were indistinguishable….[Peirce and Jastrow] showed this speculation was false’ Stigler (2016, 160). No matter how close in weight the objects were the probability of a correct discernment of difference differed from ½. A good example of evidence for a “no-effect” null by falsifying the alternative statistically.
 I’m now, truly, within days of completing a very short, but deep, conclusion.9/13/17
Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite. Cambridge: Cambridge University Press.
Laudan, L. 1981 Science and Hypothesis: Historical Essays on Scientific Methodology. Dordrecht: D. Reidel.
Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite. Cambridge: Cambridge University Press.
Mayo, D. 1991 “Novel Evidence and Severe Tests”, Philosophy of Science, 58: 523-552.
———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), Charles S. Peirce and the Philosophy of Science. Tuscaloosa: University of Alabama Press.
——— 1996 Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.
———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), Probability Is the Very Guide in Life. Chicago: Open Court Press, pp. 89-117.
———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), Scientific Evidence, Johns Hopkins University Press.
Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in Foundations of Bayesianism, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.
Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” Philosophy of Science, Vol. II, PSA 2002, pp. 1007-1025.
———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, The British Journal of Philosophy of Science 57: 323-357.
Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium, 2005.
Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in Philosophical Transactions of the Royal Society, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.
———- 1967 Joint Statistical Papers, Berkeley: University of California Press.
Niiniluoto, I. 1984 Is Science Progressive? Dordrecht: D. Reidel.
Peirce, C. S. Collected Papers: Vols. I-VI, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.
Popper, K. 1962 Conjectures and Refutations: the Growth of Scientific Knowledge, Basic Books, New York.
Rescher, N. 1978 Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method, Notre Dame: University of Notre Dame Press.
Stigler, S. 2016 The Seven Pillars of Statistical Wisdom, Harvard.
[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative T’ is closer to the truth than a refuted T.
[iii] If the p-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard H0 as consistent with data x, but we may wish to go further and determine the size of an increased risk r that has thereby been ruled out with severity. We do so by finding a risk increase, such that, Prob(d(x) > d(x); risk increase r) is high, say. Then the assertion: the risk increase < r passes with high severity, we would argue.
If there were a discrepancy from hypothesis H0 of r (or more), then, with high probability,1-p, the data would be statistically significant at level p.
x is not statistically significant at level p.
Therefore, x is evidence than any discrepancy from H0 is less than r.
For a general treatment of severity, see Mayo and Spanos (2006).
[Ed. Note: A not bad biographical sketch can be found on wikipedia.]