Continuation of “Peircean Induction and the Error-Correcting Thesis”
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319
Part 1 is here.
There are two other points of confusion in critical discussions of the SCT, that we may note here:
I. The SCT and the Requirements of Randomization and Predesignation
The concern with “the trustworthiness of the proceeding” for Peirce like the concern with error probabilities (e.g., significance levels) for error statisticians generally, is directly tied to their view that inductive method should closely link inferences to the methods of data collection as well as to how the hypothesis came to be formulated or chosen for testing.
This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated (1.95) namely, that the sample be (approximately) random and that the property being tested not be determined by the particular sample x— i.e., predesignation.
The picture of Peircean induction that one finds in critics of the SCT disregards these crucial requirements for induction: Neither enumerative induction nor H-D testing, as ordinarily conceived, requires such rules. Statistical significance testing, however, clearly does.
Suppose, for example that researchers wishing to demonstrate the benefits of HRT search the data for factors on which treated women fare much better than untreated, and finding one such factor they proceed to test the null hypothesis:
H0: there is no improvement in factor F (e.g. memory) among women treated with HRT.
Having selected this factor for testing solely because it is a factor on which treated women show impressive improvement, it is not surprising that this null hypothesis is rejected and the results taken to show a genuine improvement in the population. However, when the null hypothesis is tested on the same data that led it to be chosen for testing, it is well known, a spurious impression of a genuine effect easily results. Suppose, for example, that 20 factors are examined for impressive-looking improvements among HRT-treated women, and the one difference that appears large enough to test turns out to be significant at the 0.05 level. The actual significance level—the actual probability of reporting a statistically significant effect when in fact the null hypothesis is true—is not 5% but approximately 64% (Mayo 1996, Mayo and Kruse 2001, Mayo and Cox 2006). To infer the denial of H0, and infer there is evidence that HRT improves memory, is to make an inference with low severity (approximately 0.36).
II Understanding the “long-run error correcting” metaphor
Discussions of Peircean ‘self-correction’ often confuse two interpretations of the ‘long-run’ error correcting metaphor, even in the case of quantitative induction: (a) Asymptotic self-correction (as n approaches ∞): In this construal, it is imagined that one has a sample, say of size n=10, and it is supposed that the SCT assures us that as the sample size increases toward infinity, one gets better and better estimates of some feature of the population, say the mean. Although this may be true, provided assumptions of a statistical model (e.g., the Binomial) are met, it is not the sense intended in significance-test reasoning nor, I maintain, in Peirce’s SCT. Peirce’s idea, instead, gives needed insight for understanding the relevance of ‘long-run’ error probabilities of significance tests to assess the reliability of an inductive inference from a specific set of data, (b) Error probabilities of a test: In this construal, one has a sample of size n, say 10, and imagines hypothetical replications of the experiment—each with samples of 10. Each sample of 10 gives a single value of the test statistic d(X), but one can consider the distribution of values that would occur in hypothetical repetitions (of the given type of sampling). The probability distribution of d(X) is called the sampling distribution, and the correct calculation of the significance level is an example of how tests appeal to this distribution: Thanks to the relationship between the observed d(x) and the sampling distribution of d(X), the former can be used to reliably probe the correctness of statistical hypotheses (about the procedure) that generated the particular 10-fold sample. That is what the SCT is asserting.
It may help to consider a very informal example. Suppose that weight gain is measured by 10 well-calibrated and stable methods, possibly using several measuring instruments and the results show negligible change over a test period of interest. This may be regarded as grounds for inferring that the individual’s weight gain is negligible within limits set by the sensitivity of the scales. Why? While it is true that by averaging more and more weight measurements, i.e., an eleventh, twelfth, etc., one would get asymptotically close to the true weight, that is not the rationale for the particular inference. The rationale is rather that the error probabilistic properties of the weighing procedure (the probability of ten-fold weighings erroneously failing to show weight change) inform one of the correct weight in the case at hand, e.g., that a 0 observed weight increase passes the “no-weight gain” hypothesis with high severity.
7. Induction corrects its premises
Justifying the severity, and accordingly, the error-correcting capacity, of tests depends upon being able to justify sufficiently test assumptions, whether in the quantitative or qualitative realms. In the former, a typical assumption would be that the data set constitutes a random sample from the appropriate population; in the latter, assumptions would include such things as “my instrument (e.g., scale) is working”. The problem of justifying methods is often taken to stymie attempts to justify inductive methods. Self-correcting, or error-correcting, enters here too, and precisely in the way that Peirce recognized. This leads me to consider something apparently overlooked by his critics; namely, Peirce’s insistence that induction “not only corrects its conclusions, it even corrects its premises” (3.575).
Induction corrects its premises by checking, correcting, or validating its own assumptions. One way that induction corrects its premises is by correcting and improving upon the accuracy of its data. This idea is at the heart of what allows induction—understood as severe testing—to be genuinely ampliative: to come out with more than is put in. Peirce comes to his philosophical stances from his experiences with astronomical observations.
Every astronomer, however, is familiar with the fact that the catalogue place of a fundamental star, which is the result of elaborate reasoning, is far more accurate than any of the observations from which it was deduced. (5.575)
His day-to-day use of the method of least squares made it apparent to him how knowledge of errors of observation can be used to infer an accurate observation from highly shaky data.
It is commonly assumed that empirical claims are only as reliable as the data involved in their inference, thus it is assumed, with Popper, that “should we try to establish anything with our tests, we should be involved in an infinite regress” (Popper 1962, p. 388). Peirce explicitly rejects this kind of “tower image” and argues that we can often arrive at rather accurate claims from far less accurate ones. For instance, with a little data massaging, e.g., averaging, we can obtain a value of a quantity of interest that is far more accurate than individual measurements.
Qualitative Error Correction
Peirce applies the same strategy from astronomy to a qualitative example:
That Induction tends to correct itself, is obvious enough. When a man undertakes to construct a table of mortality upon the basis of the Census, he is engaged in an inductive inquiry. And lo, the very first thing that he will discover from the figures … is that those figures are very seriously vitiated by their falsity. (5.576)
How is it discovered that there are systematic errors in the age reports? By noticing that the number of men reporting their age as 21 far exceeds those who are 20 (while in all other cases ages are much more likely to be expressed in round numbers). Induction, as Pierce understands it, helps to uncover this subject bias, that those under 21 tend to put down that they are 21. It does so by means of formal models of age distributions along with informal, background knowledge of the root causes of such bias. “The young find it to their advantage to be thought older than they are, and the old to be thought younger than they are” (5.576). Moreover, statistical considerations often allow correcting for bias, i.e., by estimating the number of “21” reports that are likely to be attributable to 20 year olds. As with the star catalogue, the data thus corrected is more accurate than the original data report.
By means of an informal tool kit of key errors and their causes, coupled with formal or systematic tools to model them, experimental inquiry checks and corrects its own assumptions for the purpose of carrying out some other inquiry. As I have been urging for Peircean self-correction generally, satisfying the SCT is not a matter of saying with enough data we will get better and better estimates of the star positions or the distribution of ages in a population; it is a matter of being able to employ methods in a given inquiry to detect and correct mistakes in that inquiry, or that data set. To get such methods off the ground there is no need to build a careful tower where inferences are piled up, each depending on what went on before: Properly exploited, inaccurate observations can give way to far more accurate data. By building up a “repertoire” of errors and means to check, avoid, or correct them, scientific induction is self-correcting.
Induction Fares Better Than Deduction at Correcting its Errors
Consider how this reading of Peirce makes sense of his holding inductive science as better at self-correcting than deductive science.
Deductive inquiry … has its errors; and it corrects them, too. But it is by no means so sure, or at least so swift to do this as is Inductive science. (5.577)
An example he gives is that the error in Euclid’s elements was undiscovered until non-Euclidean geometry was developed. Or again, “It is evident that when we run a column of figures down as well as up, as a check” or look out for possible flaws in a demonstration, “we are acting precisely as when in an induction we enlarge our sample for the sake of the self-correcting effect of induction” (5.580). In both cases we are appealing to various methods we have devised because we find they increase our ability to correct our mistakes, and thus increase the error probing power of our reasoning. What is distinctive about the methodology of inductive testing is that it deliberately directs itself to devising tools for reliable error probes. This is not so for mathematics. Granted, “once an error is suspected, the whole world is speedily in accord about it” (5.577) in deductive reasoning. But, for the most part mathematics does not itself supply tools for uncovering flaws.
So it appears that this marvelous self-correcting property of Reason … belongs to every sort of science, although it appears as essential, intrinsic and inevitable only in the highest type of reasoning, which is induction. (5.579)
In one’s inductive or experimental tool kit, one finds explicit models and methods whose single purpose is the business of detecting patterns of irregularity, checking assumptions, assessing departures from canonical models, and so on. If an experimental test is unable to do this—if it is unable to mount severe tests—then it fails to count as scientific induction.
[You can find a pdf version of this paper here.]
REFERENCES and Notes (see part 1 here)