K. W. Staley
Associate Professor
Department of Philosophy,
Saint Louis University
(Almost) All about error
BOOK REVIEW Metascience (2012) 21:709–713 DOI 10.1007/s11016-011-9618-1
Deborah G. Mayo and Aris Spanos (eds): Error and inference: Recent exchanges on experimental reasoning, reliability, objectivity, and rationality. New York: Cambridge University Press, 2010, xvii+419 ppThe ERROR’06 (experimental reasoning, reliability, objectivity, and rationality) conference held at Virginia Tech aimed to advance the discussion of some central themes in philosophy of science debated by Deborah Mayo and her more-or-less friendly critics over the years. The volume here reviewed brings together the contributions of these critics and Mayo’s responses to them (with Mayo’s collaborator Aris Spanos). (I helped with the organization of the conference and, with Mayo and Jean Miller, edited a separate collection of workshop papers that were presented there, published as a special issue of Synthese.) My review will focus on a couple of themes I hope to be of interest to a broad philosophical audience, then turn more briefly to an overview of the entire collection. The discussions in Error and Inference (E&I) are indispensable for understanding several current issues regarding the methodology of science.
The remarkably useful introductory chapter lays out the broad themes of the volume and discusses ‘‘The Error-Statistical Philosophy’’. Here, Mayo and Spanos provide the most succinct and non-technical account of the error-statistical approach that has yet been published, a feature that alone should commend this text to anyone who has found it difficult to locate a reading on error statistics suitable for use in teaching.
Mayo holds that the central question for a theory of evidence is not the degree to which some observation E confirms some hypothesis H but how well-probed for error a hypothesis H is by a testing procedure T that results in data x0. This reorientation has far-reaching consequences for Mayo’s approach to philosophy of science. On this approach, addressing the question of when data ‘‘provide good evidence for or a good test of’’ a hypothesis requires attention to characteristics of the process by means of which the data are used to bear on the hypothesis. Mayo identifies the starting point from which her account is developed as the ‘‘Weak Severity Principle’’ (WSP):
Data x0 do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H (even if H is incorrect). (21)
The weak severity principle is then developed into the full severity principle (SP), according to which ‘‘data x0 provide a good indication of or evidence for hypothesis H (just) to the extent that test T has severely passed H with x0’’ where H passes a severe test T with x0 if x0 ‘‘agrees with’’ H and ‘‘with very high probability, test T would have produced a result that accords less well with H than doesx0, if H were false or incorrect’’ (22). This principle constitutes the heart of the error-statistical account of evidence, and E&I, by including some of the most important critiques of the principle, provides a forum in which Mayo and Spanos attempt to correct misunderstandings of the principle and to clarify its meaning and application.
The appearance in the WSP of the disjunctive phrase ‘‘a very low probability or capacity’’ (my emphasis) indicates a point central to much of this clarificatory work. The error-statistical account is resolutely frequentist in its construal of probability. It is commonly held (including by some frequentists) that the rationale for frequentist statistical methods lies exclusively in the fact that they can sometimes be shown to have low error rates in the long run. Throughout E&I, Mayo insists that this ‘‘behaviorist rationale’’ is not applicable when it comes to evaluating a particular body of data in order to determine what inferences may be warranted. That evaluation rests upon thinking about the particular data and the inference at hand in light of the capacity of the test to reveal potential errors in the inference drawn. Frequentist probabilities are part of how one models the error-detecting capacities of the process. As Mayo explains in a later chapter co-authored with David Cox, tests of hypotheses function analogously to measuring instruments: ‘‘Just as with the use of measuring instruments, applied to a specific case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing’’ (257).
One of the most fascinating exchanges in E&I concerns the role of severe testing in the appraisal of ‘‘large-scale’’ theories. According to Mayo, theory appraisal proceeds by a ‘‘piecemeal’’ process of severe probing for specific ways in which a theory might be in error. She illustrates this with the history of experimental tests of theories of gravity, emphasizing Clifford Will’s parametrized post-Newtonian (PPN) framework, by means of which all metric theories of gravity can be represented in their weak-field, slow-motion limits by means of ten parameters. Experimental work on gravity theories then severely tests hypotheses about the values of those parameters. Rather than attempting to confirm or probabilify the general theory of relativity (GTR), the aim is to learn about the ways in which GTR might be in error, more generally to ‘‘measure how far off what a given theory says about a phenomenon can be from what a ‘correct’ theory would need to say about it’’ (55).
Alan Chalmers and Alan Musgrave both challenge this view. According to Chalmers, no general theory, whether ‘‘low level’’ or ‘‘high level’’, can pass a severe test because the content of theories surpasses whatever empirical evidence supports them. As a consequence, Chalmers argues, Mayo’s severe-testing account of scientific inference must be incomplete because even low-level experimental testing sometimes demands relying on general theoretical claims. Similarly, Musgrave accuses Mayo of holding that (general) theories are not tested by ‘‘testing their consequences’’, but that ‘‘all that we really test are the consequences’’ (105), leaving her with ‘‘nothing to say’’ about the assessment, adoption, or rejection of general theories (106).
Mayo denies that her shift away from confirmation or probabilification of theories, and toward the question of the ways that a theory has been probed for error, leaves her with ‘‘nothing to say’’ about the role that theoretical premises play in evidential arguments from data. First, she argues, such premises are more susceptible to independent testing than is often alleged. Second, robustness arguments can sometimes be invoked to show that particular theoretical assumptions, although sufficient to warrant a particular inference, are not strictly necessary. Finally, she states that ‘‘the aim of science in the account I favor is not severity but finding things out’’ (353), which is to say that not merely truth is sought but ‘‘informative’’ truth. I would suggest that in some cases of the sort Chalmers has in mind, scientists accept theoretical premises not because they have already passed severe tests, but in spite of the fact that they have not done so. Wanting to advance into informative new scientific territory, they play a hunch (perhaps invoking plausibility arguments) and take their chances with fortune. Mayo’s account of theory assessment through piecemeal severe testing tells us about the mechanism by which such risky efforts are subsequently kept honest and objectivity is secured.
For reasons of space, I will be brief in my comments on what remains. John Worrall takes up the issue of novel evidence. Worrall defends his long-standing endorsement of ‘‘the UN (Use-Novelty) charter’’, according to which one is barred from using the same fact both in constructing and in supporting a theory, and criticizes Mayo for failing to adhere to it. Mayo claims on the contrary that UN faces counterexamples, as when a hypothesis is inferred (via averaging, say) from a body of data. Worrall argues for treating such counterexamples in terms of support that is conditional on the acceptance of some broader theoretical framework, but insists that the UN rule cannot be violated in cases where data provide unconditional support.
Peter Achinstein’s contribution is nominally a defense of Mill’s account of induction against criticisms by Mayo and Peirce. The exchange, however, focuses on the allegation that Mayo’s account is subject to counterexamples in virtue of what has been called the ‘‘base-rate fallacy’’. (The tenuous connection with Mill comes via Achinstein’s claim that Mill regards the conclusion of a good inductive argument as having high probability, which the conclusions in the base-rate examples lack.) Achinstein’s attempt to craft a counterexample rests, according to Mayo, on the ‘‘fallacy of probabilistic instantiation’’, whereby the probability that the random selection of an individual from a population will result in an individual with property P is taken to be identical to the probability that some particular individual who has been chosen randomly from that population has property P. Mayo then argues that Achinstein’s counterexample commits him to a violation of the WSP. However, it is not clear whether Achinstein would regard this as a problem.
In his ambitious chapter, Aris Spanos combines a historical perspective on methodological debates in economics with a defense of error statistics as a framework for progress in economic modeling. In brief, Spanos characterizes the historical development as a kind of dialectic: at first, data were subordinated to the role of instantiating theories; then economists turned to data-driven modeling in which models were evaluated on ‘‘goodness of fit’’ criteria; finally, they turned to a ‘‘third way’’ that attempts to allow data a prominent theory-constraining role while acknowledging such difficulties as the gap between econometric data and the terms of economic theorizing. Error statistics, Spanos argues, is the natural philosophical ‘‘home’’ for the third way, because it describes how to bridge the gap between theory and data with interconnected models at different levels, and it accounts for what Spanos calls ‘‘statistical knowledge’’—knowledge about the probabilistic assumptions underlying statistical inferences and their adequacy for the inference at hand.
In a chapter on frequentist statistics, Mayo and David Cox lay out a programmatic discussion of the rationale for frequentist approaches in theoretical statistics, with attention to particular types of examples. This chapter will be best appreciated by those who have a passing acquaintance with discussions in the philosophy of statistics, but parts of it could be used by ambitious readers lacking such an acquaintance as an entry point into important methodological debates. Mayo has in the past written about such topics as optional stopping, data mining, mixed tests, and (fundamentally) the ‘‘likelihood principle’’, but it is useful to have these various issues brought together in a single clearly articulated discussion that should be read by anyone with an interest in debates over statistical methodology.
Clark Glymour contributes a stimulating discussion in which he seeks to draw a connection between ‘‘explanatory virtues’’ and the question of what to believe. According to Glymour, ‘‘explanations and the explanatory virtues that facilitate comprehension also facilitate testing the claims made in explanation’’ (331). He proposes that the assumptions ‘‘implicit in causal inference from experiments’’ ‘‘tie together’’ the ability of a causal hypothesis to explain data in a randomized experiment and the ability of such a test to yield an inference to the causal hypothesis. The conditions in question are the causal Markov condition and the faithfulness condition. Glymour, drawing on work by Zhang, then goes on to show that severe testing (or at least something close to it) is then available for ‘‘a broad class of causal explanations, including causal explanations that postulate unobserved quantities’’ (337).
The final chapter by Larry Laudan looks at first like an outlier insofar as it is an essay not in philosophy of science but in legal epistemology, focusing in particular on an apparent inconsistency in how burdens and standards of proof are treated in the criminal justice system when a defendant employs an affirmative defense. Mayo’s response attempts to reframe the issues raised by Laudan in error-statistical terms. What Laudan’s essay shares with Mayo’s work is a concern with doing serious philosophical work that is also practically significant.
Different readers will no doubt be drawn to different aspects of this challenging but rewarding collection of essays. But all of it deserves to be read, and no judgment about the error-statistical approach to philosophy of science should neglect the arguments and insights to be found here.
Kent: I’m very grateful to you for such a clear explanation and overview, especially of my proposed “paradigm shift” from seeking highly probable to highly probed theories. I’m sorry I had not seen this before. I am so glad to have something here that focuses more on the general (error statistical) philosophy of science, and less on the philosophy of statistics. The latter is only really appreciated in relation to the (much broader questions of) the former. Yet, the vast majority of commentators to this blog tend to jump in at the point where it is assumed we are in a fairly well-defined statistical context, or faced with a “statistical” problem (as opposed to a general question about finding something out, and controlling errors).
So here’s an idea for readers: compartmentalize. Recognize and try out this distinct (error statistical) use of statistical ideas (viewed broadly, from design, generation, modeling, and analysis of data, to linking statistical and substantive questions), while retaining your favorite goals for distinct contexts, be they quantifying degrees of probability and belief (actual, personalistic or rational), rational updating, betting, maxent, default, constructive, “operational”, prior elicitation, and O-Bayesianism; “laws” of comparative likelihoods, quantifying losses, bookies, and formal decision-making, and much else.
Or, at least, keep it in the background of your mind, and please don’t tell me Jaynes (or any other high priest) is preventing you….