In view of some questions about “behavioristic” vs “evidential” construals of frequentist statistics (from the last post), and how the error statistical philosophy tries to improve on Birnbaum’s attempt at providing the latter, I’m reblogging a portion of a* post *from Nov. 5, 2011 when I also happened to be in London. (The beginning just records a goofy mishap with a skeletal key, and so I leave it out in this reblog.) Two papers with much more detail are linked at the end.

Error Statistics

(1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.).

I assume the interest here* is on the former, (a). I have stated it in numerous ways, but the basic position is that inductive inference—i.e., data-transcending inference—calls for methods of controlling and evaluating error probabilities (even if only approximate). An inductive inference, in this conception, takes the form of inferring hypotheses or claims to the extent that they have been well tested. It also requires reporting claims that have not passed severely, or have passed with low severity. In the “severe testing” philosophy of induction, the quantitative assessment offered by error probabilities tells us not “how probable” but, rather, “how well probed” hypotheses are. The local canonical hypotheses of formal tests and estimation methods need not be the ones we entertain post data; but they give us a place to start without having to go “the designer-clothes” route (see Oct. 30 post).

(2) There are cases where low long-run errors of a procedure are just what is wanted. I call these “behavioristic” contexts. In contexts of “scientific inference,” as I will call them, by contrast, we want to evaluate the evidence or warrant for this hypothesis about this phenomenon (in this world).

Question: How can error probabilities (or error-probing capacities) of a procedure be used to make a specific inference

Habout the process giving rise to this data? Answer: by enabling the assessment of how well probed or how severely testedHis with data x (along with a background or a “repertoire of errors”). By asking a question of interest in terms of a “data generating process” that we can actually trigger and check (or what Neyman might call a “real statistical experiment”), we can and do build knowledge about the world using statistical reasoning.While the degree of severity with which a hypothesis

Hhas passed a test T lets us determine whether it is warranted to inferH, the degree of severity is not assigned toHitself: it is an attribute of the test procedure as a whole (including the inference under consideration). (The “testing” logic can be applied equally to cases of “estimations.”)(3) Although the overarching goal of inquiry is to find out what is (truly) the case about aspects of phenomena, the hypotheses erected in actually finding things out are generally approximations and may even be deliberately false. In scientific contexts, the sampling distribution may be seen to describe what it would be like, statistically, if

Hwas incorrect about some specific aspect of the process generating datax(as modeled). Dataxdo not supply good evidence for the correctness ofHwhen the data attained are scarcely different from what it would be like wereHfalse. FalsifyingHrequires more. (i.e., severely warrantingH’s denial). [Model assumptions are separately checked.]I argue that the logic of probability is inadequate as a logic for well testedness, and then I replace it with a probabilistic concept that succeeds. The goal of attaining such well-probed hypotheses differs crucially from seeking highly probable ones (however probability is interpreted).

I am happy to use Popper’s “degree of corroboration” notion so long as its meaning is understood. Clearly, the Popperian idea that claims should be accepted only after passing “severe attempts to falsify them” is in the error statistical spirit; but Popper never had an account of statistics that could do justice to this insight. He never made “the error probability turn.” (He also admitted to me that he regretted not having learned statistics.) A better historical figure, if one wants one, is C. S. Peirce.

(4) An objective account of statistical inference (I published a paper with that name in 1983!—scary!) requires being able to control and approximately evaluate error-probing capabilities (formally or informally). For detailed computations, see that very paper, Mayo 1983

^{1}, or Mayo and Spanos (2006, 2011). [The 2006 paper is also referenced and linked below.]When a claim is inferred—and this requires detaching it from the assumptions— it must be qualified by an assessment of how well probed it is, in relation to a specific test and data set x. If you want to infer a posterior probability for H, it too must be checked for well testedness, along the same lines.

How do we evaluate a philosophy of statistics? (I discuss this in my RMM contribution; see earlier posts.) We evaluate how well the account captures, and helps solve problems about, statistical learning—understood as the cluster of methods for generating, modeling, and learning from data, and using what is learned to feed into other questions of interest. It should also illuminate scientific inference and empirical progress more generally.

If you accept the basic error-statistical philosophy, or even its minimal requirements, then whatever account you wish to recommend should be able to satisfy its goals and meet its requirements.

Simple as that.Mayo, D. and Spanos, A. (2006. “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” with Aris Spanos,

British Journal of Philosophy of Science, 57, 323-357.

Mayo, D. and Cox, D. (2006/2010). “Frequentist Statistics as a Theory of Inductive Inference” inError and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science(D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared inThe Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

*In the particular post being reblogged, I was responding to statisticians, but philosophers of science are likely to be interested in the latter (b). Thus, I will separately post some things that link statistical science and philosophy of science over the weekend.

[1] I hadn’t introduced the term “severity” yet, but the reasoning is the same, and there are some nifty pictures and graphs. On the other hand, I seem to recall a typo (having to do with a decimal someplace).

“evaluating error probabilities (even if only approximate).”

What would be an example of a non approximate error probability?

I think in the context of the post “approximate” is being juxtaposed with “exact”, meaning even if one is not in a position to be able to calculate the error probabilities, one can attempt to approximate what they would be. So in qualitative inferences, (which often take the form of an argument from coincidence) error probabilistic reasoning can be translated into making an argument from error along the lines given in Mayo 1996, (e.g., p. 64). In her chapter 3, she uses Hacking’s microscope example (ibid. pp. 66-7) for inferring a real effect is being observed rather than an artifact. If we see the same ‘dense bodies’ in red blood platelets using different physical techniques–see the same things in the same arrangements under fluorescence micrographs and electron micrographs, especially as we are using and seeing them against grid lines that we made, we have good grounds for claiming they are a real effect. That is, given this evidence, we would have strong grounds for rejecting the null hypothesis (they are artifacts), because given all the manipulations made, the fact that we are using different technologies based on different physical properties, etc. such a coincidence would require calling upon a Cartesian demon to explain it, as Mayo puts it (p. 66). I think this is what Mayo was referring to when she said “even if only approximately.”

An example of a non-approximate error probability:

Berger’s ‘conditional error probabilities’? 🙂

(e.g., Mayo’s 10/21/12 post 🙂

I think I understand what she meant by approximate. I was wondering when is it ever possible to deal with exact error probabilities.

Do we ever know the future distribution of errors for real repetitions of real devises? Still less do we ever know the distribution of errors in hypothetical repetitions where a device is used over and over again under the exact same circumstances (same time and place)?

Even if we accept approximate as good enough, do we ever know how good the approximation is? Suppose we’ve verified that a measuring device has N(mu,sd) errors approximately (it’s seemingly never exact). How do we know the device will continue to give the same approximate errors? Is there any instance in which we know that with certainty? Making that leap is precisely the problem of induction we’re wrestling with.

It seems like the objective part of our explanation of scientific inference is based on the assumption that we have made correct scientific inferences about the errors. So we’re assuming as an already accomplished fact, the very thing we’re trying to explain.

I suppose we could just assume future errors are the same as they’ve been in the past, but then were does that leave the Frequentist guarantee? It’s worth repeating my original question again: what is an example of a non approximate error probability?

Fisher: To ask that is to ask when can we ever apply statistical models to actual experiments or real phenomena, and the answer is: quite often. The situation is just the same as the “fit” we manage with any empirical phenomena, even in entirely nonstatistical arenas, and in measurements in day-to-day life. We don’t even seek ‘exactitude’ in learning, else deliberately idealized models wouldn’t be so valuable. One is free to be a radical skeptic about all knowledge and science, but it’s not a very interesting position. (See my “no pain” entries on Popper, if interested.)

I’m not even close to being a radical skeptic and my questions had nothing to do with that. My questions were about matters of principle.

We don’t seem to have any instances in which we use non approximate error probabilities. This seems on close analysis to be a nice way of saying all our error probabilities are themselves models based on assumptions which may not be true or if true may change at any moment. This wouldn’t be a problem except this seems, if I’m understanding things correctly, to be foundation of the objectiveness of Error Statistics and of our Frequentist guarantees.

Objectivity in error statistics, as in science more generally, rests on being able to detect discrepancies and discordances, and being able to reorient ourselves and our models as a result of such error detecting. If you worry about our assumptions being wrong, then presumably you have evidence of claims being found in error—the basis of objectivity. On approximations, fortunately whether an error probability is, say, .048 or .05 or the like, does not alter our inference or prevent us from discovering what is/is not the case about the world. To say, as you do, that true assumptions may change at any moment, is confused. (Check objectivity on this blog.)

Fisher, check your namesake. Permutation based randomization tests have exact error probabilities (i.e., non-approximate) under Fisher’s strong null hypothesis. These tests provide the basis for statistical inference in randomized trials.

There are no examples of non approximate numbers in any real applications, even CERN experiments. That is why applied labs have standard procedures for documenting the uncertainty of measurement for data used in testing. This is universal. Only in pure math does one see some level of approximation as truly problematic. This has nothing to do with foundations of error stats or objectivity per se. Objectivity exists in how the data were generated and how uncertainty is estimated.

John: True. The recent “Fisher” (commentator) is clearly no Fisher.