Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me.
♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠
BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.
SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.
Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …
On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.
Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.
BARNARD: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.
LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.
BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.
LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.
BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.
LINDLEY: I do not care what it is as long as it is not one.
BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.
LINDLEY: All probabilities are conditional.
BARNARD: I agree.
LINDLEY: If there are only conditional ones, what is the point at issue?
PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.
BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.
LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.
BARNARD: Only if you knew that the condition was true, but you do not.
GOOD: Make a conditional bet.
BARNARD: You can make a conditional bet, but that is not what we are aiming at.
WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.
BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.
You can read the rest of pages 78-103 of the Savage Forum here.
HAPPY BIRTHDAY GEORGE!
References
*Six other Barnard links on this blog:
Guest Posts:
Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example
Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
Posts by Mayo:
Barnard, Background Information, and Intentions
Statistical Theater of the Absurd: Stat on a Hot tin Roof
George Barnard’s 100^{th} Birthday: We Need More Complexity and Coherence in Statistical Education
Letter from George Barnard on the Occasion of my Lakatos Award
Links to a scan of the entire Savage forum may be found at: https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/
I. The myth of objectivity. Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective,” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity-subjectivity distinction really toothless as many will have you believe? I say no.
Cavalier attitudes toward objectivity are in tension with widely endorsed movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science–if they are not mere lip-service–are rooted in the supposition that we can more objectively scrutinize results,even if it’s only to point out those that are poorly tested. The fact that the term “objectivity” is used equivocally should not be taken as grounds to oust it, but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.
II. The Key is Getting Pushback. While knowledge gaps leave plenty of room for biases, arbitrariness and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. We get pushback! This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. In an adequate account of statistical inference, explicit attention is paid to communicating results to set the stage for others to check, debate, extend or refute the inferences reached. Don’t let anyone say you can’t hold them to an objective account of statistical inference.
If you really want to find something out, and have had some experience with flaws and foibles, you deliberately arrange inquiries so as to capitalize on pushback, on effects that will not go away, and on strategies to get errors to ramify quickly to force you to pay attention to them. The ability to register alterations in error probabilities due to hunting, optional stopping, and other questionable research practices (QRPs) is a crucial part of objectivity in statistics. In statistical design, day-to-day tricks of the trade to combat bias are amplified and made systematic. It is not because of a “disinterested stance” that such methods are invented. It is that we, competitively and self-interestedly, want to find things out.
Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Succeeding in scientific learning is very different from success at grants, honors, publications, or engaging in technical activism, replication research and meta-research. That’s why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data, preregistration, and resisting the urge to cherry pick, outcome-switch, or otherwise engage in bad science are proposed. I say that if the allure of carrots has grown stronger than the sticks (which they have), then what we need are stronger sticks.
III. Objective procedures. It is often urged that, however much we may aim at objective constraints, we can never have clean hands, free of the influence of beliefs and interests. The fact that my background knowledge enters in researching a claim H doesn’t mean I combinine my beliefs about H into the analysis so as to prejudge any inference. I may instead use background information to give H a hard time. I may use it to question your claim to have grounds to infer H by showing it hasn’t survived a stringent effort to falsify or find flaws in H. The test H survived might be quite lousy, and even if I have independent grounds to believe H, I may deny you’ve done a good job testing it.
Others argue that we invariably sully methods of inquiry by the entry of personal judgments in their specification and interpretation. It’s just human all too human. The issue is not that a human is doing the measuring; the issue is whether that which is being measured is something we can reliably use to solve some problem of inquiry. That an inference is done by machine, untouched by human hands, wouldn’t make it objective, in the relevant sense. There are three distinct requirements for an objective procedure for solving problems of inquiry:
Yes, there are numerous choices in collecting, analyzing, modeling, and drawing inferences from data, and there is often disagreement about how they should be made. Why suppose this means all accounts are in the same boat as regards subjective factors? It need not, and they are not. An account of inference shows itself to be objective precisely in how it steps up to the plate in handling potential threats to objectivity.
IV. Idols of Objectivity. We should reject phony objectivity and false trappings of objectivity. They often grow out of one or another philosophical conception of what objectivity requires—even though you will almost surely not see them described that way. If it’s thought objectivity is limited to direct observations (whatever they are) plus mathematics and logic, as does the typical logical positivist, then it’s no surprise to wind up worshiping “the idols of a universal method” as Gigerenzer and Marewski (2015) call it. Such a method is to supply a formal, ideally mechanical, way to process statements of observations and hypotheses. To recognize such mechanical rules don’t exist is not to relinquish the view that they’re demanded by objectivity. Instead, objectivity goes by the board, replaced by various stripes of relativism and constructivism, or more extreme forms of post-modernisms.
Relativists may augment their rather thin gruel with a pseudo-objectivity arising from social or political negotiation, cost-benefits (“they’re buying it”), or a type of consensus (“it’s in a 5 star journal”), but that’s to give away the goat far too soon. The result is to abandon the core stipulations of scientific objectivity. To be clear: There are authentic problems that threaten objectivity. We shouldn’t allow outdated philosophical accounts to induce us into giving it up.
V. From Discretion to Subjective Probabilities. Some argue that “discretionary choices” in tests, which Neyman himself tended to call “subjective”[1], leads us to subjective probabilities in claims. A weak version goes: since you can’t avoid subjective (discretionary) choices in getting the data and the model, there can be little ground for complaint about subjective degrees of belief in the resulting inference. This is weaker than arguing you must use subjective probabilities; it argues merely that doing so is no worse than discretion. But it still misses the point.
Even if the entry of discretionary judgments in the journey to a statistical inference/model have the capability to introduce subjectivity, they need not. Second, not all discretionary judgments are in the same boat when it comes to being open to severe testing.
A stronger version of the argument goes on a slippery slope from the premise of discretion in data generation and modeling to the conclusion: statistical inference just is a matter of subjective beliefs (or their updates). How does that work? One variant, which I do not try to pin on anyone in particular, involves a subtle slide from “our models are merely objects of belief”, to “statistical inference is a matter of degrees of belief”. From there it’s a short step to “statistical inference is a matter of subjective probability” (whether my assignments or that of an imaginary omniscient agent).
It is one thing to describe our models as objects of belief and quite another to maintain that our task is to model beliefs.
This is one of those philosophical puzzles of language that might set some people’s eyes rolling. If I believe in the deflection effect (of gravity) then that effect is the object of my belief, but only in the sense that my belief is about said effect. Yet if I’m inquiring into the deflection effect, I’m not inquiring into beliefs about the effect. The philosopher of science Clark Glymour (2010, p. 335) calls this a shift from phenomena (content) to epiphenomena (degrees of belief).
Karl Popper argues that the central confusion all along was sliding from the degree of the rationality (or warrantedness) of a belief, to the degree of rational belief (1959, p. 424). The former is assessed via degrees of corroboration and well-testedness, rooted in the error probing capacities of procedures. (These are supplied by error probabilities of methods, formal or informal.)
VI. Blurring What’s Being Measured vs My Ability to Test It. You will sometimes hear a Bayesian claim that anyone who says their probability assignments to hypotheses are subjective must also call the use of any model subjective because it too is based on my choice of specifications. This is a confusion of two notions of subjective.
This goes back to my point about what’s required for a feature to be relevant to a method’s objectivity in III.
(Passages, modified, are from Mayo, Statistical Inference as Severe Testing (forthcoming)
[1]But he never would allow subjective probabilities to enter in statistical inference. Objective, i.e., frequentist, priors in a hypothesis H could enter, but he was very clear that this required H’s truth being the result of some kind of stochastic mechanism. He found that idea plausible in cases, the problem was not knowing the stochastic mechanism sufficiently to assign the priors. Such frequentist (or “empirical”) priors in hypotheses are not given by drawing Hrandomly from an urn of hypothesis k% of which are assumed to be true. Yet, an “objective” Bayesian like Jim Berger will call these frequentist, resulting in enormous confusion in today’s guidebooks on the probability of type 1 errors.
Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.
Gigerenzer, G. and Marewski, J. 2015. ‘Surrogate Science: The Idol of a Universal Method for Scientific Inference,’ Journal of Management 41(2): 421-40.
Glymour, C. 2010. ‘Explanation and Truth’, in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), CUP: 331–350.
Mayo, D. (1983). “An Objective Theory of Statistical Testing.” Synthese 57(2): 297-340.
Popper, K. 1959. The Logic of Scientific Discovery. New York: Basic Books.
Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. Happy birthday Peirce.
Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319
Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:
Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)
Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):
Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.
Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).
In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.
2. Probabilities are assigned to procedures not hypotheses
Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x_{0}, of the observed facts; if x> x_{0 }reject H; if x< x_{0} accept H.” Although the outputs of N-P tests do not assign hypotheses degrees of probability, “it may often be proved that if we behave according to such a rule … we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false” (Neyman and Pearson, 1933, p.142).[i]
The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:
The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)
The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:
If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)
For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).
Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).
Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis H is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what H asserts, and yet it did not.
3. So why is justifying Peirce’s SCT thought to be so problematic?
You can read Section 3 here. (it’s not necessary for understanding the rest).
4. Peircean induction as severe testing
… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).
The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)
When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.
This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)
While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly corroborated (by his lights), he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.
In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis H when not only does H “accord with” the data x; but also, so good an accordance would very probably not have resulted, were H not true. In other words, we may inductively infer H when it has withstood a test of experiment that it would not have withstood, or withstood so well, were H not true (or were a specific flaw present). This can be encapsulated in the following severity requirement for an experimental test procedure, ET, and data set x.
Hypothesis H passes a severe test with x iff (firstly) x accords with H and (secondly) the experimental test procedure ET would, with very high probability, have signaled the presence of an error were there a discordancy between what H asserts and what is correct (i.e., were H false).
The test would “have signaled an error” by having produced results less accordant with H than what the test yielded. Thus, we may inductively infer H when (and only when) H has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely H has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for H but the probative capacity of the test of experiment ET (with regard to those errors that an inference to H is declaring to be absent)……….
You can read the rest of Section 4 here here
5. The path from qualitative to quantitative induction
In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.
(I) First-Order, Rudimentary or Crude Induction
Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim H, provisionally adopt H. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of H‘s falsity would probably have been detected, were H false, finding no evidence against H is poor inductive evidence for H. H has passed only a highly unreliable error probe.
(II) Second Order (Qualitative) Induction
It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.
The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)
The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)
The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.
The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)
It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.
(III) Third Order, Statistical (Quantitative) Induction
We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.
6. Quantitative and qualitative induction: significance test reasoning
Quantitative Severity
A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a null hypothesis H_{0}, which is an assertion about the distribution of the sample X = (X_{1}, …, X_{n}), a set of random variables, and (2) a function of the sample, d(x), the test statistic, which reflects the difference between the data x = (x_{1}, …, x_{n}), and null hypothesis H_{0}. The observed value of d(X) is written d(x). The larger the value of d(x) the further the outcome is from what is expected under H_{0}, with respect to the particular question being asked. We can imagine that null hypothesis H_{0} is
H_{0}: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.
Let d(x) measure the increased risk of cancer in n women, half of which were randomly assigned to HRT. H_{0} asserts, in effect, that it is an error to take as genuine any positive value of d(x)—any observed difference is claimed to be “due to chance”. The test computes (3) the p-value, which is the probability of a difference larger than d(x), under the assumption that H0 is true:
p-value = Prob(d(X) > d(x)); H_{0}).
If this probability is very small, the data are taken as evidence that
H*: cancer risks are higher in women treated with HRT
The reasoning is a statistical version of modes tollens.
If the hypothesis H_{0} is correct then, with high probability, 1- p, the data would not be statistically significant at level p.
x is statistically significant at level p.
Therefore, x is evidence of a discrepancy from H_{0}, in the direction of an alternative hypothesis H.
(i.e., H* severely passes, where the severity is 1 minus the p-value)[iii]
For example, the results of recent, large, randomized treatment-control studies showing statistically significant increased risks (at the 0.001 level) give strong evidence that HRT, taken for over 5 years, increases the chance of breast cancer, the severity being 0.999. If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject H_{0} (and H_{0} is actually true), we would find we were rarely able to get so statistically significant a result to recur, and in this way we would discover our original error.
It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)
The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.
[Part 2 and Part 3 are here; you can find the full paper here.]
REFERENCES:
Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite. Cambridge: Cambridge University Press.
Laudan, L. 1981 Science and Hypothesis: Historical Essays on Scientific Methodology. Dordrecht: D. Reidel.
Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite. Cambridge: Cambridge University Press.
Mayo, D. 1991 “Novel Evidence and Severe Tests”, Philosophy of Science, 58: 523-552.
———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), Charles S. Peirce and the Philosophy of Science. Tuscaloosa: University of Alabama Press.
——— 1996 Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.
———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), Probability Is the Very Guide in Life. Chicago: Open Court Press, pp. 89-117.
———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), Scientific Evidence, Johns Hopkins University Press.
Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in Foundations of Bayesianism, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.
Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” Philosophy of Science, Vol. II, PSA 2002, pp. 1007-1025.
———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, The British Journal of Philosophy of Science 57: 323-357.
Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium, 2005.
Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in Philosophical Transactions of the Royal Society, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.
———- 1967 Joint Statistical Papers, Berkeley: University of California Press.
Niiniluoto, I. 1984 Is Science Progressive? Dordrecht: D. Reidel.
Peirce, C. S. Collected Papers: Vols. I-VI, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.
Popper, K. 1962 Conjectures and Refutations: the Growth of Scientific Knowledge, Basic Books, New York.
Rescher, N. 1978 Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method, Notre Dame: University of Notre Dame Press.
[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.
[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative T’ is closer to the truth than a refuted T.
[iii] If the p-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard H_{0} as consistent with data x, but we may wish to go further and determine the size of an increased risk r that has thereby been ruled out with severity. We do so by finding a risk increase, such that, Prob(d(X) > d(x); risk increase r) is high, say. Then the assertion: the risk increase < r passes with high severity, we would argue.
If there were a discrepancy from hypothesis H_{0} of r (or more), then, with high probability,1-p, the data would be statistically significant at level p.
x is not statistically significant at level p.
Therefore, x is evidence than any discrepancy from H_{0} is less than r.
For a general treatment of effect size, see Mayo and Spanos (2006).
[Ed. Note: A not bad biographical sketch can be found on wikipedia.]
Error Statistics Philosophy: Blog Contents (5 years) [i]
By: D. G. Mayo
Dear Reader: It’s hard to believe I’ve been blogging for five years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease.
Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.
Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging, but at least until the publication of my book on statistical inference. After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo
September 2011
October 2011
November 2011
December 2011
January 2012
February 2012
March 2012
April 2012
May 2012
NHST
June 2012
July 2012
August 2012
September 2012
October 2012
November 2012
December 2012
January 2013
February 2013
March 2013
April 2013
May 2013
June 2013
July 2013
August 2013
September 2013
October 2013
November 2013
December 2013
January 2014
February 2014
March 2014
April 2014
May 2014
argument for the Likelihood Principle
June 2014
July 2014
August 2014
September 2014
October 2014
November 2014
December 2014
testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”
January 2015
February 2015
March 2015
April 2015
May 2015
June 2015
August 2015
September 2015
December 2015
January 2016
February 2016
March 2016
April 2016
May 2016
June 2016
July 2016
August 2016
[i]Table of Contents (compiled by N. Jinn & J. Miller)*
*I thank Jean Miller for her assiduous work on the blog. I’m very grateful to guest posters in the past year: Laudan, Spanos, Senn, and to all contributors and readers for helping “frequentists in exile” to feel (and truly become) less exiled–wherever they may be!
Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?
Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!
Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!
Raucous laughter ensues!
(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)
The frequentist tester should retort:
Frequentist Tester: But you assume 50% of the null hypotheses are true, compute P(H_{0}|x) using P(H_{0}) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!
At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.
It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H_{0.} This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013). J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!
The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H_{0}: μ = μ_{0} versus H_{1}: μ ≠ μ_{0}.
“If n = 50 one can classically ‘reject H_{0} at significance level p = .05,’ although Pr (H_{0}|x) = .52 (which would actually indicate that the evidence favors H_{0}).” (Berger and Sellke, 1987, p. 113).
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!
Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H_{0}, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior.
Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of H_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here.
But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H_{0}. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:
50% of the null hypotheses in a given pool of nulls are true.
This particular null H_{0 }was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).
Therefore P(H_{0} is true) = .5.
I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.
It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x_{0} under hypothesis H_{0}. In other words, it’s no longer the H_{0} needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)
In any event, .5 is not the frequentist probability that the selected null H_{0} is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).
The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:
“With big data, researchers have brought cherry-picking to an industrial level”.
Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.
The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H_{0} conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).
Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!
How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).
Senn, in a guest post remarks:
The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.
It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.
Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.
Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.
The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.
Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:
Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).
The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.
In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).
Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).
Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)
References & Related articles
Berger, J. O. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.
Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.
Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 1(3): pp. 1-16.
Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.
Fisher, R.A. (1947), Design of Experiments.
Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.
Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.
Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” Philosop2hy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.
Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.
Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science18, 19-24.
Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.
Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)
Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.
Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (1): 195-244.
Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.
Prusiner, S. (1991). Molecular Biology of Prion Diseases. Science, 252(5012), 1515-1522.
Prusiner, S. B. (2014) Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease, New Haven, Connecticut: Yale University Press.
Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.
Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.
Related posts:
Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin
“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”
Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’. Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction.
There is, however, an egregious drawback to a legal system so structured. To wit, a verdict of ‘not guilty’ tells us nothing whatever about whether it is reasonable to believe that the defendant did not commit the crime. It offers no grounds whatever for inferring that an acquitted defendant probably did not commit the crime. That fact alone should make most of us leery about someone acquitted of a felony. Will a bank happily hire someone recently acquitted of a forgery charge? Are the neighbors going to rest easy when one of them was charged with, and then acquitted of, child molestation?
While the current proof standard provides ample protection to the innocent from being falsely convicted (the false positive rate is ~3%), it does little or nothing to protect the reputation of the truly innocent defendants. If properly understood, it fails to send any message to the general public about how they should regard and treat an acquitted defendant because it fails to tell the public whether it’s likely or unlikely that he committed the crime.
It would not be difficult to remedy this centuries-old mess, both for the public and for the acquitted defendant, by employing a three-verdict system, as the Scots have been doing for some time. Their verdicts are: guilty, guilt not proven and innocent. In a Scottish trial, if guilt is proven beyond a reasonable doubt, the defendant is found guilty; if the jury thinks it more likely than not that defendant committed no crime, his verdict is ‘innocent’; if the jury suspects that defendant did the crime but is not sure beyond all reasonable doubt, the verdict is ‘guilt not proven’. Both the guilt-not-proven verdict and the innocence verdict are officially acquittals in the sense that those receiving it serve no jail time. (This gives a whole new meaning to the well-known phrase ‘going scot-free’.)
The Scottish verdict pattern serves the interests of both the innocent defendant and the general society. The Scots know that if a defendant received an innocent verdict, then the jury believed it likely that he committed no crime and that he should be treated accordingly. That is both important information for the citizenry and a substantial protection for the innocent defendant himself, since the innocent verdict is in effect an exoneration, entailing the likelihood of his innocence.
On the other hand, the Scottish guilt-not-proven verdict sends out the important message to citizens that no other Anglo-Saxon legal system can; to wit, that the acquitted defendant (with a guilt-not-proven verdict) should be treated warily by society since he was probably the culprit, even though he was neither convicted nor punished.
Interestingly, there is ample use of the intermediary verdict. The Scottish government reports in a study of criminal prosecutions in 2005 and 2006 that it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict. That means that about 7-in-10 acquittals for murder in Scotland involved defendants regarded by the jurors as having probably committed the crime.[1] In a more recent analysis, the Scottish government reported that in rape cases some 35% of acquittals resulted in ‘guilt not proven’ verdicts. In murder cases, the probably guilty verdict rate was 27% of all acquittals.[2]
It’s worth adding that Scotland’s intermediary verdict gives us access to information on an error whose frequency no other Western legal system can easily compute: to wit, the frequency of false acquittals. It tells us that, at least in Scotland, the rate of false acquittals hovers between 1-in-4 and 1-in-3. That is crucial information for those of us who believe that a legitimate system of inquiry—whether a legal one or otherwise— must get a handle on its error rates. Without knowing that, we cannot possibly figure out whether the distribution of erroneous verdicts is in line with our beliefs about the respective costs of the two errors.
Scottish criminal law has one other interesting feature worthy of mention in this context: a verdict there requires only a majority vote from the 15 citizens who serve as the jury. By contrast, most American states require a unanimous vote among 12 jurors, contributing to a situation in which mistrials are both expensive and common. They are expensive because they usually lead to re-trials, which are rarely cheap. In some jurisdictions in the US, 20% or more of trials end in a hung jury.[3] Not surprisingly, hung juries in Scottish cases are much less frequent.
***
[1] See http://www.scotland.gov.uk/Publications/2006/04/25104019/11.) See also the Scottish Government Statistical Bulletin, Crim/2006/Part 11.
[2] See Scottish Government, Criminal Proceedings in Scotland, 2013-14, Table 2B.
[3] A study by Paula Agor et al., (Are Hung Juries a Problem? National Center for State Courts and National Institute of Justice, 2002) found that in Washington, D.C. Superior Courts some 22.4% of jury trials ended in a hung jury; In Los Angeles Superior Courts, the hung jury rate was 19.5%.
ADDITIONAL RESOURCES:
Previous guest posts:
Among Laudan’s books:
1977. Progress and its Problems: Towards a Theory of Scientific Growth
1981. Science and Hypothesis
1984. Science and Values
1990. Science and Relativism: Dialogues on the Philosophy of Science
1996. Beyond Positivism and Relativism
2006. Truth, Error and Criminal Law: An Essay in Legal Epistemology
Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is
“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.
He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]
So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.
OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her:
One day at the beginning of April 1926, down ‘in the middle of small samples,’ wandering among apple plots at East Malling, where a cousin was director of the fruit station, he was ‘suddenly smitten,’ as he later expressed it,with a ‘doubt’ about the justification for using Student’s ratio (the t-statistic) to test a normal mean (Quotes are from Pearson in Reid, p. 60).
Soon after, Egon contacted Neyman and their joint work began.
I assumed the meanderings over apple plots was a different time, and that Egon just had a habit of conducting his deepest statistical thinking while overlooking fruit. Yet it shared certain unique features with the revelation when gazing over at the blackcurrant plot, as in my picture, if only in the date and the great importance he accorded it (although I never recall his saying he was “smitten” before). I didn’t think more about it. Then, late one night last week I grabbed a peculiar book off my shelf that contains a smattering of writings by Pearson for a work he never completed: “Student: A Statistical Biography of William Sealy Gosset” (1990, edited and augmented by Plackett and Barnard, Clarendon, Oxford). The very first thing I open up to is a note by Egon Pearson:
I cannot recall now what was the form of the doubt which struck me at East Malling, but it would naturally have arisen when discussing there the interpretation of results derived from small experimental plots. I seem to visualize myself sitting alone on a gate thinking over the basis of ‘small sample’ theory and ‘mathematical statistics Mark II’ [i.e., Fisher]. When nearly thirty years later (JRSS B, 17, 204 1955), I wrote refuting the suggestion of R.A.F. [Fisher] that the Neyman-Pearson approach to testing statistical hypotheses had arisen in industrial acceptance procedures, the plot which the gate was overlooking had through the passage of time become a blackcurrant one! (Pearson 1990 p. 81)
What? This is weird. So that must mean it wasn’t blackcurrants after all, and Egon is mistaken in the caption under the picture I drew 20 years ago. Yet, he doesn’t say here that it was apples either, only that it had “become a blackcurrant” plot in a later retelling. So, not blackcurrant, so, it must have been apple, putting this clue together with what he told Constance Reid. So it appears I can no longer quote that “blackcurrant” statement, at least not without explaining that, in all likelihood, it was really apples. If any statistical sleuths out there can corroborate that it was apples, or knows the correct fruit that Egon was gazing at (and, come to think of it, why couldn’t it have been both?) I’d be very grateful to know [ii]. I will happily cite you. I know this is a bit of minutia–don’t say I didn’t warn you [iii]. By contrast, the Pearson paper replying to Fisher is extremely important (and very short). It’s entitled “Statistical Concepts in Their Relation to Reality”. You can read the paper HERE.
[i] Some of the previous lines, and 6 following words:
There was no question of a difference in point of view having ‘originated’ when Neyman ‘re-interpreted’ Fisher’s early work on tests of significance ‘in terms of that technological and commercial apparatus which is known as an acceptance procedure’. …
Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot at the East Malling Research Station!–E.S Pearson, “Statistical Concepts in Their Relation to Reality”
[ii] As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman. So I should get the inspirational fruit correct.
[iii] I’m not saying I know the answer isn’t in the book on Student, or someplace else.
Fisher 1955 “Scientific Methods and Scientific Induction” .
Pearson E.S., 1955 “Statistical Methods in Their Relation to Reality”.
Reid, C. 1998, Neyman–From Life. Springer.
This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post. I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.
Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.
Cases of Type A and Type B
“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)
Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:
“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…
(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)
In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:
“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?
Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)
Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.
“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.
Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).
Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:
“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)
“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)
We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:
“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)
The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.
Three Steps in the Original Construction of Tests
After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:
“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…
Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).
“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).
Pearson warns that:
“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).
Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.
So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.
If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]
Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.
Neyman Was the More Behavioristic of the Two
Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.
Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:
“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.
In Pearson’s (1955) response to Fisher (blogged here):
“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)
“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).
“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)
“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)
__________________________
References:
Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.
Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” Biometrika 34(1/2): 139-167.
Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” Journal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.
Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” Biometrika 20(A): 175-240.
[i] In some cases only an upper limit to this error probability may be found.
[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.
[iii] I thank Aris Spanos for locating this work of Pearson’s from 1935
1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:
“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)
Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.^{[1]} But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability?
We know that literal deductive falsification only occurs with trite examples like “All swans are white”; and that a single black swan falsifies the universal claim that C: all swans are white, whereas observing a single white swan wouldn’t allow inferring C (unless there was only 1 swan, or no variability in color) but Burnham and Anderson are discussing statistical falsification, and statistical methods of testing. Moreover, the authors champion a methodology that they say has nothing to do with testing or falsifying: “Unlike significance testing”, the approaches they favor “are not ‘tests,’ are not about testing” (p. 628). I’m not disputing their position that likelihood ratios, odds ratios, Akaike model selection methods are not about testing, but falsification is all about testing! No tests, no falsification, not even of the null hypotheses (which they presumably agree significance tests can falsify). It seems almost a scandal, and it would be one if critics of statistical testing were held to a more stringent, more severe, standard of evidence and argument than they are.
I may add installments/corrections (certainly on E. Pearson’s birthday Thursday); I’ll update with (i), (ii) and the date.
A bit of background. I view significance tests as only a part of a general statistical methodology of testing, estimation, and modeling that employs error probabilities of methods to control and assess how capable methods are at probing errors, and blocking misleading interpretations of data. I call it an error statistical methodology. I reformulate statistical tests as tools for severe testing. The outputs report on the discrepancies that have and have not been tested with severity. There’s much in Popper I agree with: data x only count as evidence for a claim H_{1} if it constitutes an unsuccessful attempt to falsifyH_{1}. One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. I use formal error probabilities to direct a more satisfactory notion of severity than Popper.
2. Popper, Fisher-Neyman-Pearson, and falsification.
Popper’s philosophy shares quite a lot with the stringent testing ideas found in Fisher, and also Neyman-Pearson–something Popper himself recognized in the work the authors site (LSD). Here is Popper:
We say that a theory is falsified only if we have accepted basic statements which contradict it…. This condition is necessary but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated. (Popper LSD, 1959, 203)
Such “a low level empirical hypothesis” is well captured by a statistical claim. Unlike the logical positivists, Popper realized that singular observation statements couldn’t provide the “basic statements” for science. In the same spirit, Fisher warned that in order to use significance tests to legitimately indicate incompatibility with hypotheses, we need not an isolated low P-value, but an experimental phenomenon.
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)
If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Conjectured statistical effects are likewise falsified if they contradict data and/or could only be retained through ad hoc saves, verification biases and “exception incorporation”. Moving in stages between data collection, modeling, inferring, and from statistical to substantive hypotheses and back again, learning occurs by a series of piecemeal steps with the same reasoning. The fact that at one stage H_{1} might be the alternative, at another, the test hypothesis, is no difficulty. The logic differs from inductive updating probabilities of a hypothesis, as well as from a comparison of how much more probable H_{1} makes the data than does H_{0}, as in likelihood ratios. These are 2 variants of probabilism.
Now there are many who embrace probabilism who deny they need tools to reject or falsify hypotheses. That’s fine. But having declared it a scandal (almost) for a statistical account to lack a methodology to reject/falsify, it’s a bit surprising to learn their account offers no such falsificationist tools. (Perhaps I’m misunderstanding; I invite correction.) For example, the likelihood ratio, they declare, “is an evidence ratio about parameters, given the model and the data. It is the likelihood ratio that defines evidence (Royall 1997)” (Burnham and Anderson, p. 628). They italicize “given” which underscores that these methods begin their work only after models are specified. Richard Royall is mentioned often, but Royall is quite clear that for data to favor H_{1} over H_{0} is not to have supplied evidence against H_{0}. (“the fact that we can find some other hypothesis that is better supported than H does not mean that the observations are evidence against H” (1997, pp.21-2).) There’s no such thing as evidence for or against a single hypothesis for him. But without evidence against H_{0}, one can hardly mount a falsification of H_{0}. Thus, I fail to see how their preferred account promotes falsification. It’s (almost) a scandal.
Maybe all they mean is that “historical” Fisher said the tests have only a null, so the only alternative would be its denial. First, we shouldn’t be limiting ourselves to what Fisher thought, nor keep an arbitrary distinction between Fisher vs N-P tests nor confidence intervals. David Cox is a leading Fisherian and his tests have either implicit or explicit alternatives. The choice of a test statistic indicates the alternative, even if it’s only directional. In N-P tests, the test hypothesis and the alternative may be swapped.) Second, even if one imagines the alternative is limited to either of the following:
(i) the effect is real/ non-spurious, or (ii) a parametric non-zero claim (e.g., μ ≠ 0),
they are still statistically falsifiable. An example of the first came last week. Shock waves were felt in high energy particle physics (HEP) when early indications (from last December) of a genuine new particle—one that would falsify the highly corroborated Standard Model (SM)—was itself falsified. This was based on falsifying a common statistical alternative in a significance test: the observed “resonance” (a great term) is real. (The “bumps” began to fade with more data [2].) As for case (ii), some of the most important results in science are null results. By means of high precision null hypotheses tests, bounds for statistical parameters are inferred by rejecting (or falsifying) discrepancies beyond the limits tests are capable of detecting. Think of the famous negative result of Michelson-Morley experiments that falsified the “ether” (or aether) of the type ruled out by special relativity, or the famous equivalence principles of experimental GTR. An example of each is briefly touched upon in a paper with David Cox (Mayo and Cox 2006). Of course, background knowledge about the instruments and theories are operative throughout. More typical are the cases where power analysis can be applied, as discussed in this post.
“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
Perhaps they only mean to say that Fisherian tests don’t directly try to falsify “the effect is real”. They’re supposed to, it should be very difficult to bring about statistically significant results if the world is like H0.
3. Model validation, specification and falsification.
When serious attention is paid to the discovery of new ways to extend models and theories, and to model validation, basic statistical tests are looked to. This is so even for Bayesians, be they ecumenical like George Box, or “falsificationists” like Gelman.
For Box, any account that relies on statistical models requires “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification”. This leads Box to advocate ecumenism. (Box 1983, p. 57). He asks,
[w]hy can’t all criticism be done using Bayes posterior analysis?…The difficulty with this approach is that by supposing all possible sets of assumptions are known a priori, it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process (ibid., p. 73).
Listen to Andrew Gelman (2011):
At a philosophical level, I have been persuaded by the arguments of Popper (1959), Kuhn (1970), Lakatos (1978), and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call ‘pure significance testing’)^{[3]} (Gelman 2011, p. 70).
Discovery, model checking and correcting rely on statistical testing, formal or informal.
4. “An explicit, objective criterion of ‘best’ models” using methods that obey the LP (p.628).
Say Burnham and Anderson:
“At a deeper level, P values are not proper evidence as they violate the likelihood principle” (Royall 1997)” (p. 627).
A list of pronouncements by Royall follows. What we know at a much deeper level is that any account that obeys the likelihood principle (LP) is not an account that directly assesses or controls the error probabilities of procedures. Control of error probabilities, even approximately, is essential for good tests, and this grows out of a concern, not for controlling error rates in the long run, but for evaluating how well tested models and hypotheses are with the data in hand. As with others who embrace the LP, the authors reject adjusting for selection effects, data dredging, multiple testing, etc.–gambits that alter the sampling distribution and, handled cavalierly, are responsible for much of the bad statistics we see. By the way, reference or default Bayesians also violate the LP. You can’t just make declarations about “proper evidence” without proper evidence. (There’s quite a lot on the LP on this blog; see also links to posts below the references.)
Burnham and Anderson are concerned with how old a method is. Oh the horrors of being a “historical” method. Appealing to ridicule (“progress should not have to ride in a hearse”) is no argument. Besides, it’s manifestly silly to suppose you use a single method, or that error statistical tests haven’t been advanced as well as reinterpreted since Fisher’s day. Moreover, the LP is a historical, baked-on principle suitable for ye olde logical positivist days when empirical observations were treated as “given”. Within that statistical philosophy, it was typical to hold that the data speak for themselves, and that questionable research practices such as cherry-picking, data-dredging, data-dependent selections, and optional stopping are irrelevant to “what the data are saying”! It’s redolent of the time where statistical philosophy sought a single, “objective” evidential relationship to hold between given data, model and hypotheses. Holders of the LP still say this, and the authors are no exception.
[The LP was, I believe, articulated by George Barnard who announced he rejected it at the 1959 Savage forum for all but predesignated simple hypotheses. If you have a date or correction, please let me know. 8/10]
The truth is that one of the biggest problems behind the “replication crisis” is the violation of some age-old truisms about science.It’s the consumers of bad science (in medicine at least) that are likely to ride in a hearse. There’s something wistful about remarks we hear from some quarters now. Listen to Ben Goldacre (2016) in Naure: “The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data,” which he follows with a list of selective publication, data dredging and all the rest, “leading collectively to the ‘replication crisis’.”
He’s trying to remind us that the rules for good science were all in place long ago and somehow are now being ignored or trampled over, in some fields. Wherever there’s a legitimate worry about “perverse incentives,” it’s not a good idea to employ methods where selection effects vanish.
5. Concluding comments
I don’t endorse many of the applications of significance tests in the literature, especially in the social sciences. Many p-values reported are vitiated by fallacious interpretations (going from a statistical to substantive effect), violated assumptions, and biasing selection effects. I’ve long recommended a reformulation of the tools to avoid fallacies of rejection and non-rejection. In some cases, sadly, better statistical inference cannot help, but that doesn’t make me want to embrace methods that do not directly pick up on the effects of biasing selections. Just the opposite.
If the authors are serious about upholding Popperian tenets of good science, then they’ll want to ensure the claims they make can be regarded as having passed a stringent probe into their falsity. I invite comments and corrections.
(Look for updates.)
____________
^{[1]}They are replying to an article by Paul Murtaugh. See the link to his paper here.
[2]http://www.physicsmatt.com/blog/2016/8/5/standard-model-1-diphotons-0
^{[3]}Gelman continues: “At the next stage, we see science–and applied statistics–as resolution of anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996).”
REFERENCES:
Related Blogposts
LAW OF LIKELIHOOD: ROYALL
8/29/14: BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)
10/10/14: BREAKING THE (Royall) LAW! (of likelihood) (C)
11/15/14: Why the Law of Likelihood is bankrupt—as an account of evidence
11/25/14: How likelihoodists exaggerate evidence from statistical tests
P-VALUES EXAGGERATE
7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)
7/23/14: Continued: “P-values overstate the evidence against the null”: legit or fallacious?
5/12/16: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”
Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn
Painful dichotomies
The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)
The bare facts they established are the following:
The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.
and the false conclusion they immediately asserted is the following
This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.
To understand the fallacy, look at the accompanying graph. This shows the simplest possible model describing events over time that is consistent with the ‘facts’. The model in question is the exponential distribution and what is shown is the cumulative probability of response for individuals suffering from tension headache depending on whether they are treated with placebo or paracetamol. The dashed vertical line is at the arbitrary International Headache Society critical time point of 2 hours. This intersects the placebo curve at 0.49 and the paracetamol curve at 0.59, exactly the figures quoted in the Cochrane review.
The model that the diagram represents is simplistic and almost certainly false. It is what would apply if it were the case that all patients given placebo had the same probability over time of headache resolution and ditto for paracetamol and an exponential model applied. However, the point is that for all we know it is true. It would take careful measurement over time for repeated headaches of the same individuals to establish the element of personal response (Senn 2016).
The curve given for placebo is what we would expect to find for the simple exponential model if it were the case that mean time to response were 2.97 hours when a patient was given placebo. The curve for paracetamol has a mean of 2.24 hours. It is important to understand that this is perfectly compatible with this being the long term average response time (that is to say averaged over many many headaches) for every patient and this means that any patient at any time feeling the symptoms of headache could expect to shorten that headache by 2.97-2.24=0.73 hrs or just under 45 minutes.
Is this a benefit or not? I would say, ‘yes’. And that means that a perfectly logical way to describe the results is to say, ‘for all we know, any patient taking paracetamol for headache will benefit. The size of that benefit is an increase of the probability of resolution at 2 hours of 10 percent or a reduction of mean headache time of 3/4 of an hour’.
The latter, of course, depends on the exponential model being appropriate and it may be that some alternative can be found by careful analysis of the data. The point is, however, that the claim that only 10% will benefit by taking paracetamol is completely unjustified.
Unfortunately, the combination of arbitrary dichotomies (Senn 2003) and naïve analysis continues to fuel misunderstandings regarding personalised medicine.
Acknowledgement:
This work was funded by grant 602552 for the IDEAL project under the European Union FP7 programme and support from the programme is gratefully acknowledged.
References:
MONTHLY MEMORY LANE: 3 years ago: July 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1], and in green up to 3 others I’d recommend[2]. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.
July 2013
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
[2] New Rule, July 30, 2016.
Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with an undergraduate item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].
Blog Readers who wish to send me their answers will have their papers graded (at least try the multiple choice; if you’re unsure, do the reading). [Use the comments or e-mail.]
[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 1, 2, and 3.
HAPPY BIRTHDAY POPPER!
REFERENCE:
Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.
Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?
(I) Listen to Jacob Cohen (1988) introduce Power Analysis
“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists.
What is really intended by the invalid affirmation of a null hypothesis is not that the population ES is literally zero, but rather that it is negligible, or trivial. This proposition may be validly asserted under certain circumstances. Consider the following: for a given hypothesis test, one defines a numerical value i (or iota) for the ES, where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – b) is then set at a high value, so that b is relatively small. When, additionally, a is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i, i.e., that it is negligible; this conclusion can be offered as significant at the b level specified. In much research, “no” effect (difference, correlation) functionally means one that is negligible; “proof” by statistical induction is probabilistic. Thus, in using the same logic as that with which we reject the null hypothesis with risk equal to a, the null hypothesis can be accepted in preference to that which holds that ES = i with risk equal to b. Since i is negligible, the conclusion that the population ES is not as large as i is equivalent to concluding that there is “no” (nontrivial) effect. This comes fairly close and is functionally equivalent to affirming the null hypothesis with a controlled error rate (b), which, as noted above, is what is actually intended when null hypotheses are incorrectly affirmed (J. Cohen 1988, p. 16).
Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time–something not always available. Even where a negligible i may be specified, the power to detect that i may be low and not high. Two important points can still be made:
Now to tell what’s true about Greenland’s concern that “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
(II) The first step is to understand the assertion, giving the most generous interpretation. It deals with nonsignificance, so our ears are perked for a fallacy of nonrejection or nonsignificance. We know that “high power” is an incomplete concept, so he clearly means high power against “the alternative”.
For a simple example of Greenland’s phenomenon, consider an example of the Normal test we’ve discussed a lot on this blog. Let T+: H0: µ ≤ 12 versus H_{1}: µ > 12, σ = 2, n = 100. Test statistic Z is √100(M – 12)/2 where M is the sample mean. With α = .025, the cut-off for declaring .025 significance from M*_{.025 }= 12+ 2(2)/√100 = 12.4 (rounding to 2 rather than 1.96 to use a simple Figure below).
[Note: The thick black vertical line in the Figure, which I haven’t gotten to yet, is going to be the observed mean, M_{0 }= 12.35. It’s a bit lower than the cut-off at 12.4.]
Now a title like Greenland’s is supposed to signal some problem. What is it? The statistical part just boils down to noting that the observed mean M_{0 }(e.g., 12.35) may fail to make it to the cut-off M* (here 12.4), and yet be closer to an alternative against which the test has high power (e.g., 12.6) than it is to the null value, here 12. This happens because the Type 2 error probability is allowed to be greater than the Type 1 error probability (here .025).
Abbreviate the alternative against which the test T+ has .84 power as, µ^{.84} , as I’ve often done. (See, for example, this post.) That is, the probability Test T+ rejects the null when µ = µ^{.84} = .84. i.e.,POW(T+, µ^{.84}) = .84. One of our power short-cut rules tells us:
µ^{.84 }= M* + 1σ_{M }= 12.4 + .2 = 12.6,
where σ_{M}: =σ/√100 = .2.
Note: the Type 2 error probability in relation to alternative µ = 12.6 is.16. This is the area to the left of 12.4 under the red curve above. Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β(12.6).
µ^{.84 }exceeds the null value by 3σ_{M}: so any observed mean that exceeds 12 by more than 1.5σ_{M }but less than 2σ_{M }gives an example of Greenland’s phenomenon. [Note: the previous sentence corrects an earlier wording.] In T+ , values 12.3 < M_{0 }<12 .4 do the job. Pick M_{0 }= 12.35. That value is indicated by the black vertical line in the figure above.
Having established the phenomenon, your next question is: so what?
It would be problematic if power analysis took the insignificant result as evidence for μ = 12 (i.e., 0 discrepancy from the null). I’ve no doubt some try to construe it as such, and that Greenland has been put in the position of needing to correct them. This is the reverse of the “mountains out of molehills” fallacy. It’s making molehills out of mountains. It’s not uncommon when a nonsignificant observed risk increase is taken as evidence that risks are “negligible or nonexistent” or the like. The data are looked at through overly rosy glasses (or bottle). Power analysis enters to avoid taking no evidence of increased risk as evidence of no risk. Its reasoning only licenses μ < µ^{.84} where .84 was chosen for “high power”. From what we see in Cohen, he does not give a green light to the fallacious use of power analysis.
(III) Now for how the inference from power analysis is akin to significance testing (as Cohen observes). Let μ^{1−β} be the alternative against which test T+ has high power, (1 – β). Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T-, we might call it. That is, T- tests H_{0}: μ ≥ μ^{1−β} versus H_{1}: μ < μ^{1−β} at the β level. The test rejects H_{0} (at level β) when M < μ_{0} – z_{β}σ_{M}. Such a significant result would warrant inferring μ < μ^{1−β }at significance level β. Using power analysis doesn’t require making this switcheroo, which might seem complicated. The point is that there’s really no new reasoning involved in power analysis, which is why the members of the Fisherian tribe manage it without even mentioning power.
EXAMPLE. Use μ^{.84} in test T+ (α = .025, n = 100, σ_{M }= .2) to create test T-. Test T+ has .84 power against μ^{.84} = 12 + 3σ_{M} = 12.6 (with our usual rounding). So, test T- is
H_{0}: μ ≥ 12.6 versus H_{1}: μ <12 .6
and a result is statistically significantly smaller than 12.6 at level .16 whenever sample mean M < 12.6 – 1σ_{M} = 12.4. To check, note (as when computing the Type 2 error probability of test T+) that
Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β. In test T-, this serves as the Type 1 error probability.
So ordinary power analysis follows the identical logic as significance testing. [i] Here’s a qualitative version of the logic of ordinary power analysis.
Ordinary Power Analysis: If data x are not statistically significantly different from H_{0}, and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ.[ii]
Or, another way to put this:
If data x are not statistically significantly different from H_{0}, then x indicates that the underlying discrepancy (from H_{0}) is no greater than γ, just to the extent that that the power to detect discrepancy γ is high,
************************************************************************************************
[i] Neyman, we’ve seen, was an early power analyst. See, for example, this post.
[ii] Compare power analytic reasoning with severity reasoning from a negative or insignificant result.
POWER ANALYSIS: If Pr(d > c_{α}; µ’) = high and the result is not significant, then it’s evidence µ < µ’
SEVERITY ANALYSIS: (for an insignificant result): If Pr(d > d_{0}; µ’) = high and the result is not significant, then it’s evidence µ < µ.’
Severity replaces the pre-designated cut-off c_{α} with the observed d_{0}. Thus we obtain the same result remaining in the Fisherian tribe. We still abide by power analysis though, since if Pr(d > d_{0}; µ’) = high then Pr(d > c_{α}; µ’) = high, at least in a sensible test like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting bound: µ < µ’. But why view a miss as good as a mile? Power is too coarse.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. [Link to quote above: p. 16]
Greenland, S. 2012. ‘Nonsignificance Plus High Power Does Not Imiply Support for the Null Over the Alternative’, Annals of Epidemiology 22, pp. 364-8. Link to paper: Greenland (2012)
Date: July 17, 2016
Location: London School of Economics, London
Website: http://www.lse.ac.uk/philosophy/events/carlo-rovelli-public-lecture/
Start Date: July 21, 2016
End Date: July, 22, 2016
Location: University of Cambridge
Website: http://www.crassh.cam.ac.uk/events/26814
Start Date: September 6, 2016
End Date: September 9, 2016
Location: University of Exeter, UK
Website: http://www.philsci.eu/epsa17
Submission Deadline: December 16, 2016
Flyer: Structure.pdf
Submission Deadline: September 5, 2016
Flyer: CFPLinconscio_ENG.pdf
Submission Deadline: July 17, 2016
Start Date: October 3, 2016
End Date: October 7, 2016
Location: San Sebastian, Spain
Flyer: Flier-XIIInternationalOntologyCongress.pdf
Start Date: October 12, 2016
End Date: October 13, 2016
Location: Leuven, Belgium
Flyer: TheScienceOfEvolutionAndTheEvolutionOftheSciences.pdf
Start Date: September 5, 2016
End Date: September 9, 2016
Location: Urbino, Italy
Website: https://sites.google.com/site/centroricerchecirfis/sep-2016
Start Date: September 23, 2016
End Date: September 24, 2016
Location: University of Pittsburgh, PA
Website: http://www.pitt.edu/~pittcntr/Events
MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1]. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.
June 2013
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.