How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)
We [statisticians] are not blameless … we have not made a concerted professional eﬀort to provide the scientific world with a uniﬁed testing methodology. (J. Berger 2003, p. 4)
From the aerial perspective of a hotair balloon, we may see contemporary statistics as a place of happy multiplicity: the wealth of computational ability allows for the application of countless methods, with little handwringing about foundations. Doesn’t this show we may have reached “the end of statistical foundations”? One might have thought so. Yet, descending close to a marshy wetland, and especially scratching a bit below the surface, reveals unease on all sides. The false dilemma between probabilism and longrun performance lets us get a handle on it. In fact, the Bayesian versus frequentist dispute arises as a dispute between probabilism and performance. This gets to my second reason for why the time is right to jump back into these debates: the “statistics wars” present new twists and turns. Rival tribes are more likely to live closer and in mixed neighborhoods since around the turn of the century. Yet, to the beginning student, it can appear as a jungle.
Statistics Debates: Bayesian versus Frequentist
These days there is less distance between Bayesians and frequentists, especially with the rise of objective [default] Bayesianism, and we may even be heading toward a coalition government. (Efron 2013, p. 145)
A central way to formally capture probabilism is by means of the formula for conditional probability, where Pr(x) > 0:
Since Pr(H and x) = Pr(xH)Pr(H) and Pr(x) = Pr(xH)Pr(H) + Pr(x~H)Pr(~H), we get:
where ~H is the denial of H. It would be cashed out in terms of all rivals to H within a frame of reference. Some call it Bayes’ Rule or inverse probability. Leaving probability uninterpreted for now, if the data are very improbable given H, then our probability in H after seeing x, the posterior probability Pr(Hx), may be lower than the probability in H prior to x, the prior prob ability Pr(H). Bayes’ Theorem is just a theorem stemming from the definition of conditional probability; it is only when statistical inference is thought to be encompassed by it that it becomes a statistical philosophy. Using Bayes’ Theorem doesn’t make you a Bayesian.
Larry Wasserman, a statistician and master of brevity, boils it down to a contrast of goals. According to him (2012b):
The Goal of Frequentist Inference: Construct procedure with frequentist guarantees [i.e., low error rates].
The Goal of Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.
At times he suggests we use B(H) for belief and F(H) for frequencies. The distinctions in goals are too crude, but they give a feel for what is often regarded as the Bayesianfrequentist controversy. However, they present us with the false dilemma (performance or probabilism) I’ve said we need to get beyond.
Today’s Bayesian–frequentist debates clearly differ from those of some years ago. In fact, many of the same discussants, who only a decade ago were arguing for the irreconcilability of frequentist Pvalues and Bayesian measures, are now smoking the peace pipe, calling for ways to unify and marry the two. I want to show you what really drew me back into the Bayesian–frequentist debates sometime around 2000. If you lean over the edge of the gondola, you can hear some Bayesian family feuds starting around then or a bit after. Principles that had long been part of the Bayesian hard core are being questioned or even abandoned by members of the Bayesian family. Suddenly sparks are ﬂying, mostly kept shrouded within Bayesian walls, but nothing can long be kept secret even there. Spontaneous combustion looms. Hard core subjectivists are accusing the increasingly popular “objective (nonsubjective)” and “reference” Bayesians of practicing in bad faith; the new frequentist–Bayesian uniﬁcationists are taking pains to show they are not subjective; and some are calling the new Bayesian kids on the block “pseudo Bayesian.” Then there are the Bayesians camping somewhere in the middle (or perhaps out in left ﬁeld) who, though they still use the Bayesian umbrella, are ﬂatly denying the very idea that Bayesian updating ﬁts anything they actually do in statistics. Obeisance to Bayesian reasoning remains, but on some kind of a priori philosophical grounds. Let’s start with the uniﬁcations.
While subjective Bayesianism oﬀers an algorithm for coherently updating prior degrees of belief in possible hypotheses H_{1}, H_{2}, …, H_{n}, these uniﬁcations fall under the umbrella of nonsubjective Bayesian paradigms. Here the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, ideally to have minimal impact on the posterior probability. I will call such Bayesian priors default. Advocates of uniﬁcations are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisﬁed?
True blue subjective Bayesians are understandably unhappy with non subjective priors. Rather than quantify prior beliefs, nonsubjective priors are viewed as primitives or conventions for obtaining posterior probabilities. Take Jay Kadane (2008):
The growth in use and popularity of Bayesian methods has stunned many of us who were involved in exploring their implications decades ago. The result … is that there are users of these methods who do not understand the philosophical basis of the methods they are using, and hence may misinterpret or badly use the results … No doubt helping people to use Bayesian methods more appropriately is an important task of our time. (p. 457, emphasis added)
I have some sympathy here: Many modern Bayesians aren’t aware of the traditional philosophy behind the methods they’re buying into. Yet there is not just one philosophical basis for a given set of methods. This takes us to one of the most dramatic shifts in contemporary statistical foundations. It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but you’ll notice that groups holding this position, while they still dot the landscape in 2018, have been gradually shrinking. Some Bayesians have come to question whether the wide spread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.
Marriages of Convenience?
The current frequentist–Bayesian uniﬁcations are often marriages of convenience; statisticians rationalize them less on philosophical than on practical grounds. For one thing, some are concerned that methodological conﬂicts are bad for the profession. For another, frequentist tribes, contrary to expectation, have not disappeared. Ensuring that accounts can control their error probabilities remains a desideratum that scientists are unwilling to forgo. Frequentists have an incentive to marry as well. Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and conﬁdence levels – frequentists are constantly put on the defensive. Jim Berger (2003) proposes a construal of significance tests on which the tribes of Fisher, Jeffreys, and Neyman could agree, yet none of the chiefs of those tribes concur (Mayo 2003b). The success stories are based on agreements on numbers that are not obviously true to any of the three philosophies. Beneath the surface – while it’s not often said in polite company – the most serious disputes live on. I plan to lay them bare.
If it’s assumed an evidential assessment of hypothesis H should take the form of a posterior probability of H – a form of probabilism – then Pvalues and conﬁdence levels are applicable only through misinterpretation and mistranslation. Resigned to live with Pvalues, some are keen to show that construing them as posterior probabilities is not so bad (e.g., Greenland and Poole 2013). Others focus on longrun error control, but cede territory wherein probability captures the epistemological ground of statistical inference. Why assume significance levels and conﬁdence levels lack an authentic epistemological function? I say they do: to secure and evaluate how well probed and how severely tested claims are.
Eclecticism and Ecumenism
If you look carefully between dense forest trees, you can distinguish uniﬁcation country from lands of eclecticism (Cox 1978) and ecumenism (Box 1983), where tools ﬁrst constructed by rival tribes are separate, and more or less equal (for different aims). Currentday eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges. For example, frequentist methods have long been employed to check or calibrate Bayesian methods (e.g., Box 1983); you might test your statistical model using a simple significance test, say, and then proceed to Bayesian updating. Others suggest scrutinizing a posterior probability or a likelihood ratio from an error probability standpoint. What this boils down to will depend on the notion of probability used. If a procedure frequently gives high probability for claim C even if C is false, severe testers deny convincing evidence has been provided, and never mind about the meaning of probability. One argument is that throwing different methods at a problem is all to the good, that it increases the chances that at least one will get it right. This may be so, provided one understands how to interpret competing answers. Using multiple methods is valuable when a shortcoming of one is rescued by a strength in another. For example, when randomized studies are used to expose the failure to replicate observational studies, there is a presumption that the former is capable of discerning problems with the latter. But what happens if one procedure fosters a goal that is not recognized or is even opposed by another? Members of rival tribes are free to sneak ammunition from a rival’s arsenal – but what if at the same time they denounce the rival method as useless or ineffective?
Decoupling. On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched. In an attempted meeting of the minds (Bayesian and error statistical), Andrew Gelman and Cosma Shalizi (2013) claim that “implicit in the best Bayesian practice is a stance that has much in common with the errorstatistical approach of Mayo” (p. 10). In particular, Bayesian model checking, they say, uses statistics to satisfy Popperian criteria for severe tests. The idea of error statistical foundations for Bayesian tools is not as preposterous as it may seem. The concept of severe testing is suﬃciently general to apply to any of the methods now in use. On the face of it, any inference, whether to the adequacy of a model or to a posterior probability, can be said to be warranted just to the extent that it has withstood severe testing. Where this will land us is still futuristic.
Why Our Journey?
We have all, or nearly all, moved past these old [Bayesianfrequentist] debates, yet our textbook explanations have not caught up with the eclecticism of statistical practice. (Kass 2011, p. 1)
When Kass proﬀers “a philosophy that matches contemporary attitudes,” he ﬁnds resistance to his big tent. Being hesitant to reopen wounds from old battles does not heal them. Distilling them in inoffensive terms just leads to the marshy swamp. Textbooks can’t “catchup” by softpeddling competing statistical accounts. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.
From an elevated altitude we see how it occurs. Once highproﬁle failures of replication spread to biomedicine, and other “hard” sciences, the problem took on a new seriousness. Where does the new scrutiny look? By and large, it collects from the earlier social science “significance test controversy” and the traditional philosophies coupled to Bayesian and frequentist accounts, along with the newer Bayesian–frequentist uniﬁcations we just surveyed. This jungle has never been disentangled. No wonder leading reforms and semipopular guidebooks contain misleading views about all these tools. No wonder we see the same fallacies that earlier reforms were designed to avoid, and even brand new ones. Let me be clear, I’m not speaking about ﬂatout howlers such as interpreting a Pvalue as a posterior probability. By and large, they are more subtle; you’ll want to reach your own position on them. It’s not a matter of switching your tribe, but excavating the roots of tribal warfare. To tell what’s true about them. I don’t mean understand them at the sociopsychological levels, although there’s a good story there (and I’ll leak some of the juicy parts during our travels).
How can we make progress when it is difficult even to tell what is true about the different methods of statistics? We must start afresh, taking responsibility to oﬀer a new standpoint from which to interpret the cluster of tools around which there has been so much controversy. Only then can we alter and extend their limits. I admit that the statistical philosophy that girds our explorations is not out there readymade; if it was, there would be no need for our holiday cruise. While there are plenty of giant shoulders on which we stand, we won’t be restricted by the pronouncements of any of the high and low priests, as sagacious as many of their words have been. In fact, we’ll brazenly question some of their most entrenched mantras. Grab on to the gondola, our balloon’s about to land.
In Tour II, I’ll give you a glimpse of the core behind statistics battles, with a ﬁrm promise to retrace the steps more slowly in later trips.
FOR ALL OF TOUR I: SIST Excursion 1 Tour I
THE FULL ITINERARY: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars: SIST Itinerary
REFERENCES:
Berger, J. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and ‘Rejoinder’, Statistical Science 18(1), 1–12; 28–32.
Box, G. (1983). ‘An Apology for Ecumenism in Statistics’, in Box, G., Leonard, T., and Wu, D. (eds.), Scientific Inference, Data Analysis, and Robustness, New York:
Academic Press, 51–84.
Cox, D. (1978). ‘Foundations of Statistical Inference: The Case for Eclecticism’, Australian Journal of Statistics 20(1), 43–59.
Efron, B. (2013). ‘A 250Year Argument: Belief, Behavior, and the Bootstrap’, Bulletin of the American Mathematical Society 50(1), 126–46.
Fraser, D. (2011). ‘Is Bayes Posterior Just Quick and Dirty Confidence?’ and ‘Rejoinder’, Statistical Science 26(3), 299–316; 329–31.
Gelman, A. and Shalizi, C. (2013). ‘Philosophy and the Practice of Bayesian Statistics’ and ‘Rejoinder’, British Journal of Mathematical and Statistical Psychology 66(1), 8–38; 76–80.
Greenland, S. and Poole, C. (2013). ‘Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics’ and ‘Rejoinder: Living with Statistics in Observational Research’, Epidemiology 24(1), 62–8; 73–8.
Kadane, J. (2008). ‘Comment on Article by Gelman’, Bayesian Analysis 3(3), 455–8.
Kass, R. (2011). ‘Statistical Inference: The Big Picture (with discussion and rejoinder)’, Statistical Science 26(1), 1–20.
Mayo, D. (2003b). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Commentary on J. Berger’s Fisher Address’, Statistical Science 18, 19–24.
Wasserman, L. (2012b). ‘What is Bayesian/Frequentist Inference?’, Blogpost on normaldeviate.wordpress.com (11/7/2012).
]]>
I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in ﬁelds of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)
While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism.
The performance philosophy sees the key function of statistical method as controlling the relative frequency of erroneous inferences in the long run of applications. For example, a frequentist statistical test, in its naked form, can be seen as a rule: whenever your outcome exceeds some value (say, X > x*), reject a hypothesis H_{0} and infer H_{1}. The value of the rule, according to its performanceoriented defenders, is that it can ensure that, regardless of which hypothesis is true, there is both a low probability of erroneously rejecting H_{0} (rejecting H_{0} when it is true) as well as erroneously accepting H_{0} (failing to reject H_{0} when it is false).
The second philosophy, probabilism, views probability as a way to assign degrees of belief, support, or plausibility to hypotheses. Many keep to a comparative report, for example that H_{0} is more believable than is H_{1} given data x; others strive to say H_{0} is less believable given data x than before, and oﬀer a quantitative report of the difference.
What happened to the goal of scrutinizing BENT science by the severity criterion? [See 1.1] Neither “probabilism” nor “performance” directly captures that demand. To take these goals at face value, it’s easy to see why they come up short. Potti and Nevins’ strong belief in the reliability of their prediction model for cancer therapy scarcely made up for the shoddy testing. Neither is good longrun performance a sufficient condition. Most obviously, there may be no longrun repetitions, and our interest in science is often just the particular statistical inference before us. Crude longrun requirements may be met by silly methods. Most importantly, good performance alone fails to get at why methods work when they do; namely – I claim – to let us assess and control the stringency of tests. This is the key to answering a burning question that has caused major headaches in statistical foundations: why should a low relative frequency of error matter to the appraisal of the inference at hand? It is not probabilism or performance we seek to quantify, but probativeness.
I do not mean to disparage the longrun performance goal – there are plenty of tasks in inquiry where performance is absolutely key. Examples are screening in highthroughput data analysis, and methods for deciding which of tens of millions of collisions in highenergy physics to capture and analyze. New applications of machine learning may lead some to say that only low rates of prediction or classification errors matter. Even with prediction, “blackbox” modeling, and nonprobabilistic inquiries, there is concern with solving a problem. We want to know if a good job has been done in the case at hand.
Severity (Strong): Argument from Coincidence
The weakest version of the severity requirement (Section 1.1), in the sense of easiest to justify, is negative, warning us when BENT data are at hand, and a surprising amount of mileage may be had from that negative principle alone. It is when we recognize how poorly certain claims are warranted that we get ideas for improved inquiries. In fact, if you wish to stop at the negative requirement, you can still go pretty far along with me. I also advocate the positive counterpart:
Severity (strong): We have evidence for a claim C just to the extent it survives a stringent scrutiny. If C passes a test that was highly capable of ﬁnding ﬂaws or discrepancies from C, and yet none or few are found, then the passing result, x, is evidence for C.
One way this can be achieved is by an argument from coincidence. The most vivid cases occur outside formal statistics.
Some of my strongest examples tend to revolve around my weight. Before leaving the USA for the UK, I record my weight on two scales at home, one digital, one not, and the big medical scale at my doctor’s oﬃce. Suppose they are well calibrated and nearly identical in their readings, and they also all pick up on the extra 3 pounds when I’m weighed carrying three copies of my 1pound book, Error and the Growth of Experimental Knowledge (EGEK). Returning from the UK, to my astonishment, not one but all three scales show anywhere from a 4–5 pound gain. There’s no difference when I place the three books on the scales, so I must conclude, unfortunately, that I’ve gained around 4 pounds. Even for me, that’s a lot. I’ve surely falsified the supposition that I lost weight! From this informal example, we may make two rather obvious points that will serve for less obvious cases. First, there’s the idea I call liftoﬀ.
Liftoﬀ: An overall inference can be more reliable and precise than its premises individually.
Each scale, by itself, has some possibility of error, and limited precision. But the fact that all of them have me at an over 4pound gain, while none show any difference in the weights of EGEK, pretty well seals it. Were one scale oﬀ balance, it would be discovered by another, and would show up in the weighing of books. They cannot all be systematically misleading just when it comes to objects of unknown weight, can they? Rejecting a conspiracy of the scales, I conclude I’ve gained weight, at least 4 pounds. We may call this an argument from coincidence, and by its means we can attain liftoﬀ. Liftoﬀ runs directly counter to a seemingly obvious claim of dragdown.
Dragdown: An overall inference is only as reliable/precise as is its weakest premise.
The dragdown assumption is common among empiricist philosophers: As they like to say, “It’s turtles all the way down.” Sometimes our inferences do stand as a kind of tower built on linked stones – if even one stone fails they all come tumbling down. Call that a linked argument.
Our most prized scientific inferences would be in a very bad way if piling on assumptions invariably leads to weakened conclusions. Fortunately we also can build what may be called convergent arguments, where liftoﬀ is attained. This seemingly banal point suffices to combat some of the most well entrenched skepticisms in philosophy of science. And statistics happens to be the science par excellence for demonstrating liftoﬀ!
Now consider what justifies my weight conclusion, based, as we are supposing it is, on a strong argument from coincidence. No one would say: “I can be assured that by following such a procedure, in the long run I would rarely report weight gains erroneously, but I can tell nothing from these readings about my weight now.” To justify my conclusion by longrun performance would be absurd. Instead we say that the procedure had enormous capacity to reveal if any of the scales were wrong, and from this I argue about the source of the readings: H: I’ve gained weight. Simple as that. It would be a preposterous coincidence if none of the scales registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading when applied to my weight. You see where I’m going with this. This is the key – granted with a homely example – that can ﬁll a very important gap in frequentist foundations: Just because an account is touted as having a longrun rationale, it does not mean it lacks a short run rationale, or even one relevant for the particular case at hand. Nor is it merely the improbability of all the results were H false; it is rather like denying an evil demon has read my mind just in the cases where I do not know the weight of an object, and deliberately deceived me. The argument to “weight gain” is an example of an argument from coincidence to the absence of an error, what I call:
Arguing from Error: There is evidence an error is absent to the extent that a procedure with a very high capability of signaling the error, if and only if it is present, nevertheless detects no error.
I am using “signaling” and “detecting” synonymously: It is important to keep in mind that we don’t know if the test output is correct, only that it gives a signal or alert, like sounding a bell. Methods that enable strong arguments to the absence (or presence) of an error I call strong error probes. Our ability to develop strong arguments from coincidence, I will argue, is the basis for solving the “problem of induction.”
Glaring Demonstrations of Deception
Intelligence is indicated by a capacity for deliberate deviousness. Such deviousness becomes selfconscious in inquiry: An example is the use of a placebo to ﬁnd out what it would be like if the drug has no eﬀect. What impressed me the most in my ﬁrst statistics class was the demonstration of how apparently impressive results are readily produced when nothing’s going on, i.e., “by chance alone.” Once you see how it is done, and done easily, there is no going back. The toy hypotheses used in statistical testing are nearly always overly simple as scientific hypotheses. But when it comes to framing rather blatant deceptions, they are just the ticket!
When Fisher oﬀered Muriel BristolRoach a cup of tea back in the 1920s, she refused it because he had put the milk in ﬁrst. What difference could it make? Her husband and Fisher thought it would be fun to put her to the test (1935a). Say she doesn’t claim to get it right all the time but does claim that she has some genuine discerning ability. Suppose Fisher subjects her to 16 trials and she gets 9 of them right. Should I be impressed or not? By a simple experiment of randomly assigning milk ﬁrst/tea ﬁrst Fisher sought to answer this stringently. But don’t be fooled: a great deal of work goes into controlling biases and confounders before the experimental design can work. The main point just now is this: so long as lacking ability is sufficiently like the canonical “coin tossing” (Bernoulli) model (with the probability of success at each trial of 0.5), we can learn from the test procedure. In the Bernoulli model, we record success or failure, assume a ﬁxed probability of success θ on each trial, and that trials are independent. If the probability of getting even more successes than she got, merely by guessing, is fairly high, there’s little indication of special tasting ability. The probability of at least 9 of 16 successes, even if θ = 0.5, is 0.4. To abbreviate, Pr(at least 9 of 16 successes; H_{0}: θ = 0.5) = 0.4. This is the Pvalue of the observed difference; an unimpressive 0.4. You’d expect as many or even more “successes” 40% of the time merely by guessing. It’s also the significance level attained by the result. (I often use Pvalue as it’s shorter.) Muriel BristolRoach pledges that if her performance may be regarded as scarcely better than guessing, then she hasn’t shown her ability. Typically, a small value such as 0.05, 0.025, or 0.01 is required.
Such artiﬁcial and simplistic statistical hypotheses play valuable roles at stages of inquiry where what is needed are blatant standards of “nothing’s going on.” There is no presumption of a metaphysical chance agency, just that there is expected variability – otherwise one test would suﬃce – and that probability models from games of chance can be used to distinguish genuine from spurious eﬀects. Although the goal of inquiry is to ﬁnd things out, the hypotheses erected to this end are generally approximations and may be deliberately false. To present statistical hypotheses as identical to substantive scientiﬁc claims is to mischaracterize them. We want to tell what’s true about statistical inference. Among the most notable of these truths is:
Pvalues can be readily invalidated due to how the data (or hypotheses!) are generated or selected for testing.
If you fool around with the results afterwards, reporting only successful guesses, your report will be invalid. You may claim it’s very difficult to get such an impressive result due to chance, when in fact it’s very easy to do so, with selective reporting. Another way to put this: your computed Pvalue is small, but the actual Pvalue is high! Concern with spurious ﬁndings, while an ancient problem, is considered sufficiently serious to have motivated the American Statistical Association to issue a guide on how not to interpret Pvalues (Wasserstein and Lazar 2016); hereafter, ASA 2016 Guide. It may seem that if a statistical account is free to ignore such fooling around then the problem disappears! It doesn’t.
Incidentally, BristolRoach got all the cases correct, and thereby taught her husband a lesson about putting her claims to the test.
skips p. 18 on Peirce
Texas Marksman
Take an even simpler and more blatant argument of deception. It is my favorite: the Texas Marksman. A Texan wants to demonstrate his shooting prowess. He shoots all his bullets any old way into the side of a barn and then paints a bull’seye in spots where the bullet holes are clustered. This fails utterly to severely test his marksmanship ability. When some visitors come to town and notice the incredible number of bull’seyes, they ask to meet this marksman and are introduced to a little kid. How’d you do so well, they ask? Easy, I just drew the bull’seye around the most tightly clustered shots. There is impressive “agreement” with shooting ability, he might even compute how improbably so many bull’seyes would occur by chance. Yet his ability to shoot was not tested in the least by this little exercise. There’s a real eﬀect all right, but it’s not caused by his marksmanship! It serves as a potent analogy for a cluster of formal statistical fallacies from datadependent ﬁndings of “exceptional” patterns.
The term “apophenia” refers to a tendency to zero in on an apparent regularity or cluster within a vast sea of data and claim a genuine regularity. One of our fundamental problems (and skills) is that we’re apopheniacs. Some investment funds, none that we actually know, are alleged to produce several portfolios by random selection of stocks and send out only the one that did best. Call it the Pickrite method. They want you to infer that it would be a preposterous coincidence to get so great a portfolio if the Pickrite method were like guessing. So their methods are genuinely wonderful, or so you are to infer. If this had been their only portfolio, the probability of doing so well by luck is low. But the probability of at least one of many portfolios doing so well (even if each is generated by chance) is high, if not guaranteed.
Let’s review the rogues’ gallery of glaring arguments from deception. The lady tasting tea showed how a statistical model of “no eﬀect” could be used to amplify our ordinary capacities to discern if something really unusual is going on. The Pvalue is the probability of at least as high a success rate as observed, assuming the test or null hypothesis, the probability of success is 0.5. Since even more successes than she got is fairly frequent through guessing alone (the Pvalue is moderate), there’s poor evidence of a genuine ability. The Playfair and Texas sharpshooter examples, while quasiformal or informal, demonstrate how to invalidate reports of significant eﬀects. They show how gambits of postdata adjustments or selection can render a method highly capable of spewing out impressive looking ﬁts even when it’s just random noise.
We appeal to the same statistical reasoning to show the problematic cases as to show genuine arguments from coincidence.
So am I proposing that a key role for statistical inference is to identify ways to spot egregious deceptions (BENT cases) and create strong arguments from coincidence? Yes, I am.
Skips “Spurious Pvalues and Auditing” (p. 20) up to Souvenir A (p. 21)
Souvenir A: Postcard to Send
The gift shop has a postcard listing the four slogans from the start of this Tour. Much of today’s handwringing about statistical inference is uniﬁed by a call to block these fallacies. In some realms, trafficking in tooeasy claims for evidence, if not criminal oﬀenses, are “bad statistics”; in others, notably some social sciences, they are accepted cavalierly – much to the despair of panels on research integrity. We are more sophisticated than ever about the ways researchers can repress unwanted, and magnify wanted, results. Fraudbusting is everywhere, and the most important grain of truth is this: all the fraudbusting is based on error statistical reasoning (if only on the metalevel). The minimal requirement to avoid BENT isn’t met. It’s hard to see how one can grant the criticisms while denying the critical logic.
We should oust mechanical, recipelike uses of statistical methods that have long been lampooned, and are doubtless made easier by Big Data mining. They should be supplemented with tools to report magnitudes of eﬀects that have and have not been warranted with severity. But simple significance tests have their uses, and shouldn’t be ousted simply because some people are liable to violate Fisher’s warning and report isolated results. They should be seen as a part of a conglomeration of error statistical tools for distinguishing genuine and spurious eﬀects. They oﬀer assets that are essential to our task: they have the means by which to register formally the fallacies in the postcard list. The failed statistical assumptions, the selection eﬀects from trying and trying again, all alter a test’s errorprobing capacities. This sets oﬀ important alarm bells, and we want to hear them. Don’t throw out the errorcontrol baby with the bad statistics bathwater.
The slogans about lying with statistics? View them, not as a litany of embarrassments, but as announcing what any responsible method must register, if not control or avoid. Criticisms of statistical tests, where valid, boil down to problems with the critical alert function. Far from the high capacity to warn, “Curb your enthusiasm!” as correct uses of tests do, there are practices that make sending out spurious enthusiasm as easy as pie. This is a failure for sure, but don’t trade them in for methods that cannot detect failure at all. If you’re shopping for a statistical account, or appraising a statistical reform, your number one question should be: does it embody trigger warnings of spurious eﬀects? Of bias? Of cherry picking and multiple tries? If the response is: “No problem; if you use our method, those practices require no change in statistical assessment!” all I can say is, if it sounds too good to be true, you might wish to hold oﬀ buying it.
Skips remainder of section 1.2 (bott p. 22 middle p. 23).
NOTES:
^{2} This is the traditional use of “bias” as a systematic error. Ioannidis (2005) alludes to biasing as behaviors that result in a reported significance level differing from the value it actually has or ought to have (e.g., postdata endpoints, selective reporting). I will call those biasing selection eﬀects.
FOR ALL OF TOUR I: SIST Excursion 1 Tour I
THE FULL ITINERARY: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars: SIST Itinerary
]]>
I’m talking about a speciﬁc, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)
It is easy to lie with statistics. Or so the cliché goes. It is also very diﬃcult to uncover these lies without statistical methods – at least of the right kind. Self correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huﬀ wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:
Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “evidencebased” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, signiﬁcance seeking, Phacking, and assorted modes of extra ordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraudbusting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed.
Getting Philosophical
Are philosophies about science, evidence, and inference relevant here? Because the problems involve questions about uncertain evidence, probabilistic models, science, and pseudoscience – all of which are intertwined with technical statistical concepts and presuppositions – they certainly ought to be. Even in an openaccess world in which we have become increasingly fearless about taking on scientiﬁc complexities, a certain trepidation and groupthink take over when it comes to philosophically tinged notions such as inductive reasoning, objectivity, rationality, and science versus pseudoscience. The general area of philosophy that deals with knowledge, evidence, inference, and rationality is called epistemology. The epistemological standpoints of leaders, be they philosophers or scientists, are too readily taken as canon by others. We want to understand what’s true about some of the popular memes: “All models are false,” “Everything is equally subjective and objective,” “Pvalues exaggerate evidence,” and “[M]ost published research ﬁndings are false” (Ioannidis 2005) – at least if you publish a single statistically signiﬁcant result after data ﬁnagling. (Do people do that? Shame on them.) Yet R. A. Fisher, founder of modern statistical tests, denied that an isolated statistically signiﬁcant result counts.
[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of signiﬁcance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically signiﬁcant result. (Fisher 1935b/1947, p. 14)
Satisfying this requirement depends on the proper use of background knowledge and deliberate design and modeling.
This opening excursion will launch us into the main themes we will encounter. You mustn’t suppose, by its title, that I will be talking about how to tell the truth using statistics. Although I expect to make some progress there, my goal is to tell what’s true about statistical methods themselves! There are so many misrepresentations of those methods that telling what is true about them is no mean feat. It may be thought that the basic statistical concepts are well understood. But I show that this is simply not true.
Nor can you just open a statistical text or advice manual for the goal at hand. The issues run deeper. Here’s where I come in. Having long had one foot in philosophy of science and the other in foundations of statistics, I will zero in on the central philosophical issues that lie below the surface of today’s raging debates. “Getting philosophical” is not about articulating rariﬁed concepts divorced from statistical practice. It is to provide tools to avoid obfuscating the terms and issues being bandied about. Readers should be empowered to understand the core presuppositions on which rival positions are based – and on which they depend.
Do I hear a protest? “There is nothing philosophical about our criticism of statistical signiﬁcance tests (someone might say). The problem is that a small Pvalue is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis.” Really? Pvalues are not intended to be used this way; presupposing they ought to be so interpreted grows out of a speciﬁc conception of the role of probability in statistical inference. That conception is philosophical. Methods characterized through the lens of oversimple epistemological orthodoxies are methods misapplied and mischaracterized. This may lead one to lie, however unwittingly, about the nature and goals of statistical inference, when what we want is to tell what’s true about them.
Fisher observed long ago, “[t]he political principle that anything can be proved by statistics arises from the practice of presenting only a selected subset of the data available” (Fisher 1955, p. 75). If you report results selectively, it becomes easy to prejudge hypotheses: yes, the data may accord amazingly well with a hypothesis H, but such a method is practically guaranteed to issue so good a ﬁt even if H is false and not warranted by the evidence. If it is predetermined that a way will be found to either obtain or interpret data as evidence for H, then data are not being taken seriously in appraising H. H is essentially immune to having its ﬂaws uncovered by the data. H might be said to have “passed” the test, but it is a test that lacks stringency or severity. Everyone understands that this is bad evidence, or no test at all. I call this the severity requirement. In its weakest form it supplies a minimal requirement for evidence:
Severity Requirement (weak): One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. If data x agree with a claim C but the method used is practically guaranteed to ﬁnd such agreement, and had little or no capability of ﬁnding ﬂaws with C even if they exist, then we have bad evidence, no test (BENT).
The “practically guaranteed” acknowledges that even if the method had some slim chance of producing a disagreement when C is false, we still regard the evidence as lousy. Little if anything has been done to rule out erroneous construals of data. We’ll need many diﬀerent ways to state this minimal principle of evidence, depending on context….
skips bottom of p. 5bottom of p. 6
Do We Always Want to Find Things Out?
The severity requirement gives a minimal principle based on the fact that highly insevere tests yield bad evidence, no tests (BENT). We can all agree on this much, I think. We will explore how much mileage we can get from it. It applies at a number of junctures in collecting and modeling data, in linking data to statistical inference, and to substantive questions and claims. This will be our linchpin for understanding what’s true about statistical inference. In addition to our minimal principle for evidence, one more thing is needed, at least during the time we are engaged in this project: the goal of ﬁnding things out.
The desire to ﬁnd things out is an obvious goal; yet most of the time it is not what drives us. We typically may be uninterested in, if not quite resistant to, ﬁnding ﬂaws or incongruencies with ideas we like. Often it is entirely proper to gather information to make your case, and ignore anything that fails to support it. Only if you really desire to ﬁnd out something, or to challenge soandso’s (“trust me”) assurances, will you be prepared to stick your (or their) neck out to conduct a genuine “conjecture and refutation” exercise. Because you want to learn, you will be prepared to risk the possibility that the conjecture is found ﬂawed.
We hear that “motivated reasoning has interacted with tribalism and new media technologies since the 1990s in unfortunate ways” (Haidt and Iyer 2016). Not only do we see things through the tunnel of our tribe, social media and web searches enable us to live in the echo chamber of our tribe more than ever. We might think we’re trying to ﬁnd things out but we’re not. Since craving truth is rare (unless your life depends on it) and the “perverse incentives” of publishing novel results so shiny, the wise will invite methods that make uncovering errors and biases as quick and painless as possible. Methods of inference that fail to satisfy the minimal severity requirement fail us in an essential way.
With the rise of Big Data, data analytics, machine learning, and bioinformatics, statistics has been undergoing a good deal of introspection. Exciting results are often being turned out by researchers without a traditional statistics background; biostatistician Jeﬀ Leek (2016) explains: “There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics.” The problem goes beyond turf battles. It’s discovering that many data analytic applications are missing key ingredients of statistical thinking. Brown and Kass (2009) crystalize its essence. “Statistical thinking uses probabilistic descriptions of variability in (1) inductive reasoning and (2) analysis of procedures for data collection, prediction, and scientiﬁc inference” (p. 107). A word on each.
(1) Types of statistical inference are too varied to neatly encompass. Typically we employ data to learn something about the process or mechanism producing the data. The claims inferred are not speciﬁc events, but statistical generalizations, parameters in theories and models, causal claims, and general predictions. Statistical inference goes beyond the data – by deﬁnition that makes it an inductive inference. The risk of error is to be expected. There is no need to be reckless. The secret is controlling and learning from error. Ideally we take precautions in advance: predata, we devise methods that make it hard for claims to pass muster unless they are approximately true or adequately solve our problem. With data in hand, postdata, we scrutinize what, if anything, can be inferred.
What’s the essence of analyzing procedures in (2)? Brown and Kass don’t specifically say, but the gist can be gleaned from what vexes them; namely, ad hoc data analytic algorithms where researchers “have done nothing to indicate that it performs well” (p. 107). Minimally, statistical thinking means never ignoring the fact that there are alternative methods: Why is this one a good tool for the job? Statistical thinking requires stepping back and examining a method’s capabilities, whether it’s designing or choosing a method, or scrutinizing the results.
A Philosophical Excursion
Taking the severity principle then, along with the aim that we desire to ﬁnd things out without being obstructed in this goal, let’s set sail on a philosophical excursion to illuminate statistical inference. Envision yourself embarking on a special interest cruise featuring “exceptional itineraries to popular destinations worldwide as well as unique routes” (Smithsonian Journeys). What our cruise lacks in glamour will be more than made up for in our ability to travel back in time to hear what Fisher, Neyman, Pearson, Popper, Savage, and many others were saying and thinking, and then zoom forward to current debates. There will be exhibits, a blend of statistics, philosophy, and history, and even a bit of theater. Our standpoint will be pragmatic in this sense: my interest is not in some ideal form of knowledge or rational agency, no omniscience or God’seye view – although we’ll start and end surveying the landscape from a hotair balloon. I’m interested in the problem of how we get the kind of knowledge we do manage to obtain – and how we can get more of it. Statistical methods should not be seen as tools for what philosophers call “rational reconstruction” of a piece of reasoning. Rather, they are forwardlooking tools to ﬁnd something out faster and more eﬃciently, and to discriminate how good or poor a job others have done.
The job of the philosopher is to clarify but also to provoke reﬂection and scrutiny precisely in those areas that go unchallenged in ordinary practice. My focus will be on the issues having the most inﬂuence, and being most liable to obfuscation. Fortunately, that doesn’t require an abundance of technicalities, but you can opt out of any daytrip that appears too technical: an idea not caught in one place should be illuminated in another. Our philosophical excursion may well land us in positions that are provocative to all existing sides of the debate about probability and statistics in scientiﬁc inquiry.
Methodology and Metamethodology
We are studying statistical methods from various schools. What shall we call methods for doing so? Borrowing a term from philosophy of science, we may call it our metamethodology – it’s one level removed.1 To put my cards on the table: A severity scrutiny is going to be a key method of our metamethodology. It is fairly obvious that we want to scrutinize how capable a statistical method is at detecting and avoiding erroneous interpretations of data. So when it comes to the role of probability as a pedagogical tool for our purposes, severity – its assessment and control – will be at the center. The term “severity” is Popper’s, though he never adequately deﬁned it. It’s not part of any statistical methodology as of yet. Viewing statistical inference as severe testing lets us stand one level removed from existing accounts, where the air is a bit clearer.
Our intuitive, minimal, requirement for evidence connects readily to formal statistics. The probabilities that a statistical method lands in erroneous interpretations of data are often called its error probabilities. So an account that revolves around control of error probabilities I call an error statistical account. But “error probability” has been used in diﬀerent ways. Most familiar are those in relation to hypotheses tests (Type I and II errors), signiﬁcance levels, conﬁdence levels, and power – all of which we will explore in detail. It has occasionally been used in relation to the proportion of false hypotheses among those now in circulation, which is diﬀerent. For now it suﬃces to say that none of the formal notions directly give severity assessments. There isn’t even a statistical school or tribe that has explicitly endorsed this goal. I ﬁnd this perplexing. That will not preclude our immersion into the mindset of a futuristic tribe whose members use error probabilities for assessing severity; it’s just the ticket for our task: understanding and getting beyond the statistics wars. We may call this tribe the severe testers.
We can keep to testing language. See it as part of the metalanguage we use to talk about formal statistical methods, where the latter include estimation, exploration, prediction, and data analysis. I will use the term “hypothesis,” or just “claim,” for any conjecture we wish to entertain; it need not be one set out in advance of data. Even predesignating hypotheses, by the way, doesn’t preclude bias: that view is a holdover from a crude empiricism that assumes data are unproblematically “given,” rather than selected and interpreted. Conversely, using the same data to arrive at and test a claim can, in some cases, be accomplished with stringency.
As we embark on statistical foundations, we must avoid blurring formal terms such as probability and likelihood with their ordinary English meanings. Actually, “probability” comes from the Latin probare, meaning to try, test, or prove. “Proof” in “The proof is in the pudding” refers to how you put some thing to the test. You must show or demonstrate, not just believe strongly. Ironically, using probability this way would bring it very close to the idea of measuring welltestedness (or how well shown). But it’s not our current, informal English sense of probability, as varied as that can be. To see this, consider “improbable.” Calling a claim improbable, in ordinary English, can mean a host of things: I bet it’s not so; all things considered, given what I know, it’s implausible; and other things besides. Describing a claim as poorly tested generally means something quite diﬀerent: little has been done to probe whether the claim holds or not, the method used was highly unreliable, or things of that nature. In short, our informal notion of poorly tested comes rather close to the lack of severity in statistics. There’s a diﬀerence between ﬁnding H poorly tested by data x, and ﬁnding x renders H improbable – in any of the many senses the latter takes on. The existence of a Higgs particle was thought to be probable if not necessary before it was regarded as well tested around 2012. Physicists had to show or demonstrate its existence for it to be well tested. It follows that you are free to pursue our testing goal without implying there are no other statistical goals. One other thing on language: I will have to retain the terms currently used in exploring them. That doesn’t mean I’m in favor of them; in fact, I will jettison some of them by the end of the journey.
To sum up this ﬁrst tour so far, statistical inference uses data to reach claims about aspects of processes and mechanisms producing them, accompanied by an assessment of the properties of the inference methods: their capabilities to control and alert us to erroneous interpretations. We need to report if the method has satisﬁed the most minimal requirement for solving such a problem. Has anything been tested with a modicum of severity, or not? The severe tester also requires reporting of what has been poorly probed, and highlights the need to “bend over backwards,” as Feynman puts it, to admit where weaknesses lie. In formal statistical testing, the crude dichotomy of “pass/fail” or “signiﬁcant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones. Using just our minimal principle of evidence, and a sturdy pair of shoes, join me on a tour of statistical inference, back to the leading museums of statistics, and forward to current oﬀshoots and statistical tribes.
Why We Must Get Beyond the Statistics Wars
Some readers may be surprised to learn that the ﬁeld of statistics, arid and staid as it seems, has a fascinating and colorful history of philosophical debate, marked by unusual heights of passion, personality, and controversy for at least a century. Others know them all too well and regard supporting any one side largely as proselytizing. I’ve heard some refer to statistical debates as “theological.” I do not want to rehash the “statistics wars” that have raged in every decade, although the signiﬁcance test controversy is still hotly debated among practitioners, and even though each generation ﬁghts these wars anew – with task forces set up to stem reﬂexive, recipelike statistics that have long been deplored.
The time is ripe for a fairminded engagement in the debates about statistical foundations; more than that, it is becoming of pressing importance. Not only because
nor because
– as important as those facets are – but because what is at stake is a critical standpoint that we may be in danger of losing. Without it, we forfeit the ability to communicate with, and hold accountable, the “experts,” the agencies, the quants, and all those data handlers increasingly exerting power over our lives. Understanding the nature and basis of statistical inference must not be considered as all about mathematical details; it is at the heart of what it means to reason scientiﬁcally and with integrity about any ﬁeld whatever. Robert Kass (2011) puts it this way:
We care about our philosophy of statistics, ﬁrst and foremost, because statistical inference sheds light on an important part of human existence, inductive reasoning, and we want to understand it. (p. 19)
Isolating out a particular conception of statistical inference as severe testing is a way of telling what’s true about the statistics wars, and getting beyond them.
Chutzpah, No Proselytizing
Our task is twofold: not only must we analyze statistical methods; we must also scrutinize the jousting on various sides of the debates. Our metalevel standpoint will let us rise above much of the cacophony; but the excursion will involve a dose of chutzpah that is out of the ordinary in professional discussions. You will need to critically evaluate the texts and the teams of critics, including brilliant leaders, high priests, maybe even royalty. Are they asking the most unbiased questions in examining methods, or are they like admen touting their brand, dragging out howlers to make their favorite method look good? (I am not sparing any of the statistical tribes here.) There are those who are earnest but brainwashed, or are stuck holding banners from an earlier battle now over; some are wedded to what they’ve learned, to what’s in fashion, to what pays the rent. Some are so jaundiced about the abuses of statistics as to wonder at my admittedly herculean task. I have a considerable degree of sympathy with them. But, I do not sympathize with those who ask: “why bother to clarify statistical concepts if they are invariably misinterpreted?” and then proceed to misinterpret them. Anyone is free to dismiss statistical notions as irrelevant to them, but then why set out a shingle as a “statistical reformer”? You may even be shilling for one of the proﬀered reforms, thinking it the road to restoring credibility, when it will do nothing of the kind.
You might say, since rival statistical methods turn on issues of philosophy and on rival conceptions of scientiﬁc learning, that it’s impossible to say anything “true” about them. You just did. It’s precisely these interpretative and philosophical issues that I plan to discuss. Understanding the issues is diﬀerent from settling them, but it’s of value nonetheless. Although statistical disagreements involve philosophy, statistical practitioners and not philosophers are the ones leading today’s discussions of foundations. Is it possible to pursue our task in a way that will be seen as neither too philosophical nor not philosophical enough? Too statistical or not statistically sophisticated enough? Probably not, I expect grievances from both sides.
Finally, I will not be proselytizing for a given statistical school, so you can relax. Frankly, they all have shortcomings, insofar as one can even glean a clear statement of a given statistical “school.” What we have is more like a jumble with tribal members often speaking right past each other. View the severity requirement as a heuristic tool for telling what’s true about statistical controversies. Whether you resist some of the ports of call we arrive at is unimportant; it suﬃces that visiting them provides a key to unlock current mysteries that are leaving many consumers and students of statistics in the dark about a crucial portion of science.
NOTE:
1 This contrasts with the use of “metaresearch” to describe work on methodological reforms by nonphilosophers. This is not to say they don’t tread on philosophical territory often: they do.
FOR ALL OF TOUR I: SIST Excursion 1 Tour I
THE FULL ITINERARY: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars: SIST Itinerary
]]>
Speakers:
Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech
Intermingled in today’s statistical controversies are some longstanding, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the socalled “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.
MONTHLY MEMORY LANE: 3 years ago: August 2015. I mark in red 34 posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of relevance to philosophy of statistics [2]. Posts that are part of a “unit” or a group count as one.
August 2015
[1] Monthly memory lanes began at the blog’s 3year anniversary in Sept, 2014.
[2] New Rule, July 30, 2016, March 30,2017 a very convenient way to allow datadependent choices (note why it’s legit in selecting blog posts, on severity grounds).
]]>
Egon Pearson’s Neglected Contributions to Statistics
by Aris Spanos
Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the NeymanPearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:
(i) specification: the need to state explicitly the inductive premises of one’s inferences,
(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as
(iii) MisSpecification (MS) testing: probing for potential departures from the Normality assumption.
Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922ab). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:
X_{k} ∽ NIID(μ,σ²), k=1,2,…,n,… (1)
where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:
(a) τ(X) =[√n(Xbar μ)/s] ∽ St(n1), (2)
(b) v(X) =[(n1)s²/σ²] ∽ χ²(n1), (3)
where St(n1) and χ²(n1) denote the Student’s t and chisquare distributions with (n1) degrees of freedom.
The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:
“What I should like you to do is to find a solution for some other population than a normal one.” (Lehmann, 1999)
He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(X), assuming some skewed distribution (possibly logNormal). We know this from Gosset’s reply:
“I like the result for z [τ(X)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.” (Lehmann, 1999)
After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of nonNormality for the Normalbased inference procedures; t, chisquare and F tests.
In contrast, Egon Pearson shared Gosset’s concerns about the robustness of Normalbased inference results (a)(b) to nonNormality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s.
This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in Nature, and dated June 8th, 1929. Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)(ii) above:
“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)
Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in Nature, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to nonnormal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible nonNormal distributions; there is an infinity of potential modifications!
Egon Pearson recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several nonNormal distributions within the Pearson family. His probing was based primarily on simulation, relying on tables of pseudorandom numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the ttest:
τ_{0}(X)=[√n(Xbar μ_{0})/s], C_{1}:={x: τ_{0}(x) > c_{α}}, (4)
for testing the hypotheses:
H_{0}: μ = μ_{0} vs. H_{1}: μ ≠ μ_{0}, (5)
is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).
Perhaps more importantly, Pearson (1930) proposed a test for the Normality assumption based on the skewness and kurtosis coefficients: a MisSpecification (MS) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original MS test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.
After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that simulation alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of nonNormality using different distributions:
“How much does it [nonNormality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).
In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:
“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”
It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:
(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normalbased inference than Normality, and
(b) deriving the consequences of particular forms of nonNormality on the reliability of Normalbased inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does not provide a complete answer to the problem of dealing with departures from the inductive premises.
In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of nonIID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown f(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.
In relation to (b), when one poses the question ‘how robust to nonNormality is the reliability of inference based on a ttest?’ one ignores the fact that the ttest might no longer be the ‘optimal’ test under a nonNormal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):
X_{k }∽ U(aμ,a+μ), k=1,2,…,n,… (6)
where f(x;a,μ)=(1/(2μ)), (aμ) ≤ x ≤ (a+μ), μ > 0,
how does one assess the robustness of the ttest? One might invoke its generic robustness to symmetric nonNormal distributions and proceed as if the ttest is ‘fine’ for testing the hypotheses (5). A more wellgrounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the ttest based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the ttest, but the test defined by:
w(X)={(n1)([X_{[1] }+X_{[n]}]μ_{0})}/{[X_{[1]}X_{[n]}]}∽F(2,2(n1)), (7)
with a rejection region C_{1}:={x: w(x) > c_{α}}, where (X_{[1]}, X_{[n]}) denote the smallest and the largest element in the ordered sample (X_{[1]}, X_{[2]},…, X_{[n]}), and F(2,2(n1)) the F distribution with 2 and 2(n1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the ttest ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the ttest have nominal and actual significance level, .05 and .045, and power at μ_{1}=μ_{0}+1, of .4 and .37, respectively. The conventional wisdom will call the ttest robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ_{1}) are say, .03 and .9, respectively?
A strong case can be made that a more complete approach to the statistical misspecification problem is:
(i) to probe thoroughly for any departures from all the model assumptions using trenchant MS tests, and if any departures are detected,
(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.
Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).
References
Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 189512 June 1980,” Biographical Memoirs of Fellows of the Royal Society, 27: 425443.
D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” Biometrika, 60: 613622.
Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10: 507521.
Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” Metron, 1: 332.
Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222, 309368.
Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” Journal of the Royal Statistical Society, 85: 597612.
Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh.
Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” Proceedings of the London Mathematical Society, Series 2, 30: 199238.
Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” Biometrika, 20A: 175240.
Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, Philosophical Transanctions of the Royal Society, A, 231: 289337.
Lehmann, E. L. (1975) Nonparametrics: statistical methods based on ranks, HoldenDay, San Francisco.
Lehmann, E. L. (1999) “‘Student’ and SmallSample Theory,” Statistical Science, 14: 418426.
Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, Nature, June 8th, pp. 8667.
Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” Biometrika, 21: 33760.
Pearson, E. S. (1930) “A further development of tests for normality,” Biometrika, 22: 23949.
Pearson, E. S. (1931) “The analysis of variance in cases of nonnormal variation,” Biometrika, 23: 11433.
Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” Biometrika, 50: 31525.
Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” Biometrika, 20: 35660.
Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from nonnormal symmetrical and skew populations,” Biometrika, 21: 25986.
Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” Biometrika, 62: 223241.
Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” Biometrika, 64: 231246.
Spanos, A. (2010) “Akaiketype Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204220.
Student (1908), “The Probable Error of the Mean,” Biometrika, 6: 125.
]]>Today is Egon Pearson’s birthday. In honor of his birthday, I am posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (NP) statistical tests, notably the idea that the essential justification for tests resides in a longrun control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about NP theory admitting of two interpretations: behavioral and evidential:
“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.
(Nowadays, some people concentrate to an absurd extent on “sciencewise error rates in dichotomous screening”.)
When Erich Lehmann, in his review of my “Error and the Growth of Experimental Knowledge” (EGEK 1996), called Pearson “the hero of Mayo’s story,” it was because I found in E.S.P.’s work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of NP statistics. Granted, these “evidential” attitudes and practices have never been explicitly codified to guide the interpretation of NP tests. Doubtless, “Pearson and Pearson” statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect. One of the best sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It begins like this:
Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.
In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to fiveyear plans. It was really much simpler–or worse. The original heresy, as we shall see, was a Pearson one!…
To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE
Pearson doesn’t mean it was he who endorsed the behavioristic model that Fisher is here attacking.[i] The “original heresy” refers to the break from Fisher in the explicit introduction of alternative hypotheses (even if only directional). Without considering alternatives, Pearson and Neyman argued, significance tests are insufficiently constrained–for evidential purposes! However, this does not mean NP tests give us merely a comparativist appraisal (as in a report of relative likelihoods!)
This is a good weekend to read or reread “the triad”:
I’ll post some other Pearson items over the week.
HAPPY BIRTHDAY E. PEARSON
[i] Fisher’s tirades against behavioral interpretations of “his” tests are almost entirely a reflection of his break with Neyman (after 1935) rather than any radical disagreement either in philosophy or method. Fisher could be even more behavioristic in practice (if not in theory) than Neyman, and Neyman could be even more evidential in practice (if not in theory) than Fisher. Moreover, it was really when others discovered Fisher’s fiducial methods could fail to correspond to intervals with valid error probabilities that Fisher began claiming he never really was too wild about them! (Check fiducial on this blog.) Contemporary writers love to harp on the socalled “inconsistent hybrid” combining Fisherian and NP tests, but it’s largely a lot of hoopla growing out of either their taking FisherNeyman personality feuds at face value or (more likely) imposing their own philosophies of statistics on the historical exchanges. It’s time to dismiss these popular distractions: they are serious obstacles to progress in statistical understanding. Most notably, Fisherians are kept from adopting features of NP statistics, and visa versa (or they adopt them improperly). What matters is what the methods are capable of doing! For more on this, see “it’s the methods, stupid!”
Reference
Lehmann, E. (1997). Review of Error and the Growth of Experimental Knowledge by Deborah G. Mayo, Journal of the American Statistical Association, Vol. 92.
Also of relevance:
Erich Lehmann’s (1993), “The Fisher, NeymanPearson Theories of Testing Hypotheses: One Theory or Two?“. Journal of the American Statistical Association, Vol. 88, No. 424: 12421249.
Mayo, D. (1996), “Why Pearson Rejected the NeymanPearson (Behavioristic) Philosophy and a Note on Objectivity in Statistics” (Chapter 11) in Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. [This is a somewhat older view of mine.]
Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.CUP. (Sept. 1) [A much newer view of mine.]
]]>
Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with: Popper SelfTest Questions. It includes multiple choice questions, quotes to ponder, an essay, and thumbnail definitions at the end[ii].
Blog Readers who wish to send me their answers will have their papers graded [use the comments or error@vt.edu.] An A or better earns a signed copy of my forthcoming book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. [iii]
[i] Popper reading from Conjectures and Refutations
[ii] I might note the “NoPain philosophy” (3 part) Popper posts on this blog: parts 1, 2, and 3.
[iii] I posted this once before, but now I have a better prize.
HAPPY BIRTHDAY POPPER!
REFERENCE:
Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.
]]>
MONTHLY MEMORY LANE: 3 years ago: July 2015. I mark in red 34 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics [2]. Posts that are part of a “unit” or a group count as one.
July 2015
[1] Monthly memory lanes began at the blog’s 3year anniversary in Sept, 2014.
[2] New Rule, July 30, 2016, March 30,2017 a very convenient way to allow datadependent choices (note why it’s legit in selecting blog posts, on severity grounds).
]]>
A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.
Stephen Senn
Consultant Statistician,
Edinburgh
Thirty years ago, Laupacis et al^{1} proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).
Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write: ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’^{2}. Fifteen years further on, bringing us up to date, Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’^{3}
This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirtyyear anniversary of NNTs provides a good excuse.
NNTs based on dichotomies, as opposed to those based on true binary outcomes (which are very rare), do not measure the proportion of patients who benefit from the drug and even when not based on such dichotomies, they say less about differential response than many suppose. Common false interpretations of NNTs are creating confusion about the scope for personalised medicine.
To illustrate the problem, consider a 2015 Nature comment piece by Nicholas Schork^{4} calling for Nof1 trials to be used more often in personalising medicine. These are trials in which, as a guide to treatment, patients are repeatedly randomised in different episodes to the therapies being compared. ^{5}.
NNTs are commonly used in health economics. Other things being equal, a drug with a larger NNT ought to have a lower cost per patient day than one with a smaller NNT if it is to justify its place in the market. Here however, they were used to make the case for the scope for personalised medicine, and hence the need for Nof1 trials, a potentially very useful approach to personalising treatment. Schork claimed, ‘The top ten highestgrossing drugs in the United States help between 1 in 25 and 1 in 4 of the people who take them (p609). This claim may or may not be correct (it is almost certainly wrong) but the argument for it is false.
The figure: Imperfect medicine is based on Schork’s figure Imprecision medicine and shows the NNTs for the ten best selling drugs in the USA at the time of his comment. The NNTs range, for example, from 4 for Humira® in arthritis to 25 for Nexium in heartburn. This is then interpreted as meaning that since, for example, on average 4 patients would have to be treated with Humira rather than placebo in order to get one more response, only one in 4 patients responds to Humira.Imperfect medicine: Numbers Needed to Treat based on a figure in Schork (2015). The total number of dots represents how many patients you would have to switch to the treatment mentioned to get one additional response (blue dot). The red dots are supposed to represent the patients for whom it would make no difference.
Take the example of Nexium. The figure quoted by Schork is taken from a metaanalysis carried out by Gralnek et al^{6} based on several studies comparing Esomeprazole (Nexium) to other protein pump inhibitors. The calculation of the NNT may be illustrated by taking one of the studies that comprise the metaanalysis, the EXPO study reported by Labenz et al^{7} in which a clinical trial with more than 3000 patients compared Esomeprazole to Pantoprazole. Patients with erosive oesophagitis were treated with either one or the other treatment and then evaluated at 8 weeks.
Of those treated with Esomeprazole 92.1% were healed. Of those treated with Pantoprazole 87.3% were healed. The difference of 4.8% is the risk difference. Expressed as a proportion this is 0.048 and the reciprocal of this figure is 21, rounded up to the nearest whole number. This figure is the NNT and an interpretation is that on average you would need to treat 21 patients with Esomeprazole rather than with Pantoprazole to have one extra healed case at 8 weeks. For the metaanalysis as a whole, Gralnek et al^{6} found a risk difference of 4% and this yields an NNT of 25, the figure quoted by Schork. (See Box for further discussion.)
Two different interpretations of the EXPO oesophageal ulcer data
It is impossible for us to observe the ulcers that were studied in the EXPO trial under both treatments. Each patient, was treated with either Esomeprazole or Pantoprazole. We can imagine what response would have been on either but we can only observe it on one. Table 1 and Table 2 have the same observable marginal probabilities of ulcer healing but different postulated joint ones.
Table 1 Possible joint distribution of response (percentages) for the EXPO trial. Case where no patient would respond on Pantoprazole who did not on Esomeprazole In the case of Table 1, no patient that would not have been healed by Esomeprazole could have been healed by Pantoprazole. In consequence the total number of patients who could have been healed are those who were healed with Esomeprazole, that is to say 92.1%. In the case of Table 2, all patients who were not healed with by Esomeprazole, that is to say 7.9%, could have been healed by Pantoprazole. In principle it becomes possible to heal all patients. Of course, intermediate situations are possible but all such tables have the same NNT of 21. The NNT cannot tell us which is true.
Table 2 Possible joint distribution of response (percentages) for the EXPO trial. Case where all patients did not respond on Esomeprazole would respond on Pantoprazole

A number of points can be made taking this example. First, it is comparatorspecific. Proton pump inhibitors as a class are highly effective and one would get quite a different figure if placebo rather than Pantoprazole had been used as the control for Esomeprazole. Second, the figure, of itself, does not tell us the scope for personalising medicine. It is quite compatible with the two extreme positions given in the Box. In the first case, every single patent who was helped by Pantoprazole would have been so by Esomeprazole. If there are no cost or tolerability advantages to the former the optimal policy would be to give all patient the latter. In the second case, every single patient who was not helped by Esomeprazole would have been helped by Pantoprazole. If a suitable means can be found of identifying such patients, all patients can be treated successfully. Third, healing is a process that takes time. The eightweek timepoint is partly arbitrary. The careful analysis presented by Labenz et al^{7} shows healing rates rising with time with the Esomeprazole rate always above that for Pantoprazole. Perhaps with time, either would heal all ulcers, the difference between them being one of speed. Fourth, although it is not directly related to this discussion, it should be appreciated that a given drug can have many NNTs. The NNT will vary both according to the comparator, the outcome chosen, the cut point for any dichotomy or the followup^{8}. (The original article proposing NNTs by Laupacis et al^{1} discusses a number of such caveats.) Indeed, for the EXPO study the risk difference at 4 weeks is 8.7 with an NNT of rather than 21 for 8 weeks. This shows the importance of not mixing NNTs for different followups in a metaanalysis.
There are no shortcuts to finding evidence for variation in response^{9}. Dichotomising continuous measures not only has the capacity to exaggerate unimportant differences it is also inefficient and needlessly increases trial sizes^{10}.
Rather than becoming simpler, ways that clinical trial are reported need to be more nuanced. In a previous blog I showed how a NNT of 10 for headache had been misinterpreted as meaning that only 1 in 10 benefitted from paracetamol. It is, or ought to be obvious, that in order to understand the extent to which patients respond to paracetamol you should study them more than once under treatment and under control. For example, a design could be employed in which each patient was treated for four headaches, twice with placebo and twice with paracetamol. This is an example of the nof1 trials than Schork calls for^{4}. We hardly ever run these. Of course for some diseases they are not practical but where we can’t run them, we should not pretend to have identified what we can’t.
The role for nof1 trials is indeed there but not necessarily to personalise treatment. More careful analysis of response may simply reveal that this is less variable than supposed^{11}. In some cases such trials may simply deliver the message that we need to do better for everybody^{12}.
In his editorial of 2003 Smith referred to pharmacogenetics as providing ‘hopes that greater understanding of genetics will mean that we will be able to identify with a “simple genetic test” people who will respond to drugs and design drugs for individuals rather than populations.’ and added, ‘We have, however, been hearing this tune for a long time’^{2}.
Smith’s complaint about an old tune is as true today as it was in 2003. However, the message for the pharmaceutical industry may simply be that we need better drugs not better diagnosis.
I am grateful to Andreas Laupacis and Jennifer Deevy for helpfully providing me with a copy of the 1988 paper.
I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]
Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and daytoday learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degreesofsupport/belief/plausibility to propositions, models, or theories.
“Higgs Analysis and Statistical Flukes: part 2”
Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgslike particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgsparticle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.
Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a onesided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as
Pr(Test T would yield at least a 5 sigma excess; H_{0}: background only) = extremely low
are deduced from the sampling distribution of the test statistic, fortified with much crosschecking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone). The inferences, even the formal statistical ones, go beyond pvalue reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.
Error probabilities
In a NeymanPearson setting, a cutoff c_{α}_{ }is chosen predata so that the probability of a type I error is low. In general,
Pr(d(X) > c_{α}; H_{0}) ≤ α
and in particular, alluding to an overall test T:
(1) Pr(Test T yields d(X) > 5 standard deviations; H_{0}) ≤ .0000003.
The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).
[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, p_{0}. In general,
Pr(P < p_{0}; H_{0}) < p_{0}
and in particular,
(2) Pr(Test T yields P < .0000003; H_{0}) < .0000003.
For test T to yield a “worse fit” with H_{0 }(smaller pvalue) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0. With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.
So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic d(X), or the pvalue viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).
An implicit principle of inference or evidence
Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgslike sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)
Data x_{0 } from a test T provide evidence for rejecting H_{0} (just) to the extent that H_{0} would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).
It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010) and a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006).[3]
The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signallike events. (Likewise for computations under hypothesized discrepancies.) The relationship between H_{0} and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977). “H_{0 }is true” is a shorthand for a very long statement that H_{0} is an approximately adequate model of a specified aspect of the process generating the data in the context. (This relates to statistical models and hypotheses living “lives of their own”.)
Severity and the detachment of inferences
The sampling distributions serve to give counterfactuals. In this case, they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to H_{0}.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out. Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference. (This is why the bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)
The severity principle, put more generally:
Data from a test T[ii] provide good evidence for inferring H (just) to the extent that H passes severely with x_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.
(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap resampling.) In this case, what is surviving is minimally the nonnull. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H‘s denial: together they exhaust the answers to a given question.
Without making such a principle explicit, some critics assume the argument is all about the reported pvalue. The inference actually detached from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:
(3) There is strong evidence for H: a Higgs (or a Higgslike) particle.
(3)’ They have experimentally demonstrated H: a Higgs (or Higgslike) particle.
Or just, infer H.
Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)
As always, the mere pvalue is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.
Qualifying claims by how well they have been probed
The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned. Neither is there in daytoday reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)
[We can however write, SEV(H) ~1]
The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)
Telling what’s true about significance levels
So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation). It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to H_{0}. Worse, (1 the pvalue) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!
If critics (or the pvalue police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the pvalue as a posterior probability assignment to H_{0}, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just H and H_{0} but all rivals to the Standard Model, rivals to the data and statistical models, and higher level theories as well. But can’t we just imagine a Bayesian catchall hypothesis? On paper, maybe, but where will we get these probabilities? What do any of them mean? How can the probabilities even be comparable in different data analyses, using different catchalls and different priors?[iv]
Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.
Those prohibited phrases
One may wish to return to some of the condemned phrases of particular physics reports. Take,
“There is less than a one in a million chance that their results are a statistical fluke”.
This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution: The statistical null asserts that H_{0}: background alone adequately describes the process.
H_{0} does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under H_{0}”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of pvalues {x: p < p_{0}}. Even when H_{0} is true, such “signal like” outcomes may occur. They are p<sub:0 level flukes. Were such flukes generated even with moderate frequency under H_{0}, they would not be evidence against H_{0}. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny pvalues, we have evidence of a genuine discrepancy from H_{0}.
I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain H_{0} as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)
I will not deny that there have been misinterpretations of pvalues, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.
Triggering, indicating, inferring
As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.
If interested: See statistical flukes (part 3)
The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:
Part 1: https://errorstatistics.com/2013/03/17/updateonhiggsdataanalysisstatisticalflukes1/
Part 2 https://errorstatistics.com/2013/03/27/higgsanalysisandstatisticalflukespart2/
*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).
2018/2015/2014 Notes
[0]Physicists manage to learn quite a lot from negative results. They’d love to find something more exotic, but the negative results will not go away. A recent article from CERN, “We need to talk about the Higgs” says: While there are valid reasons to feel less than delighted by the null results of searches for physics beyond the Standard Model (SM), this does not justify a mood of despondency.
“Physicists aren’t just praying for hints of new physics, Strassler stresses. He says there is very good reason to believe that the LHC should find new particles. For one, the mass of the Higgs boson, about125.09 billion electron volts, seems precariously low if the census of particles is truly complete. Various calculations based on theory dictate that the Higgs mass should be comparable to a figure called the Planck mass, which is about 17 orders of magnitude higher than the boson’s measured heft.”The article is here.
[1]My presentation at a Symposium on the Higgs discovery at the Philosophy of Science Association (Nov. 2014) is here.
[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.
[3]Aspects of the statistical controversy in the Higgs episode occurs in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018)
___________
Original notes:
[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.
[ii] Which almost always refers to a set of tests, not just one.
[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.
[iv] In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.
[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis https://errorstatistics.com/2012/08/25/didhiggsphysicistsmissanopportunitybynotconsultingmorewithstatisticians/
REFERENCES (from March, 2013 post):
ATLAS Collaboration (November 14, 2012), Atlas Note: “Updated ATLAS results on the signal strength of the Higgslike boson for decays into WW and heavy fermion final states”, ATLASCONF2012162. http://cds.cern.ch/record/1494183/files/ATLASCONF2012162.pdf
Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” Annals of Mathematical Statistics, 29: 357–72.
Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” Scandinavian Journal of Statistics, 4: 49–70.
Mayo, D.G. (1996), Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago.
Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247275.
Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a NeymanPearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323–357.
Below are the slides from my June 14 presentation at the XPhil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some. However, I did learn something about Xphil, particularly from the very interesting poster session [1]. This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.
[1] The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.
]]>Below are the slides from my talk today at Columbia University at a session, Philosophy of Science and the New Paradigm of DataDriven Science, at an American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics. Todd was brave to sneak in philosophy of science in an otherwise highly mathematical conference.
Philosophy of Science and the New Paradigm of DataDriven Science : (Room VEC 902/903)
Organizer and Chair: Todd Kuffner (Washington U)
]]>
Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman did discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.)
Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”! “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.
Birnbaum struggled. Why? Because he regarded controlling the probability of misleading interpretations to be essential for scientific inference, and yet he seemed to have demonstrated that the LP/SLP followed from frequentist principles! That would mean error statistical principles entailed the denial of error probabilities! For many years this was assumed to be the case, and accounts that rejected error probabilities flourished. Frequentists often admitted their approach seemed to lack what Birnbaum called a “concept of evidence”–even those who suspected there was something pretty fishy about Birnbaum’s “proof”. I have shown the flaw in Birnbaum’s alleged demonstration of the LP/SLP (most fully in the Statistical Science issue). (It only uses logic, really, yet philosophers of science do not seem interested in it.) [3]
The Statistical Science Issue: This is the 4th Birnbaum birthday where I can point to the Statistical Science issue being out. But are textbooks are out making changes, or still calling this a theorem? I’ve a hunch that Birnbaum would have liked my rejoinder to discussants (Statistical Science): Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu. For those unfamiliar with the argument, at the end of this entry are slides from an entirely informal talk as well as some links from this blog. Happy Birthday Birnbaum!
[1] The Weak LP concerns a single experiment; whereas, the strong LP concerns two (or more) experiments. The weak LP is essentially just the sufficiency principle.
[2] I will give a free signed hard copy of my new “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (July 31, 2018) to each of the first 10 readers who sends a fully cited and linked published example (with distinct authors, you may be one) of criticisms of frequentist methods based on arguing against the relevance of “intentions”. Include as much of the cited material as needed for a reader to grasp the general argument. Entries must be posted as a comment to this post (not twitter), with a link to the article or portions of the article. A brief discussion of what you think of it should also be included. Judges on Elba have final say. [Write with questions.]*
[3] The argument still cries out for being translated into a symbolic logic of some sort.
Excerpts from my Rejoinder
I. Introduction
……As longstanding as Birnbaum’s result has been, Birnbaum himself went through dramatic shifts in a short period of time following his famous (1962) result. More than of historical interest, these shifts provide a unique perspective on the current problem.
Already in the rejoinder to Birnbaum (1962), he is worried about criticisms (by Pratt 1962) pertaining to applying WCP to his constructed mathematical mixtures (what I call Birnbaumization), and hints at replacing WCP with another principle (Irrelevant Censoring). Then there is a gap until around 1968 at which point Birnbaum declares the SLP plausible “only in the simplest case, where the parameter space has but two” predesignated points (1968, 301). He tells us in Birnbaum (1970a, 1033) that he has pursued the matter thoroughly leading to “rejection of both the likelihood concept and various proposed formalizations of prior information”. The basis for this shift is that the SLP permits interpretations that “can be seriously misleading with high probability” (1968, 301). He puts forward the “confidence concept” (Conf) which takes from the NeymanPearson (NP) approach “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” while supplying it an evidential interpretation (1970a, 1033). Given the many different associations with “confidence,” I use (Conf) in this Rejoinder to refer to Birnbaum’s idea. Many of the ingenious examples of the incompatibilities of SLP and (Conf) are traceable back to Birnbaum, optional stopping being just one (see Birnbaum 1969). A bibliography of Birnbaum’s work is Giere 1977. Before his untimely death (at 53), Birnbaum denies the SLP even counts as a principle of evidence (in Birnbaum 1977). He thought it anomalous that (Conf) lacked an explicit evidential interpretation even though, at an intuitive level, he saw it as the “one rock in a shifting scene” in statistical thinking and practice (Birnbaum 1970, 1033). I return to this in part IV of this rejoinder……
IV PostSLP foundations
Return to where we left off in the opening section of this rejoinder: Birnbaum (1969).
The problemarea of main concern here may be described as that of determining precise concepts of statistical evidence (systematically linked with mathematical models of experiments), concepts which are to be nonBayesian, nondecisiontheoretic, and significantly relevant to statistical practice. (Birnbaum 1969, 113)
Given Neyman’s behavioral decision construal, Birnbaum claims that “when a confidence region estimate is interpreted as statistical evidence about a parameter”(1969, p. 122), an investigator has necessarily adjoined a concept of evidence, (Conf) that goes beyond the formal theory. What is this evidential concept? The furthest Birnbaum gets in defining (Conf) is in his posthumous article (1977):
(Conf) A concept of statistical evidence is not plausible unless it finds ‘strong evidence for H_{2} against H_{1}’ with small probability (α) when H_{1} is true, and with much larger probability (1 – β) when H_{2} is true. (1977, 24)
On the basis of (Conf), Birnbaum reinterprets statistical outputs from NP theory as strong, weak, or worthless statistical evidence depending on the error probabilities of the test (1977, 2426). While this sketchy idea requires extensions in many ways (e.g., beyond predata error probabilities, and beyond the two hypothesis setting), the spirit of (Conf), that error probabilities qualify properties of methods which in turn indicate the warrant to accord a given inference, is, I think, a valuable shift of perspective. This is not the place to elaborate, except to note that my own twist on Birnbaum’s general idea is to appraise evidential warrant by considering the capabilities of tests to have detected erroneous interpretations, a concept I call severity. That Birnbaum preferred a propensity interpretation of error probabilities is not essential. What matters is their role in picking up how features of experimental design and modeling alter a methods’ capabilities to control “seriously misleading interpretations”. Even those who embrace a version of probabilism may find a distinct role for a severity concept. Recall that Fisher always criticized the presupposition that a single use of mathematical probability must be competent for qualifying inference in all logical situations (1956, 47).
Birnbaum’s philosophy evolved from seeking concepts of evidence in degree of support, belief, or plausibility between statements of data and hypotheses to embracing (Conf) with the required control of misleading interpretations of data. The former view reflected the logical empiricist assumption that there exist contextfree evidential relationships—a paradigm philosophers of statistics have been slow to throw off. The newer (postpositivist) movements in philosophy and history of science were just appearing in the 1970s. Birnbaum was ahead of his time in calling for a philosophy of science relevant to statistical practice; it is now long overdue!
“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists” (Birnbaum 1972, 861).
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227266.
Links to individual papers:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227239.
Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240241.
Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242246.
Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247251.
Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252253.
Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254258.
Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259260.
Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261266.
Abstract: An essential component of inference based on familiar frequentist notions, such as pvalues, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x^{∗} and y^{∗} from experiments E_{1} and E_{2} (both with unknown parameter θ), have different probability models f_{1}( . ), f_{2}( . ), then even though f_{1}(x^{∗}; θ) = cf_{2}(y^{∗}; θ) for all θ, outcomes x^{∗} and y^{∗}may have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which E_{i} produced the measurement, the assessment should be in terms of the properties of E_{i}. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].
Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality
Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.
[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”
Some previous posts on this topic can be found at the following links (and by searching this blog with key words):
UPhils and responses
[ii]
Below are my slides from my May 2, 2014 presentation in the Virginia Tech Department of Philosophy 2014 Colloquium series:
“Putting the Brakes on the Breakthrough, or
‘How I used simple logic to uncover a flaw in a controversial 50 year old ‘theorem’ in statistical foundations taken as a
‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” In Breakthroughs in Statistics, edited by S. Kotz and N. Johnson, 1:478–518. Springer Series in Statistics 1993. New York: SpringerVerlag.
*Judges reserve the right to decide if the example constitutes the relevant use of “intentions” (amid a foundations of statistics criticism) in a published article. Different subsets of authors can count for distinct entries. No more than 2 entries per person. This means we need your name.
]]>