Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”
Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”
Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.
I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading
Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).
- Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.
I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”? Continue reading
I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data . The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black. Continue reading
Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics . A related paper for your Saturday night reading is Mayo and Spanos (2011). It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.
I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.
It was first blogged here, as part of our seminar 2 years ago.
 For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.
See also on this blog:
A. Spanos, “Lecture on frequentist hypothesis testing
Evolutionary ecologist, Stephen Heard (Scientist Sees Squirrel) linked to my blog yesterday. Heard’s post asks: “Why do we make statistics so hard for our students?” I recently blogged Barnard who declared “We need more complexity” in statistical education. I agree with both: after all, Barnard also called for stressing the overarching reasoning for given methods, and that’s in sync with Heard. Here are some excerpts from Heard’s (Oct 6, 2015) post. I follow with some remarks.
This bothers me, because we can’t do inference in science without statistics*. Why are students so unreceptive to something so important? In unguarded moments, I’ve blamed it on the students themselves for having decided, a priori and in a self-fulfilling prophecy, that statistics is math, and they can’t do math. I’ve blamed it on high-school math teachers for making math dull. I’ve blamed it on high-school guidance counselors for telling students that if they don’t like math, they should become biology majors. I’ve blamed it on parents for allowing their kids to dislike math. I’ve even blamed it on the boogie**. Continue reading
Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)
Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP ). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).
Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”! “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account. For error statisticians, this information reflects real and crucial properties of your inference procedure.
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.
Links to individual papers:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle. Statistical Science 29 (2014), no. 2, 227-239.
Dawid, A. P. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 240-241.
Evans, Michael. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 242-246.
Martin, Ryan; Liu, Chuanhai. Discussion: Foundations of Statistical Inference, Revisited. Statistical Science 29 (2014), no. 2, 247-251.
Fraser, D. A. S. Discussion: On Arguments Concerning Statistical Principles. Statistical Science 29 (2014), no. 2, 252-253.
Hannig, Jan. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 254-258.
Bjørnstad, Jan F. Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 259-260.
Mayo, Deborah G. Rejoinder: “On the Birnbaum Argument for the Strong Likelihood Principle”. Statistical Science 29 (2014), no. 2, 261-266.
Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x∗ and y∗ from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(x∗; θ) = cf2(y∗; θ) for all θ, outcomes x∗ and y∗may have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox [Ann. Math. Statist. 29 (1958) 357–372] proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s [J.Amer.Statist.Assoc.57(1962) 269–306] argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].
Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality
Regular readers of this blog know that the topic of the “Strong Likelihood Principle (SLP)” has come up quite frequently. Numerous informal discussions of earlier attempts to clarify where Birnbaum’s argument for the SLP goes wrong may be found on this blog. [SEE PARTIAL LIST BELOW.[i]] These mostly stem from my initial paper Mayo (2010) [ii]. I’m grateful for the feedback.
In the months since this paper has been accepted for publication, I’ve been asked, from time to time, to reflect informally on the overall journey: (1) Why was/is the Birnbaum argument so convincing for so long? (Are there points being overlooked, even now?) (2) What would Birnbaum have thought? (3) What is the likely upshot for the future of statistical foundations (if any)?
I’ll try to share some responses over the next week. (Naturally, additional questions are welcome.)
[i] A quick take on the argument may be found in the appendix to: “A Statistical Scientist Meets a Philosopher of Science: A conversation between David Cox and Deborah Mayo (as recorded, June 2011)”
UPhils and responses
Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.
1. Some assertions from Fisher, N-P, and Bayesian camps
Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)
a) From the Fisherian camp (Cox and Hinkley):
For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by
pobs = Pr(T > tobs; H0).
….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).
Thus pobs would be the Type I error probability associated with the test.
b) From the Neyman-Pearson N-P camp (Lehmann and Romano):
“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4)
Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null. Continue reading
School Director & Professor
School of Mathematical & Natural Science
Arizona State University
First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.
Here are the slides from my presentation (May 17) at the Scientism workshop in NYC. (They’re sketchy since we were trying for 25-30 minutes.) Below them are some mini notes on some of the talks.
Aris Spanos’ overview of error statistical responses to familiar criticisms of statistical tests. Related reading is Mayo and Spanos (2011)
I had invited Larry to give an update, and I’m delighted that he has! The discussion relates to the last post (by Spanos), which follows upon my deconstruction of Wasserman*. So, for your Saturday night reading pleasure, join me** in reviewing this and the past two blogs and the links within.
“Wasserman on Wasserman: Update! December 28, 2013”
My opinions have shifted a bit.
My reference to Franken’s joke suggested that the usual philosophical debates about the foundations of statistics were un-important, much like the debate about media bias. I was wrong on both counts.
First, I now think Franken was wrong. CNN and network news have a strong liberal bias, especially on economic issues. FOX has an obvious right wing, and anti-atheist bias. (At least FOX has some libertarians on the payroll.) And this does matter. Because people believe what they see on TV and what they read in the NY times. Paul Krugman’s socialist bullshit parading as economics has brainwashed millions of Americans. So media bias is much more than who makes better hummus.
Similarly, the Bayes-Frequentist debate still matters. And people — including many statisticians — are still confused about the distinction. I thought the basic Bayes-Frequentist debate was behind us. A year and a half of blogging (as well as reading other blogs) convinced me I was wrong here too. And this still does matter. Continue reading
Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me. A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a snippet from a passage from Le Cam (1977) (Here I have fleshed it out):
“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation. Whether this is true, false, unacceptable or inspiring, it tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices. It is only at the design stage that the statistician can help you.
Another claim is the very curious one that if one follows the neo-Bayesian theory strictly one would not randomize experiments….However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. …
In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.
There are many other curious statements concerning confidence intervals, levels of significance, power, and so forth. These statements are only confusing to an otherwise abused public”. (Le Cam 1977, 158)
Back to EGEK:
Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,
In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .
The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others.
Given the difference in aims, it is not surprising that information relevant to the Bayesian task is very different from that relevant to the task of the error statistician. In this section I want to sharpen and make more rigorous what I have already said about this distinction.
…. the secret to solving a number of problems about evidence, I hold, lies in utilizing—formally or informally—the error probabilities of the procedures generating the evidence. It was the appeal to severity (an error probability), for example, that allowed distinguishing among the well-testedness of hypotheses that fit the data equally well… .
A few pages later in a section titled “Bayesian Freedom, Bayesian Magic” (350-1):
A big selling point for adopting the LP (strong likelihood principle), and with it the irrelevance of stopping rules, is that it frees us to do things that are sinful and forbidden to an error statistician.
“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson). . . . Many experimenters would like to feel free to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money or patience … Classical statisticians … have frowned on [this]”. (Edwards, Lindman, and Savage 1963, 239)1
Breaking loose from the grip imposed by error probabilistic requirements returns to us an appealing freedom.
Le Cam, … hits the nail on the head:
“It is characteristic of [Bayesian approaches]  . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from whatever data and whatever features one happens to notice”. (Le Cam 1977, 145)
In contrast, the error probability assurances go out the window if you are allowed to change the experiment as you go along. Repeated tests of significance (or sequential trials) are permitted, are even desirable for the error statistician; but a penalty must be paid for perseverance—for optional stopping. Before-trial planning stipulates how to select a small enough significance level to be on the lookout for at each trial so that the overall significance level is still low. …. Wearing our error probability glasses—glasses that compel us to see how certain procedures alter error probability characteristics of tests—we are forced to say, with Armitage, that “Thou shalt be misled if thou dost not know that” the data resulted from the try and try again stopping rule. To avoid having a high probability of following false leads, the error statistician must scrupulously follow a specified experimental plan. But that is because we hold that error probabilities of the procedure alter what the data are saying—whereas Bayesians do not. The Bayesian is permitted the luxury of optional stopping and has nothing to worry about. The Bayesians hold the magic.
Or is it voodoo statistics?
When I sent him a note, saying his work had inspired me, he modestly responded that he doubted he could have had all that much of an impact.
*I had forgotten that this Synthese (1977) volume on foundations of probability and statistics is the one dedicated to the memory of Allan Birnbaum after his suicide: “By publishing this special issue we wish to pay homage to professor Birnbaum’s penetrating and stimulating work on the foundations of statistics” (Editorial Introduction). In fact, I somehow had misremembered it as being in a Harper and Hooker volume from 1976. The Synthese volume contains papers by Giere, Birnbaum, Lindley, Pratt, Smith, Kyburg, Neyman, Le Cam, and Kiefer.
_______(1962). Contribution to discussion in The foundations of statistical inference, edited by L. Savage. London: Methuen.
_______(1975). Sequential Medical Trials. 2nd ed. New York: John Wiley & Sons.
Edwards, W., H. Lindman & L. Savage (1963) Bayesian statistical inference for psychological research. Psychological Review 70: 193-242.
Le Cam, L. (1974). J. Neyman: on the occasion of his 80th birthday. Annals of Statistics, Vol. 2, No. 3 , pp. vii-xiii, (with E.L. Lehmann).
Le Cam, L. (1977). A note on metastatistics or “An essay toward stating a problem in the doctrine of chances.” Synthese 36: 133-60.
Le Cam, L. (1982). A remark on empirical measures in Festschrift in the honor of E. Lehmann. P. Bickel, K. Doksum & J. L. Hodges, Jr. eds., Wadsworth pp. 305-327.
Le Cam, L. (1986). The central limit theorem around 1935. Statistical Science, Vol. 1, No. 1, pp. 78-96.
Le Cam, L. (1988) Discussion of “The Likelihood Principle,” by J. O. Berger and R. L. Wolpert. IMS Lecture Notes Monogr. Ser. 6 182–185. IMS, Hayward, CA
Le Cam, L. (1996) Comparison of experiments: A short review. In Statistics, Probability and Game Theory. Papers in Honor of David Blackwell 127–138. IMS, Hayward, CA.
Le Cam, L., J. Neyman and E. L. Scott (Eds). (1973). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Vol. l: Theory of Statistics, Vol. 2: Probability Theory, Vol. 3: Probability Theory. Univ. of Calif. Press, Berkeley Los Angeles.
Neyman, J. and L. Le Cam (Eds). (1967). Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. I: Statistics, Vol. II: Probability Part I & Part II. Univ. of Calif. Press, Berkeley and Los Angeles.
 For some links on optional stopping on this blog: Highly probably vs highly probed: Bayesian/error statistical differences.; Who is allowed to cheat? I.J. Good and that after dinner comedy hour….; New Summary; Mayo: (section 7) “StatSci and PhilSci: part 2″; After dinner Bayesian comedy hour….; Search for more, if interested.
 Le Cam is alluding mostly to Savage, and (what he called) the “neo-Bayesian” accounts.
by Professor Larry Laudan
Philosopher of Science*
Several of the comments to the July 17 post about the presumption of innocence suppose that jurors are asked to believe, at the outset of a trial, that the defendant did not commit the crime and that they can legitimately convict him if and only if they are eventually persuaded that it is highly likely (pursuant to the prevailing standard of proof) that he did in fact commit it. Failing that, they must find him not guilty. Many contributors here are conjecturing how confident jurors should be at the outset about defendant’s material innocence.
That is a natural enough Bayesian way of formulating the issue but I think it drastically misstates what the presumption of innocence amounts to. In my view, the presumption is not (or at least should not be) an instruction about whether jurors believe defendant did or did not commit the crime. It is, rather, an instruction about their probative attitudes.
There are three reasons for thinking this:
a). asking a juror to begin a trial believing that defendant did not commit a crime requires a doxastic act that is probably outside the jurors’ control. It would involve asking jurors to strongly believe an empirical assertion for which they have no evidence whatsoever. It is wholly unclear that any of us has the ability to talk ourselves into resolutely believing x if we have no empirical grounds for asserting x. By contrast, asking juries to believe that they have seen as yet no proof of defendant’s guilt is an easy belief to acquiesce in since it is obviously true. Continue reading
I am grateful for the chance to comment on the paper by Gelman and Robert. I welcome seeing statisticians raise philosophical issues about statistical methods, and I entirely agree that methods not only should be applicable but also capable of being defended at a foundational level. “It is doubtful that even the most rabid anti-Bayesian of 2010 would claim that Bayesian inference cannot apply” (Gelman and Robert 2012, p. 6). This is clearly correct; in fact, it is not far off the mark to say that the majority of statistical applications nowadays are placed under the Bayesian umbrella, even though the goals and interpretations found there are extremely varied. There are a plethora of international societies, journals, post-docs, and prizes with “Bayesian” in their name, and a wealth of impressive new Bayesian textbooks and software is available. Even before the latest technical advances and the rise of “objective” Bayesian methods, leading statisticians were calling for eclecticism (e.g., Cox 1978), and most will claim to use a smattering of Bayesian and non-Bayesian methods, as appropriate. George Casella (to whom their paper is dedicated) and Roger Berger in their superb textbook (2002) exemplify a balanced approach. Continue reading