Monthly Archives: July 2013

New Version: On the Birnbaum argument for the SLP: Slides for my JSM talk

Picture 216 1mayo In my latest formulation of the controversial Birnbaum argument for the strong likelihood principle (SLP), I introduce a new symbol \Rightarrow  to represent a function from a given experiment-outcome pair, (E,z) to a generic inference implication.  This should clarify my argument (see my new paper).

(E,z) \Rightarrow InfrE(z) is to be read “the inference implication from outcome z in experiment E” (according to whatever inference type/school is being discussed).

A draft of my slides for the Joint Statistical Meetings JSM in Montreal next week are right after the abstract. Comments are very welcome.

Interested readers may search this blog for quite a lot of discussion of the SLP (e.g., here and here) including links to the central papers, “U-Phils” by others (e.g., here, here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).

On the Birnbaum Argument for the Strong Likelihood Principle


An essential component of inference based on familiar frequentist notions p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory). This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x* and y* from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1, f2, then even though f1(x*; θ) = cf2(y*; θ) for all θ, outcomes x* and y* may have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox (1958) proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of the particular Ei.      

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP) entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases directly refute [WCP entails SLP].

Comments, questions, errors are welcome.

Full paper can be found here:

Categories: Error Statistics, Statistics, strong likelihood principle

Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs

drapery6A low-powered statistical analysis of this blog—nearing its 2-year anniversary!—reveals that the topic to crop up most often—either front and center, or lurking in the bushes–is that of “background information”. The following was one of my early posts, back in Oct.30, 2011:

October 30, 2011 (London). Increasingly, I am discovering that one of the biggest sources of confusion about the foundations of statistics has to do with what it means or should mean to use “background knowledge” and “judgment” in making statistical and scientific inferences. David Cox and I address this in our “Conversation” in RMM (2011); it is one of the three or four topics in that special volume that I am keen to take up.

Insofar as humans conduct science and draw inferences, and insofar as learning about the world is not reducible to a priori deductions, it is obvious that “human judgments” are involved. True enough, but too trivial an observation to help us distinguish among the very different ways judgments should enter according to contrasting inferential accounts. When Bayesians claim that frequentists do not use or are barred from using background information, what they really mean is that frequentists do not use prior probabilities of hypotheses, at least when those hypotheses are regarded as correct or incorrect, if only approximately. So, for example, we would not assign relative frequencies to the truth of hypotheses such as (1) prion transmission is via protein folding without nucleic acid, or (2) the deflection of light is approximately 1.75” (as if, as Pierce puts it, “universes were as plenty as blackberries”). How odd it would be to try to model these hypotheses as themselves having distributions: to us, statistical hypotheses assign probabilities to outcomes or values of a random variable.

However, quite a lot of background information goes into designing, carrying out, and analyzing inquiries into hypotheses regarded as correct or incorrect. For a frequentist, that is where background knowledge enters. There is no reason to suppose that the background required in order sensibly to generate, interpret, and draw inferences about H should—or even can—enter through prior probabilities for H itself! Of course, presumably, Bayesians also require background information in order to determine that “data x” have been observed, to determine how to model and conduct the inquiry, and to check the adequacy of statistical models for the purposes of the inquiry. So the Bayesian prior only purports to add some other kind of judgment, about the degree of belief in H. It does not get away from the other background judgments that frequentists employ.

This relates to a second point that came up in our conversation when Cox asked, “Do we want to put in a lot of information external to the data, or as little as possible?” Continue reading

Categories: Background knowledge, Error Statistics | Tags: ,

Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior

DSCF3726“Why presuming innocence has nothing to do with assigning low prior probabilities to the proposition that defendant didn’t commit the crime”

by Professor Larry Laudan
Philosopher of Science*

Several of the comments to the July 17 post about the presumption of innocence suppose that jurors are asked to believe, at the outset of a trial, that the defendant did not commit the crime and that they can legitimately convict him if and only if they are eventually persuaded that it is highly likely (pursuant to the prevailing standard of proof) that he did in fact commit it. Failing that, they must find him not guilty. Many contributors here are conjecturing how confident jurors should be at the outset about defendant’s material innocence.

That is a natural enough Bayesian way of formulating the issue but I think it drastically misstates what the presumption of innocence amounts to.  In my view, the presumption is not (or at least should not be)  an instruction about whether jurors believe defendant did or did not commit the crime.  It is, rather, an instruction about their probative attitudes.wavy capital

There are three reasons for thinking this:

a). asking a juror to begin a trial believing that defendant did not commit a crime requires a doxastic act that is probably outside the jurors’ control.  It would involve asking jurors  to strongly believe an empirical assertion for which they have no evidence whatsoever.  It is wholly unclear that any of us has the ability to talk ourselves into resolutely believing x if we have no empirical grounds for asserting x. By contrast, asking juries to believe that they have seen as yet no proof of defendant’s guilt is an easy belief to acquiesce in since it is obviously true. Continue reading

Categories: frequentist/Bayesian, PhilStatLaw, Statistics

Msc Kvetch: A question on the Martin-Zimmerman case we do not hear

questionmark pinkThis is off topic, but a question I don’t hear people ask in regard to the Zimmerman case is: why didn’t any of the several people hearing screams intervene to stop the brawl? Never mind who was screaming, no one felt an obligation to intervene. Eye-witness John Good came out and said something like “stop” but then immediately ran back inside. Others could be heard saying, “don’t go out there”. I don’t say they should have joined the fight, but if a few people had gone outside and screamed or blew a whistle it probably would have been effective.

(I’ve been in such situations twice.)

Categories: msc kvetch

Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)

wavy capitalNathan Schachtman, Esq., PC* emailed me the following interesting query a while ago:

NAS-3When I was working through some of the Bayesian in the law issues with my class, I raised the problem of priors of 0 and 1 being off “out of bounds” for a Bayesian analyst.  I didn’t realize then that the problem had a name:  Cromwell’s Rule.

My point was then, and more so now, what is the appropriate prior the jury should have when it is sworn?  When it hears opening statements?  Just before the first piece of evidence is received?

Do we tell the jury that the defendant is presumed innocent, which means that it’s ok to entertain a very, very small prior probability of guilt, say no more than 1/N, where N is the total population of people? This seems wrong as a matter of legal theory.  But if the prior = 0, then no amount of evidence can move the jury off its prior.

*Schachtman’s legal practice focuses on the defense of product liability suits, with an emphasis on the scientific and medico-legal issues.  He teaches a course in statistics in the law at the Columbia Law School, NYC. He also has a legal blog here.

Categories: PhilStatLaw, Statistics | Tags:

Stephen Senn: Indefinite irrelevance

Stephen SennStephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),

At a workshop on randomisation I attended recently I was depressed to hear what I regard as hackneyed untruths treated as if they were important objections. One of these is that of indefinitely many confounders. The argument goes that although randomisation may make it probable that some confounders are reasonably balanced between the arms, since there are indefinitely many of these, the chance that at least some are badly confounded is so great as to make the procedure useless.

This argument is wrong for several related reasons. The first is to do with the fact that the total effect of these indefinitely many confounders is bounded. This means that the argument put forward is analogously false to one in which it were claimed that the infinite series ½, ¼,⅛ …. did not sum to a limit because there were infinitely many terms. The fact is that the outcome value one wishes to analyse poses a limit on the possible influence of the covariates. Suppose that we were able to measure a number of covariates on a set of patients prior to randomisation (in fact this is usually not possible but that does not matter here). Now construct principle components, C1, C2… .. based on these covariates. We suppose that each of these predict to a greater or lesser extent the outcome, Y  (say).  In a linear model we could put coefficients on these components, k1, k2… (say). However one is not free to postulate anything at all by way of values for these coefficients, since it has to be the case for any set of m such coefficients that inequality (2)where  V(  ) indicates variance of. Thus variation in outcome bounds variation in prediction. This total variation in outcome has to be shared between the predictors and the more predictors you postulate there are, the less on average the influence per predictor.

The second error is to ignore the fact that statistical inference does not proceed on the basis of signal alone but also on noise. It is the ratio of these that is important. If there are indefinitely many predictors then there is no reason to suppose that their influence on the variation between treatment groups will be bigger than their variation within groups and both of these are used to make the inference. Continue reading

Categories: RCTs, Statistics, Stephen Senn

Professor of Philosophy Resigns over Sexual Misconduct (rejected post)

Unknown-1My field (philosophy) is not known for the kinds of data frauds and retractions we’ve discussed on this blog, but scandals revolving around sexual harassment by male faculty are not rare, though I can’t think of another with a senior faculty resigning, at least not in recent times. This article is from 

A Few Words on the McGinn Imbroglio from the philosophy smoker blog (June 4, 2013)

As I guess we [in philosophy] all know, Colin McGinn has chosen to resign from the University of Miami rather than allow the University to proceed with an investigation into allegations of sexual misconduct involving a research assistant. The article at the Chronicle of Higher Ed is here (paywalled); Sally Haslanger has posted a PDF of the whole thing here. Discussion at NewApps hereherehere, and here; discussion at Feminist Philosophers here; discussion at Leiter here and here.

Briefly, what seems to have happened is this: McGinn had a Research Assistant who was a female graduate student. Last spring, the RA started feeling uncomfortable with McGinn. Then, last April, McGinn allegedly started sending her sexually explicit email messages, including one in which, according to the RA’s boyfriend and two unnamed faculty members, “McGinn wrote that he had been thinking about the student while masturbating.”* Wowza.

The RA then contacted the Office of Equality Administration. According to CHE, “after the university’s Office of Equality Administration and the vice provost for faculty affairs conducted an investigation, Mr. McGinn was given the option of agreeing to resign or having an investigation into the allegations against him continue in a public setting, several of the philosopher’s colleagues said.”

It’s hard to know exactly what to make of this. On one obvious interpretation, there’s a clearly implied threat: if you don’t resign, we’re going to publicly drag your name through the mud. And I’m not sure how normal the prospect of a “public” investigation is in this kind of circumstance. For example, if I recall correctly, the Oregon case from a couple of years ago involved an investigation that was supposed to have been kept private, and was made public only in violation of the University’s procedures. But procedures vary from institution to institution, and I don’t have any expertise here. I don’t really have any idea whether this is unusual or not, although my suspicion is that it is at least a little unusual.

It therefore seems reasonable to worry about whether the procedures Miami followed here were respectful of McGinn’s right to due process. But it’s worth emphasizing that the CHE article is not very clear about precisely what happened—for example, Leiter says that McGinn had legal representation and was acting on his lawyer’s advice, but the CHE doesn’t mention it. It is also worth emphasizing that the account in the CHE comes from unnamed “colleagues,” not McGinn or his representatives or any official source at the University. And this comment at Feminist Philosophers, the veracity of which I am not in a position to verify, makes the meeting seem at least a little less troubling. On that account, it was more like, we’ve got some pretty compelling, well-documented evidence of misconduct, which we are duty-bound to pursue; but we’d like to give you the opportunity to resign now and save us both a big headache.

Harrassment occurs between professors, and not just between professors and students, but without the obvious professor-student taboo, it is not taken especially seriously, in my experience. Naturally philosophers, being philosophers, some of them, will engage in deep philosophical discussion of the philosophical nature and justification of the infractions and even how it might have grown out of a legitimate philosophical research on the topic of the evolutionary development of the hand, in relation to its physical functions. Continue reading

Categories: Rejected Posts, Uncategorized

Is Particle Physics Bad Science? (memory lane)

Memory Lane: reblog July 11, 2012 (+ updates at the end). 

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i].  Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson.  It is asked: “Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Bad science?   I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson.  The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson.  Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level.  Five standard deviations, assuming normality, means a p-value of around 0.0000005.  A number of questions spring to mind.

1.  Why such an extreme evidence requirement?  We know from a Bayesian  perspective that this only makes sense if (a) the existence of the Higgs  boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme.  Neither seems to be the case, so why  5-sigma?

2.  Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis.  Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is? Continue reading

Categories: philosophy of science, Statistics | Tags: , , , , ,

PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)

Memory Lane: One Year Ago on error

A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy.  The following are some excerpts, read the full blog here.   I make two comments at the end.

July 8th, 2012

Nathan Schachtman

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.


Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large data set—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed. Continue reading

Categories: P-values, PhilStatLaw, significance tests | Tags: , , , ,

Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the third time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears). 

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

Mayo’s Rejoinder:

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation. Continue reading

Categories: Bayesian/frequentist, Comedy, P-values, Statistics | Tags: , , ,

Phil/Stat/Law: 50 Shades of gray between error and fraud

500x307-embo-reports-vol-73-meeting-report-fig-1-abcAn update on the Diederik Stapel case: July 2, 2013, The Scientist, “Dutch Fraudster Scientist Avoids Jail”.

Two years after being exposed by colleagues for making up data in at least 30 published journal articles, former Tilburg University professor Diederik Stapel will avoid a trial for fraud. Once one of the Netherlands’ leading social psychologists, Stapel has agreed to a pre-trial settlement with Dutch prosecutors to perform 120 hours of community service.

According to Dutch newspaper NRC Handeslblad, the Dutch Organization for Scientific Research awarded Stapel $2.8 million in grants for research that was ultimately tarnished by misconduct. However, the Dutch Public Prosecution Service and the Fiscal Information and Investigation Service said on Friday (June 28) that because Stapel used the grant money for student and staff salaries to perform research, he had not misused public funds. …

In addition to the community service he will perform, Stapel has agreed not to make a claim on 18 months’ worth of illness and disability compensation that he was due under his terms of employment with Tilburg University. Stapel also voluntarily returned his doctorate from the University of Amsterdam and, according to Retraction Watch, retracted 53 of the more than 150 papers he has co-authored.

“I very much regret the mistakes I have made,” Stapel told ScienceInsider. “I am happy for my colleagues as well as for my family that with this settlement, a court case has been avoided.”

No surprise he’s not doing jail time, but 120 hours of community service?  After over a decade of fraud, and tainting 14 of 21 of the PhD theses he supervised?  Perhaps the “community service” should be to actually run the experiments he had designed?  What about his innocence of misusing public funds? Continue reading

Categories: PhilStatLaw, spurious p values, Statistics

Blog at