Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

How likelihoodists exaggerate evidence from statistical tests


I insist on point against point, no matter how much it hurts

Have you ever noticed that some leading advocates of a statistical account, say a testing account A, upon discovering account A is unable to handle a certain kind of important testing problem that a rival testing account, account B, has no trouble at all with, will mount an argument that being able to handle that kind of problem is actually a bad thing? In fact, they might argue that testing account B is not a  “real” testing account because it can handle such a problem? You have? Sure you have, if you read this blog. But that’s only a subliminal point of this post.

I’ve had three posts recently on the Law of Likelihood (LL): Breaking the [LL](a)(b)[c], and [LL] is bankrupt. Please read at least one of them for background. All deal with Royall’s comparative likelihoodist account, which some will say only a few people even use, but I promise you that these same points come up again and again in foundational criticisms from entirely other quarters.[i]

An example from Royall is typical: He makes it clear that an account based on the (LL) is unable to handle composite tests, even simple one-sided tests for which account B supplies uniformly most powerful (UMP) tests. He concludes, not that his test comes up short, but that any genuine test or ‘rule of rejection’ must have a point alternative!  Here’s the case (Royall, 1997, pp. 19-20):

[M]edical researchers are interested in the success probability, θ, associated with a new treatment. They are particularly interested in how θ relates to the old treatment’s success probability, believed to be about 0.2. They have reason to hope θ is considerably greater, perhaps 0.8 or even greater. To obtain evidence about θ, they carry out a study in which the new treatment is given to 17 subjects, and find that it is successful in nine.

Let me interject at this point that of all of Stephen Senn’s posts on this blog, my favorite is the one where he zeroes in on the proper way to think about the discrepancy we hope to find (the .8 in this example). (See note [ii])

A standard statistical analysis of their observations would use a Bernouilli (θ) statistical model and test the composite hypotheses H1: θ ≤ 0.2 versus H2: θ > 0.2. That analysis would show that H1 can be rejected in favor of Hat any significance level greater than 0.003, a result that is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1. (Royall, ibid.)

Following Royall’s numbers, the observed success rate is:

m0 = 9/17 = .53, exceeding H1: θ ≤ 0.2 by ~3 standard deviations, as σ / √17 ~ 0.1, yielding significance level ~.003.

So, the observed success rate m0 = .53, “is conventionally taken to mean that the observations are very strong evidence supporting H2 over H1.” (ibid. p. 20) [For a link to an article by Royall, see the references.]

And indeed it is altogether warranted to regard the data as very strong evidence that θ > 0.2—which is precisely what H2 asserts (not fussing with his rather small sample size). In fact, m0 warrants inferring even larger discrepancies, but let’s first see where Royall has stopped in his tracks.[iii]

Royall claims he is unable to allow that m0 = .53 is evidence against the null in the one sided-test we are considering:  H1: θ ≤ 0.2 versus H2: θ > 0.2.

He tells us why in the next paragraph (ibid., p. 20):

But because H1 contains some simple hypotheses that are better supported than some hypotheses in H2 (e.g.,θ = 0.2 is better supported than θ= 0.9 by a likelihood ratio of LR = (0.2/0.9)9(0.8/0.1)8 = 22.2), the law of likelihood does not allow the characterization of these observations as strong evidence for H2 over H1(my emphasis; note I didn’t check his numbers since they hardly matter.)

It appears that Royall views rejecting H1: θ ≤ 0.2 and inferring H2: θ > 0.2 as asserting every parameter point within H2 is more likely than every point in H1! (That strikes me as a highly idiosyncratic meaning.) Whereas, the significance tester just takes it to mean what it says:

to reject H1: θ ≤ 0.2 is to infer some positive discrepancy from .2.

We, who go further, either via severity assessments or confidence intervals, would give discrepancies that were reasonably warranted, as well as those that were tantamount to making great whales out of little guppies (fallacy of rejection)! Conversely, for any discrepancy of interest, we can tell you how well or poorly warranted it is by the data. (The confidence interval theorist would need to supplement the one-sided lower limit which is, strictly speaking, all she gets from the one-sided test. I put this to one side here.)

But Royall is blocked! He’s got to invoke point alternatives, and then give a comparative likelihood ratio (to a point null). Note, too, the point against point requirement is always required (with a couple of exceptions, maybe) for Royall’s comparative likelihoodist; it’s not just in this example where he imagines a far away alternative point of .8. The ordinary significance test is clearly at a great advantage over the point against point hypotheses, given the stated goal here is to probe discrepancies from the null. (See Senn’s point in note [ii] below.)

Not only is the law of likelihood unable to tackle simple one-sided tests, what it allows us to say is rather misleading:

What does it allow us to say? One statement that we can make is that the observations are only weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4). We can also say that they are rather strong evidence supporting θ = 0.5 over any of the values under H1: θ ≤ 0.2 (LR > 89), and at least moderately strong evidence for θ = 0.5 over any value θ > 0.8 (LR) > 22). …Thus we can say that the observation of nine successes in 17 trials is rather strong evidence supporting success rates of about 0.5 over the rate 0.2 that is associated with the old treatment, and at least moderately strong evidence for the intermediate rates versus the rates of 0.8 or greater that we were hoping to achieve. (Royall 1997, p. 20, emphasis is mine)

But this is scarcely “rather strong evidence supporting success rates of about 0.5” over the old treatment. What confidence level would you be using if you inferred m0 is evidence that θ > 0.5? Approximately .5. (It’s the typical comparative likelihood move of favoring the claim that the population value equals the observed value. (*See comments.)

Royall”s “weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4)” fails to convey that there is rather horrible warrant for inferring θ = 0.8–associated with something like 99% error probability! (It’s outside the 2-standard deviation confidence interval, is it not?)

We significance testers do find strong evidence for discrepancies in excess of .3 (~.97 severity or lower confidence level) and decent evidence of excesses of .4 (~.84 severity or lower confidence level).  And notice that all of these assertions are claims of evidence of positive discrepancies from the null H1: θ ≤ 0.2. In short, at best (if we are generous in our reading, and insist on confidence levels at least .5), Royall is rediscovering what the significance tester automatically says in rejecting the null with the data!

His entire analysis is limited to giving a series of reports as to which parameter values the data are comparatively closer to. As I already argued, I regard such an account as bankrupt as an account of inference. It fails to control probabilities of misleading interpretations of data in general, and precludes comparing the warrant for a single H by two data sets x, y. In this post, my aim is different. It is Royall, and some fellow likelihoodists, who lodge the criticism because we significance testers operate with composite alternatives. My position is that dealing with composite alternatives is crucial, and that we succeed swimmingly, while Royall is barely treading water. He will allow much stronger evidence than is warranted in favor of members of H2. Ironically, an analogous move is advocated by those who raise the riot act against P-values for exaggerating evidence against a null! [iv]

Elliott Sober, reporting on the Royall road of likelihoodism, remarks:

The fact that significance tests don’t contrast the null hypothesis with alternatives suffices to show that they do not provide a good rule for rejection. (Sober 2008, 56) 

But there is an alternative, it’s just not limited to a point, the highly artificial case we rarely are involved in testing. Perhaps they are more common in biology. I will assume here that Elliott Sober is mainly setting out some of Royall’s criticisms for the reader, rather than agreeing with them.slide11

According to the law of likelihood, as Sober observes, whether the data are evidence against the null hypothesis depends on which point alternative hypothesis you consider. Does he really want to say that so long as you can identify an alternative that is less likely given the data than is the null, then the data are “evidence in favor of the null hypothesis, not evidence against it” (Sober, 56). Is this a good thing? What about all the points in between?  The significance test above exhausts the parameter space, as do all N-P tests.[v]


[i] I know because, remember, I’m writing a book that’s close to being done.

[ii] “It would be ludicrous to maintain that [the treatment] cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.” (Senn 2008, p. 201)

[iii] Note: a rejection at the 2-standard deviation cut-off would be ~M* = .2 + 2(.1) = .4.

[iv] That is, they allow the low P-value to count as evidence for alternatives we would regard as unwarranted. But I’ll come back to that another time.

[v] In this connection, do we really want to say, about a null with teeny tiny likelihood, that there’s evidence for it, so long as there is a rival, miles away, in the other direction? (Do I feel the J-G-L Paradox coming on? Yes! It’s the next topic in Sober p.56)


Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

Royall, R.(1997)  Statistical Evidence, A Likelihood Paradigm. Chapman and Hall.

Senn, S. (2007), Statistical Issues in Drug Development, Wiley.

Sober, E. (2008). Evidence and Evolution. CUP.

Categories: law of likelihood, Richard Royall, Statistics | Tags: | 16 Comments

Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”

1119OPEDmerto-master495A NYT op-ed the other day,”How Medical Care Is Being Corrupted” (by Pamela Hartzband and Jerome Groopman, physicians on the faculty of Harvard Medical School), gives a good sum-up of what I fear is becoming the new normal, even under so-called “personalized medicine”. 

WHEN we are patients, we want our doctors to make recommendations that are in our best interests as individuals. As physicians, we strive to do the same for our patients.

But financial forces largely hidden from the public are beginning to corrupt care and undermine the bond of trust between doctors and patients. Insurers, hospital networks and regulatory groups have put in place both rewards and punishments that can powerfully influence your doctor’s decisions.

Contracts for medical care that incorporate “pay for performance” direct physicians to meet strict metrics for testing and treatment. These metrics are population-based and generic, and do not take into account the individual characteristics and preferences of the patient or differing expert opinions on optimal practice.

For example, doctors are rewarded for keeping their patients’ cholesterol and blood pressure below certain target levels. For some patients, this is good medicine, but for others the benefits may not outweigh the risks. Treatment with drugs such as statins can cause significant side effects, including muscle pain and increased risk of diabetes. Blood-pressure therapy to meet an imposed target may lead to increased falls and fractures in older patients.

Physicians who meet their designated targets are not only rewarded with a bonus from the insurer but are also given high ratings on insurer websites. Physicians who deviate from such metrics are financially penalized through lower payments and are publicly shamed, listed on insurer websites in a lower tier. Further, their patients may be required to pay higher co-payments.

These measures are clearly designed to coerce physicians to comply with the metrics. Thus doctors may feel pressured to withhold treatment that they feel is required or feel forced to recommend treatment whose risks may outweigh benefits.

It is not just treatment targets but also the particular medications to be used that are now often dictated by insurers. Commonly this is done by assigning a larger co-payment to certain drugs, a negative incentive for patients to choose higher-cost medications. But now some insurers are offering a positive financial incentive directly to physicians to use specific medications. For example, WellPoint, one of the largest private payers for health care, recently outlined designated treatment pathways for cancer and announced that it would pay physicians an incentive of $350 per month per patient treated on the designated pathway.

This has raised concern in the oncology community because there is considerable debate among experts about what is optimal. Dr. Margaret A. Tempero of the National Comprehensive Cancer Network observed that every day oncologists saw patients for whom deviation from treatment guidelines made sense: “Will oncologists be reluctant to make these decisions because of an adverse effects on payments?” Further, some health care networks limit the ability of a patient to get a second opinion by going outside the network. The patient is financially penalized with large co-payments or no coverage at all. Additionally, the physician who refers the patient out of network risks censure from the network administration.

When a patient asks “Is this treatment right for me?” the doctor faces a potential moral dilemma. How should he answer if the response is to his personal detriment? Some health policy experts suggest that there is no moral dilemma. They argue that it is obsolete for the doctor to approach each patient strictly as an individual; medical decisions should be made on the basis of what is best for the population as a whole.

Medicine has been appropriately criticized for its past paternalism, where doctors imposed their views on the patient. In recent years, however, the balance of power has shifted away from the physician to the patient, in large part because of access to clinical information on the web.

In truth, the power belongs to the insurers and regulators that control payment. There is now a new paternalism, largely invisible to the public, diminishing the autonomy of both doctor and patient.

In 2010, Congress passed the Physician Payments Sunshine Act to address potential conflicts of interest by making physician financial ties to pharmaceutical and device companies public on a federal website. We propose a similar public website to reveal the hidden coercive forces that may specify treatments and limit choices through pressures on the doctor.

Medical care is not just another marketplace commodity. Physicians should never have an incentive to override the best interests of their patients.

Categories: PhilStat/Med, Statistics | Tags: | 4 Comments

Erich Lehmann: Statistician and Poet

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann                       20 November 1917 –              12 September 2009

Memory Lane 1 Year (with update): Today is Erich Lehmann’s birthday. The last time I saw him was at the Second Lehmann conference in 2004, at which I organized a session on philosophical foundations of statistics (including David Freedman and D.R. Cox).

I got to know Lehmann, Neyman’s first student, in 1997.  One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He told me he was sitting in a very large room at an ASA meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, dark table sat just one book, all alone, shiny red.  He said he wondered if it might be of interest to him!  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after. Some related posts on Lehmann’s letter are here and here.

That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. (It’s just like the case of our high school student Isaac). His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here*). Aside from being a leading statistician, Erich had a (serious) literary bent.

Juliet Shafer, Erich Lehmann, D. Mayo

Juliet Shafer, Erich Lehmann, D. Mayo

The picture on the right was taken in 2003 (by A. Spanos).

(2014 update): It was at this meeting that I proposed organizing a session for the 2004 Erich Lehmann Conference that would focus on “Philosophy of Statistics”. He encouraged me to do so. I invited David Freedman (who accepted), and then had the wild idea of inviting Sir David Cox. He too accepted! (Cox and I later combined our contributions into Mayo and Cox 2006).

Mayo, D. G (1997a), “Response to Howson and Laudan,” Philosophy of Science 64: 323-333.

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.


(Selected) Books by Lehmann)

  • Testing Statistical Hypotheses, 1959
  • Basic Concepts of Probability and Statistics, 1964, co-author J. L. Hodges
  • Elements of Finite Probability, 1965, co-author J. L. Hodges
  • Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006). Nonparametrics: Statistical methods based on ranks (Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708.
  • Theory of Point Estimation, 1983
  • Elements of Large-Sample Theory (1988). New York: Springer Verlag.
  • Reminiscences of a Statistician, 2007, ISBN 978-0-387-71596-4
  • Fisher, Neyman, and the Creation of Classical Statistics, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

Categories: highly probable vs highly probed, phil/history of stat, Sir David Cox, Spanos, Statistics | Tags: , | Leave a comment

Lucien Le Cam: “The Bayesians Hold the Magic”

lecamToday is the birthday of Lucien Le Cam (Nov. 18, 1924-April 25,2000): Please see my updated 2013 post on him.


Categories: Bayesian/frequentist, Statistics | Leave a comment

Why the Law of Likelihood is bankrupt–as an account of evidence



There was a session at the Philosophy of Science Association meeting last week where two of the speakers, Greg Gandenberger and Jiji Zhang had insightful things to say about the “Law of Likelihood” (LL)[i]. Recall from recent posts here and here that the (LL) regards data x as evidence supporting H1 over H0   iff

Pr(x; H1) > Pr(x; H0).

On many accounts, the likelihood ratio also measures the strength of that comparative evidence. (Royall 1997, p.3). [ii]

H0 and H1 are statistical hypothesis that assign probabilities to the random variable X taking value x.  As I recall, the speakers limited  H1 and H0  to simple statistical hypotheses (as Richard Royall generally does)–already restricting the account to rather artificial cases, but I put that to one side. Remember, with likelihoods, the data x are fixed, the hypotheses vary.

1. Maximally likely alternatives. I didn’t really disagree with anything the speakers said. I welcomed their recognition that a central problem facing the (LL) is the ease of constructing maximally likely alternatives: so long as Pr(x; H0) < 1, a maximum likely alternative H1 would be evidentially “favored”. There is no onus on the likelihoodist to predesignate the rival, you are free to search, hunt, post-designate and construct a best (or better) fitting rival. If you’re bothered by this, says Royall, then this just means the evidence disagrees with your prior beliefs.

After all, Royall famously distinguishes between evidence and belief (recall the evidence-belief-action distinction), and these problematic cases, he thinks, do not vitiate his account as an account of evidence. But I think they do! In fact, I think they render the (LL) utterly bankrupt as an account of evidence. Here are a few reasons. (Let me be clear that I am not pinning Royall’s defense on the speakers[iii], so much as saying it came up in the general discussion[iv].)

2. Appealing to prior beliefs to avoid the problem of maximally likely alternatives. Recall Royall’s treatment of maximally likely alternatives in the case of turning over the top card of a shuffled deck, and finding an ace of diamonds:

According to the law of likelihood, the hypothesis that the deck consists of 52 aces of diamonds (H1) is better supported than the hypothesis that the deck is normal (HN) [by the factor 52]…Some find this disturbing.

Not Royall.

Furthermore, it seems unfair; no matter what card is drawn, the law implies that the corresponding trick-deck hypothesis (52 cards just like the one drawn) is better supported than the normal-deck hypothesis. Thus even if the deck is normal we will always claim to have found strong evidence that it is not. (Royall 1997, pp. 13-14)

To Royall, it only shows a confusion between evidence and belief. If you’re not convinced the deck has 52 aces of diamonds “it does not mean that the observation is not strong evidence in favor of H1 versus HN.” It just wasn’t strong enough to overcome your prior beliefs.

The relation to Bayesian inference, as Royall notes, is that the likelihood ratio “that the law [LL] uses to measure the strength of the evidence, is precisely the factor by which the observation X = x would change the probability ratio” Pr(H0) /Pr(H1). (Royall 1997, p. 6). So, if you don’t think the maximally likely alternative is palatable, you can get around it by giving it a suitably low prior degree of probability. But the more likely hypothesis is still favored on grounds of evidence, according to this view. (Do Bayesians agree?)

When this “appeal to beliefs” solution came up in the discussion at this session, some suggested that you should simply refrain from proposing implausible maximally likely alternatives! I think this misses the crucial issues.

3. What’s wrong with the “appeal to beliefs” solution to the (LL) problem: First, there are many cases where we want to distinguish the warrant for one and the same hypothesis according to whether it was constructed post hoc to fit the data or predesignated. The “use constructed” hypothesis H could well be plausible, but we’d still want to distinguish the evidential credit H deserves in the two cases, and appealing to priors does not help.

Second, to suppose one can be saved from the unpleasant consequences of the (LL) by the deus ex machina of a prior is to misidentify what the problem really is—at least when there is a problem (and not all data-dependent alternatives are problematic—see my double-counting papers, e.g., here). In the problem cases, the problem is due to the error probing capability of the overall testing procedure being diminished. You are not “sincerely trying”, as Popper puts it, to find flaws with claims, but instead you are happily finding evidence in favor of a well-fitting hypothesis that you deliberately construct— unless your intuitions tell you it is unbelievable. So now the task that was supposed to be performed by an account of statistical evidence is not being performed by it at all. It has to be performed by you, and you are the most likely one to follow your preconceived opinions and pet theories.You are the one in danger of confirmation bias. If your account of statistical evidence won’t supply tools to help you honestly criticize yourself (let alone allow the rest of us to fraud-bust your inference), then it comes up short in an essential way.

4. The role of statistical philosophy in philosophy of science. I recall having lunch with Royall when we first met (at an ecology conference around 1998) and trying to explain, “You see, in philosophy, we look to statistical accounts in order to address general problems about scientific evidence, inductive inference, and hypothesis testing. And one of the classic problems we wrestle with is that data underdetermine hypotheses; there are many hypotheses we can dream up to “fit” the data. We look to statistical philosophy to get insights into warranted inductive inference, to distinguish ad hoc hypotheses, confirmation biases, etc. We want to bring out the problem with that Texas “sharpshooter” who fires some shots into the side of a barn and then cleverly paints a target so that most of his hits are in the bull’s eye, and then takes this as evidence of his marksmanship. So, the problem with the (LL) is that it appears to license rather than condemn some of these pseudoscientific practices.”

His answer, as near as I can recall, was that he was doing statistics and didn’t know about these philosophical issues. Had it been current times, perhaps I could have been more effective in pointing up the “reproducibility crisis,” “big data,” and “fraud-busting”. Anyway, he wouldn’t relent, even on stopping rules.

But his general stance is one I often hear: We can take into account those tricky moves later on in our belief assignments. The (LL) just gives a measure of the evidence in the data. But this IS later on. Since these gambits can completely destroy your having any respectable evidence whatsoever, you can’t say “the evidence is fine, I’ll correct things with beliefs later on”.

Besides, the influence of the selection effects is not on the believability of H but rather on the capability of the test to have unearthed errors. Their influence is on the error probabilities of the test procedure, and yet the (LL) is conditional on the actual outcome.

5. Why does the likelihoodist not appeal to error probabilities to solve his problem? The answer is that he is convinced that such an appeal is necessarily limited to controlling erroneous actions in the long run. That is why Royall rejects it (claiming it is only relevant for “action”), and only a few of us here in exile have come around to mounting a serious challenge to this extreme behavioristic rationale for error statistical methods. Fisher, E. Pearson, and even Neyman some of the time, rejected such a crass behavioristic rational, as have Birnbaum, Cox, Kempthorne and many other frequentists.(See this post on Pearson.) 

Yet, I have just shown that the criticisms based on error probabilities have scarcely anything to do with the long run, but have everything to do with whether you have done a good job providing evidence for your favored hypothesis right now.

“A likelihood ratio may be a criterion of relative fit but it “is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis, because a knowledge of [likelihoods] alone is not adequate to insure control of this error (Pearson and Neyman, 1930, p. 106).

Pearson and Neyman should have been explicit as to how this error control is essential for a strong argument from coincidence in the case at hand.

Ironically, a great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, P-values, power, etc.) start with assuming “the law (LL)”, when in fact attention to the probativeness of tests by means of the relevant sampling distribution is just the cure the likelihoodist needs.

6. Is it true that all attempts to say whether x is good or terrible evidence for H are utterly futile? Royall says they are, that only comparing a fixed x to H versus some alternative H’ can work.

[T]he likelihood view is that observations [like x and y]…have no valid interpretation as evidence in relation to the single hypothesis H.” (Royall 2004, p. 149).

But we should disagree. We most certainly can say that x is quite lousy evidence for H, if nothing (or very little) has been done to find flaws in H, or if I constructed an H to agree swimmingly with x, but by means that make it extremely easy to achieve, even if H is false.

Finding a non-statistically significant difference on the tested factor, I find a subgroup or post-data endpoint that gives “nominal” statistical significance. Whether Hwas pre-designated or post-designated makes no difference to the likelihood ratio, and the prior given to Hwould be the same whether it was pre- or post-designated. The post-designated alternative might be highly plausible, but I would still want to say that selection effects, cherry-picking, and generally “trying and trying again” alter the stringency of the test. This altered capacity in the test’s picking up on sources of bias and unreliability has no home in the (LL) account of evidence. That is why I say it fails in an essential way, as an account of evidence.

7. So what does the Bayesian say about the (LL)? I take it the Bayesian would deny that the comparative evidence account given by the (LL) is adequate. LRs are important, of course, but there are also prior probability assignments to hypotheses. Yet that would seem to get us right back to Royall’s problem that we have been discussing here.

In this connection, ponder (v).

8. Background. You may wish to review “Breaking the Law! (of likelihood) (A) and (B)”, and Breaking the Royall Law of Likelihood ©. A relevant paper by Royall is here.



[i] The PSA program is here: Program.pdf. Zhang and Gandenberger are both excellent young philosophers of science who engage with real statistical methods.

[ii] For a full statement of the [LL] according to Royall. “If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) > pB(x), and the likelihood ratio, pA(x)/ pB(x), measures the strength of that evidence.” (Royall, 2004, p. 122)

“This says simply that if an event is more probable under hypothesis A than hypothesis B, then the occurrence of that event is evidence supporting A over B––the hypothesis that did the better job of predicting the event is better supported by its occurrence.” Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed. (ibid. 123)

Aside from denying the underlined sentence,can a Bayesian violate the [LL]? In comments to this first post, it was argued that they can.

[iii] In fact, Gandenberger’s paper was about why he is not a “methodological likelihoodist” and Zhang was only dealing with a specific criticism of (LL) by Forster.  [Gandenberger’s blog:]

[iv] Granted, the speakers did not declare Royall’s way out of the problem leads to bankruptcy, as I would have wanted them to.

[v] I’m placing this here for possible input later on.  Royall considers the familiar example where a positive diagnostic result is more probable under “disease” than “no disease”. If the prior probability for disease is sufficiently small, it can result in a low posterior for disease.  For Royall, “to interpret the positive test result as evidence that the subject does not have the disease is never appropriate––it is simply and unequivocally wrong. Why is it wrong?” (2004, 122). Because it violates the (LL). This gets to the contrast between “Bayes boosts” and high posterior again. I take it the Bayesian response would be to agree, but still deny there is evidence for disease. Yes? [This is like our example of Isaac who passes many tests of high school readiness, so the LR in favor of his being ready is positive. However, having been randomly selected from “Fewready” town, the posterior for his readiness is still low (despite its having increased).] Severity here seems to be in sync with the B-boosters,at least in direction of evidence.



Mayo, D. G. (2014) On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29, no. 2, 227-266.

Mayo, D. G. (2004). “An Error-Statistical Philosophy of Evidence,” 79-118, in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. In J. Neyman and E.S. Pearson, 1967, Joint Statistical Papers, (99-115). Cambridge: CUP.

Royall, R. (1997) Statistical Evidence: A likelihood paradigm, Chapman and Hall, CRC Press.

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press.




Categories: highly probable vs highly probed, law of likelihood, Richard Royall, Statistics | 62 Comments

A biased report of the probability of a statistical fluke: Is it cheating?

cropped-qqqq.jpg One year ago I reblogged a post from Matt Strassler, “Nature is Full of Surprises” (2011). In it he claims that

[Statistical debate] “often boils down to this: is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake. It is asking whether they cheated — whether they adjusted the rules unfairly — and biased the answer through the question they chose…”

(Nov. 2014):I am impressed (i.e., struck by the fact) that he goes so far as to call it “cheating”. Anyway, here is the rest of the reblog from Strassler which bears on a number of recent discussions:

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify the probability that it is a statistical fluke — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed within three days by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence?  I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.” (my emphasis) Continue reading

Categories: Higgs, spurious p values, Statistics | 7 Comments

The Amazing Randi’s Million Dollar Challenge

09randi3-master675-v2-1The NY Times Magazine had a feature on the Amazing Randi yesterday, “The Unbelievable Skepticism of the Amazing Randi.” It described one of the contestants in Randi’s most recent Million Dollar Challenge, Fei Wang:

“[Wang] claimed to have a peculiar talent: from his right hand, he could transmit a mysterious force a distance of three feet, unhindered by wood, metal, plastic or cardboard. The energy, he said, could be felt by others as heat, pressure, magnetism or simply “an indescribable change.” Tonight, if he could demonstrate the existence of his ability under scientific test conditions, he stood to win $1 million.”

Isn’t “an indescribable change” rather vague?

…..The Challenge organizers had spent weeks negotiating with Wang and fine-tuning the protocol for the evening’s test. A succession of nine blindfolded subjects would come onstage and place their hands in a cardboard box. From behind a curtain, Wang would transmit his energy into the box. If the subjects could successfully detect Wang’s energy on eight out of nine occasions, the trial would confirm Wang’s psychic power. …”

After two women failed to detect the “mystic force” the M.C. announced the contest was over.

“With two failures in a row, it was impossible for Wang to succeed. The Million Dollar Challenge was already over.”

You think they might have given him another chance or something.

“Stepping out from behind the curtain, Wang stood center stage, wearing an expression of numb shock, like a toddler who has just dropped his ice cream in the sand. He was at a loss to explain what had gone wrong; his tests with a paranormal society in Boston had all succeeded. Nothing could convince him that he didn’t possess supernatural powers. ‘This energy is mysterious,’ he told the audience. ‘It is not God.’ He said he would be back in a year, to try again.”

The article is here. If you don’t know who A. Randi is, you should read it.

Randi, much better known during Uri Geller spoon-bending days, has long been the guru to skeptics and fraudbusters, but also a hero to some critical psi believers like I.J. Good. Geller continually sued Randi for calling him a fraud. As such, I.J. Good warned me that I might be taking a risk in my use of “gellerization” in EGEK (1996), but I guess Geller doesn’t read philosophy of science. A post on “Statistics and ESP Research” and Diaconis is here.


I’d love to have seen Randi break out of these chains!


Categories: Error Statistics | Tags: | 3 Comments

“Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA

We had an excellent discussion at our symposium yesterday: “How Many Sigmas to Discovery? Philosophy and Statistics in the Higgs Experiments” with Robert Cousins, Allan Franklin and Kent Staley. Slides from my presentation, “Statistical Flukes, the Higgs Discovery, and 5 Sigma” are posted below (we each only had 20 minutes, so this is clipped,but much came out in the discussion). Even the challenge I read about this morning as to what exactly the Higgs researchers discovered (and I’ve no clue if there’s anything to the idea of a “techni-higgs particle”) — would not invalidate* the knowledge of the experimental effects severely tested.


*Although, as always, there may be a reinterpretation of the results. But I think the article is an isolated bit of speculation. I’ll update if I hear more.

Categories: Higgs, highly probable vs highly probed, Statistics | 26 Comments

Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”



The biennial meeting of the Philosophy of Science Association (PSA) starts this week (Nov. 6-9) in Chicago, together with the History of Science Society. I’ll be part of the symposium:


How Many Sigmas to Discovery?
Philosophy and Statistics in the Higgs Experiments


on Nov.8 with Robert Cousins, Allan Franklin, and Kent Staley. If you’re in the neighborhood stop by.



“A 5 sigma effect!” is how the recent Higgs boson discovery was reported. Yet before the dust had settled, the very nature and rationale of the 5 sigma (or 5 standard deviation) discovery criteria began to be challenged and debated both among scientists and in the popular press. Why 5 sigma? How is it to be interpreted? Do p-values in high-energy physics (HEP) avoid controversial uses and misuses of p-values in social and other sciences? The goal of our symposium is to combine the insights of philosophers and scientists whose work interrelates philosophy of statistics, data analysis and modeling in experimental physics, with critical perspectives on how discoveries proceed in practice. Our contributions will link questions about the nature of statistical evidence, inference, and discovery with questions about the very creation of standards for interpreting and communicating statistical experiments. We will bring out some unique aspects of discovery in modern HEP. We also show the illumination the episode offers to some of the thorniest issues revolving around statistical inference, frequentist and Bayesian methods, and the philosophical, technical, social, and historical dimensions of scientific discovery.


1) How do philosophical problems of statistical inference interrelate with debates about inference and modeling in high energy physics (HEP)?

2) Have standards for scientific discovery in particle physics shifted? And if so, how has this influenced when a new phenomenon is “found”?

3) Can understanding the roles of statistical hypotheses tests in HEP resolve classic problems about their justification in both physical and social sciences?

4) How do pragmatic, epistemic and non-epistemic values and risks influence the collection, modeling, and interpretation of data in HEP?


Abstracts for Individual Presentations

robert cousins(1) Unresolved Philosophical Issues Regarding Hypothesis Testing in High Energy Physics
Robert D. Cousins.
Professor, Department of Physics and Astronomy, University of California, Los Angeles (UCLA)

The discovery and characterization of a Higgs boson in 2012-2013 provide multiple examples of statistical inference as practiced in high energy physics (elementary particle physics).  The main methods employed have a decidedly frequentist flavor, drawing in a pragmatic way on both Fisher’s ideas and the Neyman-Pearson approach.  A physics model being tested typically has a “law of nature” at its core, with parameters of interest representing masses, interaction strengths, and other presumed “constants of nature”.  Additional “nuisance parameters” are needed to characterize the complicated measurement processes.  The construction of confidence intervals for a parameter of interest q is dual to hypothesis testing, in that the test of the null hypothesis q=q0 at significance level (“size”) a is equivalent to whether q0 is contained in a confidence interval for q with confidence level (CL) equal to 1-a.  With CL or a specified in advance (“pre-data”), frequentist coverage properties can be assured, at least approximately, although nuisance parameters bring in significant complications.  With data in hand, the post-data p-value can be defined as the smallest significance level a at which the null hypothesis would be rejected, had that a been specified in advance.  Carefully calculated p-values (not assuming normality) are mapped onto the equivalent number of standard deviations (“s”) in a one-tailed test of the mean of a normal distribution. For a discovery such as the Higgs boson, experimenters report both p-values and confidence intervals of interest. Continue reading

Categories: Error Statistics, Higgs, P-values | Tags: | 18 Comments

Oxford Gaol: Statistical Bogeymen

Memory Lane: 3 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)Unknown-2

Criticisms then follow readily: the form of one or both:

  • Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
  • Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
  • I have proposed an alternative philosophy that replaces these tenets with different ones:
  • the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
  • the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
  • Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Philosophy of Statistics, Statistics | Tags: , | 30 Comments

To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola



 Bioethicist Arthur Caplan gives “7 Reasons Ebola Quarantine Is a Bad, Bad Idea”. I’m interested to know what readers think (I claim no expertise in this area.) My occasional comments are in red. 

“Bioethicist: 7 Reasons Ebola Quarantine Is a Bad, Bad Idea”

In the fight against Ebola some government officials in the U.S. are now managing fear, not the virus. Quarantines have been declared in New York, New Jersey and Illinois. In Connecticut, nine people are in quarantine: two students at Yale; a worker from AmeriCARES; and a West African family.

Many others are or soon will be.

Quarantining those who do not have symptoms is not the way to combat Ebola. In fact it will only make matters worse. Far worse. Why?

  1. Quarantining people without symptoms makes no scientific sense.

They are not infectious. The only way to get Ebola is to have someone vomit on you, bleed on you, share spit with you, have sex with you or get fecal matter on you when they have a high viral load.

How do we know this?

Because there is data going back to 1975 from outbreaks in the Congo, Uganda, Sudan, Gabon, Ivory Coast, South Africa, not to mention current experience in the United States, Spain and other nations.

The list of “the only way to get Ebola” does not suggest it is so extraordinarily difficult to transmit as to imply the policy “makes no scientific sense”. That there is “data going back to 1975″ doesn’t tell us how it was analyzed. They may not be infectious today, but…

  1. Quarantine is next to impossible to enforce.

If you don’t want to stay in your home or wherever you are supposed to stay for three weeks, then what? Do we shoot you, Taser you, drag you back into your house in a protective suit, or what?

And who is responsible for watching you 24-7? Quarantine relies on the honor system. That essentially is what we count on when we tell people with symptoms to call 911 or the health department.

It does appear that this hasn’t been well thought through yet. NY Governor Cuomo said that “Doctors Without Borders”, the group that sponsors many of the volunteers, already requires volunteers to “decompress” for three weeks upon return from Africa, and they compensate their doctors during this time (see the above link). The state of NY would fill in for those sponsoring groups that do not offer compensation (at least in NY). Is the existing 3 week decompression period already a clue that they want people cleared before they return to work? Continue reading

Categories: science communication | Tags: | 48 Comments


Hand writing a letter with a goose feather

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: October 2011 (I mark in red 3 posts that seem most apt for general background on key issues in this blog*)

*I indicated I’d begin this new, once-a-month feature at the 3-year anniversary. I will repost and comment on one each month. (I might repost others that I do not comment on, as Oct. 31, 2014). For newcomers, here’s your chance to catch-up; for old timers, this is philosophy: rereading is essential!

Categories: 3-year memory lane, blog contents, Statistics | Leave a comment

September 2014: Blog Contents

metablog old fashion typewriterSeptember 2014: Error Statistics Philosophy
Blog Table of Contents 

Compiled by Jean A. Miller

  • (9/30) Letter from George (Barnard)
  • (9/27) Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)
  • (9/23) G.A. Barnard: The Bayesian “catch-all” factor: probability vs likelihood
  • (9/21) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • (9/18) Uncle Sam wants YOU to help with scientific reproducibility!
  • (9/15) A crucial missing piece in the Pistorius trial? (2): my answer (Rejected Post)
  • (9/12) “The Supernal Powers Withhold Their Hands And Let Me Alone”: C.S. Peirce
  • (9/6) Statistical Science: The Likelihood Principle issue is out…!
  • (9/4) All She Wrote (so far): Error Statistics Philosophy Contents-3 years on
  • (9/3) 3 in blog years: Sept 3 is 3rd anniversary of





Categories: Announcement, blog contents, Statistics | Leave a comment

PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must



The following is from Nathan Schachtman’s legal blog, with various comments and added emphases (by me, in this color). He will try to reply to comments/queries.

“Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses”

Nathan Schachtman, Esq., PC * October 14th, 2014

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons.[i] The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].



A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.

*   *   *   *   *   *   *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem.

Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2]. [emphasis mine]

I’m perplexed by this mixture of stances. If you don’t mention the multiple testing for which it is acceptable not to adjust, then you’re guilty of poor statistical practice; but its “too prevalent for anyone to say that ignoring multiple testing is fraudulent”. This appears to claim it’s poor statistical practice if you fail to mention your results are due to multiple testing, but “ignoring multiple testing” (which could mean failing to adjust or, more likely, failing to mention it) is not fraudulent. Perhaps, it’s a questionable research practice QRP. It’s back to “50 shades of grey between QRPs and fraud.”

  […read his full blogpost here]

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:

“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

I’m going to require, as part of its meaning, that a statistically significant difference not be one due to “chance variability” alone. Then to avoid self contradiction, this last sentence might be put as follows: “when there is a high number of subgroup comparisons, at least some will show purported or nominal or unaudited statistical significance by chance alone. [Which term do readers prefer?] If one hunts down one’s hypothesized comparison in the data, then the actual p-value will not equal, and will generally be greater than, the nominal or unaudited p-value.”

So, I will insert “nominal” where needed below (in red).

Texas Sharpshooter fallacy

Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:

“Even if the elevated levels of lung cancer for men had been [nominally] statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a [nominally] statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

The Texas sharpshooter fallacy is one of my all time favorites. One purports to be testing the accuracy of his aim, when in fact that is not the process that gave rise to the impressive-looking (nominal) cluster of hits. The results do not warrant inferences about his ability to accurately hit a target, since that hasn’t been well-probed. Continue reading

Categories: P-values, PhilStat Law, Statistics | 12 Comments

Gelman recognizes his error-statistical (Bayesian) foundations


From Gelman’s blog:

“In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons”

Posted by  on

Exhibit A: [2012] Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 189-211. (Andrew Gelman, Jennifer Hill, and Masanao Yajima)

Exhibit B: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, in press. (Andrew Gelman and Eric Loken) (Shortened version is here.)


The “forking paths” paper, in my reading,  basically argues that mere hypothetical possibilities about what you would or might have done had the data been different (in order to secure a desired interpretation) suffices to alter the characteristics of the analysis you actually did. That’s an error statistical argument–maybe even stronger than what some error statisticians would say. What’s really being condemned are overly flexible ways to move from statistical results to substantive claims. The p-values are illicit when taken to provide evidence for those claims because an actual p-value requires Prob(P < p;Ho) = p (and the actual p-value has become much greater by design). The criticism makes perfect sense if you’re scrutinizing inferences according to how well or severely tested they are. Actual error probabilities are accordingly altered or unable to be calculated. However, if one is going to scrutinize inferences according to severity then the same problematic flexibility would apply to Bayesian analyses, whether or not they have a way to pick up on it. (It’s problematic if they don’t.) I don’t see the magic by which a concern for multiple testing disappears in Bayesian analysis (e.g., in the first paper) except by assuming some prior takes care of it..

Categories: Error Statistics, Gelman | 15 Comments

BREAKING THE (Royall) LAW! (of likelihood) (C)



With this post, I finally get back to the promised sequel to “Breaking the Law! (of likelihood) (A) and (B)” from a few weeks ago. You might wish to read that one first.* A relevant paper by Royall is here.

Richard Royall is a statistician1 who has had a deep impact on recent philosophy of statistics by giving a neat proposal that appears to settle disagreements about statistical philosophy! He distinguishes three questions:

  • What should I believe?
  • How should I act?
  • Is this data evidence of some claim? (or How should I interpret this body of observations as evidence?)

It all sounds quite sensible– at first–and, impressively, many statisticians and philosophers of different persuasions have bought into it. At least they appear willing to go this far with him on the 3 questions.

How is each question to be answered? According to Royall’s commandments writings, what to believe is captured by Bayesian posteriors; how to act, by a behavioristic, N-P long-run performance. And what method answers the evidential question? A comparative likelihood approach. You may want to reject all of them (as I do),2 but just focus on the last.

Remember with likelihoods, the data x are fixed, the hypotheses vary. A great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with “the law”. But I fail to see why we should obey it.

To begin with, a report of comparative likelihoods isn’t very useful: H might be less likely than H’, given x, but so what? What do I do with that information? It doesn’t tell me I have evidence against or for either.3 Recall, as well, Hacking’s points here about the variability in the meanings of a likelihood ratio across problems. Continue reading

Categories: law of likelihood, Richard Royall, Statistics | 41 Comments

A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”

Sir David Cox

Sir David Cox

The original Statistics Views interview is here:

“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”– An interview with Sir David Cox


  • Author: Statistics Views
  • Date: 24 Jan 2014
  • Copyright: Image appears courtesy of Sir David Cox

Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. The Cox point process was named after him.

Sir David studied mathematics at St John’s College, Cambridge and obtained his PhD from the University of Leeds in 1949. He was employed from 1944 to 1946 at the Royal Aircraft Establishment, from 1946 to 1950 at the Wool Industries Research Association in Leeds, and from 1950 to 1955 worked at the Statistical Laboratory at the University of Cambridge. From 1956 to 1966 he was Reader and then Professor of Statistics at Birkbeck College, London. In 1966, he took up the Chair position in Statistics at Imperial College Londonwhere he later became Head of the Department of Mathematics for a period. In 1988 he became Warden of Nuffield College and was a member of the Department of Statistics at Oxford University. He formally retired from these positions in 1994 but continues to work in Oxford.

Sir David has received numerous awards and honours over the years. He has been awarded the Guy Medals in Silver (1961) and Gold (1973) by the Royal Statistical Society. He was elected Fellow of the Royal Society of London in 1973, was knighted in 1985 and became an Honorary Fellow of the British Academy in 2000. He is a Foreign Associate of the US National Academy of Sciences and a foreign member of the Royal Danish Academy of Sciences and Letters. In 1990 he won the Kettering Prize and Gold Medal for Cancer Research for “the development of the Proportional Hazard Regression Model” and 2010 he was awarded the Copley Medal by the Royal Society.

He has supervised and collaborated with many students over the years, many of whom are now successful in statistics in their own right such as David Hinkley and Past President of the Royal Statistical Society, Valerie Isham. Sir David has served as President of theBernoulli Society, Royal Statistical Society, and the International Statistical Institute.

This year, Sir David is to turn 90*. Here Statistics Views talks to Sir David about his prestigious career in statistics, working with the late Professor Lindley, his thoughts on Jeffreys and Fisher, being President of the Royal Statistical Society during the Thatcher Years, Big Data and the best time of day to think of statistical methods.

1. With an educational background in mathematics at St Johns College, Cambridge and the University of Leeds, when and how did you first become aware of statistics as a discipline?

I was studying at Cambridge during the Second World War and after two years, one was sent either into the Forces or into some kind of military research establishment. There were very few statisticians then, although it was realised there was a need for statisticians. It was assumed that anybody who was doing reasonably well at mathematics could pick up statistics in a week or so! So, aged 20, I went to the Royal Aircraft Establishment in Farnborough, which is enormous and still there to this day if in a different form, and I worked in the Department of Structural and Mechanical Engineering, doing statistical work. So statistics was forced upon me, so to speak, as was the case for many mathematicians at the time because, aside from UCL, there had been very little teaching of statistics in British universities before the Second World War. Afterwards, it all started to expand.

2. From 1944 to 1946 you worked at the Royal Aircraft Establishment and then from 1946 to 1950 at the Wool Industries Research Association in Leeds. Did statistics have any role to play in your first roles out of university?

Totally. In Leeds, it was largely statistics but also to some extent, applied mathematics because there were all sorts of problems connected with the wool and textile industry in terms of the physics, chemistry and biology of the wool and some of these problems were mathematical but the great majority had a statistical component to them. That experience was not totally uncommon at the time and many who became academic statisticians had, in fact, spent several years working in a research institute first.

3. From 1950 to 1955, you worked at the Statistical Laboratory at Cambridge and would have been there at the same time as Fisher and Jeffreys. The late Professor Dennis Lindley, who was also there at that time, told me that the best people working on statistics were not in the statistics department at that time. What are your memories when you look back on that time and what do you feel were your main achievements?

Lindley was exactly right about Jeffreys and Fisher. They were two great scientists outside statistics – Jeffreys founded modern geophysics and Fisher was a major figure in genetics. Dennis was a contemporary and very impressive and effective. We were colleagues for five years and our children even played together.

The first lectures on statistics I attended as a student consisted of a short course by Harold Jeffreys who had at the time a massive reputation as virtually the inventor of modern geophysics. His Theory of Probability, published first as a monograph in physics was and remains of great importance but, amongst other things, his nervousness limited the appeal of his lectures, to put it gently. I met him personally a couple of times – he was friendly but uncommunicative. When I was later at the Statistical Laboratory in Cambridge, relations between the Director, Dr Wishart and R.A. Fisher had been at a very low ebb for 20 years and contact between the Lab and Fisher was minimal. I hear him speak on three of four occasions, interesting if often rambunctious occasions. To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.

“To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.”

Continue reading

Categories: Sir David Cox | 3 Comments

Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)

Oh My*.images-16

(“But I can succeed as a social philosopher”)

The following is from Retraction Watch. UPDATE: OCT 10, 2014**

Diederik Stapel, the Dutch social psychologist and admitted data fabricator — and owner of 54 retraction notices — is now teaching at a college in the town of Tilburg [i].

According to Omroep Brabant, Stapel was offered the job as a kind of adjunct at Fontys Academy for Creative Industries to teach social philosophy. The site quotes a Nick Welman explaining the rationale for hiring Stapel (per Google Translate):

“It came about because students one after another success story were told from the entertainment industry, the industry which we educate them .”

The students wanted something different.

“They wanted to also focus on careers that have failed. On people who have fallen into a black hole, acquainted with the dark side of fame and success.”

Last month, organizers of a drama festival in The Netherlands cancelled a play co-written by Stapel.

I really think Dean Bon puts the rationale most clearly of all.

…A letter from the school’s dean, Pieter Bon, adds:

We like to be entertained and the length of our lives increases. We seek new ways in which to improve our health and we constantly look for new ways to fill our free time. Fashion and looks are important to us; we prefer sustainable products and we like to play games using smart gadgets. This is why Fontys Academy for Creative Industries exists. We train people to create beautiful concepts, exciting concepts, touching concepts, concepts to improve our quality of life. We train them for an industry in which creativity is of the highest value to a product or service. We educate young people who feel at home in the (digital) world of entertainment and lifestyle, and understand that creativity can also mean business. Creativity can be marketed, it’s as simple as that.

We’re sure Prof. Stapel would agree.

[i] Fontys describes itself thusly: Fontys Academy for Creative Industries (Fontys ACI) in Tilburg has 2500 students working towards a bachelor of Business Administration (International Event, Music & Entertainment Studies and Digital Publishing Studies), a bachelor of Communication (International Event, Music & Entertainment Studies) or a bachelor of Lifestyle (International Lifestyle Studies). Fontys ACI hosts a staff of approximately one hundred (teachers plus support staff) as well as about fifty regular visiting lecturers.

 *I wonder if “social philosophy” is being construed as “extreme postmodernist social epistemology”?  

I guess the students are keen to watch that Fictionfactory Peephole.

**Turns out to have been short-lived. Also admits to sockpuppeting at Retraction watch. Frankly I thought it was more fun to guess who “Paul” was, but they have rules.

[ii} One of my April Fool’s Day posts is turning from part fiction to fact.

Categories: Rejected Posts, Statistics | 9 Comments

Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?


Faye Flam

Congratulations to Faye Flam for finally getting her article published at the Science Times at the New York Times, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful too, that one remark from me remained. Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory, so the posterior checks out.

The article says, Bayesian methods can “crosscheck work done with the more traditional or ‘classical’ approach.” Yes, but on traditional frequentist grounds. What many would like to know is how to cross check Bayesian methods—how do I test your beliefs? Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:

Statistics may not sound like the most heroic of pursuits. But if not for statisticians, a Long Island fisherman might have died in the Atlantic Ocean after falling off his boat early one morning last summer.

Continue reading

Categories: Bayesian/frequentist, Statistics | 47 Comments

Letter from George (Barnard)

Memory Lane: (2 yrs) Sept. 30, 2012George Barnard sent me this note on hearing of my Lakatos Prize. He was to have been at my Lakatos dinner at the LSE (March 17, 1999)—winners are permitted to invite ~2-3 guests*—but he called me at the LSE at the last minute to say he was too ill to come to London.  Instead we had a long talk on the phone the next day, which I can discuss at some point. The main topics were likelihoods and error probabilities. {I have 3 other Barnard letters, with real content, that I might dig out some time.)

*My others were Donald Gillies and Hasok Chang

Categories: Barnard, phil/history of stat | Tags: , | Leave a comment

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 481 other followers