Monthly Archives: August 2012

Failing to Apply vs Violating the Likelihood Principle

In writing a new chapter on the Strong Likelihood Principle [i] the past few weeks, I noticed a passage in G. Casella and R. Berger (2002) that in turn recalled a puzzling remark noted in my Jan. 3, 2012 post. The post began:

A question arose from a Bayesian acquaintance:

“Although the Birnbaum result is of primary importance for sampling theorists, I’m still interested in it because many Bayesian statisticians think that model checking violates the (strong) likelihood principle (SLP), as if this principle is a fundamental axiom of Bayesian statistics”.

But this is puzzling for two reasons. First, if the LP does not preclude testing for assumptions (and he is right that it does not[ii]), then why not simply explain that rather than appeal to a disproof of something that actually never precluded model testing?   To take the disproof of the LP as grounds to announce: “So there! Now even Bayesians are free to test their models” would seem only to ingrain the original fallacy.

You can read the rest of the original post here.

The remark in G. Casella and R. Berger seems to me equivocal on this point: Continue reading

Categories: Likelihood Principle, Philosophy of Statistics, Statistics | Tags: , , ,

Frequentist Pursuit

A couple of readers sent me notes about a recent post on (Normal Deviate)* that introduces the term “frequentist pursuit”:

“If we manipulate the data to get a posterior that mimics the frequentist answer, is this really a success for Bayesian inference? Is it really Bayesian inference at all? Similarly, if we choose a carefully constructed prior just to mimic a frequentist answer, is it really Bayesian inference? We call Bayesian inference which is carefully manipulated to force an answer with good frequentist behavior, frequentist pursuit. There is nothing wrong with it, but why bother?

If you want good frequentist properties just use the frequentist estimator.”(Robins and Wasserman)

I take it that the Bayesian response to the question (“why bother?”) is that the computations yield that magical posterior (never mind just how to interpret them).

Cox and Mayo 2010 say, about a particular example of  “frequentist envy pursuit”:

 “Reference priors yield inferences with some good frequentist properties, at least in one-dimensional problems – a feature usually called matching. … First, as is generally true in science, the fact that a theory can be made to match known successes does not redound as strongly to that theory as did the successes that emanated from first principles or basic foundations. This must be especially so where achieving the matches seems to impose swallowing violations of its initial basic theories or principles.

Even if there are some cases where good frequentist solutions are more neatly generated through Bayesian machinery, it would show only their technical value for goals that differ fundamentally from their own.” (301)

Imitation, some say,  is the most sincere form of flattery.  I don’t agree, but doubtless  it is a good thing that we see a degree of self-imposed and/or subliminal frequentist constraints on much Bayesian work in practice.  Some (many?) Bayesians suggest that this is merely a nice extra rather than necessary, forfeiting the (non-trivial) pursuit of  frequentist (error statistical foundations) for Bayesian pursuits**.

*I had noticed this, but had no time to work through the thicket of the example he considers. I welcome a very simple upshot.

**At least some of them.

Categories: Philosophy of Statistics | Tags: , ,

knowledge/evidence not captured by mathematical prob.

Mayo mirror

Equivocations between informal and formal uses of “probability” (as well as “likelihood” and “confidence”) are responsible for much confusion in statistical foundations, as is remarked in a famous paper I was rereading today by Allan Birnbaum:

“It is of course common nontechnical usage to call any proposition probable or likely if it is supported by strong evidence of some kind. .. However such usage is to be avoided as misleading in this problem-area, because each of the terms probability, likelihood and confidence coefficient is given a distinct mathematical and extramathematical usage.” (1969, 139 Note 4).

For my part, I find that I never use probabilities to express degrees of evidence (either in mathematical or extramathematical uses), but I realize others might. Even so, I agree with Birnbaum “that such usage is to be avoided as misleading in” foundational discussions of evidence. We know, infer, accept, and detach from evidence, all kinds of claims without any inclination to add an additional quantity such as a degree of probability or belief arrived at via, and obeying, the formal probability calculus.

It is interesting, as a little exercise, to examine scientific descriptions of the state of knowledge in a field. A few days ago, I posted something from Weinberg on the Higgs particle. Here are some statements, with some terms emphasized:

The general features of the electroweak theory have been well tested; their validity is not what has been at stake in the recent experiments at CERN and Fermilab, and would not be seriously in doubt even if no Higgs particle had been discovered.

I see no suggestion of a formal application of Bayesian probability notions. Continue reading

Categories: philosophy of science, Philosophy of Statistics | Tags: , , ,

“Did Higgs Physicists Miss an Opportunity by Not Consulting More With Statisticians?”

On August 20 I posted the start of  “Discussion and Digest” by Bayesian statistician Tony O’Hagan– an oveview of  responses to his letter (ISBA website) on the use of p-values in analyzing the Higgs data, prompted, in turn, by a query of subjective Bayesian Dennis Lindley.  I now post the final section in which he discusses his own view. I think it raises many  questions of interest both as regards this case, and more generally about statistics and science. My initial July 11 post is here.

“Higgs Boson – Digest and Discussion” By Tony O’Hagan


So here are some of my own views on this.

There are good reasons for being cautious and demanding a very high standard of evidence before announcing something as momentous as H. It is acknowledged by those who use it that the 5-sigma standard is a fudge, though. They would surely be willing to make such an announcement if they were, for instance, 99.99% certain of H’s existence, as long as that 99.99% were rigorously justified. 5-sigma is used because they don’t feel able to quantify the probability of H rigorously. So they use the best statistical analysis that they know how to do, but because they also know there are numerous factors not taken into account by this analysis – the multiple testing, the likelihood of unrecognised or unquantified deficiencies in the data, experiment or statistics, and the possibility of other explanations – they ask for what on the face of it is an absurdly high level of significance from that analysis. Continue reading

Categories: philosophy of science, Philosophy of Statistics, Statistics | Tags: ,

Scalar or Technicolor? S. Weinberg, “Why the Higgs?”

CERN’s Large Hadron Collider under construction, 2007

My colleague in philosophy at Va Tech, Ben Jantzen*, sent me this piece by Steven Weinberg on the Higgs. Even though it does not deal with the statistics, it manages to clarify some of the general theorizing more clearly than most of the other things I’ve read. (See also my previous post.)

Why the Higgs?
August 16, 2012
Steven Weinberg

The New York Times Review of Books

The following is part of an introduction to James Baggott’s new book Higgs: The Invention and Discovery of the “God Particle,” which will be published in August by Oxford University Press. Baggott wrote his book anticipating the recent announcement of the discovery at CERN near Geneva—with some corroboration from Fermilab—of a new particle that seems to be the long-sought Higgs particle. Much further research on its exact identity is to come.

It is often said that what was at stake in the search for the Higgs particle was the origin of mass. True enough, but this explanation needs some sharpening.

By the 1980s we had a good comprehensive theory of all observed elementary particles and the forces (other than gravitation) that they exert on one another. One of the essential elements of this theory is a symmetry, like a family relationship, between two of these forces, the electromagnetic force and the weak nuclear force. Electromagnetism is responsible for light; the weak nuclear force allows particles inside atomic nuclei to change their identity through processes of radioactive decay. The symmetry between the two forces brings them together in a single “electroweak” structure. The general features of the electroweak theory have been well tested; their validity is not what has been at stake in the recent experiments at CERN and Fermilab, and would not be seriously in doubt even if no Higgs particle had been discovered. Continue reading

Categories: philosophy of science | Tags: ,

Higgs Boson: Bayesian “Digest and Discussion”

Professor  O’Hagan sent around (to the ISBA list ) his summary of the comments he received in response to his request for information about the use of p-values in in relation to the Higgs boson data. My original July 11 post including O’Hagan’s initial letter is here.  His “digest” begins:

Before going further, I should say that the wording of this message, including the somewhat inflammatory nature of some parts of it, was mine; I was not quoting Dennis Lindley directly. The wording was, though, quite deliberately intended to provoke discussion. In that objective it was successful – I received more than 30 substantive comments in reply. All of these were thoughtful and I learnt a great deal from them. I promised to construct a digest of the discussion. This document is that digest and a bit more – it includes some personal reflections on the issues. Continue reading

Categories: Philosophy of Statistics, Statistics | Tags: , , ,

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with the discussion of E.S. Pearson:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption. Continue reading

Categories: Philosophy of Statistics, Statistics | Tags: , , , , , , ,

E.S. Pearson’s Statistical Philosophy

E.S. Pearson on the gate,
D. Mayo sketch

Egon Sharpe (E.S.) Pearson’s birthday was August 11.  This slightly belated birthday discussion is directly connected to the question of the uses to which frequentist methods may be put in inquiry.  Are they limited to supplying procedures which will not err too frequently in some vast long run? Or are these long run results of crucial importance for understanding and learning about the underlying causes in the case at hand?   I say no to the former and yes to the latter.  This was also the view of Egon Pearson (of Neyman and Pearson).

(i) Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment? Continue reading

Categories: Philosophy of Statistics, Statistics | Tags: , ,

Good Scientist Badge of Approval?

In an attempt to fix the problem of “unreal” results in science some have started a “reproducibility initiative”. Think of the incentive for being explicit about how the results were obtained the first time….But would researchers really pay to have their potential errors unearthed in this way?  Even for a “good scientist” badge of approval?

August 14, 2012

Fixing Science’s Problem of ‘Unreal’ Results: “Good Scientist: You Get a Badge!”

Carl Zimmer, Slate

As a young biologist, Elizabeth Iorns did what all young biologists do: She looked around for something interesting to investigate. Having earned a Ph.D. in cancer biology in 2007, she was intrigued by a paper that appeared the following year in Nature. Biologists at the University of California-Berkeley linked a gene called SATB1 to cancer. They found that it becomes unusually active in cancer cells and that switching it on in ordinary cells made them cancerous. The flipside proved true, too: Shutting down SATB1 in cancer cells returned them to normal. The results raised the exciting possibility that SATB1 could open up a cure for cancer. So Iorns decided to build on the research.

There was just one problem. As her first step, Iorns tried replicate the original study. She couldn’t. Boosting SATB1 didn’t make cells cancerous, and shutting it down didn’t make the cancer cells normal again.

For some years now, scientists have gotten increasingly worried about replication failures. In one recent example, NASA made a headline-grabbing announcement in 2010 that scientists had found bacteria that could live on arsenic—a finding that would require biology textbooks to be rewritten. At the time, many experts condemned the paper as a poor piece of science that shouldn’t have been published. This July, two teams of scientists reported that they couldn’t replicate the results. Continue reading

Categories: philosophy of science, Philosophy of Statistics | Tags: , , ,

U-Phil: (concluding the deconstruction) Wasserman / Mayo

It is traditional to end the U-Phil deconstruction discussion with the author’s remarks on the deconstruction itself.  I take this from Wasserman’s initial comment on 7/28/12, and my brief reply. I especially want to highlight the question of goals that arises.


I thank Deborah Mayo for deconstructing me and Al Franken. (And for the record, I couldn’t be further from Franken politically; I just liked his joke.)

I have never been deconstructed before. I feel a bit like Humpty Dumpty. Anyway, I think I agree with everything Deborah wrote. I’ll just clarify two points.

First, my main point was just that the cutting edge of statistics today is dealing with complex, high-dimensional data. My essay was an invitation to Philosophers to turn their analytical skills towards the problems that arise in these modern statistical problems.

Deborah wonders whether these are technical rather than foundational issues. I don’t know. When physicists went from studying medium sized, slow-moving objects to studying the very small, the very fast and the very massive, they found a plethora of interesting questions, both technical and foundational. Perhaps inference for high-dimensional, complex data can also serve as a venue for both both technical and foundational questions.

Second, I downplayed the Bayes-Frequentist perhaps more than I should have. Indeed, this debate still persists. But I also feel that only a small subset of statisticians care about the debate (because, they do what they were taught to do, without questioning it) and those that do care, will never be swayed by debate. The way I see it is that there are basically two goals:

  • Goal 1: Find ways to quantify your subjective degrees of belief.
  • Goal 2: Find procedures with good frequency properties. Continue reading
Categories: Statistics | Tags: , , , ,

U-PHIL: Wasserman Replies to Spanos and Hennig

Wasserman on Spanos and Hennig on  “Low Assumptions, High Dimensions” (2011)

(originating U-PHIL : “Deconstructing Larry Wasserman” by Mayo )


Thanks to Aris and others for comments .

Response to Aris Spanos:

1. You don’t prefer methods based on weak assumptions? Really? I suspect Aris is trying to be provocative. Yes such inferences can be less precise. Good. Accuracy is an illusion if it comes from assumptions, not from data.

2. I do not think I was promoting inferences based on “asymptotic grounds.” If I did, that was not my intent. I want finite sample, distribution free methods. As an example, consider the usual finite sample (order statistics based) confidence interval for the median. No regularity assumptions, no asymptotics, no approximations. What is there to object to?

3. Indeed, I do have to make some assumptions. For simplicity, and because it is often reasonable, I assumed iid in the paper (as I will here). Other than that, where am I making any untestable assumptions in the example of the median?

4. I gave a very terse and incomplete summary of Davies’ work. I urge readers to look at Davies’ papers; my summary does not do the work justice. He certainly did not advocate eyeballing the data. Continue reading

Categories: Philosophy of Statistics, Statistics, U-Phil | Tags: , , , ,

E.S. Pearson Birthday

Egon Pearson on a Gate (by D. Mayo)

Today is Egon Pearson’s birthday, but I will postpone some discussion of his work for a few days. He is, as Erich Lehmann noted in his review of EGEK (1996)[i]*, “the hero of Mayo’s story” because one may find throughout his work, if only in side discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman-Pearson theory of statistics.  Pearson and Pearson statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect.[i]

[i] Mayo (1996), Error and the Growth of Experimental Knowledge.

*If you have items relating to E.S. Pearson you think might be relevant for this blog, please send them to: until the end of August.

Categories: Statistics | Tags: , ,

U-PHIL: Hennig and Gelman on Wasserman (2011)

Two further contributions in relation to

Low Assumptions, High Dimensions” (2011)

Please also see : “Deconstructing Larry Wasserman” by Mayo, and Comments by Spanos

Christian Hennig:  Some comments on Larry Wasserman, “Low Assumptions, High Dimensions”

I enjoyed reading this stimulating paper. These are very important issues indeed. I’ll comment on both main concepts in the text.

1) Low Assumptions. I think that the term “assumption” is routinely misused and misunderstood in statistics. In Wasserman’s paper I can’t see such misuse explicitly, but I think that the “message” of the paper may be easily misunderstood because Wasserman doesn’t do much to stop people from this kind of misunderstanding.

Here is what I mean. The arithmetic mean can be derived as optimal estimator under an i.i.d. Gaussian model, which is often interpreted as “model assumption” behind it. However, we don’t really need the Gaussian distribution to be true for the mean to do a good job. Sometimes the mean will do a bad job in a non-Gaussian situation (for example in presence of gross outliers), but sometimes not. The median has nice robustness properties and is seen as admissible for ordinal data. It is therefore usually associated with “weaker assumptions”. However, the median may be worse than the mean in a situation where the Gaussian “assumption” of the mean is grossly violated. At UCL we ask students on a -2/-1/0/1/2 Likert scale for their general opinion about our courses. The distributions that we get here are strongly discrete and the scale is usually interpreted as of ordinal type. Still, for ranking courses, the median is fairly useless (pretty much all courses end up with a median of 0 or 1); whereas, the arithmetic mean can still detect statistically significant meaningful differences between courses.

Why? Because it’s not only the “official” model assumptions that matter but also whether a statistic uses all the data in an appropriate manner for the given application. Here it’s fatal that the median ignores all differences among observations north and south of it. Continue reading

Categories: Philosophy of Statistics, Statistics, U-Phil | Tags: , , , ,

U-PHIL: Aris Spanos on Larry Wasserman

Our first outgrowth of “Deconstructing Larry Wasserman”. 

Aris Spanos – Comments on:

Low Assumptions, High Dimensions” (2011)

by Larry Wasserman*

I’m happy to play devil’s advocate in commenting on Larry’s very interesting and provocative (in a good way) paper on ‘how recent developments in statistical modeling and inference have [a] changed the intended scope of data analysis, and [b] raised new foundational issues that rendered the ‘older’ foundational problems more or less irrelevant’.

The new intended scope, ‘low assumptions, high dimensions’, is delimited by three characteristics:

“1. The number of parameters is larger than the number of data points.

2. Data can be numbers, images, text, video, manifolds, geometric objects, etc.

3. The model is always wrong. We use models, and they lead to useful insights but the parameters in the model are not meaningful.” (p. 1)

In the discussion that follows I focus almost exclusively on the ‘low assumptions’ component of the new paradigm. The discussion by David F. Hendry (2011), “Empirical Economic Model Discovery and Theory Evaluation,” RMM, 2: 115-145,  is particularly relevant to some of the issues raised by the ‘high dimensions’ component in a way that complements the discussion that follows.

My immediate reaction to the demarcation based on 1-3 is that the new intended scope, although interesting in itself, excludes the overwhelming majority of scientific fields where restriction 3 seems unduly limiting. In my own field of economics the substantive information comes primarily in the form of substantively specified mechanisms (structural models), accompanied with theory-restricted and substantively meaningful parameters.

In addition, I consider the assertion “the model is always wrong” an unhelpful truism when ‘wrong’ is used in the sense that “the model is not an exact picture of the ‘reality’ it aims to capture”. Worse, if ‘wrong’ refers to ‘the data in question could not have been generated by the assumed model’, then any inference based on such a model will be dubious at best! Continue reading

Categories: Philosophy of Statistics, Statistics, U-Phil | Tags: , , , ,

Bad news bears: Bayesian rejoinder

This continues yesterday’s post: I checked out the the” xtranormal” website. Turns out there are other figures aside from the bears that one may hire out, but they pronounce “Bayesian” as an unrecognizable, foreign-sounding word with around five syllables. Anyway, before taking the plunge, here is my first attempt, just off the top of my head. Please send corrections and additions.

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation.

Bear #1: Oh. I see. Then you must mean 99.6% of the time a smaller difference would have been observed if in fact the null hypothesis of “no effect” was true.

Bear #2: No, that would also be an incorrect interpretation.

Bear #1: Well then you must be saying it is rational to believe to degree .996 that there is a real difference?

Bear #2: It depends. That might be so if the prior probability distribution was a proper probabilistic distribution representing rational beliefs in the different possible parameter values independent of the data.

Bear #1: But I was assured that this would be a nonsubjective Bayesian analysis.

Bear #2: Yes, the prior would at most have had the more important parameters elicited from experts in the field, the remainder being a product of one of the default or conjugate priors.

Bear #1: Well which one was used in this study? Continue reading

Categories: Statistics | Tags: , ,

A “Bayesian Bear” rejoinder practically writes itself…

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities. Coincidentally, I have been sent several different p-value U-Tube clips in the past two weeks, rehearsing essentially the same interpretive issues, but this one (“what the p-value”*) was created by some freebee outfit that will apparently set their irritating cartoon bear voices to your very own dialogue (I don’t know the website or outfit.)

The presumption is that somehow there would be no questions or confusion of interpretation were the output in the form of a posterior probability. The problem of indicating the extent of discrepancies that are/are not warranted by a given p-value is genuine but easy enough to solve**. What I never understand is why it is presupposed that the most natural and unequivocal way to interpret and communicate evidence (in this case, leading to low p-values) is by means of a (posterior) probability assignment, when it seems clear that the more relevant question the testy-voiced (“just wait a tick”) bear would put to the know-it-all bear would be: how often would this method erroneously declare a genuine discrepancy? A corresponding “Bayesian bear” video practically writes itself, but I’ll let you watch this first. Share any narrative lines that come to mind.

*Reference: Blume, J. and J. F. Peipert (2003). “What your statistician never told you about P-values.” J Am Assoc Gynecol Laparosc 10(4): 439-444.

**See for example, Mayo & Spanos (2011) ERROR STATISTICS

Categories: Statistics | Tags: , , ,

Stephen Senn: Fooling the Patient: an Unethical Use of Placebo? (Phil/Stat/Med)

Senn in China

Stephen Senn
Competence Centre for Methodology and Statistics
CRP Santé
Strassen, Luxembourg

I think the placebo gets a bad press with ethicists. Many do not seem to understand that the only purpose of a placebo as a control in a randomised clinical trial is to permit the trial to be run as double-blind. A common error is to assume that the giving of a placebo implies the withholding of a known effective treatment. In fact many placebo controlled trials are ‘add-on’ trials in which all patients get proven (partially) effective treatment. We can refer to such treatment as standard common background therapy.  In addition, one group gets an unproven experimental treatment and the other a placebo. Used in this way in a randomised clinical trial, the placebo can be a very useful way to increase the precision of our inferences.

A control group helps eliminate many biases: trend effects affecting the patients, local variations in illness, trend effects in assays and regression to the mean. But such biases could be eliminated by having a group given nothing (apart from the standard common background therapy). Only a placebo, however, can allow patients and physicians to be uncertain whether the experimental treatment is being given or not. And ‘blinding’ or ‘masking’ can play a valuable role in eliminating that bias which is due to either expectation of efficacy or fear of side-effects.

However, there is one use of placebo I consider unethical. In many clinical trials a so-called ‘placebo run-in’ is used. That is to say, there is a period after patients are enrolled in the trial and before they are randomised to one of the treatment groups when all of the patients are given a placebo.  The reasons can be to stabilise the patients or to screen out those who are poor compliers before the trial proper begins. Indeed, the FDA encourages this use of placebo and, for example, in a 2008 guideline on developing drugs for Diabetes advises:  ‘In addition, placebo run-in periods in phase 3 studies can help screen out noncompliant subjects’. Continue reading

Categories: Statistics | Tags: , , , ,

Blog at