Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me. A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a passage from Le Cam (1977):

“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation…. It tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices”. (Le Cam 1977, 158)

Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,

In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .

The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others.

Given the difference in aims, it is not surprising that information relevant to the Bayesian task is very different from that relevant to the task of the error statistician. In this section I want to sharpen and make more rigorous what I have already said about this distinction.

…. the secret to solving a number of problems about evidence, I hold, lies in utilizing—formally or informally—the error probabilities of the procedures generating the evidence. It was the appeal to severity (an error probability), for example, that allowed distinguishing among the well-testedness of hypotheses that fit the data equally well… .

Then, a few pages later in a section titled “*Bayesian Freedom, Bayesian Magic” (350-1):*

A big selling point for adopting the LP (strong likelihood principle), and with it the irrelevance of stopping rules, is that it frees us to do things that are sinful and forbidden to an error statistician.“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson). . . . Many experimenters would like to feel free to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money or patience … Classical statisticians … have frowned on [this]“. (Edwards, Lindman, and Savage 1963, 239)

^{1}Breaking loose from the grip imposed by error probabilistic requirements returns to us an appealing freedom.

Le Cam, … hits the nail on the head:

“It is characteristic of [Bayesian approaches] [2] . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from whatever data and whatever features one happens to notice”. (Le Cam 1977, 145)

In contrast, the error probability assurances go out the window if you are allowed to change the experiment as you go along. Repeated tests of significance (or sequential trials) are permitted, are even desirable for the error statistician; but a penalty must be paid for perseverance—for optional stopping. Before-trial planning stipulates how to select a small enough significance level to be on the lookout for at each trial so that the overall significance level is still low. …. Wearing our error probability glasses—glasses that compel us to see how certain procedures alter error probability characteristics of tests—we are forced to say, with Armitage, that “Thou shalt be misled if thou dost not know that” the data resulted from the try and try again stopping rule. To avoid having a high probability of following false leads, the error statistician must scrupulously follow a specified experimental plan. But that is because we hold that error probabilities of the procedure alter what the data are saying—whereas Bayesians do not. The Bayesian is permitted the luxury of optional stopping and has nothing to worry about. The Bayesians hold the magic.

Or is it voodoo statistics?

When I sent him a note, saying his work had inspired me, he modestly responded that he doubted he could have had all that much of an impact.

_____________

*I had forgotten that this *Synthese* (1977) volume on foundations of probability and statistics is the one dedicated to the memory of Allan Birnbaum after his suicide: “By publishing this special issue we wish to pay homage to professor Birnbaum’s penetrating and stimulating work on the foundations of statistics” (Editorial Introduction). In fact, I somehow had misremembered it as being in a Harper and Hooker volume from 1976. The *Synthese* volume contains papers by Giere, Birnbaum, Lindley, Pratt, Smith, Kyburg, Neyman, Le Cam, and Kiefer.

REFERENCES:

*Journal of the Royal Statistical Society (B)*23:1-37.

_______(1962). Contribution to discussion in *The foundations of statistical inference*, edited by L. Savage. London: Methuen.

_______(1975). *Sequential Medical Trials*. 2nd ed. New York: John Wiley & Sons.

Edwards, W., H. Lindman & L. Savage (1963) Bayesian statistical inference for psychological research. *Psychological Review* 70: 193-242.

Le Cam, L. (1974). J. Neyman: on the occasion of his 80th birthday. *Annals of Statistics*, Vol. 2, No. 3 , pp. vii-xiii, (with E.L. Lehmann).

Le Cam, L. (1977). A note on metastatistics or “An essay toward stating a problem in the doctrine of chances.” *Synthese* 36: 133-60.

Le Cam, L. (1982). A remark on empirical measures in *Festschrift in the honor of E. Lehmann. *P. Bickel, K. Doksum & J. L. Hodges, Jr. eds., Wadsworth pp. 305-327.

Le Cam, L. (1986). The central limit theorem around 1935. *Statistical Science*, Vol. 1, No. 1, pp. 78-96.

Le Cam, L. (1988) Discussion of “The Likelihood Principle,” by J. O. Berger and R. L. Wolpert. IMS Lecture Notes Monogr. Ser. 6 182–185. IMS, Hayward, CA

Le Cam, L. (1996) Comparison of experiments: A short review. In *Statistics, Probability and Game Theory. Papers in Honor of David Blackwell* 127–138. IMS, Hayward, CA.

Le Cam, L., J. Neyman and E. L. Scott (Eds). (1973). *Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability*, Vol. l: *Theory of Statistics*, Vol. 2: *Probability Theory*, Vol. 3: *Probability Theory*. Univ. of Calif. Press, Berkeley Los Angeles.

Mayo, D. (1996). [EGEK] *Error Statistics and the Growth of Experimental Knowledge. *Chicago: University of Chicago Press. (Chapter 10; Chapter 3)

Neyman, J. and L. Le Cam (Eds). (1967). *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*, Vol. I: *Statistics*, Vol. II: *Probability* Part I & Part II. Univ. of Calif. Press, Berkeley and Los Angeles.

[1] For some links on optional stopping on this blog: Highly probably vs highly probed: Bayesian/error statistical differences.; Who is allowed to cheat? I.J. Good and that after dinner comedy hour….; New Summary; Mayo: (section 7) “StatSci and PhilSci: part 2″; After dinner Bayesian comedy hour….; Search for more, if interested.

[2] Le Cam is alluding mostly to Savage, and (what he called) the “neo-Bayesian” accounts.

Wow, very interesting post! I am glad you are utilizing more of the birth/death days I compiled earlier.

Nicole: Yes, but there was no notification of this one arriving–I happened to remember it. Thanks.

‘“It is characteristic of [Bayesian approaches] . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from whatever data and whatever features one happens to notice”. (LeCam 1977, 145)’

I regard this as the most cutting of these criticisms of Bayesian statistics. Gelman addresses it fully in Chapter 6 of Badass.

Incidentally, Le Cam was awesome — he insisted on getting the strongest possible theorems from the weakest possible assumptions:

“It is possible to make somewhat restrictive assumptions… to save the theorem…but Le Cam was not willing to do so.” van der Vaart. The Statistical Work of Lucien Le Cam. Ann. Stat. 2002, 30(3), 631-82.

Corey: This cannot be dismissed, nor can I see the Bayesian rationale for wanting to: it is offered as the core asset of the account that the sample space/sampling plan doesn’t matter once the data are given.

Have you read the chapter I referred to? I can’t really explain all of it in one go. Short version: sampling plans, in general, do matter. Gelman gives conditions under which they can be neglected in Bayesian analysis. The usual examples, e.g., optional stopping, satisfy these conditions.

Here’s more from Le Cam’s quote: “One of the claims is that the experiment matters little, what matters is the likelihood function after experimentation. Whether this is true, false, unacceptable or inspiring, it tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices. It is only at the design stage that the statistician can help you.

Another claim is the very curious one that if one follows the neo-Bayesian theory strictly one would not randomize experiments….However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. …

In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.

There are many other curious statements concerning confidence intervals, levels of significance, power, and so forth. These statements are only confusing to an otherwise abused public”. (Le Cam 1977, 158)

“An otherwise abused public” indeed.

I really like “it is only at the design stage that the statistician can help you.” I have always viewed design as far more important than analysis. Which, I suppose, is why I truly feel exiled in this new world of “big data (no design)”.

But note how it’s coming back to haunt them*, if they care to understand/improve processes, or make predictions more challenging than what groups A, B might buy in the next couple of days….

*e.g., the decade lost in genomics (according to Stan Young and others). Welcome to rediscovering blindness, randomization, model validation..

Please, don’t cite Stan Y as an authority on statistical genomics. This is laughable. Instead, go ask Peter Donnelly, Rafa Irizarry, John Storey, etc etc what they think, or do. The field has been and remains rich in smart people, doing good work, and making progress.

OG: I wasn’t citing him as an authority, only intending to note the origin of that point on this blog.

(http://errorstatistics.com/2013/06/19/stanley-young-better-p-values-through-randomization-in-microarrays/)

The link within that blog is:

http://blog.goldenhelix.com/?p=322

Since then I have heard this corroborated by people who work in the field. So are you saying it isn’t true? or just pointing out that S. Young isn’t an authority in statistical genetics? Thanks.

Chapter 6 of Badass is the rebuttal again. Gelman gives the mathematical framework that Bayesians can use to model, analyze sensitivity to, and adjust for selection bias.

Regarding randomization, the short version is: randomization makes the design ignorable; far from being irrelevant for Bayesians, it is one of the routes to making the sampling plan irrelevant (in a Bayesian sense, i.e., probabilistically independent of) to the inference of interest. It also serves to make the analysis robust to modelling assumptions.

Le Cam is not wrong in that the 1970s “neo-Bayesians” did sometimes offer these sorts of arguments — they had failed to think the math through. Rubin, Gelman’s mentor, did, and his insights gave the lie to the claims Le Cam found so objectionable.

It is true Le Cam was speaking here of Savage-style Bayesians. Other styles of Bayesians are discussed elsewhere. On your particular so-called “rebuttal”:

Traveling, and don’t have Gelman with me, but here are a couple of points: (1) Gelman-Bayes is his own brand that explicitly rejects Bayesian inductive inference, insists that “A Bayesian Wants Everybody Else to Be a Non- Bayesian” http://errorstatistics.com/2012/10/05/deconstructing-gelman-part-1-a-bayesian-wants-everybody-else-to-be-a-non-bayesian/, and claims to endorse error statistics (e.g., Gelman and Shalizi).

(2) Most importantly, violating the likelihood principle, denying Bayesian updating, and whatever else some Bayesians (wanting to be all things to all people) are prepared to do/say—it doesn’t follow that denying a standard Bayesian principle yields the rationale that leads error statisticians to deny it. For example, default Bayesians “technically” (as they put it) deny the likelihood principle, but it doesn’t follow that their procedures evaluate inferences according to the reasoning used by error statisticians. It’s “agreement on numbers” again (readers can search the blog).

When I first heard Bernardo declare that he’d make optional stopping matter in order to better match the frequentist computation (thereby disagreeing with J. Berger, with whom I believe he’s writing a text), I held out the hope that the consequence would be a consilience with the deeper error statistical rationale (and so invited him to “Stat Sci meets Phil Sci” in 2010, hoping to find out). But it doesn’t turn out to be true at all.

That is, violating X might be a necessary consequence of caring about goals such as the stringency of tests, but it doesn’t follow that violating X is sufficient for attaining those goals.

(error statistics – > ~X)

But it’s not the case that

(~X -> error statistical goals)

for various principles X, e.g., the likelihood principle, the use of subjective priors in parameters.

(3) Finally, Le Cam’s main points here still hold: Bayesians take reasoning about events as holding for reasoning about hypotheses, theories and all kinds of propositions. And “when neo-Bayesians state that a certain event A has probability one-half, this may mean either that he did not bother to think about it, or that he has no information on the subject, or that whether A occurs or not will be decided by the toss of a fair coin. The number 1/2 itself does not contain any information about the process by which it was obtained.” (156).

Dang it, I referenced the wrong chapter. The chapter I meant is Chapter 8, not Chapter 6.

Regarding your points:

(1) Gelman;s philosophy informs Chapter 6 (on model checking and what Spanos would call misspecification) but not Chapter 8, which straight-up Bayesian modeling of a type acceptable to Bayesians of almost any philosophical persuasion.

(2) The framework in Chapter 8 is a modeling framework; it’s not committed to any particular prior. Those default Bayesian priors that violate the likelihood principle by their dependence on the sampling distribution and/or “frequentist pursuit” are compatible with but are not a part of the framework. That’s part and parcel of the “acceptable to Bayesians of almost any persuasion” claim made in (1).

Regarding “error statistics -> ~X” and what you label as Le Cam’s main point: that doesn’t seem to much to do with my claim that Le Cam was right to criticize the neo-Bayesians on sampling plans, but Rubin got this right and Gelman wrote it up.

Perhaps it may be clarifying to state explicitly that Gelman’s framework obeys the likelihood principle. The neo-Bayesians thought that the likelihood principle meant all sampling plans were ignorable, and they were wrong.

The likelihood principle was always relative to a given model. Using outcomes other than the one observed in model checking, is not a violation of the LP. We’ve discussed this (The blog is searchable.) The inference (in which the LP arises) is always inference within a model, in fact parametric inference within a statistical model. The Bayesians didn’t get this wrong, it follows deductively from inference by way of Bayesian theorem (and also for pure likelihoodists). If the likelihoods are proportional the proportionality constant drops out.

The only interesting violations of the likelihood principle is given the model. A rationale must be supplied, and the error statistician of course has one (the importance of error probabilities of a procedure for evaluating any inference from the procedure), but Bayesianism is premised on rejecting and not wanting such a principle. So it makes no sense, really, to say the Bayesians got this wrong, and others came and corrected it.

I don’t mean to say that the neo-Bayesians got the likelihood principle wrong. I’m saying they thought the LP entails the magic that Le Cam rightly derides, and in *this* they were wrong. They made a hidden assumption (ignorability) which, together with the LP, does make Bayesian inferences independent of sampling plans. Rubin developed a modeling framework in which ignorability need not hold.

“So it makes no sense, really, to say the Bayesians got this wrong, and others came and corrected it.”

No, I deny that — it *does* make sense. Le Cam says that Bayesian approaches tend to treat experiments and fortuitous observations alike — true of the neo-Bayesians, not true of Gelman. Le Cam also says we Bayesians claim to hold the magic which permits [us] to draw conclusions from whatever data and whatever features one happens to notice. Again, a fair criticism of the neo-Bayesians, but not of Gelman.

I think randomisation belongs to the best practice of statistics. It has roles in helping to form consensus, simplifying the problem being studied and aiding with computational problems.

In contrast Bayesian statistics is the theory of statistics. It assumes no limits on articulation or computation, it refers to the beliefs of a single individual, It allows the synthesis of information from arbitrary sources. There is no role in such a system for randomisation.

I think some of the most interesting frontiers in the foundations of statistics are attempts to include imperfect articulation or finite computational resources or consensus into the theory. I don’t think the true meaning of randomisation in a Bayesian setting (or indeed approximation be it Monte Carlo or otherwise) can be solved until these problems are solved.

I really don’t think these solutions have been solved satisfactorily to date.

Best practice statistics combined usually with frequentist methods but more importantly experimental protocols (e.g. double blind randomised trials) provides lines in the sand for publications in journals, medical trials and countless other tasks.

The Bayesian is usually in the position of being able to rightly point out that these procedures are quite arbitrary, but if they attempt to improve on them they are faced with the need to encode very detailed and intricate prior specifications and value judgements in order to proceed, or otherwise accept and use heuristics from best practice and apply Bayesian theory in a narrow way which might well have little difference with the frequentist analysis.

David: I largely agree with everything you say. I would definitely agree that our interest in practice is not in so-called ideally rational knowers, but in how humans actually learn about the world, and how we might do it better and faster. That is, the limitations of “best practice” are an essential (not an optional) ingredient for an adequate statistical theory of inquiry. And this theory, as I see it, should be continuous with scientific inference more generally.

One thing: you say, Bayesian theory “allows the synthesis of information from arbitrary sources” but I don’t see how the pieces of background information that we typically wish to bring to bear on a research problem are synthesized or systematically used in a formal Bayesian computation. I have elsewhere described a background ‘repertoire’ of problems, obstacles, strategies for using knowledge about fallibilities in inquiries –all of which actually admit of a small set of groupings into canonical types. But I think one generally would have to jump through hoops to get these considerations packaged in the form of prior probabilities in parameters or statistical models. Even then, it would seem to be an indirect way of taking account of relevant information, by and large.

Mayo: I know these phrasings are just figurative, but I’m having trouble reconciling “continuous with scientific inference more generally” with “you jump in, and you jump out”. The latter seems like a discontinuous move — abstracting/detaching/grabbing(/abducting?) some… thing?… from the statistical modeling level and then moving up to a wider context.

Corey: You are a careful reader, as I have said that (about jumping in and out) and, interestingly, they are one and the same points. The piecemeal nature of learning is what enables the piecemeal error statistical account to be continuous with more qualitative moves in science.

You don’t want to have to carry along some huge assemblage of equipment* just to get started in asking something of the data**, and there are many ways to jump in, and interconnected checks to combine/amplify them. Jump in to probe some particular query: Like a detective the shrewd inquirer poses a few “trick” questions, finds some clues, and jumps out again.

*Not just priors in all kinds of parameters under the sun, but contemplations of how you’d react to such and such, and have every possibility arrayed in front of you ahead of time.

**ready-to-wear vs designer clothes.

I don’t usually defend Bayesians but it should be noted that most Bayesian analyses are conservative. They will tend to shrink estimates towards the prior mean so that they pay less attention to that which is unexpected than would an unadjusted frequentist estimate. Furthermore, if you stop early, the sample size will be smaller, so the degree of shrinkage will be greater. Also the design (including monitoring) will reflect belief so that it is not quite right to expect Bayesians who look and those who don’t to come to the same conclusions.

Also I gave a sense in a previous post in which even frequentists can ignore stopping rules. http://errorstatistics.com/2012/09/29/stephen-senn-on-the-irrelevance-of-stopping-rules-in-meta-analysis/

Stephen: Your “defense” is curious. The man bound to go on til “the data look good” (as J. Berger puts it) believes in the effect, so being “conservative” may well mean affirming the prior bias/belief. You can’t just assume the prior comes to the rescue, and besides this misses the point. I ask you how good a job your test did in probing a claim, and you give me a posterior that doesn’t convey whether you tried and tried again or not. The difference doesn’t show up, that’s Le Cam’s point, one of them. Are you saying the design does show up (in contrast to the stopping rule principle)?

I don’t understand what you mean about the design reflecting belief–or rather, I can well see that choice of this design reflects belief in the effect, rather than being low enough to dampen your enthusiasm. In any event, for a subjective Bayesian the design for appraising H doesn’t change the prior belief in H.

I will take the last point first, Two Bayesians with identical beliefs and utilities would have to (if they consider Bayes to be the be all and end all of all rational behaviour) design the same experiment. Thus if one has a stopping rule and the other has not, they must differ in some way as regards what they think about the world. Since they differ about what they think when they design the experiment they must also differ in what they think once it has concluded.

As regards the first point, all I am claiming is that the Bayesian could say that whether or not the frequentist is right to do what (s)he does in analysing sequential designs, it should be understood that if a sequential experiment is run, the frequentist will be adjusting inferences just as the Bayesian does. The difference is that the Bayesian will be adjusting more strongly if the experiment stops early and less strongly if it stops late, whereas the frequentist will be adjusting because the experiment might stop early. So the Bayesian could say “you complain that we don’t adjust for sequential experiments but in fact we adjust everything. Our complaint is that you only adjust sequential experiments.”

Note, also, that the Bayesian deals with such matters by weighting “appropriately” (from the Bayesian perspective). However, as I pointed out in my previous post, if a frequentist weights the results from a series of sequential trials appropriately (according tos the amount of information) the even without a prior distribution being added to the mix everything is OK. See http://errorstatistics.com/2012/09/29/stephen-senn-on-the-irrelevance-of-stopping-rules-in-meta-analysis/

As you know I practice a repulsive eclecticism in inference (but then a statistician’s motto is “nobody likes us we don’t care” so being a statistician, I am unconcerned) so I am not making these points to claim the Bayesian approach is superior; I am just pointing out that claiming that Bayesians don’t adjust is, from one point of view, somewhat misleading.

Stephen:

“Thus if one has a stopping rule and the other has not, they must differ in some way as regards what they think about the world.”

Not if sequential stopping is irrelevant. There should be no difference as regards what they think concerning the hypotheses of interest. That’s the essence of the stopping rule principle. Nor am I or Le Cam arguing that Bayesians can’t “adjust” for whatever in the world they feel like. But they cannot have a principled rationale for adjusting and at the same time very happily say they enjoy the “simplicity and freedom” of declaring sampling plans irrelevant when they do. Nor do I see why they should… And of course stopping rules is just one example of a broad class of ‘selection effects” which the frequentist must take account of—given their goals. If your goal is to express stringency of test, that’s one thing, if it’s a belief or belief-boost assessment, that’s another. When you imagine the Bayesian “weighing appropriately”, my guess is you’ve got your error statistician’s hat on…there’s no reason belief in h should be altered by an experimental design or sampling plan.Traveling in airport–heading to London–so I may not respond for a bit.

Mayo, its nice to find some points of agreement, it would probably be wise for me to stop now before they vanish! With that said..

I actually think of subjective Bayes in two quite different ways.

1) The fully specified subjective Bayes theory where inference is by conditioning on the observations to produce a predictive distribution. This should go hand in hand with decision theory, some decisions might include to collect more data, or to quote intervals or something more like a normal decision.

2) A conviction in the correctness of the expected utility hypothesis, and the importance of observables and decisions, but with the acceptance that only very coarse articulation about subjective probabilities and utilities are possible in practice. As a consequence of the imprecision; conditioning by Bayes rule doesn’t apply.

I see 1) as a platanoic ideal that doesn’t exist in reality. I agree very much with Senn that with 1) the perfect becomes the enemy of the good. On the other hand it can be extremely fruitful to apply Bayesian decision theory to probability or utility specifications that are developed partially out of convenience.

On the other hand I see 2) as the only thing as possible if you trying to solve a real problem. Its fine to apply any method whatsoever, but you must evaluate its effectiveness as best you can conditional on your own beliefs not some arbitrary model. So, I am absolutely fine with the “jump in, jump out” idea but the validity of the result is dependent on a coarse specification of extremely complex mathematical objects. It should not be surprising if you are in practice really inarticulate about the merits of two or more “jump in, jump out” procedures.

I am frustrated when Bayesians mix a version of 1 and 2 together and engage in evangelical point scoring with frequentists, but in reality are using a framework which concedes too much ground with frequentists to begin with. For example a comparison is being made between a credible interval and a confidence interval (but it isn’t really Bayesian to be talking about the parameter space in the first place…).

The criticisms you (or LeCam) make seem to be of this sort of Bayesian school.

The following changes to the framework I believe fix the major underlying problems.

a) accept that you are inarticulate about probabilities and utilities

b) what matters is the predictive distribution both the prior predictive and the posterior predictive (not the prior and the posterior). I think your doubts about probabilistic specification of knowledge can be resolved by a thoughtful consideration of this approach.

c) decision theory is critical. For example collecting more data (or stopping the experiment) is a decision.

I don’t however dismiss some other underlying philosophical difficulties such as the role of randomisation, which I think are not easy to fix.

Gelman writes (on his blog*): “you can use the noninformative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information.” To be fair, I will need to go read this, but on the face of it, it sounds like more “magic”, you add info to make it more informative? And how do we assess more informative? A difference from the prior would seem informative.

The manner of correction is all important,and this one sounds flabby…We want to find the info that’s actually there….

*http://andrewgelman.com/2013/11/21/hidden-dangers-noninformative-priors/

Mayo:

There is no magic here, nor do I see the mystery in the statement that you add more information to make a distribution more informative. Adding information, that’s what it’s all about! As I wrote somewhere or another, I see no particular reason to privilege the information that is in some particular dataset that somebody happens to be analyzing right now, as compared to information that is just as good that happens to come from other sources.

I agree that Gelman is cheating here (see also my comment on Gelman and Shalizi http://onlinelibrary.wiley.com/doi/10.1111/j.2044-8317.2012.02065.x/abstract ) but that does not mean that in practice the inferences are not good (a point I also concede). The strictest frequentist approaches do not allow testing of models (a point that worried Kempthorne, J Plan Stat Inf, 1977,1, p2) but many frequentists do in fact pre-test.

Stephen: I’m afraid I don’t know what you’re talking about wrt testing models and the strictest frequentists; all the tests of models I know of are error statistical. That was the point of Box’s ecumenism (using frequentist methods for testing assumptions and then allowing Bayesian methods), and of course David Cox always stressed the need for model checking via simple significance tests and such. I have heard philosophers like R. Rosenkrantz say that since testing model assumptions is post-data that tests of assumptions are in this sense at odds with “predesignationist” requirements. But there is no predesignationist requirement, only a requirement that error probabilities be vouchsafed. I’ll try to look up Kempthorne, “on the road”.

Mayo: Here‘s the Kempthorne paper.

Corey: thanks, I had just had it sent to me. This is a paper that Kempthorne gave me in a bunch of reprints he handed to me when he was visiting Virginia Tech. I just reread it, but can’t see Stephen’s point about strict frequentists and model checking on p. 2. One point in the paper I agree with (not the only one, but the others are fairly standard) is that Fisher and N-P were actually very close in their behavioristic thinking, and Fisher’s fussing was more pretense…. Usually Oscar is more clearly taking Fisher’s side. There’s a mention also of a 1935 disagreement where Kempthorne is saying Fisher should have given more credit to Neyman’s side (involving interaction). Not sure if this is one of the debates we spoke about on this blog….In London, no time to check just now.

SJS: Gelman is only “cheating” according to a standard to which he does not hold. Other standards exist according to which Gelman’s approach is legitimate. For instance, Cox-Jaynes foundations for Bayesian statistics (which you can read about at my blog), about which I have written, “Unlike other approaches to Bayesian foundations, we made no loaded claims of rational behavior, preference, or, arguably, belief. We can jump into and out of any or all joint (over parameters/hypotheses and data) prior probability distributions we care to set up, examining the consequences of each in turn.”

Forgot to close the tag there… rassa-frassin’ HTML… [fading Yosemite Sam-esque cursing]

As I recall, Senn granted that Bayesians were permitted to go tinker with the prior or adjust anything at all to get coherence. (This would have been the “You might believe you’re a Bayesian,…” paper). Anything goes (within coherence, but even that goes in practice)!

Corey: After you “examine the consequences of each in turn,” what do you do next? Do you check that the observed consequences fit the computed ones? Presumably this would be a statistical “fit”, so do you check whether the observed consequences are or are not statistically improbable according to the various claims from which you derive the consequences? How does that go?

“After you “examine the consequences of each in turn,” what do you do next?”

Well, in the context of a certain complicated analysis, I’ve done residual plots to check if a particular parametric error distribution I’ve assumed accounts for the data. This is a self-consistency check: the fact that the residuals match the error model does not necessarily indicate that the model is correct, but if the (posterior expectation of) the empirical residual distribution doesn’t match the error model, then something is definitely wrong. Here I’m talking about ~100,000 individual residuals, so just a simple histogram is a severe enough interrogation to reveal the problem. (I’m trying to speak your lingo, eh?) The Gaussian error model failed this self-consistency check; I switched to a t error model, which passed.

(I think I’ve described this next one on the blog before.) Also in this analysis, it was only after I had computed my initial posterior distribution that I noticed that neither the analysis prior (a “noninformative” one) nor the data were able to rule out the possibility that the instrument that collected the data had recorded certain observations with zero variance (i.e., infinite precision). This caused two problems: bad estimations for the parameters pertaining to those data, and slow convergence in the MCMC sampler. The fix was to revise the prior distribution so that observations of infinite precision were ruled out from the start. This is exactly the process in your quote of Gelman.

Corey, this kind of model checking makes sense to me, but it does not seem to be consistent with Bayesian principles as I understand them from Kadane , Howson,Lindley, etc. It seems Gelman is comfortable with that distinction and maybe you are as well. Why persist in calling your analysis “Bayesian” if you adopt error checking principles inconsistent with so much Bayesian philosophy?

Click on my name to read why.

That Kempthorne paper has long been one of my favorites… in fact, I’d say that it pretty much summarizes my personal philosophy of statistics. “The Fisher [randomization] proposal is absurdly simple.” Exactly, well-said. I’m wondering if anybody has ever severely criticized Kempthorne… I haven’t found it, if they have, would appreciate any references.

Mark: Oscar was a cantankerous fellow prone to provocative claims which many disagreed with (this is just my reaction from reading comments to comments, his visit to VT, and a few letters). I was disappointed that his response to Birnbaum on the SLP (when it first came out), despite beginning with the promising statement that Birnbaum’s argument was based on a logical fallacy,was merely to dismiss it (Birnbaum’s argument) because it assumed the correctness of the model. I thought he would have caught the real logical fallacy, which occurs given the model.

Deborah, as regards testing models, this is what I meant and what (in my opinion) Kempthorne is alluding to.

Any frequentist approach in which you first test which of a number of possible models seems most appropriate and then test some substatbtive hypothesis using the chosen model can be automated to become a more complicated one-step approach. That is to say the region in the sample space that leads to rejection can be identified.

A case in point is the so-called two-stage analysis of cross-over trials in which you first test for carry-over and then if it is not judged to be present use a within-patient test of the treatment effect and if it is present use the first period values only. In fact when I arrived at CIBA-Geigy in 1987 a statistician who had just left had written a macro in SAS to do just that. Another example is more elaborate versions of the two sample t-test in which you first test for homogeneity of variance and then use the Student-Fisher approach if not significant and perhaps Satterthwaite or Welch if significant. Some packages do this automatically for you.

Now you can calculate the type one error rate of such integrated one-stage procedures. Suppose that in doing so you discover that as a wholeit does not maintain the declared Type I error rate. A notorious case in point is the two-stage analysis of Cross-over Trials which for a declared 5% type I error rate has a an actual rate between about 7% and 9.5% depending on the correlation structure even if there is no carry-over. See http://www.senns.demon.co.uk/ROEL.pdf for an explanation.

In other words, you can a) perform a test of carry-over that is valid in a frequentist sense in that it maintains the error rate claimed and then b) according to the result perform one of two tests of the treatment effect, each of which if performed unconditionally would (given the assumptions) have the correct error rate but this procedure as a whole does not maintain the type I error rate.

So there are problems not just for Bayesians but also for frequentists in model checking. As I know Aris Spanos is a great advocate of model checking it would be interesting to know more about his take on this.

Stephen:

I know what you’re talking about, but didn’t see this implied in Kempthorne. Maybe it is. This gets away from Le Cam, but my reaction to your claim (that model checking is a problem for frequentists and Bayesians alike) is this. The fact that we understand and (as you indicate) may even assess the repercussions of violated assumptions and of different strategies, is an important asset of error statistical inference. We have an idea of what we need the model to do in a given context, and whether the 5% or 7% error rate matters to the inference may be discerned. That’s one key basis for testing its adequacy for the case at hand. The second is that the error probabilities of tests of assumptions are independent (or largely so) of the unknown parameters of the primary inference.

What are the ramifications of the Bayesian model and prior such that I even know what I’m checking when checking them? (Bayesians say many different things.) How do we distinguish the prior and the model (some have both in mind as the model)? Does the model being checked have the same meaning/testable implications in the Bayesian formulation as in the error statistical? Are the Bayesian tests of models testing if I’ve adequately captured my beliefs? or are they testing how often certain types of models and values of parameters occur? or a number of other possibilities? It makes a big difference to what the testable implications can possibly be. Since the model (the likelihood function portion) is uncertain, why aren’t Bayesians assigning it a probability? Maybe some do?

In other words, error statisticians have problems of testing models, but at least I understand what those models are asserting is the case, and can derive testable predictions. At least we understand what the problems are (and can make out criteria for judging when they are bad, or not so bad). I don’t see this with Bayesian models, do you? It isn’t clear what Bayesian models are asserting such that one can check the extent to which their assertions are adequate?

Did Fisher not recommend the omnibus test for consideration of a “family” of test results? I believe his presentation used the chisquared test, but it seems that the principle can be extended to many other situations where there is a need to obtain an overall p-value after multiple tests are performed.

Deborah, if you look once halfway during a sequential trial and would stop if significant and don’t adjust, your type I error rate will not exceed that you would find by using a standard two-stage analysis of cross-over trials. So why is the latter acceptable and the former not?

As for understanding the problem, there are masses of multi-stage procedures where nobody has studied the overall error rate but behaves as if everything is OK. In what sense is the problem understood?

As regards Kempthorne. Just read p2 starting with ‘the basic difficulty’ continuing with ‘a classical type approach….does not have the validity claimed for it’ and finally finishing with ‘It is clear to me that the claims made by those of the Neyman-Pearson school…. cannot be sustained’. It seems clear to me too.

So to sum up, everybody cheats. Bayesians cheat, frequentists cheat and we applied statisticians muddle through. Where this definitely matters is if one claims to have a theory of everything. Some Bayesians do, perhaps some frequentists do but I don’t.

Stephen: Well i don’t agree that everyone is bound to cheat nor that all methods are equally directed toward taking precautions against cheating. I agree with you that one may believe she is a Bayesian but probably is not—only I don’t leave the person who recognizes this in some kind of limbo, where having broken out of a paradigm no standards exist to scrutinize results. (That was the gist of my “Can we Cultivate Senn’s Ability…?”) Call it metastatistical if you want, as does Le Cam.

I never said anything about your stopping early example–you’re the expert in medical trials. I find those sentences of Kempthorne’s, which of course I read, incredibly vague, especially coming from someone who always raises this gripe, I mean I didn’t have to go to this relatively obscure paper to find Oscar grouching!*

* let me note that this paper may not be obscure in the least to others–one commentator said it’s his favorite,– sorry Mark. My point was just that I’ve seen Oscar go further in detail regarding this one gripe of his.

Stephen: A couple of other things:

Not everyone admits to “cheating”, even having the eyes and vocabulary to call it “cheating” presupposes quite a lot. (You’re just being “Sennsible” again, from outside certain paradigms.) For instance, I have heard Good and J. Berger claim that even a method bound to exclude a true value from an interval with probability 1–or the like–isn’t really cheating because given what he/she believes, one is not “actually” misled. I pondered this for a long time. My most generous reading is essentially that, if you don’t assess an inference (from method M) according to how incapable M is of getting it right, then you have no worries in relation to error probabilities or, as I prefer, in relation to probative incapacity. There is no real answer to this—it’s a matter of aims. (Mine is finding things out and learning from error; theirs is something else):

http://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/

Thanks for these insightful remarks Dr. Mayo! This clears up much, for me at least. I’ve often wondered why applied frequentest papers were so much more reliable than applied Bayesian papers. Now I know. The only reason they get anything right is because they’re all secretly frequentests. Their reasoning is so “flabby” I wonder why they don’t save themselves the embarrassment and just guess the truth. That’s all they seem to be doing anyway.

I took to heart your comments about “I don’t see this with Bayesian models, do you? It isn’t clear what Bayesian models are asserting such that one can check the extent to which their assertions are adequate?” If someone of your deep reading and background in Bayesian methods can’t see “this with Bayesian models” then surely no one can. I can safely ignore Bayesian models assured that they are meaningless according to the greatest philosophical authority on Bayes.

You’ve saved one reader a great deal of time! Thanks again for sharing your many decades of practical experience with applied statistics.

Anon: Ha Ha, Thanks for the Saturday night jokes for those in exile. appreciated. Have an Elba Grease–on me– but I would have much preferred insights from the Bayesians to answer these questions. That’s what i was hoping to get.

Deborah, the point of my stopping early analogy was this, if in a conventional sequential trial you look hafway through and would stop if you get significance you get an inflation of your type I error rate unless you adjust. I think a few posts back you said it is a weakness of Bayesian methods that they don’t adjust. OK lets accept this point. However, certain common frequentist model checking procedures also lead to a similar inflation of type I error rates. The two-stage analysis of cross-over trials is a case in point. Actually my paper shows what you would have to do to adjust for pre-testing in cross-over trials but as far as I am aware nobody does. On the other hand true Bayesians would pay a penalty for not knowing the true model here and my friend Andy Grieve who analyses cross-over trials does it properly. See our joint paper. http://www.tandfonline.com/doi/abs/10.1080/10543409808835233#.UpIMbeIlj6g So here is an example where I can claim the opposite. In standard Bayesian analyses of cross-over trials they did pay a penalty for looking and in conventional frequentist analyses they didn’t but should have (a point that the Bayesian Peter Freeman first made in 1989).

OK. You can say that this is only one example but the point is that it’s an example when the overall behaviour of the procedure (model testing + effect testing given model finally chosen) was studied and the effect was found wanting. Bancroft many years ago showed that for some other situations it’s also a problem. However, in general it is ignored. It may be that Aris Spanos has studied the overall performance of his model checking procedures but if so he would be an exception.

To understand the problem with the 2-stage procedure for cross-over trials consider the null null case. There is no carry-over and there is no treatment effect. It is usuall to test for carry-over at the 10% level. Then if you “don’t find” carry-over you will use a within patient test at the nominal 5% level. This test now has a correct 5% conditional size, so no problem. However, if you “find” significant carry-over you use the first period data only. This test is carried out at the 5% level. the problem is however, that it does not have 5% conditional size. It has 5% unconditional size. Its conditional size having “found” carry-over is between 25% and 50%. Its conditional size having “not found” carryover is less than 5% and could be as low as 0.

This is now what leads to the problem since the overall procedure in the worst case has size

(0.9 x 0.05 )+(0.1×0.5)=0.095

and in the best case

(0.9 x 0.05 )+(0.1×0.25)=0.07.

Thus, starting out with the best of intentions and testing each stage of the problem using apparently acceptable error statistics approach actually leads to a problem.

So, from the jaundiced approach of an applied statistician I would say that I prefer to look at things using different statistical spectacles. If the thing looks the same however I look at it, fine. If it looks different it’s time to think.

Stephen: (1) I don’t know what the rationale is for your Bayesian to adjust as you describe. Is it to get a correct posterior probability or to correctly report some “error probability” (a term which Bayesians use in different ways)?*(remark added)

(2) Perhaps you are talking to “frequentists” in general. I’m not sure if you realize that my thing is neither probabilism nor performance but rather probativeness (for whatever is being claimed or entertained). It can be absurd to try and characterize what one has done at every stage of a scientific research effort, perhaps as a series of decisions, and then sum up all the errors that could occur at each step to arrive at an assessment of a (so-called) error probability–at least if that is to be relevant to evaluating how well probed the claim of interest is.

It’s broadly analogous to the allegation some have made that frequentists do or ought to pay a penalty in DNA matching where the guilty DNA is known and a data base of all person’s DNA is searched for a match. In fact the more negative matches, the higher the probability of guilt when a match is found. (Remember the cartoon, always the last place you look?)

But never mind what is bound to seem an unobvious analogy. My view is piece-meal–just like ordinary science. Once the instruments, initial conditions, model etc. are deemed adequate for a given inquiry, the tests and inferences that follow do not try to formally register all of the mistakes that scientists could have made up to this point in time. That doesn’t mean there isn’t a record of assumptions and potential flaws that could well come up for questioning. I’m not wedded to good long run performance in the least, it is at most a necessary condition. Absurd appraisals of the probativeness at hand are easy to come by (if only error rates are considered). Now as for which is the most relevant assessment of probativeness in your example of crossover trials–as I say, you’re the expert.

*I don’t feel “it is a weakness of Bayesian methods that they don’t adjust” for certain stopping rules–I don’t see the Bayesian rationale for adjusting.

Stephen: You’ve asked a couple of times about Spanos’ reaction to the pre-test bias business and model validation problems. Here’s a paper of his. Sections 5 and 6, or even just 6, should suffice.

http://errorstatistics.files.wordpress.com/2013/11/spanos-2010-journal-of-econometrics.pdf

We’re not in the same country just now.

In the last line of Section 6, Spanos concludes that, “The pre-test bias charge is ill-conceived because it misrepresents model validation as a choice between two models come what may.” But in spite of that, I see a direct tension between the first line of that section, “statistical adequacy renders the relevant error probabilities ascertainable by ensuring that the nominal error probabilities are approximately equal to the actual ones,” and Senn’s cross-over example.

Is this an isntance of the fallacy of rejection rearing its ugly head? What “judicious choice[s]” and “astute ordering[s]” of tests would Spanos suggest for M-S testing in the cross-over trial design? And much (much!) more importantly: in light of the goal of “securing the statistical adequacy” of the model within which the primary inference will be assessed, on what (meta-statistical?) basis can one assess the judiciousness and astuteness of any particular M-S testing strategy given some specified context?