Nate Silver gave his ASA Presidential talk to a packed audience (with questions tweeted[i]). Here are some quick thoughts—based on scribbled notes (from last night). Silver gave a list of 10 points that went something like this (turns out there were 11):

1. statistics are not just numbers

2. context is needed to interpret data

3. correlation is not causation

4. averages are the most useful tool

5. human intuitions about numbers tend to be flawed and biased

6. people misunderstand probability

7. we should be explicit about our biases and (in this sense) should be Bayesian?

8. complexity is not the same as not understanding

9. being in the in crowd gets in the way of objectivity

10. making predictions improves accountability

Just to comment on #7, I don’t know if this is a brand new philosophy of Bayesianism, but his position went like this: Journalists and others are incredibly biased, they view data through their prior conceptions, wishes, goals, and interests, and you cannot expect them to be self-critical enough to be aware of, let alone be willing to expose, their propensity toward spin, prejudice, etc. Silver said the reason he favors the Bayesian philosophy (yes he used the words “philosophy” and “epistemology”) is that people should be explicit about disclosing their biases. I have three queries: (1) If we concur that people are so inclined to see the world through their tunnel vision, what evidence is there that they are able/willing to be explicit about their biases? (2) If priors are to be understood as the way to be explicit about one’s biases, shouldn’t they be kept separate from the data rather than combined with them? (3) I don’t think this is how Bayesians view Bayesianism or priors—is it? Subjective Bayesians, I thought, view priors as representing prior or background information about the statistical question of interest; but Silver sees them as admissions of prejudice, bias or what have you. As a confession of bias, I’d be all for it—though I think people may be better at exposing other’s biases than their own. Only thing: I’d need an entirely distinct account of warranted inference from data.

This does possibly explain some inexplicable remarks in Silver’s book to the effect that R.A. Fisher denied, excluded, or overlooked human biases since he disapproved of adding subjective prior beliefs to data in scientific contexts. Is Silver just about to recognize/appreciate the genius of Fisher (and others) in developing techniques consciously designed to find things out despite knowledge gaps, variability, and human biases? Or not?

Share your comments and/or links to other blogs discussing his talk (which will surely be posted if it isn’t already). Fill in gaps if you were there—I was far away… (See also my previous post blogging the JSM).

[i] What was the point of this, aside from permitting questions to be cherry picked? (It would have been fun to see ALL the queries tweeted.) The ones I heard were limited to: how can we make statistics more attractive, who is your favorite journalist, favorite baseball player, and so on. But I may have missed some, I left before the end.

For a follow-up post including an 11th bullet that I’d missed, see here. My first post on JSM13 (8/5/13) was here.

(4) sounds really odd but maybe it makes more sense in context. Nassim Taleb would have a fit if he was there.

I think part of his answer to your questions might be to appeal to this idea of bayesian calibration which I mentioned in the other post, at least in the context of applied predictive models. He would probably see a prior which leads to miscalibrated predictions to be “wrong”.

This is his approach ensuring that his models converge on actionable predictions such as:

http://normaldeviate.wordpress.com/2012/11/07/betting-and-elections/

Actually I think that it was honorable by de Finetti to acknowledge how the scientific decisions we make depend on our prior point of view/attitude/belief, and to try to put this explicitly into his approach. I also think that it is a good idea for a philosophy of statistics, as well as for researchers in practice, to acknowledge such… I wouldn’t call them “biases” but rather, perhaps,… “influences of the subjective point of view”.

What I don’t like (and this also started already with de Finetti) is that the Bayesians apparently think that it makes sense to have the prior as the one and only correct place where a subjective point of view should be acknowledged. In my personal experience I hardly ever came across a situation in which I thought that the influence of my personal point of view (or the ones of experts collaborating with me) is appropriately incorporated in the analysis by specifying a prior.

Particularly, I think that the way the personal/subjective point of view *should* have an influence on the analysis is by taking into account the aim of the analysis (e.g., exploring vs. explaining vs. predicting), and some personal and not fully formalised ideas about the meaning of the data, worthwhile and less worthwhile hypotheses to think about, limitations of the study design etc., but usually not the *prior belief* about what is true (prior belief may be helpful if fast decision making is crucial and the data are thin).

Although it is right to criticise frequentists (and Bayesians regarding all the aspects apart from prior specifications!) for being too silent too often about the need of making subjective decisions.

There’s a methodology that grows directly out of the recognition that people may be biased, self-deceived, unconsciously and consciously prejudicial, and operating with limited information: it’s the frequentist error statistical approach. It is the rationale for principles of experimental design and interpretation. To claim that a methodology that deliberately takes account of biases and limitations is ignoring biases and limitations is to commit a serious fallacy, yet that is exactly what Silver appears to be doing. That is to say, “taking account of” is very different from “denying the existence of” More when I return to base (one of them)!

Christian: To finish up my points, now that I’m back:

I find your remarks interesting—let’s see, you wrote: (a) “Actually I think that it was honorable by de Finetti to acknowledge how the scientific decisions we make depend on our prior point of view/attitude/belief, and to try to put this explicitly into his approach. …(b) What I don’t like (and this also started already with de Finetti) is that the Bayesians apparently think that it makes sense to have the prior as the one and only correct place where a subjective point of view should be acknowledged. In my personal experience I hardly ever came across a situation in which I thought that the influence of my personal point of view (or the ones of experts collaborating with me) is appropriately incorporated in the analysis by specifying a prior.

© Particularly, I think that the way the personal/subjective point of view *should* have an influence on the analysis is by taking into account the aim of the analysis (e.g., exploring vs. explaining vs. predicting), ..worthwhile and less worthwhile hypotheses to think about, limitations of the study design etc., but usually not the *prior belief* about what is true …”

The upshot is that we need to ensure these background factors show up in the overall reliability and error probing capacities of the analysis, and in testing the relevance of the formal question to a substantive one, and the like. The account that best accomplishes this is the error statistical one. We see what question you’ve asked, the sensitivity, power, etc….the data, its assumptions and interpretation.

That is why I think it so very, very odd for anyone to claim that introducing subjective priors does anything to improve the objectivity and self-critical standpoint of an analysis. I’m glad you are not one of those people. Those considerations were always there to be picked up on by “frequentist analysis”. But it’s made much more explicit in what I call “error statistics”–at least that’s the idea.

It’s too bad the audience wasn’t given the opportunity to question Silver’s pronouncement #7.

I’m not sure what he’s meant to have said, and I didn’t hear the talk, but here are some questions.

(1) Independently of the Bayesianism question, I’m not sure “disclosing” biases does anything much. When I’m reading a newspaper article and the author writes: “Full disclosure: the subject of this article pays my salary and owns this newspaper,” I don’t think, “Great, the author disclosed, and thus is aware of, her biases and has dealt with them,” I think “This article is probably junk.”

(2) If the idea is to incorporate journalistic full disclosure into Bayesian methodology, is it supposed to go something like “I have a bias toward the person who pays my salary, so you should discount any of my statements about Rupert Murdoch (say) by X?” Unless this results in a judgment to disbelieve everything the reporter or politician says, I’m unsure how biases would work as priors in this way, unless they were just stated as beliefs and treated the same as any other belief.

(3) So, maybe the idea is that the *practice* of disclosing your prior biases is useful? That Bayesian methodology, in forcing you to lay out your prior beliefs, also forces you to disclose your biases, interests, economic self-interest, and the like? But if that’s the idea, again there are practical problems. This would be most useful to the journalist or politician *themselves*, not to anyone else. Because the same mechanisms that are keeping those biases and self-interests in place (your boss won’t let you publish certain things, you want to keep your job, you know no one will vote for you if you say ‘x’ publicly) are still there, and are still good reasons for you to not publish or say the truth. So the practice of evaluating one’s own priors might help the journalist or politician figure out what they think is true, but it won’t necessarily help that truth get to the public.

Lydia: I agree with all this. In science, when a scientist announces, say, I’m going to try and save theory X from refutation no matter what, he is well aware that he’ll have to muster evidential support for disregarding any anomalies for theory X. It’s fine for him to work hard to save theory X–a lot can be learned that way–but the crux of science is precisely to have stringent standards for distinguishing warranted/unwarranted interpretations of data.

It is true that under a proper prior, the Bayesian posterior mean qua estimator does not, in general, have sampling expectation equal to the true parameter value, i.e., in statistical jargon, it is a “biased estimator”. I can’t help but feel that the claim that we are somehow recognizing our biases (in either the colloquial or cognitive sense) through the identification and use of a prior distribution is simply the product of semantic drift and ought to be abandoned by one and all.

Corey: Do you really think that’s what’s going on? That would be interesting. Great to meet you in Montreal, by the way!

Mayo: I wouldn’t say that’s what I really think is going on — just that it could be, and right now I’m at a loss to explain it otherwise. But irrespective of the cause of the empirical fact that people do make the claim, I want them to stop.

Corey: I also want them to stop. But how? I think it may have more to do with some of the old slogans about making our subjectivity explicit, along with the “logic error” I consider in my multiple choice question on today’s blog.

I was at the Silver talk, which I very much enjoyed even though I disagree with his statistical views. Actually, I was hoping he would try to make the case for subjective Bayesianism, but aside from calling biases “priors,” he really didn’t touch it. I had the impression that he knew he was speaking to a group of people who know statistics better than he does, and thus avoided the topic.

Indeed, I came away with the impression that he doesn’t know statistics well. He knows politics well, and knows how to apply Bayes’ Rule (in an empirical Bayes context, not subjective), but I don’t think he understands the principles in statistics very well. Drew Linzer, who also predicted the Electoral College vote exactly last year, is actually much more sophisticated statistically.

I don’t mean to bash someone who is obviously quite bright and accomplished, but his characterization of priors as “biases” was quite troubling to me. It dovetails with a point I make in my open source textbook on probability and statistics (http://heather.cs.ucdavis.edu/probstatbook), in the section on Bayesian statistics. I ask the readers if they think that scientific investigations on politically sensitive topics should include subjective priors, which would be a door for bias to enter. Even though some people think my treatment is too harsh, no one has ever even tried to counter that point.

Norm: Thanks for your comment. I’m glad you’re willing to admit:

” I came away with the impression that he doesn’t know statistics well,” because I haven’t heard anyone say that. Likewise, “his characterization of priors as “biases” was quite troubling to me.” I would think people would correct him, or pin him down, on this point, else it encourages others to echo him, as well as to assume a murky vagueness is permissible (so long as it’s under a popular banner). Why do you suppose no one has*?

*I assume they have not because it’s the one thing he said with great forcefulness. I could be wrong.

I’m not sure why you use the word “admit.” As I said, I am not a fan of Nate Silver. And as I said (maybe not clearly), I am not a fan of Bayesian statistics (meaning subjective priors) either. When I said that some people have told me my textbook’s treatment of Bayes comes across as harsh, I was referring to my textbook being critical of the Bayes method.

I would regard Silver’s “bias” remark as almost being a Freudian slip. My point is that one person’s “prior” is another person’s bias. I think that’s a dead-on convincing reason why Bayesian statistics should NOT be used for public consumption. (What someone does for their personal use is up to them.) In my view, it shows why the Bayesians’ rallying cry, “But we should use all the knowledge we have!” is just plain off base.

That Bayesian rallying cry has great appeal, sad to say. Even more sadly, the reason for that appeal is that people don’t think much about statistics and what it really means. It’s just formulas to them, even for many professional statisticians, I’ve found. That’s why a “murky vagueness,” as you put it so well, is quite acceptable. They just want an answer.

Norm:

You wrote: “I’m not sure why you use the word “admit.” As I said, I am not a fan of Nate Silver. And as I said (maybe not clearly), I am not a fan of Bayesian statistics (meaning subjective priors) either.”

I know you’re not a fan of subjective Bayesianism—I checked out your book (and in any event already surmised this from your comments on Normal Deviate’s blog). It’s just that I had the impression that N.S. was the “emperor” at the JSM (and doubtless beyond) and that any such criticism taboo. I always champion taboo-breaking for the sake of intellectual honesty, so I was crediting you.

Norm: “In my view, it shows why the Bayesians’ rallying cry, “But we should use all the knowledge we have!” is just plain off base. That Bayesian rallying cry has great appeal, sad to say.”

Mayo: Imagine saying “use all the biases you have in interpreting your data!” See my multiple choice question in my current post: https://errorstatistics.com/2013/08/09/11th-bullet-multiple-choice-question-and-last-thoughts-on-the-jsm/

There’s also a crucial distinction, often not drawn, between (a) background information and beliefs regarding the question/hypothesis under test, and (b) background info/knowledge involved in specifying/interpreting results. Even if I warrantedly believed in theory T, if the question at hand was: how well tested is theory T by data x? I would not include that background, lest I beg the question. Scientists might even say, we know theory T has got to be approximately right (about such and such domains), but this data has scarcely discriminated T from any number of rival theories. Or some such thing. For example, many particle physicists felt they already “knew” there would be a Higgs particle, but this was not part of interpreting what the ATLAS data indicated (about the existence and type of particle detected).

Norm: “Even more sadly, the reason for that appeal is that people don’t think much about statistics and what it really means. It’s just formulas to them, even for many professional statisticians, I’ve found.”

Mayo: That is sad, if true. But is it really true? I’m not asking whether professional statisticians are interested in “getting philosophical”—I know that they’re generally not.

Was Nate Silver the “emperor of JSM”? I don’t think so, but it’s important to keep in mind that there is a huge diversity among JSM attendees, who may have had different views of Silver.

However, in my long experience, I do think the vast majority of them do have one thing in common–a nonquestioning, formula-plugging view of the field of statistics.

The mathematical statisticians, for instance (this is the IMS component of the Joint Statistical Meetings), tend to see things in terms of their mathematical elegance, which can blind them to whether what they are doing is of actual practical value. The Bayes method really fits that elegance criterion well, and there is a theorem that says, roughly, “All good decision rules are Bayes rules.” The premises underlying the theorem are unreasonable in a practical sense, but you can imagine how appealing it must be to mathematicians.

Then there are the biologists, many of whom hated math in school and consider statistics to be a nuisance. They want to plug their data into the formulas and be done with it. If people say they should use Bayes formulas, then fine.

As to the physicists and engineers, I recall a joke from my college days: “A mathematician, a physicist and an engineer are investigating a theorem that claims that all odd numbers are primes. The mathematician says, ‘3 is a prime, 5 is a prime, 7 is a prime, 9 is not a prime, the theorem is false.’ The physicist says, ‘3 is a prime, 5 is a prime, 7 is a prime, 9 is an experimental error, 11 is a prime…’ The engineer says, ‘3 is a prime, 5 is a prime, 7 is a prime, 9 is a prime, 11 is a prime…'” Not too much questioning by the latter two professions, at least in the case of statistics.

A few years ago, my department (computer science, housed in a college of engineering), invited a new PhD grad to interview for a faculty position. During her job talk, she stated that she had used a certain prior in her Bayesian analysis. I asked why she had used that particular prior. The candidate was mystified by my question; “Why do you care about the prior”?, she asked me with genuine bafflement. And she was from a Top 3 university, with a thesis adviser you may have heard of in a context not really in CS.

Then there are the business people; see the biologists above.

People with actual statistics degrees tend to come originally from one of the above fields.

My wife, who doesn’t have any statistical background, attended the JSM with me. She went to a number of the applied talks, and in looking at the program, she noticed the large number of titles that included the word “Bayesian.” I had told her that the Bayes method is controversial (I mentioned this to prepare her for the Silver talk), so she asked me why, if Bayes is controversial, the program was dominated by this topic, which seemed to indicate general acceptance, indeed enthusiasm? The answer of course is that it only USED TO BE controversial; it’s now become the standard religion, or better, the standard Koolaid. So, if Silver was not the emperor, Bayes certainly was.

Norm: This is very interesting. You say the mathematicians go for elegance, but at least they refuted the claim in your joke (all odd #’s are primes), whereas, oddly the engineers did not.

I meant that I got the sense that N.S. was appraised as the “emperor”, not that he was “the” emperor. Your idea that the Bayesian Way was the real emperor of the JSM is intriguing. I don’t know what to compare it with, but I assume you would. I supposed that people were mostly using Bayesian techniques in relatively non-controversial ways, conjugate priors or technical tricks to get estimates with good error probabilities (as in the session I chaired).

On the other hand, in submitting my paper I searched for a category of “methodology” and found only “Bayesian methodology”. This supports my contention that when it comes to foundations, frequentists are “in exile”.

Yes, the mathematicians at least know what is a prime and what isn’t. 🙂 And they wouldn’t consider it strange for someone to ask a speaker why she chose to use a certain prior.

Regarding the use of Bayesian methods as “technical tricks,” yes, there is some of that. A prior can have a moderating effect, making it less likely that extreme values will arise with an estimator. But I believe I am correct in saying that most Bayesians use the method because of the rallying cry, “Use all your knowledge!” As you point out, using all of one’s knowledge doesn’t mean one must combine one’s knowledge with the data, but once one drinks the Koolaid, it becomes “must.”

N.S. was certainly not the emperor. No subsequent speaker that I heard mentioned him, nor did I even hear people in the hallway mention him. People went to his talk because he was a celebrity.

And I was using hyperbole when I said that Bayes was the emperor. But there is no question that the Bayesian approach has become extremely prominent, if not dominant, in statistics today. N..S.’s fame has given a boost to statistics, as has the Big Data buzzword (which even I had in my, talk), and the fact is that N.S. claims to be a Bayesian and many Big Data/Machine Learning methods are Bayesian, thus giving Bayesianism a boost too.

Your blog title, “Frequentists in Exile,” reflects this. I was at a conference last year in which I attended a talk by a talk by a Bayesian, who was describing his analysis in a dental research project. When I asked whether the research sponsor, the NIH I think, would accept a Bayesian analysis, a number of people in the audience jumped on me for merely raising the question–even though they said they were non-Bayesians and even though I had asked my question in a very neutral manner (and even though I don’t think anyone there knew my opinion on the Bayes method). Exile, indeed.

I blame my own field, computer science, for a large part of this. (I started in statistics and still consider myself to be in that field, but my departmental affiliation has been CS for a long time.) CS, being a more aggressive field than stat (in part due to the huge pressure to get research funding in engineering departments), has basically usurped statistics, renaming it Machine Learning. In the view of many statisticians, a lot of the ML research is at best pseudoscience. A typical paper will define a problem only in a “vaguely murky” way, and then list the empirical performance of some ad hoc methods on a few data sets, with no motivation for the methods and no explanation of why some seemed to work. Accordingly, there is no questioning of principles, and once one or two prominent people started using Bayes, most of the rest followed.

Small wonder, then, that a supposedly strong student from a very top university (Number 1 in some rankings, both for CS and in general), working under a top researcher, can be so casual about the Bayes method that she is perplexed when asked why she used a particular prior. And I must add that some of my own colleagues were perplexed too.

Norm,

Thanks for your interesting reflections. Strange that people jumped on you for questioning Bayesian dentistry.* And what did the speaker from NIH say?

Norm: “As you point out, using all of one’s knowledge doesn’t mean one must combine one’s knowledge with the data, but once one drinks the Koolaid, it becomes ‘must’.”

Mayo: But the real problem is that it’s generally not knowledge at all. Moreover, a posterior probability of an event doesn’t produce the kind of appraisal/logic that is really wanted for inductive/statistical inference. But that will get me too far afield: please check the blog (which is searchable and includes tables of contents with links).

On drinking the Koolaid; I wonder how many people know about Jim Jones these days. Perhaps it has more to do with a pressure to follow “the tribal drums” or the like. (e.g., a comment on Gelman’s blog about the need to worry “about the possibility of negative push back if they seem to not be enthusiastically marching to that positive tribal drum beat.”

http://andrewgelman.com/2012/12/21/two-reviews-of-nate-silvers-new-book-from-kaiser-fung-and-cathy-oneil/#comment-122785

Norm: “Big Data/Machine Learning methods are Bayesian, thus giving Bayesianism a boost too.”

Mayo: I’m puzzled because I actually got the reverse impression (from Wasserman). I took it that performance and reliability were so important there as to encourage a return to appreciating some rudimentary, but highly clever, frequentist ways.

There are some posts from a simplicity/machine learning conference last year: https://errorstatistics.com/2012/06/29/further-reflections-on-simplicity-mechanisms/

Anyway, several of the points you mention are in sync with the very reason I decided to attend the JSM to give my paper on the flaw in Birnbaum’s (1962) argument for the likelihood principle.

Norm: “Your blog title, “Frequentists in Exile,” reflects this.”

Mayo: You may be one of a teeny tiny handful of people who have agreed with me (I can point out comments), not that I want it to be true or think it makes any sense at all—far from it.

*Non-exiles: this is just a little joke from the story Norm told.

The speaker doing the dental project was funded by NIH, not from there. He is an academic at SDSU who used to be at UCD, which is why I attended the talk. (I had not met him before, but wound up having dinner with him and several ex-UCD students. We didn’t talk about Bayesian issues at the dinner. 🙂 ) He simply said he hadn’t gotten any negative feedback on using Bayesian methods in the paper. It was people in the audience who reacted so negatively.

The Bayesians don’t seem to be very interested in inference, either because they realize it is not really possible or because they don’t think it’s important. Again, the rationale usually given involves estimation error, with the argument being that using a prior will reduce mean squared error (or whatever accuracy criterion is used).

A few months ago on Larry Wasserman’s blog, I raised the question of the effect of using the “wrong” prior. Some discussion ensued. See http://normaldeviate.wordpress.com/2013/03/19/shaking-the-bayesian-machine/

I use the term “drinking the Koolaid” with my students, and they seem to understand, even though they likely could not state the source of the allusion.

I’m not sure what Larry told you, but all you have to do is plug “Bayesian machine learning” into Google and you’ll get a ton of stuff, including some prominent ML books that use Bayesian analysis throughout. And of course my story on the faculty applicant involved a Bayesian approach.

It’s definitely true that the goal of the ML people is prediction accuracy, but that is really just a form of estimation accuracy, which as I said is the putative reason for using Bayesian methods. But ML people don’t tend to be “religious” Bayesians.

One more illustration on the “9 is a prime” attitude I find in CS, at least about statistics: Look at the introductory statistics course offered by the MOOCs purveyor, Udacity. It’s taught by the founder of the firm, Sebastian Thrun, himself a specialist in ML. It is amazingly shallow. This is a criticism I have of MOOCs in general, so I’m biased, but the course is quite shallow even by MOOCs standards. I’m told that Thrun is really an outstanding person and of course his accomplishments are huge, but to me this extremely thin course reflects the typical CS attitude that statistics just consists of a few formulas, nothing worth giving any thought to.

Norm: Of course I meant he was funded by NIH.

Norm: “The Bayesians don’t seem to be very interested in inference, either because they realize it is not really possible or because they don’t think it’s important.”

Mayo: What? If they’re not interested in inference, then they’re not interested in statistical inference, which is about learning from data. Inference is not important? Certainly they claim to be giving us methods for inference, methods they find superior to frequentist inferential methods because they enable moving from data (and a prior, with their various meanings) to the probabilification of a claim, be it a model, estimate, hypothesis. I realize the different forms,and true, many are focused on a very limited arena of observable prediction, but it still must purport to be inferential (even if merely deductive). I don’t suppose they view Bayesian methods as merely descriptive of data.

Norm: Again, the rationale usually given involves estimation error, with the argument being that using a prior will reduce mean squared error (or whatever accuracy criterion is used).

Mayo: And this too is a criterion for inference/decision—never mind that we may not be very interested in the behavior of an estimator in long-run use over possibly quite different systems/hypotheses. It may not be a good criterion for inference, but it is still intended as one.

If they really claim inference is unimportant/impossible by means of their methods, they should make this very clear, so others could understand what they’re not getting when they buy the Bayesian package.

What I meant was that things like confidence intervals and tests. Some do “probability-ize” their estimates, but to my knowledge this is not common.

The above refers to statisticians. In the ML community, many don’t even think of their data as a sample from anything, so the inference question becomes moot. I do think they are inconsistent, though, because they always talk about the predictive ability of their model on “new data,” i.e. beyond the data that they fitted their model to. To me, this again is “9 is a prime” thinking.

It should be noted that inference is in fact tough or even of questionable meaning in many of the new applications people look at these days. Random graphs (e.g. analysis of social networks) is an example.

Norm,

Norm: What I meant was that things like confidence intervals and tests. Some do “probability-ize” their estimates, but to my knowledge this is not common.

Mayo: So where do prior probabilities come in then? [Most inference/learning is not of the formal statistical variety. Our accounts of statistical learning should be continuous with learning in general.]

Norm: In the ML community, many don’t even think of their data as a sample from anything, so the inference question becomes moot. I do think they are inconsistent, though, because they always talk about the predictive ability of their model on “new data,” i.e. beyond the data that they fitted their model to. To me, this again is “9 is a prime” thinking.

Mayo: The ones I’ve heard/read allude to iid samples from ….something. I realize there are unique aspects to these problems—and this too calls for philosophical illumination. I get the sense that some/a lot (?) of the work is to capture human discernment abilities: we know the answer, but want to train the robot to make the discernments. Or they have tons of correlations and just want to predict how much more Mayo will pay for the same book/hotel because she uses a Mac and does Y & Z. It’s marketing-advertising-security. Fine, but I scarcely see Bayesian updating (which I’m distinguishing from merely using conditional probability).

The prior probabilities are certainly used; they play a role in the computation of the estimator.

On the lack of a sampling framework in much of ML, see the SVM literature, for instance. As far as I know, most of it does not assume such a framework. See for example http://jmlr.org/proceedings/papers/v28/zhang13c.pdf (Don’t be confused by the word “sampling” which does appear in the text, as it is a different context. They are sampling from what stat people would call a sample.)

I don’t think it’s common in ML to “train the robot to [confirm] the discernments.” But they do indeed want to predict how much Mayo will pay for something, and indeed predict that she will buy that thing in the first place.

I am not sure what you mean by some of your other comments, but maybe the following will clarify. Say we are predicting which people will purchase Nate Silver’s book, based on covariates–demographics, book buying history and so on. For simplicity, say we use classical linear discriminant analysis, fit on a sample of people from the population of all book buyers (all books, not just NS books). Part of that analysis would estimate the unconditional probability p of Yes (yes, they buy the NS book). This is called a prior probability, but it is not Bayesian. Or, it could be considered a simple case of empirical Bayes, but is not a subjective prior.

On the other hand, say we only have samples from two subpopulations, NS book buyers and NS nonbuyers. We then cannot estimate p from our data. BUT we could say, “Oh, I think p is about 0.15” or we could say, “I think p has a beta distribution with such-and-such parameter values,” and then factor that into the discriminant analysis. THAT would be a subjective prior.

Norm: Note that in both your cases one views the particular event, Mayo buying an NS book, as a generic type of event. One then tries to judge the relative frequency of its occurrence. Yet the big difference between Bayesians and frequentists purports to be that the latter does not, while the former does, assign probabilities to statements or hypotheses like the deflection of light is 1.75”, prions do not contain nucleic acid, a leading cause of ulcers is helicobacter pylori, the relative frequency of successful perforations by one of Pearson’s naval shells follows distribution D. But these cannot be regarded as general events that occur with relative frequencies over repetitions–can they? (If they can, the frequentist might also assign them probabilities, if that seemed relevant.) So the Bayesian must have something else in mind in assigning these claims prior probabilities. (Some kind of epistemic degree of credence or the like.) Your example does not get at the supposed big deal between frequentists and Bayesians. No difference in the conception of probability (although we error statisticians would want to know if the assignment was well or poorly warranted).

Norm: I’m interested in your views on the unreasonability of the premises underlying Wald’s complete class theorem. Would you care to spell them out?

(One big reason that I, a Bayesian, frequent (ha ha) this blog is the opportunity it affords me to confront the strongest available arguments against my current stance.)

This is not going to be a great answer, as I haven’t thought about this issue for a long time, but here goes.

First of all, I want to be able to do inference, again meaning confidence intervals or at least reporting standard errors. (I’m opposed to doing significance testing.) A subjective prior in effect prevents me from doing that.

On a broader scale, the Wald analysis requires that one choose a utility function, which is unreasonable, especially if my analysis is going to be used by others, who don’t have the same utility function. Even my own utility function, if I can formally state one, might vary from one day to the next. It’s just a very artificial thing, I believe.

Norm: Your subjective probability (which recall means whatever it means) of what? (prevents you from performing significance tests)? Look, if one is in an “embedded” setting (as Cox calls them), so as to be able to compute CIs, then CI’s are better,– if one doesn’t know how to properly interpret tests in terms of the discrepancies that are and are not severely indicated. But CIs still must be supplied with additional principles if one is to avoid the same fallacies one can commit in tests. (See my reforming the reformers on CIs). The points within a CI, for example, should not be regarded as on par.

However, when it comes to things like testing assumptions, the pure significance test has an important role. I take it that’s why Gelman-Bayesians turn to them (for testing models). Even in testing assumptions, interpreting the results generally demands you consider the type of departures the test could or could not readily unearth. You can no more infer an alternative model upon finding a low p-value in tests of assumptions, than you can infer to a substantive explanation of a stat sig result in general.

The criticism that all nulls are false, even where true, I argue, does not mean we’re not interested in how, and in what way, a given null is false. But I also think it’s false that all nulls are false (in any way other than the trivial one, that no model or statement is exactly and completely true about the world). People often confuse the fact that sufficiently sensitive tests will discern discrepancies from precise nulls with thinking the null is false.

I’ve said all this much more clearly (I hope) in published work. The blog can be searched.

Norm: Thanks for taking the time to respond. Regarding the claim that subjective priors prevent one from calculating confidence intervals, I have two things to say (neither from an especially Bayesian stance). First, the requirement to use a (procedure risk-equivalent to the use of a) prior is a consequence of the theorem, not a premise, so pointing to it isn’t responsive to the question. (Let me reiterate that I do appreciate a direct statement of your views.) Second, the claim is false to fact: one can indeed incorporate subjective priors into confidence intervals — yes, honest-to-FSM random regions/intervals with correct coverage properties. The literature on this is fairly sparse; it starts with Pratt 1963, takes a detour through statistical decision theory for interval/region procedures of varying frequentist purity in the 70s through 90s, and returns to “pure” frequentist intervals with Farchione and Kabaila 2008 and Kabaila and Giri 2009. You can even find some discussion of this sort of thing by yours truly right here on this blog (the comment behind that link was written before I did a literature review).

I agree that explicit utility functions are usually quite artificial. My perspective on why the concept is nevertheless valuable is given here.

I looked at one of the references (Kaballa), and I personally don’t consider it acceptable. It is highly contrived and only works for certain ranges of the parameter values.

Things are even worse from my point of view, as I rely mainly on asymptotics. I think using a t-distribution to form “exact” tests or CI, for instance, is silly, as nothing in the real world has a normal distribution. So I use asymptotics, and since the effect of a prior washes out asymptotically, the whole question becomes moot.

Norm: I’m really curious as to what criteria you’re using to judge the procedure unacceptable. It can’t be that it fails to have correct confidence coverage,and “contrived” is an aesthetic judgment rather than an argument against its use…

(Your point about asymptotics is backward — with increasing n, many statistics converge to a pivot having a normal or t-distribution, so a procedure that is exact for those distributions can be turned into a procedure that has asymptotically correct coverage.)

I think that by “only works for certain ranges” you mean that it’s only shorter in expectation than the usual interval in certain ranges. That’s true, but if someone thinks she has reason to believe that the parameter is actually in those ranges, why shouldn’t she aim to report a smaller interval? (Not rhetorical.) Especially since her procedure has a confidence coverage guarantee even if she turns out to be wrong.

Corey, I’m afraid that I didn’t have, and won’t have, time to read that paper in detail. But your comment that “if someone thinks she has reason to believe…” seems to completely undermine the claimed frequentist nature of the procedure.

Your comment on asymptotics seems to be a tautology, something along the lines of “asymptotically normal estimators can use normal-based CIs.” Maybe I’m missing something.

Norm: There’s some disconnect here. Just to be absolutely clear, the procedures I’m discussing don’t use Bayes’ theorem at all; their creators justify them entirely on the basis that — and I want to put this in the strongest possible terms — they have exactly correct confidence coverage for *any* true parameter value. So when you say that the claimed frequentist nature of the procedure seems to have been completely undermined, I can only conclude that either (i) you haven’t taken this latter justification completely on board, or (ii) by “frequentist nature” you mean something other than correct confidence coverage.

If you’re getting fed up and want to tap out, that’s fine. But before you do, I’d like you to forget Kabaila’s papers and take five or ten minutes to contemplate this figure — it just might blow your mind. It’s a plot of two Neyman-style confidence belts for the mean of a single normal deviate with known variance equal to one. The datum value is on the x-axis and the parameter value is on the y-axis. The idea is that one draws a vertical line on the plot at the observed datum value; the intersection of the vertical line with a confidence belt gives the confidence interval.

First consider the usual 95% CI belt, shown given in blue. Here’s what makes it a valid CI: pick an arbitrary mu value and draw a *horizontal* line at that value; the intersection of the horizontal line with the confidence belt defines a subset of the data space — i.e., an event, call it E — with exactly 95% probability mass under the assumed mu value. Furthermore, the realized (vertical) CI interval will cover the assumed mu value if and only E occurs. Since this is true for *any* given mu, it must be true for the actual unknown mu; hence the confidence belt defines a correct frequentist CI procedure.

Now consider the red confidence belt. It too defines a correct frequentist CI procedure (i.e., for any arbitrary mu value yada yada). It also yields a shorter realized interval (i.e., vertically) than the blue procedure if the datum happens to be near zero, and a longer interval otherwise. The point: if you prefer the blue procedure to the red procedure (or vice versa), you must be basing that preference on something other than correct confidence coverage! And I want to know the basis for any such preference.

My point about asymptotics is that you’ve got it backwards when you write that things are “even worse” from your point of view. My tautological statement points out that these procedures continue to be relevant in Asymptotia. Let me put this more starkly: for these (completely frequentist) procedures, prior information doesn’t wash out.

Corey, first let me assure you that I am NOT fed up. This is a very stimulating discussion. Remember, I’m the one who complained that the profession of statistics is dominated by the “9 is a prime” types, so all this is a very welcome breath of fresh air for me. [Though I might accuse you of being in the “overly enamored of mathematical elegance” camp. 🙂 ]

Now that you point out that methods you cited don’t use Bayes’ Theorem–which I should have realized even without doing anything more than skim the papers, sorry, my lapse–I realize now that there really isn’t anything remarkable about those results. There are lots of estimators in statistics that will do better for some parameter values than others, even though still correct overall, e.g. in the sense of consistency, and in this case in the sense of (conservative) CIs. Given a choice of two competing estimators, both of them correct but one of which works better in a region one has hunch is where the parameter lies, then it may well be sensible to choose that estimator.

But such estimators are not subjective prior Bayesian, in the sense that people generally use the term, precisely BECAUSE Bayes’ Theorem isn’t used.

I’m afraid you still don’t see my point about asymptotics. What I tried to say is that since a Bayesian estimate becomes non-Bayesian for large n, it’s not fair to say (or, is a triviality to say) that Bayesian estimates enable asymptotic inference. Or, better, because the priors cause bias in the frequentist sense for finite n, then one needs a larger n for the asymptotics to take hold, compared with standard estimators–not a comforting thought.

By the way, you’ve used the term “exact” for various inference procedures a couple of times, and, for better or worse, this kind of thing just doesn’t move me. Interesting that Deborah cites Pearson’s (Fisher’s?) concern about robustness of assumptions; this too is something I settled in my mind–what is robust and what isn’t–from early in my career.

Norm: “Though I might accuse you of being in the “overly enamored of mathematical elegance” camp.”

Nolo contendere. 😉

“There are lots of estimators in statistics that will do better for some parameter values than others, even though still correct overall[,] in this case in the sense of (conservative) CIs. … Given a choice of two competing estimators, both of them correct but one of which works better in a region one has hunch is where the parameter lies, then it may well be sensible to choose that estimator.”

Right, now we’re almost on the same page. The only thing to clarify is the technical point that these procedures aren’t conservative — the non-strict inequality in the definition of confidence coverage is a straight-up equality here (given the model assumptions, natch). That’s all I mean by “exact”.

“But such estimators are not subjective prior Bayesian, in the sense that people generally use the term, precisely BECAUSE Bayes’ Theorem isn’t used.”

True, but recall that the claim at issue is, “I want to be able to do inference, again meaning confidence intervals… A subjective prior in effect prevents me from doing that.” These procedures are the solution to a constrained optimization problem, to wit, the minimization of (subjective!) prior expected interval length subject to the constraint that the resulting procedure have correct frequentist confidence coverage. That’s why I assert that your claim is mistaken.

“What I tried to say is that since a Bayesian estimate becomes non-Bayesian for large n, it’s not fair to say (or, is a triviality to say) that Bayesian estimates enable asymptotic inference. Or, better, because the priors cause bias in the frequentist sense for finite n, then one needs a larger n for the asymptotics to take hold, compared with standard estimators–not a comforting thought.”

This is true — from a certain point of view (as Obi Wan Kenobi would say) — but not relevant to the frequentist procedures I was discussing above (nor to my question about the premises of Wald’s complete class theorem that prompted this highly enjoyable discussion). But now let me put on my statistical decision theorist hat, depart on this tangent, and challenge this point of view directly. Focusing on bias is a mistake — why neglect sampling variance? It’s mean squared error (or, in general, risk) that’s important, and in terms of MSE, a prior can *reduce* the n needed to reach Asymptopia.

And now I’ll swap that hat for my Bayesian statistician hat and say, funny, I always thought of it as non-Bayesian procedures becoming Bayesian for large n! What I mean is that it seems to me that all of the non-Bayesian procedures for coping with nuisance parameters become equivalent to Bayesian marginalization in the large n limit.

Norm: I know nothing about Corey’s references, but just on your last point: Nothing in the real world is the asymptotic world.

Lots in the real world is well approximated by asymptotics. Early in my career I did quite a bit of investigating how good the approximations are, and thus felt confortable using them, which I still do today

Corey: In your “right here on this blog” comment, should 1.42 be ~2.32?:

“For example, if x = 0 is observed, one can claim that the 95% confidence interval is [-1.68, 1.68] rather the usual [-1.96, 1.96] *provided* that had one observed, e.g., x = 4, one would have reported [1.42, 5.66] instead of the usual [2.04, 5.96].

Mayo: No, I’m pretty sure the numbers I gave are correct. In this plot, consider the intersection of a vertical line at x=4 with the red confidence belt. The lower CI limit is clearly below 2.

I can’t speak for the Bayesians and thus my comments have always been qualified by phrases such as “it is my understanding that.” Making this disclaimer again, I really don’t think they regard what you refer as “probability” to really be probability, certainly not in the “frequencies over repetitions” sense. That’s why most don’t give results in probability terms.

I should mention that I even know non-Bayesians who don’t think of probabilities as “frequencies over repetitions.” One colleague actually views the mean as the center of mass, which shocked me.

To react to Norm and Corey:

I confess not to understand what people mean in talking of subjective or personalistic or epistemic probability, and I challenge anyone to pin them down consistently. Yet another twist on this ever-shifting landscape arose in my exchange here with Norm. He seemed to be suggesting that one is dealing in subjective probability if one doesn’t have adequate evidence about relative frequencies of outcomes but just guesses (at the relative frequencies). Is a subjectivist just a frequentist with lousy evidence? I think that’s very strange.

Let me try to give the strongest type of example, where there’s intuitively high corroboration for some parameter values, that the “state of knowledge” is like saying the parameter follows a normal distribution. There’s a certain folk version that seems sensible enough at first glance. Suppose I want to say that there is strong evidence that a parameter, L, the deflection of light, is close to the value predicted in GTR, 1.75″ (now usually set at 1). As an error statistician I have no problem saying that the GTR value has passed with high severity, but that is not a posterior probability. I also have no problem saying the interval of L values warranted with severity gets increasingly precise with new interferometry measurements and so on. So what will the subjective Bayesian give me that’s better?

L is regarded as a constant, so I can’t speak of frequencies of L taking a value. The subjective Bayesian will say that they will assign an epistemic probability to L, and I’m wondering what they mean. It might mean that, say, values around 1.75 have strongest evidence, and the ones further away, in either direction, have lower evidence? That cannot be right. I don’t mean there’s little evidence of discrepant values, say of a 0 deflection or the .87” of a Newtonian half deflection, I want to say there’s strong evidence that it’s not far away from the GTR value, in fact, we might want to say those far away values are refuted by the data. So what does the low subjective probability in those far away values mean?

What about the idea that values near 1.75” are more likely than those further away, in the formal sense? That is, they are values that render the data more probable than do lambda values far away. But this is an entirely frequentist notion, and it’s using probabilities of evidence under hypotheses as a way to appraise evidential strength or comparative evidential strength. A frequentist does that too. The only additional thing we care about is whether the high comparative likelihood in favor of H’ over H would very frequently be reported even if H is true. But Bayesians should care about this as well.

So far this shows a subjective Bayesian, when he has evidence for a distribution of parameter values, really just has knowledge of comparative likelihoods.

Things get really murky when one has poor evidence or no data at all. To say all the values of L are equiprobable—or some other noninformative variant– is to conflate having no evidence and having evidence. But even the “easy case” seems to reduce to likelihood ratios.

Well, remember I’m the one that views a lot of the people as “9 is a prime” types. And, I keep coming back to our faculty position applicant from a very top school who was mystified when I asked her why she had chosen a particular subjective prior. So, I just don’t think many people who use subjective Bayesian methods have really given much thought, if any, to the question of in what sense subjective priors are probabilities.

I believe that most of those who have thought about it do NOT consider subjective priors to be probabilities. They are simply treatng the priors as gut feelings, and are definitely not “frequentists with lousy evidence.” Not only would they not be able to give you a precise definition of what a prior really is, they would not be interested in the question.

Norm:

You wrote: “They are simply treatng the priors as gut feelings, and are definitely not ‘frequentists with lousy evidence.'” But your example of estimating p (in the case of buying a N.S. book) seemed to draw the distinction according to whether there was evidence or just guessing. But both cases dealt with making a claim about a relative frequency.

Anyway, given what you say, It’s not clear why you objected to S.L.’s idea of the prior being a report of bias or ill-founded personal hunches–or some such thing.