Bad news bears: ‘Bayesian bear’ rejoinder- reblog

To my dismay, I’ve been sent, once again, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (see Aug 5 post),, who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t  seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a reblog of my rejoinder (replacing p-value bears with hypothetical Bayesian bears)–but you can’t get it without first watching the Aug 5 post, since I’m mimicking them.  [My idea for the rejoinder was never polished up for actually making a clip.  In fact the original post had 16 comments where several reader improvements were suggested. Maybe someone will want to follow through*.] I just noticed a funny cartoon on Bayesian intervals on Normal Deviate’s post from Nov. 9.

This continues yesterday’s post: I checked out the the” xtranormal” http://www.xtranormal.com/ website. Turns out there are other figures aside from the bears that one may hire out, but they pronounce “Bayesian” as an unrecognizable, foreign-sounding word with around five syllables. Anyway, before taking the plunge, here is my first attempt, just off the top of my head. Please send corrections and additions.

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation.

Bear #1: Oh. I see. Then you must mean 99.6% of the time a smaller difference would have been observed if in fact the null hypothesis of “no effect” was true.

Bear #2: No, that would also be an incorrect interpretation.

Bear #1: Well then you must be saying it is rational to believe to degree .996 that there is a real difference?

Bear #2: It depends. That might be so if the prior probability distribution was a proper probabilistic distribution representing rational beliefs in the different possible parameter values independent of the data.

Bear #1: But I was assured that this would be a nonsubjective Bayesian analysis.

Bear #2: Yes, the prior would at most have had the more important parameters elicited from experts in the field, the remainder being a product of one of the default or conjugate priors.

Bear #1: Well which one was used in this study?

Bear #2: I would need to find out, I came into the project at the point of trying to find an adequate statistical model; this alone required six different adjustments of the model.

Bear #1: So can you explain to me what a posterior of 0.996 really means?

Bear #2: There is no unanimity as to the definition of objective Bayesian analysis, nor even unanimity as to its goal. It is a quantitative construct arrived at by means of a Bayesian computation based on a prior distribution.

Bear #1: But I am assured the priors are coherent, and do not violate probability axioms, correct?

Bear #2: Not in general. Conventional priors may not even be probabilities in that a constant or flat prior for a parameter may not sum to 1 (improper prior).

Bear #1: If priors are not probabilities, how do I know the posterior is a probability?

Bear #2: The posterior distribution can generally be justified as limiting approximations to proper prior posteriors.

Bear #1: Yeah right. Well the important thing is that this is stronger evidence of a genuine effect than was reported in the recent Hands-Price study: they had only a .965 posterior probability.

Bear #2: Not necessarily. I would have to know their sample size, type of prior used, whether they were doing a Bayesian highest probability density interval or treating it as a test, possibly with a “spiked” prior.

Bear #1: You are not serious.

Bear #2: Unfortunately I’m very serious. Bayesian analyses are like that.

Bear #1: Aren’t all the objective, default priors agreed upon conventions?

Bear #2: Not at all. For instance, one school defines the prior via the (asymptotic) model-averaged information difference between the prior and the posterior; by contrast, the matching prior approach seeks priors that yield optimal frequentist confidence sets for the given model, and there are also model-dependent invariance approaches.  Even within a given approach the prior for a particular parameter may depend on whether it is a parameter “of interest” or if it is a nuisance parameter, and even on the “order of importance” in which nuisance parameters are arranged.

Bear #1: Wait a tick: we have a higher posterior probability than the Hands-Price study and you’re saying we might not have stronger evidence?

Bear #2: Yes. Even if you’re both doing the same kind of default Bayesian analysis, the two studies may have started with different choices of priors.

Bear #1: But even the two studies had started with different priors, that difference would have been swamped out by the data, right?

Bear #2: Not necessarily. It will depend on how extreme the priors are relative to the amount of data collected, among many other things.

Bear #1: What good is that? Please assure me at least that if I report a high posterior probability in the results being genuine there is no way it is the result of such shenanigans as hunting and searching until obtaining such an impressive effect.

Bear #2: I’m afraid I can’t, the effect of optional stopping is not generally regarded as influencing the Bayesian computation; this is called the Stopping rule Principle.

Bear #1: You’re not serious.

Bear #2: I am very serious. Granted, stopping rules can be taken account of in a prior, but then the result is Bayesian incoherent (in violating the likelihood principle), but there is no unanimity on this among Bayesian statisticians at the moment. It is a matter of theoretical research

Bear #1: Just to try this one last time: can you tell me how to interpret the reported posterior of .996?

Bear #2:  The default posteriors are numerical constructs arrived at by means of conventional computations based on a prior which may in some sense be regarded as either primitive or as selected by a combination of pragmatic considerations and background knowledge, together with mathematical likelihoods given by a stipulated statistical model.  The interpretation of the posterior probability will depend on the interpretation of the prior that went into the computation, and the priors are to be construed as conventions for obtaining the default posteriors.

*E.R.R.O.R. fund will support the hiring out of the bears or preferable the much better animated entities on xtranormal.

Categories: Comedy, Metablog, significance tests, Statistics |

42 thoughts on “Bad news bears: ‘Bayesian bear’ rejoinder- reblog”

1. Anon

“Just to try this one last time: can you tell me how to interpret the reported posterior of .996?”

It means that for every possible “state of the world” that isn’t ruled out by my knowledge in which there is “no difference”, there are 249 possible states compatible with my knowledge in which there is a difference.

This has an objective, well defined, and useful meaning. It is still meaningful if the “difference” in question is a singular event and can never be used in a “repeated trial”.

It does not however imply that if you could repeat this 250 times that you’d get 1 “no” for every 249 “yes’s”. What would actually happen in a repeated trial is an entirely different question which may or may not be related and relevant.

Thinking of every “.996 probability” as a frequency in a repeated trial is a kind of “scaffolding” surrounding the above that meaning. It’s inaccurate in general, unnecessary even in problems that do address repeated trials, and highly limiting since it only really ever applies to a tiny subset of the problems practitioners actually face.

• Anon: Thanks for this attempt, but you can’t really mean this.

• Anon: Let me try to see what you might mean. The bear wants to convey fairly strong evidence for a real difference (the “what the p-value” bears don’t give specifics so I don’t in my example), and you propose it means something like: a large proportion of the possible states of the world consistent with your knowledge (or the data) x, are worlds with a difference. Or something like that, is that your idea?

• Guest

There is only one world. It would be better to say “some parameters about the world are unknown. A large portion of the possible values for those unknown parameters imply a difference”

2. guest

When a person says “the probability of heads on a coin flip is .5” what they really mean is:

“For every possible initial condition (“state”) of the coin flip system that leads to a heads, there is an allowed initial condition which leads to tails. Because of this symmetry I’ll say the probability is .5”

You think it means:

“if I flip a symmetrical coin many times it will come up heads 50% of those times.”

The latter viewpoint leads one to believe that the probability is a kind of fixed physical property of a coin (“an unbiased coin as probability .5”).

In reality, the same perfectly symmetrical unbiased coin can be made to have any frequency of heads you want by simply changing space of possible initial conditions. For this and other reasons, the former viewpoint is more useful even when you’re actual concern is the frequency of heads after many flips!

So yes I am serious.

• Guest: sorry, but I don’t have a clue what you mean by your preferred interpretation, much less how such a thing could be used to assign probabilities to hypotheses (assuming you wanted to—I don’t).

• Guest

It requires a reversal of the way Frequentists normally think. Instead of thinking:

“There is a fixed cause for the Data, and I’ll consider a range of possible outcomes”

Think:

“What actually happened (i.e. the Data) is fixed and I’ll consider a range of potential causes”

For example, the “cause” in the coin flipping example is the initial conditions (for simplicity take this to be the initial orientation and velocity vectors of the coin when it leaves the hand). The “outcome” in n trials would be an element of {H,T}^n.

The later way of thinking is more down-to-earth. You no longer have to dream up completely fictitious infinitely long sequences of trials. But it’s also far more general: it applies just as well, without modification, for a single coin flip as it does for a 1000 coin flips. It also focuses attention on what really matters. The actual frequency of heads can be any value you want if you constrain the initial conditions for each flip to lie in a well-chosen subset of all possible initial conditions!

Unfortunately, the later viewpoint quickly leads one to a Jaynesian style Objective Bayes which would cause some Frequentists to blow a gasket (I’m not sure I want that on my conscience).

3. Christian Hennig

guest: How do you know what a person really means when they say something if it isn’t yourself?

• Corey

Christian, I’m sure you mean*:

“guest and Mayo: How do you know what a person really means when they say something if it isn’t yourself?

* irony!

• Nor does one always know what one means Corey bear!

• Guest

It’s a figure of speech. Obviously we’re talking about what it could mean in principle. But since you took those words literally it would be fun run to with the idea.

Do people really think “prob = .5” refers to the relative frequency of heads in future trials? Well I am not a mind reader so I can’t say. But I can observe what people do.

And what I’ve noticed is that no one actually observes the relative frequency of heads in coin flips and then sets p equal to this observed frequency. They just assume p=.5 and get on with their work. It’s unclear in practice how Frequentist get the “well calibrated” probability distributions they talk about with seemingly so little actual calibration.

On the other hand, the idea that the probability should be .5 impresses itself strongly on the mind. Both Bayesians and Frequenists feel the pull of this intuition and are loath to assign a different value without significant justification. Were does this intuition come from?

Given our typical lack of knowledge about the initial conditions, could the intuition come from the inherent symmetry in the state space (i.e. the set of allowed initial conditions) and it’s this symmetry that makes people think the probability is .5?

• Guest: i would deny that our lack of knowledge is what warrants the .5 assignment, but rather knowledge about a given system being sufficiently well modeled by a distribution such that the relative frequency of heads is specifiably close to what is predicted with the relevant Binomial model. It is scarcely different from other uses of mathematical models eg., in geometry. How much more so does lack of knowledge fail to warrant probabilistic assignments to hypotheses.
Fortunately, assessments of how much evidence, and how well tested, are not in terms of probability assignments to them.

• Guest

Oh I see. So when a Stat professor does a coin flipping experiment in class and gets “freq of heads =.48″ they then do a significance test and say this isn’t significantly different from .49 and instruct the students to use .49 as the probability of heads for the rest of their lives?

Because if so, the students are very bad students indeed. Whenever they encounter a new coin that they have to make predictions about they just assume p=.5.

I stress again, that the actual frequency of heads of a perfectly symmetric unbiased coin can be made to have any value you want between 0 and 1 by restricting the space of possible initial conditions in the right way. If by ” knowledge about a given system” you also include knowledge of the “space of possible initial conditions” then I respectfully suggest your slowly coming around to my way of thinking.

• Guest: I’m not sure what your way of thinking is, but frequentist testers always require their inferences to be connected to a model with a space of possible outcomes.

• Guest

“i would deny that our lack of knowledge is what warrants the .5 assignment”

If I didn’t lack knowledge and new the initial condition precisely, I’d be able to work out the outcome of the flip. The probability would be either 0 or 1 depending on the physics calculation. So I’ll just make two points:

(1) Our assignment has to be based on what we know (and don’t know) about the universe. We have nothing else to base it on.

(2) Whether a given prediction is accurate depends strongly on what we know.

(3) There’s no getting around (1) and (2)

4. Guest: I haven’t a clue why your posts are sent for moderation since you’ve had comments approved before, sorry. I’ll ask the folks in Elba.

• guest

It’s a different “guest” – I’m the one who posted a lot a while ago, not much recently, and nothing in the above. The email addresses supplied should distinguish who’s who.

From Elba Blog Administrators: Guest (to whom we’re responding): Dr Mayo has said that if someone wants a comment to be anonymous, she will view it that way, and not ask us for e-mails.
However, if there are many guests (for a given post), we wonder if you would take numbers to help the reader?

5. Christian Hennig

guest: I like your last posting; that’s something to think about. If I’m asked to guess what people are thinking, I’d go for some kind of muddled up concept; I’d believe that most people will in fact believe that long run freqencies will match their intuition from symmetry and that they’d find strange at first sight the conceptual need to separate them. As far as I know (this can be found in Gillies book on “Philosophical Theories of Probability” but also somewhere in Hacking), it took writers on statistics until the 1830s or so to realise explicitly that there are two concepts of probability around. For example, you don’t find anything like this in Bayes’s original treatise, which is quite ambiguous on the meaning of probability and motivates his use of a prior by a nice repeatable physical experiment.
I think that much confusion in the foundations comes from the fact that both the frequentist and Bayesian approach are different generalisations to cope with situations where the classical “counting of symmetric cases”-concept wouldnt’t work, both use the word “probability” for more or less different things but because the roots are the same, many people would still insist that there should be only one meaning and so at least one of the two must be wrong. So instead to have the two properly separated and to let each of them do the things they’d be better for, people tend to somehow muddle them together again or to campaign for eradicating one of them.

6. Guest

“I’d believe that most people will in fact believe that long run frequencies will match their intuition from symmetry”

Yes because there is a complicating factor which I left out for simplicity. People know more than just “their intuition from symmetry”. They also know that if there were something breaking this symmetry, like for example a law of physics which disallowed most of the initial conditions the lead to “heads”, then this would have long ago been discovered and everyone would know about it.

Obviously, there are plenty of situations where we don’t have that kind of additional knowledge.

7. Guest

Hennig you wrote: “I think that much confusion in the foundations comes from the fact that both the Frequentist and Bayesian approach are different generalizations to cope with situations where the classical “counting of symmetric cases”-concept wouldnt’t work”

I think you’re right. I would only add my own little theory as to how this came about historically. In a comment above I pointed out the Bayesian mindset was to think:

“What actually happened (i.e. the Data) is fixed and I’ll consider a range of potential causes”.

This is easy to do if you actually know a great deal about the causes. In this case we know about initial conditions because we know Classical Mechanics. Historically Laplace was doing the initial applications of statistics in science to Astronomy (i.e. Classical Mechanics).

There is an interesting technical feature of statistics though. Even if we know a deeper model like Classical Mechanics for a coin flip, we can, if we like, express everything statistical in terms of probabilities of H and T. We don’t need to mention distributions over initial conditions or the deeper model explicitly.

This is an important technical fact, because it allows us to operate even when we have no deeper model. For example, people analyzed the number of Boy vs Girl births long before anyone knew about DNA and human embryos.

But this generates a shift in thinking for many people. Since all relevant quantities (probability distributions of whatever kind) are then functions of “outcomes” like {H,T} or {boy,girl}, it becomes easy to shift to the Frequentist way of thinking:

“There is a fixed cause for the Data, and I’ll consider a range of possible outcomes”.

Unfortunately, once you make that shift you’re going to want to drop prior distributions because you can no longer see the justification for them. They seem to be probabilities of something like parameters that don’t have a range of possible outcomes! This shift is especially easy for people in the life and social sciences since they often work without an explicit and accurate deeper model. And in fact, the rise of Frequentist statistics is historically associated with rise of the life and social sciences.

8. Mark

Guest, I agree with much of what you say, particularly regarding coin flipping, but I deny that that viewpoint would necessarily lead to Jaynesian style objective Bayes. Personally, I subscribe to Popper’s propensity interpretation of probability, which perfectly aligns with what you say about probability being a property of the initial conditions (he calls them more broadly generating conditions). However, I also believe that physical systems to which probability actually applies are fairly limited, and am very skeptical when it comes to applying any statistical model in cases where I just don’t see how probability could possibly apply (e.g, statistical models of non-reflexive (I.e., deliberative) human behavior). For example, I find Nate Silver’s application of “probability” to be misguided and useless.

• Corey

Mark: How then do you account for the performance of Nate Silver’s models — not just in the two most recent presidential elections, but also in the 2008 presidential primaries, which established his reputation? (Predicting primaries is much harder than predicting presidential elections for reasons discussed in the last paragraph of this post on Andrew Gelman’s blog.)

• Mark

Corey, I closely followed that discussion on Gelman’s blog (and even contributed a couple of comments). I certainly don’t want to open a new debate here, and I bet Mayo doesn’t want that either, so I’m sorry I brought it up. Silver has obviously been good at identifying which polls have been the most reliable in the recent past and selecting those to use in his poll averaging algorithm. His selection methods have worked well since 2008, I’m particularly happy that they worked well this year, and they will continue to work well into the future… Until they don’t. I just don’t put any stock into his statement of probability.

• Corey Bear

Mark: I didn’t mean to create a debate — I was just curious, and your reply fully addresses my question and satisfies my curiosity. I shall say no more about it.

• Mark: Putting to one side N. Silver, one could still use statistical reasoning, couldn’t one, to experiment on/test claims about behavior? Popper thought simple statistical testing was at the heart of methodological falsification (Fisherian style), and often brought out the confusion so rampant in his day (and ours) between wanting to test and corroborate statistical hypotheses (e.g., about real effects), and wanting to confirm those hypotheses probabilistically (i.e., assign them degrees of probability)–he called the latter verificationism or probabilism, as you know if you read him.
Also, you may know that he did not lump economics in with other social sciences he disparaged….

• Mark

Deborah, absolutely, but there is a very big difference between experimentally testing behavior or reactions under different conditions and predicting behavior using a probability model.

• Mark

Just to elaborate a bit… In an experiment, probability enters through random assignment and through the assumption that all allowable assignments were equally probable. There is no need to probabilistically model how any individual subject or groups of subjects would behave.

9. All: Before moving away from this, I’d really like to hear people’s thoughts on “Anon’s” suggestion at the start of the comments. Unlike the Bayesian, the error statistician employs error probabilities to qualify the method or test. This is akin to Popper’s degree of confirmation or Peirce’s degree of “trustworthiness of the proceeding”. But my main point is that bringing in all our favorite views of probabilities (of events) is moving a bit away from where we began, and that was providing a reading for the Bayesian bear (perhaps in comparison to the p-value) of frequentist error statistics. Anon’s attempt could be seen as somewhat similar, if it could work? Does it require all the possible states be initially equally probable?

• Mark

I think I was assuming that “Anon” and “Guest” were one in the same. That aside, one of Fisher’s criteria for probabilistic inference was “there is a measureable reference set (a well-defined set, perhaps of propositions, perhaps of events).” Ref: http://digital.library.adelaide.edu.au/dspace/bitstream/2440/15274/1/272.pdf

I don’t see how “states of the world” or even “possible values of unknown parameters” meets this criterion.

• Corey Bear

Mayo: As far as I can see, Anon’s suggestion is the Jaynesian party line — it is justified by Cox’s theorem. That theorem, alas, requires a premise that you do not accept, so I no longer bother to try to get you to read about it. See here for more.

The question about equally probable states will run smack into the issue of continuous versus discrete models of the world. In a model with discrete states, *if* the prior information has permutation symmetry with respect to the model’s states then a straightforward argument given by Jaynes in PTLOS shows that the plausibility assessment inherits that symmetry, implying that the states of the world must be equiplausible.

Similar reasoning applies in continuous models, but “equally probable” doesn’t work in the continuous context — permutation, being a discrete transformation, cannot be applied. Symmetry of the prior information with respect to some continuous transformation must be used. Jaynes wrote a paper entitled The Well-Posed Problem giving a particularly insightful application of this notion to Bertrand’s paradox.

• Corey B: Following the link: “Because they don’t accept the premises of Cox’s theorem — in particular, the one that says that the plausibility of a claim shall be represented by a single real number. I’m thinking of Deborah Mayo here.” But are you saying that Anon assuming this, why?
And as a topic for another time, the key issue isn’t so much whether the plausibility of a could be represented quantitatively (whatever that means), but rather what we need to be representing or measuring in evaluating the well-testedness (corroboration/severity) of hypothesized claimes.

• Corey Bear

• Corey B: I take it you’re alluding to the second topic not the first. I saw, by the way, that on the link you sent, Cyan claims I haven’t read these Bayesians, but this does not do me justice at all*. I began way back when (a college undergraduate in math logic-philo, with those axiomatizations at the height of their popularity, and all my potential mentors were knee deep in (one of the two main) philosophical traditions out of which these logics grew, and in fact worked on them, and heard all the gurus ….but then I realized something, and couldn’t go back. We’ve mostly grown out of logical positivism these days….but some still crave these logics, or rather, not logicS, but a very limited first order deductive system obeying truth functional operations ….far from scientific and ordinary reasoning.

Back to anon, does it differs from a likelihoodist account: such and such underlying states would render probable the data, thereby “explaining them”, whereas other states do not account well for them. Still, at least it doesn’t have such cold feet about modeling causes as some we’ve heard from recently. But, from this little snippet I obviously cannot tell Anon’s view.

• Corey Bear

Mayo: You’ve mistaken a (true) statement about the information available to me for a claim about you. I (as Cyan) didn’t claim that you haven’t read all those Bayesians — I said that *as far as I know*, you aren’t familiar with one particular theorem. It’s extremely likely that my interlocutor is aware of the distinction between these two claims and did not mistake the one for the other — long-time members of that forum tend to be educated about the mind projection fallacy.

Other than this, would you say I characterized your stance correctly in the linked discussion? I’m anxious to avoid misrepresenting your views.

My favorite explanation of Cox’s theorem is Kevin Van Horn’s review paper.

• Cyanabear: why are you so anxious? Leaving town now….but here’s a query for you:
Suppose you have your dreamt of probabilistic plausibility measure, and think H is a plausible hypothesis and yet a given analysis has done a terrible job in probing H. Maybe they ignore contrary info, use imprecise tools or what have you. How do you use your probabilistic measure to convey you think H is plausible but this evidence is poor grounds for H? Sorry to be dashing…use any example.

• Corey Bear

Mayo: I’m anxious because it’s a mistake I’d find hard to rectify, since I can’t ensure that those who will have read the thread will see a correction.

Your question about plausibility measures is framed in a sort of agenty way; this comes from arguing against Savage-style subjective Bayesians, I expect. The plausibility measure I’m talking about isn’t agent-bound per se; it’s conditional on states of information.

Ideally, if I “think H is plausible but this evidence is poor grounds for H,” it’s because I have information warranting that belief. The word “convey” is a bit tricky here. If I’m to communicate the brute fact that I think H is plausible, I’d just state my prior probability for H; likewise, to communicate that I think that the evidence is poor grounds for claiming H, I’d say that the likelihood ratio is 1. But if I’m to *convince* someone of my plausibility assessments, I have to communicate the information that warrants them. (Under certain restrictive assumptions that never hold in practice, other Bayesian agents can treat my posterior distribution as direct evidence. This is Aumann’s agreement theorem.)

• Paul

It’s certainly simplest if you require the initial states to be equally probable. This works amazingly well for thermodynamics.

• Paul: But how can it make sense otherwise?

• Christian Hennig

Re Anon’s explanation: To me counting “possible states of the world” and even assuming them to be somehow “symmetric”/equally likely seems much more obscure than to assume a frequentist model or to give explanations in terms of betting behaviour. How is a “state of the world” defined? How is symmetry assessed?

• Paul

You can alternatively take the game-theoretic approach of Abraham Wald. Jaynes, a physicist, did not like this since he did not consider nature to be an adversary. But there are plenty of adversaries in the behavioral sciences.

Some great triumphs in this regard include optimal poker strategy (via Harsanyi) and option pricing (via Fischer Black). Plus lots of evolutionary biology.