Yesterday’s slight detour [i] presents an opportunity to (re)read Lindley’s “Philosophy of Statistics” (2000) (see also an earlier post). I recommend the full article and discussion. There is actually much here on which we agree.

The Philosophy of Statistics

Dennis V. Lindley

The Statistician(2000) 49:293-319

Summary. This paper puts forward an overall view of statistics. It is argued that statistics is the study of uncertainty. The many demonstrations that uncertainties can only combine according to the rules of the probability calculus are summarized. The conclusion is that statistical inference is firmly based on probability alone. Progress is therefore dependent on the construction of a probability model; methods for doing this are considered. It is argued that the probabilities are personal. The roles of likelihood and exchangeability are explained. Inference is only of value if it can be used, so the extension to decision analysis, incorporating utility, is related to risk and to the use of statistics in science and law. The paper has been written in the hope that it will be intelligible to all who are interested in statistics.

Around eight pages in we get another useful summary:

Let us summarize the position reached.

(a) Statistics is the study of uncertainty.

(b) Uncertainty should be measured by probability.

(c) Data uncertainty is so measured, conditional on the parameters.

(d) Parameter uncertainty is similarly measured by probability.

(e) Inference is performed within the probability calculus, mainly by equations (1) and (2) (301).

…..

Then on 309:

“The position has been reached that the practical uncertainties should be described by probabilities, incorporated into your model and then manipulated according to the rules of the probability calculus. We now consider the implications that the manipulation within that calculus have on statistical methods, especially in contrast with frequentist procedures, thereby extending the discussion of significance tests and confidence intervals in section 6. It is sometimes said, by those who use Bayes estimates or tests, that all the Bayesian approach does is to add a prior to the frequentist paradigm. A prior is introduced merely as a device for constructing a procedure, that is then investigated within the frequentist framework, ignoring the ladder of the prior by which the procedure was discovered. This is untrue: the adoption of the full Bayesian paradigm entails a drastic change in the way that you think about statistical methods.” (309)

I agree (that the difference can be drastic). While frequentists (or sampling theorists or error statisticians) also assign probabilities to events, the role and interpretation of these probabilities for statistical inference differs notably from what Lindley advocates. Probability arises (i) to control, assess, reduce the error rates of procedures (in behavioristic contexts); and (ii) to quantify how reliably probed, severely tested, or well-corroborated claims are (in scientific contexts). Details are explained elsewhere (the blog can be searched).

Yet much of contemporary discussions of statistical foundations tends to minimize or discount the contrast, and would likely scoff at Lindley’s idea that a “drastic change” is associated with adopting any of the Bayesian (or frequentist) methodologies now on offer. This is a theme that has continually recurred in this blog.

For example, a common assertion is that in scientific practice, by and large, the frequentist sampling theorist (error statistician) ends up in essentially the “same place” as Bayesians, as if to downplay the importance of disagreements within the Bayesian family, as well as between the Bayesian and frequentist. This renders any subsequent claims to prefer the frequentist philosophy as just that—a matter of preference, without a pressing foundational imperative. Yet, even if one were to grant an agreement in numbers, it is altogether crucial to ascertain *who or what is really doing the work*. If we don’t understand what is really responsible for success stories, we cannot hope to improve methods, or get ideas for extending and developing tools in brand new arenas.

Many if not most will claim to be eclectic or pluralistic or the like—even with respect to methods for statistical inference (as distinct from model specification and decision). They say they do not have, and do not need, a clear statistical philosophy, even for the single context of scientific inference, which is my focus. That is fine, for practitioners. I would never claim there is any obstacle to practice in not having a clear statistical philosophy. But that is different from maintaining both that practice calls for recognition of underlying foundational issues, while also denying Bayesian-frequentist issues are especially important to them. Even if one or the other paradigm is chosen (perhaps just for a particular problem), there are still basic issues of warrant and interpretation within that paradigm.

We noted a common tendency for “default” Bayesians to profess reverence to subjective Bayesianism deep down (BADD), at a core philosophical level. But if their practice is at odds with the underlying philosophy, they still need to tackle the consequences that Lindley brings out (failing at any of (a) – (e)). So on this Lindley and I agree. Nor do I think it suffices to describe their methods as approximations to a Bayesian normative ideal. I think the methods in practice require their own principles or philosophy or whatever one likes to call it.

Some readers might say that it’s only because I’m a philosopher of science that I think foundations matter. Maybe. But, I think we must admit some fairly blatant “tensions” that sneak into day-to-day practice. For example there seems to be a confusion between our limits in achieving the goal of adequately capturing a given data generating mechanism, and making the goal *itself* *be* to capture our knowledge of, or degrees of belief in (or about), the data generating mechanism. The former may be captured by severity assessments (or something similar), but these are not posterior probabilities (even if one agrees with Lindley that the latter could be). Then there are some slippery slopes about objective/subjective, deduction/induction, and truth/idealizations, deliberately discussed on this blog. These are all philosophical issues, and they are clearly illuminated within frequentist-Bayesian contrasts.

Then there is the rationale for introducing priors. While one group of Bayesians insists we must introduce prior probability distributions (on an exhaustive set of hypotheses) if we are to properly take account of background knowledge (see Oct. 31 post); subjectively elicited priors are often seen as so hard to get, and so rarely to be trusted, that much work goes into developing conventional “default” priors that are not supposed to be expressions of uncertainty, ignorance, or degree of belief. We are back to the question Fisher asked long ago (1934, 287): if prior probabilities in hypotheses are intended to allow subjective background beliefs to influence statistical assessments of hypotheses, then why do we want them? If the priors are designed to have minimal influence on any inferences, then why do we need them? As remarked in Cox and Mayo (2010, p. 301):

“Reference priors yield inferences with some good frequentist properties, at least in one-dimensional problems – a feature usually called matching. Although welcome, it falls short of showing their success as objective methods. First, as is generally true in science, the fact that a theory can be made to match known successes does not redound as strongly to that theory as did the successes that emanated from first principles or basic foundations. This must be especially so where achieving the matches seems to impose swallowing violations of its initial basic theories or principles.

Even if there are some cases where good frequentist solutions are more neatly generated through Bayesian machinery, it would show only their technical value for goals that differ fundamentally from their own. But producing identical numbers could only be taken as performing the tasks of frequentist inference by reinterpreting them to mean confidence levels and significance levels, not posteriors.” (Cox and Mayo 2010)

**I invite your comments, remarks and queries.**

[i] Which got the highest # of hits of any post.

“Yet, even if one were to grant an agreement in numbers, it is altogether crucial to ascertain *who or what is really doing the work*. If we don’t understand what is really responsible for success stories, we cannot hope to improve methods, or get ideas for extending and developing tools in brand new arenas.”

I couldn’t agree more. That said, when making this point, it would be good to lay out explicitly and concisely what would constitute an adequate answer to the question. I say “concisely” because someone without the time to dip into your oevre wouldn’t know what you consider necessary and might yet wish to defend some brand of Bayesianism.

Mayo: I agree with much of your comments here. I think that in fact the majority of statisticians and, in fact, scientists using statistics, tend to ignore the philosophical issues, which often implies that they can’t give a clear explanation of what the result of what they are doing means.

And actually there are not only success stories; there is an awful lot of stuff supposedly coming out of misused and misinterpreted application of statistics in science.

Unfortunately, I think that we are in a rather bad position to find out “what is really doing the work” because there are all kinds of problems obscuring the issue. Often enough, where clearly good work was done, both Bayesian (epistemic) and frequentist approaches (and also well-trained eclectic statisticians not too interested in philosophy) could have done it and claimed the credit. Both are often enough misused as well. Often enough people wouldn’t agree on what exactly was the “work to be done”, and which parts were done well and less well. One should also not forget that both frequentist and Bayesian (epistemic) probability theory have the same historical roots, so their probability thinking is in some sense very related, despite the fact that when done consistently, they model different things and need to be explained in different ways. It maybe that often “the work is done” by the common core.

By the way, I hope that you saw that with some delay I responded to you regarding the Cherkassky and first Ockham’s razor meeting entries. The blog world may be too fast for me…

Chrstian: I have argued that the use of error statistical methods in scientific contexts, with its frequentist probability, is properly epistemic. The error probabilistic analysis, in other words, enables determining what there is, is not, good evidence for. On the other hand, I don’t see that probability theory itself adequately captures “epistemic” reasoning. The use of these terms in your comment, I think, is connected to some of the biggest confusions.

Well, I used the term “epistemic” in agreement with what I find in most of the philosophical/statistical literature on interpretations of probability, in order to distinguish epistemic probability (modelling uncertainty assessmemnts of a rational observer) from aleatoric probability (modelling random processes in the reality outside the observer).

I won’t defend this term because I used it to make reference to others who use it, not in order to claim that this is how it generally should be used (I personally have no issues with using it in a descriptive fashion like this, though, and I think that confusion is rather located elsewhere).

Christian:

I realize you’re getting the idea of “epistemic probability (modelling uncertainty assessments of a rational observer)” from some traditional literature, and I’m saying it is a notion that has never been adequately cashed out and should be dropped. I would be interested to hear philosophical clarifications/defenses. (I don’t know what you mean by calling it “descriptive”, I thought it was deemed normative—I may be missing your point). There is a long history of the source, mostly philosophers defining knowledge in terms of true belief. (Some have said that Popper’s philosophy of science is irrelevant to epistemology since he denies such a notion. But so does C.S. Peirce. Musgrave tries to rephrase Popper to include “belief” but it only does harm.) Some blogposts on Achinstein may be relevant.

Anyway, I see one of the main functions of this blog as giving a space to rethink such knee-jerk conceptions. You say it doesn’t cause confusion, but I think it is at the heart of the confusion. If one is told the only way to be relevant to inference, evidence, and knowledge is to provide probabilistic assignments to claims, then one might assume one must be a Bayesian epistemologist, or the like, if one want to say something about scientific inference. Wrong.

Side point: It may not do much damage in analytic epistemology, since here the use of “probability” may be little more than a way to abbreviate “belief”, but the formalism misleads some to suppose that “formal epistemologists” are doing “philosophy of statistics”. Some might be (I hope they are), I’m just saying that those who are only doing analytic epistemology are not connecting with statistics. (See last 2 sections of my RMM paper.) “Reliabilists” in philosophy are also mainly stuck in an analytic definition game. I would like for them to actually be involved in identifying reliable processes, and statistics is a great place to look.

By “descriptive” I just meant that I used this term to describe a certain way of conceiving probabilities (so I referred to myself as being “descriptive”, not the notion of epistemic probabilities). There wasn’t any value implied by this such as “it’s the only way to be relevant to inference,…”

You may be right to say that one could read such a claim into the use of this word and I’m happy to use another one (although sticking to the literature normally is good for being understood). Any suggestions? It’s not identical to “Bayesian”, though. Gelman for example is Bayesian but apparently doesn’t interpret probabilities “epistemically” – yes there’s the bad word again;-)

Christian: Another word for what? If you’re speaking of subjective probabilities, say so; if it’s frequentist (or propensity) say that, if some kind of default, tell us which (including Carnapian style language-based). Although there are subgroups among each of these, at least it doesn’t prejudge which can be relevant for inference , evidence, and knowledge (i.e., for epistemology).

I’m looking for another generic term for those probability interpretations that refer to “rational strength of uncertainty”, i.e., something that encompasses subjective but also logical and objective Bayes interpretations. Pretty much everything that is *not* frequentist/propensity.

This broad class is often referred to in the literature as “epistemic”.

Christian: While you’re looking for it, stop and think about what “rationale strength of uncertainty” could really mean*. What is the event in which one is to report one’s rational degree of uncertainty in appraising scientific hypotheses (e.g., some aspect of the Higgs mechanism by which particles get their mass, just to mention our recent example). It isn’t that I cannot imagine somehow “grading” positive evidence obtained by other (non-Bayesian) means. But this won’t suffice: remember, it has to also be computed and combined according to the probability calculus. This is at odds with the “logic” of well-testedness, as I see it.

*By the way, I don’t think most default Bayesians interpret their probabilities this way nowadays; many seem to view their priors as some kind of formal construct sans interpretation. Do you not find this?

That may well be and I agree particularly with the footnote. It doesn’t stop me from wanting to have a generic term for these approaches in order tro discuss them and as long as I’m not offered a better one, I’ll have to stick to “epistemic”.

Corey: In this passage, of course, i am reacting to those who argue their account of statistics is right because it “works”. Realizing it’s an inadequate argument, and that satisfying an associated requirement, perhaps inadvertently, might be responsible for the success story, suffices for my point.

(I don’t know why this isn’t coming out right under your comment.)

Mayo: The passage I quoted makes an important point well beyond simply noting that “satisfying an associated requirement, perhaps inadvertently, might be responsible for the success story.”

1. I disagree with the claim that “the frequentist sampling theorist (error statistician) ends up in essentially the “same place” as Bayesians”.

2. I have no “reverence to subjective Bayesianism deep down”.

3. I have recently become devoted to “weakly informative priors”, which are neither “subjectively elicited priors”, nor are they “conventional “default” priors that are not supposed to be expressions of uncertainty, ignorance, or degree of belief”. One way I think of my priors is as soft constraints on probability models. If I fit some particular class of models, for example mixtures of normal distributions, I am generally not interested in all possible models that fall within this mathematical class. To define my inferences in that way would be to make myself a slave to default parameteriztions. I am happy to use a prior distribution to constrain the models that I fit.

The above does not mean that I disagree with what Mayo wrote above; I’m just trying to clarify the position of (at least) one Bayesian statistician.

Gelman: Thanks much. I wonder, when you use your priors as “soft constraints on probability models” if that is really tantamount to giving a prior; if it is, how do we interpret the prior or does it not matter?

Mayo:

Yes, a prior is a soft constraint. Here’s an analogy that might work: I will analogize the prior distribution to the price of a new product.

If I am a business person releasing a new product, I want to set a price. Let us simplify and suppose there is some optimal price (given all my goals of fame, fortune, a steady stream of income, etc). Ideally I would do some data collection and analysis to estimate an optimal price and make a good decision. Alternatively, I could just set a price using various guesses and heuristics.

Similarly, ideally the prior distribution in a Bayesian analysis represents my best guess of the probability distribution of the parameter I am studying (that is, the probability distribution defined in the reference set of all settings where I might use this model). But if I can’t do this, I approximate. My intuition is that in most settings it is better to have the prior be too weak than too strong, but I don’t have any demonstration of this. In any case, the prior is mathematically a soft constraint. In some settings the constraint derives from explicitly expressed prior information; in other settings it’s just a constraint, to allow me to fit the model I think I want to fit.

Similar concerns arise with the data model or “likelihood.” Suppose I fit a normal or binomial model. In some settings this reflects real (or imagined) substantive information, in other settings it is just a constraint on my model space.

Gelman: This might work if there is something like “the probability distribution of the parameter I am studying”, but what if it is a fixed constant? Assigning a probability to a price being optimal may well be like assigning a probability to an event, and it makes sense to think of the relative frequency with which a given price will satisfy your goals (e.g., prob of .9 to $32. for a share of Facebook?), but I don’t see the correctness of a scientific hypothesis to be like the occurrence of an event (eg., it will sell). I will ponder your price analogy some more. As in our earlier discussions, I think our differences are mostly traced to different kinds of applications. In some examples it sounds like an empirical Bayes context. But perhaps I can grasp what you’re not saying (correct me, if wrong), I take it you’re not saying the prior is (your?) degree of rational belief.

Mayo: You wrote, “This might work if there is something like “the probability distribution of the parameter I am studying”, but what if it is a fixed constant?”

Gelman didn’t respond to this question, but I can. The parenthetical after the part you quote is the key: “that is, the probability distribution defined in the reference set of all settings where I [Gelman] might use this model”. You are to imagine the set of problems that Gelman faces as a stochastic process of some kind. Presumably his choice to use a particular model in any problem could also be modelled as stochastic. The true parameters associated with each problem inherit the randomness of these stochastic processes.

I have a feeling from this post that you’d call this “fallacious probabilistic instantiation”.

Corey: Fallacious probabilistic instantiation is when, for example, one randomly selects a statistical hypothesis from an urn of hypothesis, p% of which are “true” (or share some other property) and then infers the probability is p that this one particular hypothesis, H, that I have selected is true. [They needn’t even be related in subject.] It is one thing to play a game of randomly selecting hypotheses from an urn, and saying the event of a “success” is p (even though I have no clue how one would know the hypothesis selected was true, or even that the urn had p% true hypotheses). But this event (selecting a true hyp) is no longer referring to this particular H under which I (as a frequentist) would be computing likelihoods. It’s the equivocation through the same computation that is problematic. You can find all of this in posts and papers (remember Isaac). The other thing that always strikes me about the “urn of hypotheses” gambit is just how behavioristic this all is. We get so used to the frequentist being lambasted for considering low relative frequencies of error, that we forget that it is the Bayesian who seems only too happy to consider things like the % of hypotheses that would be true, given x (in a series of urn samplings of hyps). By contrast, the frequentist’s error rate (over different x’s that could have resulted from the one hyp that generates the experimental results of interest—whatever it is), can be quite relevant to determining how well tested THIS hypothesis is, and in scientific contexts, that’s what I’d require. Sorry to be writing so fast, I have urgent work, and have written on this more clearly elsewhere.

Mayo:

Sometimes my prior is (an approximation) to my degree of rational belief, sometimes not. Just as, sometimes a data model is (an approximation to) a believed data-generating process, sometimes not.

Let me add a very brief comment in relation to Gelman’s take:

“One way I think of my priors is as soft constraints on probability models. If I fit some particular class of models, for example mixtures of normal distributions, I am generally not interested in all possible models that fall within this mathematical class. To define my inferences in that way would be to make myself a slave to default parameterizations. I am happy to use a prior distribution to constrain the models that I fit.”

Parameter restrictions are used routinely by frequentists in fields like econometrics where such restrictions stem from substantive (subject matter) information; this is usually done by nesting the structural (substantive) model into the statistical model. The key difference between the frequentist and Bayesian approaches in employing parameter restrictions — apart from treating unknown parameters as constants as opposed to random variables — is that in the context of frequentist modeling and inference these restrictions, like other forms of substantive information, are subject to evaluation; they can be formally tested against the data before being imposed. Introducing parameter restrictions via priors leaves hardly any room to assess their validity vis-a-vis the data!

Aris; in addition to Andrew Gelman’s comments below, you might be interested in Journal of Multivariate Analysis (2009) “On the distribution of penalized maximum likelihood estimators”. In some plausible/popular models, no uniformly consistent estimator for inference exists under model selection. Hence, asking the data “what question can you usefully answer and what is the answer” is very different from asking “what is the answer” to a question You know, a priori, that the data will usefully answer. Priors are one way of defining those questions.

Also, could you (and everyone) please drop the line about “treating unknown parameters as constants as opposed to random variables”. As has been repeated here many times, there are loads of Bayesians who view loads of unknown parameters as fixed constants. Bayesians describe what they know about parameters via probability calculus; this doesn’t rule out the truth being an unknown constant. Being precise about this is particularly important whenever random effects models are used – whether by Bayesians or not.

Thanks for the advice; I will look into the paper in the Journal of Multivariate Analysis (2009) “On the distribution of penalized maximum likelihood estimators”.

Guest: Maybe you can clarify how to understand the prior probability in a value of a fixed constant. Is it a measure of what is known about the fixed constant? or does it measure the degree accorded to a given value of the fixed parameter. So for example, say L is the fixed deflection of light in GTR. Does a .5 prior probability* assignment to “L = 1.74” mean I have about as much evidence that L is lower than 1.75 as I do that it is greater than 1.75. Or does it mean I know (without a quantification on the knowing) that around half the time the experimental measurements on L would be as if measuring “L=1.74”. I can see it meaning either, but they are very different. It is one of the central ambiguities masked by simply saying priors in parameters represent what we know (e.g., about fixed constants). I’ve never seen pinned down just what the probability quantity is intended to be measuring.

*take any p you like.

If your prior has .5 mass on L=1.74, you judge the plausibility (a la Cox 1946’s axioms) of that exact value being the true L to be equal to the sum of all probabilities assigned elsewhere in the parameter space. The true value is still a fixed constant, albeit one you don’t know the value of. In the absence of stating a full prior, you haven’t said anything about L being higher or lower than this fixed constant value. Also, even a full prior doesn’t say anything about relative frequency of data; whether the estimate L-hat comes in higher or lower than the true L, averaging over repeated experiments, depends on the experiment you do, and the estimate you use. (And yes, if you use a prior in your analysis, which prior you use also matters, to some extent). I don’t see where your “masking” comment comes from; neither of the options you describe captures what priors mean. Also, while different flavor Bayesians disagree on lots of points, the basics above are not controversial, or ambiguous.

At the risk of repeating my earlier comments, you may also find it helpful to step away from point mass priors, with e.g. 50% mass on a single point. Smooth priors, without “spikes”, are much more plausible in many practical scientific situations, can be elicited more easily, and also lead to Bayesian inference that behaves a lot like classical results (p-values, confidence intervals etc). This isn’t true for every example – as Larry W often correctly points out – but it might help understand what priors mean, and don’t mean.

Guest: I didn’t explain myself clearly, sorry, I wasn’t meaning to get into special problems of point masses, only the meaning of probability to a statistical hypothesis H about parameter values understood as fixed (it can be a range of parameter values, or smooth them as you like). Is my prior trying to represent strength of evidence in H? Were the parameter values to vary, one might try to quantify approximately how often the different parameter values are the case. But we’re talking about a fixed parameter. So possibly a prior could be intended to capture how often experimental measurements of this fixed constant would be as H were the case.

Mayo: I’ve been thinking about your question for guest and following your exchange, and I think I have something to contribute. To understand the meaning of a Bayesian probability distribution, it is necessary to go back to the sets of postulates that lead to Bayesian probability. Some of these have operationalizations built in, e.g., de Finetti’s betting coherence, Savage’s rational decision making axioms. Cox’s theorem doesn’t — it just lays out some desiderata for a measure of plausibility. The meaning of a Bayesian probability distribution (exclusive of the decisions or bets one would use it to make) is just that it satisfies the Cox desiderata.

Corey: Just to say it’s whatever satisfies the axioms isn’t to say it serves it’s purpose, e.g.,provides for a good measure of evidence or epistemic warrant or the like. In my next to last pair of remarks (the Socratic ones), I was trying to put aside for the moment questions about whether it’s reasonable to have probabilities for hypotheses, and try to consider my best attempt to cash them out (along the Reichenbach avenue). In my very last comment, I went even further than I’d need to for the critique: Consider an unobjectionable case where we have a probability distribution for outcomes and associated events,* and see if the various probabilities adequately capture strength of evidence in some sense. This is what one does to see if a formal system is “adequate” for representing something (the additional question would be whether it was complete–the converse–much harder.) It seems to me inadequate.

( Christian was suggesting at one point it’s automatically epistemological by not being frequentist, and perhaps merely by purporting to be. )

*Recall it turned out not to be so easy to even find an uncontroversial case where we have a probability distribution.

Mayo: Priors certainly can be (and often are) elicited by considering sampling properties of simple experiments. For example, if the parameter of interest is a proportion, one could elicit the plausible range (quantiles, formally) of values seen, from multiple samples of 100 independent draws from the population, and reverse-engineer a corresponding prior; the math involved is not challenging.

Note that this doesn’t commit you to an experiment of sample size 100, nor does it free you from making some (very mildly) ad hoc assumptions on the parametric form of the prior. One could easily elicit a slightly different prior – but in most situations the differences are no worse than using different ad hoc asymptotic approximations, in non-Bayesian work. Tony O’Hagan, with others, has a practical book on prior elicitation.

While, with enough imagination, you can always elicit priors by considering sampling behavior, you don’t have to, so I’d be reluctant to say priors are “intended to capture” this behavior. More fundamentally, as per Cox 1946, priors express prior plausibility of different parameter values, in a way that obeys probability axioms. Priors do this regardless of whether the parameters are fixed or random; in random effects work, one kicks the whole problem up a level and considers a prior on the distribution that generates the random effects in the first place.

For weakly informative priors, I’m not aware of a formal definition, but they seem to aim to be priors that rule out extreme values of parameters (i.e. downweighting them strongly) while not strongly upweighting any specific values, i.e. no point masses, or similarly strong spikes. Elicitation of them is, I imagine, a mix of the methods above and empirical checking that the output (probably averaged over repeated samples) behaves acceptably.

On a different point, you’re still writing about “a statistical hypothesis H” as if it’s a point mass, when often it’s really not. It may perhaps help to consider the prior (for some real-valued parameter) as a quantification of plausibility of *all* hypotheses about that parameter. In the proportion example, the prior tells us the prior probability of not only whether the proportion is less than or greater than 50%, but also between 6% and 47%, or between any other two numbers you like. Put all this information together and you get a full distribution, which gets updated by the likelihood. The resulting prior lets us do inference on the parameter value, in any way we choose – but the different inferences will be coherent.

ARis: Thanks for tuning in from far away lands. This is one of my big puzzles. Why do Bayesians insist restrictions come in in terms of a prior as opposed to simply as restrictions in the manner you describe. It would be interesting to compare the difference between the frequentist test of the restrictions and the Bayesian “test”, perhaps along Gelman’s use of predictive distribtuions(?) Obviously there’s a big difference here in how this particular background information is being incorporated, and it would be good to get to the foundation and the consequences of each.

Perhaps one reason Bayesians (could say that they) insist that restrictions come in terms of a prior is because the prior distribution is itself equivalent to the observation of some data. Thus the posterior estimate is (can be viewed as) the aggregation of data in a mathematically consistent way. Whereas other forms of restriction are ad hoc (since they do not correspond to something that can be interpreted as the observation of data).

By the way, since the prior distribution can be interpreted as data, it can in principle be independently evaluated. Not that anyone seems to want to do this.

Aris, you wrote “…in the context of frequentist modeling and inference these restrictions, like other forms of substantive information, are subject to evaluation; they can be formally tested against the data before being imposed. Introducing parameter restrictions via priors leaves hardly any room to assess their validity vis-a-vis the data!”

In frequentist statistics, when you pick a statistic, you’re also implicitly selecting a set of alternative models against which the substantive restriction can be tested with high power. The Bayesian version of this process involves expanding the hypothesis space to include these alternative models and then seeing if the posterior probability concentrates on the restricted model. So it is not the case that assessing the validity of substantive restrictions vis-a-vis the data is impossible in the Bayesian framework.

I tihnk this subject is tangential to the sort of use Gelman has in mind. The story is basically (warning: I might have some of the details wrong) that he’s run logistic regressions on a large corpus of publicly available well-conditioned data sets and found that parameter estimates almost always fall in the range [-5, 5]. So for logistic regression he uses a prior which concentrates its probability mass in that range. This regularizes estimates for ill-conditioned data, e.g., data that has total separation, resulting in a flat log-likelihood and non-unique MLE.

Aris:

I find the above comments unpleasant for similar reasons as I was bothered by David Hendry’s article discussed here . My disquiet comes from my experiences twenty years ago at Berkeley, where I had colleagues who put down Bayesian methods thoughtlessly, for example patting themselves on the back for being skeptical of hierarchical models and prior distributions and then using generic data models (and, no, I never saw them check the fit of those models to data in any applied contexts).

I am not saying anything negative about “frequentists.” For one thing, I’m not quite sure what exactly a frequentist is. What I would prefer is for people to not mischaracterize what I do. As the author of the most popular Bayesian book, I think I have as much right as anyone to call what I do “Bayesian.” And I do evaluate my models. We devoted chapters of our books to model evaluation, and I’ve also published several papers on the theory and practice of Bayesian model checking.

I also think of my prior distributions as soft constraints. I recognize there are non-Bayesian regularization methods as well. In describing my work, I’m never intending to claim that others can’t use similar methods but embedded in different inferential frameworks.

As I wrote in my first comment in this thread, I’m not making the argument that Bayesian methods are innocuous. I’m saying that Bayesian methods have been helpful to me, and that the idea of soft constraints or regularization has been a helpful way for me to think about priors. Given all the difficulties that Bayesians and others have had with noninformative priors and with subjective probability, it seems to me to be a good idea to be open to other ways of thinking about probability distributions for parameters.

Gelman: I don’t think Aris was putting down Bayesian methods for restricting parameters. I am puzzled by your degree of defensiveness in relation to this comment, especially given your own departures from Bayesian orthodoxy (because of something that happened 20 years ago?) Clearly by your earlier comment, you disagree with Lindley’s criticism of those Bayesians who claim

“A prior is introduced merely as a device for constructing a procedure, that is then investigated within the frequentist framework.”

Perhaps you have suggested to Lindley that he not mischaracterize what you do or ought to do under the banner of “Bayesian”? His conception seems much further from you than Aris’s. It is evident that some of us are trying to understand and compare the different ways one might make use of “restrictions”, and it would be extremely valuable to really see if/where the methods are doing essentially the same thing, and where not. Anyway I agree with your criticisms of subjective and non-informative priors, and thank you for your comment.

Mayo:

I was not angry at Aris; I just wanted to explain right away where I was coming in this comment, emotionally speaking. My emotions may not be logical in this case, but there they are.

I am not so sure that things are always as simple in the frequentist world as Aris implies . One problem is that we have to make decisions based on the data we have and another problem is that we have to make decisions as to what data to collect based on the resources we have. That means that in practice we may have to accept that we have to believe something about nuisance parameters to make progress. A notorious case in point is given by two period cross-over trials. What you believe about carry-over is crucial to the inferences you make. (You can get very different inferences depending on whether you act as if it were negligible or act as if it might be anything at all.) It is also the case that the design makes it very difficult for us to say anything much about the carry-over effect itself. More generally it is also known that making a substantive choice of model based on tests of model adequacy can lead to very misleading inferences. There is quite a strong analogy with screening for disease in medicine. It is often assumed that such screening will always be beneficial but it turns out that this is not always the case.

Another point is that it tends to be quite difficult in frequentist frameworks to adopt smooth restrictions. A term is either in the model (in which case it is usually estimated as if it could be anything at all) or out of the model (which is equivalent to having it in the model but with a known parameter value of, say, 0).

Gelman: I meant to ask: you’ve only “recently” become devoted to these? I thought it’s been a while. I’m not sure I understand J. Berger’s criticism of them, assuming that is still true (recall the J. Berger deconstruction on this blog).

Mayo: As far as I can tell, what Gelman calls “weakly informative priors” and what Berger calls “weakly informative priors” are two very different things. I think Berger means what most other folks call non-informative or default priors. Presumably Berger calls them weakly informative because he holds that there are no such things as non-informative priors — which is odd, because that’s one of the reasons the term “default” was introduced.

Corey: I don’t think so, since J. Berger favors default priors. Note what J. Berger said at http://errorstatistics.com/2011/12/29/jim-berger-on-jim-berger/

I see my remark might need clarification. My “I don’t think so” was to the claim that J. Berger regards his default priors as weakly informative. Maybe that wasn’t your claim though.

Mayo:

My first publication on weakly informative priors is from 2006, two years after the release of Bayesian Data Analysis, second edition. This is recent to me.

P.S. As I wrote in my comment on Berger’s comment, I don’t think that what he’s talking about is the same thing that I’m talking about when I talk about weakly informative priors.

Gelman: That is recent, I didn’t realize that.

Christian: (I can’t reply below you.) Well you should care if those accounts of epistemic probability are talking about anything you can understand.

Both error statistical and Bayesian accounts assign probabilities to events. So start with an uncontroversial case of assigning probabilities to events.

For example, suppose it is given that a coin tossing procedure follows the Bernouilli model, with probability of success at each trial equal to p (call this h). Given n iid trials I can deduce from h the probability of event x: r heads in n trials. Say P(x;h) =.7. You say that epistemic probability asserts that .7 is my “rational strength of uncertainty” in the event x occurring. Yes? (After all, subjectivists do accept the relationship between relative frequency and degree of belief, right?)

See my comments above; if the parameter p is of interest, the function of it that tells you P(r heads in n trials) is the integral (over what we know) of an nth order polynomial in p, that also depends on r and n. This is, frankly, not easy to think about directly – a Normal location example would be a much simpler starting place – though the calculations here are entirely possible.

Assuming a beta(a,b) prior on p, given a statement about P(between r1 and r2 heads in n tails) you could reverse-engineer the values of “a” and “b”. In this example, the degree of overdispersion of the a priori P(r in n) statements, relative to a the corresponding pure frequency statements for fixed p, tell you about the dispersion in the prior.

In your example, yes, a prior may assign 70% epistemic belief to the event that, in a single hypothetical experiment, we get exactly r heads in n trials (though you’d need a very small value of n to manage 70%; try it). On its own, this information is however not enough information to fully describe the prior. And, if one is not doing an experiment in which that value of “r” or “n” are particularly relevant or plausible, it’s not even a useful summary of the prior. So in practice, other statements or summaries of the prior are preferable.

Guest: In relation to your last comment: I was responding to Christian, where the issue is somewhat different from yours. Sorry, my Socratic style was intended to begin with a relatively uncontroversial case, in order to highlight some differences were we to go further to treat the truth (or adequacy or success or the like) of a statistical hypothesis like an occurrence of an event.

Were I to have had a further round with Christian on this, I might have suggested the following as my best attempt to do something similar in assigning a probability to h. This involves trying to see the truth (or adequacy) of hypothesis h as akin to an occurrence of an event even though h concerns this particular coin tossing mechanism in front of me. Let us assume there is good evidence E that h is approximately correct. (The evidence may be obtained by frequentist means. E.g., I continually observe, and readily generate, various observed % of heads extremely close to the deductively derived % , given by h, and everything else passes muster by a variety of tests and checks, maybe we even have a theory about such procedures. Whatever.)

The point is that I want to express, let’s say, good evidence for h, so I take all the information in E, and judge that P(h;E) = .9 or some other high number. Now we’re trying to understand a rational strength of uncertainty of .9. Is .9 just a kind of abbreviation for “E is rather good evidence for h”? But from that we clearly do not know it behaves like a probability and obeys the probability calculus (e.g., when combining it with other hypotheses, etc.).

My best attempt* to provide it a probability that reflects the evidence for h (without falling into the fallacy of probabilistic instantiation) would go something like this: I consider various cases where evidence just about as impressive as E has been had for a group of hypotheses (they can be in various fields or related in some way). I look at this group of hypotheses, and consider the % that turned out to be true. Say 90%. Then that might be my meaning of the statement P(h;E) = .9. I rationally expect h to turn out true to degree .9. [This is related to Reichenbach’s frequency account, but is importantly different (and I think more satisfactory). Reichenbach thought eventually, at some point in the future, we might be able to judge of a hypothesis h whether it’s the kind of hypothesis that frequently or rarely turns out to be true. I discuss this in Ch 4 of EGEK.]

*Never mind for the moment if it’s workable.

Adequate for what purpose? True evaluated how? These are not easy questions, nor are the answers likely transportable across different areas of science. So, I don’t think generic measures of support for point null hypotheses are going to work, under any statistical paradigm.

What statisticians can do, reasonably successfully, is to measure signal, and noise, and their ratio, when estimating specified parameters that are not too far from those that govern the outcomes of the experiment at hand – which has a known design.

Perhaps your Socratic dialog better start with something as relatively uncontroversial as this.

Guest: Oy! Missing my mark again. The example actually didn’t matter*. I wanted to start from the “other” (easier) direction, the direction of deductively assigning probabilities to events, any events (here, based on the binomial outcomes). (Then one might consider going from x to h.) I am trying to get at the assumption that things we say about the probability of an event are akin to how much evidence there is for hypotheses (which in turn give probabilities to events). In my Socratic response (to Christian) I did my best to find a way to make it make sense for hypotheses (the improved Reichenbach move) and do not question that it makes sense for ordinary statistical events.

But now let’s question that, just to show how (apparently) unusual my perspective is (though not to me). This baby, non-exotic, hypothesis h entails various events, say x’, x” with different probabilities, say p(x’;h) = .9 and p(x”;h) = .1 (I don’’t say the events are points, I don’t care). It seems odd to me to say these reflect the different amounts of evidence I have for the events. I do have evidence for the various probabilities of events (i.e., for the distribution asserted in h) and my evidence would seem the same for all the events governed by h: namely the evidence I have for h (which would be in terms of things like how severely h has passed tests).

(I don’t care if model assumptions are included or are separately appraised).

*And I only put in “adequate” or “successful” so that someone didn’t say no hyps are true, and dismiss my best attempt out of hand.

Not being too clear at this hour, but here it goes….

I appreciate the argument you’re trying to make, really. But the problem with assigning prior probabilities to events, as a foundation, is that many events we actually measure (or try to measure) have probability zero: in many experimental settings, one will simply never see exactly identical experimental results. Comparing the plausibility of various probability-zero events does not seem like a good basis for inference – and to make them non-zero, one has to introduce artificial categorizations that, as well as being ad hoc, are known to throw away power/precision.

However, I don’t want to dismiss the entire underlying idea. Bayesians, like everyone else, can compute signal and noise, and related quantities, without ever dividing by zero. That the tradeoff between signal and noise might change, perhaps dramatically, depending on some third quantity derived from the posterior (that addresses relevance of the signal, say, like your contrast between x’ and x”) seems totally feasible, and indeed sensible in some situations. Adapting in this way also seems entirely compatible with the Bayesian paradigm, so far as I can see.

There are lots of voices here saying interesting things, but I think talking about slightly different things…. at the risk of adding to the confusion I will jump in here… (although I should prefix it with a “I can’t speak for Christian”)

Also FWIW I really like the bernoulli model as a starting point for discussion (above the normal)… it leads to the very exciting topic of exchangeability which bridges probability theory and philosophy…

In response to your question there is no need to think about proportions to discuss probability. There are good arguments for seeing probability as a naturally arising primitive within decision theory…. but you are right it isn’t obviously meaningful to apply decision theory to a hypothesis.

However you can marginalise out p i.e. go from P(r,p)=P(r|p)P(p) to just P(r), which under conjugate priors is analytical i.e. a Polya (or Beta-binomial) distribution.

Under your assumptions above r is an integer between 0 and n.

Under a common “Laplace” prior which is flat on (0,1) (or the beta(1,1) if you prefer) then it turns out that the probability of r is also uniform i.e. equal to 1/(n+1) for all of its n+1 possible values. In this situation putting a uniform distribution on the statistic r is every bit as natural as putting a uniform distribution on the parameter.

It is worth noting that this distribution (along with every other distribution which is produced by applying Bayes theorem to the parameter p) has 2 interesting properties.

while the distribution of the statistic is uniform the distribution of sequences is not. In the case of n=2.

P(Throw1=Heads,Throw2=Heads)=1/3

P(Throw1=Heads,Throw2=Tails)=1/6

P(Throw1=Tails,Throw2=Heads)=1/6

P(Throw1=Tails,Throw2=Tails)=1/3

and the marginal P(Heads)=0.5

Which leads to

P(Throw2=Heads|Throw1=Heads) > P(Throw2=Heads)

i.e. if you see a head, it increases your probability that the next throw is a head, this is very familiar reasoning.

It turns out the above is true if the exchangeable distributions on coin toss outcomes is exchangeably extendable i.e. can be viewed as the marginal of an arbitrarily long distribution.

In order to guide intuition on this it is worth considering situations in which the above two conditions are false. Card counting in blackjack is a good example. Here you observe cards that are either high or low. If you observe a low card then your conditional probability for the next card being high increases (and you might up your bet). Equally this is a distribution on an exchangeable finite sequence over the number of cards in the pack say 52. This exchangeable distribution is not the marginal of any exchangeable distribution of length 53 or greater.

Typically in science we can carry out an experiment an unlimited number of times, and increase our belief in something having observed it happen. If this is the case then the usual Bayesian machinery of distributions on parameters has these properties and seems relevant (provided you see merit in decision theory of course).

As another aside… I am puzzled why you see the need for encouragement to release daring analysis… I await your next post on Wasserman with interest!

David. I’ll study the other points at another time, but on your last, I don’t need the chorus, I just wanted it. I may revise my post actually…tone it down, smooth it out…

Applied statistics is about using inductive thought to infer conclusions from real-life data sets in relation to their scientific backgrounds. If you only address data sets which can be analysed using a neat application of Bayes’ theorem, then you get a pretty warped view of what it is all about, See also ‘The Life of a Bayesian Boy’ on my website. Dennis is a wonderful mathematical statistician.