Some of the recent comments to my May 20 post leads me to point us back to my earlier (April 15) post on dynamic dutch books, and continue where Howson left off:
“And where does this conclusion leave the Bayesian theory? ….I claim that nothing valuable is lost by abandoning updating rules. The idea that the only updating policy sanctioned by the Bayesian theory is updating by conditionalization was untenable even on its own terms, since the learning of each conditioning proposition could not itself have been by conditionalization.” (Howson 1997, 289).
So a Bayesian account requires a distinct account of empirical learning in order to learn “of each conditioning proposition” (propositions which may be statistical hypotheses). This was my argument in EGEK (1996, 87)*. And this other account, I would go on to suggest, should ensure the claims (which I prefer to “propositions”) are reliably warranted or severely corroborated.
*Error and the Growth of Experimental Knowledge (Mayo 1996): Scroll down to chapter 3.
- Howson, C. (1997). “A Logic of Induction,” Philosophy of Science 64(2):268-290.
- Mayo D. G. (1996). Error and the Growth of Experimental Knowledge. Chicago: Chicago University Press.
- Mayo D. G. (1997). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan” Philosophy of Science 64(2):222-244 and 323-333.
I don’t get it. If I design a machine to perform online Bayes in response to some signal, it just goes to the RAM location of the conditioning propositions, which in this case are the measured signal data. What corresponds to the “distinct account of empirical learning” in this setup?
I was alluding to warranting the model and conditioning statements, according to what Bayesians say (and for the general purpose of scientific learning). Are they also assigned probabilities, or are they accepted or given in some sense. Of course, a machine “user” needn’t be bothered with what is hidden away in the online black box.
Anyway, if your answer is “no”, fine; subjective Bayesians are always free to just say no (i.e., all you need is Bayes).
I’m definitely baffled by those who want to update using something other than conditionalization, especially if they call themselves “Bayesians”. I think this usage is peculiar to philosophy. I can’t claim that no statistician ever used the term “Bayesian” to refer to a method that that didn’t have a prior updated by conditionalization, but it would certainly be non-standard terminology.
I hold that for updating, in theory all one needs is Bayes, but “in theory, theory and practice are the same. In practice, they’re not.” My practice is much closer to Gelman than de Finetti.
As a side note, there is not much Bayes either in what is called “PAC-Bayes” in the machine learning community (and taken up by some statisticians). I recently was at a seminar about this and the speaker was very baffled that there was someone in the audience who knew Bayes’s original paper and wanted to know what on earth what was presented has to do with that.
Yes, a lot of these practitioners seem to think “Bayesian” is another word for “using statistics”. I recall J. Pearl thinking this way at first, years ago, and now I think he’s become a non-Bayesian.
Is this an update from his self-declared half-Bayesian position?
Guest: That was my understanding from conference reports rather than published work of his (which I do not follow). I think the causal modeling literature tends to be in its own category. I’m not claiming he himself altered his status from “half” to “non”. It was just an informal blog comment, and for all I know, the conference reports may just be impressionistic, not verbatim. But I’m not sure what counts as non-Bayesian these days. Are you? Should holding (i) and neither (ii) nor (iii) really be “half”? I don’t know; maybe only Bayesians quantify such things. I don’t consider someone a Bayesian just because they use conditional probability (do you?), but things like “Bayes’s nets” needn’t be more Bayesian than that. Do you agree?
@Mayo, as you ask, I define a method as Bayesian if it can be derived using Bayesian updating, i.e. if it results from some manipulation of a posterior resulting from some prior and model. Note the “some” here; methods can be Bayesian without us yet knowing that they are – and I suspect, for example, that this applies to the methods for causal inference, that Pearl works on. Methods can also be approximately Bayesian, if their large-sample behavior is the same as that of a Bayesian method or methods. A person is being Bayesian when they use Bayesian methods – and similarly (approximately) frequentist when they use (approximately) frequentist methods. Both methods and people can be both Bayesian and frequentist, perhaps approximately; I see no difficulty with this.
I’m aware this the definitions above are very broad. They make it very difficult, for example, to show that methods are not Bayesian; the space of priors, models, and posterior manipulations one has to rule out is typically massive. I also make no claim that, on its own, the property of being Bayesian is any guarantee of being useful. But neither is the property of being frequentist.
Guest: Thanks for this: it has informative aspects, and some not so informative.
“I define a method as Bayesian if it can be derived using Bayesian updating, i.e. if it results from some manipulation of a posterior resulting from some prior and model. Note the “some” here; methods can be Bayesian without us yet knowing that they are”.
Does this exclude methods that deny Bayes’ rule, Bayesian conditioning and the like? (that would make it informative).
Does it encompass “Bayesian reconstructions” such that it countenances what I call “painting by numbers”? (that would make it uninformative.)
(Asides: In the Pearl paper you posted, the half Bayesian part seemed to be based solely on (i) thinking that background knowledge should enter in inquiry, that one shouldn’t start with a blank slate. I.J. Good, even with his numerous types of Bayesians, by the way, considered the key to be willingness to assign probabilities to hypotheses regarded as true or false, correct/incorrect.)
It is because of the ambiguity and vagueness of “frequentist” (or even “sampling theory”) that I coined the name “error statistical”. This term is informative, especially together with a requirement for evaluating (something like) severity of test. I can see methods counting as Bayesian, under your conception, that would not seem to satisfy error statistical requirements. (This might help you to qualify your definition a bit more.)
Anyway, my way of proceeding is somewhat in the other direction: identify a philosophy of statistical inference/learning and let the principles that emerge point the way for appraising and interpreting methods (wherever they come from). (Nor do I restrict myself to statistical method, since it has to be a general philosophy of science.) Of course one may reject this statistical philosophy (even for its intended domain), but that would at least make the contrasting goals clear.
The debates, the most interesting and important ones, are really not about terms and names. It’s not a counting game (i.e., how many people can we show to be housed under the umbrella of so-and-so ism”.) There’s a lot of confusion—now, perhaps more than ever (oddly enough)– about whether and why there are any fundamental, philosophical differences, and where they would show up in the use and interpretation of methods (even within a given “school”). I think there are. So that’s what I’ve been keen to identify and clarify.
Some short replies; your questions/topics are starred
* Does this exclude methods that deny Bayes’ rule, Bayesian conditioning and the like?
– Only if they cannot be viewed as some form of operation on a posterior (and model, and prior)
* Does it encompass “Bayesian reconstructions” such that it countenances what I call “painting by numbers”?
– Sorry, I’m not sure what you mean exactly by “painting by numbers”. But I’m viewing the method, for all datasets, as Bayesian and not the inference for one specific set of data.
* Asides: …
– assigning probabilities to sets of parameter values is integral to establishing a prior, and getting a posterior. So yes, I agree with Good on that. However, while doing this assignment for point hypotheses is within the rules of the mathematical framework, in my experience it is very rarely an appropriate way to proceed. I’d also note that there exist Bayesian methods for interrogating point null values beyond just putting a point mass in the prior, so I don’t view testing point nulls as a big gulf between the paradigms.
* I can see methods counting as Bayesian, under your conception, that would not seem to satisfy error statistical requirements.
– Certainly this could happen, I don’t think there’s much debate there.
Instead – in line with your comments on general philosophical direction – it would be more useful to show if there are methods that satisfy error statistical requirements that are explicitly not Bayesian (even approximately). This is hard to do; see comments on the “massive” space. But it would be interesting, and informative.
* Counting games
– I think that, among statisticians, indifference is more the current issue, not confusion. To counter that, showing that the philosophical differences between paradigms matter and merit investigation, one has to give real-world examples where useful methods cannot be viewed as e.g. Bayesian.
PS: I think that one could have a thing very close to Gelman’s practice with de Finetti’s philosophy, acknowledging that the philosophy is based on idealisation and that one sometimes wants to violate its principles for good reasons (which then should be explained). However, it seems that neither Gelman would want de Finetti’s philosophy (he doesn’t seem to like open subjectivity), nor do I believe that de Finetti would have been happy with Gelman’s eclectic practice.
Right. By the way, I’m not sure how de Finetti entered this exchange.
I introduced de Finetti as as exemplar of one who both preached “all you need is Bayes” and practiced what he preached.
Good for Gelman!
Corey: I was heartened* by your first paragraph, then confused by your second that seemed to take it all back. Anyway, these are very interesting comments, I will have to get back to them later on today.
*Heartened because it’s hard to pin Bayesians down for analysis as they keep morphing into something else. My own feeling is that Gelman’s account, which is even further than others, needs a different name (maybe error statistical Bayes).
Mayo: I’m a slow writer, and many of your comments (such as this one, and also the one by Christian Hennig about approximation) demand replies that would take me a long time to write. The short version of my reply to this comment goes like this.
Cox’s theorem licences representing uncertainty by probability and updating by conditionalization, but it isn’t a principle for turning states of information into prior probability assessments. If we had such principles, then Bayes would be sufficient for proceeding. This being the case, one approach is to attempt as best as possible to encode one’s prior information into a probability distribution, update, and then perform subsequent checking to improve confidence that the tentative prior probability assessment neither left out crucial information nor included assumptions not in evidence.
What is being checked: whether you adequately captured your prior degrees of belief in a claim, hypothesis or model? If the prior is changed by data, how can it represent your beliefs based only on info prior to the data? Or do you not get to change your prior if you don’t like the posterior?
I realize it’s usually too hard to write adequate comments….also it’s way too late to be writing. If you like, send me something longer for posting.
“What is being checked: whether you adequately captured your prior degrees of belief in a claim, hypothesis or model?”
More-or-less. I would phrase it as checking whether I adequately encoded my available prior information. I hold, contra Kadane, that it’s not about my belief — it’s about the degree of plausibility justified by a given state of information. In principle, any two agents with exactly the same prior information should end up with the same prior distribution.
“If the prior is changed by data, how can it represent your beliefs based only on info prior to the data? Or do you not get to change your prior if you don’t like the posterior?”
Since assigning priors by introspection is error-prone, one useful insight is that it makes little material difference whether a given piece of information is in both the prior and the likelihood or just in the likelihood. For example, in linear regression with a reasonable amount of data, Bayesians can get away with assigning the so-called noninformative prior for the parameters, yielding posterior credible intervals that coincide with frequentist confidence intervals. In that analysis the prior on the logarithm of the variance of the noise is uniform over the whole real line, which corresponds to no reasonable prior state of information, e.g., we know in advance that our timing devices don’t have measurement noise on the order of the Planck time unit nor on the order of hours. This prior works anyway because the data themselves rule out ridiculously precise or imprecise measurement noise variance; encoding that information in the prior would make no material difference to the subsequent inferences.
I once built a model sufficiently reminiscent of linear regression (actually ~3000 linear regressions) that I unthinkingly used the flat prior on the log-variance parameters. Upon running my Gibbs sampler to generate samples from the posterior distribution, I found that some variance parameters getting stuck close to zero, which messes up the Gibbs sampler. There’s a technical fix for this, Gelman’s parameter-expanded Gibbs sampler, but I realized the source of the problem was that neither the improper prior density I had used nor the likelihood ruled out effectively infinitely precise measurements for some of the “regressions”. But I had prior information about the expected magnitude of the measurement error from the literature and also directly from the analytical chemist who collected the data. So I switched to a different prior density, one that encoded information ruling out ridiculously high precision. This fixed the glitchy Gibbs sampler and also materially improved the resulting inferences.
So it was not the case that the prior information was changed by data, nor was it the case that I couldn’t change the prior density when I found features of the posterior inconsistent with the prior information.
RESPONSE TO COREY:
Let me just consider your first points, since there are already so, so many issues (many of which we took up in earlier “deconstructions” (e.g., of J. Berger).
Mayo (from before): “What is being checked: whether you adequately captured your prior degrees of belief in a claim, hypothesis or model?”
Corey: More-or-less. I would phrase it as checking whether I adequately encoded my available prior information. I hold, contra Kadane, that it’s not about my belief — it’s about the degree of plausibility justified by a given state of information. In principle, any two agents with exactly the same prior information should end up with the same prior distribution.
Mayo: So then you test, given the data, whether you correctly encoded your available prior information (which you imagine would be similar to that of any other agent in your shoes with your info). But, except for very special cases, I don’t even see how to attach a meaning to this. It assumes the existing knowledge is representable by a probabilistic prior, but as Bayesians now say, neither introspection nor elicitation ‘adequately yields this, even assuming we wanted this.
I am making inferences, say, about some aspect of Newtonian gravity theory, say in 1919, and all of us (at that point) have pretty strong degrees of belief in Newton, based on tons of evidence. The eclipse data could not have been taken to indicate it’s falsity.
We would do better, as did the scientists of the day, by deliberately excluding the beliefs and biases to which we are wedded, even on strong evidence–for purposes of correctly interpreting this new data.
Corey: Since assigning priors by introspection is error-prone.
Mayo: Let’s try to parse what it can mean for your introspection of your beliefs to be error prone. Perhaps the idea is that there is an (objective, logical) relation between existing data x and any hypothesis H. You try to capture what this relationship really is, by introspection, and discover that this is a method that is error prone. How did you discover that it was error prone? Possibly by consulting one of the default Bayesian priors (never mind that they are generally improper). Is that the idea? I’m really trying to understand. But then that couldn’t be Kadane’s (subjective) view. (Could it?)
So maybe it’s not that there is an evidential relationship between x and H “out there”, but it is “in here” or in us, and yet this does not help to make sense of the notion of being wrong. Perhaps your idea is that you look at the data, and decide the prior must have been wrong, but how do you know the original prior wasn’t right? And what exactly can wrong mean (Lindley says it can mean nothing).
CONTINUATION OF RESPONSE TO COREY (as these points are general and they keep arising):
COREY: For example, in linear regression with a reasonable amount of data, Bayesians can get away with assigning the so-called noninformative prior for the parameters, yielding posterior credible intervals that coincide with frequentist confidence intervals.
MAYO: Sure, we’ve noted this before, but this scarcely shows the value/appropriateness of the so-called “non-informative” prior. As Fisher points out: https://errorstatistics.com/2012/02/17/two-new-properties-of-mathematical-likelihood/. (Recall, of course, that the corresponding Bayes test can be wildly different from the credible interval). There’s also a difference in interpretation and justification.
COREY: I once built a model sufficiently reminiscent of linear regression (actually ~3000 linear regressions) that I unthinkingly used the flat prior …. I found that some variance parameters getting stuck close to zero, which messes up the Gibbs sampler.
MAYO: Yes, flat priors are scarcely uninformative, in general. Again, no obvious points are scored for this way of proceeding. But I do see your point that you’re able to correct a prior here when you know it makes absolutely no sense…. Still one might rightly be uncomfortable about using the technique for cases where one was trying to find things out and couldn’t count on “bon sense” to rescue one from absurdity.
In my reply I’ve sliced-n-diced your text a bit so as to consolidate the ideas I want to communicate and put them in a certain order. If I don’t quote a chunk of text explicitly, I’m not responding to it.
Mayo: Let’s try to parse what it can mean for your introspection of your beliefs to be error prone…. you test, given the data, whether you correctly encoded your available prior information… Perhaps your idea is that you look at the data, and decide the prior must have been wrong, but how do you know the original prior wasn’t right? And what exactly can wrong mean (Lindley says it can mean nothing).
Corey: I really want to emphasize that when I say prior information I’m not talking about my beliefs but rather a body of knowledge external to myself that I can actually point to. The task is to find a prior probability distribution that captures all the implications of that body of knowledge, but nothing more. “Wrong” means failing at that task in a way that materially affects the subsequent inferences. By saying that introspection is error-prone, what I mean is that the mathematical consequences of a particular choice for the form of the prior and likelihood are not transparent to the human mind (this human’s mind, anyway), and thus can be wrong in the sense given above.
One doesn’t check the data directly — one checks the resulting posterior distribution. If it gives an inference whose implausibility is obvious on the given prior information, then one has to go back, figure out why, and solve the problem. In my example, the original posterior distribution would not have ruled out implausibly precise measurements because, as became apparent, my likelihood and data couldn’t provide that information and I hadn’t put it into the prior density. Prior sensitivity checking is based on similar reasoning, to wit, that if a posterior inference is stable to different choices of prior, then the inference is more likely to be a consequence of information in the data rather than information in the prior. None of this makes sense on a doctrinaire subjectivist account of coherent beliefs. In my account, the focus is always on the relationship between information and the probability distributions that encode it rather than on belief.
Mayo: It assumes the existing knowledge is representable by a probabilistic prior…
Corey: It does, and I’d absolutely agree that this is a pretty tendentious assertion on its face. Nevertheless I hold that this is a theorem, not an assumption. To really address this issue, I’d have to write up a long explanation of Solomonoff induction, which, frankly, I don’t understand as well as I should. So, a link.
Mayo: Perhaps the idea is that there is an (objective, logical) relation between existing data x and any hypothesis H.
Corey: Yes, that’s the hope. Again, I have to point to Solomonoff induction, but that doesn’t fully address the issue of objectivity because Solomonoff induction is relative to a chosen Turing-complete language and I don’t know of an objective basis for choosing one over another. (The previous point about representability requires only that there exist at least one language that can do the job.)
Mayo: We would do better, as did the scientists of the day, by deliberately excluding the beliefs and biases to which we are wedded, even on strong evidence–for purposes of correctly interpreting this new data.
Corey: I don’t disagree, but I suspect we’d differ on how to continue that thought. I’d say likelihood; I anticipate that you’d say severity. (As I discovered when I read Richard Royall’s little book, nothing in Royall’s writing contradicts the Jaynesian view (as interpreted by me, at least). He’s all about evidence in the new data being measured by the likelihood function and nothing else. He just doesn’t take the next step — the one justified by Cox’s theorem — and combine the evidence with the available prior information.)
Mayo: Sure, we’ve noted this before… There’s also a difference in interpretation and justification.
Corey: I considered including a parenthetical addressing that, but left it out for length reasons. You and I are in violent agreement about the existence of the difference in interpretation and justification. On the question of which justification is sound… not so much.
Mayo: Still one might rightly be uncomfortable about using the technique for cases where one was trying to find things out and couldn’t count on “bon sense” to rescue ONE from absurdity.
Corey: Yep. This is still a fallible endeavor.
Corey: Thanks so much for this. I will have to wait til I arrive at destination to respond in more detail, but will ponder in airports. One thing: Royall will only give you comparative likelihoods–a comparative appraisal (of fit) among specified hypotheses–not a way to test and reject a claim/model, and certainly not a way to control error-proneness.
Corey:
“The task is to find a prior probability distribution that captures all the implications of that body of knowledge, but nothing more.”
This was the kind of ideal that Carnapians and other logicists had/have. I’m glad you bring it out, because the idea of trying to consider all the implications of a body of knowledge is not something anyone trying to find new things out would or should try to do. It’s not merely an idealization—it’s opposed to what needs to be done in human inquiry; namely, ask a question so that one is very likely to be able to find something informative out upon getting the answer. One doesn’t want to be trying to carry along, much less try to draw out and articulate, this huge baggage of everything that is or seems known and everything that is entailed by what is known, for simply jumping in now, and jumping out again. Lots of info can come from queries that embody known falsehoods. But the thing is to have an efficient method for splitting off a question for learning something now, deliberately bracketing the rest.
Mayo: If instead of “human” you’d written “universal Turing machine with access to a halting oracle“, everything in the above comment would be provably false. I can’t wait to see what you have to say about Solomonoff induction.
Corey: I read the link abaout Solomonoff induction. Apparently Occam’s razor (or, more precisely, a certain, probably not unique, translation of it into formalism) plays a key role for this.
I don’t see, however, how Occam’s razor considerations can contribute anything to the question of whether a certain hypothesis is true or not (the Wikipedia page, carefully, states in one place that we prefer to *choose* a hypothesis that is simple, which is a different, more pragmatic, justification; it also says elsewhere that we “feel” such hypotheses more likely to be true but I’m very sceptical about the epistemic value of such a feeling).
From my limited knowledge, my impression is actually that of course you can come up with axioms that make it possible to define such a “universal a priori probability”, but whether one likes it or not depends on whether one accepts the axioms as an in some sense “objective” and unique formalisation of what should be achieved here, which I don’t, and I neither see convincing arguments here, nor how such arguments could be found for any system for “universal prior distributions”.
I treat Solomonoff induction as an ideal to be approximated because, as shown by the Prediction Error Theorem, it “will learn to correctly predict any computable sequence with only the absolute minimum amount of data. It would thus, in some sense, be the perfect universal prediction algorithm, if only it were computable.”
Corey: Fair enough, but it is neither straightforward to accept that the random processes observed in reality are computable according to the given definition, nor that an algorithm that could predict those that are would be particularly useful to predict those that are not. (I don’t know of any useful notion of “approximately computable” that could be used here.)
Christian, it’s straightforward enough for me to accept — if I turn out to be wrong, there’s no harm done, since as of right now I only have access to physical computers to do data analysis, and they are, by definition, incapable of dealing usefully with noncomputable processes.
Corey: Well, the finite datasets that we observe are obviously computable, but this to me doesn’t seem to be a good argument.
The theorem is about prediction in potentially infinite computable sequences, if I understand this correctly.
OK, you may still say that we are only interested in predicting a finite future. However, a perfect prediction rule for computable processes with “minimum amount of data” is not of much help if the sequence is as complex as its lengthy, i.e., later observations are in no deterministically computable relation to former ones. In such cases the theorem says that we can predict as much as we know already and not more (if I get all of this right, of which I’m admittedly not 100% sure).
So, amending my own previous comment, the question is not so much whether one believes that the real processes of interest are really computable, but rather whether they are computable in a way that is so much less complex than their length that Solomonoff’s universal prediction algorithm is of any help.
I guess that in most cases this is much less useful than frequentist modelling, which I take as intentionally simplifying things by subsuming all kinds of stuff to “random noise” without the need to worry about computability. (I am aware that von Mises did worry about computability in connection with frequentist probability, but to me this looks like a failed attempt to ascribe frequentist models more “objective truth” than they deserve. Kolmogorow and Solomonoff took off from there and ended up at a rather different place.)
Christian: “the question is not so much whether one believes that the real processes of interest are really computable, but rather whether they are computable in a way that is [] much less complex than their length”
Corey: If one has ever used a computer to help one model data, one subscribes to beliefs stronger than this.
Christian: “I guess that in most cases this is much less useful than frequentist modelling, which I take as intentionally simplifying things by subsuming all kinds of stuff to ‘random noise’ without the need to worry about computability.”
Corey: The prediction error theorem I referred to earlier doesn’t operate in a deterministic environment — it operates in a stochastic environment where the *measure mapping events to probabilities* is a computable function. This includes all practical modelling of any type: if one can write code to simulate it (up to machine precision for an arbitrarily precise machine), then Solomonoff prediction would give (if only one could compute it) a conditional predictive distribution approaching the “true” conditional predictive distribution. And one isn’t limited to deterministic PRNGs in the simulation; one can use genuinely random bits (whatever that means) as input if so desired.
Corey: OK, thanks. I probably have to read more about it.
Response to Earlier Guest Comment:
Since I couldn’t respond directly under your comment ( don’t know why), I’m responding on my latest post: 5-28-12. What I don’t mention there, I do below:
Guest: Sorry, I’m not sure what you mean exactly by “painting by numbers”.
Mayo: See latest post: 5-28-12.
Guest: But I’m viewing the method, for all datasets, as Bayesian and not the inference for one specific set of data.
Mayo: I don’t think I understand this.
Mayo: Let me be clear: I never said that statisticians or statistical practitioners were or should be interested in whether or not a method or person could be placed under a specific banner, let alone the degree to which a method or person could be placed under one or another banner. It is you who began this particular exchange expressing interest in the degree to which Pearl may be said to be Bayesian (either by his lights or yours), and whether the degree has moved from 50% to some lower number in the last decade (or whether it should always be regarded as 100%). As my late colleague, Good, used to say, To the Bayesian All Things Are Bayesian (but I think he attributed it to someone else).
Guest: I think that, among statisticians, indifference is more the current issue, not confusion. To counter that, showing that the philosophical differences between paradigms matter and merit investigation, one has to give real-world examples where useful methods cannot be viewed as e.g. Bayesian.
Mayo: See post: 5/28/12