From the “The Savage Forum” (pp 79-84 Savage, 1962)[i]
BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.
SAVAGE: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.
Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …
On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.
Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.
BARNARD: Professor Savage says in effect, ‘add at the bottom of list H1, H2,…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.
LINDLEY: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.
BARTLETT: But you would be inconsistent because your prior probability would be zero one day and non-zero another.
LINDLEY: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.
BARNARD: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.
LINDLEY: I do not care what it is as long as it is not one.
BARNARD: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.
LINDLEY: All probabilities are conditional.
BARNARD: I agree.
LINDLEY: If there are only conditional ones, what is the point at issue?
PROFESSOR E.S. PEARSON: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.
BARNARD: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.
LINDLEY: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.
BARNARD: Only if you knew that the condition was true, but you do not.
GOOD: Make a conditional bet.
BARNARD: You can make a conditional bet, but that is not what we are aiming at.
WINSTEN: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.
BARNARD: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H1 against H2, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H1 as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.
In their attempts to get the “catchall factor” to disappear, many appeal to comparative assessments–likelihood ratios or Bayes’ factors. Several key problems remain: (i) the appraisal is always relative to the choice of alternative, and this allows “favoring” one or the other hypothesis, without being able to say there is evidence for either; (ii) although the hypotheses are not exhaustive, many give priors to the null and alternative that sum to 1 (iii) the ratios do not have the same evidential meaning in different cases (what’s high? 10, 50, 800?), and (iv) there’s a lack of control of the probability of misleading interpretations, except with predesignated point against point hypotheses or special cases (this is why Barnard later rejected the Likelihood Principle). You can read the rest of pages 78-103 of the Savage Forum here. This exchange was first blogged here. Share your comments.
References
*Other Barnard links on this blog:
Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example
Mayo, Barnard, Background Information/Intentions
Links to a scan of the entire Savage forum may be found here.
I’ve come see most ‘within model Bayesians’ eg ‘falsificationist Bayesians’ who are Bayesian within a model but use ‘outside the model’ checks as (arguably) likelihoodists from a fundamental perspective. Allowing priors *within* a model, however, is incredibly useful for hierarchical models and nuisance parameters.
The calibration of the ‘absolute fit’ – as opposed to the relative fit given by Bayesian/Likelihood methods – is an interesting problem. This is where the ideas of Gelman etc need more formal investigation imo. Mike Evans has some interesting ideas in this direction from what I’ve been reading lately.
Or, ultimately both Bayesians and Likelihoodists are conditional probability modellers. At some point every model needs closure assumptions. Falsificationist Bayesian/Likelihoodists use a conditional independence assumption that the rest of the background variables are probabilistically irrelevant. If they are then one can follow the usual estimation methods.
But these assumptions are always provisional and can only be subject to limited (though useful) testing.
Omaclaren: If there are statisticians calling themselves “falsificationist Bayesians” then they need to tell us the nature of their falsification rule by which claims are declared falsified. Of course they won’t be deductively falsified, so the question is by what criterion will they output a falsification?
One criterion Bayesians have used in the past is based on low (posterior) probability of claim C (be it about a model, hypothesis or other). This is problematic, because, for one thing, it is relative to a catchall hypotheses, for another, declaring a highly improbable hypothesis falsified seems to go against the Bayesian idea that claims are probabilified, not declared true or false. Still, it’s a possible route, but almost certainly not the one you have in mind.
Another might be to declare C falsified when there is sufficiently low likelihood of data x given C. This can’t work because it would be too easy to falsify. One might appeal to a comparative likelihood approach, but then Barnard’s question about the alternative enters. The comparativist can at most say that one claim is more or less likely than another given data (and of course the likelihood already supposes a model). You’d need another rule mapping low likelihood ratio to “falsified”–but it would really only be comparative. Simple significance tests have a real advantage here, and that’s the final possibility.
A final possibility might be to run significance tests on model claims, rejecting them if a P-value is too low. Again, this is not a deductive falsification but may be defended as rejecting claims based on an observed difference D between data and what C predicts the data would be. If Pr(D > d| C) is sufficiently low, then C might be declared (provisionally) falsified. If it is not low, then assuming the adequacy of C might be declared acceptable (perhaps for a given purpose). The justification would be based on low error probabilities (perhaps with a severity rather than a long-run construal) Is this the view you favor?
The next question is what is done when C is falsified? Is it replaced with C’ that passes the test? Barnard will remind us that there are numerous different alternative models that could pass the test, and we are back to where we began with at most a comparative assessment, i.e., this model accords with the data better than that one. The pure significance tester or (Spanos type) misspecification tester, would declare C inadequate or discordant, but without being permitted to infer C’ (unless it’s denial was ruled out by other tests).
An additional problem enters if one is allowed to replace the “falsified” C with a tailoring of a prior. Yet more members of the “catchall” rear their heads.
I’m being sketchy, obviously, but the onus is on your Bayesian falsificationist to explain the nature of the recommended statistical falsification rule. I’ve suggested three basic approaches. It won’t due to simply say we falsify claims or we do diagnostic checks without indicating the reasoning and the role of probability in that reasoning.
Here is my own sketch. Let me know if it fails to address your points.
Consider a model where we predict a quantity y as a function of x in background context b. Grant the existence of p(y|x,b) and p(x|b). Note b is only ever on the right hand side so need not be considered probabilistic (notation can be formalised further if you want).
My closure assumptions are
a) p(y|x,b) = p(y|x) for all x,y,b
b) p(x|b) is given
If these are satisfied then using bayesian parameter inference is fine. If not then the model is misspecified. Note that these only ever hold in the domain of b, and we can only explore this by varying b and seeing if the conditions hold.
a) is an assumption on mechanism ie x determines y regardless of context b
b) is an assumption on experimental manipulation ie boundary conditions.
These are checked by the analogue of pure significance testing. These essentially ask – ‘can I predict y given x without knowing the values of other variables?’ and ‘do I know how my interventions affect my predictor x?’.
These sort of assumptions are ‘meta-statistical’ closure assumptions but testable to the extent we can explore/consider different contexts (values of b).
Also note that stopping rules etc may be included as part of the variable ‘b’, so could affect the validity of assumption a) and/or assumption b).
Eg a particular stopping rule may be construed as preserving a) while requiring modification of b) ie a different prior. If the model is misspecified then it may be corrected on that data in that context but this may invalidate its application in other contexts. Invariance of the relationship between y and x for all contexts b is crucial here.
Also, I should have used 1) and 2) for the conditions to avoid confusion with the variable b
omaclaran: The trouble is delineating the background b, that is Barnard’s point. The assumption of “closure” and boundary conditions is just what Barnard is questioning. Savage says you must leave a “catchall” for all that hasn’t been considered.So even if you could vary b, you can’t vary the possibilities not thought of; and even if your predictions “pass” the test (under your assumptions), it wouldn’t indicate you had gotten hold of even a partial explanation of the data. You might have to try very hard to think of ways you could be wrong so as to generate failed predictions regularly. Then you might need to arrive at something entirely new to explain it. Substantive causal or other theories (of the data and the domain) beyond your statistical model would be needed.Assuming the fixed framework or closure wouldn’t get you there.
But you might say that every modeling account has this problem, and Barnard’s point is that,no, it’s a special problem of using Bayesian probabilism.
The frequentist tester and modeler can start anywhere and doesn’t require closure.
I very much reject the ‘catchall’ approach – I think it’s misguided. For me ‘closure’ of a mathematical model is always provisional. That’s the point – a strong ‘for all’ quantification assumption that can be rejected by counterexamples a la Popper!
Also these closure assumptions are really meta-statistical ‘causal’-esque assumptions. They’re not that far from Pearl/Glymour style assumptions.
And just to emphasise – these are not ‘knowledge is closed under entailment’ assumptions. I’m generally against this a la Nozick. I simply mean temporary closure of a mathematical model via a falsifiable ‘for all’ statement giving ‘boundary conditions’ and the like, as in a differential equation model.
And this is compatible with using Bayes within a model (probabilities integrate to one within a model) but allowing the closure assumptions to be falsified.
One more comment for the moment if you don’t mind…sorry it’s hard to have a (semi-) philosophical argument in a limited number of blog comments.
A motto:
‘Forall is not catchall’!
(‘For all’ is falsifiable, as Popper pointed out, ‘catch all’ is not).
Also, without wanting to distract from my main reply, Spanos’ misspecification testing to ‘secure the inductive premise’ is precisely a type of ‘closure’ assumption in the sense I am using it.
There is an important misconception in what Barnard wrote, and in the responses to it. This is the core part: “Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses.” What is missing is the fact that the probabilities will integrate (or sum) to unity only if the probabilities all come from the same statistical model. It is a statistical model that supplies the probabilities, and the probabilities apply to parameter values of the model. It is unfortunate that we talk of ‘hypotheses’ in this context because that word allows a far larger set of more complicated things than the model parameters for which we can supply probabilities.
To illustrate that point, consider a bent coin tossing experiment. In ten tosses it comes up HHTHTTTHHT, where H and T have the obvious meaning. The usual statistical model assumes a binary outcome with a fixed probability, p, for each toss of the coin. The probabilities of all possible values of the parameter p, between 0 and 1, are obtained from the binomial distribution, and those probabilities integrate to unity. Now we add “something that we has not thought of yet”: a hypothesis that the result HHTHTTTHHT is a pre-ordained result, a determinist hypothesis that the observed result had to happen with probability one. It is valid as a hypothesis and it is, presumably, part of the catch-all, but it is not a valid part of the statistical model that features a fixed value of p. That catch-all hypothesis is an evil demon hypothesis where an evil demon sets the probability of H to one for each of the throws where H was observed, and sets the probability of T to one for each of the throws where T was observed. Clearly that evil demon hypothesis is not part of the statistical model where we obtain the probabilities of H and T from a fixed value of the parameter p. After appending the evil demon hypothesis to the model we have a ‘super-model’ that presents the problem that its probabilities add up to more than unity. Any way we choose to normalise those probabilities requires us to supply a weighting factor that is equivalent to Bayesian priors on the binomial part of the super-model and the evil demon part. I would be happy to supply such priors (unity for the binomial part and zero for the evil demon part), but I dare say that some would not be happy to do so.
The problem comes entirely from thinking in terms of hypotheses as something different from parameter values in a defined statistical model.
Michael: I think you are saying what Barnard says, for example: “The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions.”
The reports of the Bayesians here (Savage, Lindley) are posterior probabilities.
On your other point,even though it is a side issue: there’s no reason to suppose that a maximal likelihood hypothesis has anything to do with an “evil demon” whatever that is. Coming up with “just so stories”, and good fitting explanations of data–and of course they needn’t make the prob of the data maximal– is quite common, and there’s no need to invoke any demons. (Even Royall’s hypothesis that the deck contains only cards exactly like the one I observed doesn’t involve any demons.)
Mayo, Royall’s example of drawing a single card from a deck of 52 is not originally his. Edwards wrote about it it in 1970 (Edwards, A. W. (1970). Likelihood. Nature, 227, 92–92. http://doi.org/10.1038/227092a0). It’s a nice example of overfitting allowing a determinist hypothesis. There are several others, one of which is discussed in the paper I sent you a couple of months ago (http://arxiv.org/abs/1507.08394). None of those should be taken as a criticism of likelihood methods because overfitting mucks up most forms of statistical inference.
The evil demon is a philosopher’s creation, originally from Descartes, as you should well know. The demon is able to do whatever it wants and is intent on fooling us.
As far as I can see the only ways to get a sure-thing, whatever happened had to happen, hypothesis into a statistical model are overfitting where there are more parameters than data points and manouvers that are equivalent to evocation of the evil demon. I would be pleased to hear of other strategies.
Just so stories do not generally fit into the statistical model. That is a point that I am trying to make. The only “stories” that the statistical model allows the evidence to evaluate are the stories that equate to parameter values of the model. I used the coin example because you have used it an an illustration of a determinist, “maximally likely” hypothesis in the past (Mayo & Spanos 2011). You write that the hypothesis “can assert that the probability of heads is 1 just on those tosses that yield heads, 0 otherwise”. That hypothesis is an evil demon hypothesis because it does not fit into the binomial model that is typically used for coin tossing. It would not be one of the hypotheses for which the statistical model can supply a likelihood.
Michael – I agree that many of these issues arise from not thinking about hypotheses as parameters within a mathematical model. And that most ‘counterexamples’ rely on overfitting via ‘too flexible models’.
I do think that blocking models that overfit relies on considerations of prior or future contexts/data not immediately at hand, as in my comments above. That is, notions of a range of contexts within which the model is invariant. This also motivates the common use of splitting into ‘training’ vs ‘testing’ datasets to guard against overfitting.
So anyway, hiding among some of my ramble on this post I feel like there is a fairly defensible sketch of how a Bayesian may be able to be a falsificationist. They provisionally accept two conditional probability statements which involve conditioning on a background domain of validity. This domain does not need a probability distribution over its domain as it is only ever conditioned on.
The goal of a scientist is a ‘search’ problem to find (eg by guessing) theories which satisfy these conditions for a desirable range of background contexts. When the conditions are not satisfied for a given background then the theory is false (-ified) for that domain.
As an example, Newton’s law f-ma=0 is a statement of an invariant relationship parameterised by ‘context’. When that involves knowing ‘gravity is present and the relevant masses are known’ and I want to predict acceleration, then the expression for f is determined by background knowledge and is used to predict acceleration. When acceleration and mass are known known relative to a background reference frame then the net force can be predicted. This relationship is nice because we can satisfy the two conditions I gave under a wide range of conditions.
Note that these closure assumptions don’t really have anything to do with being Bayesian or not – I believe Glymour and Pearl have said things along these lines – but are perfectly compatible with a Bayesian approach. If you don’t want to use Bayesian models, fine, but the argument that they cannot be compatible with falsificationist approach to doing science is clearly wrong (to me anyway). Bayesian and likelihoodist methods also happen to have particularly intuitive interpretations for parameter estimation within a model defined conditionally wrt a background context.
Again, it’s scientists’ job to find instances of theories satisfying the closure conditions and determine the range of backgrounds over which they hold. There is no ‘catchall’ – there is a ‘for all’ for which we need to determine the truth set (range of values for which the quantification holds).
omaclaran: The bottom line is that you don’t have inference by way of posteriors without a catchall. The issue of falsification is a bit different. You don’t have falsification without a falsification rule. It will not be deductive, that’s clear. So what’s your probabilistic falsification rule? I indicated some possible avenues.
It would be good if statisticians took philosophy of science so they could see the need to clarify vague claims along the lines of “it is the scientist’s job” to find closure conditions, as if that answers the question of how your particular account promotes learning despite not having such conditions, or reliably finding them. It’s a stretch to see statistical modeling as giving full blown theories with ceteris paribus conditions
Sigh. I’ve tried my best. Personally I find my account less vague and have been using it on applied problems. A similar view which doesn’t make the closure conditions as explicit is in Gelman’s BDA.
Some of us are happy to use conditional probability as basic and posteriors where useful. Here posteriors do come in – over the parameters *within* the model. Also, closure conditions allow you to pass from the prior *predictive* to posterior *predictive* distributions (see BDA or Bernardo amd Smith for definitions) so do also allow (predictive) inference using a posterior. That this requires accepting two conditional probability statement is neither here nor there to me as far as ‘being Bayesian’ or not is concerned.
I tried to collect some of the points I made here on my blog (click my name) for anyone interested. I guess it’s a one-way street – there are people who use Bayesian methods open to Mayo’s ideas and methods (e.g. me, Gelman) but Mayo doesn’t seem to have any interest in anything that ever involves Bayesian methods at all.
Again, sigh.
Also, I didn’t mean it’s the scientist’s job to find the closure conditions – I gave them. Their job is to find theories satisfying them. They do this by ‘conjecture and refutation’ you might say. Anyway, not sure why I’m bothering – I suppose these arguments on the internet have been helpful for clarifying my views (to myself at least) so that I can better use them. My goal is not to score points or win a philosophical argument.
Omaclaren: Everyone who uses probability uses conditional probability, and theorems of statistics/probability/logic. My, and Barnard’s, point has nothing to do with being interested in using Bayesian ideas or not, ideas which are generally probabilist, not falsificationist, but trying to pin down the falsification you have in mind. In the works you mention, I believe the answer they give to the “falsificationist” argument is declaring misfit (or the like) based on statistically significant differences, as with pure significance tests, based on tail areas. Whether a rival that passes is inferred, is unclear.
I remain perplexed at your claim “There is no ‘catchall’ – there is a ‘for all’ ” especially as you stress the importance of “closure conditions”.
We can return to this another time.
I didn’t know you had a blog, I’ll have a look.