A comment from Professor Peter Grünwald
Head, Information-theoretic Learning Group, Centrum voor Wiskunde en Informatica (CWI)
Part-time full professor at Leiden University.
This is a follow-up on Vladimir Cherkassky’s comments on Deborah’s blog. First of all let me thank Vladimir for taking the time to clarify his position. Still, there’s one issue where we disagree and which, at the same time, I think, needs clarification, so I decided to write this follow-up.[related posts 1]
The issue is about how central VC (Vapnik-Chervonenkis)-theory is to inductive inference.
I agree with Vladimir that VC-theory is one of the most important achievements in the field ever, and indeed, that it fundamentally changed our way of thinking about learning from data. Yet I also think that there are many problems of inductive inference to which it has no direct bearing. Some of these are concerned with hypothesis testing, but even when one is concerned with prediction accuracy – which Vladimir considers the basic goal – there are situations where I do not see how it plays a direct role. One of these is sequential prediction with log-loss or its generalization, Cover’s loss. This loss function plays a fundamental role in (1) language modeling, (2) on-line data compression, (3a) gambling and (3b) sequential investment on the stock market (here we need Cover’s loss). [a superquick intro to log-loss as well as some references are given below under [A]; see also my talk at the Ockham workshop (slides 16-26 about weather forecasting!) )
The central notion of VC-theory is the VC-dimension The VC-dimension has originally been developed for classification with 0/1-loss in an i.i.d. setting. (See [B] below for a very quick idea about what it means.) Now, the log-loss function behaves entirely differently from the 0/1-loss. Moreover, the three settings above are all settings in which no i.i.d. assumption is ever made. The VC-theory can be extended to other loss functions (e.g. regression problems) but I don’t see how it could be extended to log-loss in a non-i.i.d. setting. To be sure, there exist analogues of VC-dimension such as Rissanen’s ‘parametric complexity’. For a statistical model like, e.g. a K-th order Markov model, this defines a corresponding ‘complexity’ which is a complicated function that is, indeed, increasing with K and thus increasing with the number of parameters, but it also depends on several other aspects of the model. Parametric complexity has a direct interpretation in terms of log-loss prediction accuracy, in the game-theoretic sense of ‘worst-case regret’, without any assumption of a ‘true model’ or ‘true distribution’ at all (!). This method of measuring prediction error is an important topic in machine learning, see e.g. the book prediction, learning and games by Cesa-Bianchi and Lugosi. Last week our fellow workshop speaker Larry Wasserman has put a short introduction to the idea on his blog – very much recommended. (the story there is for absolute loss but it is straightforward to adapt it to log-loss).
Now it turns out that there is a relation between VC-dimension, Rissanen’s parametric complexity (developed mostly in the ‘minimum description length world’ – references below) and several other measures of a model’s ‘richness’ after all. Interestingly, these can all be seen as instances of clever ways of codinghypotheses using lossless codes. I give four examples:
(1) As to VC dimension: using the so-called Sauer’s Lemma, if you have a model of VC-dimension d, then ‘d log n’ can be interpreted as the worst-case number of bits you need to encode an element of your model if you are allowed to use the X-values (inputs/covariates) in your coding scheme (and indeed, there is a close connection between VC-bounds and MDL (minimum description length) bounds; a first, pioneering bound in this direction already appears in Vapnik’s work and more refined bounds have been given later by people like Blum & Langford, Bousquet (machine learners) and Audibert (statistician). Here is a link to the Blum and Langford paper.
(2) As to parametric complexity: this can (essentially) be seen as the number of bits you need to encode an element of your model if you use the code that minimizes the worst-case regret, another technical notion. This is explained in, e.g., my MDL book or my (much shorter) MDL tutorial or, perhaps the fastest way to learn about it, watch this video.
(3) The bounds for expected prediction error I showed in my talk at the Ockham workshop involve quantities of the form- log prior (classifier), which can be thought of as being related to a Bayesian prior but also as the number of bits needed to code an element of your model, i.e. a set of classifiers. Again, Blum and Langford and Audibert have results showing how these bounds can be unified with the standard VC-bounds.
(4) Rademacher Complexity, another way of measuring complexity for classification problems, can be directly related to both VC dimension and parametric log-loss complexity (see Chapter 17 of my book of MDL where I go into this in some detail).
Now let’s go back to Vladimir’s notes on the Ockham workshop on this blog. He writes “Some philosophers suggested that the Occam’s Razor principle still holds if the VC-dimension is used as a measure of complexity. This semantic game playing seems counter-productive and only breeds more confusion.”
This probably refers to a remark that I made after Vladimir’s talk. I realize now that my remark was quite unclear (entirely my fault)–I certainly didn’t mean to say the above. Let me try to explain what I really meant: I tried to convey that, perhaps, the notion of a *code* over the set of allowed hypotheses is a general notion that can capture many different notions of ensuring good predictive accuracy; in the typical VC-context (0/1-error, i.i.d. but no other assumption), in the log-loss context (no stochastic assumption at all) and in the ‘PAC-Bayesian’ context of my own talk later that day. One designs a code over hypotheses before seeing the data; one then has a learning algorithm which uses this code, and a mathematical analysis which provides bounds on predictive accuracy in terms of this code. The code can (a) be one which has worst-case optimality properties (such codes based on VC-bounds in classification problems, and codes based on minimax rerget in log-loss problems) – and, contrary to what is often suggested, such codes usually do not depend on arbitrary things such as what names one gives to what hypotheses). Or (b) the code may incorporate some subjective ‘luckiness’ information (one trades in good performance on a subset of hypotheses for worse performance on another subset) – one then gets ‘objective subjectivity’ – bounds on predictive accuracy which hold irrespective of whether the prior/subjective assumptions are correct, but which get progressively weaker the less the assumptions are corroborated by the data. This ‘luckiness idea’ was the subject of my talk .
Thus, I think codes play a fundamental role in many (not all) problems of inductive inference, and might even play a role in unifying several different strands of inductive inference. I see nothing wrong with calling a class of hypotheses that allows for a short worst-case code length (e.g., d log n in the VC-case) ‘simple’, and so, in my terminology, this coding approach can be related to Occam’s Razor after all (note that, e.g., on wikipedia, VC-dimension is also introduced as ‘measuring, in some sense, how complicated a set of hypotheses is’). This is what I wanted to say in my remark after Vladimir’s talk.
That’s it- best wishes to everyone!
Appendix:
[A] About Prediction with Log-Loss
In the simplest version of the log-loss prediction game, if one predicts outcome y with distribution p, one suffers loss – log p(y), i.e. if one assigned high probability to the outcome that actually occurs, the loss is small. In the sequential prediction game, at each point in time one predicts y_t with a probability distribution that is allowed to depend on y_1, …, y_{t-1} and some side-information x_1, …, x_t. The total loss at time T is then the cumulative (summed) loss one makes at times t=1,…, T
While the predictions one can make are formally probability distributions, they might perhaps better be called ‘nonnegative weights summing to 1’ – for example, in the gambling application, p(y) represents the proportion of one’s capital that one invests in y; i.e., one uses p(y) times one’s total capital to buy tickets that pay off if y occurs. And in the data compression scenario, – log p(y) is the number of bits one needs to encode outcome y using ‘the code corresponding to p’. See the Cesa-Bianchi/Lugosi-book or
my talk (slides 16-26 about weather forecasting!) for further information.
[B] About VC-dimension
Consider data (X,Y) where Y is 0 or 1 and X can take values in an arbitrary set. Suppose we see i.i.d. pairs (X1,Y1), (X2,Y2),… . These are our training data. We also have a ‘model’ (set of classifiers such as support vector machines for a given kernel, or all feedforward neural networks with 6 hidden nodes, or all decision trees that can be defined on X). Our goal is to learn to predict Y given X as well as the best element in the model, where predicting well means achieving a small expected 0/1-loss , i.e. a small probability of making the wrong prediction.
Very broadly speaking, if the model has finite VC-dimension then, no matter what the data generating distribution is, as long as data are i.i.d., ’empirical risk minimization’ (=picking the classifier which has the smallest loss on the given data) will ‘work well’: with probability 1, as you get more and more data, the estimator you pick gets better and better and its performance convergence to the best performance that can be obtained by any classifier in the model.
___________________
1Previous Posts from CMU Ockham’s Razor Workshop (BACK TO POST)
- · (July 6) Vladimir Cherkassky Responds on Foundations of Simplicity
- · (July 4) Comment on Falsification
- · (July 3) Elliott Sober Responds on Foundations of Simplicity
- · (July 2) More from the Foundations of Simplicity Workshop
- · (June 29) Further Reflections on Simplicity: Mechanisms
- · (June 26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop
Peter: Thanks much for your follow-up comments. I have recently had several rounds (of e-mail) with Chernassky, I’m guessing he’ll have a reaction to the specifics on machine learning–where I’m a complete outsider. Even though our discussions from the last few days may be rather distant from machine learning, some connections might be worth noting.
Here’s just one from near the end of your posted slides (link to “my talk”). We were discussing 2 days ago whether various (prior) restrictions were essentially captured (or well captured) Bayesianly: you say that the prior dependent methods you describe are not Bayesian because
“purely Bayesian algorithms may fail dramatically…. You may assign small prior to certain [values] because you think they are not likely to predict well… But also because they may not be useful!
“If you’re lucky, prior is well aligned with data, and bound is strong. But bound holds whether you are lucky or not! There’s no such thing in Bayesian inference.”
This seems redolent of Neyman’s stipulation that performance of methods be independent of prior probabilities, even while granting that you could do better if you had the right (or better) priors.
But, on the other hand, this seems almost the opposite of Kiefer’s allusion to ‘luckiness” (alluding to data, not priors) which is keen to report the error probabilities conditional on your lucky (or unlucky) data (in contrast to neyman). But you say they are connected (Kiefer may well have other notions, it was your talk that got me to go back and check his 1977 papers.)
Hi Deborah,
When I wrote that Kiefer already had the idea of ‘luckiness’, I indeed meant exactly what you wrote:
“Kiefers luckiness (alluding to data, not priors) is keen to report the error probabilities conditional on your lucky (or unlucky) data (in contrast to neyman).”
My idea of luckiness goes well beyond Kiefer’s (the extended use of the word probably dates back to work by R. Williamson), but one can really view it as an extension of Kiefer’s idea, it’s certainly not a contradiction. I think this is an important point you raise here, so let me clarify:
In Kieffer’s approach, the ‘luckier’ you are on the data you observe (e.g. the data very strongly point to one of your hypotheses) , the stronger your conclusion will be.
In my approach , the same holds: the ‘luckier’ you are on the data you observe, the stronger your conclusion will be.
But now ‘luckiness’ is measured in terms of – informally stated- how likely the data are under your prior assumptions.
Very simple example : consider the normal location family with fixed variance 1. A Bayesian might put, for mathematical convenoence, a prior on the mean parameter \mu that is itself a normal distribution with variance 1. and some mean \tau. In the approach I described we could do the same thing, and then compute for example the posterior mean estimate \bar{\mu} of \mu given the data. Now – and here things become different from a Bayesian approach – I can calculate a ‘luckiness confidence interval ‘around \bar{\mu}. Now the 95% luckiness confidence interval around \bar{\mu}
will be narrower (which means more knowledge about mu!) if \bar{\mu} is close to \tau : if you’re lucky, the data is well-aligned with the prior (\bar\mu is close to \tau ) and you get a more informative interval. If you’re unlucky, \bar\mu is far from \tau (‘the prior seems wrong, in retrospect’), and you get a wider, less informative interval.
But importantly the luckiness confidence interval is valid in general, independent of the distance of \bar\mu to \tau. So this is what I meant by ‘objective subjectivity’: you can use priors, and they determine your method (\bar\mu depends on the prior), but you get frequentist confidence statements about your method that hold independently of whether the prior is in any sense correct or valid or not – it’s just that the confidence statement you get is data dependent, and will be stronger if you’re lucky (data corresponds to prior in some sense).
So note that what I’m doing here is *not* consistent with what you refer to as ‘Neyman’s stipulation that performance of methods be independent of prior probabilities’. The performance of the luckiness methods as above is very dependent on the prior probabilities – it’s just that you prove confidence statements about the method which hold generally, irrespective of whether the prior probabilities are ‘correct’ or ‘good’.
I hope this clarifies the issue a little.
Peter. Thanks for your reply. Firstly, what Neyman meant by independent of the prior is that the error statistical claims held regardless of the prior, not that one couldn’t do better if one had a legitimate prior. Confidence intervals, are of course, already data dependent, as are severity assessments (which naturally have more gradations). Conditional claims may be warranted, enabling more precise error probability claims, but I don’t see why this is tantamount to introducing a prior. In truth I don’t know enough about what you’re recommending (or anything really), so I shouldn’t be guessing.
Maybe a plot or two can clarify things a bit. Here’s a plot of one possible 95% luckiness confidence interval for the NIID(mu, 1) model shown in Neyman’s confidence belt form. That is, the abcissa is the datum x and the ordinate is the parameter mu. Upon observing a random value x, draw a vertical line at that value — the intersections with the confidence belts give the realized CI. For any given mu (i.e., horizontally), the confidence belt encloses 95% of the sampling distribution’s probability mass, ensuring that the above CI procedure has exactly correct confidence coverage.The luckiness confidence belt is shown in red, the usual procedure is shown in blue.
Here’s a plot of the expected (purple) and realized (red) length of the CI. The usual procedure is again shown in blue. The luckiness procedure’s best case and worst case relative expected lengths are -13% and +19% respectively.
A few things are immediately apparent from the plot. First, there’s no such things as a one-sided luckiness CI — luckiness maintains the confidence property by pushing probability mass around horizontally, and there’s no freedom to do that in a one-sided CI procedure. Second, for the null hypothesis “mu = 0”, inverting the CI to get a rejection region gives the same region as the usual CI. I doubt it’s a coincidence that mu = 0 is also the value of the parameter that minimizes the expected length. Third, the luckiness CI demonstrates the manner in which CI procedures can depend on data that might have happened but didn’t. For example, if x = 0 is observed, one can claim that the 95% confidence interval is [-1.68, 1.68] rather the usual [-1.96, 1.96] *provided* that had one observed, e.g., x = 4, one would have reported [1.42, 5.66] instead of the usual [2.04, 5.96].
It’s not entirely clear how luckiness CI procedures are related to prior probabilities. Perhaps the notion is that one can minimizes the double expectation of the luckiness CI length with respect to both the sampling distribution and a prior distribution. This seems more like an exercise in decision theory to me, since the choice of coverage level and parameterization in which to minimize CI length ought to be motivatived by problem-specific substantive concerns.
Corey: Thanks. I’m intrigued with luckiness confidence intervals, while I’m far from seeing where they come from, but then again, I haven’t had time to study this (and won’t for a week), but wanted to note my interest. By the way, are there connections with those “confidence distributions” you cited awhile ago?
I’d like to know what Peter thinks of your remark about (lack of?) connection to prior probabilities,(aside from the double expectation you mention).
Mayo: In this case, the luckiness CI procedure came from me shoving the edge of the confidence belt over a bit by algorithmic fiat.
There’s one thing I’d like to note at this point. You have stated that “[i]f there is genuine knowledge, say, about the range of a parameter, then that would seem to be something to be taken care of in a proper model of the case—at least for a non-Bayesian.” I don’t think this is quite right. I think that the correct statement is that for a non-Bayesian, knowledge about the range of a parameter goes into the function mapping the data to procedure’s output. This is true for hard parameter constraints, as in Feldman and Cousins, A Unified Approach to the Classical Statistical Analysis of Small Signals, and also soft constraint as in luckiness CI procedures.
I don’t believe that there are connections between luckiness CIs and confidence distributions — confidence distributions give rise to valid one-sided CIs, so they can’t be lucky.
Here’s a relevant reference: Robin Willink. Shrinkage confidence intervals for the normal mean: Using a guess for greater efficiency. Canadian Journal of Statistics, 2008, 36(4), 623-637.
Hi Deborah,
Just a short clarifying note:
you wrote
“Confidence intervals, are of course, already data dependent”
this is true and indeed, you continue to write
“…Conditional claims may be warranted, enabling more precise error probability claims, but I don’t see why this is tantamount to introducing a prior
Just to clarify: it’s not “tantamount” to introducing a prior, you can have conditional claims (and ‘luckiness’) with and without priors. Priors offer a lot of additional flexibility, and I think you can’t do without them in some contexts (McAllester’s method of “PAC-Bayesian classification”, popular in some corner of machine learning, is one example) . But in other contexts you can certainly get ‘luckiness’-type conditional claims without any priors. I’m not sure whether Kiefer himself already thought of using priors .
I found Peter Grünwald’s observation about VC theory being limited to iid possibly very helpful in gaining an understanding of the strong claims that its proponents make about it, and perhaps also their claims about its philosophical implications, since I doubt that iid ever holds for any real world behavioral science phenomenon.
Paul: Well it’s interesting that you say that, because when I said something along the first part of your claim at the CMU conference, Cherkassky sounded very surprised. He was and is claiming that this provides a well-formed way to present the problem of induction, and is therefore the proper way to state (and presumably solve) it. He made it clear, of course, that he was assuming the training and test cases were iid, and perhaps the background knowledge in those classification problems justifies this. Which is fine, but that won’t be true in general for inductive learning.
Actually, others show that even iid doesn’t always take one that far—something that comes up in Wasserman’s (2011, RMM) paper.
Thank you Peter for this very nice account, with lots of intuition, of minimum description length theory and how it compares to other learning theories like Vapnik-Chernovenkis.
I was asked at the workshop about how minimum description length relates to the grue problem. Goodman’s Riddle can be seen as a direct attack on the principle of minimizing description length (MDL), so this is a good point for interdisciplinary discussion. I assume familiarity with Goodman’s “New Riddle of Induction” involving the grue(t) predicate: an emerald is grue(t) if it is examined before time t and green, or if it is examined at or after time t and blue. Computer scientists may appreciate Russell and Norvig’s brief discussion of Goodman’s Riddle, in their AMAI textbook. In the third edition, you can find this in Ch.19 on rule learning, at the beginning of the biographical notes. Or look up “grue” in the index.
Most MDL theories are theories of the complexity of strings, so the first step is to translate data and hypotheses into strings, meaning sequences of symbols from a finite set. I think that the best way to explain the grue problem in terms of the MDL principle is as an issue about how to code the *data*, rather than as an issue of how to code the different hypotheses. Goodman does say that for any sample of grue(t) emeralds observed before the critical time t, we have “parallel evidence statements”, namely that all of them are green, and equivalently, that all of them are grue(t).
To specify how we encode observations as strings, let us say that green/blue speakers write “0” for each time a green emerald is observed, and “1” for each blue emerald. Then “all emeralds are green”, corresponds to the infinite sequence “00000000000….”. And “all emeralds are grue(3)” corresponds to the sequence “000111111….”. There are theories of descriptive simplicity that give the result that the sequence “00000000…” is simpler than the sequence “0001111111….”. For example, Komolgorov’s definition of the complexity of strings has that consequence.
The argument then continues by pointing out that a grue/bleen speaker may well prefer the following encoding of the data: Write “0” for each grue(3) emerald, and “1” for each bleen(3) emerald. Then the hypothesis “all emeralds are green” corresponds to the infinite sequence “000111111….” and the hypothesis “all emeralds are grue(3)” corresponds to “000000000…”. If the string “0000000….” is simpler than the string “00011111….”, we see that the prescription maximizing string simplicity reverses the actual hypothesis chosen: on the green/blue encoding, we get that “all emeralds are green” should be projected, and on the grue/bleen encoding, we get that “all emeralds are grue” should be projected. Contradictory conclusions from the same observations but with different syntactic representations of the data.
I’m not quite sure how Rissanen’s theory of parametric complexity, that Peter mentions, applies to the Riddle of Induction. Each generalization is a point hypothesis, there aren’t parametrized models, so I’ll just use the term “hypothesis” for “point hypothesis”. There isn’t really a question of how to code the data given a point hypothesis, because each hypothesis deterministically entails the data (or is inconsistent with the data). In probability notation, P(data|hypothesis) = 1 or = 0. Looking at Peter’s fine tutorial http://homepages.cwi.nl/~pdg/ftp/mdlintro.pdf, this seems to entail that model selection just comes down to the model complexity term COMP_n. This notion of model complexity depends on the number of data sequences that a hypothesis can fit. In the grue problem, each hypothesis fits exactly one data sequence for each n < t, so it seems there is no preference. This agrees with retraction theory/topological complexity as I presented it.
The grue version I presented at the workshop was more complex, because it allowed infinitely many grue predicates grue(1), grue(2), … It seems to me that in this case, the conclusion is again that parametric complexity does not prefer one generalization over the other. This disagrees with retraction theory/topological complexity. It seems that generally, in a setting with deterministic point hypotheses, parametric complexity doesn't select a model.
In terms of Peter's tutorial, what topological complexity does is to assign a ranking to hypotheses, corresponding to code lengths L(H). For instance, if the topological complexity of the whole hypothesis space is n, and hypothesis H has topological complexity k, it could be assigned code length L(H) = n-k. Kevin was discussing how this complexity notion can be interpreted in terms of regret, with respect to retractions (and other losses).
Kevin's way of connecting retractions with regret leads me to a speculation, I hope an interesting one. I wonder if the idea of minimax regret coding can also be applied to codes L(H) for hypotheses. The general idea is that given a data sequence, we would want to assign shorter codes to hypotheses that fit the data better, and the difference between the prior and posterior codes would be the regret. So the regret is not only with respect to compressing the data, but also with respect to whether, in hindsight, we assigned too much complexity to a correct hypothesis; call this "total regret". One formal way to cash out the intuition in terms of Peter's tutorial is to take regret with respect to the "crude" hypothesis score L(H) + L(D|H). I am embarassed to associate myself with anything so crude as crude ML, but my feeling is that this will select "all emeralds are grue" in my infinitary version of the problem. It may even more generally select code lengths L(H) that correspond to topological complexity. I cannot go into the mathematical details here, but I think it would be a beautiful result to show that Rissannen's concept of simplicity from the 1990s agrees with Cantor's concept of simplicity from the 1890s!
Oliver: Thanks for your comments. Let me take a little lull here to jump in with an anecdote from my experience solving grue. Of course the grue problem in philosophy (with its zillions of articles and books) stands as a towering monument to the logical empiricist program, wherein accounts of inductive inference (or logics of confirmation) were to be based on purely syntactic properties alone. This is also the case with the other infamous “paradoxes of confirmation” which ultimately (I think) led to the winding down of the logical positivistic confirmation program*–at least in that traditional form.
The problem in a general nutshell:
If it is assumed that confirmation follows a type of straight rule—(roughly) for properties A, B, from all/most A’s observed to be B’s, infer all/most/or the next A will be a B– then one can have 100% positive instances for two generalizations, even though they give contradictory predictions for unobserved cases. (Just as many curves can fit observed points with different predictions for unobserved cases.)
When I submitted my manuscript “Error and the Growth of Experimental Knowledge”(1996), David Hull, (editor of the Chicago series in conceptual foundations of science), was utterly horrified to find it contained …yes, a chapter on grue. The book, after all, purported to exemplify post positivist philosophy of science, so that chapter should go. Sharing this report with Wesley Salmon, Salmon responded with an analogy to Hume’s famous remark, though I don’t have his letter in front of me, it went something like “If we take in our hand a volume of philosophy, and ask, Does it contain a chapter on grue? Yes. Commit it then to the flames.” Of course, I removed that chapter (manuscript was too long anyway) and I did publish it in his series.
But I was arguing that a proper account of testing that is not purely syntactical escapes the problem. In a nutshell, I had argued (1) the straight rule was a highly unreliable rule: it finds support for a hypothesis H with data x, even though H has passed a test with a low or even 0 severity with x. (2) Although the green and grue hypotheses are equally well confirmed according to the syntactic straight rule, they are not equally well tested by x. The underdetermination vanishes with the problem. (I never published the chapter). There may be a connection with your discussion, but I definitely do not advise pursuing it.
*In computer science, it seems clear that this type of purely formal curve fitting problem remains very central.
Statisticians generally tackle the grue problem numerically. But grue is (or at least, should be) treated quite differently from a physical sciences perspective.
For example, the physical scientist can definitely contemplate a grue emerald. A major focus would be to posit a coherent process by which the gem’s color would be green until time t, and blue thereafter.
It would be fairly natural to explore the possibility of ionic substitution of iron for the chromium or vanadium impurities — in what medium could this be effected by application of electrical current in order to achieve the transformation at t? — but abundant alternative explanations could be advanced, such as acceleration to a spectrum-shifting velocity.
The favored hypothesis would draw support from specific information bearing to the grue emerald under investigation. (“Is it on a rocket ship?”)
The point being that for the physical scientist, grue would not be a matter of definition, or some attempt to extrapolate from a sequence of numericized observations. Rather, it would resolve as (1) a “normal science” account of the phenomenon (i.e. employing established scientific principles) … or (2) scientific advance by postulation of a new principle that would not deal just with grue but also explain additional hitherto puzzling phenomena.
Paul: Something along these lines would be the basis for my claim that the data x cannot have well-tested both grue and green hypotheses.
Yep.
Hi Deborah,
Just a short clarifying note:
you wrote
“Confidence intervals, are of course, already data dependent”
this is true and indeed, you continue to write
“…Conditional claims may be warranted, enabling more precise error probability claims, but I don’t see why this is tantamount to introducing a prior
Just to clarify: it’s not “tantamount” to introducing a prior, you can have conditional claims (and ‘luckiness’) with and without priors. Priors offer a lot of additional flexibility, and I think you can’t do without them in some contexts (McAllester’s method of “PAC-Bayesian classification”, popular in some corner of machine learning, is one example) . But in other contexts you can certainly get ‘luckiness’-type conditional claims without any priors. I’m not sure whether Kiefer himself already thought of using priors .
Peter: Thanks for the clarification. I am glad we agree that conditional claims may be warranted, enabling more precise error probability claims,while not tantamount to introducing a prior. The pathological examples in statistics that are “cured” by conditioning actually are mostly if not all rigged to permit a pathology that would never have crept into an error statistician’s framing of the example (as my colleague Aris Spanos argues, I don’t have a reference handy.)
But back to your point: “Priors offer a lot of additional flexibility, and I think you can’t do without them in some contexts (McAllester’s method of “PAC-Bayesian classification”, popular in some corner of machine learning, is one example)”. It seems to me, on the basis of the little I know of these classification problems, that the priors have legitimate frequentist interpretations, and as the goal is prediction, it seems akin to estimating a probability of a “successful classification”–an error probability. If this is even roughly the point, I could hardly disagree, whether one wants to call such assessment “Bayesian” seems a matter of terminology, but can be confusing. But I may be missing your point.