possible progress on the comedy hour circuit?

Image of business woman rolling a giant stoneIt’s not April Fool’s Day yet, so I take it that Corey Yanofsky, one of the top 6 commentators on this blog, is serious in today’s exchange, despite claiming to be a Jaynesian (whatever that is). I dare not scratch too deep or look too close…along the lines of not looking a gift horse in the mouth, or however that goes. So here’s a not-too selective report from our exchange in the comments on my previous blogpost:

Mayo: You wrote:”I think I wrote something to the effect that your philosophy was the only one I have encountered that could possibly put frequentist procedures on a sound footing; I stand by that.” I’m curious as to why I deserve this honor ….

Corey: 
Mayo: It was always obvious no competent frequentist statistician would use a procedure criticized by the howlers; the problem was that I had never seen a compelling explanation why (beyond “that’s obviously stupid”). So you deserve the honor for putting forth a single principle from which error statistical procedures flow that refutes all of the howlers at once.

Mayo
: Corey: Wow, that’s a big concession even coupled with your remaining doubts….maybe I should highlight this portion of our exchange for our patient readers, looking for any sign of progress…

Corey:
 Mayo: Feel free to highlight it. I will point out that this “concession” shouldn’t be news to you: in an email I sent you on September 11, 2012, I wrote, ‘I now appreciate how the severity-based approach fully addresses all the typical criticisms offered during “Bayesian comedy hour”. Now, when I encounter these canards in Bayesian writings, I feel chagrin that they are being propagated; I certainly shall not be repeating them myself.’

Mayo: Ok, so you get an Honorable Mention, especially as I’m always pushing this bolder, or maybe it’s a stone egg. It will be a miracle if any to-be-published Bayesian texts or new editions excise some of the howlers!

But I still don’t understand the hesitancy in coming over to the error statistical side….

Categories: Uncategorized

Post navigation

42 thoughts on “possible progress on the comedy hour circuit?

  1. Nicole Jinn

    I agree with Corey that you deserve such honor – your hard work is paying off, and please remember that I am on your side. I, too, have not been able to understand the reluctance to accept your error statistical approach, at least ever since I completed my transition from the formal/mathematical sciences to philosophy. In fact, understanding the motivation behind that reluctance is going to be an essential component of my MA thesis!

  2. Nicole: You’re very kind, but don’t take this business too seriously. I don’t. I’ll settle for less howlers.

  3. David Rohde

    I have three comments…

    * It seems you have attended one or more Bayesian meetings where there was obnoxious behaviour or frequentist bashing… I am sorry to hear this.. the few meetings I have been at have been fun, and I haven’t seen anything like this. I have never seen Bayesian meetings with the chorus of agreement that you suggest.

    * The “howler” that I find most common and most convincing are the dependency on stopping rule ones i.e. binomial vs negative binomial trials or the faulty multi-meter story. I think you have an answer to this, but I haven’t seen it, or at least I haven’t made the connection… so if you can point me to a place where you discuss this, I will read it with interest.

    * My Bayesian inclinations have very little to do with the “howlers”. I like the Bayesian focus on real things, real decisions and real observables, and that would be my main argument. I don’t like testing and as a consequence I don’t like the “howlers”. Although it seems to me that the stopping rule principle at the heart of the howlers does have relevence to real problems. Here is my more practical “howler”: we are solving a binary pattern classification problem. Two decision boundaries are fit to the data one from a very high capacity model (neural network / SVM) the other from a low capacity model. After these decision boundaries are fit it turns out that the decision boundaries turn out to be identical and achieve 95% correct classification.

    The relevant question seems to me to be the Bayesian question: Which decision boundry has the lowest expected error rate? – which is of course is the same in this case.

    I have a real problem seeing how the frequentist questions which would consider the different sampling behaviour of the estimators/algorithms as relevant. Yes it is more surprising that the low capacity model achieved 95% correct classification than the high capacity model achieves the same 95% correct classification…. but why should we care?….

    • Corey

      David Rohde: I don’t regard optional stopping as a howler — in fact, I regard it as an irreconcilable difference in the approaches, and I’m currently attempting to create a steelman (opposite of strawman) CI procedure for optional stopping to compare and contrast to the Bayesian answer.

      Genuine howler: a hypothesis test with the desired size, but based on a random variable independent of the data, as promulgated by Jay Kadane.

      • Corey: yes, the irreconcilable difference is that we care about the error probabilities of procedures. The current interest in avoiding fraud may be relevant here.

        • Corey

          Mayo: Let me amend my statement: the way Jack Good liked to put the situation — offering a p-value that doesn’t reflect the optional stopping sampling distribution — is indeed a howler.

      • Corey: There are CI procedures for the optional stopping case.

        • Corey

          Mayo: I looked them up and didn’t like them. I want a procedure that optimizes some criterion — that way I know what is being achieved beyond confidence coverage.

  4. Corey

    Mayo: Here’s why don’t I come over to the error-statistical side. You’ve confirmed to me by email that you don’t think it’s possible for an artificial intelligence to do science. But I currently see no reason to suppose that the scientific reasoning carried out by humans brains is magic, i.e., uncomputable, like solving the Halting Problem; the opposite strikes me as fantastically more plausible. So I can only be satisfied by an account of science/learning/growth of experimental knowledge that is consistent with the notion that it’s possible — in principle — to reduce the task of generating an accurate model of reality to an algorithm. Perhaps the error-statistical approach can be extended to such an account. But you regard the notion as chimerical, so you’ve made no effort in this direction, and I suspect that any such account would be unacceptable to you. Conversely, Bayes is a natural fit — in fact, Jaynes motivates the Cox axioms through the literary conceit that he and his readers want to build a robot that reasons.

    A related issue is your requirement that scientists must be free to find all available hypotheses inadequate and think up new ones. “Pure” (for lack of a better word) Bayesian approaches seemingly don’t permit this, since Bayes operates on a partition of hypothesis space and partitions are exhaustive by definition. (Gelman’s approach isn’t pure in this sense.) Hence you view this requirement as a knock-down argument against Bayes as a general approach to scientific reasoning. In my view, this stance overlooks the fact that the set of all scientifically useful hypotheses is countable (because scientifically useful hypotheses must be written down to be useful, if for no other reason). Ignoring time and space constraints, Bayes is able to operate on probability distributions on this set. (Of course, any practical artificial intelligence will need to employ heuristic searches in the set of useful hypotheses, but this is no different from what human scientists do currently.)

    • Corey:
      I like this idea of coming over the the other side, like a spy. I didn’t respond to that e-mail from long ago (right?). I am pretty sure I saw it but was pondering your questions, and didn’t get back to it. The fact that I cannot create a robot to do science doesn’t mean it can’t be done or that I think it can’t be done. I don’t see why it couldn’t be done, with the right programs/robots. They’re programing fraud detectors now for the social psychologists and others. But I cannot see how that could be the determination in deciding on the better foundation for statistical inference. If you want a tool for the job and it turns out that at the moment anyway, some of the ingenuity comes from humans and not obviously assignable to robots, then you wouldn’t want to just lower the standards for the kind of problems you are able to solve.

      Put it this way, I think we can systematize ways to learn much faster than we do, and it’s also much faster than trotting out priors. But please explain why this would matter to you.

      I also don’t think the closedness of the Bayesian account is a knock down anything…Extremely tired, so tell me more. By the way, are you still far away, I forget if it was canada?

      So bottom line: AI can do my kind of science, it just has to be programmed the right way first.

      • Corey

        Mayo wrote: “AI can do my kind of science, it just has to be programmed the right way first.”

        This isn’t what you told me in our exchange back in September. I asked, “Ignoring time and space constraints, do you think it is possible to write an algorithm/program an artificial intelligence to carry out scientific investigation, including the requisite statistical reasoning?” and you replied, “No,obviously I don’t think an AI program can be created for science, fortunately—even when we all become bores or bots it won’t be!”

        This matters to me because I suspect that the number of crucial insights into intelligence necessary for human-level AI is rather small — perhaps 10 or less. (I think that Cox’s theorem is one, and Pearl-style causality is another.) But once that level is achieved, an I.J. Good-style intelligence explosion is, if not “unquestionable” in Good’s phrasing, at least a serious concern. By necessity, any safe AI must be built using correct principles for creating an accurate model of reality; any otherwise-safe AI built using substandard principles will get walloped by an unsafe AI built with correct principles. And then it seems to follow corollary-like that the principles of reasoning good enough for a safe, potentially weakly godlike intelligence, they ought to be good enough for me in my pursuit of scientific knowledge.

        • Corey

          Whoops, last sentence got mangled due to incomplete rephrasing. It should read: And then it seems to follow corollary-like that the principles of reasoning good enough for a safe potentially-weakly-godlike intelligence ought to be good enough for me in my pursuit of scientific knowledge.

          • Corey

            I forgot to respond re: location. Yes, Canada.

    • Paul

      The union of all countable sets is uncountable. Even the countable union of countable sets may be uncountable (viz., proving countability requires the axiom of countable choice, which cannot be proven in Zermelo-Fraenkel set theory without the axiom of choice).

      So any specific system of hypotheses may be countable, but the union of all such — or even a scientifically useful subset — may be uncountable.

      … But this isn’t a showstopper for Bayesian methods, since they can be applied to uncountable sets.

      • Corey

        Paul: I actually consider a hypothesis useful only if it leads to an effective method for calculating predictions. This set is definitely countable.

        • Paul

          But there are problems that no effective method can solve.

          Also, this ignores the problem of mapping an effective method to the real world. This set of maps is uncountable.

          • Corey

            Paul: If constructing an accurate model of reality is one of those problems that no effective method can solve, then we’re screwed anyway as far as I can see. And the problem isn’t mapping an effective method to the real world — it’s mapping observations to a set of effective method. Since real data are stored on physical media the set of possible data sets is finite, albeit unimaginably large. In fact, if you think about how almost all real data are stored nowadays — as binary files on computer disks — you’ll see that this isn’t a problem at all: just restrict consideration to effective methods that “know” the data file encoding already. (Likewise, prediction of future observables based on a given effective method must assume a stable data encoding.)

            • Corey: I thought you had banished all talk of an accurate model of reality.Do you think if you had all possible observable data (human or otherwise), you’d have an accurate model of reality?

              • Corey

                Mayo: I think that if I had all possible observable data except a small holdout set chosen in an ignorable way, a halting oracle sufficiently high in the arithmetical hierarchy (as far as I know, a halting oracle for the usual Turing machine would be enough, but I could be wrong) and unlimited computational resources, my *predictions* would be virtually error-free. I could also identify which model in the set of computable models was mostly responsible for the predictions — it would be the one with the smallest Kolmogorov complexity. I couldn’t claim that this model is *true*, but I would sure be interested in knowing what kind of physical theory it encoded.

        • Corey: I remain mystified by your points about effective methods and error statistics, but it would likely be too hard to discuss here. If by an effective method you have the usual idea of one that solves a mathematical problem so as to get all right answers and no wrong answers, in finitely many steps…then it’s hard to see what this has to do with empirical science. It is both too strong and too weak, if you know what I mean.

          • Corey

            Mayo: “Effective method” just means there’s an algorithm to get predictions from the model. As a counterexample, a physical theory with predictions that depended on arbitrarily deep digits of Chaitin’s constant (for some specific programming language) would not have an effective method. It’s an extremely weak requirement.

      • Paul: What? A union of countable sets is countable, like, say, the rationals.

        • Corey

          Mayo: A countable union of countable sets is countable. An uncountable union of countable sets isn’t. Something like the long line but restricted to a union of the rationals in [0, 1) is an example of such a union.

      • Corey

        Mayo: I seem to recall including the appeal-to-scientific-practice while verbally summarizing EGEK to a friend. This memory could be a confabulation any of a number of levels.

        • Corey: You verbally summarized EGEK? The philosophy of science is certainly intended to be relevant to practice—relevant to how humans actually learn and speed up learning—but that’s different from saying it gets a justification from anyone consciously following it. Or rather, once, there is a scientific success story, any number of accounts can be claimed to represent it. It’s intended to be forward-looking.

    • David Rohde

      Corey: Yes, I agree the Kadane example isn’t great…. Although the rest of the book seems worthwhile.

      I think an elaboration of the classification example above might be made into a “steel man”. A philosophical puzzle that has been semi-dormant with me for almost a decade is why is cross-validation so highly regarded when it violates the likelihood principle… Maybe one day, I will try to submit a discussion of this to a philosphy journal and try my luck there….

      I find the Hutter-Schmidthuber stuff that you often mention interesting, but also puzzling. I intepret it as a form of objective Bayes. My main queries with it are : a) why should subjective probabilities take on the form given by some (sophisticated) objective Bayes rule and b) what does convergence mean in these contexts when repetition doesn’t seem to be assumed (as far as I can tell)…

      • Corey

        David: There’s not necessarily any reason for subjective probabilities to follow the form of Solomonoff induction. It all depends on what philosophical baggage one is willing to take on board before setting sail. If one is satisfied with de Finetti’s coherence argument and nothing else, there’s no contradiction I could exploit to force a change. I personally am not satisfied with just de Finetti-style coherence.

        The convergence is in squared one-step prediction error, so no repetition is needed.

      • David: All of the revered sampling and resampling methods for inference violate the likelihood principle. I know of no rationale for its support outside of extremely limited problems where comparative assessments are all that matter. that was also Birnbaum’s position.

  5. David Rohde

    Corey: Thanks for the answer, although I don’t fully understand still.

    If Solomonoff induction doesn’t lead to subjective probabilities, it would appear along with other objective Bayesian methods unable to subjectively order decisions…

    Repetition seems to give something to converge to. What does Solomonoff induction converge to… surely it isn’t a delta function on the truth?…

    I appreciate, I am being lazy and could read some very hard papers instead of asking you questions….

    • Corey

      David: There are two ways in which Solomonoff induction can be said to converge. Both pertain to a sequential/time-series setting.

      In the deterministic framework, the final state of the tape after the operation of a possibly non-halting Turing machine is revealed one symbol at a time. The Solomonoff prior is a probability distribution over Turing machines. The sum of the one-step-ahead squared prediction errors is bounded as the data go to infinity.

      In the stochastic framework, some computable probability measure is postulated to generate the data. The Solomonoff prior is probability measure over the set of computable probability measures, and the posterior predictive distribution converges in Hellinger distance to “true” probability measure generating the data.

      • Corey: I can’t imagine what meaning there is to the “distance to “true” probability measure generating the data”. How does the “true” (Bayesian) probability measure generate data?

        • Corey

          Mayo: This is really a result in probability theory proper. Let S be the Solomonoff probability measure, H(.,.) be the Hellinger distance, x be the data from 1 to n, and y be the n+1’th datum. Then for all P in the set of computable probability measures, H(S(y|x),P(y|x)) goes to 0, P-almost surely. I called P the “true” probability measure because that’s how statisticians, frequentists and Bayesians alike, usually relate such results to the real world.

    • David Rohde

      Thanks for the reply Deborah, I just saw this now.

      I am struggling to think of a real problem which is not a comparative assessment. Real problems seem to be of the form should we take decision 1 or take decision 2, i.e. they are a direct comparison.

      If somebody, say a policy maker, wants to use sampling theory and intepret it to make a decision they might well be disturbed to know that the outcome of the test is dependent on what they would have done with data they didn’t get in a way that appears to have no bearing on a reasonable assessment on what the future will bring and in turn the quality of the decisions being contemplated.

      • Sampling distributions on which computations of precision, reliability, and accuracy are based, are certainly relevant for inferences about the future. The problem with comparative appraisals is that (i) selecting from A and B is a bad idea when they’re both false, poor or whatnot; that is why a severity analysis can’t be seen as obeying the posterior probability calculus (ii) we don’t have a complete set of discrete possibilities, the better of the two may still be horrible; (iii) we often want to ascertain whether this model or hypothesis is adequate, or how consistent it is with data.

  6. David Rohde

    Hi Corey,

    Thanks for your informative response! I am admittedly confused about how important this result is.

    I think the meaning of P(.) is not really clear either in the frequentist or Bayesian sense… It makes most sense to me to say that the true value of y is simply y and the idea that there is a true probability has no clear meaning…. but your response sheds some light on it…

    • David Rohde

      Maybe this is a good time to agree to disagree….

      I don’t see how the sampling distribution is relevant for forecasting. I can use an estimate to produce a plug-in predictive distribution, ignoring model uncertainty…. but what can I do with the sampling distribution of the estimator, shouldn’t this broaden the predictive distribution somehow?… also while the data is irrelevant to the sampling distribution of the estimator, what I would have done with data I didn’t get is relevant… this seems to make it simply the wrong tool to “adjust” the plug-in predictive distribution to account for model uncertainty…

      (i) and (ii) In real life most of the time you must decide. If you have the option not to decide then that is an additional decision “C”.

      (iii) Questions of this form are difficult to formalise in a satisfying way, furthermore while some might say that is what they want, if the answer cannot be used to drive a decision theory, then are they really useful?

      • Well you can use sampling distributions for things like determining the consistency of the estimator, and the accuracy and precision of the estimates. Silly things like that.

  7. David Rohde

    Exactly, you can’t use it as a driver for decision theory.

  8. David Rohde

    A little more verbose…

    The posterior uses the data to give the uncertainty of the _estimate_. This can be propoaged through to produce a predictive distribution. The predictive distribution allows the application of decision theory (to real observations).

    In contrast, the sampling distribution uses the sample space to give the uncertainty of the _estimator_. This seems (to me) to depend on arbitrary concerns, moreover it cannot be used to make predictions or apply a decision theory (to real observations).

    You are very consistent in advocating error probabilities, but I think the value of error probabilities are so obvious to you that you feel no need to explain why. The value of error probabilities is not obvious to me.

    Qualification: The real world is messy… approximate methods such as approximate Bayesian computation mix together conditional and unconditional uncertainty, given the choice I am interested in conditional uncertainty only – but the real world is rarely so neat…. When an approximation is employed the likelihood principle is violated and a version of sampling theory is needed. For example sampling theory is used to analyse MCMC output.

Blog at WordPress.com.