Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*

Picking up the pieces…

My flight out of Pittsburgh has been cancelled, and as I may be stuck in the airport for some time, I will try to make a virtue of it by jotting down some of my promised reflections on the “simplicity and truth” conference at Carnegie Mellon (organized by Kevin Kelly). My remarks concern only the explicit philosophical connections drawn by (4 of) the seven non-philosophers who spoke. For more general remarks, see blogs of: Larry Wasserman (Normal Deviate) and Cosma Shalizi (Three-Toed Sloth). (The following, based on my notes and memory, may include errors/gaps, but I trust that my fellow bloggers and sloggers, will correct me.)

First to speak were Vladimir Vapnik and Vladimir Cherkassky, from the field of machine learning, a discipline I know of only formally. Vapnik, of the Vapnik Chervonenkis (VC) theory, is known for his seminal work here. Their papers, both of which addressed directly the philosophical implications of their work, share enough themes to merit being taken up together.

Vapnik and Cherkassky find a number of striking dichotomies in the standard practice of both philosophy and statistics. They contrast the “classical” conception of scientific knowledge as essentially rational with the more modern, “data-driven” empirical view:

The former depicts knowledge as objective, deterministic, rational. Ockham’s razor is a kind of synthetic a priori statement that warrants our rational intuitions as the foundation of truth with a capital T, as well as a naïve realism (we may rely on Cartesian “clear and distinct” ideas; God does not deceive; and so on). The latter empirical view, illustrated by machine learning, is enlightened. It settles for predictive successes and instrumentalism, views models as mental constructs (in here, not out there), and exhorts scientists to restrict themselves to problems deemed “well posed” by machine-learning criteria.

But why suppose the choice is between assuming “a single best (true) theory or model” and the extreme empiricism of their instrumental machine learner? A similar dichotomy arises in the description of “classical” statistics as contrasted with modern accounts of machine learning. The classical statistician is viewed as beginning with a known parametric distribution, or a true statistical model, the goal being to estimate or test parameters within it. This is an erroneous if familiar caricature of what goes on in statistical science. I gained some insight from Cherkassky during a Q and A period: The perspective stems from a complaint by Leo Breiman and other new machine learners that statistics had downplayed their work for some time. But biases in editorial policy, whether past or present, do not themselves justify so superficial a conception of classical statistics.

Cherkassky qualified the successes of machine learning in two important ways: (1) machine learning aims for good predictions but provides us with wholly uninterpretable “black boxes”; and (2) machine-learning inductions, based on training samples and teachers, work only so long as stationarity is sufficient to ensure that the new data are adequately similar to the training data. He did not seem to see these as noteworthy drawbacks or limitations that (while perhaps just fine for machine learning) an adequate, full-bodied account of statistical science would and does regularly break out of.

Cherkassky began with the interesting claim that philosophical ideas form only in the context of scientific developments and in response to technological advances. I take his upshot to be that now that science has changed (toward empirical machine learning) philosophy of science changes accordingly.

This general position deserves further reflection. For now I will say that, since the technology of machine learning is only a small part of science, even if it has demanded an instrumental philosophy, it does not follow that this would be an adequate philosophy for science in general. Also, consider this radical possibility: how do they know that the goals of machine learning would not be furthered by striving to understand underlying mechanisms, empirical and theoretical?

I have no reason to doubt that machine learning has had great successes. I hope it has a machine in the works to obviate cancelled flights and save me from having to plead with the machine for an “agent” to fix the mess.

This morning (Sunday), Vapnik clarified some of the formal issues, but repeated the idea that we should restrict ourselves to so-called well-formed problems. My own interest is in how scientists arrive at sufficiently well-formed problems despite very messy points of departure, and despite their more ambitious goals, which include a theoretical and empirical understanding of phenomena. It is one thing to view the work of machine learning as having carved out an important domain with increasingly diverse applications, but quite another to suggest that that is all there is to learning about the world. Vapnik invoked at least partial connections to Popperian falsificationism. But Popper would have regarded the enterprise as akin to Kuhnian “normal science,” or science constrained within the bounds of a paradigm.

Vapnik concluded, if I understood him (and this is a first pass for me), that in order to improve the teacher/trainer in machine-learning classification tasks (e.g., disambiguate a handwritten 5 from an 8), we must, to satisfy equivocal cases, consider features that go beyond the usual classification features, in particular, he thinks, various metaphors and mystical, holistic, “yin/yang” harmonies. The list I made during the talk includes what may be seen as human idiosyncrasies (e.g., egotism, willfulness, stubbornness, a willingness to cause pain). I am not surprised that capturing the shades of human discrimination requires us to go beyond observable sense data; if given the stark[i] choice between mysticism and a naïve empiricism, humans will always tend toward mysticism.

Peter Gruenwald asked the same question I often ask: “Where are the philosophers?” [on a variety of issues in contemporary statistical science]. He raised the problem that arises when Bayesians are led to revise their priors on the grounds that they do not like the resulting posteriors. To avoid Bayesian inconsistency, he says, requires “non-Occam priors.” This should be understood, he suggests, in terms of what he calls “luckiness,” an idea he has found in Kiefer’s conditional frequentist inference. There was a period during which I worked through Kiefer’s approach—Casella, Lehmann, and others having told me that I seemed to be doing something similar. But insofar as I worked through his approach, it appeared similar only in the sense that severity is a data-dependent assessment. Now, having heard Gruenwald, I want to go back and see what I can find in Kiefer on luck.

Cosma Shalizi began by explaining that his switch from physics to statistics was prompted by discovering many links between both statistics and machine learning and fundamental philosophical questions of knowledge and inference. In a (refreshing!) contrast with the machine learners, he conceives the goals of statistical science broadly: that it seeks ways of using data to answer very general inferential questions about the world by developing “abstract machines” that honestly assess uncertainty. While he discovered, unsurprisingly, that statistical practice was much more focused on nuts and bolts applications than foundational principles, he claimed that computational developments saved it from being boring. Do the new computations alter the interrelations between the formal work and general conceptions of knowledge, cause, and inference? Do they promote or downplay the connections? I heard Shalizi’s answer (in the Q and A) being that such philosophical issues actually have nothing to do with it. Maybe his point was simply that computational challenges make statistics/machine learning more fun.

Yin. Yang.

My new flight, to an unintended destination is here; then I’m to be picked up and make my way by car, bus, and ferry.

My slides are posted:  http://www.phil.vt.edu/dmayo/personal_website/June24,12MayoCMU-SIMP.pdf (Link has expired)


[i] This is a false choice or false dilemma.

*CMU Workshop on Foundations for Ockham’s Razor (https://errorstatistics.com/2012/06/12/4644/)

Categories: philosophy of science, Statistics | Tags: , , , ,

Post navigation

14 thoughts on “Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*

  1. Christian Hennig

    This is not on your posting here but rather on your slides. I just wrote on Larry Wasserman’s blog on the same conference and wondered whether anybody had discussed what simplicity is good for. I’m delighted to see that you did that.

    It is interesting that you start by looking at other aspects of simplicity than what statisticians would usually think of (mainly number of parameters). I had always thought of the frequentist approach as *simpler* as the Bayesian one, contrary to what you wrote, because the Bayesian models are usually prior plus sampling model, whereas frequentists only have the latter layer (which is why I think that if a Bayesian uses a prior, he/she needs to justify what this adds to the analysis – often not much, I guess).

    A simplicity issue that you don’t mention but I think is highly relevant to your philosophy is multiple testing – particularly how to control for error when advertising the use of a comprehensive battery of misspecification tests. Bonferroni seems to imply that the more tests you use, the less you can learn from every single one.

    • Christian: Thanks for your comment. The Bayesian algorithm is often said to be simpler in the sense that one always (allegedly!) does, or needs to do, the very same thing. But as I point out, implementing it is far more complex than the “ready to use” and “easy to check” error statistical methods. So I guess we agree there. On multiple testing, I don’t see it as an issue of simplicity at all. But, to your comment, I think there’s a confusion between hunting for an effect (and various other selection effects) in inference, and explaining an effect, once identified. Hunting for a nominally significant effect, and reporting just the one found in the same way as if there had been no hunting, is a case of the former. Identifying a flawed assumption of a test or experiment, if done correctly, is more like the latter. Striving to explain the source of my carpet stain by searching through many substances until I find that it is, say, definitely chocolate, does not yield a less reliable identification, once reached, than if I tested chocolate first.

      • Christian Hennig

        Mayo: It’s interesting to realise that the word “simplicity” can be given several different relevant meanings (I was as surprised when I read you commenting an the apparent simplicity of the Bayesian approach).

        Running a battery of tests in order to check model assumptions or to arrive at a new model doesn’t necessarily do much about the simplicity of the resulting model, but it creates an overall procedure that is increasingly difficult to analyse. In order to fully understand what such testing does, one would need to analyse what is done based on the model finally arrived at (error probabilities etc.) conditionally on the model testing/selection procedure leading to this model. True, this is not as obviously bad as hunting for significances, but it still does something, in most cases, and the more complex the procedure, the more tests involved, the more difficult it is to find out what exactly, and to control for the effects.

        • Christian Hennig

          I should probably add that the problem is not only the difficulty to analyse it (things can be set up so that this is still possible; I know that Hendry has done something), but also that the more you do, the more effect it will have on the final analysis.

        • Hennig: I think it is a mistake to suppose that “one would need to analyse what is done based on the model finally arrived at (error probabilities etc.) conditionally on the model testing/selection procedure leading to this model”. The reason I emphasize philosophical perspectives is precisely to unearth flaws in what might be thought to be entirely sensible (in this case) statistical procedures. Try to address my informal examples e.g., searching for the source of a stain, or the location of my keys), and let the answer direct what makes sense in the formal example. I am reminded, in this connection, of a flaw regarding DNA analysis (thinking we need to punish a procedure that searched through a DNA base)—see discussion of this under “selection effects” in Cox and Mayo (2010). We’re hit by power outages here, so I’m minimizing links, but you know where to find it.

          • Christian Hennig

            The Cox and Mayo (2010) I know apparently doesn’t have anything on “selection effects” in it (even the word search “selection” doesn’t find anything). Do I look in the right place?
            I know that in your view I may overstress this point but I really think that there is more in it than you can get from informal examples. (Actually Hendry and Aris are well aware of this but tend to reveal it only where they can handle it. There is too much “mentioning the good news only” for my taste.)

            • Hennig:
              It’s Mayo and Cox, Section 4.2 “need for Adjustments for Selection”, and in particular, pp. 270-1 (e.g., the DNA example).
              You’re missing my point in regards to informal examples. The point is an instantiation of “Rule #1”, the one and only rule I set for this blog (back on Sept 4): https://errorstatistics.com/2011/09/04/drilling-rule-1/
              Looking to the most clear-cut examples can illuminate the right application of formal ideas. From rule #1: if you concur with the informal example, then, given the case under discussion is analogous, it too must be granted. One is not inferring a statistical generalization here*, but identifying the source of this data. As always, of course, the given interpretation has to be shown to be warranted with severity.
              *One would have to completely reformulate it to view it as such.

              • Christian Hennig

                As always I’m severely inhibited by my weak memory and therefore I’m not going to discuss the informal examples right now (not remembering precisely what they were and how they are connected to this topic at the moment… I thought I’d remember, but…).

                Also I could have remembered the part of Mayo and Cox (2010) but had looked up Cox and Mayo (2010) instead.
                Mayo and Cox (2010) is indeed to the point. Actually Examples 4-6 fit more directly and there the general tendency is actually that adjustment could be required and analysis is sometimes difficult. The only bit where I disagree is the second half of the discussion of Ex. 4, because I think what counts is not the intention with which model diagnostics are run, but what they actually do, which should be analysed as far as possible (which of course may – sometimes – back up your optimism… and yes, I do agree that it’s a good thing to take logs if it can be seen that this improves the model fit big time, but still without analysis of the situation conditionally on making such a decision we don’t know what it *exactly* does and if we do too many things like this on the same data, it may still backfire).

                • Christian: I was just alluding to the example I trotted out in my post, no memory needed:
                  Striving to explain the source of my carpet stain by searching through many substances until I find that it is, say, definitely chocolate, does not yield a less reliable identification, once reached, than if I tested chocolate first.
                  I will have to study your other remarks later…we’re still hunting for a hotel that will not lose its electricity to lightning the first or second night….

                  • Christian Hennig

                    That’s different from probability modelling, though. Let’s say you fit a time series model MT after having rejected independence over time first and then having tested and not rejected several other assumptions of your model. Now you estimate or test a certain parameter of that model. This parameter would not have been tested, had your battery of tests led you to another model. If you want to investigate the characteristics of what you did, e.g., error probabilities, assuming MT, you have to account for the probability that you test battery had opted for another model despite MT being true, because there is (usually) a nonzero probability for this to happen. Conditioning on your knowing that MT was actually picked, the distribution to analyse is not MT, but “MT conditionally on having been picked by your battery of tests”, which is usually different from MT.
                    “Chocolate” though is not different from “chocolate conditionally on not being olive oil”.

  2. Eileen

    Just some quick questions on a point you mentioned from C. Shalizi’s talk that seems very similar to some of the discussion on the blog here. The post about Stephen Senn and statistics by computer program–what you called “grace and amen” Bayesianism–raised the point that some hold the view that statisticians are free to treat “competing” statistical tools as simply different formal tools, which they can throw at a problem freely without any commitments to the different (contradictory) philosophical underpinnings of those different tools. Is Shalizi part of the grace and amen choir here? Also, I know this attitude alarms you, but doesn’t it also alarm subjective Bayesians, too (e.g., Kadane), thus giving you common cause with them?

    • Eileen: thanks for your comment. I’m not entirely sure what you mean. I’m pretty sure Shalizi wouldn’t endorse grace and amen Bayesianism, but perhaps your point is that by viewing statistical tools as mere tools to churn out numbers without worrying too much as to what they mean or how to justify their use, practitioners can unwittingly credit principles and methods that are not actually responsible for the work. Subjective Bayesians do complain that “casual” Bayesians are too casual, and there is an odd sort of agreement between us that if one is going to call oneself Bayesian, one should consciously embrace and defend Bayesian principles. This is my atttempt to get your point, I may have missed it. Feel free to try explaining your concern again.

  3. Peter Grunwald sent a nice reply to the workshop mailing list, he said he would post it on this blog, but I haven’t seen it. Peter gave a very nice account of minimum description length theory as compared to, say, Vapnik-Chernovenkis prediction theory. I think his view is that the minimum description length principle, properly understood, has VC theory and other theories of inductive inference as a special case.

    I was asked at the workshop about how minimum description length relates to the grue problem. Goodman’s Riddle can be seen as a direct attack on the principle of minimizing description length (MDL), so this is a good point for interdisciplinary discussion.

    Most MDL theories are theories of the complexity of strings, so the first step is to translate data and hypotheses into strings, meaning sequences of symbols from a finite set. I think that the best way to explain the grue problem in terms of the MDL principle is as an issue about how to code the data, rather than as an issue of how to code the different hypotheses. Goodman does say that for any sample of grue emeralds observed before the critical time, we have “parallel evidence statements”, namely that all of them are green, and that all of the are grue.

    Let us say that green/blue speakers write “0” for each time a green emerald is observed, and “1” for each blue emerald. Then “all emeralds are green”, with critical time t = 3 (say), corresponds to the infinite sequence “00000000000….”. And “all emeralds are grue” corresponds to the sequence “000111111….”. Now there are various ways to get the conclusion that the sequence “00000000…” is simpler than the sequence “0001111111….”. For example, Komolgorov’s definition of the complexity of strings has that consequence.

    The way I would continue the argument is to point out that a grue/bleen speaker may well prefer the following encoding of the data: Write “0” for each grue emerald, and “1” for each bleen emerald. Then the hypothesis “all emeralds are green” corresponds to the infinite sequence “000111111….” and the hypothesis “all emeralds are grue” corresponds to “000000000…”. Accepting the result that the string “0000000….” is simpler than the string “00011111….”, we see that the prescription of MDL reverses: on the green/blue encoding, we get that “all emeralds are green” should be projected, and on the grue/bleen encoding, we get that “all emeralds are grue” should be projected. Contradictory conclusions from the same actual data but with different syntactic representations of the data. On this proposal, the problem for MDL is that even if it uses an objectively correct notion of the simplicity of strings, there is an arbitrariness in how data can be represented as strings.

    Case closed? It’s not that simple actually, because the full-blown MDL theory (as presented by Peter Gruenwald for instance) encodes hypotheses as a function of the set of hypotheses under consideration. Thus confirmation is not just a two-place relation “evidence e confirms hypothesis H”, but a three-place relation “evidence e confirms hypothesis H given the alternatives H1,…Hn,…”. As Kevin Kelly put it at the workshop, the inductive conclusions should not depend on how data are represented, but they can depend on what question is being asked. This is like the topological theory of simplicity that I presented, where simplicity is not a feature of a hypothesis alone, but depends on the context of the entire hypothesis space. It seems to me now that the MDL theory advocated by Peter Gruenwald will agree with the topological theory in many cases (i.e., when there is always a unique simplest theory to conjecture). Including the Riddle of Induction, where it gives the natural projection rule. This is not the place to go into the mathematical details, but it would be a beautiful result to show that Rissannen’s concept of simplicity from the 1990s agrees with Cantor’s concept of simplicity from the 1890s!

    • Bill Jefferys

      Note that grue has a parameter that green does not have, namely the time at which the observed color changes. Just because the usual presentation of the dilemma fixes a date does not solve this problem, e.g., the date that was originally proposed in the dilemma has already passed and has to be changed when we present the dilemma in the future. This may be considered an indication that grue is more complex than green.

Blog at WordPress.com.