My flight out of Pittsburgh has been cancelled, and as I may be stuck in the airport for some time, I will try to make a virtue of it by jotting down some of my promised reflections on the “simplicity and truth” conference at Carnegie Mellon (organized by Kevin Kelly). My remarks concern only the explicit philosophical connections drawn by (4 of) the seven non-philosophers who spoke. For more general remarks, see blogs of: Larry Wasserman (Normal Deviate) and Cosma Shalizi (Three-Toed Sloth). (The following, based on my notes and memory, may include errors/gaps, but I trust that my fellow bloggers and sloggers, will correct me.)
First to speak were Vladimir Vapnik and Vladimir Cherkassky, from the field of machine learning, a discipline I know of only formally. Vapnik, of the Vapnik Chervonenkis (VC) theory, is known for his seminal work here. Their papers, both of which addressed directly the philosophical implications of their work, share enough themes to merit being taken up together.
Vapnik and Cherkassky find a number of striking dichotomies in the standard practice of both philosophy and statistics. They contrast the “classical” conception of scientific knowledge as essentially rational with the more modern, “data-driven” empirical view:
The former depicts knowledge as objective, deterministic, rational. Ockham’s razor is a kind of synthetic a priori statement that warrants our rational intuitions as the foundation of truth with a capital T, as well as a naïve realism (we may rely on Cartesian “clear and distinct” ideas; God does not deceive; and so on). The latter empirical view, illustrated by machine learning, is enlightened. It settles for predictive successes and instrumentalism, views models as mental constructs (in here, not out there), and exhorts scientists to restrict themselves to problems deemed “well posed” by machine-learning criteria.
But why suppose the choice is between assuming “a single best (true) theory or model” and the extreme empiricism of their instrumental machine learner? A similar dichotomy arises in the description of “classical” statistics as contrasted with modern accounts of machine learning. The classical statistician is viewed as beginning with a known parametric distribution, or a true statistical model, the goal being to estimate or test parameters within it. This is an erroneous if familiar caricature of what goes on in statistical science. I gained some insight from Cherkassky during a Q and A period: The perspective stems from a complaint by Leo Breiman and other new machine learners that statistics had downplayed their work for some time. But biases in editorial policy, whether past or present, do not themselves justify so superficial a conception of classical statistics.
Cherkassky qualified the successes of machine learning in two important ways: (1) machine learning aims for good predictions but provides us with wholly uninterpretable “black boxes”; and (2) machine-learning inductions, based on training samples and teachers, work only so long as stationarity is sufficient to ensure that the new data are adequately similar to the training data. He did not seem to see these as noteworthy drawbacks or limitations that (while perhaps just fine for machine learning) an adequate, full-bodied account of statistical science would and does regularly break out of.
Cherkassky began with the interesting claim that philosophical ideas form only in the context of scientific developments and in response to technological advances. I take his upshot to be that now that science has changed (toward empirical machine learning) philosophy of science changes accordingly.
This general position deserves further reflection. For now I will say that, since the technology of machine learning is only a small part of science, even if it has demanded an instrumental philosophy, it does not follow that this would be an adequate philosophy for science in general. Also, consider this radical possibility: how do they know that the goals of machine learning would not be furthered by striving to understand underlying mechanisms, empirical and theoretical?
I have no reason to doubt that machine learning has had great successes. I hope it has a machine in the works to obviate cancelled flights and save me from having to plead with the machine for an “agent” to fix the mess.
This morning (Sunday), Vapnik clarified some of the formal issues, but repeated the idea that we should restrict ourselves to so-called well-formed problems. My own interest is in how scientists arrive at sufficiently well-formed problems despite very messy points of departure, and despite their more ambitious goals, which include a theoretical and empirical understanding of phenomena. It is one thing to view the work of machine learning as having carved out an important domain with increasingly diverse applications, but quite another to suggest that that is all there is to learning about the world. Vapnik invoked at least partial connections to Popperian falsificationism. But Popper would have regarded the enterprise as akin to Kuhnian “normal science,” or science constrained within the bounds of a paradigm.
Vapnik concluded, if I understood him (and this is a first pass for me), that in order to improve the teacher/trainer in machine-learning classification tasks (e.g., disambiguate a handwritten 5 from an 8), we must, to satisfy equivocal cases, consider features that go beyond the usual classification features, in particular, he thinks, various metaphors and mystical, holistic, “yin/yang” harmonies. The list I made during the talk includes what may be seen as human idiosyncrasies (e.g., egotism, willfulness, stubbornness, a willingness to cause pain). I am not surprised that capturing the shades of human discrimination requires us to go beyond observable sense data; if given the stark[i] choice between mysticism and a naïve empiricism, humans will always tend toward mysticism.
Peter Gruenwald asked the same question I often ask: “Where are the philosophers?” [on a variety of issues in contemporary statistical science]. He raised the problem that arises when Bayesians are led to revise their priors on the grounds that they do not like the resulting posteriors. To avoid Bayesian inconsistency, he says, requires “non-Occam priors.” This should be understood, he suggests, in terms of what he calls “luckiness,” an idea he has found in Kiefer’s conditional frequentist inference. There was a period during which I worked through Kiefer’s approach—Casella, Lehmann, and others having told me that I seemed to be doing something similar. But insofar as I worked through his approach, it appeared similar only in the sense that severity is a data-dependent assessment. Now, having heard Gruenwald, I want to go back and see what I can find in Kiefer on luck.
Cosma Shalizi began by explaining that his switch from physics to statistics was prompted by discovering many links between both statistics and machine learning and fundamental philosophical questions of knowledge and inference. In a (refreshing!) contrast with the machine learners, he conceives the goals of statistical science broadly: that it seeks ways of using data to answer very general inferential questions about the world by developing “abstract machines” that honestly assess uncertainty. While he discovered, unsurprisingly, that statistical practice was much more focused on nuts and bolts applications than foundational principles, he claimed that computational developments saved it from being boring. Do the new computations alter the interrelations between the formal work and general conceptions of knowledge, cause, and inference? Do they promote or downplay the connections? I heard Shalizi’s answer (in the Q and A) being that such philosophical issues actually have nothing to do with it. Maybe his point was simply that computational challenges make statistics/machine learning more fun.
My new flight, to an unintended destination is here; then I’m to be picked up and make my way by car, bus, and ferry.
My slides are posted: http://www.phil.vt.edu/dmayo/personal_website/June24,12MayoCMU-SIMP.pdf (Link has expired)
[i] This is a false choice or false dilemma.
*CMU Workshop on Foundations for Ockham’s Razor (https://errorstatistics.com/2012/06/12/4644/)
This is not on your posting here but rather on your slides. I just wrote on Larry Wasserman’s blog on the same conference and wondered whether anybody had discussed what simplicity is good for. I’m delighted to see that you did that.
It is interesting that you start by looking at other aspects of simplicity than what statisticians would usually think of (mainly number of parameters). I had always thought of the frequentist approach as *simpler* as the Bayesian one, contrary to what you wrote, because the Bayesian models are usually prior plus sampling model, whereas frequentists only have the latter layer (which is why I think that if a Bayesian uses a prior, he/she needs to justify what this adds to the analysis – often not much, I guess).
A simplicity issue that you don’t mention but I think is highly relevant to your philosophy is multiple testing – particularly how to control for error when advertising the use of a comprehensive battery of misspecification tests. Bonferroni seems to imply that the more tests you use, the less you can learn from every single one.