Vladimir Cherkassky Responds on Foundations of Simplicity

I thank Dr. Vladimir Cherkassky for taking up my general invitation to comment. I don’t have much to add to my original post[i], except to make two corrections at the end of this post.  I invite readers’ comments.

Vladimir Cherkassky

As I could not participate in the discussion session on Sunday, I would like to address several technical issues and points of disagreement that became evident during this workshop. All opinions are mine, and may not be representative of the “machine learning community.” Unfortunately, the machine learning community at large is not very much interested in the philosophical and methodological issues. This breeds a lot of fragmentation and confusion, as evidenced by the existence of several technical fields: machine learning, statistics, data mining, artificial neural networks, computational intelligence, etc.—all of which are mainly concerned with the same problem of estimating good predictive models from data.

Occam’s Razor (OR) is a general metaphor in the philosophy of science, and it has been discussed for ages. One of the main goals of this workshop was to understand the role of OR as a general inductive principle in the philosophy of science and, in particular, its importance in data-analytic knowledge discovery for statistics and machine learning.

Data-analytic modeling is concerned with estimating good predictive models from finite data samples. This is directly related to the philosophical problem of inductive inference. The problem of learning (generalization) from finite data had been formally investigated in VC-theory ~ 40 years ago. This theory starts with a mathematical formulation of the problem of learning from finite samples, without making any assumptions about parametric distributions. This formalization is very general and relevant to many applications in machine learning, statistics, life sciences, etc. Further, this theory provides necessary and sufficient conditions for generalization. That is, a set of admissible models (hypotheses about the data) should be constrained, i.e., should have finite VC-dimension. Therefore, any inductive theory or algorithm designed to explain the data should satisfy VC-theoretical conditions.

It is difficult to proceed with constructive discussions on the subject of inductive inference, until these basic facts of VC-theory are acknowledged and understood. In my talk, I briefly introduced the VC-theoretical framework, and it has been challenged by philosophers during Q&A periods and discussions. They raised objections on two accounts:

1.       An underlying assumption that future data is statistically similar to past data is too narrow. In particular, Deborah Mayo suggested that “statistical science” offers methodologies that enable inductive learning from arbitrarily changing distributions. This assertion is puzzling to me, as I am unaware of any such methods.

2.       There exist “better theories.” This assertion is too vague. Certainly, one can hope and strive for better theories. However, in order to be scientifically sound, these better theories have to include the VC-theory as a special case. Clearly, this was not the case with the philosophical theories presented during this workshop.

With regard to philosophical interpretation of data-analytic knowledge discovery, my presentation advocated an Instrumentalist position versus Realistic view of classical statistics. My arguments used both pragmatic considerations (following Leo Breiman’s paper, and VC-theoretical results. According to VC-theory, it is not possible to estimate a true model from finite data, whereas it is still possible to estimate a good predictive model. So the Instrumentalist view follows directly from a sound scientific theory, and not from the philosophical arguments. This instrumentalist approach has many philosophical and practical implications for interpretation of data-analytic models—as discussed in my talk.

In a more general sense, as noted by Vapnik, the problem of induction is ill-posed. This ill-posedness is the property of the problem itself, not the solution. So realism is not possible, and instrumentalism is an appropriate (technically sound) philosophical position. Vapnik’s observation is very fundamental, and it challenges existing classical statistical methods (maximum likelihood, least squares estimation, etc.). This view, of course, is consistent with his original VC-theory concerned with theoretical analysis of the binary classification problem (i.e., estimating a good rule for discriminating handwritten digits 5 versus 8).

Finally, on the importance/relevance of Occam’s Razor for statistical learning. It can be addressed by VC-theory (assuming, of course, that this theoretical framework is adopted by all participants). Namely, inductive inference/ generalization is controlled by the VC-dimension. This complexity index is different from the number of free parameters (or entities) used by statisticians and philosophers to measure the model complexity. Therefore, OR is not relevant for the problem of inductive inference in statistical learning. This conclusion may be discouraging and unpleasant to philosophers. Some philosophers suggested that the OR principle still holds if the VC-dimension is used as a measure of complexity. This semantic game-playing seems counter-productive and only breeds more confusion.  Of course, this discussion is limited to data-analytic modeling: the OR principle may still be useful for discovering other kinds of knowledge, i.e., first-principle knowledge.

In conclusion, a few more general remarks:

In the course of this workshop, it became evident that there is disagreement/ misunderstanding of the learning problem setting used by statisticians/machine learning researchers and philosophers. Without such a common understanding and agreement on the basic assumptions, it is difficult to have a meaningful technical discussion. VC-theory uses a quantifiable notion of generalization (prediction risk) that originated from Rosenblatt’s perceptron, and assumes standard inductive learning setting. So my and Vapnik’s talks discussed inductive inference, Occam’s Razor, etc., in the context of this setting. Similar settings are also used in most machine learning and statistical methods.

Apparently, this learning setting is not familiar to philosophers, and they use a different set of assumptions and concepts.

Some philosophers refer to hypothesis testing when they discuss induction and statistical learning. Under this setting, given a set of data points, the goal is to decide whether this data set was generated by a given (probabilistic) model. In contrast, in machine learning, we are given a set of data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models. This latter (machine learning) view was used in Popper’s discussions on polynomial curve fitting, used for illustrating his ideas on the connection between falsifiability and complexity.

Finally, as stated in my talk, useful philosophical ideas/interpretations usually develop in response to new scientific and technological advances. It seems that some philosophers favor a different view, that philosophy should guide scientific methodology (i.e., statistical inference). According to this view, philosophical models of induction should yield better results for statistical inference problems. I have not seen any such empirical evidence in philosophical presentations. Several presentations discussed polynomial curve fitting, but it was not clear whether/how philosophical models of induction yield an improvement for this well-known problem. In this regard, I emphasize that empirical verification is an important part of any true scientific theory, and logical arguments alone do not suffice. According to Albert Einstein:

Pure logical thinking cannot yield us any knowledge of the empirical world. All knowledge starts from experience and ends in it.

Note that Einstein’s quotation does not mention intelligibility or “truth.” Even though Einstein referred to first-principle knowledge, his argument certainly holds for data-analytic knowledge as well.

 Dr. Vladimir Cherkassky, Department of Electrical and Computer Engineering
 University of Minnesota


Just to briefly correct two errors: (1) I never claimed that “‘statistical science’ offers methodologies that enable inductive learning from arbitrarily changing distributions,” only that the prediction-classification examples Cherkassky discusses, however important and interesting, are special cases and not the whole of statistical science (much less, all of science). Says Cherkassky: “In machine learning, we are given a set of data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models.” Fine, but is the background knowledge required to obtain the pieces of this setup, to ensure iid, and so on, itself reducible to prediction-classification problems? I say no, as would any good Popperian.    (2) I would never suggest that philosophy guide scientific methodology; on the contrary, I have long argued that statistical science is highly relevant to the problems of induction and evidence-transcending inference that philosophers care about. However, the suitability of an instrumental philosophy for machine learning is no argument for a general instrumental epistemology or metaphysics. I may have more to say on this later, but for now I will leave it as open to readers’ comments.

[i] See http://errorstatistics.com/2012/06/26/deviates-sloths-and-exiles-philosophical-remarks-on-the-ockhams-razor-workshop/.  Several other subsequent posts follow up on the “Foundations of Simplicity” theme.

Categories: philosophy of science, Statistics | Tags: , , , , , | 12 Comments

Post navigation

12 thoughts on “Vladimir Cherkassky Responds on Foundations of Simplicity

  1. To say a bit more about the relationship between philosophy of science and statistical science, my view is that neither should be the handmaiden of the other; I have long urged a “two-way street.” This blog largely grew out of a conference in June, 2010 at the London School of Economics entitled, “Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2010 and Beyond?”
    • Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond?
    Cherkassky and other new readers might find something of interest in the on-line publication of contributions to the general forum (in RMM, Rationality, Markets and Morals). The link is on the left column of the blog.

    I strongly endorse the idea that philosophical problems of knowledge can derive valuable insights from statistical methods—including way to understand and solve the problem of induction (namely, by showing the existence of reliable methods and ways to assess and control the severity of tests). It is this idea that drew me to statistics as a graduate student and that I have been developing and promoting ever since. This might I can, however, imagine the chagrin of the philosopher who decides for the first time to have a look:“Okay, Mayo, I’m taking your advice to see what statistics has to offer,” and is told that the upshot of all these methods is that one can make reliable inferences from observed to unobserved cases just when it is reliable to do so. To the philosopher’s ears, that is what Cherkassky sounded to be saying in his talk. After a laugh, he or she would resume trying to justify the straight rule based on “inductive intuition” (from all observed A’s have been B’s to the next or most A’s will be B’s). My concern is that this yields too much to philosophical skepticism—though I’m fairly certain that wasn’t the intent.

  2. Eileen

    Maybe I misread him above, but it sounded like he was accusing you of wanting to put the philosophers in charge of scientific/statistical methodology using philosophical theories of induction, etc.! Which is pretty strange, even humorous, because in everything that I’ve read of yours so far (especially Error and the Growth of Experimental Knowledge), you are always imploring philosophers to look to standard statistics and statistical methods to get insight into how to solve their philosophical problems like induction, Duhem’s problem, etc.!

  3. Eileen: We had not met before, and I doubt he is familiar with my work—but that is one of the valuable things about interdisciplinary conferences. I also think there is some ambiguity between the “philosophical basis” or rationale for a statistical method, in the sense of how to interpret and justify it in relation to given aims (something practitioners would presumably possess) and a general philosophy of science or knowledge (something philosophers would have). I have tried to distinguish these (e.g., Mayo and Spanos 2011, “Error Statistics,”). I’m actually unclear as to why machine learners would care to foist a general instrumental metaphysics or epistemology onto philosophers.There is an endorsement of Popper, or so it seems, but Popper was a realist,and would recoil at such instrumentalism. As to “which came first, the observation or the hypothesis?” Popper held it was always a hypothesis.

    • Christian Hennig

      To me what Cherkassky writes sounds far too much like “prediction is the only game in town”. I think it’s not.

  4. Christian: True, that’s what I was trying to say at the conference, but he wasn’t having it. It occurs to me, though, Christian, that your area, cluster analysis, must have a lot in common with the machine learning classification tasks. Do you work with them? Are they rediscovering some of the results from your field, or is it very different? I’d be interested to know what you think.

    • Christian Hennig

      Mayo: This is a question I just have to deal with, writing an overview paper about how to select a good cluster analysis method. Currently the majority of work suggesting new methodology is from the machine learning community. There is also a fair amount on cluster validation and comparison of methods. I’m rather impressed (and scared, because I should take much of it in) by the amount of promising methodological work they produce. They have certainly passed the state of just rediscovering stuff; in some instances I’m amazed by how good it is.
      However, I also see a tendency in line with Cherkassky’s view of trying to reduce the clustering problem to prediction. Pretty much all papers that they have end with applying the newly suggested method to some kind of standard set of benchmarking data sets with “known truth” in order to evaluate the prediction quality there.
      Some of the data sets are real ones with “known classes” from supervised classification tasks, other have been generated by mixture distributions.
      This doesn’t look too bad to a statistician at first sight, but the problem here (missed by many statisticians, too) is that what we are interested in is not necessarily this apparent “known truth”.
      In real data sets with known classes, there is obviously no real scientific interest in reconstructing these classes, so this does not mimic what cluster analysis is supposed to do in new applications. Also, the “true classes” are usually defined by an external criterion, and there is no guarantee that they are the classes that one would reasonably want to find data analytically if one wouldn’t know them already.
      The problem with mixture models is that often there is no proper definition of what kind of clusters the method is supposed to find and it is just assumed the there are “natural true clusters” out there, so that the choice of mixture distributions seems rather arbitrary and unmotivated.

      So I think that they are quite good at coming up with ideas but not so good at defining the problem properly, which may partly be because they lack philosophical thinking.
      They have pretty good hammers so they try to make the problem look like a really sophisticated nail.

      • Christian: Thanks so much for this. “In real data sets with known classes, there is obviously no real scientific interest in reconstructing these classes, so this does not mimic what cluster analysis is supposed to do in new applications.” This was exactly my suspicion, but when I suggested as much (at the conference) I didn’t get a sense that others agreed, so I figured I was wrong (about cluster analysis). Perhaps you can explain more of what the “new applications” are supposed to do that is not covered. Also, can you please explain briefly “mixture models” in this context? Thanks!

        • Christian Hennig

          What cluster analysis is supposed to do is to find subpopulations that are separated by the rest in data analytic terms, often in the hope that such groups point to some meaningful discoveries, but sometimes also for organisational, communicative and indirect reasons (e.g.,, constructing a meaningful categorical variable out of messy data to use this in order to analyse something else).

          The task of supervised classification is to find out how the information in the given data set can be used to separate known classes which are defined by other means than the data set used for classification.

          This means that there is no guarantee that the known classes form data analytic clusters. If you look at gene expression data in order to separate patients with a certain tumor from patient without that tumor, there is no guarantee that these two groups are clearly separated in the data (one would want to find the best classification rule regardless) let alone that this is the best possible partition of the data according to purely data analytic criteria. In the very same data set there may be people with an without another not recorded condition that separates data much better into two classes, so that a clustering method that doesn’t find the given supposedly “true” classes but rather the hidden ones may actually do a better job. That’s the problem with comparing the method’s results with what is supposed to be “ground truth”.

          I’ll write another posting on mixtrue models.

          • Christian Hennig

            OK, mixture models. A mixture model is a model for a two-step process in which for every data point first a multinomial random variable selects a membership to one of a number of mixture components (although this information is unobserved), and then a data point is generated from a certain distribution with its parameter depending on the component. For example, 0.3*N(0,1)+0.2*N(1,1)+0.5*N(10,10) is a 3-component normal mixture, and one may want to estimate all the component proportions and parameters.
            In cluster analysis one usually interprets every component as model for a cluster and can then classify points by the estimated conditional probabilities of belonging to the components given the observed data (this is an application of Bayes’s rule but it’s not Bayesian in a philosophical sense).

            The problem with the way this is used for cluster analysis method benchmarking is that they just select some distributions for the mixture components without discussing how this corresponds to the kind of cluster they expect their method to find. How to do this is far from trivial. For example if you plot the density of 0.6*N(0,1)+0.4*N(1,1), this will rather look like a single cluster than like two of them because it’s unimodal, and if you use this mixture for benchmarking, a clustering method will come out as doing well if it counterintuitively tells you that that’s two clusters.

            OK, what is used for benchmarking in the literature is rarely as nonsensical as this, but anyway it is usually picked in a rather unmotivated fashion by the intuition of the authors without proper definition what kind of problem their method is actually meant to solve.

            • Thanks Christian. I will have to read up on this, I’m not sufficiently familiar with cluster analysis to reply.

  5. I am a bit confused by exactly what Cherkassky proposes as a general philosophy of science, or if that is what he wants to do in the first place. Perhaps I missed something, but what he seems to offer in the end comes down to good old empiricism or a simplistic version of instrumentalism. If so, do the developments in machine learning end up telling us to go back to an already well-known philosophical view of science, namely instrumentalism? Personally, I do not think there’s anything seriously wrong with instrumentalism as a very general view on large-scale theories of science. But I fail to see what new things the instrumentalism supported by machine learning can teach us about science that instrumentalist philosophers of science have not already said. Perhaps Cherkassky could make clear how his version of instrumentalism is different from, say, Van Fraassen’s instrumentalism…

    I also want to reinforce the question why what may be an adequate philosophy of machine learning should also be an adequate philosophy of science. Is it because there is an assumption that the type of learning studied in machine learning is the same or similar to what scientists do? To me, this is no assumption but simply an open empirical question about what cognitive processes are involved as scientists do research and draw inferences about the world. It is an interesting question and, as some think, it may be creating a need for a “cognitive science” of science. But to do that we should first have a methodologically sound and reliable cognitive science. How can that be secured? I think that is exactly one of the places where collaborations between statistical science and philosophy of science can be of real use…

    • Emrah: these are good questions and thoughts. I wish we had taken up more of them at the recent conference. I think the machine learner’s “instrumentalism” is of the naive sense data sort, which is very different from van Fraassen’s constructive empiricism. I think they regard the choices to be between ‘rationalism’ and ‘instrumentalism’ in the old naive sense. See my comment for an explanation.
      I do think there is something “seriously wrong with instrumentalism as a very general view on large-scale theories of science” because, for starters, scientists are not content merely to predict without understanding things like mechanisms and causes. For seconders, it’s not clear how there could be empirically warranted theoretical knowledge, or large scale theories as even approximately capturing what is the case about such entities as genes, gravity, electrons, etc. according to their conception, as I understand it. As you know, I emphasize experimental knowledge, which some (e.g., Musgrave) had viewed as anti-realist, yet error statistics decides on a case by case basis. But I think your deeper question is quite right: why are they trying to argue from these classification-prediction problems to a general philosophy of science? And even if they do think their work fits instrumentalism, what are they adding to many existing empirical accounts in philosophy of science?

I welcome constructive comments for 14-21 days

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com. The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 427 other followers