I thank Dr. Vladimir Cherkassky for taking up my general invitation to comment. I don’t have much to add to my original post[i], except to make two corrections at the end of this post. I invite readers’ comments.
As I could not participate in the discussion session on Sunday, I would like to address several technical issues and points of disagreement that became evident during this workshop. All opinions are mine, and may not be representative of the “machine learning community.” Unfortunately, the machine learning community at large is not very much interested in the philosophical and methodological issues. This breeds a lot of fragmentation and confusion, as evidenced by the existence of several technical fields: machine learning, statistics, data mining, artificial neural networks, computational intelligence, etc.—all of which are mainly concerned with the same problem of estimating good predictive models from data.
Occam’s Razor (OR) is a general metaphor in the philosophy of science, and it has been discussed for ages. One of the main goals of this workshop was to understand the role of OR as a general inductive principle in the philosophy of science and, in particular, its importance in data-analytic knowledge discovery for statistics and machine learning.
Data-analytic modeling is concerned with estimating good predictive models from finite data samples. This is directly related to the philosophical problem of inductive inference. The problem of learning (generalization) from finite data had been formally investigated in VC-theory ~ 40 years ago. This theory starts with a mathematical formulation of the problem of learning from finite samples, without making any assumptions about parametric distributions. This formalization is very general and relevant to many applications in machine learning, statistics, life sciences, etc. Further, this theory provides necessary and sufficient conditions for generalization. That is, a set of admissible models (hypotheses about the data) should be constrained, i.e., should have finite VC-dimension. Therefore, any inductive theory or algorithm designed to explain the data should satisfy VC-theoretical conditions.
It is difficult to proceed with constructive discussions on the subject of inductive inference, until these basic facts of VC-theory are acknowledged and understood. In my talk, I briefly introduced the VC-theoretical framework, and it has been challenged by philosophers during Q&A periods and discussions. They raised objections on two accounts:
1. An underlying assumption that future data is statistically similar to past data is too narrow. In particular, Deborah Mayo suggested that “statistical science” offers methodologies that enable inductive learning from arbitrarily changing distributions. This assertion is puzzling to me, as I am unaware of any such methods.
2. There exist “better theories.” This assertion is too vague. Certainly, one can hope and strive for better theories. However, in order to be scientifically sound, these better theories have to include the VC-theory as a special case. Clearly, this was not the case with the philosophical theories presented during this workshop.
With regard to philosophical interpretation of data-analytic knowledge discovery, my presentation advocated an Instrumentalist position versus Realistic view of classical statistics. My arguments used both pragmatic considerations (following Leo Breiman’s paper, and VC-theoretical results. According to VC-theory, it is not possible to estimate a true model from finite data, whereas it is still possible to estimate a good predictive model. So the Instrumentalist view follows directly from a sound scientific theory, and not from the philosophical arguments. This instrumentalist approach has many philosophical and practical implications for interpretation of data-analytic models—as discussed in my talk.
In a more general sense, as noted by Vapnik, the problem of induction is ill-posed. This ill-posedness is the property of the problem itself, not the solution. So realism is not possible, and instrumentalism is an appropriate (technically sound) philosophical position. Vapnik’s observation is very fundamental, and it challenges existing classical statistical methods (maximum likelihood, least squares estimation, etc.). This view, of course, is consistent with his original VC-theory concerned with theoretical analysis of the binary classification problem (i.e., estimating a good rule for discriminating handwritten digits 5 versus 8).
Finally, on the importance/relevance of Occam’s Razor for statistical learning. It can be addressed by VC-theory (assuming, of course, that this theoretical framework is adopted by all participants). Namely, inductive inference/ generalization is controlled by the VC-dimension. This complexity index is different from the number of free parameters (or entities) used by statisticians and philosophers to measure the model complexity. Therefore, OR is not relevant for the problem of inductive inference in statistical learning. This conclusion may be discouraging and unpleasant to philosophers. Some philosophers suggested that the OR principle still holds if the VC-dimension is used as a measure of complexity. This semantic game-playing seems counter-productive and only breeds more confusion. Of course, this discussion is limited to data-analytic modeling: the OR principle may still be useful for discovering other kinds of knowledge, i.e., first-principle knowledge.
In conclusion, a few more general remarks:
In the course of this workshop, it became evident that there is disagreement/ misunderstanding of the learning problem setting used by statisticians/machine learning researchers and philosophers. Without such a common understanding and agreement on the basic assumptions, it is difficult to have a meaningful technical discussion. VC-theory uses a quantifiable notion of generalization (prediction risk) that originated from Rosenblatt’s perceptron, and assumes standard inductive learning setting. So my and Vapnik’s talks discussed inductive inference, Occam’s Razor, etc., in the context of this setting. Similar settings are also used in most machine learning and statistical methods.
Apparently, this learning setting is not familiar to philosophers, and they use a different set of assumptions and concepts.
Some philosophers refer to hypothesis testing when they discuss induction and statistical learning. Under this setting, given a set of data points, the goal is to decide whether this data set was generated by a given (probabilistic) model. In contrast, in machine learning, we are given a set of data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models. This latter (machine learning) view was used in Popper’s discussions on polynomial curve fitting, used for illustrating his ideas on the connection between falsifiability and complexity.
Finally, as stated in my talk, useful philosophical ideas/interpretations usually develop in response to new scientific and technological advances. It seems that some philosophers favor a different view, that philosophy should guide scientific methodology (i.e., statistical inference). According to this view, philosophical models of induction should yield better results for statistical inference problems. I have not seen any such empirical evidence in philosophical presentations. Several presentations discussed polynomial curve fitting, but it was not clear whether/how philosophical models of induction yield an improvement for this well-known problem. In this regard, I emphasize that empirical verification is an important part of any true scientific theory, and logical arguments alone do not suffice. According to Albert Einstein:
Pure logical thinking cannot yield us any knowledge of the empirical world. All knowledge starts from experience and ends in it.
Note that Einstein’s quotation does not mention intelligibility or “truth.” Even though Einstein referred to first-principle knowledge, his argument certainly holds for data-analytic knowledge as well.
Dr. Vladimir Cherkassky, Department of Electrical and Computer Engineering University of Minnesota
Just to briefly correct two errors: (1) I never claimed that “‘statistical science’ offers methodologies that enable inductive learning from arbitrarily changing distributions,” only that the prediction-classification examples Cherkassky discusses, however important and interesting, are special cases and not the whole of statistical science (much less, all of science). Says Cherkassky: “In machine learning, we are given a set of data samples, and the goal is to select the best model (function, hypothesis) from a given set of possible models.” Fine, but is the background knowledge required to obtain the pieces of this setup, to ensure iid, and so on, itself reducible to prediction-classification problems? I say no, as would any good Popperian. (2) I would never suggest that philosophy guide scientific methodology; on the contrary, I have long argued that statistical science is highly relevant to the problems of induction and evidence-transcending inference that philosophers care about. However, the suitability of an instrumental philosophy for machine learning is no argument for a general instrumental epistemology or metaphysics. I may have more to say on this later, but for now I will leave it as open to readers’ comments.
[i] See http://errorstatistics.com/2012/06/26/deviates-sloths-and-exiles-philosophical-remarks-on-the-ockhams-razor-workshop/. Several other subsequent posts follow up on the “Foundations of Simplicity” theme.