Larry Wasserman (“Normal Deviate”) has announced he will stop blogging (for now at least). That means we’re losing one of the wisest blog-voices on issues relevant to statistical foundations (among many other areas in statistics). Whether this lures him back or reaffirms his decision to stay away, I thought I’d reblog my (2012) “deconstruction” of him (in relation to a paper linked below)[i]
Deconstructing Larry Wasserman [i] by D. Mayo
The temptation is strong, but I shall refrain from using the whole post to deconstruct Al Franken’s 2003 quip about media bias (from Lies and Lying Liars Who Tell Them: A Fair and Balanced Look at the Right), with which Larry Wasserman begins his paper “Low Assumptions, High Dimensions” (2011) in his contribution to Rationality, Markets and Morals (RMM) Special Topic: Statistical Science and Philosophy of Science:
Wasserman: There is a joke about media bias from the comedian Al Franken:
‘To make the argument that the media has a left- or right-wing, or a liberal or a conservative bias, is like asking if the problem with Al-Qaeda is: do they use too much oil in their hummus?’
According to Wasserman, “a similar comment could be applied to the usual debates in the foundations of statistical inference.”
Although it’s not altogether clear what Wasserman means by his analogy with comedian (now senator) Franken, it’s clear enough what Franken meant if we follow up the quip with the next sentence in his text (which Wasserman omits): “The problem with al Qaeda is that they’re trying to kill us!” (p. 1). The rest of Franken’s opening chapter is not about al Qaeda but about bias in media. Conservatives, he says, decry what they claim is a liberal bias in mainstream media. Franken rejects their claim.
The mainstream media does not have a liberal bias. And for all their other biases . . . , the mainstream media . . . at least try to be fair. …There is, however, a right-wing media. . . . They are biased. And they have an agenda…The members of the right-wing media are not interested in conveying the truth… . They are an indispensable component of the right-wing machine that has taken over our country… . We have to be vigilant. And we have to be more than vigilant. We have to fight back… . Let’s call them what they are: liars. Lying, lying, liars. (Franken, pp. 3-4)
When I read this in 2004 (when Bush was in office), I couldn’t have agreed more. How things change*. Now, of course, any argument that swerves from the politically correct is by definition unsound, irrelevant, and/ or biased. [ii](December 2016 update: This just shows how things get topsy-turvy every 5-8 years. Now we have extremes on both sides.)
But what does this have to do with Bayesian-frequentist foundations? What is Wasserman, deep down, really trying to tell us by way of this analogy (if only subliminally)? Such are my ponderings—and thus this deconstruction. (I will invite your “U-Phils” at the end[a].) I will allude to passages from my contribution to RMM (2011) (in red).
A.What Is the Foundational Issue?
Wasserman: To me, the most pressing foundational question is: how do we reconcile the two most powerful needs in modern statistics: the need to make methods assumption free and the need to make methods work in high dimensions… . The Bayes-Frequentist debate is not irrelevant but it is not as central as it once was. (p. 201)
One may wonder why he calls this a foundational issue, as opposed to, say, a technical one. I will assume he means what he says and attempt to extract his meaning by looking through a foundational lens.
Let us examine the urgency of reconciling the need to make methods assumption-free and that of making them work in complex high dimensions. The problem of assumptions of course arises when they are made about unknowns that can introduce threats of error and/or misuse of methods.
Wasserman: These days, statisticians often deal with complex, high dimensional datasets. Researchers in statistics and machine learning have responded by creating many new methods … . However, many of these new methods depend on strong assumptions. The challenge of bringing low assumption inference to high dimensional settings requires new ways to think about the foundations of statistics. (p. 201)
It is not clear if Wasserman thinks these new methods run into trouble as a result of unwarranted assumptions. This is a substantive issue about Wasserman’s applications that foundational discussions are unlikely to answer. Still, he sees the issue as one of foundations, so I shall take him at his word.
The last decade or more has also given rise to many new problem areas that call for novel methods (e.g., machine learning). Do they call for new foundations? Or, can existing foundations be relevant here too? (See Larry Wasserman’s contribution.) A lack of clarity on the foundations of existing methods tends to leave these new domains in foundational limbo. (Mayo 2011, 92)
I may seem to be at odds with Wasserman’s call to move on past frequentist-Bayesian debates:
Debates over the philosophical foundations of statistics have a long and fascinating history; the decline of a lively exchange between philosophers of science and statisticians is relatively recent. Is there something special about 2011 (and beyond) that calls for renewed engagement in these fields? I say yes. (Mayo, p. 80)
Perhaps this may be Wasserman’s meaning: new types of problems and methods call for a more pragmatic perspective on learning from data. One cannot begin at the point at which different interpretations of probability (Bayesian or frequentist) enter; so frequentist-Bayesian debates are not as central to current practice.
I would never claim there is any obstacle to practice in not having a clear statistical philosophy. But that is different from maintaining both that practice calls for recognition of underlying foundational issues, while also denying Bayesian-frequentist issues are especially important to them. The fact is, key underlying issues come to the surface and are illuminated within frequentist-Bayesian contrasts, as are issues surrounding objective/subjective, deduction/induction, and truth/idealizations, deliberately discussed on this blog. It may be insisted we are beyond them, but they invariably lurk in the background, they are the elephants in the room.
We deliberately used ‘statistical science’ in our forum title because it may be understood broadly to include the full gamut of statistical methods, from experimental design, generation, analysis, and modeling of data to using statistical inference to answer scientific questions. (Even more broadly, we might include a variety of formal but nonprobabilistic methods in computer science and engineering, as well as machine learning.) (Mayo, p. 85)
B. Models Are Always Wrong
Wasserman: One then looks for adequate models rather than true models… . [A] distribution P is an adequate approximation for x1,…, xn, if typical data sets of size n, generated under P ‘look like’ x1,…, xn. (p. 203)
The recognition that “the model is always wrong”–in the sense of being an idealization– was clear to the founders of “classical” statistics*(see relevant remarks from Cox, Fisher, and Neyman elsewhere on this blog). Although this recognition discredits the idea that inference is all about assigning degrees of belief or confirmation to hypotheses and models, it supports the use of probability in standard error statistics—or so I argue. One can learn true things from idealized models.
Wasserman: A more extreme example of using weak assumptions is to abandon probability completely… . Why are scholars in foundations ignoring this? (pp. 203-4)
By and large, the idea that data were literally “generated from a distribution is usually a fiction” (p. 203) is also not news to error statisticians; in a sense, observations are always deterministic. Viewing the sample as if it were generated probabilistically may simply be to cope with incomplete information, and the incorrect inferences that can result. Probability is introduced as attached to methods (which, in this example, would be for a type of prediction or classification tool).
The machine learners say that there is little need to understand what actually produced the numbers. Fine, then methods are apt that enable an increasingly successful error-rate reduction. Under error statistics’ big umbrella, machine learning appears to fall under the subset of the philosophy of “inductive behavior,” the goals of which involve controlling/improving performance and setting bounds for error rates, and trading off precision and accuracy where appropriate to the particular case. This is in contrast to the subset that is the main focus of my work: that which uses error rates to assess and control how severely claims have passed tests. The latter are contexts of scientific inference. In the prediction-classification example, however, the error-rate guarantees are just the ticket. (I would not rule out inferences about the case at hand.) Yet in the domains of both inductive behavior and scientific inference, the error statistician regards models as approximations and idealizations, or, as Neyman saw them, “picturesque” ways of talking about actual experiments.
Wasserman has proved many intriguing results about the problems of and prospects for low-assumption methods. Whether methods that invoke assumptions could do better, perhaps along side these (checking or making allowances later), is not something on which I can speculate. As complex as the classification prediction problems are, they enjoy an outcome that’s normally absent: we get to find out if we’ve been successful. Background knowledge enters in qualitative ways, not obviously as prior probability distributions in parameters.
C. Is It Bayesian?
Wasserman: In principle, low assumption Bayesian inference is possible. We simply put a prior π on the set of all distributions P. The rest follows from Bayes theorem. But this is clearly unsatisfactory. The resulting priors have no guarantees, except the solipsistic guarantee that the answer is consistent with the assumed prior. (p. 206) [iii]
One big reason some may turn aside from frequentist-Bayesian contrasts is that today even most Bayesians grant the importance of good performance characteristics (though their meaning may differ distinctly). The traditional idea that statistical learning is well-captured by Bayes theorem is rarely upheld (we have seen exceptions, most recently Lindley, also Kadane) [iv].
Today’s debates clearly differ from the Bayesian-frequentist debates of old. In fact, some of those same discussants of statistical philosophy, who only a decade ago were arguing for the ‘irreconcilability’ of frequentist p-values and (Bayesian) measures of evidence, are now calling for ways to ‘unify’ or ‘reconcile’ frequentist and Bayesian accounts… .(Mayo p. 82)
In some cases the nonsubjective posteriors may have good error-statistical properties of the proper frequentist sort, at least in the asymptotic long run. But then another concern arises: If the default Bayesian has merely given us technical tricks to achieve frequentist goals, as some suspect, then why consider them Bayesian (Cox 2006)? Wasserman (2008, 464) puts it bluntly: If the Bayes’ estimator has good frequency-error probabilities, then we might as well use the frequentist method. If it has bad frequency behavior then we shouldn’t use it. (The situation is even more problematic for those of us who insist on a relevant severity warrant.) (Mayo, p. 90)
Wasserman: [In other cases] the answers are usually checked against held out data. This is quite sensible but then this is Bayesian in form not substance. (p. 206)
In this context, insofar as I understand it, the goal is to be able to assess how well the rule can predict “test sets” and indicate an estimate of prediction error. The substance is of an error-statistical kind: through various strategies (e.g., cross validation) we may learn approximately how well a predictive model will perform in cases other than those already used to fit the model. It connects with a general set of strategies for preventing too-easy fits and avoiding (pejorative) double-counting, “over fitting,” and nongeneralizable results.
So where does this leave us in deconstructing Wasserman’s call for new-fangled foundations?
Franken deconstructed: Let us imagine Franken as representing a frequentist error statistician[v]. He begins by noting that while Bayesians may detect a frequentist bias (in certain circles), he detects no such thing. Besides, such a quibble would be akin to worrying about Al-Qaeda using too much oil in their hummus!
Frequentists, he says, are at least trying to meet a fundamental scientific requirement for controlling error, and are open to any number of ways of accomplishing this. But Bayesians—at least dyed-in-the-wool (or staunch subjective or “philosophical”) Bayesians—have an agenda, Franken is saying, by analogy. They charge frequentists with legitimating a hodgepodge of “incoherent” and “inadmissible” methods; they say that frequentists care only for low error rates in the long run, have no way of incorporating background information, invariably misinterpret their own methods, and top it all off with a litany of howlers (that the Bayesian easily avoids). If the discourse on frequentist foundations seems biased, our frequentist Franken continues, it is only to correct the many blatant misinterpretations of its methods.
Now Wasserman comes in and utters the scientific equivalent of “Let’s move on.” (as with the Clinton scandal, which gave rise to MoveOn.org, i.e., “Get over it.”) The Bayesian requirements and philosophy do not underwrite the substance of the most promising new complex methods. So if our focus is to justify, interpret, and extend these new contexts, we are allowed to leave the old (frequentist-Bayesian) scandals behind. But, as Wasserman seems further to imply, finding oneself in an essentially frequentist, error-statistical world is not enough either, especially when it comes to the kinds of complex classification and prediction problems of machine learning, data mining, and the like. At any rate, new foundational concerns must loom large….
So let me inject myself into the interpretive mix I’ve created.
I concur with the deconstructed Franken and Wasserman. Taking seriously Wasserman’s intimation that there is not only a technical-statistical problem here (which only statisticians can solve), but also a foundational problem, he seeks a ground for applications where probabilistic bounds, however, crude, do not directly describe a data-generating mechanism, but assess/reduce/balance procedural error rates.
The “long-run” relative frequencies have probabilistic implications for bounding the next test set. The old accusation that good error-statistical properties are irrelevant to the case at hand goes by the wayside. Anyone who takes a broad view of error-statistical methods would have no problem finding a home for the variety of methods of creative control and assessment of approximate sampling distributions and error rates. This falls more clearly under what may be called “a behavioristic” context than one of scientific inference (though the latter is not precluded) . It would require breaking out of traditional notions of frequentist statistics and in so doing simultaneously scotch the oft repeated howlers.[vi]
Ironically many seem prepared to allow that Bayesianism still gets it right for epistemology, even as statistical practice calls for methods more closely aligned with frequentist principles. What I would like the reader to consider is that what is right for epistemology is also what is right for statistical learning in practice. That is, statistical inference in practice deserves its own epistemology. (Mayo, p. 100)
Constructing such a framework, would be one payoff of genuinely transcending the frequentist-Bayesian debates, rather than rendering them taboo, or closed.
Cox, D. R. 2006, Principles of Statistical Inference, Cambridge: Cambridge University Press.
Gelman, A and C. Shalizi. (Article first published online: 24 FEB 2012). “Philosophy and the Practice of Bayesian statistics (with discussion)”. British Journal of Mathematical and Statistical Psychology
Mayo, D. (2011), “Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2011 (and Beyond)?” RMM Vol. 2, 2011, 79–102
Wasserman, L. 2008., “Comment on article by Gelman,” Bayesian Analysis. 3(3): 463-465.
*7/29 I modified this assertion, and will explicate the different senses in which Neyman and Pearson viewed the relationship between approximate models and correct/incorrect claims about the world later on.
[a] This had been an open “U-Phil”. If you send me a new analysis, I’m willing to post it.
[ii] Says Franken: “And what shocked me most…was the silence from those conservatives who complain about the ugliness of political discourse in this country.” (19) Oh pleeeze (to use Franken’s expression).
[iii] For some examples of methods applicable to large numbers of variables in econometrics under the error statistical umbrella, see the two contributions to the special topic by Aris Spanos, and David Hendry. It would be interesting to hear of relationships.
[iv] Even where Bayesian methods are usefully applied, some say “most of the standard philosophy of Bayes is wrong” (Gelman and Shalizi 2012, 2 n2). See https://errorstatistics.com/2012/06/19/the-error-statistical-philosophy-and-the-practice-of-bayesian-statistics-comments-on-gelman-and-shalizi/
[v] Never mind that, intuitively, I think, it would fit more closely to see him wearing a Bayesian hat. Please weigh in on this.
*I think Hilary C. was right about the right-wing conspiracy at the time; hence my 2008 endorsement of the PUMAs (standing for “Political Unity My Ass”).
Some related reactions and responses to Wasserman:
Spanos on Wasserman
Hennig and Gelman on Wasserman
Wasserman response to Mayo’s deconstruction: