Two further contributions in relation to
“Low Assumptions, High Dimensions” (2011)
Please also see : “Deconstructing Larry Wasserman” by Mayo, and Comments by Spanos
Christian Hennig: Some comments on Larry Wasserman, “Low Assumptions, High Dimensions”
I enjoyed reading this stimulating paper. These are very important issues indeed. I’ll comment on both main concepts in the text.
1) Low Assumptions. I think that the term “assumption” is routinely misused and misunderstood in statistics. In Wasserman’s paper I can’t see such misuse explicitly, but I think that the “message” of the paper may be easily misunderstood because Wasserman doesn’t do much to stop people from this kind of misunderstanding.
Here is what I mean. The arithmetic mean can be derived as optimal estimator under an i.i.d. Gaussian model, which is often interpreted as “model assumption” behind it. However, we don’t really need the Gaussian distribution to be true for the mean to do a good job. Sometimes the mean will do a bad job in a non-Gaussian situation (for example in presence of gross outliers), but sometimes not. The median has nice robustness properties and is seen as admissible for ordinal data. It is therefore usually associated with “weaker assumptions”. However, the median may be worse than the mean in a situation where the Gaussian “assumption” of the mean is grossly violated. At UCL we ask students on a -2/-1/0/1/2 Likert scale for their general opinion about our courses. The distributions that we get here are strongly discrete and the scale is usually interpreted as of ordinal type. Still, for ranking courses, the median is fairly useless (pretty much all courses end up with a median of 0 or 1); whereas, the arithmetic mean can still detect statistically significant meaningful differences between courses.
Why? Because it’s not only the “official” model assumptions that matter but also whether a statistic uses all the data in an appropriate manner for the given application. Here it’s fatal that the median ignores all differences among observations north and south of it.
The mean in fact assumes that “the data are so that the mean doesn’t give a misleading result”, which doesn’t only depend on the underlying truth but also on how the result is used and interpreted. This is what properly used misspecification tests actually do: they don’t test the truth of the model assumption (that’s impossible), they rather test whether there is anything in the data that will mislead the statistic that one would want to compute.
This holds for all kinds of statistics, be they parametric or nonparametric. Just because there is a theorem that tells us something about a certain method in a large class of models or even model-free doesn’t mean that the method doesn’t come with assumptions. It still cannot be taken for granted that the assumption “the method will make sense of the specific dataset” is fulfilled.
As an example, take the minimax regret result on individual sequences from Wasserman’s paper. Does this theorem tell us that the method works “without assumptions”? Actually not. I’d suspect that there are individual sequences that just cannot be well predicted and all predictors will give rubbish in the future. The result then says that the weighted “assumption-less” predictor is not much worse than the best of the other predictors here (which still will give rubbish) and not worse on other sequences either. So the result doesn’t say that the weighted predictor “works” anywhere. In comparison, something with strong explicit assumptions at least is guaranteed to work where the assumptions hold. If we want to know where the weighted “assumption-less” predictor really works well, we have to look what exactly it is and, guess what, make some assumptions.
2) High Dimensions. Actually, high d is “modern” but small n is not. Where in the past there were 40 2-d observations, there may now be 40 (or even 200) 10,000-d observations. That’s not worse. The missing 9,998 dimensions may already have existed in the past and only not have been observed, and it was implicitly assumed that they didn’t play a role. That’s not exactly a weaker assumption than what we have to make these days. If we really believe that what we did for 2-d in the past was appropriate, we can now just pick the two best of the 10,000 dimensions by thinking about their scientific meaning and throw away the others. What is really different from the past is that people tend to trust their computers more than their brains for finding the meaningful variables.
OK, I accept that one actually wants to get some added value out of the 9,998 new dimensions (although competing with a good brain that constructs some meaningful low-d indexes from high-d data is tough).
The thing about high-d data is that distributional shape is so complex that 40 (or even 200) observations cannot get the statistician anywhere near a clear idea about it. So we have to make assumptions such as “sparsity”, “a linear predictor will do OK” etc. There is just no way around this (apart from confining attention to problems that don’t require such distributional details), although the inventors of methods will do their best to pretend that what they do is “low assumptions”. Just to illustrate this, for p>n, with probability 1, all Mahalanobis distances between points are the same. No affine equivariant method can give any information. So every method needs to assume, implicitly and without possibility to check it that some directions in space are more meaningful than others (for much standard methodology those along the original variables).
I grant that some methods in high-d often do a good job. As said before, the really important assumption is usually not the one about which a mathematical theorem can be proved, but that “nothing in the specific data breaks the method down” given the intended interpretation of the result (it may be that assuming a slim graph structure “works” just because the human brain cannot conceive the ways in which it doesn’t). Recall also that with 40*10,000 data one cannot do worse than with 40*2 data, so although 10,000 dimensions may look more scary, they don’t really make it more difficult to find a good predictor (and one shouldn’t therefore marvel too much at those that work).
_______________________________________________________________________________
Andrew Gelman on Larry Wasserman (reblogging Gelman’s comments: Posted by Andrew on 13 February 2012.)
Continuing with my discussion of the articles in the special topic of the journal Rationality, Markets and Morals *Rationality, Markets and Morals (RMM): Statistical Science and Philosophy of Science:
Larry Wasserman, “Low Assumptions, High Dimensions”:
This article was refreshing to me because it was so different from anything I’ve seen before. Larry works in a statistics department and I work in a statistics department but there’s so little overlap in what we do. Larry and I both work in high dimesions (maybe his dimensions are higher than mine, but a few thousand dimensions seems like a lot to me!), but there the similarity ends. His article is all about using few to no assumptions, while I use assumptions all the time. Here’s an example. Larry writes:
P. Laurie Davies (and his co-workers) have written several interesting papers where probability models, at least in the sense that we usually use them, are eliminated. Data are treated as deterministic. One then looks for adequate models rather than true models. His basic idea is that a distribution P is an ad- equate approximation for x1,…,xn, if typical data sets of size n, generated under P look like x1,…,xn. In other words, he asks whether we can approximate the deterministic data with a stochastic model.
This sounds cool. And it’s so different from my world! I do a lot of work with survey data, where the sample is intended to mimic the population, and a key step comes in the design, which is all about probability sampling. I agree that Wassserman’s (or Davies’s) approach could be applied to surveys—the key step would be to replace random sampling with quota sampling, and maybe this would be a good idea—but in the world of surveys we would typically think of quota sampling or other nonprobabilistic approaches as an unfortunate compromise with reality rather than as a desirable goal. In short, typical statisticians such as myself see probability modeling as a valuable tool that is central to applied statistics, while Wasserman appears to see probability as an example of an assumption to be avoided.
Just to be clear: I’m not at all saying Wasserman is wrong in any way here; rather, I’m just marveling on how different his perspective is from mine. I can’t immediately see how his assumption-free approach could possibly be used to estimate public opinion or votes cross-classified by demogtaphics, income, and state. But, then again, maybe my models wouldn’t work so well on the applications on which Wasserman works. Bridges from both directions would probably be good.
With different methods and different problems come different philosophies. My use of generative modeling motivates, and allows, me to check fit to data using predictive simulation. Wasserman’s quite different approach motivates him to understand his methods using other tools.
Christian: Thanks so much for bringing a different perspective to the whole problem of assumptions and what it genuinely means to be threatened/not threatened by them. People rarely bring up the fact that what really matters is not obeying assumptions but rather not invalidating/obstructing whatever one is trying to do/learn from the method/model. (I’d like to see an attempt to try and tackle this kind of (admittedly vague) question, for at least some cases.*) As you note: “it’s not only the ‘official’ model assumptions that matter but also whether a statistic uses all the data in an appropriate manner for the given application”. This is the kind of thing we should be thinking about in worrying about assumptions. I find your last point very intriguing as well.
*Do you know of such general discussions?
Andrew: I very much appreciate your having commented on this paper soon after it came out,months before the rest of us, along with your other RMM reflections. Your points interestingly strengthen those of Spanos and Hennig, I think. But, Wasserman will have his own take on these matters tomorrow.
Mayo: Unfortunately I don’t think that there is much in the literature that explicitly treats the question how to validate whether a statistical method fulfils being “fit for the given purpose with the given data”. I try to do some modest steps in that direction in the area of cluster analysis but it’s early days.
What one can find in some good applied statistical work (including Gelman’s) is scattered arguments in this direction, but as far as I know no systematic approach.
Andrew Gelman: As in my reply to Aris’s posting, I again feel that I should comment on Davies’s work. I don’t see how this implies that one should replace random by quota sampling. I can’t find anything in Davies’s work that points in this direction (one could see it as a weakness on his side that he doesn’t say much on sampling and design of experiments; his work seems to be rather exclusively on data analysis once the data are there from whatever source, but then although I know quite a bit, I may have missed something).
Christian:
I was not commenting on Davies (none of whose work I’ve read) but on Wasserman. Wasserman wrote of, “several interesting papers where probability models, at least in the sense that we usually use them, are eliminated.” I remarked that this is much different from the world of surveys, where random sampling is (a) standard, and (b) considered to be a good thing, not a bad thing. As I wrote, I think it would be possible to do survey sampling without using probability modeling; it would just be an unusual approach, compared to many decades of theory and practice of sample surveys. To analyze survey data without the use of probability models is way out of the mainstream of statistics. It might be a fine idea, I was just surprised to hear about it.