U-PHIL: Aris Spanos on Larry Wasserman

Our first outgrowth of “Deconstructing Larry Wasserman”. 

Aris Spanos – Comments on:

“Low Assumptions, High Dimensions” (2011)

by Larry Wasserman*

I’m happy to play devil’s advocate in commenting on Larry’s very interesting and provocative (in a good way) paper on ‘how recent developments in statistical modeling and inference have [a] changed the intended scope of data analysis, and [b] raised new foundational issues that rendered the ‘older’ foundational problems more or less irrelevant’.

The new intended scope, ‘low assumptions, high dimensions’, is delimited by three characteristics:

“1. The number of parameters is larger than the number of data points.

2. Data can be numbers, images, text, video, manifolds, geometric objects, etc.

3. The model is always wrong. We use models, and they lead to useful insights but the parameters in the model are not meaningful.” (p. 1)

In the discussion that follows I focus almost exclusively on the ‘low assumptions’ component of the new paradigm. The discussion by David F. Hendry (2011), “Empirical Economic Model Discovery and Theory Evaluation,” RMM, 2: 115-145,  is particularly relevant to some of the issues raised by the ‘high dimensions’ component in a way that complements the discussion that follows.

My immediate reaction to the demarcation based on 1-3 is that the new intended scope, although interesting in itself, excludes the overwhelming majority of scientific fields where restriction 3 seems unduly limiting. In my own field of economics the substantive information comes primarily in the form of substantively specified mechanisms (structural models), accompanied with theory-restricted and substantively meaningful parameters.

In addition, I consider the assertion “the model is always wrong” an unhelpful truism when ‘wrong’ is used in the sense that “the model is not an exact picture of the ‘reality’ it aims to capture”. Worse, if ‘wrong’ refers to ‘the data in question could not have been generated by the assumed model’, then any inference based on such a model will be dubious at best!

To separate these two types of ‘wrongness’ one needs to distinguish between two different, but interrelated, types of models. Behind every substantive (structural, scientific) model Mϕ(x), aiming to capture certain key features of the phenomenon of interest, there is (often implicit) a statistical model Mθ(x) that concerns the assumed probabilistic structure of the stochastic process {Xk, k∈N=(1,2,…)} underlying the data x0:=(x1,…,xn). It is one thing to claim that Mϕ(x) is not a ‘realistic enough’ picture of the reality it aims to capture, and entirely another to claim that the assumed process {Xk, k∈N}, underlying Mθ(x), could not have generated data x0. A substantive model Mϕ(x) may always be not-realistic-enough, but a statistical model Mθ(x) may be entirely satisfactory [when its probabilistic assumptions are valid for the data–statistically adequate] to reliably probe substantive questions of interest. Indeed, unrealistic models, when they are statistically adequate, can ‘give rise to useful insights’; see Spanos (2010).

Wasserman’s ‘weak assumptions’ is partly motivated by a widely held viewpoint:

“I think most statisticians would agree that methods based on weak assumptions are generally preferable to methods based on strong assumptions.” (p. 2)

I wouldn’t, because the assertion is based on two questionable presuppositions. First, weaker (stronger) assumptions give rise to inferences that are less (more) vulnerable to misspecification. Second the precision of inference is not too adversely affected by the weaker assumptions.

Starting with the latter, weaker assumptions do give rise to less precise inferences; there is a clear trade-off between weaker (stronger) assumptions and less (more) precise inferences – assuming the assumptions are valid. In light of that, weaker assumptions can contribute to less reliable inferences in two different ways. Firstly, weaker assumptions often give rise to inferences that rely on asymptotic results. This, however, does not guarantee that, for a given sample size n, the nominal (assumed) error probabilities are close to the actual ones to ensure error reliability of inference; applying a 5% significance level test when the actual type I error (for a given n) is closer to 90% can easily lead the inference astray.  Secondly, weaker premises often invoke non-testable conditions which render model validation impossible. An instance of such untestable conditions, when kernel smoothing is employed, is: the “true” density function f(x) has (a) uniformly continuous derivatives of up to order 3, and (b) bounded support [a, b]; see Thompson and Tapia (1991). Clearly, these assumptions can only be taken at face value since they cannot be tested! This means that such models/procedures are potentially more, not less, vulnerable to statistical misspecification than models with strong but testable assumptions. ‘Weak assumptions’ will definitely undermine the precision of inference and when unvalidated they will derail its reliability. Indeed, I will go a step further and claim that we learn from data about phenomena of interest by employing reliable and incisive (optimal) procedures to establish trustworthy evidence. The best way to achieve that is to use fully parametric models that enable one to establish both the reliability and precision (optimality) of inference.

2.1 Completely Nonparametric Inference

All the models in this category assume IID, which are highly restrictive assumptions for fields like economics. What these models do not assume is a direct distributional assumption; Normal, Poisson, etc.

In a classic paper Bahadur Savage (1956) asked whether the optimal inference procedures associated with μ in the case of the simple Normal model:

Mθ(x): Xk ∽ NIID(μ,σ²), k=1,2,…,n, …,

like the sample mean being an excellent estimator, the one-sided t-test being UMP, etc., can be extended to the case of a statistical model, which retains the IID but replaces Normality with indirect distributional assumptions, like the unknown distribution belongs to a family Ϝ satisfying conditions like:

(i) it has finite mean and variance, or
(ii) all its moments are finite, or
(iii) it has an unknown but closed and bounded support a≤xk≤b.

In particular, is there a reasonably reliable and precise test or Confidence Interval (CI) for μ in the context of simple (IID) models with a distribution belonging to Ϝ? That surprising answer was NO! They showed that any test (or CI) for μ based on Ϝ will be biased and inconsistent, i.e. its power will be less than or equal to the size (type I error) of the test for all n, and asymptotically the power goes to zero as n→∞. That is, relaxing Normality had dire consequences on all inferences concerning μ:

“It is shown that there is neither an effective test of the hypothesis that μ=0, nor an effective confidence interval for μ, nor an effective point estimate of μ.” (p. 1115)

Bahadur and Savage went on to explain this as follows:

“These conclusions concerning μ flow from the fact that μ is sensitive to the tails of the population distribution; parallel conclusions hold for other sensitive parameters, and they can be established by the same methods as are here used for μ.” (p. 1115)

The intuition underlying the Bahadur Savage (1956) result is that inference is hopeless unless one can ‘tie down’ the assumed family Ϝ sufficiently to “tame” the sampling distribution of the estimator for the quantity of interest, so that the tail areas – where error probabilities live – can be evaluated (even approximately). In the example in question, the ‘untamedness’ of the  distributions in Ϝ is largely inherited by the sampling distribution of the sample mean.

In light of that, the intuition behind the Donoho (1988) result is that the sampling distribution of ‘the number of modes’ is ‘tame enough’ because of its discreteness, in contrast to the sample mean distribution which is continuous! Hence, I will be happy to hazard an answer to Wasserman’s question:

“for which quantities do there exist non-trivial nonparametric confidence intervals when the dimension d > n?” (p. 2)

Spanos: ‘for those quantities of interest whose estimator has a sampling distribution ‘tame enough’ to render the evaluation of its tail areas possible’.

The Bahadur Savage (1956) result brings out the important role played in inference by direct distributional assumptions. The real difference between a parametric and nonparametric inference is that one trades a direct and testable distributional assumption with an indirect and (often) untestable one relating to restrictions on the unknown f(x). The end result is that this tradeoff is often nothing short of undermining both the reliability and precision of inference. The former because one cannot validate the inductive premises, and the latter because one trades ‘global’ for ‘local’ optimality relative to a particular loss function, and invariably invokes a ‘large enough n‘ to justify asymptotic approximations.

2.2 Inference without Models

From the discussion in Wasserman’s paper, it seems that P. Laurie Davies (and his co-workers) are attacking a strawman when they assert that their approach differs from the traditional because:

“Data are treated as deterministic. One then looks for adequate models rather than true models. His basic idea is that a distribution P is an adequate approximation for x0=(x1,…,xn), if typical data sets of size n, generated under P ‘look like’ (x1,…,xn). In other words, he asks whether we can approximate the deterministic data with a stochastic model.” (p. 3)

Indeed, there is nothing stochastic about a set of numbers (x1,…,xn). Nevertheless, not every set of numbers is amenable to statistical modeling. They are only when they exhibit certain ‘chance regularity patterns’, i.e. when these numbers can be justifiably viewed as a realization of a generic stochastic process {Xk, k∈N} whose assumed probabilistic structure renders the particular data a ‘typical realization’ thereof. In addition, Davies’s criterion of ‘randomness’ of the residuals is nothing more than a crude version of statistical adequacy, in the sense that one basically tests whether the residuals are ‘patternless enough’. It is crude because the latter does not usually include all forms of systematic statistical information, e.g. distributional information. Note that systematic statistical information can be classified under three broad categories: Distributional, Dependence and Heterogeneity.

The most effective way to select the probabilistic structure of this generic process is to give a pertinent answer to the question: ‘what probabilistic structure, when imposed on the process {Xk, k∈N}, would render data x0 a “typical” realization thereof? An effective way to assess the “typicality” is to use trenchant Mis-Specification (M-S) testing to assess the probabilistic assumptions comprising the statistical model based on {Xk, k∈N}, and not to eyeball some simulated data based on P to assess if they ‘look like’ the actual data. Macroeconometricians have been doing just that for the last quarter century with embarrassing results. As Mayo (1996) would assert, ‘eyeballing’ does not constitute a severe test for whether the data constitute a typical realization of a prespecified stochastic process.

Without models? This is another misleading claim. The example from Davies, Kovac and Meise (2009) is a theory-driven regression model:

Yk = f(Xk) + εk, εk∽IID(0,σ²), k=1,2,…,n,…

whose statistical adequacy is reduced to the cruder form based on whether the residuals are IID(0,σ²) — ‘random’ in their terminology. This is a statistical model whose adequacy is not defined in terms of the validity of the probabilistic assumptions relating to the observable process {(Yk|Xk) k∈N}. In that sense, the minimization of the complexity function ψ(f) subject to this crude adequacy measure (randomness of the residuals) is an attempt to achieve some kind of simplicity without forsaking basic adequacy. This is not an unreasonable procedure, but by focusing on the error term it ignores the fact that if the data come from the joint distribution D(Xk,Yk;θ), then the statistical regression function E(Yk|Xk) is determined by the structure of D(Xk,Yk;θ), in the sense that, it based on the first moment of the conditional distribution derived via D(Xk,Yk;θ).

Hence, just looking at the scatter plot {(xk,yk), i=1,2,…,n} can be highly informative about the functional form of E(Yk|Xk)=h(Xk), and potentially compare it with the prespecified f(Xk).

2.3 Individual Sequences

Inferences that are based on mathematical approximation algorithms without any probabilistic structure constitute attempts to revive the pre-Gauss (1809) approach to curve-fitting that relies exclusively on mathematical approximation theory. As argued in Spanos (2010) this framework is lacking the necessary understructure for inductive inference, i.e. it does not delimit the probabilistic premises stating the conditions under which the various statistics invoked, including estimators, test statistics and goodness-of-fit measures, are inferentially reliable, as opposed to mathematically justifiable. The ‘inferential’ results in such a framework often come in the form of providing an upper bound for the potential errors which converges to zero in some mathematical way. Wasserman (2011), p. 4, asserts:

“Moreover, this bound is tight. Note that there is no assumptions about randomness (subjective or frequentist).

In summary, we can do sensible inference without invoking probability at all. Why are scholars in foundations ignoring this?”

Such results have two major weaknesses. First, the mathematical approximation framework does not guarantee that the approximation errors (residuals) are ‘non-systematic’ in a probabilistic sense, and this will often undermine any statistical inference! Second, such upper bounds are often incredibly crude to the extent that they cannot give rise to incisive inferences. Indeed, Wasserman’s claim about bounds being tight has to be qualified – ‘tight with respect to what?’ In the example of this section, let us consider the case where N=10 and n=20. A number of obvious questions arise: (i) In what sense is the upper bound √((lnN)/(2n))=√((ln(10))/(2(20)))=.24 tight? (ii) What does this say about the accuracy of the underlying expert system compared to some alternative model/procedure? Granted, one can always fabricate a loss function for which the value .24 constitutes a ‘relatively’ small discrepancy, but (iii) doesn’t that canard bring the whole procedure into disrepute?

To answer these questions, I’m willing to compare the predictive accuracy of the Cesa-Bianchi and Logosi (2006) procedure with that of a statistical model that only accounts for the ‘chance regularities’ exhibited by data x0=(x1,…,xn). My educated guess is that for any data (x1,…,xn) that can be viewed as a realization of a non-IID process – i.e. there is something to predict beyond a constant – the latter model will out-predict any expert system!

3. Low Assumptions in High Dimensions

“Can we apply some of this thinking to high dimensional problems?” (p.4)

Wasserman considers a number of examples that fall, in one way or another, into the demarcated intended scope.

3.1 Prediction

“The lasso (and its variants) has been used successfully in so many application areas that one cannot doubt its usefulness. This is interesting because the linear model is certainly wrong and is also uncheckable. For example, if Y is a disease outcome and X represents the expression levels of 50,000 genes, it is inconceivable that the mean of Y given X=x would be linear.” (p. 5)

“Thus, the lasso “works” under very weak conditions.” (p. 5)

Using a linear approximation for unknown regression functions has been employed for many years as a tool for crude prediction purposes, when no better procedures are available. However, sophisticated graphical analysis of the data can be used to reduce the dimensionality of the problem as well as provide additional information concerning the appropriateness of key assumptions, including linearity and homoskedasticity (imposed by LASSO), or/and adjustments on the original approximation that might improve the predictions.

As a practicing econometrician at the trenches, I have been inundated by reassuring messages from several high priests (including Nobel prize winners) — as well as numerous publications in prestigious economic journals — that the recent macro-models known by the pretentious name, Dynamic Stochastic General Equilibrium (DSGE) models, have revolutionized policy making because they “work” well in practice. Here is a typical assertion by researchers in the European Central Bank:

“Recent developments in the construction, simulation and estimation of dynamic stochastic general equilibrium (DSGE) models have made it possible to combine a rigorous microeconomic derivation of the behavioural equations of macro models with an empirically plausible calibration or estimation which fits the main features of the macroeconomic time series.” http://www.ecb.int/home/html/researcher_swm.en.html.

As a result of such reassuring claims most central banks around the world have developed their own DSGE models to provide the basis of their forecasting and policy simulation analysis. Despite these reassuring claims, when I tested several of these models against the data, it turned out that their claim needs to be qualified to: “these models ‘work’ for uses other than drawing any form of inference, however formal or informal, including prediction and policy simulations”. These models are atrocious for inference purposes! One will do better in forecasting key macro-variables like inflation, GDP growth and interest rates by using a crystal ball spewing numbers between 1 and 5. When challenged, the adherents of DSGE models would invariably justify their continued use by invoking a lame form of relativism: alternative data-based models forecast equally badly:

“… it is shown that the estimated model is able to compete with more standard, unrestricted time series models, such as vector auto regressions (VARs), in out-of-sample forecasting”, as claimed by the same researchers at the European Central Bank  (http://www.ecb.int/home/html/researcher_swm.en.html)

What the adherents to DSGE modeling neglect to mention (or never bother to check) is that the statistical models used in these comparisons, like the VAR, are just as misspecified (statistically) as their own models – their probabilistic assumptions are totally invalid for their data. Not surprisingly, when I constructed a statistically adequate model based on the same data, the model rejected their structural restrictions out of hand and predicted the key variables for the crises period 2008-2010 very well; the prediction errors were both non-systematic and small in a precisely defined probabilistic sense! In contrast, predictions from a typical DSGE model, for both the crises period and before, were both systematic [overprediction for the whole of the period] and very large [more than 20 times larger than those of the statistically adequate model]! Spare a thought for poor Greece that has to abide by the policy recommendations stemming from these empirically base-less models!

In light of that and other similar experiences in econometrics, I’m particularly skeptical when practitioners claim that their procedures/models ‘really work’, but when it comes to explaining what they mean by ‘really works’, they seek refuge in the worst kind of relativism: they do not compare too badly with certain alternatives in terms of particular loss functions. The real  problem is that they have ditched any notion of ‘adequacy’ and ‘optimality’ by forsaking the use of well-defined statistical models in favor of ‘weak assumptions’, in a desperate search for that elusive ‘free lunch’ economists are infamous for. They pay dearly for the indirect distributional assumption in terms of imprecision and unreliability of inference!

3.2 Salient Structure

This example brings out the potential trade-off between weaker (stronger) statistical premises and stronger (weaker) substantive premises. Using this trade-off is a good idea when the reliability of the substantive information has been validated by other means. However, when the invoked substantive information is simply based on “strong beliefs”, and armchair empiricism — as in the case of most economic commentators – ignoring the probabilistic structure of the data can be calamitous! I am particularly skeptical about this trade-off because the DSGE modeling referred to above constitutes an example that continues to yield dubious inference results, but the central banks (including the Federal Reserve) show no sign of abandoning such models primarily because of “strong beliefs” in their ‘usefulness’, despite the overwhelming empirical evidence that they are useless for inference purposes!

4 Bayes? I agree with Larry that the Bayesian approach is not suited to statistical modeling and inference in the context of ‘low assumptions, high dimensions’.

5. Conclusion

“The best way to do this is to look at the wide array of new problems that statisticians (and computer scientists) are facing with the deluge of high dimensional complex data that is now so common.” (p. 8)

I agree with this conclusion, but I disagree with the claim that the Efron (1978) list of philosophical issues is irrelevant for today’s discussions pertaining to the philosophical foundations of statistics. The foundational problems on Efron’s list concern (directly or indirectly) the basic question:

“how does one construct inductive inference procedures that give rise to learning from data about phenomena of interest?”

It was argued above that such learning can take place when the reliability and precision of inference has been secured by validating the inductive premises and relying on optimal (most effective) procedures. This remains the key desideratum in the context of ‘low assumptions, high dimensions’ paradigm. However, the examples discussed in this paper sidestep the reliability and precision problems by relying on weak premises and invoking asymptotic results based on unvalidated premises. In this newly demarcated paradigm there is nothing to prevent the residuals from the fitted model/procedure from being “systematic” in a probabilistic sense; they contain relevant statistical information not accounted for by the fitted model. That would undermine the reliability/precision of inference in these examples, as well as the ‘local optimality’ framed in terms of a particular loss function.

These problems are also prevalent in statistical learning theory, where the emphasis is placed on the algorithmic dimension of mathematical approximation. The fact that one can demonstrate that a certain algorithm “works” in simulations, does not render such an algorithm effective in the case of actual data. This is because simulations sidestep both of the above key problems. The reliability of inference problem is sidestepped by forging reality. Finding the most effective (optimal) procedure (in a global sense) is sidestepped by invoking relative optimality based on a particular loss function, ignoring the fact that one can always concoct a loss function with respect to which any lame procedure is rendered relatively optimal!

In a certain sense, the newly demarcated intended scope of statistics returns to the pre-Fisher era, where the focus was on data description and least-squares curve-fitting, with inferences being justified using goodness-of-fit and invoking vague ‘large sample’ approximations. Granted, since the 1920s mathematical approximation theory has developed a lot more powerful results, and the modern computers have rendered the numerical aspects of this theory a lot less daunting, but, by themselves, these developments do not alleviate the unreliability and imprecision of inference problems. Indeed, goodness-of-fit/prediction is neither necessary nor sufficient for statistical adequacy; the latter will secure the reliability of inference; see Spanos (2010).

*Rationality, Markets and Morals (RMM) Special Topic: Statistical Science and Philosophy of Science:

Cesa-Bianchi, N. and G. Lugosi (2006), Prediction, Learning, and Games, Cambridge: Cambridge University Press.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Spanos, A. (2010), “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204-220.

Thompson, J. R. and R. A. Tapia (1991), Nonparametric function estimation, modeling, and simulation, SIAM, Philadelphia.

Categories: Philosophy of Statistics, Statistics, U-Phil | Tags: , , , , | 7 Comments

Post navigation

7 thoughts on “U-PHIL: Aris Spanos on Larry Wasserman

  1. Aris: Thanks so much for these comments; they are extremely interesting and useful. I should also acknowledge the very helpful role your earlier notes on this paper served in my getting at possible foundational underpinnings in my “Wasserman deconstruction” (July 28 post). Do you think the new wave (possibly into a much older paradigm) reflects a genuinely different set of needs/applications (in areas like machine learning), and/or a desire to escape some of the controversies surrounding the use and interpretation of statistical models? My comments say it is both, but that the latter enters to a greater degree than is generally recognized.

    Finally, if, as you argue, the “parametric” problems not only reappear but possibly in more virulent form (in being less open to self-scrutiny), then it would seem pressing to probe the basis for saying a method “works”. (Are they really resting their claims on asymptotic results and simulations?) This might move us to a deeper level on foundations of statistics altogether.

    • Aris Spanos

      My feeling is that the new intended scope has developed partly as a result of applying statistical modeling and inference to new problems using primarily a computer intensive perspective. The latter perspective gives the impression that one can replace fully parametric (Fisher-Neyman-Pearson) frequentist modeling with much weaker premises by relying a lot more on computer intensive techniques that can be justified on mathematical approximation grounds instead of probabilistic grounds; the latter is an illusion, or so I argue. I’m sure that this was partly motivated by a desire to escape some of the controversies surrounding the use and interpretation of parametric statistical modeling and inference. Having said that, I get the impression that the new literature hasn’t made any genuine attempt to explain how the new perspective can avoid or circumvent the older problems and controversies. Indeed, there is an element of wishful thinking about claims relating to “weak premises”.
      My main argument in the comments above is that weaker premises create very serious foundational problems, including undue reliance on asymptotic inference and nonvalidated premises, that have not been adequately appreciated in this literature. Demonstrating that a procedure “works” in simulations is only half the battle; the easy half! In practice, there is a huge gap between “weak” and “valid” premises that needs to be bridged; a problem ducked by simulation demonstrations.

  2. Aris writes:

    To answer these questions, I’m willing to compare the predictive accuracy of the Cesa-Bianchi and Logosi (2006) procedure with that of a statistical model that only accounts for the ‘chance regularities’ exhibited by data x0=(x1,…,xn). My educated guess is that for any data (x1,…,xn) that can be viewed as a realization of a non-IID process – i.e. there is something to predict beyond a constant – the latter model will out-predict any expert system!

    Two questions. The reference is not given. What is it? And, what does Aris mean to include within the scope of “expert systems.”

    Clark Glymour

    • Aris Spanos

      My apologies for taking my time to reply; I have been traveling all morning. I’m commenting on Wasserman’s paper
      and the reference is given there, together with an approving summary of the paper. The notion of the “expert system” is given in the reference in question, and my challenge is that I will do better at predicting by totally ignoring all the “expert system” information and instead construct a model that relies exclusively on the chance regularities exhibited by the data!

  3. Christian Hennig

    There is quite a bit in Aris’s posting with which I agree (perhaps more on this later). For the moment just one critical comment. I wouldn’t exactly say that Wasserman misrepresented the work and philosophy of P. L. Davies, but there’s certainly more to it than what was explained in Wasserman’s article. I think that the beating that this approach receives from Aris is somewhat undeserved, as Aris could perhaps check himself when reading Davies’s own “Data Features” and Äpproximating Data” papers.

    Certainly Davies doesn’t only advertise “eyeballing” but has something that at least mathematically is quite close to misspecification testing although Davies would probably say that that’s a misnomer.

    • Christian: Just to let you know, I will be posting your post on Wasserman separately. Just taking our time.

    • Aris Spanos

      Christian: I take your point, but please remember that what I’m commenting on is not the perspective of P. L. Davies, but how Wasserrman thinks this perspective fits into the new intended scope characterized by conditions 1-3!

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.