The Nature of the Inferences From Graphical Techniques: What is the status of the learning from graphs? In this view, the graphs afford good ideas about the kinds of violations for which it would be useful to probe, much as looking at a forensic clue (e.g., footprint, tire track) helps to narrow down the search for a given suspect, a fault-tree, for a given cause. The same discernment can be achieved with a formal analysis (with parametric and nonparametric tests), perhaps more discriminating than can be accomplished by even the most trained eye, but the reasoning and the justification are much the same. (The capabilities of these techniques may be checked by simulating data deliberately generated to violate or obey the various assumptions.)
The combined indications from the graphs indicate departures from the LRM in the direction of the DLRM, but only, for the moment, as indicating a fruitful model to probe further. We are not licensed to infer that it is itself a statistically adequate model until its own assumptions are subsequently tested. Even when they are checked and found to hold up – which they happen to be in this case – our inference must still be qualified. While we may infer that the model is statistically adequate – this should be understood only as licensing the use the model as a reliable tool for primary statistical inferences but not necessarily as representing the substantive phenomenon being modeled.
Back to the Primary Statistical Inference – Nonsense Regressions In a nutshell, the respecified model that turned out to be statistically adequate, after thorough M-S testing, is the DLRM:
yt =17.68 +.19t-.000xt+1.50yt-1+.013xt-1-.56yt-2+.014xt-2+εt
Having established the statistical adequacy of the estimated DLRM, we are then licensed in making ‘primary’ statistical inferences about the values of parameters in this model. In particular, we can proceed to assess the ability of the secret variable to help predict the population of the USA (yt ). A test (an F test) of joint significance of the coefficients of (xt, xt-1, xt-2), yields F(3, 26) = .302[.823], which does not reject the hypothesis that they are all 0, indicating that the secret variable is uncorrelated with the population variable! That is, despite the earlier “impressive” t-ratios and excellent goodness-of-fit in the estimated equation (1) [see part 1] the secret variables is unrelated to yt !
We are thus led to drop these terms from the DLRM, giving rise to an Autoregressive of order m (AR(m)) model with trends:
The estimated form of this AR(m) model yields:
yt = 17.148+ +.217t+1.475yt-1-.572yt-2+εt (3)
Hence, on the basis of a statistically adequate model we were able to infer reliably that the secret variable contributed nothing towards predicting or explaining the population variable. The regression between xt and yt suggested by the estimated models M0 and M1 turn out to be nonsense. The source of the problem is that the inferences concerning the significance of xt were unreliable due to the fact that the underlying models were misspecified.
Revealing the identity of the secret variable shows the egregiousness of such erroneous inferences. Surprisingly, it turns out that:
xt – the # of pair of shoes owned by Spanos’s grandmother!
While this was an extreme case, the usual methods do not readily pick up the problem. It serves as a ‘canonical exemplar’ for errors that require methods to probe and rule out, especially in the social sciences.
Brief Concluding Remarks:
I began by noting in part 1 (of these four m-s posts) that I was shopping for an account of testing assumptions of statistical models at around the time I came across Aris Spanos’ account (in 1999). The account I have only briefly sketched here promotes many of the central features of the error statistical philosophy I favor.
1. A piecemeal account.
(a) A central asset is to avoid blurring problems due to a statistically misspecified model on the one hand, and evidence of discrepancies from a substantive scientific claim or theory on the other. If an underlying statistical model is misspecified, no valid severity assessment of subsequent claims can be made—except that they are all unwarranted. Far from assuming the likelihood function on which error statistical inferences are based, a well-founded methodology for testing assumptions is available under the error statistical umbrella. The problem of inferring the adequacy of the statistical model attains independence from the following substantive inferences (where this fails, the problem is usefully amplified).
(b) This piece-meal break down is essential even within the task of arriving at a statistically adequate model. We saw, for example, that the m-s test itself must be distinguished from the task of arriving at a new model, or what Spanos calls model respecification. Blithely inferring the alternative in the case of the Durbin-Watson test (part 2), we saw, permits inferences whose severity is low and/or incalculable.
(c) While piecemeal, there is exhaustion within each question split off. Headings of the “menu” in my illustration may be used to define any statistical model. Like most dabblers in statistical model checking, I had not been aware of the detailed and numerous assumptions underlying statistical models, nor how dangerous working with an incomplete set can be to the enterprise. For instance, if we had thought that homoskedasticity is sufficient for valid inferences in the above example, we would have been in trouble: the homoskedasticity assumption was indeed valid, but the inferences were completely unreliable.
(2.) Fit is not enough. Perhaps the most surprising implication of statistical inadequacy is that it calls into question the most widely used criterion of model selection, the goodness-of-fit/prediction measures like the R2 and Mean-Square Prediction Error. What goes wrong with the R2 is that, in the presence of t-heterogeneity in the mean, the variances of both (yt, Xt), cannot be estimated consistently using their sum of squares from a fixed sample mean, since the true mean is trending. Goodness-of-fit/prediction is neither necessary nor sufficient for statistical adequacy (Spanos 2007). This is because such criteria rely on the “smallness” of the residuals instead of their “non-systematicity”. Residuals can be “small” but systematically different from white-noise. One can explain the overwhelming majority of “spurious correlations, regressions, etc.” as stemming from statistically inadequate models.
(3) Solving Duhemian problems. Its focus on distinguishing the effects of different violations, and capitalizing on a number of procedures for reliably pinpointing the blame for anomalies is precisely in the error statistical spirit. Statistical adequacy, in the sense of the current account, l constitutes statistical knowledge and is analogous to the role I have in mind for “experimental knowledge” more generally in science. It informs us of what the genuine regularities are, which in turn may serve for the (inductive!) inferences needed for falsifying or corroborating scientific hypotheses (see “no-pain philosophy” posts (1), (2), (3)). Adequate experimental or statistical models also teach us what kinds of models can reliably be used to learn about a phenomenon of interest. This is radically different from attaining ‘good fit’, however, the latter is measured, and for much the same reason that “fit” alone is inadequate for scientific evidence more generally. Further, the information used to infer an adequate/inadequate statistical model with severity is distinguishable from that needed to probe substantive questions of interest.
(The above discussion is intended as a sketch of the m-s testing reasoning ,for purposes of philosophical discussions (here and elsewhere); it omits many of the tests, parametric and non-parametric, singly and joint, used in a full application of the same ideas. For more detailed accounts, see Aris Spanos.)
…how is it that Spanos has annual data on the number of pairs of shoes owned by his grandmother going back to 1955?
That’s easy! My grandmother would never throw away any shoes and each pair had a different story behind it; the stories I grew up with. Each pair was bought at a specific annual fair and it was dated.
Neat! Thanks, AS.
…while it is not strange that she had so many shoes, only a future economist would record the data…
In forensic science, tests, with their statistical models, have to be “validated” prior to use. This nearly always involves trial runs against independent data. We regard this as a robust test of adequacy. How does this compare to the other m-s tools?
I need to look into the case of forensic science in more detail to give an informed answer, but my first thought is that the statistical models are used to “calibrate” the lab processes used in such a context.
I happened to come across good discussions of forensics today at the Egyptian museum. For fun, here’s a link to their clickable mummy:
http://www.akhet.co.uk/clikmumm.htm