We’re going to be discussing the philosophy of m-s testing today in our seminar, so I’m reblogging this from Feb. 2012. I’ve linked the 3 follow-ups below. Check the original posts for some good discussion. (Note visitor*)
“This is the kind of cure that kills the patient!”
is the line of Aris Spanos that I most remember from when I first heard him talk about testing assumptions of, and respecifying, statistical models in 1999. (The patient, of course, is the statistical model.) On finishing my book, EGEK 1996, I had been keen to fill its central gaps one of which was fleshing out a crucial piece of the error-statistical framework of learning from error: How to validate the assumptions of statistical models. But the whole problem turned out to be far more philosophically—not to mention technically—challenging than I imagined.I will try (in 3 short posts) to sketch a procedure that I think puts the entire process of model validation on a sound logical footing. Thanks to attending several of Spanos’ seminars (and his patient tutorials, for which I am very grateful), I was eventually able to reflect philosophically on aspects of his already well-worked out approach. (Synergies with the error statistical philosophy, of which this is a part, warrant a separate discussion.)
Problems of Validation in the Linear Regression Model (LRM)
The example Spanos was considering was the the Linear Regression Model (LRM) which may be seen to take the form:
M_{0}: y_{t} = β_{0 }+ β_{1}x_{t} + u_{t}, t=1,2,…,n,…
Where µ_{t} = β_{0 }+ β_{1}x_{t} is viewed as the systematic component, and u_{t} = y_{t} – β_{0 }- β_{1}x_{t} as the error (non-systematic) component. The error process {u_{t}, t=1, 2, …, n, …,} is assumed to be Normal, Independent and Identically Distributed (NIID) with mean 0, variance σ^{2} , i.e. Normal white noise. Using the data z_{0}:={(x_{t}, y_{t}), t=1, 2, …, n} the coefficients (β_{0 }, β_{1}) are estimated (by least squares)yielding an empirical equation intended to enable us to understand how y_{t} varies with X_{t}.
Empirical Example
Suppose that in her attempt to find a way to understand and predict changes in the U.S.A. population, an economist discovers, using regression, an empirical relationship that appears to provide almost a ‘law-like’ fit (see figure 1):
y_{t} = 167.115+ 1.907x_{t} + û_{t}, (1)
where y_{t} denotes the population of the USA (in millions), and x_{t} denotes a secret variable whose identity is not revealed until the end (these 3 posts). Both series refer to annual data for the period 1955-1989.
A Primary Statistical Question: How good a predictor is x_{t}?
The goodness of fit measure of this estimated regression, R^{2}=.995, indicates an almost perfect fit. Testing the statistical significance of the coefficients shows them to be highly significant, p-values are zero (0) to a third decimal, indicating a very strong relationship between the variables. Everything looks hunky dory; what could go wrong?
Is this inference reliable? Not unless the data z_{0} satisfy the probabilistic assumptions of the LRM, i.e., the errors are NIID with mean 0, variance σ^{2}.
Misspecification (M-S) Tests: Questions of model validation may be seen as ‘secondary’ questions in relation to primary statistical ones; the latter often concern the sign and magnitude of the coefficients of this linear relationship.
Partitioning the Space of Possible Models: Probabilistic Reduction (PR)
The task in validating a model M_{0} (LRM) is to test ‘M_{0}is valid’ against everything else!
In other words, if we let H_{0} assert that the ‘true’ distribution of the sample Z, f(z) belongs to M_{0}, the alternative H_{1} would be the entire complement of M_{0}, more formally:
H_{0}: f(z) € M_{0} vs. H_{1}: f(z) € [P - M_{0}]
where P denotes the set of all possible statistical models that could have given rise to z_{0}:={(x_{t},y_{t}), t=1, 2, …, n}, and € is “an element of” (all we could find).
The traditional analysis of the LRM has already, implicitly, reduced the space of models that could be considered. It reflects just one way of reducing the set of all possible models of which data z_{0} can be seen to be a realization. This provides the motivation for Spanos’ modeling approach (first in Spanos 1986, 1989, 1995).
Given that each statistical model arises as a parameterization from the joint distribution:
D(Z_{1},…,Z_{n};φ): = D((X_{1}, Y_{1}), (X_{2}, Y_{2}), …., (X_{n}, Y_{n}); φ),
we can consider how one or another set of probabilistic assumptions on the joint distribution gives rise to different models. The assumptions used to reduce P, the set of all possible models, to a single model, here the LRM, come from a menu of three broad categories. These three categories can always be used in statistical modeling:
(D) Distribution, (M) Dependence, (H) Heterogeneity.
For example, the LRM arises when we reduce P by means of the “reduction” assumptions:
(D) Normal (N), (M) Independent (I), (H) Identically Distributed (ID).
Since we are partitioning or reducing P by means of the probabilistic assumptions, it may be called the Probabilistic Partitioning or Probabilistic Reduction (PR) approach.[i]
The same assumptions, traditionally given by means of the error term, are instead specified in terms of the observable random variables (y_{t}, X_{t}): [1]-[5] in table 1 to render them directly assessable by the data in question.
Table 1 – The Linear Regression Model (LRM) | ||
y_{t} = β_{0 }+ β_{1}x_{t} + u_{t}, t=1,2,…,n,… |
||
[1] Normality: | D(y_{t} |x_{t}; θ) | Normal |
[2] Linearity: | E(y_{t} |X_{t}=x_{t}) = β_{0 }+ β_{1}x_{t} | Linear in x_{t} |
[3] Homoskedasticity: | Var(y_{t} |X_{t}=x_{t}) =σ^{2}, | Free of x_{t} |
[4] Independence: | {(y_{t} |X_{t}=x_{t}), t=1,…,n,…} | Independent |
[5] t-invariance: | θ:=(β_{0 }, β_{1}, σ^{2}), | constant over t |
There are several advantages to specifying the model assumptions in terms of the observables y_{t} and x_{t} instead of the unobservable error term.
First, hidden or implicit assumptions now become explicit ([5]).
Second, some of the error term assumptions, such as having a zero mean, do not look nearly as innocuous when expressed as an assumption concerning the linearity of the regression function between y_{t} and x_{t} .
Third, the LRM (conditional) assumptions can be assessed indirectly from the data via the (unconditional) reduction assumptions, since:
N entails [1]-[3], I entails [4], ID entails [5].
As a first step, we partition the set of all possible models coarsely
in terms of reduction assumptions on D(Z_{1},…,Z_{n};φ):
LRM | Alternatives | |
(D) Distribution: | Normal | non-Normal |
(M) Dependence: | Independent | Dependent |
(H) Heterogeneity: | Identically Distributed | Non-ID |
Given the practical impossibility of probing for violations in all possible directions, the PR approach consciously considers an effective probing strategy to home in on the directions in which the primary statistical model might be potentially misspecified. Having taken us back to the joint distribution, why not get ideas by looking at y_{t} and x_{t} themselves using a variety of graphical techniques? This is what the Probabilistic Reduction (PR) approach prescribes for its diagnostic task….Stay tuned!
*Rather than list scads of references, I direct the interested reader to those in Spanos.
[i] This is because when the NIID assumptions are imposed on the latter simplifies into a product of conditional distributions (LRM).
See follow-up parts:
PART 2: http://errorstatistics.com/2012/02/23/misspecification-testing-part-2/
PART 3: http://errorstatistics.com/2012/02/27/misspecification-testing-part-3-m-s-blog/
PART 4: http://errorstatistics.com/2012/02/28/m-s-tests-part-4-the-end-of-the-story-and-some-conclusions/
*We also have a visitor to the seminar from Hawaii, John Byrd, a forensic anthropologist and statistical osteometrician. He’s long been active on the blog. I’ll post something of his later on.