A Statistical Model as a Chance Mechanism
Aris Spanos
Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)
One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:
“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)
In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.
This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!
Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):
Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)
From my perspective, this was a major step forward for several reasons, including the following.
First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.
Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:
Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,
where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:
Xt = α0 + α1Xt-1 + σεt, t=1,2,…,n
This indicates how one can use pseudo-random numbers for the error term εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.
Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).
HAPPY BIRTHDAY NEYMAN!
For further discussion on the above issues see:
Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:
http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf
Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.
Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.
Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.
Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.
Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.
[i]He was born in an area that was part of Russia.
Thank you for re-posting this. 1) A metaphysical question: “data” appears to be a more metaphysically neutral term – being surely just a number for a random variable. But, does it matter what that data refers to? For instance, does the data refer to an event (in time) or an entity (extended in space), such as a real organism in a population? Or, a mixture thereof? I suspect you shall say that data has been abstracted to such an extent that its ontology is besides the point – i.e. its reality is in the realm of logic, statistics or mathematics only. But, surely inductive reasoning includes background knowledge on WHAT was being researched, not just numbers on the page or in the computer? 3) If I understand this correctly, Neyman focused on HOW the population was generated, unlike Fisher. Fisher’s infinite pops were, as he acknowledged, totally imaginary and outside the realm of “the real”, whereas, Neyman, appears to have been able to bypassed this by his repeatability principle. Frankly, they both seem to be searching for some ontological constant to anchor their epistemology, but then perhaps that is cognitively and philosophically inevitable and essential. Fascinating stuff, thanks.
Lauren: We had some discussion on statistical ontology almost exactly 1 year ago:
https://errorstatistics.com/2013/04/14/does-statistics-have-an-ontology-does-it-need-one-draft-1/
Not that we precisely addressed your question. Neyman seems to me to be an arch instrumentalist when it comes to using statistics and statistical modeling. Of course one knows what one is talking about. One might speak of actual data and modelled data, but events are different. Perhaps Aris will add to this.
Lauren: your questions raise particularly interesting issues that I have discussed in other places. Let me focus on the first question. With respect to that question, it is important to distinguish between a substantive (structural) model, stemming primarily from subject matter information, and the associated statistical model which is viewed as a particular parameterization of the stochastic process {Z_{t}, t∈N} underlying the data in question Z₀. The connection between the two comes in the form of choosing the parameterization of the latter to parametrically embed the former, i.e. render it a special case.
This particular statistical perspective formalizes statistical information (chance regularity patterns) in terms of probability theory, e.g. probabilistic assumptions pertaining to the process underlying data Z₀, and the substantive information (meaning of Z₀) is irrelevant to the purely statistical problem of validating the statistical premises. Here, there is a loose analogy with Shannon’s (1948) information theory, which is based on formalizing the informational content of a message by separating ‘regularity patterns’ in strings of ‘bits’ from any substantive ‘meaning’: “Frequently the messages have meaning … These semantic aspects of communication are irrelevant to the engineering problem.”
Returning to your question, this purely probabilistic construal of statistical models renders the traditional distinctions between cross-section, time series and panel data models irrelevant; these different forms of data can all be viewed as realizations of stochastic processes with different index sets N denoting the relevant ordering(s) of interest, that being time, space, gender, etc.
The substantive information, plays a crucial role in evaluating the adequacy of the substantive model vis-a-vis the phenomenon of interest, once the statistical adequacy is secured. The latter is needed to ensure the statistical reliability of the inference procedures used to answer substantive questions of interest.
Sorry for the nitpick but I think numbers are important.
You cannot generate 100 000 normally distributed random number in a nanosecond on a modern computer, not even close. Modern computers run at around 2-3 clock cycles per nanosecond. That is only time enough for the computer to perform the simplest of computations. The algorithms for creating good normal random numbers are really quite complex.
On my workstation generating 100 000 random normals takes around 20ms. Still plenty fast enough to do some simulation studies, but many orders of magnitudes slower than a nanosecond.