**A new joint paper….**

**“Error statistical modeling and inference: Where methodology meets ontology”**

**Aris Spanos · Deborah G. Mayo**

**Abstract:** In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

**Keywords:** Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

The related conference.

**Reference: **Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” *Synthese* (online May 13, 2015), pp. 1-23.

an important desideratum

There is much in the paper with which I agree, there is also much with

which I disagree. On the agreement side the authors emphasize that

models are approximation and that they are adequate rather than true

as in `statistically adequate’. On the disagreement side their vocabulary

contains words associated with a concept of truth such as `actual’ as in

`{\it actual} error probabilities’ and `wrong’ as in `wrong

likelihood’. I have no idea what an actual error probability is

unless the concept is restricted to simulations. Does the use of `wrong

likelihood’ mean that there is some `correct likelihood’ and, if so,

how to be recognize it when we see it (or them)? The disagreement is

about substantial matters which are reflected in the vocabulary.

The authors concept of statistical adequacy relies on the

ability to simulate data sets under the model and comparing these

simulated data sets with the real data. This is to be applauded but

unfortunately the form of comparison is never made precise. Here is

how it is done in `Data Analysis and Approximate Models’. A model is a

fully specified probability measure, that is, all parameters have

explicit values as they must have if the model is to be used for

simulations. The next step is to decide which features of the data

set are to be replicated by the model. Suppose for the sake of

argument the model is that of i.i.d. Gaussian random variables and

the features of interest are (i) shape as measured by the Kolmogorov

distance between the empirical and model distributions

$T_1=d_{ko}(\ep_n, N(\mu,\sigma^2))$ and (ii) the

lack of outliers as measure by $T_2=\max_i \vert X_i-\mu\vert/\sigma$.

These play the role of the mis-specification tests of the authors.

One now generates data and the model $N(\mu,\sigma^2)$ and calculates

say the 0.975-quantiles of $T_1$ and $T_2$, say $q_1(0.975)$ and

$q_2(0.975)$ respectively. Given data $x_1,\ldots,x_n$ the set of

adequate Gaussian models are those $N(\mu,\sigma^2)$ for which

$d_{ko}(\ep_n, N(\mu,\sigma^2))\le q_1(0.975)$ and $max_i \vert

x_i-\mu\vert/\sigma\le q_2(0.975)$.

Note that this concept of adequacy specifies the parameter

values. Maximum likelihood has nothing to add although one can include

the behaviour of the mean and standard deviation in the features to be

replicated. One can define mis-specification tests without specifying

parameters. Thus $T_1$ can be replaced by $T_3=\inf_{\mu,\sigma}

d_{ko}(\ep_n, N(\mu,\sigma^2))$ and $T_2$ by $T_4=\max_i \vert

x_i-mean(x)\vert/sd(x)$ where $mean(x)$ and $sd(x)$ are the mean and

standard deviation of the data. Now a model can be declared adequate

without specifying any parameter values. This leaves the statistician

free to use say maximum likelihood in the interests of efficiency

and severity. This can however go completely wrong as the resulting maximum

likelihood estimate can produce parameter values for which the

resulting model is an arbitrarily poor approximation to the data.

Finally a comment on severity. Suppose the model is $N(\mu,\sigma^2)$

and the null hypothesis is $H_0: \mu=0$. Presumably a severe test will

be based on the mean of the sample. However the careful statistician

decides first to check the adequacy of the model using some

mis-specification tests. The data pass the mis-specification tests and the null

hypothesis is accepted. Suppose we now consider all symmetric

location/scale models which pass the mis-specification tests and then

use maximum likelihood to define a severe test of $H_0$. It turns out

that the test using the Gaussian model is the least severe of all the

tests. The moral is that severity depends not only on the data but on

the model and that severity can be imported from the model. Tukey

calls this a free lunch. In mathematical terms the testing $H_0$ is an

ill-posed problem if the model can also be chosen. The problem need

regularizing and one way of doing this is to use minimum Fisher

information models. The Gaussian model is one such. The test based on

the mean is the severest test using the least severe model, that is

that model which does not introduce spurious severity. I miss a

discussion of this problem in the paper.