While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks:
See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so what you’re in essence doing then, is you’re using the P-value to index evidence against a model that is already known to be wrong. …But the point is the model was wrong. And so there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. …
I’ll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven’t learned anything. And in the case where you fail to reject it, you’ve made a mistake. So the worst, so the best possible cases you haven’t learned anything, the worst possible cases is you’re wrong…
Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that …P-values don’t give you that. P-values are about discrepancies from the model…
But P-values are not about discrepancies from the model (in which a null or test hypothesis is embedded). If they were, you might say, as he does, that you should properly always find small P-values, so long as the model isn’t exactly correct. If you don’t, he says, you’re making a mistake. But this is wrong, and is in need of clarification. In fact, if violations of the model assumptions prevent computing a legitimate P-value, then its value is not really “about” anything.
Three main points:
 It’s very important to see that the statistical significance test is not testing whether the overall model is wrong, and it is not indexing evidence against the model. It is only testing the null hypothesis (or test hypothesis) H0. It is an essential part of the definition of a test statistic T that its distribution be known, at least approximately, under H0. Cox has discussed this for over 40 years; I’ll refer first to a recent, and then an early paper.
Suppose that we study a system with haphazard variation and are interested in a hypothesis, H, about the system.We find a test quantity, a function t(y) of data y, such that if H holds, t(y) can be regarded as the observed value of a random variable t(Y) having a distribution under H that is known numerically to an adequate approximation, either by mathematical theory or by computer simulation. Often the distribution of t(Y) is known also under plausible alternatives to H, but this is not necessary. It is enough that the larger the value of t(y), the stronger the pointer against H.
The basis of a significance test is an ordering of the points in [a sample space] in order of increasing inconsistency with H0, in the respect under study. Equivalently there is a function t = t(y) of the observations, called a test statistic, and such that the larger is t(y), the stronger is the inconsistency of y with H0, in the respect under study. The corresponding random variable is denoted by T. To complete the formulation of a significance test, we need to be able to compute, at least approximately,
p(yobs) = pobs = pr(T > tobs ; H0), (1)
called the observed level of significance.
…To formulate a test, we therefore need to define a suitable function t(.), or rather the associated ordering of the sample points. Essential requirements are that (a) the ordering is scientifically meaningful, (b) it is possible to evaluate, at least approximately, the probability (1).
To suppose, as Trafimow plainly does, that we can never commit a Type 1 error in statistical significance testing because the underlying model “is not exactly correct” is a serious misinterpretation. The statistical significance test only tests one null hypothesis at a time. It is piecemeal. If it’s testing, say, the mean of a Normal distribution, it’s not also testing the underlying assumptions of the Normal model (Normal, IID). Those assumptions are tested separately, and the error statistical methodology offers systematic ways for doing so, with yet more statistical significance tests [see point 3].
 Moreover, although the model assumptions must be met adequately in order for the P-value to serve as a test of H0, it isn’t required that we have an exactly correct model, merely that the reported error probabilities are close to the actual ones. As I say in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018) (several excerpts of which can be found on this blog):
Statistical models are at best approximations of aspects of the data-generating process. Reasserting this fact is not informative about the case at hand. These models work because they need only capture rather coarse properties of the phenomena: the error probabilities of the test method are approximately and conservatively related to actual ones. …Far from wanting true (or even “truer”) models, we need models whose deliberate falsity enables finding things out. (p. 300)
Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.
And what about cases where we know ahead of time that a hypothesis H0 is strictly false?—I’m talking about the hypothesis here, not the underlying model. (Examples would be with a point null, or one asserting “there’s no Higgs boson”.) Knowing a hypothesis H0 is false is not yet to falsify it. That is, we are not warranted in inferring we have evidence of a genuine effect or discrepancy from H0, and we still don’t know in which way it is flawed.
 What is of interest in testing H0 with a statistical significance test is whether there is a systematic discrepancy or inconsistency with H0—one that is not readily accounted for by background variability, chance, or “noise” (as modelled). We don’t need, or even want, a model that fully represented the phenomenon—whatever that would mean. In “design-based” tests, we look to experimental procedures, within our control, as with randomisation.
the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. (Fisher 1935, 21)
We look to RCTs quite often these days to test the benefits (and harms) of vaccines for Covid-19. Researchers observe differences in the number of Covid-19 cases in two randomly assigned groups, vaccinated and unvaccinated. We know there is ordinary variability in contracting Covid-19; it might be that, just by chance, more people who would have remained Covid-free, even without the vaccine, happen to be assigned to the vaccination group. The random assignment allows determining the probability that an even larger difference in Covid-19 rates would be observed even if H0: the two groups have the same chance of avoiding Covid-19. (I’m describing things extremely roughly; a much more realistic account of randomisation is given by several guest posts by Senn (e.g., blogpost).) Unless this probability is small, it would not be correct to reject H0 and infer that there is evidence the vaccine is effective. Yet Trafimow, if we take him seriously, is saying it would always be correct to reject H0, and that to fail to reject it is to make a mistake. I hope that no one’s seriously suggesting that we should always infer there’s evidence a vaccine or other treatment works. But I don’t know how else to understand the position that it’s always correct to reject H0, and that to fail to reject it is to make a mistake. This is a dangerous and wrong view, which fortunately vaccine researchers are not guilty of.
When we don’t have design-based assumptions, we may check the model-based assumptions by means of tests that are secondary in relation to the primary test. The trick is to get them to be independent of the unknowns in the primary test, and there are systematic ways to achieve this.
We now turn to a complementary use of these ideas, namely to test the adequacy of a given model, what is also sometimes called model criticism…..It is necessary if we are to parallel the previous argument to find a statistic whose distribution is exactly of very nearly independent of the unknown parameter μ. An important way of doing this is by appeal t the second property of sufficient statistics, namely that after conditioning on their observed value the remaining data have a fixed distribution. (2006, p. 33)
“In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated. The key is often to look at residuals: the difference between each observed outcome and what is expected under the model. The full data are remodelled to ask a different question. [i]
In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.) [ii]
The essence of the reasoning can be made out entirely informally. Think of how the 2019 Eddington eclipse tests probed departures from the Newtonian predicted light deflection. It tested the Newtonian “half deflection” H0: μ ≤ 0.87, vs H1: μ > 0.87, which includes the Einstein value of 1.75. These primary tests relied upon sufficient accuracy in the telescopes to get a usable standard error for the star positions during the eclipse, and 6 months before (SIST, Excursion 3 Tour I). In one set of plates, that some thought supported Newton, this necessary assumption was falsified using a secondary test. Relying only on known star positions and the detailed data, it was clear that the sun’s heat had systematically distorted the telescope mirror. No assumption about general relativity was required.
If I update this, I will indicate with (i), (ii), etc.
I invite your comments and/or guest posts on this topic.
NOTE: Links to the full papers/book are given in this post, so you might want to check them out.
[i] See Spanos 2010 (pp. 322-323) from Error & Inference. (This is his commentary on Cox and Mayo in the same volume.) Also relevant Mayo and Spanos 2011 (pp. 193-194).
[ii] It’s important to see that other methods, error statistical or Bayesian, rely on models. A central asset of the simple significance test, on which Bayesians will concur, is their apt role in testing assumptions.
Part of the problem here is that people – even the great David Cox – explain p-values assuming a simple hypothesis about the distribution of some statistic. ie they assume you’ve reduced the data to some many-to-one function thereof, and they suppose moreover that the null hypothesis exactly or approximately fixes the distribution of the statistic. But most null hypotheses are extremely composite, even after we’ve already chosen our test statistic, “the p-value” actually corresponds to a “best case” (seen from the point of view of the null hypothesis). So non-mathematicians (and especially, many philosophers) get confused. Of course David knows all this, and no doubt, and if his comments were seen in context, one would know that he is talking about a special case.
[once posted but not approved I can’t correct typos etc]
Sent from my iPad