While immersed in our fast-paced, remote, NISS debate (October 15) with J. Berger and D. Trafimow, I didn’t immediately catch all that was said by my co-debaters (I will shortly post a transcript). We had all opted for no practice. But looking over the transcript, I was surprised that David Trafimow was indeed saying the answer to the question in my title is yes. Here are some excerpts from his remarks:

Trafimow 8:44

See, it’s tantamount to impossible that the model is correct, which means that the model is wrong. And so what you’re in essence doing then, is you’re using the P-value to index evidence against a model that is already known to be wrong. …But the point is the model was wrong. And so there’s no point in indexing evidence against it. So given that, I don’t really see that there’s any use for them. …

Trafimow 18:27

I’ll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven’t learned anything. And in the case where you fail to reject it, you’ve made a mistake. So the worst, so the best possible cases you haven’t learned anything, the worst possible cases is you’re wrong…

Trafimow 37:54

Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that …P-values don’t give you that. P-values are about discrepancies from the model…

But P-values are not about discrepancies from the model (in which a null or test hypothesis is embedded). If they were, you might say, as he does, that you should properly always find small P-values, so long as the model isn’t exactly correct. If you don’t, he says, you’re making a mistake. But this is wrong, and is in need of clarification. In fact, if violations of the model assumptions prevent computing a legitimate P-value, then its value is not really “about” anything.

Three main points:

[1] It’s very important to see that the statistical significance test is not testing whether the overall model is wrong, and it is not indexing evidence against the model. It is only testing the null hypothesis (or test hypothesis) *H*_{0}. It is an essential part of the definition of a test statistic T that its distribution be known, at least approximately, under *H*_{0}. Cox has discussed this for over 40 years; I’ll refer first to a recent, and then an early paper.

Suppose that we study a system with haphazard variation and are interested in a hypothesis, H, about the system.We find a test quantity, a function t(y) of data y, such that if H holds, t(y) can be regarded as the observed value of a random variable t(Y) having a distribution under H that is known numerically to an adequate approximation, either by mathematical theory or by computer simulation. Often the distribution of t(Y) is known also under plausible alternatives to H, but this is not necessary. It is enough that the larger the value of t(y), the stronger the pointer against H.

The basis of a significance test is an ordering of the points in [a sample space] in order of increasing inconsistency with

H_{0}, in the respect under study. Equivalently there is a function t = t(y) of the observations, called a test statistic, and such that the larger is t(y), the stronger is the inconsistency of y withH_{0}, in the respect under study. The corresponding random variable is denoted by T. To complete the formulation of a significance test, we need to be able to compute, at least approximately,p(y

_{obs}) = p_{obs}= pr(T > t_{obs};H_{0}), (1)called the observed level of significance.

…To formulate a test, we therefore need to define a suitable function t(.), or rather the associated ordering of the sample points. Essential requirements are that (a) the ordering is scientifically meaningful, (b)

it is possible to evaluate, at least approximately, the probability(1).

To suppose, as Trafimow plainly does, that we can never commit a Type 1 error in statistical significance testing because the underlying model “is not exactly correct” is a serious misinterpretation. The statistical significance test only tests one null hypothesis at a time. It is piecemeal. If it’s testing, say, the mean of a Normal distribution, it’s not also testing the underlying assumptions of the Normal model (Normal, IID). Those assumptions are tested separately, and the error statistical methodology offers systematic ways for doing so, with yet more statistical significance tests [see point 3].

[2] Moreover, although the model assumptions must be met adequately in order for the P-value to serve as a test of *H*_{0}, it isn’t required that we have an exactly correct model, merely that the reported error probabilities are close to the actual ones. As I say in *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018) (several excerpts of which can be found on this blog):

Statistical models are at best approximations of aspects of the data-generating process. Reasserting this fact is not informative about the case at hand. These models work because they need only capture rather coarse properties of the phenomena: the error probabilities of the test method are approximately and conservatively related to actual ones. …Far from wanting true (or even “truer”) models, we need models whose deliberate falsity enables finding things out. (p. 300)

Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.

And what about cases where we know ahead of time that a hypothesis *H*_{0 }*is strictly false?—I’m talking about the hypothesis here, not the underlying model. *(Examples would be with a point null, or one asserting “there’s no Higgs boson”.) Knowing a hypothesis *H*_{0 }is false is *not* yet to falsify it. That is, we are not warranted in inferring we have evidence of a genuine effect or discrepancy from *H*_{0}, and we still don’t know in *which way* it is flawed.

[3] What is of interest in testing *H*_{0} with a statistical significance test is whether there is a *systematic* discrepancy or inconsistency with *H*_{0}—one that is not readily accounted for by background variability, chance, or “noise” (as modelled). We don’t need, or even want, a model that fully represented the phenomenon—whatever that would mean. In “design-based” tests, we look to experimental procedures, within our control, as with randomisation.

Fisher:

the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. (Fisher 1935, 21)

We look to RCTs quite often these days to test the benefits (and harms) of vaccines for Covid-19. Researchers observe differences in the number of Covid-19 cases in two randomly assigned groups, vaccinated and unvaccinated. We know there is ordinary variability in contracting Covid-19; it might be that, just by chance, more people who would have remained Covid-free, even without the vaccine, happen to be assigned to the vaccination group. The random assignment allows determining the probability that an even larger difference in Covid-19 rates would be observed even if *H*_{0}: the two groups have the same chance of avoiding Covid-19. (I’m describing things extremely roughly; a much more realistic account of randomisation is given by several guest posts by Senn (e.g., blogpost).) Unless this probability is small, it would not be correct to reject *H*_{0} and infer that there is evidence the vaccine is effective. Yet Trafimow, if we take him seriously, is saying it would always be correct to reject *H*_{0}, and that to fail to reject it is to make a mistake. I hope that no one’s seriously suggesting that we should always infer there’s evidence a vaccine or other treatment works. But I don’t know how else to understand the position that it’s always correct to reject *H*_{0}, and that to fail to reject it is to make a mistake. This is a dangerous and wrong view, which fortunately vaccine researchers are not guilty of.

When we don’t have design-based assumptions, we may check the model-based assumptions by means of tests that are secondary in relation to the primary test. The trick is to get them to be independent of the unknowns in the primary test, and there are systematic ways to achieve this.

Cox 2006:

We now turn to a complementary use of these ideas, namely to test the adequacy of a given model, what is also sometimes called model criticism…..It is necessary if we are to parallel the previous argument to find a statistic whose distribution is exactly of very nearly independent of the unknown parameter μ. An important way of doing this is by appeal t the second property of sufficient statistics, namely

that after conditioning on their observed value the remaining data have a fixed distribution. (2006, p. 33)

“In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated. The key is often to look at residuals: the difference between each observed outcome and what is expected under the model. The full data are remodelled to ask a different question. [i]

In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.) [ii]

The essence of the reasoning can be made out entirely informally. Think of how the 2019 Eddington eclipse tests probed departures from the Newtonian predicted light deflection. It tested the Newtonian “half deflection” H_{0}: μ ≤ 0.87, vs *H*_{1}: μ > 0.87, which includes the Einstein value of 1.75. These primary tests relied upon sufficient accuracy in the telescopes to get a usable standard error for the star positions during the eclipse, and 6 months before (SIST, Excursion 3 Tour I). In one set of plates, that some thought supported Newton, this necessary assumption was falsified using a secondary test. Relying only on known star positions and the detailed data, it was clear that the sun’s heat had systematically distorted the telescope mirror. No assumption about general relativity was required.

If I update this, I will indicate with (i), (ii), etc.

*I invite your comments and/or guest posts on this topic.*

**NOTE**: Links to the full papers/book are given in this post, so you might want to check them out.

[i] See Spanos 2010 (pp. 322-323) from *Error & Inference. *(This is his commentary on Cox and Mayo in the same volume.) Also relevant Mayo and Spanos 2011 (pp. 193-194).

[ii] It’s important to see that other methods, error statistical or Bayesian, rely on models. A central asset of the simple significance test, on which Bayesians will concur, is their apt role in testing assumptions.

Part of the problem here is that people – even the great David Cox – explain p-values assuming a simple hypothesis about the distribution of some statistic. ie they assume you’ve reduced the data to some many-to-one function thereof, and they suppose moreover that the null hypothesis exactly or approximately fixes the distribution of the statistic. But most null hypotheses are extremely composite, even after we’ve already chosen our test statistic, “the p-value” actually corresponds to a “best case” (seen from the point of view of the null hypothesis). So non-mathematicians (and especially, many philosophers) get confused. Of course David knows all this, and no doubt, and if his comments were seen in context, one would know that he is talking about a special case.

[once posted but not approved I can’t correct typos etc]

Sent from my iPad

Richard: Composite because of nuisance parameters? I’m not sure what it means to view the P-value as the”best case”. As for seeing his comments in context, I supply the link to the entire paper and even the entire 2006 book, so you can check.

It means: calculate the supremum over all distributions allowed by the null, of the right tail probability. Usually the supremum is achieved. That can be seen as the best case. If the best case probability is still very small, that discredits the null.

Can you explain your point by example using the Eddington 1919 experiment? (I do not see your point…)

John: Are you referring to my mention of the eclipse test? I was merely saying that in testing that there was a distortion of the mirror, invalidating the estimate of error needed to test the primary GTR hypotheses, they did not have to assume GTR. They tested the telescope by means independent of GTR. But maybe this is not what you’re asking about, or maybe the question is for Richard.

Please be more explicit. I can explain my point with numerous concrete examples. The most simple is the case of testing the null hypothesis mu is less than or equal to zero given n independent observations from the normal distribution with mean mu and variance one. The test statistic is the sample average times root n, which is N(mu, 1) distributed. Its distribution is not fixed by the null hypothesis.

Sent from my iPhone

Richard: The null is embedded in the Normal model (with its NIID assumptions). They are to be separately tested. But I guess you’re addressing Byrd.

Yes. I am trying to find out what Byrd meant. The statistical treatment of the eclipse data is something one can easily write a book about.

Sorry, I fell into a black hole with work and did not see your response. I was asking how the test using the eclipse data is a “composite”, and how the p-value is a best case scenario in this example?

John, please tell me exactly which test (using the eclipse data) you are referring to. Or whose treatment of that example you were thinking of.

Sent from my iPad

PS I will check the paper and book. Context is important!

You take up something that was in mind mind, too, when I heard Trafimov.

Two questions:

(a) “Nor do P-values “track” violated assumptions; such violations can lead to computing an incorrectly high, or an incorrectly low, P-value.” What would in that case be a “correct” p-value, to which this seems to refer?

(b) ” “In principle, the information in the data is split into two parts, one to assess the unknown parameters of interest and the other for model criticism” (Cox 2006, p. 198). If the model is appropriate then the conditional distribution of Y given the value of the sufficient statistic s is known, so it serves to assess if the model is violated.” Isn’t the possibility to do this itself part of the model assumption and would have to be tested? In other words, I suspect that testing model assumptions in this way, even where it can be done, will always be incomplete.

Christian:

You ask: “What would in that case be a “correct” p-value, to which this seems to refer?” Indeed, as I say, if the assumptions are violated the P-values aren’t “about” anything. My point was just to deny that you should expect a low P-value whenever the model isn’t exactly true. Someone might distort things to arrange for a high or moderate P-value. An example where we might speak of the “correct” P-value being different from the reported one might be when the violated assumptions are due to multiple testing, selective reporting or the like.

On p. 141 of Cox’s 2006 book, Principles of Statistical Inference, which this blogpost links to, he says:

An important possibility is that the data under analysis are derived from a probability density g(.) that is not a member of the family f(y; θ ) originally chosen to specify the model. Note that since all models are idealizations the empirical content of this possibility is that the data may be seriously inconsistent with the assumed model and that although a different model is to be preferred, it is fruitful to examine the consequences for the fitting of the original family.

Sorry to be late to this discussion. It is nice to see the role of the statistical model explicitly discussed.

My contribution to the discussion might seem facetious or trivial, but it is given with serious intent: type I errors are a Neyman–Pearsonian construct and they go with the “hypothesis test” method that yields a decision to reject, or not, the null hypothesis. The methods dichotomises the results on the basis of a pre-specified threshold (yes, the problematical <0.05).

In contrast, "significance tests" yield a p-value that is an evaluation of the evidence in the data according to the model against the null hypothesis value of the parameter(s). It does not entail a decision to reject the null hypothesis, but only generates an index of the strength of the evidence against it. One might make a decision regarding the null hypothesis on the basis of that evidential evaluation, but one really should consider other information at the same time while making that decision and therefore any erroneous decision will not belong solely to the significance test. Significance tests do not by themselves yield type I (or type II) errors.

If any reader is uncertain about this, or if they disbelieve me then I would direct them to an extensive literature on the topic. Maybe start here: https://stats.stackexchange.com/questions/16218/what-is-the-difference-between-testing-of-hypothesis-and-test-of-significance/16227#16227 and then move on to my larger work on the topic, A Reckless Guide to P-values : Local Evidence, Global Errors available in full here: https://link.springer.com/chapter/10.1007%2F164_2019_286

Hi Michael:

Whatever philosophical, professional or personality differences there were between Fisher and Neyman, it doesn’t affect the issue here. Our debate was on P-values and Trafimow brought up Type I errors, and it was fine for him to do so. My post was addressing his remarks. I think it’s high time we recognized the mathematics of the accounts for what they are and stopped slavishly repeating a characterization of N-P as ‘behavioristic’ and Fisher as ‘evidential’. That is step #1 for getting beyond the statistics wars. Fisher justified his stat sig tests by appealing to performance of the method, and N-P were just firming up the logic of his formulation. They are mathematically nearly identical, and the N-P tester, from the start (including N and P) report the attained P-values. Here’s Lehmann:

What enables the reader to use reported P-values to apply their own threshold is that P-values also report the corresponding Type 1 error probability (associated with erroneously interpreting data as evidence against Ho). After the break-up between Fisher and Neyman, due to jealousies and the fact that Neyman insisted on repeating the flaw in Fisher’s fiducial intervals, Fisher changed lines of his books and started talking as if he always construed P-values evidentially. I talk a great deal about this in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). I would direct the reader, especially if she is bound to let her current philosophy of statistics depend on a popular historical tale, to read about the actual history. A link to Excursion 3 of SIST is linked here. https://errorstatistics.com/2018/12/11/its-the-methods-stupid-excerpt-from-excursion-3-tour-ii-mayo-2018-cup/sa

It’s not what Fisher wrote and it’s not what Neyman and Pearson wrote. It’s how the methods work and what their results mean and we can work that out for ourselves. Trafimow’s understanding of statistics might be limited to the mythical and dysfunctional hybrid “NHST”, but mine is not and yours should not be.

Here is the nub: evidence is not all-or-none and so even if the all-or-none reject/do not reject response of the hypothesis test method might align with evidence on average in the long run, it does not correspond to evidence the actual particular data in question. Thus even though one might like to pretend that dichotomising hypothesis tests are equivalent in a practical sense to non-dichotomising significance tests, they are not.

You probably do not think that I’m being fair to your philosophy, but you wrote “associated with erroneously interpreting data as evidence against Ho” and that is a direct invitation to assume evidence as all-or-none. There is almost always _some_ evidence against the null hypothesis, it’s just that the evidence is often insufficient, too weak, or unconvincing to support a decision to reject that hypothesis. (Yes, one might write that in terms of opinion or belief for the Bayesians.)

P.S. The first edition of Lehmann’s book did NOT include the sentence suggesting that p-values be specified. It is probably not a core part of his (and Neyman’s) method.

Michael:

“There is almost always _some_ evidence against the null hypothesis”. Then there’s also almost always evidence against the alternative in an N-P test.

True. But I struggle to see why you would mention it. I am arguing against the use of all-or-none thinking about evidence against the use of careless phrasing that implies all-or-none thinking about evidence.

Michael: It’s not what Lehmann wrote, it’s how the methods work and we can figure this out for ourselves. The P-value lets people choose their maximum tolerable type 1 error probability because it is the error probability associated with a test having that significance level.

I happened across something in Cox 2006 (Principles of Statistical Inference) that I had never noticed before. I tweeted it, and here’s a link to the tweet. https://twitter.com/learnfromerror/status/1335665994317099008?s=20

I point out the lack of any mention of the existence of p-values in the first edition of Lehmann’s book simply because the number of times that you have mentioned its presence in a later edition has reached a threshold for action.

Sure, a p-value does appear to do what you say, and the argument that a p-value should be provided even where an analysis leads to a decision using a fixed threshold is valid as far as it goes. However, the people who are usually in the position to offer the richest and most valuable evaluation of the overall strength of evidence regarding the experimental hypothesis are the people who did the experiment. An assumption that the reader can perform his or her own evaluation of that evidence on the basis of just a p-value will often be mistaken. (Note that I’m not here talking of the statistical evidence, but the evidence regarding the non-statistical experimental hypothesis. That evidence includes more than just a p-value unless the experimenter are doing the first relevant evaluation of anything related to the hypothesis in question.)

How do the methods work? I have given advice to scientists regarding the extension of inference from the statistical realm into the realm of hypotheses regarding the real world in my book chapter A Reckless Guide to P-values : Local Evidence, Global Errors available in full here: https://link.springer.com/chapter/10.1007%2F164_2019_286.

Mayo distorted my position and made a straw person argument in her blog. Although she quoted me accurately, she then mischaracterized the quoted material. She said: “Yet Trafimow, if we take him seriously, is saying it would always be correct to reject H0, and that to fail to reject it is to make a mistake.”

But that is not my position. If you re-read the quoted material, you will see that my emphasis is on the null model, not the null hypothesis. The null model includes the null hypothesis but lots of additional assumptions too, not all of which are right. My position is that the null model is always wrong, not that the null hypothesis is always wrong (especially not range null hypotheses). Thus, we should always reject the null model, regardless of what value we get for p; but I do NOT say we should always reject the null hypothesis. Rather, I say that the p-value we get is insufficient for providing a sound basis for deciding about the null hypothesis. And this, of course, is because of the added assumptions in the model, not all of which are right.

Aside from the issue of distorting my argument, I think the disagreement comes down to a very simple sentence that I quote directly from Deborah’s blog. She described a significance test as follows: “It is only testing the null hypothesis (or test hypothesis) H0.” Of course, this disagrees with my assertion that it is the whole model that is being tested.

At this point, I have a very simple challenge for anyone reading this. If you believe that significance tests concern the null hypothesis, without adding any assumptions to it (e.g., random selection from the population), you should agree with Mayo. But if you think there are added assumptions too, then you should disagree with Mayo that significance testing is only testing the null hypothesis.

David Trafimow

On David Trafimow’s comment:

I thank David for his comment. But I haven’t misconstrued him at all, and reading the quoted remarks from him shows this. Perhaps Trafimow is confusing “presupposing” with testing. As I explain in my remarks, applying a statistical significance test presupposes that certain assumptions hold sufficiently well to attain an approximate (and conservative) assessment of any of the quantities for testing or estimation, be they P-values, standard errors or confidence levels. However, the statistical significance test of Ho, is not testing those assumptions. An analogy I give in my post is that the test of General Relativity with the eclipse experiment is not testing the telescope mirror has not been bent by the sun by more than such and such—even though it assumes it hasn’t been by an amount that precludes the needed estimate of error. A P-values is like a score on a test. A report of the score on a math test, say, does not test whether the grader scored some students in a biased fashion, or whether the students cheated in some way. But insofar as there was bias in scoring or cheating, then the alleged test fails to be testing the math ability intended (even if it would, provided the test is taken and graded properly).

The above is my main point.

We may be able, and usually are able, to test the telescope mirror distortion (e.g., using positions of known stars—as they did) using full details of the data, but doing so would be a distinct test. The score from the initial test, call it the primary test, is not tracking the threats from improperly working instruments, biased design, or cheating. We cannot expect these violation to result in a high or low score. Were they to lack sufficiently accurate instruments for the purpose, obviously, they would not be able to learn from the eclipse result about the Einstein deflection effect. But this is not an indictment of the statistical significance test. Trafimow asserts we can learn nothing from a statistical significance test because it’s within a model that isn’t exactly true, and that’s wrong. Were it so, it would also follow that we can learn nothing from a confidence level, likelihood ratio or other model or design-based statistical methods.

All of this is in my post. Read points [1 – 3].

And finally, to repeat something else in my post, even when we’re doing a (secondary) test of the model assumptions, and so here the hypotheses concern those assumptions, e.g., the telescope mirrors are working properly, it is just wrong to suppose we always have evidence of systematic violations from them—even if no mirror is perfectly smooth. We can give the utterly uninteresting and useless (“all flesh is grass”*) remark that “something’s wrong somewhere”—in the sense that the model isn’t a perfect replica of the actual phenomenon– but this is not to have falsified a model.

*SIST Excursion 4 Tour I

As I say in the post, In testing assumptions, the null hypothesis is generally that the assumption(s) hold approximately. Again, even when we know this secondary null is strictly false, we want to learn in what way, and use the test to pinpoint improved models to try. (These new models must be separately tested.)

Moreover, as noted on p. 141 of Cox’s 2006 book, Principles of Statistical Inference, to which this blogpost links, even with serious inconsistency between model and data, we can learn a lot from fitting the model known wrong:

An important possibility is that the data under analysis are derived from a probability density g(.) that is not a member of the family f(y; θ ) originally chosen to specify the model. Note that since all models are idealizations the empirical content of this possibility is that the data may be seriously inconsistent with the assumed model and that although a different model is to be preferred, it is fruitful to examine the consequences for the fitting of the original family.

But let’s not confuse these last points about testing model assumptions with the main issue: namely, whether the statistical significance test of a hypothesis within a model is testing the assumptions of the model. Even presupposing is not testing.

I agree with Mayo’s points here. The telescope and maths test score examples seem quite clear and relevant to me.

I would like to know how David’s idea that “we should always reject the null model” differs from the famous Box quotes that all models are wrong. And I would like to know if David concurs with the phrase that usually follows: but some models are useful.

I also agree with David’s key point: “the p-value we get is insufficient for providing a sound basis for deciding about the null hypothesis”. The p-value from a significance test does not force a rejection of the null hypothesis and it should be weighed with several other important factors before a decision regarding that hypothesis is contemplated. (Note that I am talking about a significance test, not a dichotomous hypothesis test.) The usefulness of the model is one of those factors.

The idea that one should (or can) test departures from model assumptions statistically using the same dataset is problematical to me. The most important flaws are probably (i) random sample and, where relevant, (ii) independent values. Both of those would be readily ‘tested’ by the the data collector reporting fully the circumstances of the collection and reflecting on the nature of the data. They are not readily ‘tested’ by analysis using models that assume them away.

Michael:

I’m glad that you agree with me about the main point here.

As for your second point, it should go without saying that:

“The p-value from a significance test does not force a rejection of the null hypothesis and it should be weighed with several other important factors before a decision regarding that hypothesis is contemplated.” No formal statistical method forces anything. And even N-P understood “reject/do not reject” as purely “picturesque” labels to arrive at methods with certain optimality results.

As you would know from SIST, I describe the result of a stat sig test as indicating a discrepancy or inconsistency with a null. The move from “indicating” to “evidence” requires checking assumptions. The formulation in Cox and Mayo (2006) is similar.

As for using the “same” data to check assumptions about the data generating mechanism or data model, I am referring to the entire panoply of data, which has to be remodeled to ask different questions. So if I want to know whether the telescope was working to get the data to be used in testing a deflection effect, the full panoply of data x1,x2, …xn recordings, weather, and photos would be the place to look and check for systematic measurement errors. Of course there is background data and knowledge—like the known positions of stars, what the sun’s heat does to telescope mirrors, etc.

It was interesting to hear the mention of “null model” next to “null hypothesis”. That really puts us into a Neyman-Pearson situation. H0 and H1 are two disjoint sets of probability distributions of the data. Define their union to be H. Then H is the null model, I would rather call it ‘the background assumptions’. E.g. everyone agrees we have an iid sample. That is H. H0 is “the observations are normally distributed”. H1 “they aren’t”.

It strikes me that goodness of fit represents a challenge to Trafimow’s arguments even without the logical extension in this post. That is, I take his broader argument as being

“The model (your assumptions?) will always be wrong in some respect. This can be ok for estimation because a model that is close to true will give estimates that are close to true, but by using a test to dichotomize you lose continuity and a decision will no longer necessarily be close.”

(As an aside, I’d interpret Trafimow’s contention as being more along the lines of: “It’s not that we would always reject H0, it’s that when the model is not correct, the test is no longer meaningful. To paraphrase Hofstadter, the answer is neither ‘yes’ or ‘no’ but ‘mu’.”)

(As a second aside, I’m not quite sure why it’s illegitimate to look at p-values as being continuous in model space, or why evidence can’t be approximated, but I think this is part of Mayo’s point.)

But I think we run into a problem with goodness of fit. Even if one is only obtaining estimates, these estimates are still taken under model assumptions that you likely want to check. And here a no-hypothesis-test proponent runs into their own issues:

1. Goodness of fit implies an intrinsically dichotomous decision: do I keep the current model/hypothesis/estimate/whatever, or decide that I need to reformulate my analysis? One presumably wants some form of evidentiary assessment to decide this.

2. To the extent that GoF implies a larger model space, this is often sufficiently complex that you can’t readily produce intelligible confidence intervals in the model space. You can summarize discrepancy by a test statistic, but this them maps 1:1 onto a p-value.

3. Besides having their own assumptions that could be wrong, GoF tests are frequently derived from asymptotic arguments, meaning that they are approximate for the current data, even assuming the model from which they are derived.

And yet, I suspect that Trafimow would still want to use them, or something very much like them.

A couple of observations here:

1. The issue of approximate inference in the form of asymptotic arguments hasn’t been given a lot of attention in these arguments. This is despite these arguments dominating the mathematical statistics literature. They add a few complications:

a. They usually provide a way to loosen assumptions — one need not have exactly normal data to conduct a valid z-test.

b. They require exactly the sort of “this is approximately correct” arguments above.

c. They imply challenges for reporting (To how many decimal places do I report an approximate p-value? One of Bin Yu’s bugears..) and for combining p-values in things like false discovery rates (I think Flori Bunea has some results on this).

2. I think looking a less-idealized models can be helpful sometimes. I appreciate the ways in which looking at a test in a N(mu,1) model can clarify questions, but it can also miss important aspects. The discussion above was sparked in my mind by thinking about testing the adequacy of a k-factor model in multivariate analysis. I like this example because it can be thought of either as goodness of fit, or it’s own primary hypothesis, depending on the study in question.

3. Goodness of fit also provides a challenge to the “confidence intervals, not p-values” crowd. What sort of confidence intervals do I produce for a complex hypothesis, such as testing the k-factor model above? It’s often not feasible to produce an intelligible contrast for estimates, but the p-value assessment of evidence still contains meaning.

I still think that the fact that there are added assumptions, with the model including the null hypothesis and the added assumptions, is not being given sufficiently full consideration. Mayo argues that the researcher is not testing the added assumptions but that dodges that the assumptions are still there, and that the p-value still depends on them. (And the dependence is still there whether you call them assumptions, presuppositions, or whatever else.) As a simple example, suppose the researcher wishes to test the null hypothesis that 50% of the US favors the democratic party over the republican party, against the alternative hypothesis of more than 50% democratic-leaning people. One of the added assumptions is random selection from the population. Now, I will agree that the researcher might not be interested in testing the assumption of random selection; but the assumption is still there regardless of whether the researcher cares about it or not. To see this, suppose that the researcher obtains a sample biased in the direction of the democratic party and gets p = .00000001 in the direction favoring the alternative hypothesis of more democratic leaning people. What does this mean? The interpretation not recognizing the importance that there is a model here, with added assumptions, would be that the data strongly militate against the null hypothesis. But that is a poor interpretation. A better interpretation is that the data strongly militate against the model, including the assumption of random selection. Thus, even if the researcher is not interested in the issue of random selection, and is merely interested in rejecting the null hypothesis, THERE IS NO ESCAPING THE ADDED ASSUMPTIONS!

What are the consequences of the foregoing? I’ll address consequences with respect to two contexts: (1) using a p-value as a source of info and (2) the typical use of a p-value.

1. Suppose a researcher obtains a p-value, recognizes that the model is almost certainly wrong, and realizes that she would be unsound if she were to take the p-value at face value as indicating the state of evidence against the null hypothesis. Rather, she recognizes that the p-value indicates evidence against the model, and that is all. Here is the problem our researcher faces. With respect to the model, since she already knows it is wrong, there is not much gained by indexing evidence against it. And with respect to the hypothesis, the p-value doesn’t say much because the hypothesis is embedded in a known wrong model.

She might decide to be “reasonable,” and say something like: “Well, I can at least use the p-value as added information to be considered in light of all the other information I have.” But exactly how is she going to do this?

More than that, p-values confound sample effect sizes with sample sizes. Unconfounding would mean considering the sample effect size and sample size separately. That is, is the sample effect size, if a good estimate of the population effect size, such that it matters for the researcher’s substantive goals? And, is the sample size sufficiently large and appropriately collected to engender confidence that the sample size really is a good estimate of the population effect size? In the present example, where the sampling is plainly biased, the answer to the second question would be in the negative; but it is possible to imagine examples where the researcher might have reason to believe that the sample effect size is a good, though not perfect, estimate of the population effect size. In this latter case, however, it would be rather silly to go through the exercise of rejecting the null hypothesis as the researcher could simply say that she has a good estimate of the population effect size and present that estimate. The example illustrates that under conditions where the sample effect size is not a good estimate of the population effect size, taking the p-value seriously is a bad idea; and under conditions where the sample effect size is a good estimate of the population effect size, it is better to simply use the sample effect size as an estimate of the population effect size.

The obvious rejoinder to the foregoing is to say: “Wait a minute Trafimow, you said that the model is wrong, so we should never take the sample effect size as a good estimate of the population effect size!” But here is where the difference between a p-value and a sample mean, sample proportion, and so on, really comes into play. Researchers don’t use p-values to estimate population values whereas they do use sample means to estimate population means, sample proportions to estimate population proportions, and so on. And here is where the Box and Draper quotation about models always being wrong but sometimes being useful can be implemented. That is, in the case of estimating a population mean, population proportion, and so on; if the model is “good enough for government work,” the researcher might be justified in assuming that the sample statistic is a good enough estimate of the corresponding population parameter. But going back to p-values, the Box and Draper quotation is misapplied. The p-value obtained from a sample is not being used to estimate a corresponding population parameter! And in that sense, the p-value is very different from a sample mean, sample proportion, and so on.

In summary, if the argument is that a p-value simply be used as an additional bit of info, in combination with all the other bits of info, my response would be that it doesn’t add much. The researcher who has other kinds of info, such as the sample effect size and sample size, doesn’t need the p-value. Moreover, given the plethora of p-value abuses that both p-value fans and p-value detractors have moaned about, I think it is necessary to demonstrate clearly where p-values provide ADDED BENEFIT to other information researchers typically have available to them to justify their use. Finally, everything I said above under 1, is under the unrealistic assumption that researchers are actually thinking and not just using an automatic threshold, as is typical. Let me address typical NHST under 2 below.

2. If one is going to have an automatic rejection threshold, then the first thing we need to do is distinguish what we are rejecting: Are we rejecting the hypothesis or the model? If we are rejecting the hypothesis, that is simply unsound because the p-value is based on the model and not only on the null hypothesis. And if we are rejecting the model, that is an exercise in futility because we already know the model is wrong. Worse yet, if we fail to reject the model, we are committing an error. So, the best-case scenario is not learning anything, and the worst-case scenario is being wrong. Expected utility is negative!

Moreover, because p-values are based on samples, there is much randomness thereby resulting in the so-called “dance of the p-values.” Of course, sample effect sizes also vary from sample to sample and when the researcher gets lucky with a large sample effect size, there is a better chance of getting a p-value below threshold. With publishing standards strongly in favor of p-values below threshold, it should be clear that published findings feature strongly inflated effect sizes. Nor is this regression to the mean phenomenon only an esoteric mathematical point. The Open Science Foundation replicated about 100 studies published in top psychology journals and reported sample effect sizes and p-values. With respect to p-values, non-replication was more prevalent than replication (well over 60% non-replication). More important for my present point, the mean effect size in the replication cohort of studies was less than half that in the original cohort of studies. In one sense, this finding might be considered trivial as merely confirming what regression to the mean says should happen. In another sense, this finding is important because it is an empirical demonstration of just how strong effect size inflation is in the literature. A strong disadvantage of the usual procedure, with an automatic cutoff, is that it leads to dramatic effect size inflation.

Because of all the foregoing, it would be better to jettison p-values and find something better, and there are better options, but that is another story.

“find something better”. I think that you are after something that can only be had by those with a god’s eye view of the world.

We already have likelihood functions that are better than p-values with respect to evaluation of the evidence in the data. They are better in that they display the relative favouring of all of the possible ‘hypotheses’ in the model (the parameter values of the model), but they are related one-to-one to p-values and they depend on the same statistical model. They will not be any better according to the criteria implied by your criticism of p-values.