We constantly hear that procedures of inference are inescapably subjective because of the latitude of human judgment as it bears on the collection, modeling, and interpretation of data. But this is seriously equivocal: Being the product of a human subject is hardly the same as being subjective, at least not in the sense we are speaking of—that is, as a threat to objective knowledge. Are all these arguments about the allegedly inevitable subjectivity of statistical methodology rooted in equivocations? I argue that they are!

Insofar as humans conduct science and draw inferences, it is obvious that human judgments and human measurements are involved. True enough, but too trivial an observation to help us distinguish among the different ways judgments should enter, and how, nevertheless, to avoid introducing bias and unwarranted inferences. The issue is not that a human is doing the measuring, but whether we can reliably use the thing being measured to find out about the world.

Remember the dirty-hands argument? In the early days of this blog (e.g., October 13, 16), I deliberately took up this argument as it arises in evidence-based policy because it offered a certain clarity that I knew we would need to come back to in considering general “arguments from discretion”. To abbreviate:

- Numerous human judgments go into specifying experiments, tests, and models.
- Because there is latitude and discretion in these specifications, they are “subjective.”
- Whether data are taken as evidence for a statistical hypothesis or model depends on these subjective methodological choices.
- Therefore, statistical inference and modeling is invariably subjective, if only in part.

We can spot the fallacy in the argument much as we did in the dirty hands argument about evidence-based policy. It is true, for example, that by employing a very insensitive test for detecting a positive discrepancy d’ from a 0 null, that the test has low probability of finding statistical significance even if a discrepancy as large as d’ exists. But that doesn’t prevent us from determining, objectively, that an insignificant difference from that test fails to warrant inferring evidence of a discrepancy less than d’.

Test specifications may well be a matter of personal interest and bias, but, given the choices made, whether or not an inference is warranted is not a matter of personal interest and desire. Setting up a test with low power against d’ might be a product of your desire not to find an effect for economic reasons, of insufficient funds to collect a larger sample, or of the inadvertent choice of a bureaucrat. Or ethical concerns may have entered. But none of this precludes our critical evaluation of what the resulting data do and do not indicate (about the question of interest). The critical task need not itself be a matter of economics, ethics, or what have you. Critical scrutiny of evidence reflects an interest all right—an interest in not being misled, an interest in finding out what the case is, and others of an epistemic nature.

Objectivity in statistical inference, and in science more generally, is a matter of being able to critically evaluate the warrant of any claim. This, in turn, is a matter of evaluating the extent to which we have avoided or controlled those specific flaws that could render the claim incorrect. If the inferential account cannot discern any flaws, performs the task poorly, or denies there can ever be errors, then it fails as an objective method of obtaining knowledge.

Consider a parallel with the problem of objectively interpreting observations: observations are always relative to the particular instrument or observation scheme employed. But we are often aware not only of the fact that observation schemes influence what we observe but also of how they influence observations and how much noise they are likely to produce so as to subtract them out. Hence, objective learning from observation is not a matter of getting free of arbitrary choices of instrument, but a matter of critically evaluating the extent of their influence to get at the underlying phenomenon.

For a similar analogy, the fact that my weight shows up as k pounds reflects the convention (in the United States) of using the pound as a unit of measurement on a particular type of scale. But given the convention of using this scale, whether or not my weight shows up as k pounds is* a matter of how much I weigh!**

Likewise, the result of a statistical test is only partly determined by the specification of the tests (e.g., when a result counts as statistically significant); it is also determined by the underlying scientific phenomenon, at least as modeled. What enables objective learning to take place is the possibility of devising means for recognizing and effectively “subtracting out” the influence of test specifications, in order to learn about the underlying phenomenon, as modeled.

Focusing just on statistical inference, we can distinguish between an objective statistical inference, and an objective statistical method of inference. *A specific statistical inference is objectively warranted, if it has passed a severe test; a statistical method is objective by being able to evaluate and control (at least approximately) the error probabilities needed for a severity appraisal*. This also requires the method to communicate the information needed to conduct the error statistical evaluation (or report it as problematic).

It should be kept in mind that we are after the dual aims of severity and informativeness. Merely stating tautologies is to state objectively true claims, but they are not informative. But, it is vital to have a notion of objectivity, and we should stop feeling that we have to say, well there are objective and subjective elements in all methods; we cannot avoid dirty hands in discretionary choices of specification, so all inference methods do about as well when it comes to the criteria of objectivity. They do not.

*Which, in turn, is a matter of my having overeaten in London.

To the above discussion, let me add an “apparent” crucial difference between the Bayesian and frequentist perspectives as it relates to the specification of the likelihood function L(θ;z₀) and the associated statistical model Mθ(z).

According to Kadane (2011):

“… likelihoods are just as subjective as priors, and there is no reason to expect scientists to agree on them in the context of an applied problem.” (p. 445)

From the frequentist perspective likelihoods are defined by the probabilistic assumptions comprising the statistical model Mθ(z) in question, like [1]-[5] for the Linear Regression model in table 1; see Intro to Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1), Posted on February 22, 2012 by Mayo.

In light of that, there is nothing subjective or arbitrary about the choice of a statistical model Mθ(z) or the associated likelihood function, and its choice is not based on any agreement amongst scientists. The validity of these assumptions is independently testable vis-a-vis data z₀, using thorough Mis-Specification (M-S) testing.

Kadane, J. B. (2011), Principles of Uncertainty, Chapman & Hall, NY.

ARis: Thanks for the Kadane reference, I was thinking specifically of Box and, at times at least (e.g., his paper in the RMM volume), Gelman. But I was also trying to point out something that is overlooked, and perhaps it is hard to explain clearly.

In an attempt to do so, I wrote “The issue is not that a human is doing the measuring, but whether we can reliably use the thing being measured to find out about the world.”

So let us imagine there was a perfect way to measure a person’s real and true degrees of belief in a hypothesis (maybe with some neuropsych development), while with frequentist statistical models, we grope our way and at most obtain statistically adequate representations of aspects of the data generating mechanism producing the relevant phenomenon. In the former, the measurement is 100% reliable, but the question that remains is the relevance of the thing being measured for finding out about the world. People seem utterly to overlook this.

I hope you are writing your comments in sight of the Parthenon or the like. Mayo

I like this entry a lot although as you know I have4 my doubts about the concept of objectivity. I think that you explained your view here clearly and this is good food for thought.

Now let’s say we are in a situation in which the data through M-S tests will neither rule out a Normal nor a t_5-distribution (more precisely, a location-scale model based on a t_5), say, because the amount of data available just doesn’t allow to distinguish between these two (as it often doesn’t).

Would you accept to say that it is a subjective choice of whether further analyses are based on a Normal or on a t_5 distribution (despite the fact that after choosing one of them what follows may be called “objective”)? If not, please explain.

Of course in many cases inferences based on a Normal and a t_5 distribution will yield the same results in terms of interpretation, but one can always find distributions (albeit often quite messy ones) for which this is not the case and which still are not in detectable disagreement with the data,

The problem is with the hypothetical scenario: one has data which can be viewed as a realization of an IID process whose distribution can be either Normal or Student’s t with 5 df (degrees of freedom) but one cannot decide using M-S testing.

A simple, back of the envelope, calculation shows that if the distribution is Normal the kurtosis will be 3 (three), but if it’s Student’s t with 5 df, the kurtosis will be 9 (nine)! If one cannot detect the difference between these two values using a simple M-S test (even with n=20), there is something wrong with one’s tools.

In the case of a vector process (more than one data series) a simple test for the presence of heteroskedasticity in the Linear Regression model based on E(y(t) | X(t)) should easily distinguish between the two cases.

In my 1999 book [248-253] I used a real-world data series (exchange rates) to demonstrate how, in practice, one can distinguish between Student’s t with 4 df vs. Student’s with 5 df. The reason is that M-S tests can be highly effective tools in the hands of an attentive modeler.

Aris: The problem with kurtosis is that it is dependent on points in the extreme tails that occur with a fairly low probability.

I just ran a little simulation. I generated 1000 datasets from a t_5 with n=100 (admittedly not very large but much larger than 20) and ran an Anscombe-test for kurtosis deviations from normality. A p-value larger than 0.05 occurred in 396 cases. This means that if you’re lucky, you can distinguish a t_5 from a normal, but at least fairly often (here almost 40%) you’re not.

Christian: can you please summarize how differently the inference summaries behave (e.g. coverage of the population mean by nominal 95% intervals) in the datasets that do and don’t “pass” the Anscombe test?

Guest: Good question. Stay tuned.

Thanks for these comments. To your query, Christian, it’s important to see that decisive data is not required for an objective assessment for or against a given inference, or for a decision whether to use a model for a given purpose. Where possible, implications for the choice would be reported. Where not, it is extremely valuable (and part of the objective communication of results) to indicate where uncertainty remains. In my March 19 post, I say that (objectivity) “may well mean that the interpretation of the data itself is a report of the obstacles to inference!” or potential obstacles.

Guest: I ran t-tests for all these datasets (and for 1000 from N(0,1) for comparison). The t-test p-values for the t_5 are surprisingly well behaved, they don’t look more different from a uniform than those from a N(0,1). I checked this using a Kolmogorov-Smirnov test of a Uniform(0,1), which wasn’t rejected for any of the three sets (t_5 conditionally under both Anscombe rejected the Normal and not, and data from the Normal). So in this sense doing t-test inference based on the Normal for data from a t_5 isn’t bad anyway, so the fact that the Normal can often not be told apart from the t_5 isn’t a big problem. (I could probably construct an identifiability problem of this kind with a different distribution where it is a problem but I can’t put too much time into this blog, sorry.)

There is however one surprising result. Just because I became curious from something that I saw under way, I ran the same thing again with 100,000 simulation runs (still n=100).

The distribution of p-values still looks more or less OK, but with such a high number of p-values the distribution that you get from t_5 data conditionally under *not* rejecting normality by Anscombe is significantly different from a Uniform(0,1), whereas the p-value distribution from t_5 data where Anscombe rejected the Normal was not significantly different from the Uniform.

This means that M-S testing here, if anything, does more harm than good!

Thanks Christian

I do not find your simulation results very suprising, but it is not obvious to me what the results indicate about the reliability of inference when using a t-test based on IID data from two different distributions: the Normal and Student’s t with 5 df. The reliability of inference pertains to the potential discrepancy between the actual and nominal error probabilities (both type I and II or power). What I’m puzzled by is how the shape of the empirical distribution of the p-value — evaluated under the null — i.e. being close to Uniform, could potentially provide one with information pertaining to the discrepancy between actual type I error probability and the actual power with the nominal (assumed) ones.

In addition, simulations — when not performed with due care — can easily mislead one. In particular, the algorithms for generating pseudo-random numbers from a Student’s t (with df lower than 6) are notoriously unreliable and can easily lead one astray, unless one carefully screens the quality of the simulated data before using them to derive the empirical distributions of interest.

One last thought, when performing M-S testing one should not solely use omnibus tests like the ones referred to by Christian, because they are known to have very low power. One needs to combine such tests with parametric ones like the D’Agostino-Pearson kurtosis test to get a more effective detection of the difference between the Normal and Student’s t (5).

Aris: Thanks so much for detailed replies, I hadn’t seen all of these. Yes, as you say, it isn’t at all clear what Hennig’s simulations of the distribution of p-values (which p-values?), conditional on not rejecting or rejecting, etc, (who calls for such a computation*?) have got to do with showing the value of m-s tests. The main goal, I thought, was to be able to say about a statistical model, if deemed statistically adequate by m-s tests, that the statistical inferences based on that model had actual type 1 and 2 error probabilities close to those computed, based on the model (thereby also giving corresponding severity assessments).

*it involves its own significance test.

Let me add that in econometrics the Student’s to distribution has made a huge difference in modeling financial data over the last two decades. It turns out that the statistial reguarlities exhibited by numerous financial data series are more appropriately modeled using Student’s t instead of the Normal distribution. This has been established by thorough M-S testing that required econometricians like me to develop powerful tests that can discriminate between the two distributions by going beyond their shape. This is extremely important in financial econometrics because the difference in using the Student’s t rather than the Normal is not just the shape of the distribution [the latter is leptokurtic]. Much more importantly, Regression type models [including vector autoregressions etc.] based on the multivariate Student’s give rise to statistical models that enable one to capture the heteroskedasticity [volatility] and second order temporal dependence that was impossible to do systematically in the context of the Normal distribution. To those who have doubts that going from the Normal to the Student’s to can make a world of difference to one’s modeling and learning from data I can give several references of published empirical studies and they can judge by themselves.

Christian: I don’t see the relevance of these conditional distributions of p-values to the question of the value of probative m-s testing, sorry.

George Barnard once pointed out to me that although over all samples the mean is a notoriously poor location estimator for the Cauchy (it is not consistent) it works very well for those very many samples which look Normal. A simple way to see this is to appreciate that the median is a good estimator to use for the Cauchy. The mean is bad because occasionally there will be a spectacular outlier to which it will be very sensitive. However, in all those many samples where the mean is close to the median then the mean will be quite a good estimator and these are precisely the sort of samples that look Normal even though the population is Cauchy.

I think this insight is both supportive of and undermining of frequentist inference. Supportive in the sense that it shows that you don’t have to be right about everything to be right about something of interest. Undermining in that it questions the wisdom of looking at everything in terms of long run properties.

I am not sure if Barnard published this or not but it might be worth looking at his paper with John Copas

http://www.sciencedirect.com/science/article/pii/S0378375802002719

In that connection, those who like paradoxes may be interested in this one by Guernsey McPearson

http://www.senns.demon.co.uk/wprose.html#Agony

Just on one point (as I’m heading to the airport): good frequentists (e.g., error statisticians) certainly do not deem it wise to look “at everything in terms of long run properties”—supposing they do is the second of what I listed as the two main erroneous assumptions in criticizing frequentist statistics. A particular choice of long-run may not even be relevant for evaluating probativeness. See, for example, the comedy club howler (based on the famous Cox 1958 example of averaging over measurements with different precisions): in: http://errorstatistics.com/2011/12/02/getting-credit-or-blame-for-something-you-dont-deserve-and-first-honorable-mention/.

I realize mine is a general point; when I’m in one place, I will study Spanos’ point and your two other links, in relation to the m-s testing issue at hand.

Please explain why it’s okay for “good” frequentists to dismiss frequentist properties, should they feel like it. This argument seems like an invitation for users to choose whichever horribly-calibrated analysis method gives their preferred answer.

I don’t have a clue what the Guest is talking about wrt ignoring relative frequencies! There are all kinds of actual and hypothetical experiments over which one could average relative frequencies, but which differ from a test’s error probabilities as the severity associated with the claim in question helps one to see (eg., J. Berger’s conditional p-values). In the post to which the guest is replying, I alluded to the “oil exec” (from my Dec 2 post) who claims to have good evidence that H: the pressure was at normal levels on April 20, 2010, despite his “test” having very little chance of unearthing dangerous pressure levels, on the grounds that most of the time they do (or would have done) a more stringent test and he was reporting the average (over entirely different pressure tests). There is scarcely good evidence for “no danger” by means of a test that has little if any chance of detecting the presence of danger—such positive results, as Popper would say, “are too cheap to be worth having” and fail to measure how well corroborated H is in the case at hand.

You wrote that “good frequentists certainly do not deem it wise to look at everything in terms of long run properties”.

If one is going to discard frequentist calibration of methods, and (as you seem to regularly suggest) disavow everything Bayesian, how are methods to be calibrated? What scale(s), if any, are you suggesting instead?

Note I am *not* arguing against evaluation of frequentist properties under different data-generating mechanisms, such as conditioning (or not) on ancillary statistics, as discussed in Cox 1958. I want to know what alternatives you propose, instead of the use of frequentist properties.

Aris: If the p-value distribution is uniform, type I error probabilities will be fine for tests at any level (some may still be fine at specific levels without the p-values being uniform) and type II probabilities of course depend on the alternative. I’m happy to admit that my little simulations are not perfect, but if you want to convince me that one can do much (!) better distinguishing a t_5 from a Normal with n=100 or even 20 than I did in this simulation, I’d prefer proper results to just naming a test. I actually chose Anscombe not because I think that this is the best one, but rather because you suggested that the kurtosis of a t_5 should be easily distinguishable from the Normal.

I’m not saying that one cannot do better, I’m just saying: show me.

My issue originally was a general one. There will always be some distributions which you won’t distinguish from your nominal model using whatever programme of M-S tests you carry out, and it is worthwhile to ask and far from clear whether error probabilities behave well under these just because the data passed the M-S tests that you tried.

Also note that in order to investigate this, one would need to come up with a well-defined menu of M-S tests which overall doesn’t imply a high type I error rate, and one should also look at to what extent the nominal error probabilities from the model you actually assume are stable conditionally on passing the M-S tests.

This is not a criticism but rather a suggestion of something that I think is worthwhile to investigate. Unfortunately there are loads of different scenarios to look at. (I may do a tiny little bit of this at some point.)

Of course what Stephen pointed out is another instance of this kind of reasoning (it may still be that the mean is OK but Normal based inference is not).

If your test is quite poor at distinguishing two distributions, then failing to signal a distinction is poor evidence for inferring it is one rather than the other. If your argument for showing you have a bad test is sound, then you have shown (objectively!) that your “negative’ result is not good grounds for taking the distribution to be one rather than the other (so if the distinction matters for a given use, you don’t have the warrant needed). The fact that Spanos might show you how he would do a better job than you did would in no way remedy your analysis.

The point of Christian’s simulations is that, directly contrary to what Aris claimed, in plausible circumstances there is little power to distinguish Normal data from data that is considerably non-Normal. Following the arguments put forward in the articles I cited some time ago, this is to be expected; there is generally little power to detect deviations from parametric models, including deviations that will invalidate the inference we subsequently draw, when using methods that rely on parametric assumptions.

Given that we can’t reliably test our way to the right parametric model, the alternatives you seem to be suggesting seem to be some combination of i) using the models and parametric methods anyway, reporting sensitivity analyses as an uncalibrated health warning ii) give up on having well-calibrated inference iii) give up on inference altogether, saying there is no way to severely test the important parametric assumptions. None of these is very attractive, particularly for the (common) situations where we can iv) use methods that get their validity from known properties of study design, e.g. independence of observations – and not from parametric assumptions.

Guest: Sorry, but I think you are seriously misinterpreting any lesson in Hennig’s particular case, (especially as, when done correctly, he had no trouble making the distinction, as he admitted); nor have you or he shown that parametric tests have little power to detect parametric assumptions (even though the recommended procedure would generally be a combination of parametric and non-parametric tests). Finally, the alternatives you claim we “seem to be suggesting” are not ones I recognize. So I think this issue warrants more attention.

@Mayo; Christian’s simulations follow the pattern established in the fairly large literature on the subject. In plausible circumstances, there is little power to detect important deviations from assumed models. Procedures that involve pre-testing get recommended (like the now-disowned ones that Senn mentioned) but can have notably worse properties than procedures that do not involve these steps.

I don’t see the relevance of Aris’s argument. We surely don’t believe that anything literally has a t with 5 degrees of freedom. If whatever Christian simulated from has got tails that even approximately are as heavy as a t 5 and you can’t easily distinguish this from a Normal then this is surely relevant information.

I myself, however, am very wary about pre-testing for assumptions as part of a general strategy of testing a substantive hypothesis. Such procedures as a whole can have bad properties.(So I find myself in agreement with Christian.) For many years the accepted wisdom was that if data from a cross-over trial showed apparent significant carry-over you should only use the first period values. Now we know that the reverse is the case. You should never use the first period values if the test for carry-over is significant.

Stephen: Thanks for your various comments. Just on the last para, what do you mean “the reverse is the case”? And how was pre-testing the culprit, if it was, in erroneously suggesting the first period values were usable (and for what?).

Update to Senn: I thank you for the paper one crossover effects, and would really like to understand it better, but I’m missing the connection to Spanos’ m-s tests.

Well the guest now made apparently a bit more of my simulation than I saw in it… actually yes, I did show that using one specific test it is difficult to tell apart the Normal and the t_5 but on the other hand, at least regarding the distribution of t-test p-values, this isn’t such a disaster because the t_5 doesn’t behave very different from the Normal under this test anyway. Although M-S testing didn’t seem to help matters.

My general point of view is that it is an open question and a worthwhile area of research whether M-S testing is actually helpful, or more precisely, under which circumstances which M-S tests are helpful. It may turn out that how model checking/M-S testing should be done depends more or less strongly on what afterwards is done with the “accepted” model. (At least not using mean/variance/normality based analyses if extreme outliers are found is clearly sensible, so it can’t *all* be wrong.)

Guest: (I think this is the limit for indenting comments, so I’m tacking it here). I was reacting to Senn’s particular quote, and I explained that “a particular choice of long-run may not even be relevant for evaluating probativeness” (or severity). Now in the case of m-s tests, say one’s procedure is a form of the “error fixing” we saw in the common Durbin-Watson fallacy (rejecting the null and inferring a particular alternative, the autocorrelation-corrected LRM on grounds that it “accounts for” the failure of independence). See my Feb 23 post. Suppose one goes on to assess the error probabilities associated with erroneous rejections of this fallacious sort, and announces that because they are high, then m-s tests have poor error probabilities! But that is not at all the recommended procedure, so the poor error probabilities of that procedure are irrelevant! More than that, they show why we reject that procedure!

If this issue is unclear, as it seems to be, we should take it up in a regular post. Aside from m-s testing, low error probs have never been sufficient for assessing the severity associated with a given inference…..

meaning?