We interspersed key issues from the reading for this session (from Howson and Urbach) with portions of my presentation at the Boston Colloquium (Feb, 2014): Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge. (Slides below)*.

*Someone sent us a recording (mp3)of the panel discussion from that Colloquium (there’s a lot on “big data” and its politics) including: Mayo, Xiao-Li Meng (Harvard), Kent Staley (St. Louis), and Mark van der Laan (Berkeley). *

See if this works: | mp3

*There’s a prelude here to our visitor on April 24: Professor Stanley Young from the National Institute of Statistical Sciences.

I’d forgotten how much of the panel discussion with Meng, Staley, van der Laan and me revolved around the politics of big data. I’m reminded of Larry Wasserman who pointed this out early on. Since it’s not my field, I was less aware of that side of it.

Mayo, great slides, thanks much for posting. You really put things together nicely, it all tells a clear story. One point always irks me, though. You say on slide 20, and many others often repeat (especially around Gelman’s blog), “all methods of statistical inference rest on statistical models.” In a truly randomized study with *perfect retention*, what statistical model is required to test the “strong null hypothesis” (i.e., that no experimental unit would have experienced a different outcome had that unit, counter-factually, been randomized to the other treatment)? What assumptions are required to test this hypothesis? This is the the hypothesis being tested by Fisher’s exact test and many other randomization-based permutation tests. Clearly, randomization is not an assumption, it is a fact. I suppose one could consider the null hypothesis itself a “model”, but only in the weakest sense (there are no assumptions that require testing).

Mark:

I do think that all methods of statistical inference rest on statistical models, and I’m sorry that this statement irks you. You give as an example an example in which the model is known to be true. This can happen (that we use a model that we know is true) but in practice it is quite rare. Experiments typically have problems with compliance, measurement error, generalizability to the real world, and representativeness of the sample. Surveys typically have problems of undercoverage, nonresponse, and measurement error. In addition, any p-value is conditional on the model in which the same test would have been performed under other data that would have been seen. But, yes, there are some rare settings where the model on which the inference rests is known to be true.

Andrew: Nice to have you come over to Elba. On Mark’s point, I do think it’s legitimate to make the distinction between model based and design based. And it isn’t even just that the design does the work, it’s the fact that this is a kind of “pure” significance test where there isn’t an explicit statistical alternative embedded in a model (as with a parametric alternative) Beyond that, it’s largely just a semantical issue.

Mayo: You write of “the distinction between model based and design based.” But I don’t see it. The assumption that nonresponse is at random, or the assumption that the people in an experiment are a random sample of the population of interest, or the assumption that there is zero measurement error–these are all mathematical models. I don’t see how calling these assumptions “design” gets them off the hook.

Andrew, that’s fine, I can accept that randomization provides a “true model”. However, I’d argue that when dealing with experiments in people, they *always* have problems with generalizability to the real world and “representativeness” (whatever that means) of the sample. In other words, in experiments with humans I’d say internal validity is everything while external validity requires a trip down the rabbit hole.

Mark: But your example was not about representing a population, you had it right in your initial description as being about the experimental group(s).

Mark:

External validity does not require a trip down the rabbit hole. Much can be learned via multilevel regression and poststratification; see here, for example.

Maybe, but there’s no basis for it in a randomized experiment. Pretending that such a study population is a random sample from any larger population is indeed a trip down the rabbit hole.

Mark:

Real surveys are done to learn about the general population. But real surveys are not random samples. If you choose to not “go down the rabbit hole” of generalizing from sample to population, you might as well not do the survey at all.

That’s fine, there are a lot of settings where you could choose to not “go down a rabbit hole.” For example, educational tests: what are they exactly measuring? Nobody knows, maybe don’t go down that rabbit hole either. Medical research: even if it’s a randomized experiment, the participants in the study won’t be a random sample from the population for whom you’d recommend treatment. Don’t go down that rabbit hole either.

By refusing to go down the rabbit hole of generalizing from data to outcomes of interest, or from non-random sample to population, you’ve decided not to use surveys, educational tests, or medical experiments. Ultimately, by not wanting to use statistical models (as you put it, not “pretending” anything), you’ve restricted yourself to pure mathematics. That’s fine–math is a job like any other–but you should just be aware of the consequences of your decision. You can choose not to play the game, but the game will just go on without you because, in the meantime, surveys will be done, tests will be taken, and decisions will be made based on the results of experiments.

Andrew, I didn’t say anything about surveys that are intended to represent a larger population. There is no (or shouldn’t be) any such intent in a randomized trial. That is not the point. I’m perfectly comfortable and perfectly satisfied with the goals of randomized trials without having to pretend that they are “representative” of anybody who wasn’t actually randomized. But having a very carefully selected study population in a randomized trial does not invalidate or lessen the *statistical* inferences therefrom to any extent. I’m fine with that.

Mark: So you say that you think it’s enough to learn something about the counterfactual outcomes of the study participants, even if this doesn’t tell us anything about anybody else?

Actually you seem to say that if the study population is “carefully selected”, it will tell you something about other people. Or not? But if the answer is yes, one would need to think about conditions/assumptions for generalisability, which is what I was writing about, and, I think, from a different perspective also Andrew. And in this respect a model seems more helpful to me that conceptualises the studied “sample” as representative for a bigger population (so that one then can think about what this population is), than modelling just random assignments of the 10 people we have in the study, treating them as fixed.

Christian,

Yes, I think it’s enough to learn something (something very big, actually… does my drug actually *cause* a difference in any of these people) about the counterfactual outcomes of the study participants. I fully understand that people, especially clinicians, would *like* generalizability. I just don’t think there’s ever any statistical basis for it (note, I am talking strictly about randomized trials! Not surveys, not educational tests). Edgington and Onghena (2007) referred to such clinical generalizations as “non-statistical inference”… yes, that sounds about right to me.

And, yes, before Andrew brings it up, I realize that non-retention potentially causes a problem, a VERY BIG problem, in randomized trials. And addressing this ubiquitous problem involves making untestable assumptions. However, these assumptions do not change the inference space. Suppose you did a truly random survey of the US population. You’d undoubtedly have non-response, right? So, you’d make assumptions about that non-response such that you could still make (hopefully) reasonable inference… but to which population would this inference apply? Hopefully, it’s still to the population that was actually sampled. Same is true in randomized trials. Dropouts suck. We have to make completely untestable assumptions (oh my god, it’s a model!) about those dropouts. But, my inference space is still the population from which I’ve sampled, namely the finite population of counterfactuals induced by randomization.

This is one of my favorite papers of all time, not because I think it’s the gospel, but because I think it lays it all out so clearly: http://www.ncbi.nlm.nih.gov/pubmed/?term=groundhog+day+stewart

By the way, I also almost always use ANCOVA of some sort to analyze randomized trials (oh my god, models, models everywhere). But my (fully statistical) inference is still to those who were actually randomized. It’s amazing how quickly the randomization distribution of a test statistic approaches the normal distribution…

Mark: Fair enough. I agree that generalisation beyond the people actually in the study involves making untestable assumptions some of which are problematic. I just don’t think that this is a non-statistical issue. I think that statistical modelling can help a lot to at least make transparent what kind of assumptions are needed, although it cannot ensure that they are fulfilled, of course. (Actually I think that your discussion is well informed by statistical modelling.)

Mark: I think you’re right, and that randomized groups are not generally intended to be representative (e.g., they volunteered), but to extract some information about a process in an artificial set-up. This came up a few years ago when Nancy Cartwright was criticizing RCTs in development economics (in relation to our June 2010 conference), and Cox wrote a few pages for the conference specifically to clarify this point.

Mark: I don’t think that we can ever test what you call the “strong null hypothesis”. Such tests can only ever test “equality of distributions”, which means that treatments may have different effects on individuals in random manners as long as they are not systematically different.

Regarding randomisation, my way of modelling would be that before/without randomisation, units may not come from a homogeneous distribution, i.e., different units may be drawn from different distributions. *After* randomisation (but before treatment) units in both groups come from the same distribution, which is a mixture of the not necessarily identical distributions before, and if there is no treatment effect, after treatment it is still the same. Without randomisation this cannot be granted and allocation of units may be a source of systematic differences.

Christian, I don’t understand what you’re saying. What null hypothesis is being tested by a randomization-based permutation test if it’s not the “strong null hypothesis” as I phrased it?

Here’s a simple example. I randomize 10 people, 5 to receive drug A and 5 to receive drug B. There are 252 possible ways to do that randomization. Suppose all 5 on drug A die but all 5 on drug B are alive at the end of the study. Assuming the strong null hypothesis, I can calculate an exact 2-sided p-value of 2/252, which is about 0.008. Personally, I would interpret this as fairly strong evidence that drug B was beneficial (or perhaps drug A harmful, depending on the condition being treated) for at least *some* of those people. That is, I conclude that some of those people who received drug A would not have died had they been randomized to drug B, even if I can’t pinpoint exactly which people those are. What’s wrong with this test/interpretation?

Mark: OK, I see your point. The null hypothesis I had in mind was that “everyone has the same probability of death under both treatments” which is much weaker than “the treatments have the same outcome (counterfactual in half of the cases) for all patients”. I see how you use the latter one to compute your p-value (assuming that the overall number of deaths is 5 fixed). I think that applying the same as condition to the former null gives the same test (because then all 252 ways again have the same probability).

So you can’t distinguish your stronger null from my weaker one in this way. Of course, rejecting a weaker null is a stronger result, so you’re entitled to your interpretation.

Christian, great. But that brings up an interesting point. I’m pretty sure that you’re the commenter who often mentions that differences in interpretations of probability are what really distinguishes statistical philosophies (and I completely agree with you on this point (assuming that I have this right that it is you), even if Mayo doesn’t always like to go there!). So, that said, how do you interpret the “probability of death” in your explanation? How should that probability be interpreted? This is something else that has been bugging me quite a bit lately. I think of this as a difference between modeling data “prospectively”, as in using something like a predictive model, and analyzing data as given, a result of the design (obviously, I’m still thinking about this and don’t quite have my wording worked out, maybe differentiating between fixed and stochastic is better).

Anyway, this is also related, I think, to things like clinical prediction models, such as: https://my.americanheart.org/professional/StatementsGuidelines/PreventionGuidelines/Prevention-Guidelines_UCM_457698_SubHomePage.jsp

What can it possibly mean to say that *my* 10-year risk for CVD is 2% (yes, yes, I understand that it applies to a hypothetical randomly selected person from a specific sub-population… assuming that the model is correct, of course (it clearly isn’t)… but they don’t say that, they say “your risk”)? I see this as some kind of cross between “fate” and subjective probability, but it’s simply just meaningless to me.

Mark: I have no problem “going there”. What mainly matters, or what needs clearing up first, is the use of probability in inference, as opposed to, say, in probability modeling. The error statistician sees its role as controlling and assessing the error probabilities of methods in such a way as to capture their capabilities to probe various claims and flaws that are of interest in the inquiry at hand. So when people say, as they so often do, that what we really want is the probability that C: this drug improves survival (roughly your example), I say what they really mean is that they want some way to use probability to aptly qualify/quantify how well (or poorly) warranted C is by the data. They also want, or should want, to know which threats of error have not been well ruled out, and what further tests would be needed to check and discriminate claims involved. This is quite different from how believable C is, or how much you should bet on C, or the % of micro-states that are consistent with C being true (as the Jaynesians seem to want), and any number of construals in terms of degrees of support or plausibility. High probability of a scientific hypothesis, on any of these interpretations, might be OK for some purposes (e.g., summing up what’s learned by other means), but it is not what’s needed to use probability to find things out (i.e., as a “forward-looking” tool).

The use of probability in qualifying evidence for hypotheses in statistics should, in my view, be continuous with the way claims (and inquiries) are qualified throughout science, the vast majority of which is not a matter of formal statistics. Scientists might informally say of a theory that has been well tested that it is “probably true” but that is quite different from the formal notion that allowed them to figure out it was well-tested. One shouldn’t confuse informal English uses of terms such as probable,likely with their formal counterparts within foundational discussions.

Mayo, of course, sorry wasn’t implying that you won’t talk about probabilistic foundations, I just remember Christian (right?) stepping somewhat lightly around this subject previously. As far as I understand it, I agree with everything you said here.

Mark: For me, the interpretation of such a probability refers to how we think about the situation in terms of the probability model. So if I interpret probabilities in a frequentist way, which I’d do here, this means that for the “probability of death” I think of a very large, actually infinitely large population of units “of the same kind” and then about a relative frequency of deaths in that population.

Obviously this is not how reality really is – units are not exactly of the same kind, not even two of them. In order to back such thinking up with at least some real frequency, I’d therefore have to think about units of “about the same kind” ignoring some differences between them that seem irrelevant to me.

Actually, the idea behind that is more an idea of a propensity of a single unit, which however is connected to a relative frequency if I think of lots of units of this kind – I think that this corresponds to Donald Gillies’s propensity interpretation of probability. In order to make this meaningful in reality you’d need to apply this to a reasonably large group of units that are reasonably similar (and hope that the existing differences between them don’t distort the applicability of the death probability – and see below), and then you can say something meaningful about the estimated expected relative frequency of death.

There is a subjective element in this. However, this is essentially different from subjective probability in the Bayes/de Finettian sense, because I think of probabilities as modelling something in the world outside my brain (although my brain does this), whereas subjective Bayesian probabilities model something that is supposed to be in the brain of the subject.

Now actually, with randomisation, the “reasonably large group” may contain quite different units. How so? Because randomisation makes the probability of a single unit assigned to one of the groups a mixture of probabilities from the potentially diverse people that go into the study, in which all the different unit’s propensities are represented. But there is still the issue to what extent the population of units of interest is represented by the units in the study (random sampling vs. randomisation).

In any case, the interpretation refers to frequencies within certain (more or less idealised) groups, not to individuals (or to individuals only to the extent to which they feel to be “random representatives” of the group).

Christian, great, thanks! I’ve always liked Popper’s propensity interpretation, but I have a hard time applying it here, I just can’t imagine an individual’s future as a (hypothetically repeatable) chance setup. I’m not familiar with Gillies’ interpretation, but will definitely read up!

I think that Gillies’s propensity is more relevant to statisticians because it leads to claims that can be tested in the usual way, and ideas such as severity and misspecification testing apply, which however means that we need to compromise and idealise and interpret situations as “repetitions” that are not really repetitions without ignoring some details, whereas Popper’s concept to me seems to be further away from getting hands dirty by looking at nasty real data… but then I know of Popper’s propensities rather through citations in other peoples’ work than from Popper himself.

There are problems with propensities that I’ll have to come back to another time. this is a placeholder.

Christian:

At some point, you might say how you see Gillies view. I think he is wrong to spoze frequentist accounts are “operational” in the sense of reducing concepts to mere observables.

Mark: I take Christian’s point to be your way of stating the “no effect null” as “no experimental unit would have experienced a different outcome had that unit, counter-factually, been randomized to the other treatment)”. The inference in rejecting the null isn’t the logical denial of this. Stephen Senn can give us his take.

Harumph… at the risk of answering my own question, I suppose one could consider the assumption that every possible (allowable) random assignment is equally probable to be a “model”, although it’s one that is easily checked and is part of the “fact” of randomization.

Mark: First, thanks for appreciating my slides. It’s true that there is a legitimate distinction between “design based” and “model based” assessments (Cox’s distinction). So even though there are assumptions in the former, it needn’t be called a model, fine. But when I said “all methods of statistical inference” I had in mind “all methodologies of statistical inference” as a whole. the point really being that the comparison between different existing methodologies doesn’t turn on the use of statistical models. And of course, you still might need to check assumptions with design-based inference, if there is really to be inference.

Anyway, given some of the radical things I said in those slides, I’d be happy to get off with this much of a caveat.

One has to be careful in taking extreme results as examples since one controversial aspect of P-values is avoided: one does not have to consider more extreme results since there are none.

Even with such an example, however, there has to be some sort of a model to guide what is the most extreme case. For instance, some fraud detection tests don’t look at extreme cases but the reverse. It has been found that fraudsters produce data that are too good to be true so I think that what alternative you had in mind (which is some sort of a model) would guide your test. This is true even when you are looking for treatment effects, as in your example. See for example

Conover WJ, Salsburg DS. Locally Most Powerful Tests for Detecting Treatment Effects When Only a Subset of Patients Can Be Expected to Respond to Treatment. Biometrics 1988; 44: 189-196.

Of course, rejecting a given null does not entitle you to assert the alternative you had in mind when constructing a given test Also, as I have pointed out in a previous post, Fisher regarded the null hypothesis as having priority over the test statistic but the test statistic as having priority over the alternative hypothesis.

One may note, by the by, that for discrete distributions the issue as to how to handle two-sided tests is very controversial.

Stephen: this business about “having priority” is obscure. I mean I’m aware of Fisher’s preference, but it seems odd to call it a priority of some sort. I suspect Fisher was largely underscoring a way to be at odds with Neyman. Remember Pearson telling Fisher it was it actually his “heresy” (introducing the alternative), and not Neyman’s?

I gave your regards to Stan Young who led a great seminar here this afternoon.

Stephen, thanks for that. Point taken, I sort of cheated by choosing an extreme example (but it made the math easy!). Of course, any randomization test would be based on some selected test statistic… But I’m still wondering if that is a “model” (in the sense that it doesn’t require any assumptions, beyond the assumption that the test statistic is reasonable for the particular situation… Ok, that’s somewhat loaded, I admit, maybe it is a model). Wasn’t this a key point of conflict between Fisher and Neyman?

I’ll check out that reference (feel like I’ve read it before, a long time ago). Thanks!

Mark: Don’t forget the distinction between simple or “pure” significance tests and “embedded” tests (Cox’s terms). It’s because of the limited assessment under the null that you may be able to get as close to a genuine modus tollens as is to be expected. Dashing off to Stan Young’s class.

Good point, and I’m jealous.

Mark:

It is really just a matter of how wrong assumptions can be rather than an absence or presence. In your example, Peirce would point out (quote below) “truly randomized” is clearly an assumption but one that’s “wrongness” is not important. So it is a model of how subjects were assigned. I think of models and representations as the same thing, so I would argue you can’t even think without them.

Fisher did argue that the primary objective in statistical design should be to arrange both that assumptions would be less wrong and the remaining “wrongness” should have minimum impact on the inference. Your example being an example Fisher likely gave of doing this well. But even here you need independent response of the units and more generally SUTVA – single unit treatment value assumptions (if I recall correctly).

Transcribed from: Peirce, Charles Sanders, (1839-1914) The collected papers of Charles Sanders Peirce Cambridge, MA : Harvard University Press, 1931-1935 Volumes 1-6 edited by Charles Hartshorne and Paul Weiss. Section XVI. §16. Reasoning from Samples

94. That this does justify induction is a mathematical proposition beyond dispute. It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required.

Thanks Keith. Actually, I don’t believe you do need SUTVA to justify the test, that only comes into play once you intend to estimate the treatment effect (which, of course, implicitly involves a model). Actually, as was pointed out to me recently, under the null hypothesis, SUTVA is given.

I agree with that quote from Pierce (I’m a big fan of Pierce… this might be thanks to reading Mayo, but I can’t recall).

Keith: I use the quote you cited in EGEK (Mayo 1996): “Thirty throws of a die constitute an approximately random sample of all the throws of that die” and elsewhere, but do you think it is true?

Of course, neither I nor Pierce would think it’s true. Peirce’s concept of truth, of which you are surely aware, is what an enquiring community _would_ settle on if the community enquired into it sufficiently. Cheryl Misak argues that the word _would_ needs to be emphasised to avoid any idea that this will ever happen. The important point here though is that the wrongness from random assignment being approximately rather than actually random is (habitually) harmless.

Additionally as the wrongness is learned about it is lessened. Better dice are made and computerised random number generators are continually revised. Today, I might be somewhat suspicious of a randomised trial that did not use a documented computerised random number generator with a documented seed that initialised it. Fortunately, most statisticians are trained to do this. Or I wish they were.

Keith

Keith: I’m somewhat perplexed by your answer. Peirce clearly thought it true that the 30 tosses are approximately random. I was wondering if you agreed.

On the matter of truth as “what an enquiring community _would_ settle on”–according to Peirce–well, I have a different reading of what he meant than some. It isn’t as if an inquiring community could “settle on” this or that, thereby rendering it true. Rather, if the path of inquiry is unobstructed and is one that “settles on” a claim only if it has survived severe error probes (or arguments from coincidence), then the claim is true or approximately true. That is, Peirce (as I see him) is a realist who articulates methods of scientific inquiry for discovering what is true.

> a realist who articulates methods of scientific inquiry for discovering what is true

Yes, but he would likely add, you never can rule out reality brutally surprising you in the future. Individual enquirers don’t matter – including Peirce.

You might recall his quip “Good thing we die else we would live long enough to discover anything we thought we understood we didn’t.”

But we can be comfortable having different readings of Peirce – its hard not to given what remains of his largely unfinished work.

Keith: Still curious if you agree to the claim about the 30 tosses of the coin being approximately random tosses of the coin.

Actually I don’t know that quip of his; I’m not a Peirce scholar really. So where is it?

> Still curious if you agree to the claim about the 30 tosses of the coin being approximately random tosses of the coin.

Yes, in the sense if some did this to assign 30 subjects to one of two groups, and the rolls were witnessed by a third party and the origin of the die was documented and accepted as well manufactured (why was coin mentioned instead of die?) – I would not seriously question the assumption of random assignment.

On the other hand, I would never suggest it be done this way, and there are better ways (and the process needs to be audited.)

(Peirce eference not at hand.)