**Stephen Senn**

*Head of the Methodology and Statistics Group,*

* Competence Center for Methodology and Statistics (CCMS), Luxembourg*

An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence based medicine.

Philosophy of Science2002;69: S316-S330: see page S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random.

The second point is that in the absence of a treatment effect, where randomization has taken place, the statistical theory predicts probabilistically how the variation in outcome between groups relates to the variation within.

The third point, strongly related to the other two, is that statistical inference in clinical trials proceeds using ratios. The F statistic produced from Fisher’s famous analysis of variance is the *ratio* of the variance between to the variance within and calculated using

observed outcomes. (The ratio form is due to Snedecor but Fisher’s approach using semi-differences of natural logarithms is equivalent.) The critics of randomization are talking about the effect of the unmeasured covariates on the *numerator* of this ratio. However, any factor that could be imbalanced *between* groups could vary strongly *within* and thus while the numerator would be affected, so would the denominator. Any Bayesian will soon come to the conclusion that, given randomization, coherence imposes strong constraints on the degree to which one expects an unknown something to inflate the numerator (which implies not only differing between groups but also, coincidentally, having predictive strength) but not the denominator.

The final point is that statistical inferences are probabilistic: either about statistics in the frequentist mode or about parameters in the Bayesian mode. Many strong predictors varying from patient to patient will tend to inflate the variance within groups; this will be reflected in due turn in wider confidence intervals for the estimated treatment effect. It is not enough to attack the estimate. Being a statistician means never having to say you are certain. It is not the estimate that has to be attacked to prove a statistician a liar, it is the certainty with which the estimate has been expressed. We don’t call a man a liar who claims that with probability one half you will get one head in two tosses of a coin just because you might get two tails.

Stephen: I think this response to a common criticism is very important, and you explain it quite clearly. Thanks. (I have invited Worrall to respond when he can.) I take the final remark to mean that the problem critics raise would show up in a wide confidence interval; thus the statistician forming the interval would be telling the truth. Nowadays, some (many?) Bayesian practitioners seem to want to uphold randomization, but it is not clear what their grounds of justification are.

Mayo: the grounds are strong ignorability of the treatment assignment mechanisms, which licenses a causal interpretation of the resulting estimate.

Corey: Not sure I get this.

Mayo: I’m too terse, as usual. I doubt I could make further comment on the topic without making a mistake unless I review Ch. 7 of Gelman’s Bayesian Data Analysis.

Dr. Senn: Wow, this was an eye-opener. We always hear critics of evidence based policy say that RCTs are invalid because of a high probability of imbalance on some unknown factor. (This is what I’d always though randomization was designed to take care of in the first place, but seemed to be in the minority!) Now, if I understand you correctly, you are saying that any imbalance would have the same effect on both groups, so at most the precision is hurt, and this shows up in the confidence interval? Just making sure I have this right…

Senn just wrote to me from Shanghai, but he will reply when he can.

Deborah: yes. The point is that variation in response leads, other things being equal, to wider confidence intervals. I often put it like this. If we knew that many strongly predictive covariates were perfectly balanced between groups the conventional statistical calculations would be wrong. This is because the calculations make an allowance for the factors being unbalanced, which allowance would be inappropriate in a perfectly balanced case. It is for this reason that matched pairs designs generally lead to narrower confidence intervals than do completely randomised ones. It is an infamous elementary error to analyse a matched pairs design as if it were completely randomised.

A Bayesian justification for randomisation was given by Mervyn Stone in 1969. See Stone M. The Role of Experimental Randomization in Bayesian Statistics: Finite Sampling and Two Bayesians. Biometrika 1996; 56: 681-683.ber

Stephen: That’s an excellent way to put this. It should be emphasized in philosophical discussions of EBM, which rarely discuss these statistical aspects. Granted there are valid criticisms of certain studies that purport to satisfy requirements of RCTs, as in developmental economics. But then the fault lies more with the study and the inference claimed to be warranted. I don’t know if you’re familiar with that at all.

Eileen. The point is that the critics have to prove that there is a high probability of an unknown prognostic factor varying between groups but not within, because if it varies within the confidence intervals for the treatment effect will be wide.

Stephen, how does this fit in with the following (very sagely put) “incontrovertible facts about a randomized clinical trial:

1) over all randomizations the groups are balanced;

2) for a particular randomization they are unbalanced.”

This, in a nutshell, essentially forms the basis for my personal philosophy of statistics. But I’m having difficulty seeing where it fits into your four points above (perhaps point number 4, but unclear). Have you modified your own views on this? Thanks!

I don’t see any contradiction. I dislike tests of baseline balance because my philosophy is that you fit what is important in terms of its effect on outcome, not what is unbalanced. I would fit an observed predictive covariate whether or not it was imbalanced. This by the way is why I dislike the propensity score approach.

Stephen: Can you say how the propensity score approach works, and why you dislike it?

Hi Stephen, I suppose I didn’t see so much a contradiction as an oddity… I was just wondering why you went through all 4 of the above points when these two incontrovertible facts would have sufficed, it seems to me. It just seems that Worrall completely misunderstands this… The probability isn’t “high” that the groups are unbalanced on some factor, the probability is 1 that they are.

Is not the whole issue of baseline comparison problematic? Trials are not really powered to detect baseline differences, and recent trends in medical RCT’s show greater and greater numbers of baseline comparisons increasing the risk of imbalance by chance (even if not powered to that end)

Ross, yes, baseline comparison is misguided. Stephen Senn wrote, as far as I’m concerned, the definitive paper on this topic: http://www.ncbi.nlm.nih.gov/pubmed/7997705 (he actually wrote several, but this is by far my favorite — it’s actually among my favorite statistical papers ever, highly recommend!). The easiest way to understand why it’s nonsensical is this — hypothesis tests compare populations, not observed samples, and asking whether the populations from which randomized groups were drawn differ is plain silly… they were randomly selected from exactly the same study population of participants! Of course the populations don’t differ, they’re one in the same!

Of course, not all statisticians agree. Some, such as Vance Berger, still see value in doing the baseline comparisons to rule out serious selection bias due to inadequate randomization or allocation concealment mechanisms. I think Berger probably has a point, but I totally agree with Senn that using baseline comparisons as a means to decide what to control for is misguided at best and misleading at worst.

Thank you for the reference Mark. I appreciate the groups indulgence of an enthusiastic newbie!

I’ve thought a little about Worrall’s unknown confounders argument, and I have a different take on what is wrong about it. Here’s what I wrote in a 2011 EJPS paper:

Randomization is unlikely to control for confounding factors only in the event that there are many unrelated population variables that influence outcome, because only in that complex case is one of those variables likely to be accidentally unbalanced by the randomizations. Worrall considers only the abstract possibility of multiple unknown variables; he does not consider the likely relationship (correlation) of those variables with one another and he does not give us a reason to think that, in practice, randomization generally (rather than rarely) leaves some causally relevant population variables accidentally selected for and thereby able to bias the outcome. In addition, Also, an attempt to replicate a result can quickly eliminate particular counfounders (just not all possible confounders).

Miriam: Thanks for your comment. One thing of interest in Senn’s remark is that unbalance is expected and taken account of in the analysis. Your replication point is important.

Miriam. Yes, I think we are broadly in agreement but I think that it is not just the issue of correlation. Perhaps I can explain it this way. Suppose that you want to simulate the results from a set of patients using a complex model of the sort Worrall seems implictly to suppose might be relevant. Take the case of asthma in which we decide to measure forced expiratory volume in one second (FEV1) as our outcome. Let’s be really ambitious and suppose that these idefinitely many predictors are 30,000 genetic loci (factors) that can have different alleles (levels of the factor), a figure that at one time was claimed to be the number that made us who we are. So even if you want to set up an additive model you need tens of thousands of regression coefficients. But caution! You can’t make these coeffeicients whatever you like. They should not produce impossible values of FEV1, which really ought to lie between 500ml (very low) to 5000ml (very high). The fact that this is so limits pretty strongly what you can put on these coefficients and this means, of course that their joint probability distribution is very tightly constrained.

However, luckily, if you stop to think about this you will soon realise that this tedious exercise is unnecessary. If I know what the outcome is, why should I care about the predictors? All you need to simulate is one set of FEV1 values for all the patients in the trial. Do it for 300 patients which is a typical number we would use in a phase III trial in asthma.

Now the game continues as follows. I randomly assign the 300 patients one by one to either ‘treatment’ or ‘placebo’. You are allowed to add a value to all the simulated FEV1 values for all the patients in the treatment group. I now produce a 95% confidence interval for the treatment effect (treatment – placebo) using the Fisherian machinery (although Fisher would not have liked the term confidence interval). Let us suppose that you actually added 0 ml. Do the 30,000 hidden factors have any relevance to the probability that the true effect (0 in this instance) is contained within the confidence interval? Answer: none whatsoever.

Stephen: I really need to ponder more closely your analysis using the asthma example–I’m not immediately grasping it, but it seems crucial. If an obvious simplification comes to mind, I’d be grateful. Thanks so much!

I always found Worrall’s ‘indefinitely many possible confounders’ argument (against the epistemic importance of randomization) unconvincing. I am with Dr. Senn on this.

Worrall’s criticism (which goes back to Howson & Urbach) overlooks matters. Behind it is this idea that not only does randomization not provide preconditions to ground experimental inference (e.g. vis-à-vis significance tests), it is unnecessary for the comparison of groups in clinical trials. For Worrall, the essential condition is that the groups be adequately matched on factors believed to have prognostic significance — beliefs that should come from experts and incorporated into subjective priors – an orthodox Bayesian move. The objection has a certain appeal, if you assume a different (yet mistaken) role to experimental randomization. The mistaken idea is that, if randomization is to ground experimental inference insofar as balancing experimental groups, it should perform the same service for the unknown factors (possible confounders) as controls perform for the known factors. But that gets randomization wrong.

For RCTs, it is precisely when we do not have any further evidence or suspicion about which factors to conditionalize on (for instance when using observational studies) that randomization provides its epistemic importance – ethical concerns aside.

Senn’s second and third points are right on the mark. Although implicit in Senn’s second point, it is worth stressing that — it is only after intervention (control) has taken place – after disturbing the system, if the absence of a treatment effect is yet detected (e.g. no significant change in mortality rates between groups) randomization has played its role — severity concerns aside. Randomization permits the counterfactual reasoning. The otherwise presence of a treatment effect (a certain drop in mortality rates) would’ve been detected, due to the experimenter’s intervention, rather than something that happens to be correlated with the intervention.

Thanks Roger. It’s good to see that philosophers may be getting more sophisticated about the statistical functions of randomization in a frequentist-type analysis these days, rather than just exiling them.

The final point about probabilistic inferences does not get as much attention in medicine as it deserves. I think the fundamentally deeper issues relate to how probability statements make claims about the world. This is an issue that Fisher turned to late in his life and in his usual inimitable way has much to say, though I remain unconvinced about his direct account. His account relates to the issue raised in the discussion with Cox (in the linked papers) about how prior ideas influence design and analysis of experiments.

Fisher accords great weight to the role of scientists, esp in terms of taxonomy and measurement prior to experimentation and his application of probability, which I have come to believe in his mind, that setting aside mutation, was fundamentally about permutations once conditions of the experiment have been set. (see Mathematical Probability in the Natural Sciences published in 1959 and available for download as a .pdf from the exquisite University of Adelaide Fisher Archive) This means there is much in the background for which there is some convention or consensus before an experiment is even done. This then speaks to the important role of theory formation, which is utterly neglected in EBM.

Another important issue re EBM is not so much whether RCT’s balance co-variates, but the obsessive reliance on using the same standard of significance and confidence intervals (either .05 or 95%) regardless of the issue being investigated. Fisher dismissed the use of an invariant standard early on; EBM and Cochrane have enshrined it (and near reified it) with significant perils in both directions.

Actually, in practice, Fisher was using p-value benchmarks as automatically as anyone, never mind his argument with Neyman about the behavioristic account (see Neyman’s reply to him). . We have discussed this a fair amount on this blog, which you can find. Actually, it also came up very briefly (with some references) in my July 10 blog of last night.

Ross:

Late in Fisher’s career he wrote that he had initially not realized that the real value in experimental design was to make inferences depend less on assumptions (rather than most efficient under given assumptions) and make those assumptions unlikely to be importantly wrong. For instance, the randomized experiment with permutation inference about the strict Null, as Stephen pointed out, even the devil can not mess up.

Thanks Deborah. I have some catch up reading to do on the blogs after I finish devouring the volume on statistics and philosophy of science. An unexpected treasure trove in early summer!

Ross: I’m grateful for your interest!

Deborah, A quote fom Fisher

“one has to consider the problem in an extreme form. Let the Devil choose the yields of the plots to his liking. . . If now I assign treatments to plots on any system which allows any two plots which may be treated alike an equal chance of being treated

differently. . . then it can be shown both that the experiment is unbiased by the Devil’s machinations, and that my test of significance is valid.”

Fisher, R. A., in Bennett, J. H. (ed.), Statistical lnjerence and Analysis: Selected Correspondence of R. A. Fisher (p269)

So what I am saying is let anybody build a model for FEV1 using an indefinitely rich set of predictors. Let these predictors be hidden but assigned to each patient. Let the outcome value also be assigned to each patient but let us then randomise the patients to treatment. If now, we analyse the outcome values, it is only their distribution that matters.not that of the hidden predictors.

See Senn SJ. Fisher’s game with the devil. Statistics in Medicine 1994; 13: 217-230.