Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
At a workshop on randomisation I attended recently I was depressed to hear what I regard as hackneyed untruths treated as if they were important objections. One of these is that of indefinitely many confounders. The argument goes that although randomisation may make it probable that some confounders are reasonably balanced between the arms, since there are indefinitely many of these, the chance that at least some are badly confounded is so great as to make the procedure useless.
This argument is wrong for several related reasons. The first is to do with the fact that the total effect of these indefinitely many confounders is bounded. This means that the argument put forward is analogously false to one in which it were claimed that the infinite series ½, ¼,⅛ …. did not sum to a limit because there were infinitely many terms. The fact is that the outcome value one wishes to analyse poses a limit on the possible influence of the covariates. Suppose that we were able to measure a number of covariates on a set of patients prior to randomisation (in fact this is usually not possible but that does not matter here). Now construct principle components, C1, C2… .. based on these covariates. We suppose that each of these predict to a greater or lesser extent the outcome, Y (say). In a linear model we could put coefficients on these components, k1, k2… (say). However one is not free to postulate anything at all by way of values for these coefficients, since it has to be the case for any set of m such coefficients that where V( ) indicates variance of. Thus variation in outcome bounds variation in prediction. This total variation in outcome has to be shared between the predictors and the more predictors you postulate there are, the less on average the influence per predictor.
The second error is to ignore the fact that statistical inference does not proceed on the basis of signal alone but also on noise. It is the ratio of these that is important. If there are indefinitely many predictors then there is no reason to suppose that their influence on the variation between treatment groups will be bigger than their variation within groups and both of these are used to make the inference.
I can illustrate this by taking a famous data-set. These are the ‘Hills and Armitage enuresis data’(1). The data are from a cross-over trial so that each patient provides two measurements; one on treatment and one on placebo. This means that a huge number of factors are controlled for. For example it is sometimes claimed that there are 30,000 genes in the human genome. Thus 30,000 such potential factors (with various possible levels) are eliminated from consideration, since they do not differ between the patient when measured on placebo and the patient when measured on treatment. Of course there are many more such factors that do not differ: all life-history factors including what the patient ate every day from birth up to enrollment in the trial, all the social interactions the patient had during life to date, any chance infection the patient acquired at any point etc, etc. The only factors that can differ are transient ones during the trial, what are sometimes referred to as period level as opposed to patient level factors.
The figure below shows two randomisation analyses of the trial. One uses the true fact that the trial was a cross-over trial and pairs the results patient by patient, only randomly swapping which actual observed outcome in a pair was under treatment . The other pays no attention to pairs and randomly permutes the labels around only maintaining the total numbers under treatment and placebo.
What is plotted are the distributions of the t-statistics, the signal to noise ratios, and it can be seen that these are remarkably similar. Controlling for these thousands and thousands of factors has made no appreciable difference to the ratio because any effect they have on the numerator is reflected in the denominator. What changes, however, is where the actual observed t-statistic is placed compared to these permutation distributions. The t-statistic conditioning on patient as a factor is 3.53 (illustrated by the rightmost vertical dashed line) and the statistics not conditioning on patient is 2.25 (illustrated by the left-most line). This shows that if we realise that this is a cross-over trial and therefore that differences between patients can have no effect on the outcome, the observed difference is much more impressive.
This fact is reflected in the parametric analysis. In the table below, the left hand panel shows the analysis not putting patient in the model and the right hand one the analysis when patient is put in the model. On the LHS, the t-statistic is less impressive and so is the P-value than is the case on the RHS. The point estimate of 2.172 dry nights (the difference between treatment and placebo) is the same.
This brings me on to the third error of the ‘indefinitely many predictors’ criticism. Statistical statements are statements with uncertainty attached to them. The reason that the point estimate is the same for these two cases is that we did in fact run a cross-over trial. If we had run a parallel group trial instead with these patients, then it is unlikely that the treatment estimate delivered would have been 2.172 which it was for the cross-over. The enemy of randomisation may say this just shows how unreliable randomisation is. However, this overlooks the fact that the method of analysis delivers with it a measure of its reliability. The confidence interval for not fitting the patient effect is 0.24 to 4.10 and thus much wider than the interval of 0.91 to 3.43 when it is fitted.
In fact the former method has allowed for the fact that thousands and thousands of factors are not fitted. How does it do this? It does this by using the very simple trick of realising that these are only important to the extent that they affect the outcome and that variation in outcome can be measured easily within treatment groups. The theory developed by Fisher relates the extent to which the outcomes can vary between groups to the extent to which they vary within. Thus randomisation, and its associated analysis, makes an allowance for these indefinite factors.
In fact the situation is quite the opposite of what the critics of randomisation suppose. Randomisation theory does not rely on indefinitely many factors being balanced. On the contrary it relies on them not being balanced. If they were balanced then the standard analysis would be wrong. We can see this since in the case of this trial it is the RHS analysis controlling for the patient effect that is correct since the trial was actually run as a cross-over. It is the LHS analysis that is wrong. The wider confidence interval of the LHS analysis is making an allowance that is inappropriate. It is allowing for random differences between patients but such differences can have no effect on the result given the way the trial was run. Further discussion will be found in my paper ‘Seven myths of randomisation in clinical trials’(2).
1. Hills M, Armitage P. The two-period cross-over clinical trial. Br J Clin Pharmacol. 1979; 8: 7-20.
2. Senn S. Seven myths of randomisation in clinical trials. Statistics in Medicine. 2012 Dec 17.
(See also on this blog: S. Senn’s July 9, 2012 post: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics.)
Stephen: Thanks so much for your interesting post. Your points clarifying the thorny issue of randomization and statistical inference are welcome (as they so often arise, especially in philosophy). So how would you say these points relate to all the efforts pre data to stratify, and post-data to check for homogeneity? In a discussion of one of the early studies on birth-control pills and clotting disorders (chapter 5 of EGEK), I noted the analyses carried out, post data, to ensure that the treated and control women were sufficiently homogenous with respect to the chance of a blood-clotting disorder, by a series of null hypothesis tests on age, number of pregnancies, income etc. No statistical significances were found. Are you saying it wouldn’t have mattered (for the validity of testing the effect of the pill on clotting disorders) if they had been found, but at most the accuracy or the like? In this connection, I’m not completely sure I get the point as to why RCTs depend on the group being imbalanced, in general.
I read your paper on the seven myths and gained fresh perspective. I have a quibble, so naturally I’m going to go on and on about the part I disagree with, passing over the other unobjectionable (indeed, edifying) content in silence.
My quibble is in regard to your comments on Lindley. In my view, he has the right of it, and your criticism, while not wrong, is shallow; it takes little effort to answer it.
Rather than getting hung up on the word “haphazard”, let’s see what we can make of the question of whether the lady is “reasonably entitled to make the assumption of exchangeability”. Suppose that our experimental design scheme is to take the previous day’s horse racing program (without bookies’ odds) and construct a map from possible outcomes to the 70 possible tea experiment designs. We suppose that the lady in question is fully informed about the mapping, but, like us, does not know the races’ outcomes or the odds that were offered. We then go, find the race results, mix the tea, and have it sent out to her. Is she entitled to make the assumption of exchangeability?
It depends. Perhaps she is an aficionado of horse race betting; although the previous days’ results are unknown to her, she knows the horses and can set her own reasonably well-informed odds.
So we come up with another scheme. Instead of yesterday’s race results, we’ll come up with a mapping between the lengths of the reigns of the rulers of the Goryeo dynasty (from which modern-day Korea gets its name) and the 70 possible designs. That should be sufficiently haphazard, no?
But what if she’s an historian specializing in the postclassical Far East? Let’s instead use the digits of the permitivity of free space. But perhaps she’s a physicist, so let’s use some far digits of pi. But maybe she’s a mnemonist! All right, fine, let’s just do the randomized design. As long as she isn’t psychic…
But it turns out she’s no gambler, historian, physicist, nor mnemonist (and not psychic) — she was a statistician all along. For her, race horses are exchangeable; she doesn’t know the Silla kingdom from the Joseon dynasty; she has no need of the vacuum permitivity constant in her work, and she’s too sensible to spend time memorizing digits of pi. Any of our schemes would have supplied the necessary conditions. Randomization’s virtue, as you note in the paper, is that it’s a particularly cheap and robust way making sure there’s no extraneous information being used to make the predictions.