Error Statistics Philosophy

Stephen Senn: Indefinite irrelevance

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

At a workshop on randomisation I attended recently I was depressed to hear what I regard as hackneyed untruths treated as if they were important objections. One of these is that of indefinitely many confounders. The argument goes that although randomisation may make it probable that some confounders are reasonably balanced between the arms, since there are indefinitely many of these, the chance that at least some are badly confounded is so great as to make the procedure useless.

This argument is wrong for several related reasons. The first is to do with the fact that the total effect of these indefinitely many confounders is bounded. This means that the argument put forward is analogously false to one in which it were claimed that the infinite series ½, ¼,⅛ …. did not sum to a limit because there were infinitely many terms. The fact is that the outcome value one wishes to analyse poses a limit on the possible influence of the covariates. Suppose that we were able to measure a number of covariates on a set of patients prior to randomisation (in fact this is usually not possible but that does not matter here). Now construct principle components, C1, C2… .. based on these covariates. We suppose that each of these predict to a greater or lesser extent the outcome, Y  (say).  In a linear model we could put coefficients on these components, k1, k2… (say). However one is not free to postulate anything at all by way of values for these coefficients, since it has to be the case for any set of m such coefficients that where  V(  ) indicates variance of. Thus variation in outcome bounds variation in prediction. This total variation in outcome has to be shared between the predictors and the more predictors you postulate there are, the less on average the influence per predictor.

The second error is to ignore the fact that statistical inference does not proceed on the basis of signal alone but also on noise. It is the ratio of these that is important. If there are indefinitely many predictors then there is no reason to suppose that their influence on the variation between treatment groups will be bigger than their variation within groups and both of these are used to make the inference.

I can illustrate this by taking a famous data-set. These are the ‘Hills and Armitage enuresis data’(1). The data are from a cross-over trial so that each patient provides two measurements; one on treatment and one on placebo. This means that a huge number of factors are controlled for. For example it is sometimes claimed that there are 30,000 genes in the human genome. Thus 30,000 such potential factors (with various possible levels) are eliminated from consideration, since they do not differ between the patient when measured on placebo and the patient when measured on treatment. Of course there are many more such factors that do not differ: all life-history factors including what the patient ate every day from birth up to enrollment in the trial, all the social interactions the patient had during life to date, any chance infection the patient acquired at any point etc, etc. The only factors that can differ are transient ones during the trial, what are sometimes referred to as period level as opposed to patient level factors.

The figure below shows two randomisation analyses of the trial. One uses the true fact that the trial was a cross-over trial and pairs the results patient by patient, only randomly swapping which actual observed outcome in a pair was under treatment . The other pays no attention to pairs and randomly permutes the labels around only maintaining the total  numbers under treatment and placebo.

What is plotted are the distributions of the t-statistics, the signal to noise ratios, and it can be seen that these are remarkably similar. Controlling for these thousands and thousands of factors has made no appreciable difference to the ratio because any effect they have on the numerator is reflected in the denominator. What changes, however, is where the actual observed t-statistic is placed compared to these permutation distributions. The t-statistic conditioning on patient as a factor is 3.53 (illustrated by the rightmost vertical dashed line) and the statistics not conditioning on patient is 2.25 (illustrated by the left-most line). This shows that if we realise that this is a cross-over trial and therefore that differences between patients can have no effect on the outcome, the observed difference is much more impressive.

This fact is reflected in the parametric analysis. In the table below, the left hand panel shows the analysis not putting patient in the model and the right hand one the analysis when patient is put in the model. On the LHS, the t-statistic is less impressive and so is the P-value than is the case on the RHS. The point estimate of 2.172 dry nights (the difference between treatment and placebo) is the same.

   

This brings me on to the third error of the ‘indefinitely many predictors’ criticism. Statistical statements are statements with uncertainty attached to them. The reason that the point estimate is the same for these two cases is that we did in fact run a cross-over trial. If we had run a parallel group trial instead with these patients, then it is unlikely that the treatment estimate delivered would have been 2.172 which it was for the cross-over. The enemy of randomisation may say this just shows how unreliable randomisation is. However, this overlooks the fact that the method of analysis delivers with it a measure of its reliability. The confidence interval for not fitting the patient effect is 0.24 to 4.10 and thus much wider than the interval of 0.91 to 3.43 when it is fitted.

In fact the former method has allowed for the fact that thousands and thousands of factors are not fitted. How does it do this? It does this by using the very simple trick of realising that these are only important to the extent that they affect the outcome and that variation in outcome can be measured easily within treatment groups. The theory developed by Fisher relates the extent to which the outcomes can vary between groups to the extent to which they vary within. Thus randomisation, and its associated analysis, makes an allowance for these indefinite factors.

In fact the situation is quite the opposite of what the critics of randomisation suppose. Randomisation theory does not rely on indefinitely many factors being balanced. On the contrary it relies on them not being balanced. If they were balanced then the standard analysis would be wrong. We can see this since in the case of this trial it is the RHS analysis controlling for the patient effect that is correct since the trial was actually run as a cross-over. It is the LHS analysis that is wrong. The wider confidence interval of the LHS analysis is making an allowance that is inappropriate. It is allowing for random differences between patients but such differences can have no effect on the result given the way the trial was run. Further discussion will be found in my paper ‘Seven myths of randomisation in clinical trials’(2).

References:

1.  Hills M, Armitage P. The two-period cross-over clinical trial. Br J Clin Pharmacol. 1979; 8: 7-20.

2.  Senn S. Seven myths of randomisation in clinical trials. Statistics in Medicine. 2012 Dec 17.

(See also on this blog: S. Senn’s July 9, 2012 postRandomization, ratios and rationality: rescuing the randomized clinical trial from its critics.)