The idea of local control is fundamental to the design and analysis of experiments and contributes greatly to a design’s efficiency. In clinical trials such control is often accompanied by randomisation and the way that the randomisation is carried out has a close relationship to how the analysis should proceed. For example, if a parallel group trial is carried out in different centres, but randomisation is ‘blocked’ by centre then, logically, centre should be in the model (Senn, S. J. & Lewis, R. J., 2019). On the other hand if all the patients in a given centre are allocated the same treatment at random, as in a so-called cluster randomised trial, then the fundamental unit of inference becomes the centre and patients are regarded as repeated measures on it. In other words, the way in which the allocation has been carried out effects the degree of matching that has been achieved and this, in turn, is related to the analysis that should be employed. A previous blog of mine, To Infinity and Beyond, discusses the point.
In all of this, balance, or rather the degree of it, plays a fundamental part, if not the one that many commentators assume. Balance of prognostic factors is often taken as being necessary to avoid bias. In fact, it is not necessary. For example, supposed we wished to eliminate the effect of differences between centres in a clinical trial but had not, in fact, blocked by centre. We would then just by chance, have some centres in which numbers of patients on treatment and control differed. The simple difference of the two means for the trial as a whole would then have some influence from the centres, which might be regarded as biasing. However, these effects can be eliminated by the simple stratagem of analysing the data in two stages. In the first stage we compare the means under treatment and control within each centre. In the second stage we combine these differences across the centre weighting them according to the amount of information provided. In fact, including centre as a factor in a linear model to analyse the effect of treatment achieves the same result as this two-stage approach.
This raises the issue, ‘what is the value of balance?’. The answer is that other things being equal, balanced allocations are more efficient in that they lead to lower variances. This follows from the fact that the variance of a contrast based on two means is
where σ21, σ22 are the variances in the two groups being compared and n1, n2 the two sample sizes. In an experimental context, it is often reasonable to proceed as if σ21 = σ22 so that writing σ2 for each variance, we have an expression for the variance of the contrast of.
Now consider the successive ratios 1, 1/2, 1/3,…1/n. Each term is smaller than the preceding term. However, the amount by which a term is smaller is less than the amount by which the preceding term was smaller than the term that preceded it. For example, 1/3-1/4 = 1/12 but 1/2-1/3 = 1/6. In general we have 1/n – 1/n+1 = 1/n(n+1), which clearly reduces with increasing n. It thus follows that if an extra observation can be added to construct such a contrast, it will have the greater effect on reducing that contrast if it can be added to the group that has the fewest observations. This in turn implies, other things being equal, that balanced contrasts are more efficient.
Exploiting the ex-external
However, it is often the case in a randomised clinical trial of a new treatment that a potential control treatment has been much studied in the past. Thus, many more observations, albeit of a historical nature, are available for the control treatment than the experimental one. This in turn suggests that if the argument that balanced datasets are better is used, we should now allocate more patients, and perhaps even all that are available, to the experimental arm. In fact, things are not so simple.
First, it should be noted, that if blinding of patients and treating physicians to the treatment being given is considered important, this cannot be convincingly implemented unless randomisation is employed (Senn, S. J., 1994). I have discussed the way that this may have to proceed in a previous blog, Placebos: it’s not only the patients that are fooled but in fact, in what follows, I am going to assume that blinding is unimportant and consider other problems with using historical controls.
When historical controls are used there are two common strategies. The first is to regard the historical controls as providing an external standard which may be regarded as having negligible error and to use it, therefore, as an unquestionably valid reference. If significance tests are used, a one-sample test is applied to compare the experimental mean to the historical standard. The second is to treat historical controls as if they were concurrent controls and to carry out the statistical analysis that would be relevant were this the case. Both of these are inadequate. Once I have considered them, I shall turn to a third approach that might be acceptable.
A standard error
If an experimental group is compared to a historical standard, as if that standard were currently appropriate and established without error, an implicit analogy is being made to a parallel group trial with a control group arm of infinite size. This can be seen by looking at formula (2). Suppose that we let the first group be the control group and the second one the experimental group. As n1 → ∞, then formula (2) will approach σ2/n2 , which is, in fact the formula we intend to use.
Figure 1 shows the variance that this approach uses as a horizontal red line and the variance that would apply to a parallel group trial. The experimental group size has been set at 100 and the control group sample size to vary from 100 to 2000. The within group variance has been set to σ2 = 1. It can be seen that this approach of the historical standard underestimates considerably the variance that will apply. In fact even the formula given by blue line will underestimate the variance as we shall explain below.
It thus follows that assessing the effect from a single arm given an experimental treatment by comparison to a value from historical controls but using a formula for the standard error of σ/√n2, where σ is the within-treated group standard deviation and n2 is the number of patients, will underestimate the uncertainty in this comparison.
A common alternative is to treat the historical data as if they came concurrently from a parallel group trial. This overlooks many matters, not least of which is that in many cases the data will have come from completely different centres and, whether or not they came from different centres, they came from different studies. That being so, the nearest analogue of a randomised trial is not a parallel group trial but a cluster randomised trial with study as a unit of clustering. The general set up is illustrated in Figure 2. This shows a comparison of data taken from seven historical studies of a control treatment (C) and one new study of an experimental treatment (E).
This means that there is a between-study variance that has to be added to the within-study variances.
The consequence is that the control variance is not just a function of the number of patients but also of the number of studies. Suppose there are k such studies, then even if each of these studies has a huge number of patients, the variance of the control mean cannot be less than ϒ2/k, where ϒ2 is the between-study variance. However, there is worse to come. The study of the new experimental treatment also has a between-study contribution but since there is only one such study its variance is ϒ2/1 = ϒ2. The result is that a lower bound for the variance of the contrast using historical data is
It turns out that the variance of the treatment contrast decreases disappointingly according to the number of clusters you can muster. Of course, in practice, things are worse, since all of this is making the optimistic assumption that historical studies are exchangeable with the current one (Collignon, O. et al., 2019; Schmidli, H. et al., 2014).
Optimists may ask, however, whether this is not all a fuss about nothing. The theory indicates that this might be a problem but is there anything in practice to indicate it is. Unfortunately, yes. The TARGET study provides a good example of the sort of difficulties encountered in practice (Senn, S., 2008). This was a study comparing Lumiracoxib, Ibuprofen and Naproxen in osteoarthritis. For practical reasons, centres were either enrolled in a sub-study comparing Lumiracoxib to Ibuprofen or one comparing Lumiracoxib to Naproxen. There were considerable differences between sub-studies in terms of baseline characteristics but not within sub-studies and there were even differences at outcome for lumiracoxib depending on which sub-study patients were enrolled in. This was not a problem for the way the trial was analysed, it was foreseen from the outset, but it provides a warning that differences between studies may be important.
Another example is provided by Collignon, O. et al. (2019). Looking at historical data on acute myeloid leukaemia (AML), they identified 19 studies of a proposed control treatment Azacitidine. However, the variation from study to study was such that the 1279 subjects treated in these studies would only provide, in the best of cases, as much information as 50 patients studied concurrently.
How have we done in the age of COVID? Not always very well. To give an example, a trial that received much coverage was one of hydroxychloroquine in the treatment of patients suffering from corona virus infection (Gautret, P. et al., 2020). The trial was in 20 patients and “Untreated patients from another center and cases refusing the protocol were included as negative controls.” The senior author Didier Raoult later complained of the ‘invasion of methodologists’ and blamed them and the pharmaceutical industry for a ‘moral dictatorship’ that physicians should resist and compared modellers to astrologers (Nau, J.-Y., 2020).
However, the statistical analysis section of the paper has the following to say
Statistical differences were evaluated by Pearson’s chi-square or Fisher’s exact tests as categorical variables, as appropriate. Means of quantitative data were compared using Student’s t-test.
Now, Karl Pearson, RA Fisher and Student were all methodologists. So, Gautret, P. et al. (2020) do not appear to be eschewing the work of methodologists, far from it. They are merely choosing to use this work inappropriately. But nature is a hard task-mistress and if outcome varies considerably amongst those infected with COVID-19, and we know it does, and if patients vary from centre to centre, and we know they do, then variation from centre to centre cannot be ignored and trials in which patients have not been randomised concurrently cannot be analysed as if they were. Fisher’s exact test, Pearson’s chi-square and Student’s t will underestimate the variation.
The moral dictatorship of methodology
Methodologists are, indeed, moral dictators. If you do not design your investigations carefully you are on the horns of a dilemma. Either, you carry out simplistic analyses that are simply wrong or you are condemned to using complex and often unconvincing modelling. Far from banishing the methodologists, you are holding the door wide open to let them in.
This is based on work that was funded by grant 602552 for the IDEAL project under the European Union FP7 programme and support from the programme is gratefully acknowledged.
Collignon, O., Schritz, A., Senn, S. J., & Spezia, R. (2019). Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. Statistical Methods in Medical Research, 962280219880213
Gautret, P., Lagier, J. C., Parola, P., Hoang, V. T., Meddeb, L., Mailhe, M., . . . Raoult, D. (2020). Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. Int J Antimicrob Agents, 105949
Nau, J.-Y. (2020). Hydroxychloroquine : le Pr Didier Raoult dénonce la «dictature morale» des méthodologistes. Retrieved from https://jeanyvesnau.com/2020/03/28/hydroxychloroquine-le-pr-didier-raoult-denonce-la-dictature-morale-des-methodologistes/
Schmidli, H., Gsteiger, S., Roychoudhury, S., O’Hagan, A., Spiegelhalter, D., & Neuenschwander, B. (2014). Robust meta‐analytic‐predictive priors in clinical trials with historical control information. Biometrics, 70(4), 1023-1032
Senn, S. J. (2008). Lessons from TGN1412 and TARGET: implications for observational studies and meta-analysis. Pharmaceutical Statistics, 7, 294-301
Senn, S. J. (1994). Fisher’s game with the devil. Statistics in Medicine, 13(3), 217-230
Senn, S. J., & Lewis, R. J. (2019). Treatment Effects in Multicenter Randomized Clinical Trials. JAMA
I’m extremely grateful for this guest post from you. Gelman has a blogpost today asking whether we would be, or would have been, better off without randomized controlled trials–a question that might be answerable if we could do a controlled trial. His view is that statistical significance tests (which he used to endorse) are so bad that the net value of controlled trials might be negative. But if he agrees, as he claims to, that RCTs have value, then he is endorsing the counterfactual reasoning they enable in learning about effects of interventions. Non randomized methods for causal inquiry succeed where they by mimicking the counterfactual knowledge gained by randomization (in all of its variants). I think he is mainly being provocative.
It must be fascinating for you, with the knowledge you have, to watch the methodology in today’s covid-19 research. It would be great to know what you think of some of the other covid-19 appraisals coming down the pike, and the manner in which controlled trials are being relied upon. I’ll send some examples when I have them.
I’ll have to think some more about your last paragraph and methodologists being moral dictators.
Thank you Stephen Senn for your fine review of proper analysis of blocked designs that include fixed and random (repeated measures) factors, and the consequences of balanced and unbalanced designs with respect to variance.
These issues do matter, despite Raoult’s complaints and Gelman’s provocations.
As to Gelman, the less said the better.
As to Raoult’s comparison of modelers to astrologers, all I can say is “Project much?” The howlings of so many unqualified or misguided coronavirus commenters reflects more upon their own fears and failings, not exactly helpful in the midst of a pandemic. Viruses of course care little about the egos of the unqualified who merely seek the limelight, and those of us who fall for such rhetoric also risk falling prey to the little ball of RNA.
By now the evidence is clear that the effects of this virus are not the same as influenza, and it isn’t going away in the summer heat as another seeker of the limelight opined just weeks ago. A resurgence of infections in The Sun Belt of America in the middle of the summer has yet to convince the sycophantic followers of said limelight seeker that scientists and modelers have much better advice and guidance, as Senn has provided here.
Anyone evaluating the next un-peer-reviewed offerings on biorXiv would do well to keep Senn’s points in mind while doing so.