The intellectual illness of clinical drug evaluation that I have discussed here can be cured, and it will be cured when we restore intellectual primacy to the questions we ask, not the methods by which we answer them. Lewis Sheiner1
Cause for concern
In their recent essay Causal Evidence and Dispositions in Medicine and Public Health2, Elena Rocca and Rani Lill Anjum challenge, ‘the epistemic primacy of randomised controlled trials (RCTs) for establishing causality in medicine and public health’. That an otherwise stimulating essay by two philosophers, experts on causality, which makes many excellent points on the nature of evidence, repeats a common misunderstanding about randomised clinical trials, is grounds enough for me to address this topic again. Before, however, explaining why I disagree with Rocca and Anjum on RCTs, I want to make clear that I agree with much of what they say. I loathe these pyramids of evidence, beloved by some members of the evidence-based movement, which have RCTs at the apex or possibly occupying a second place just underneath meta-analyses of RCTs. In fact, although I am a great fan of RCTs and (usually) of intention to treat analysis, I am convinced that RCTs alone are not enough. My thinking on this was profoundly affected by Lewis Sheiner’s essay of nearly thirty years ago (from which the quote at the beginning of this blog is taken). Lewis was interested in many aspects of investigating the effects of drugs and would, I am sure, have approved of Rocca and Anjum’s insistence that there are many layers of understanding how and why things work, and that means of investigating them may have to range from basic laboratory experiments to patient narratives via RCTs. Rocca and Anjum’s essay provides a good discussion of the various ‘causal tasks’ that need to be addressed and backs this up with some excellent examples.
It’s not about balance and it’s not about homogeneity
In discussing RCTs Rocca and Anjum write
‘…any difference in outcome between the test group and the control group should be caused by the tested interventions, since all other differences should be homogenously distributed between the two groups,’
‘The experimental design is intended to minimise complexity—for instance, through strict inclusion and exclusion criteria’.
However, it is not the case that randomisation will guarantee that any difference between the groups should be caused by the intervention. On the contrary, many things apart from the treatment will affect the observed difference. And it is not the case that the analysis of RCTs requires the minimisation of complexity. Randomisation and its associated analysis deals with complexity in the experimental material and although the treatment structure in RCTs is often simple this is not always so (I give an example below) and it was not so in the field (literally) of agriculture for which Fisher developed his theory of randomisation. This is what Fisher, himself had to say about complexity
No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally one question, at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.3 (p. 511)
This 1926 paper of Fisher’s is an important and early statement of his views on randomisation and was cited recently by Simon Raper in his article in Significance4. Raper points out, that Fisher was abandoning as unworkable an earlier view of causality due to John Stuart Mill whereby controlling for everything imaginable was the way you made valid causal judgements. I consider Raper is right in thinking of Fisher’s approach as an alternative to Mill’s programme, rather than some realisation of it, so I disagree for example, with Mumford and Anjum in their book5 when they state
‘Fisher’s idea is the basis of the randomized controlled trial (RCT), which builds on J.S. Mill’s earlier method of difference’ (pp. 111-112).
I shall now explain exactly what it is that Fisher’s approach does with the help of an example.
Before going into the example, which is a complex design, it is necessary to clear up one further potential point of confusion in Rocca and Anjum’s essay. N-of-1 studies, are not alternatives to RCTs but a subset of them. RCTs include not just conventional parallel group trials but also cluster randomised trial and cross-over trials, including n-of-1 studies. The difference between these studies is at the level one randomises and this is reflected in my example, which has features of both a parallel group and a cross-over study. Thus, reading Rocca and Anjum’s paper, which I can recommend, will make more sense if by their use of RCT is understood ‘randomised parallel group trials’.
For the moment, all that it is necessary to know is that within the same design, I can compare the effect on forced expiratory volume in one second (FEV1), measured 12 hours after treatment, of two bronchodilators in asthma, which here I shall just label ISF24 and MTA6, in two different ways. First, I can use 71 patients who were given MTA6 and ISF24 on different occasions. Here I can compare the two treatments patient by patient. These data have the structure of a within-patient study. Second, within the same study there were 37 further patients who were given MTA6 but not 1SF24 and 37 further patients who were given ISF24 but not MTA6. Here I can compare the two groups of patients with each other. These data have the structure of a between-patient or parallel group study.
I now proceed to analyse the data from the 71 pairs of values from the patients who were given both using a matched pairs t-test. This will be referred to as the within-patient study. Note that this is an analysis of 2×71=142 values in total. I then proceed to compare the 37 patients given MTA6 only to the 37 given ISF24 only using a two-sample t-test. I shall refer to this as the between-patient study. Note that this is an analysis of 37+37=74 values in total. Finally, I combine the two using a meta-analysis.
The results are presented in the figure below which gives the point estimates for the difference between the two treatments and the 95% confidence intervals for both analyses and for a meta-analysis of both, which is labelled ‘combined’. (The horizontal dashed line is the point estimate for a full analysis of all the data and is described in the appendix.) Note how much wider the confidence intervals are for the between-patient study than the within-patient study. This is because the within-patient study is much more precise.
Why is the within-patient study so much more precise? Part of the story is that it is based on more data, in fact nearly twice as many data: 142 rather than 74. However, this is only part of the story. The ratio of variances is more than 30 to 1 and not just approximately 2 to 1, as the number of data might suggest. The main reason is that the within-patient study has balanced for a huge number of factors and the between-patient study has not. Thus, differences in 20,000 plus genes and all life-history until the beginning of the trial are balanced in the within-patient study, since each patient is his or her own control. For the between-patient study none of this is balanced by design. In fact, there are two crucial points regarding balance.
1. Randomisation does not produce balance
2. This does not affect the validity of the analysis
Why do I claim this does not matter? Suppose we accept the within-patient estimate as being nearly perfect because it balances for those huge numbers of factors. It seems that we can then claim that the between-patient estimate did a pretty bad job. The point estimate is 0.2L more than that from the within-patient design, a non-negligible difference. However, this is to misunderstand what the between-patient analysis claims. Its ‘claim’ is not the point estimate; its claim is the distribution associated with it, of which the 95% confidence interval is a sort of minimalist conventional summary and of which the point estimate is only one point. As I have explained elsewhere, such claims of uncertainty are a central feature of statistics. Thus, the true claim made by the between-patient study is not misleading. It is vague and, indeed, when we come to combine the results, the meta-analysis will give 30 times the weight to the within-patient estimate as to the between-patient estimate simply because of the vagueness of the associated claim. This is why the result from the meta-analysis is so similar to that of the within-patient estimate. Furthermore, although this can never be guaranteed, since probabilities are involved, the 95% CI for the between-patient study includes the estimate given by the within-patient study. (Note, that in general, confidence intervals are not a statement about a value in a future study, but about the ‘true’ average value6 but here, the within-patient study being very precise, they can be taken to be similar.)
How this works
This works because what Fisher’s analysis does is use variation at an appropriate level to estimate variation in the treatment estimate. So, for the between-study it starts from the following observations
1) There are numerous factors apart from treatment that could affect the outcome in one arm of the between-patient study compared to the other.
2) However, it is the joint effect of these that matters.
3) This joint effect of such factors will also vary within each of the two treatment groups.
4) Provided I use a method of allocation that is random, there will be no tendency for this variation within the groups to be larger or smaller than that between the groups.
5) Under this condition I have a way of estimating how reliable the treatment estimate is.
Thus, his programme is not about eliminating all sources of variation. He knows that this is impossible and accepts that estimates will be imperfect. Instead, he answers the question: ‘given that estimates are (inevitably) less than perfect, can we estimate how reliable they are?’. The answer he provides is ‘yes’ if we randomise.
If we now turn to the within-patient estimate, the same argument is repeated but in a first step differences are calculated by patient. These differences do not reflect differences in genes etc. since each patient acts as his or her own control. (They could reflect a treatment-by-patient interaction but this is another story I choose not to go into here7, 8. See my blog on n-of-1 trials for a discussion.) The argument then uses the variance in the single group of differences to estimate how reliable their average will be.
Note that a different design requires a different analysis and in particular because the estimate of the variability of the estimate will be inappropriate even if the estimate is not affected. This is illustrated in Figure 2 which shows what happens if you analyse the paired data from the 71 patients as if they were two independent sets of 71 each. Although the point estimate is unchanged, the confidence interval is now much wider than it was before. The value of having the patients as their own control is lost. The downstream effect of this is that the meta-analysis now weights the two estimates inappropriately.
Note also, that it is not a feature of Fisher’s approach that claims made by larger or otherwise more precise trials are generally more reliable than smaller or otherwise less precise ones. The increase in precision is consumed by the calculation of the confidence interval9, 10. More precise designs produce narrower intervals. Nothing is left to make the claim that is made more valid. It is simply more precise. The allowance for chance effects will be less, and appropriately so. Balance is a matter of precision not validity.
The shocking truth
As I often put it, the shocking truth about RCTs is the opposite of what many believe. Far from requiring us to know that all possible causal factors affecting the outcome are balanced in order for the conventional analysis of RCTs to be valid, if we knew all such factors were balanced, the conventional analysis would be invalid. RCTs neither guarantee nor require balance. Imbalance is inevitable and Fisher’s analysis allows for this. The allowance that is made for imbalance is appropriate provided that we have randomised. Thus, randomisation is a device for enabling us to make precise estimates of an inevitable imprecision.
I thank George Davey Smith, Elena Rocca and Rani Lill Anjum for helpful comments on an earlier version.
- Sheiner LB. The intellectual health of clinical drug evaluation [see comments]. Clin Pharmacol Ther 1991; 50(1): 4-9.
- Rocca E, Anjum RL. Causal Evidence and Dispositions in Medicine and Public Health. International Journal of Environmental Research and Public Health 2020; 17.
- Fisher RA. The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain 1926; 33: 503-13.
- Raper S. Turning points: Fisher’s random idea. Significance 2019; 16(1): 20-23.
- Mumford S, Anjum RL. Causation: a very short introduction: OUP Oxford, 2013.
- Senn SJ. A comment on replication, p-values and evidence S.N.Goodman, Statistics in Medicine 1992; 11: 875-879. Statistics in Medicine 2002; 21(16): 2437-44.
- Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine 2016; 35(7): 966-77.
- Araujo A, Julious S, Senn S. Understanding Variation in Sets of N-of-1 Trials. PloS one 2016; 11(12): e0167167.
- Senn SJ. Seven myths of randomisation in clinical trials. Statistics in Medicine 2013; 32(9): 1439-50.
- Cumberland WG, Royall RM. Does Simple Random Sampling Provide Adequate Balance. J R Stat Soc Ser B-Methodol 1988; 50(1): 118-24.
- Senn SJ, Lillienthal J, Patalano F, et al. An incomplete blocks cross-over in asthma: a case study in collaboration. In: Vollmar J, Hothorn LA, eds. Cross-over Clinical Trials. Stuttgart: Fischer, 1997: 3-26.
Appendix: The design of MTA/02
This was a so-called balanced incomplete blocks design necessitated because it was desired to study seven treatments (three doses of each of two formulations and a placebo)11 but it was not considered practical to treat patients in more than five treatments. Thus, patients were allocated a different one of the seven treatments in each of the five periods. That is to say, each patient received a subset of five of the seven treatments. Twenty-one sequences of five treatments were used. Each sequence permits (5×4)/2 = 10 pairwise comparisons but there are (7×6)/2= 21 pairwise comparisons overall and the sequences were chosen in such a way that any given one of the 21 pairwise comparisons within a sequence would appear equally often over the design. Looking at the members of such a given chosen pair one would find that in five further sequences the first would appear and not the second and vice versa. This leaves one sequence out of the 21 in which neither treatment would appear. The sort of scheme involved is illustrated in Table 1 below.
The active treatments were MTA6, MTA12, MTA24, ISF6, ISF12, ISF24, where the number refers to a dose in μg and the letters to two different formulations (MTA and ISF) of a dry powder of formoterol delivered by inhaler. The seventh treatment was a placebo.
In fact, the plan was to recruit six times as many patients as there were sequences, randomising a given patient to a sequence in a way that would guarantee approximately equal numbers per sequence. This would have given 126 patients in total. In the end, this target was exceeded and 161 patients were randomised to one of the sequences.
Obviously, this is a rather complex design but I have used it because it enabled me to compare two treatments two different ways. First by using only the ten sequences in which they both appear. For this purpose, I could use each patient as his or her control. Second, by using the ten further sequences in which only one appears.
This thus permitted me to analyse data from the same trial using a within-patient analysis and a between-patient analysis. The analyses used above should not be taken too seriously. The analysis would not generally proceed and did not in fact proceed in this way. For example, I ignored the complication of period effects and ignored the fact that by including all the seven treatments in an analysis at once, I could recover more information. I simply chose two treatments to compare and ignored all other information in order to illustrate a point. The two treatments I compared, ‘ISF24’ and ‘MTA6’, were respectively, the highest (24μg) dose of the then (1997) existing standard dry powder formulation, ISF of the beta-agonist formoterol, and the lowest (6μg) of a newer formulation, MTA, it was hoped to introduce. The experiment is discussed in full in Senn, Lilienthal, Patalano and Till11.
The full model analysis that I showed as a dotted line in Figure 1 & Figure 2 fitted Patient as a random effect and Treatment and Period as fixed factors with 7 and 5 levels respectively.
Ed. A link to a selection of Senn’s posts and papers is here. Please share comments and thoughts.
I am extremely thankful to Stephen Senn for a further installment on his discussions of randomisation on this blog. His response to the common criticism that RCTs don’t achieve balance is extremely important and opened my eyes to where the critics (often philosophers) go wrong. I would like to draw Stephen out a bit more, though, on what he means in saying randomisation is about randomisation. Fisher said:
The purpose of randomisation . . . is to guarantee the validity of the test of
significance, this test being based on an estimate of error made possible by
replication. (Fisher [1935
b]a1951, psection. 26) (SIST p. 286).
The meaning of’ randomisation is about randomisation’, I suggest, is that its function is to guarantee the error probability assessments (for estimators and tests). That is why its relevance comes into question by subjective Bayesians who question the relevance of error probabilities to inference (though some find a role for it in vouchsafing posterior probabilities). For the error statistician, it’s a way to deliberately introduce the probabilistic considerations needed to appraise the reliability of the inference that a method outputs.
Thanks, Stephen, for engaging with our paper and for sharing this convincing argument about the non-necessity of the ‘balance condition’ in RCTs. As you correctly note, we don’t touch upon this in our paper, despite the fact that some arguments for the same statement come also from the philosophy of science front (see for instance J. Fuller, 2018, ‘The Confounding Question of Confounding Causes in Randomized Trials’, Brit. J. Phil. Sci.).
For the point we want to make in our paper, there is no problem to adopt the view that RCT aims to have confounding factors randomly distributed and not homogeneously distributed among groups. In light of your post, we should have said that:
‘…any correlation between the tested intervention and the difference in outcome between the test group and the control group can be estimated, since all other causally relevant factors have been randomly distributed between the two groups’
Rather than what we said, namely that:
‘…any difference in outcome between the test group and the control group should be caused by the tested interventions, since all other differences should be homogenously distributed between the two groups’.
We believe that it would have been technically more correct given what you explain.
For the specific purpose of our analysis in the paper, however, we want to make clear why this is not a misunderstanding of the logic underlying this experimental design.
A quick background. The point in our paper is to argue for the necessity of plurality of methods to make a causal claim in medicine, and it is based on the philosophical idea that what medicine ultimately needs to know about are intrinsic properties (also called dispositions, capacities or causal powers). For instance, Ibuprofen causes gastrointestinal symptoms in a large number of patients. This observation is useful, because it points to an intrinsic property of ibuprofen. Now, whether one agrees with this idea or not, the problem remains that intrinsic properties are not directly observable in the same way that a correlation is, and different methods have some advantages and some blind spots to detect such properties. In the paper, we discuss some of the commonly research methods used in medicine and we explain such advantages and disadvantages for the purpose of detecting intrinsic disposition.
With RCTs, the main point we want to make is that this type of experimental design is the one best suited to detect difference-making at group level, which perfectly fits the Humean definition of causation as difference-maker and regularity. We can say that one randomises for the purpose of having a correspondent degree of variation among and between groups, rather than for the purpose of homogeneity, as argued in the blogpost (to which we have no objection).
This does not change that the aim of randomisation is to make it possible to (1) estimate difference-making, (2) estimate how reliable such estimate of difference-making is.
This is everything we need for what we say in the paper, that:
(1) An estimate of difference-making is a good indication for a disposition, since dispositions tend (although not always do) to make a statistical difference at group level;
(2) Difference-making shows a correlation but not necessarily an intrinsic disposition;
(3) Difference-making might indicate a necessary condition and not an intrinsic disposition.
About this last point, we say:
‘Statistically significant results from an RCT could indicate that the intervention played a causal role for the outcome, either as an intrinsic disposition or as a necessary background condition (a sine qua non). The latter would be where something that was necessary for the effect, although it did not as such cause the effect. Hypothetically speaking, if one had no understanding of the underlying biological mechanisms, one might, for instance, find that hysterectomy significantly reduces the risk of unwanted pregnancy and take this to mean that the uterus is the cause rather than a necessary condition for pregnancy. This shows how what is the advantage of RCTs from a dispositionalist perspective is also the reason why they cannot produce dispositional evidence on their own.’
Elena Rocca and Rani Lill Anjum