Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg
Is Pooling Fooling?
‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides
A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.
It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).
A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.
Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling.
Of course, how we choose to pool is a matter of skill, judgement, experience and statistical know-how. In the Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. Either way we can compose a composite statistic based on all the trials and test Hjoint.
What we have to be careful about, however, is choosing what hypothesis we are entitled to assert if Hjoint is rejected. Rejection of Hjoint does not entitle us to regard each of Hn1, Hn2, … Hnk as rejected. We are entitled to assert that at least one is rejected.
The issue at stake is well illustrated by a famous meta-analysis of rofecoxib carried out by Juni et al in 2004. They pooled a number of studies comparing rofecoxib to various comparators (naproxen, other non-steroidal inflammatory drugs or placebo) and concluded that it was possible to decide that there was an increased risk of adverse cardiovascular events compared to placebo prior to 2004 when Merck pulled the drug off the market. In a subsequent commentary Kim and Reicin, two scientists working for Merck, protested that pooling comparators like this violated general principles of meta-analysis.
However, the discussion above shows that both were wrong. There is no general principle requiring the pooling of like with like but on the other hand it is not logical, having pooled unlike with unlike, to conclude that a treatment that is not identical to all comparators must be different from each and every one.
In fact, sometimes, a pooled meta-analysis is properly regarded as a step to looking further. In a much-cited paper that presented, amongst other matters, a meta-analysis of the effect of cholesterol-lowering treatment on risk of ischaemic heart disease , Simon Thompson was able to show heterogeneity of effect amongst the trials and that this was ascribable to a number of factors; included amongst them was the type of treatment given.
Of course, the fact that one may pool different treatments does not mean that this is always wise. My experience is that those who have worked in drug development are very reluctant to pool different formulations, let alone molecules without careful consideration, whereas those who have not are less so. A trial I worked on nearly twenty years ago showed (to high precision) a relative potency of four to one between two dry-powder formulations of the same drug.
Pooling doses might be suitable for some purposes but not others. For instance, if there were no difference between several doses of a treatment and placebo as regards side-effects, one might take this as reassuring regarding the lowest dose but it would be quite unacceptable as a proof of safety of the highest. Similarly if in such a pooling there was a definite benefit compared to placebo one might take this as showing the efficacy of the highest dose but not the lowest. Such judgements would be based upon the presumed monotonicity of the dose response. However in either of these cases the pooled analysis (if performed at all) would probably be taken as a starting point for investigation with attempts to follow (depending on numbers available) to say something about individual doses.
It is interesting to note that fashions regarding pooling of treatments are rapidly changing as network meta-analysis (see for an example) is becoming much more popular. Such analyses used comparisons within trials as a means of connecting treatments but maintain distinctions between different treatments.
Considering the case of pooling different populations where the treatments are otherwise identical raises different issues. Here the very problem raised by pooling calls into question the interpretation of a single trial. Consider the case where two trials are run in asthma. One specifies that patients should be aged 18-65 and the other aged 65-75. Let us call the first group non elderly adults and the second elderly. By pooling them in a meta-analysis we are testing the hypothesis that there is no difference between the effects of the treatments in either group. A rejected hypothesis then implies a difference in effect in at least one group.
The issue this raises is that any trial can be regarded as containing subgroups that might have formed the object of separate study. For example, we could have run a single trial which included patients aged 18-75. Clearly it would be absurd to suggest, if analysis shows a difference between treatments, that therefore there is a difference for all patients of any age: non elderly adults and also elderly.
It might be supposed that this means that there can never be any justification in using any treatment because we can always imagine some further subdivision of the patients. However, this ignores the necessity of choice. This is where the relevance of Johnson’s remark quoted at the beginning comes in. Consider a case where A has been compared to B in a trial or set of trials involving many different types of patient: young, old, male, female, severely ill, moderately ill and so forth. The fact that the mean effect of B is better than A does not prove that it is better than A for every patient. But consider this, if nothing else is known, however much you might doubt whether B really was better for a given patient than A, it would be perverse to use this as a reason for recommending A given that A was on average worse than B. Whatever your doubts about B for this patient your doubts about A would be higher.
In much of the recent discussion about subgroups in clinical trials, some of it driven by regulators, I think that this point has been overlooked. One could say, that having established reasonably precisely the average effect of a treatment this then becomes, if not the new null hypothesis then at least a base hypothesis for future action. In my view the further investigation of subgroups then becomes a project amongst many possible projects. If it can be realistically done cheaply in a way that permits useful inferences, so be it. If not it should be regarded as competing for resources with other projects, perhaps involving other treatments altogether. The question then is ‘does it make the cut?’
Declaration of interest
I consult regularly for the pharmaceutical industry. A full declaration of interest is maintained here http://www.senns.demon.co.uk/Declaration_Interest.htm
- Senn, S.J., Trying to be precise about vagueness. Statistics in Medicine, 2007. 26: p. 1417-1430.
- Juni, P., et al., Risk of cardiovascular events and rofecoxib: cumulative meta-analysis. Lancet, 2004. 364(9450): p. 2021-9.
- Kim, P.S. and A.S. Reicin, Discontinuation of Vioxx. Lancet, 2005. 365(9453): p. 23; author reply 26-7.
- Senn, S.J., Overstating the evidence: double counting in meta-analysis and related problems. BMC Medical Research Methodology, 2009. 9: p. 10.
- Thompson, S.G., Systematic Review – Why Sources of Heterogeneity in Metaanalysis Should Be Investigated. British Medical Journal, 1994. 309(6965): p. 1351-1355.
- Senn, S.J., et al., An incomplete blocks cross-over in asthma: a case study in collaboration, in Cross-over Clinical Trials, J. Vollmar and L.A. Hothorn, Editors. 1997, Fischer: Stuttgart. p. 3-26.
- Senn, S., et al., Issues in performing a network meta-analysis. Statistical Methods in Medical Research, 2013. 22(2): p. 169-189.