Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg
Is Pooling Fooling?
‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides
A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.
It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).
A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.
Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling.
Of course, how we choose to pool is a matter of skill, judgement, experience and statistical know-how. In the Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. Either way we can compose a composite statistic based on all the trials and test Hjoint.
What we have to be careful about, however, is choosing what hypothesis we are entitled to assert if Hjoint is rejected. Rejection of Hjoint does not entitle us to regard each of Hn1, Hn2, … Hnk as rejected. We are entitled to assert that at least one is rejected.
The issue at stake is well illustrated by a famous meta-analysis of rofecoxib carried out by Juni et al in 2004. They pooled a number of studies comparing rofecoxib to various comparators (naproxen, other non-steroidal inflammatory drugs or placebo) and concluded that it was possible to decide that there was an increased risk of adverse cardiovascular events compared to placebo prior to 2004 when Merck pulled the drug off the market. In a subsequent commentary Kim and Reicin, two scientists working for Merck, protested that pooling comparators like this violated general principles of meta-analysis.
However, the discussion above shows that both were wrong. There is no general principle requiring the pooling of like with like but on the other hand it is not logical, having pooled unlike with unlike, to conclude that a treatment that is not identical to all comparators must be different from each and every one.
In fact, sometimes, a pooled meta-analysis is properly regarded as a step to looking further. In a much-cited paper that presented, amongst other matters, a meta-analysis of the effect of cholesterol-lowering treatment on risk of ischaemic heart disease , Simon Thompson was able to show heterogeneity of effect amongst the trials and that this was ascribable to a number of factors; included amongst them was the type of treatment given.
Of course, the fact that one may pool different treatments does not mean that this is always wise. My experience is that those who have worked in drug development are very reluctant to pool different formulations, let alone molecules without careful consideration, whereas those who have not are less so. A trial I worked on nearly twenty years ago showed (to high precision) a relative potency of four to one between two dry-powder formulations of the same drug.
Pooling doses might be suitable for some purposes but not others. For instance, if there were no difference between several doses of a treatment and placebo as regards side-effects, one might take this as reassuring regarding the lowest dose but it would be quite unacceptable as a proof of safety of the highest. Similarly if in such a pooling there was a definite benefit compared to placebo one might take this as showing the efficacy of the highest dose but not the lowest. Such judgements would be based upon the presumed monotonicity of the dose response. However in either of these cases the pooled analysis (if performed at all) would probably be taken as a starting point for investigation with attempts to follow (depending on numbers available) to say something about individual doses.
It is interesting to note that fashions regarding pooling of treatments are rapidly changing as network meta-analysis (see for an example) is becoming much more popular. Such analyses used comparisons within trials as a means of connecting treatments but maintain distinctions between different treatments.
Considering the case of pooling different populations where the treatments are otherwise identical raises different issues. Here the very problem raised by pooling calls into question the interpretation of a single trial. Consider the case where two trials are run in asthma. One specifies that patients should be aged 18-65 and the other aged 65-75. Let us call the first group non elderly adults and the second elderly. By pooling them in a meta-analysis we are testing the hypothesis that there is no difference between the effects of the treatments in either group. A rejected hypothesis then implies a difference in effect in at least one group.
The issue this raises is that any trial can be regarded as containing subgroups that might have formed the object of separate study. For example, we could have run a single trial which included patients aged 18-75. Clearly it would be absurd to suggest, if analysis shows a difference between treatments, that therefore there is a difference for all patients of any age: non elderly adults and also elderly.
It might be supposed that this means that there can never be any justification in using any treatment because we can always imagine some further subdivision of the patients. However, this ignores the necessity of choice. This is where the relevance of Johnson’s remark quoted at the beginning comes in. Consider a case where A has been compared to B in a trial or set of trials involving many different types of patient: young, old, male, female, severely ill, moderately ill and so forth. The fact that the mean effect of B is better than A does not prove that it is better than A for every patient. But consider this, if nothing else is known, however much you might doubt whether B really was better for a given patient than A, it would be perverse to use this as a reason for recommending A given that A was on average worse than B. Whatever your doubts about B for this patient your doubts about A would be higher.
In much of the recent discussion about subgroups in clinical trials, some of it driven by regulators, I think that this point has been overlooked. One could say, that having established reasonably precisely the average effect of a treatment this then becomes, if not the new null hypothesis then at least a base hypothesis for future action. In my view the further investigation of subgroups then becomes a project amongst many possible projects. If it can be realistically done cheaply in a way that permits useful inferences, so be it. If not it should be regarded as competing for resources with other projects, perhaps involving other treatments altogether. The question then is ‘does it make the cut?’
Declaration of interest
I consult regularly for the pharmaceutical industry. A full declaration of interest is maintained here http://www.senns.demon.co.uk/Declaration_Interest.htm
- Senn, S.J., Trying to be precise about vagueness. Statistics in Medicine, 2007. 26: p. 1417-1430.
- Juni, P., et al., Risk of cardiovascular events and rofecoxib: cumulative meta-analysis. Lancet, 2004. 364(9450): p. 2021-9.
- Kim, P.S. and A.S. Reicin, Discontinuation of Vioxx. Lancet, 2005. 365(9453): p. 23; author reply 26-7.
- Senn, S.J., Overstating the evidence: double counting in meta-analysis and related problems. BMC Medical Research Methodology, 2009. 9: p. 10.
- Thompson, S.G., Systematic Review – Why Sources of Heterogeneity in Metaanalysis Should Be Investigated. British Medical Journal, 1994. 309(6965): p. 1351-1355.
- Senn, S.J., et al., An incomplete blocks cross-over in asthma: a case study in collaboration, in Cross-over Clinical Trials, J. Vollmar and L.A. Hothorn, Editors. 1997, Fischer: Stuttgart. p. 3-26.
- Senn, S., et al., Issues in performing a network meta-analysis. Statistical Methods in Medical Research, 2013. 22(2): p. 169-189.
Stephen: Thank you so much for contributing a guest post. I’m not very familiar with this arena, but in the interest of launching some discussion, here are some thoughts. First, I was wondering about your remark that in the “Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. ” Is that because N-P looks at power, or requires stipulating the alternatives you are allowed to infer?
In general I’m not sure I get the difference, in this discussion anyway, between joint tests and meta-analysis of existing tests and their results. You almost make it sound as if you can pool studies and then cut them up every which way post data, I’m missing something obviously. Also, doing various tests have to be performed (on assumptions) before this kind of tossed salad?
As with George Box’s statistical quip “All models are wrong, some are useful” one certainly can find flaws in any study and thereby declare all studies wrong.
The exercise thus always is determining the usefulness of the various analyses.
Juni et al. worked up a set of specifications concerning which patient groups to assess: “We included all randomised controlled trials in adult patients with chronic musculoskeletal disorders that compared rofecoxib 12·5–50 mg daily with other NSAIDs or placebo.”
Kim and Reicin give these arguments:
“Moreover, Juni and colleagues ignore data included in previous analyses, available on the US Food and Drug Administation’s (FDA) website (http://www.fda.gov), from large placebo-controlled studies in about 2000 patients with Alzheimer’s disease. The results of these trials show no difference between rofecoxib (Vioxx) and placebo.”
“Furthermore, the results of the APPROVe study (which led to the withdrawal of Vioxx; http://www.vioxx.com) are consistent with previously available placebo-controlled data—for the first 18 months of the study, there was no evidence of any difference in cardiovascular risk between rofecoxib and placebo.”
As Senn points out, both are wrong. Which then is more useful? A collection of all randomised controlled trials of adult patients with chronic musculoskeletal disorders, precisely the indication for which millions of patients were given prescriptions for this drug, or an Alzheimer’s study plus the first 18 months of the APPROVe study?
Each of us has to weigh such evidence before deciding whether to pop that pill down our gullets. I’m more comfortable with the findings of the Juni et al. effort than the teeny tiny cherry-picked anecdotal arguments posed by the industry affiliates Kim and Reicin. Luckily I have statistical training, so that I can understand that focusing only on the first 18 months of a study, which guarantees that the number of relevant events will be small, is not useful for patients considering far longer term use of the drug for a chronic condition. I also understand that just because no statistically significant difference was seen in the first 18 months, that does not mean that Vioxx is as safe as the placebo. Failing to reject the null hypothesis does not mean the null hypothesis is to be accepted, until and unless it is established that the test involved had sufficient severity.
Kim and Reicin’s examples exhibit no severity. Juni et al. compile a mountain of evidence whose usefulness is apparent to me.
We have plenty of history now to judge the Vioxx debacle.
From Ross et al (2008) who assess documents forcibly extracted from Merck’s file cabinets during court cases, in the paper
“Guest Authorship and Ghostwriting in Publications Related to Rofecoxib A Case Study of Industry Documents From Rofecoxib Litigation”
This case-study review of industry documents related to rofecoxib demonstrates that Merck used a systematic strategy to facilitate the publication of guest authored and ghost written medical literature. Articles related to rofecoxib were frequently authored by Merck employees but attributed first authorship to external, academically affiliated investigators who did not always disclose financial support from Merck, although financial support of the study was nearly always provided.”
Merck made many millions, if not billions of dollars of profit from this drug. What’s the hurry in assessing its side effects when the major side effect is such profit?
Is pooling fooling? Sometimes it is, sometimes it isn’t. Unfortunately we each have to make that determination when difficult circumstances such as medical dilemmas present. It is useful for people who understand the technical issues to offer an opinion as to who is and who is not trying to fool you. The rest of us have to assess such opinions carefully, often by building a trust network of people we judge knowledgeable. Build your trust network carefully now, your life will one day depend on it.