Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg
Is Pooling Fooling?
‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides
A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.
It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).
A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.
Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling.
Of course, how we choose to pool is a matter of skill, judgement, experience and statistical know-how. In the Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. Either way we can compose a composite statistic based on all the trials and test Hjoint.
What we have to be careful about, however, is choosing what hypothesis we are entitled to assert if Hjoint is rejected. Rejection of Hjoint does not entitle us to regard each of Hn1, Hn2, … Hnk as rejected. We are entitled to assert that at least one is rejected.
The issue at stake is well illustrated by a famous meta-analysis of rofecoxib[2] carried out by Juni et al in 2004. They pooled a number of studies comparing rofecoxib to various comparators (naproxen, other non-steroidal inflammatory drugs or placebo) and concluded that it was possible to decide that there was an increased risk of adverse cardiovascular events compared to placebo prior to 2004 when Merck pulled the drug off the market. In a subsequent commentary Kim and Reicin, two scientists working for Merck, protested that pooling comparators like this violated general principles of meta-analysis[3].
However, the discussion above shows that both were wrong. There is no general principle requiring the pooling of like with like but on the other hand it is not logical, having pooled unlike with unlike, to conclude that a treatment that is not identical to all comparators must be different from each and every one[4].
In fact, sometimes, a pooled meta-analysis is properly regarded as a step to looking further. In a much-cited paper that presented, amongst other matters, a meta-analysis of the effect of cholesterol-lowering treatment on risk of ischaemic heart disease , Simon Thompson[5] was able to show heterogeneity of effect amongst the trials and that this was ascribable to a number of factors; included amongst them was the type of treatment given.
Of course, the fact that one may pool different treatments does not mean that this is always wise. My experience is that those who have worked in drug development are very reluctant to pool different formulations, let alone molecules without careful consideration, whereas those who have not are less so. A trial I worked on nearly twenty years ago showed (to high precision) a relative potency of four to one between two dry-powder formulations[6] of the same drug.
Pooling doses might be suitable for some purposes but not others. For instance, if there were no difference between several doses of a treatment and placebo as regards side-effects, one might take this as reassuring regarding the lowest dose but it would be quite unacceptable as a proof of safety of the highest. Similarly if in such a pooling there was a definite benefit compared to placebo one might take this as showing the efficacy of the highest dose but not the lowest. Such judgements would be based upon the presumed monotonicity of the dose response. However in either of these cases the pooled analysis (if performed at all) would probably be taken as a starting point for investigation with attempts to follow (depending on numbers available) to say something about individual doses.
It is interesting to note that fashions regarding pooling of treatments are rapidly changing as network meta-analysis (see[7] for an example) is becoming much more popular. Such analyses used comparisons within trials as a means of connecting treatments but maintain distinctions between different treatments.
Considering the case of pooling different populations where the treatments are otherwise identical raises different issues. Here the very problem raised by pooling calls into question the interpretation of a single trial. Consider the case where two trials are run in asthma. One specifies that patients should be aged 18-65 and the other aged 65-75. Let us call the first group non elderly adults and the second elderly. By pooling them in a meta-analysis we are testing the hypothesis that there is no difference between the effects of the treatments in either group. A rejected hypothesis then implies a difference in effect in at least one group.
The issue this raises is that any trial can be regarded as containing subgroups that might have formed the object of separate study. For example, we could have run a single trial which included patients aged 18-75. Clearly it would be absurd to suggest, if analysis shows a difference between treatments, that therefore there is a difference for all patients of any age: non elderly adults and also elderly.
It might be supposed that this means that there can never be any justification in using any treatment because we can always imagine some further subdivision of the patients. However, this ignores the necessity of choice. This is where the relevance of Johnson’s remark quoted at the beginning comes in. Consider a case where A has been compared to B in a trial or set of trials involving many different types of patient: young, old, male, female, severely ill, moderately ill and so forth. The fact that the mean effect of B is better than A does not prove that it is better than A for every patient. But consider this, if nothing else is known, however much you might doubt whether B really was better for a given patient than A, it would be perverse to use this as a reason for recommending A given that A was on average worse than B. Whatever your doubts about B for this patient your doubts about A would be higher.
In much of the recent discussion about subgroups in clinical trials, some of it driven by regulators, I think that this point has been overlooked. One could say, that having established reasonably precisely the average effect of a treatment this then becomes, if not the new null hypothesis then at least a base hypothesis for future action. In my view the further investigation of subgroups then becomes a project amongst many possible projects. If it can be realistically done cheaply in a way that permits useful inferences, so be it. If not it should be regarded as competing for resources with other projects, perhaps involving other treatments altogether. The question then is ‘does it make the cut?’
Declaration of interest
I consult regularly for the pharmaceutical industry. A full declaration of interest is maintained here http://www.senns.demon.co.uk/Declaration_Interest.htm
References
- Senn, S.J., Trying to be precise about vagueness. Statistics in Medicine, 2007. 26: p. 1417-1430.
- Juni, P., et al., Risk of cardiovascular events and rofecoxib: cumulative meta-analysis. Lancet, 2004. 364(9450): p. 2021-9.
- Kim, P.S. and A.S. Reicin, Discontinuation of Vioxx. Lancet, 2005. 365(9453): p. 23; author reply 26-7.
- Senn, S.J., Overstating the evidence: double counting in meta-analysis and related problems. BMC Medical Research Methodology, 2009. 9: p. 10.
- Thompson, S.G., Systematic Review – Why Sources of Heterogeneity in Metaanalysis Should Be Investigated. British Medical Journal, 1994. 309(6965): p. 1351-1355.
- Senn, S.J., et al., An incomplete blocks cross-over in asthma: a case study in collaboration, in Cross-over Clinical Trials, J. Vollmar and L.A. Hothorn, Editors. 1997, Fischer: Stuttgart. p. 3-26.
- Senn, S., et al., Issues in performing a network meta-analysis. Statistical Methods in Medical Research, 2013. 22(2): p. 169-189.
Stephen: Thank you so much for contributing a guest post. I’m not very familiar with this arena, but in the interest of launching some discussion, here are some thoughts. First, I was wondering about your remark that in the “Neyman-Pearson framework it would depend on some suitable and possibly quite complicated alternative hypothesis. In the Fisherian framework the choice is more direct. ” Is that because N-P looks at power, or requires stipulating the alternatives you are allowed to infer?
In general I’m not sure I get the difference, in this discussion anyway, between joint tests and meta-analysis of existing tests and their results. You almost make it sound as if you can pool studies and then cut them up every which way post data, I’m missing something obviously. Also, doing various tests have to be performed (on assumptions) before this kind of tossed salad?
As with George Box’s statistical quip “All models are wrong, some are useful” one certainly can find flaws in any study and thereby declare all studies wrong.
The exercise thus always is determining the usefulness of the various analyses.
Juni et al. worked up a set of specifications concerning which patient groups to assess: “We included all randomised controlled trials in adult patients with chronic musculoskeletal disorders that compared rofecoxib 12·5–50 mg daily with other NSAIDs or placebo.”
Kim and Reicin give these arguments:
“Moreover, Juni and colleagues ignore data included in previous analyses, available on the US Food and Drug Administation’s (FDA) website (http://www.fda.gov), from large placebo-controlled studies in about 2000 patients with Alzheimer’s disease. The results of these trials show no difference between rofecoxib (Vioxx) and placebo.”
“Furthermore, the results of the APPROVe study (which led to the withdrawal of Vioxx; http://www.vioxx.com) are consistent with previously available placebo-controlled data—for the first 18 months of the study, there was no evidence of any difference in cardiovascular risk between rofecoxib and placebo.”
As Senn points out, both are wrong. Which then is more useful? A collection of all randomised controlled trials of adult patients with chronic musculoskeletal disorders, precisely the indication for which millions of patients were given prescriptions for this drug, or an Alzheimer’s study plus the first 18 months of the APPROVe study?
Each of us has to weigh such evidence before deciding whether to pop that pill down our gullets. I’m more comfortable with the findings of the Juni et al. effort than the teeny tiny cherry-picked anecdotal arguments posed by the industry affiliates Kim and Reicin. Luckily I have statistical training, so that I can understand that focusing only on the first 18 months of a study, which guarantees that the number of relevant events will be small, is not useful for patients considering far longer term use of the drug for a chronic condition. I also understand that just because no statistically significant difference was seen in the first 18 months, that does not mean that Vioxx is as safe as the placebo. Failing to reject the null hypothesis does not mean the null hypothesis is to be accepted, until and unless it is established that the test involved had sufficient severity.
Kim and Reicin’s examples exhibit no severity. Juni et al. compile a mountain of evidence whose usefulness is apparent to me.
We have plenty of history now to judge the Vioxx debacle.
From Ross et al (2008) who assess documents forcibly extracted from Merck’s file cabinets during court cases, in the paper
“Guest Authorship and Ghostwriting in Publications Related to Rofecoxib A Case Study of Industry Documents From Rofecoxib Litigation”
JAMA. 2008;299(15):1800-1812
“COMMENT
This case-study review of industry documents related to rofecoxib demonstrates that Merck used a systematic strategy to facilitate the publication of guest authored and ghost written medical literature. Articles related to rofecoxib were frequently authored by Merck employees but attributed first authorship to external, academically affiliated investigators who did not always disclose financial support from Merck, although financial support of the study was nearly always provided.”
Merck made many millions, if not billions of dollars of profit from this drug. What’s the hurry in assessing its side effects when the major side effect is such profit?
Is pooling fooling? Sometimes it is, sometimes it isn’t. Unfortunately we each have to make that determination when difficult circumstances such as medical dilemmas present. It is useful for people who understand the technical issues to offer an opinion as to who is and who is not trying to fool you. The rest of us have to assess such opinions carefully, often by building a trust network of people we judge knowledgeable. Build your trust network carefully now, your life will one day depend on it.
Steven:
Yes, “Build your trust network carefully now, your life will one day depend on it.” but its very tricky to assess what they trusted and how to weigh that in your choice.
But first, it is a very nice post by Stephen (note different spelling) bringing out some issues that underlie what is often referred to as meta-analysis. Hopefully the most import issue will become evident that far too few people think through more than one study carefully or fully (and any one study can usually be split into sub-studies though more likely of equal quality.)
Undisclosed “garbage in” haunts everyone.
Some trust the Cochrane Collaboration but one of their members has recently cautioned against that.
(I am guessing Stephen already is aware of this.)
Tom Jefferson, et al (of The Cochrane Collaboration). Risk of bias in industry-funded oseltamivir trials: comparison of core reports versus full clinical study reports http://bmjopen.bmj.com/content/4/9/e005253.full
The conclusion:
“This approach is not possible when assessing trials reported in journal publications, in which articles necessarily reflect post hoc reporting with a far more sparse level of detail. We suggest that when bias is so limiting as to make meta-analysis results unreliable, either it should not be carried out or a prominent explanation of its clear limitations should be included alongside the meta-analysis.”
What is interesting here is that this is the first time anyone in the Cochrane Collaboration had access to the data usually only regulators have.
A perhaps extreme interpretation of “it should not be carried out” would be that the current Cochrane library be taken off line and published papers from the group be retracted from journals.
A less extreme would be that the label “a prominent explanation of its clear limitations should be included” be affixed to help prevent anyone from thinking they “built their trust network credibly”.
Keith: I looked up your BMJ link–interesting.
http://bmjopen.bmj.com/content/4/9/e005253.full
Perhaps Cochrane will increasingly rely on clinical study reports and be able to determine bias, rather than “risk of bias”. I don’t see why they wouldn’t move to the kind of detail the reviewers are calling for. The following is from the article:
“We found the Cochrane risk of bias tool to be difficult to apply to clinical study reports. We think this is not because the tool was constructed to assess journal publications but, as with all list-like instruments, its use lends itself to a checklist approach (in which each design item is sought and, if found, eliminated from the bias equation rather than with thought and consideration).
The background to our use of clinical study reports was our mistrust of journal publications of oseltamivir trials. Many trials were unpublished, and of those published, we found and documented examples of reporting bias. At least one trial publication was drafted by an unnamed medical writer. As evidence of reporting bias in industry trial publication mounts, 8 ,16–21 we believe Cochrane reviews should increasingly rely on clinical study reports as the basic unit of analysis. Sponsors and researchers both have a responsibility to make all efforts to make full clinical study reports publicly available. The systematic evaluation of bias or risk of bias remains an essential aspect of evidence synthesis, as it forces reviewers to critically examine trials. However, the current Cochrane risk of bias tool does not sufficiently identify possible faults with study design, and nor does it help to organise and check the coherence of large amounts of information that are found in clinical study reports. Our experience suggests that more detailed extraction sheets that prompt reviewers to consider additional aspects of study may be needed. Until a more appropriate guide is developed, we offer our custom extraction sheets to Cochrane reviewers and others interested in assessing risk of bias using clinical study reports and encourage further development.”
In reply to Steven’s coments, I am not defending Merck’s original handling of the the rofecoxib story. In fact in a previous paper that I cited (1) I pointed out that it made little practical difference in deciding whether to switch from naproxen to reofecoxib whether the latter was cardiotoxic or the former cardioprotective.
However, I also cannot regard the Juni et al methodology as appropriate (In fact the authors seem to have subsequently embraced network meta-analysis, and it would be an interesting exercise to revisit the data they originally analysed using the rather different technique they now favour and see whether the conclusion would survive).
As regards Keith’s comments, yes Cochrane are far from perfect and I drew attention to a case of double counting in the paper I cited (1). Equally well, of course, much of what they do is good. The point is that meta-analyses should be judged on technique and checkability rather than on authorial reputation.
As regard Deborah’s question, what I meant by Fisher’s approach being more direct was that it was sufficient for a researcher to justify preference for one valid test over another in terms of a statistic rather than in terms of a hypothesis that justified the statistic. See https://errorstatistics.com/2014/02/21/stephen-senn-fishers-alternative-to-the-alternative/
No, I am not suggesting cutting things up any way you like. However, in another context, that of closed test precedures, an approach can be used in which you designate a sequence of tests that you follow in a principled manner. These do sometimes involve pooling treatments at a higher level although this is usually not where you want to stop. In the spirit of this approach, Juni et al’s original pooling could have been justified to prove that rofecxib was worse than at least one of the treatments but not to assert that it was therefore worse than them all.
(1) Senn, S.J., Overstating the evidence: double counting in meta-analysis and related problems. BMC Medical Research Methodology, 2009
Stephen: Thank you for coming down from your mountain, river, ravine, briar patch, cavern, stream or precipice to reply to comments. Seriously.
Stephen:
> yes Cochrane are far from perfect
As we all are, but here “Undisclosed “garbage in” haunts everyone” is not their fault and clear statement of the limitations that follow from this, from them, is very commendable.
The remaining important question is, if you or a loved one needed a treatment (and you had no inside information or access to regulatory data) would you read a Cochrane meta-analysis (or even on of mine from years ago)?
I’ll try later say something about your other Fisher post from Peirce’s persepective and one of my past attempts (with input from David Cox) to provide a statistical theory for Peto’s approach.
Keith: Curious to hear your Peircean thoughts (with input from Cox).
Another great post! Senn’s position seems to be rather close to Richard Peto’s position, where the test is really the thing in a meta analysis, not this pooled estimate of effect size that everyone seems so hung up on these days (always shaped like a diamond!). In particular, Peto strongly favored the “fixed effects” analysis over the “random effects” analysis simply because the fixed effects analysis provides a valid test of the null that Stephen mentions above. Most meta analysts today seem to prefer the random effects model because it seems silly to assume a constant treatment effect across studies (and it should!), but they miss the larger point that the random effects model makes some pretty strong (and quite silly) assumptions about random sampling of study designs. Whenever I try to explain why I strongly favor the fixed effects approach (I don’t really give a damn about the pooled estimate of treatment effect), all I get back is something like “but then you’re assuming a constant effect across studies, which is just ridiculous.” No, I’m not assuming any such thing…
Mark: I’d be grateful if you explained a bit more about what you wrote in your comment: e.g., about “the test” being the thing, not the pooled estimate, about the fixed effects, and what you infer when Senn’s null is rejected.
Hi Deborah,
Regarding “the test being the thing,” I personally think that the test of the null hypothesis (in particular, Fisher’s “strong null”, more on this below) is the primary inference warranted by any randomised trial (putting “non-inferiority” aside for now). I realize that an estimate of treatment effect is desirable, but such estimates typically require assuming some model (unless we’re looking at a simple difference in means), and as David Freedman showed in a series of papers randomization alone does not provide a basis for such models. Thus, as emphasised by Thomas Cook and David DeMets in their excellent (and practical!) book, a randomized trial is primarily about testing a hypothesis. So, if individual trials focus on the test, why shouldn’t pooled summaries of such trials? I understand that others (including Senn, I believe) do emphasise the estimate of treatment effect in individual trials…. I just don’t put much stock in it. I probably risk being warned about not “making a fetish of randomization” ( a phrase that I REALLY want to ask Senn about), but that’s just how I see it. I think I might have a randomization fetish!
Regarding not focusing on the pooled estimate, that frankly requires far too many modeling assumptions (unnecessary modeling assumptions) for my comfort level.
Fixed effects versus random effects. Simply put, fixed effects assume that all trials are estimating (errrr… Testing) a constant effect (which, of course, they are under the null), whereas random effects assume that effects may vary across trials but they they’re distributed around some average effect across all possible trials.
So, what do I infer if Senn’s null is rejected? Pretty much the same thing that I think that one can infer from the rejection of the strong null in an individual trial. Namely that the test intervention had a differential effect for at least *some* folks who were randomized in the trial, although we can’t even pinpoint who or how many. That’s really a pretty strong conclusion, if you think about it… I’m saying that we can infer (with possible error, of course) that the treatment *caused* an effect for some people, which is why we were doing the study in the first place! Anyway, I’d interpret a meta analysis similarly…. Namely, as Senn said, that at least one of the component hypotheses can be rejected. It might seem weak, it’s definitely uncertain, but I truly think that this is as far as randomization takes us (and I’m not too convinced that we can go much further than randomization).
Mark:
First, I don’t know why your comment disappeared, I dug it out. I wanted to say that I like the idea of merely negating the null based on randomization without an alternative, and in fact what I was trying to get at is the characterization of the type of case where this kind of “pure significance test” is appropriate. Of course Cox gives lists, but not including meta-analysis to my knowledge.
Mark: I do agree with you (and Stephen) about being on firmest ground addressing “test intervention had a differential effect for at least *some* folks who were randomized in the trial” – but you do need to go further to answer – should another study be done, if so how and if not what do you conclude about the effect(s) of the treatment. I would not attribute this reasoning to Peto but rather Fisher (e.g. his discussion in Design of Experiments quoted in Meta-Analysis: Conceptual Issues of Addressing Apparent Failure of Individual Study Replication or “Inexplicable” Heterogeneity. K. O’Rourke http://link.springer.com/chapter/10.1007/978-1-4613-0141-7_11#page-1 .)
Peto also said (according to David Cox) that he knew the treatments effects varied (he was NOT assuming them fixed) but he did not want to account for this variance of treatment effects in the estimate of the typical treatment effect and admitted his interval was not really a confidence interval. I invited Peto to give a presentation at SAMSI Meta-analysis program on this and he declined.
Now to make sense of what Peto was doing, one could think of the estimate/test for a counter-factual population that would have same mix of patients as in the trials – it is a post stratified estimate for that population. Now if one could assume effects are always positive or negative (but never both) that would be a useful assessment (and very precise). But David Cox convinced me that was a silly assumption – it might hold for certain vaccines (facilitates recognition but could not deter it), but I am not sure.
i agree with Mark. The situation in my view is similar to the oft-expressed but false claim that the two sample t-test (in an experimental setting) relies on an assumption of equal variances. It does not because if the treatment is(say) a placebo it makes no difference to what group anyone is assigned and over all randomisations the distributions must be the same.
However, as soon as one begins to consider (based on a test) that the null hypothesis is not true, then one is in the realm of an alternative hypothesis. It now become relevant to consider more carefully the matter of estimating the difference and as soon as one admits that the treatment assigned makes a difference, then one has to concede that the variances might be different.
My position on meta-analysis is that I think that a fixed-effects meta-analysis should always be done. Random effects meta-analysis may be of interest for certain purposes but one must be very careful as to what one is trying to do and how far the standard formal random effects model can get you -the trials you ran are not a random sample of the trials you might have run and the average patient is not in the average trial.
Thanks to you and Mark, this actually helps a lot! And furthermore, I am in agreement.
@mark @steven in defense of random effects models, the motivation is not so much a model of reality but to focus on improved estimation by introducing shrinkage, from a bias-variance tradeoff perspective. I think critiques of random effects models form the perspective of what-does-this-correspond-to-in-reality lead to the use of unbiased estimators for the sake of unbiasedness, rather than correctness.
Testing in a meta-observational context is fraught with peril to begin with and it is almost impossible to make a valid test given unknown-unknown confounders (as keith points out). Thus I’d rather someone give me an estimate w/ a well-defined model and put the caveats front and center.
I agree, when it comes to observational studies.
I partly agree. I think that random effect models have their uses. However, I think that they have been very much over-hyped. In particular, I object to claims that they are a natural Bayesian way to incorporate uncertainty if the Bayesian formulation that is then used does not really reflect reasonable belief properly. This was the main thrust of my “being precise about vagueness” paper. In particular, as I pointed out in that paper, any reasonable Bayesian approach has to be capable of dealing with the case when there is only one trial.
However, if I dislike naive Bayesian approaches, the same applies a fortiori to naive frequentist approaches, where one assumes that because one has pooled (for example) comparators and found a difference to treatment, one is entitled to claim that the treatment is different to each and every comparator.
Reference
Senn, S.J., Trying to be precise about vagueness. Statistics in Medicine, 2007. 26: p. 1417-1430.
Schachtman has something of interest on his blog today on Sander Greenland and statistical testimony.
http://schachtmanlaw.com/sander-greenland-on-the-need-for-critical-appraisal-of-expert-witnesses-in-epidemiology-and-statistics/
It links to the discussion that arose on my post featuring Schachtman on Oreskes:
https://errorstatistics.com/2015/01/04/significance-levels-made-whipping-boy-on-climate-change-evidence-is-05-too-strict/#comments