Over 100 patients signed up for the chance to participate in the clinical trials at Duke (2007-10) that promised a custom-tailored cancer treatment spewed out by a cutting-edge prediction model developed by Anil Potti, Joseph Nevins and their team at Duke. Their model purported to predict your probable response to one or another chemotherapy based on microarray analyses of various tumors. While they are now described as “false pioneers” of personalized cancer treatments, it’s not clear what has been learned from the fireworks surrounding the Potti episode overall. Most of the popular focus has been on glaring typographical and data processing errors—at least that’s what I mainly heard about until recently. Although they were quite crucial to the science in this case,(surely more so than Potti’s CV padding) what interests me now are the general methodological and logical concerns that rarely make it into the popular press. These revolve around the capability of the predictive model, and the back and forth criticisms and defense of its reported error rates both for so-called “internal validity” and especially for the intended recommendations on new patients. Even after the errors were exposed by Baggerly and Coombes (2007, 2009), the trials were allowed to continue (after a brief pause to let the Duke internal committee investigate, but they found no problems.) Surely they would have tested and validated a model on which they would be recommending chemo treatments and associated surgery; it couldn’t be that these human subjects were the first external tests of the model? Could it?
This is my first foray into the episode, and I don’t claim to have a worked out view on the methodology (that’s the beauty of a blog, right?) Here, then for your weekend reading, are some background materials in relation to this episode. For starters, there is what Baggerly and Coombes call “a starter kit”:
Other key links and background will be found through this post. I’ll also be adding to this case later on.
1.2. It’s Not My Fault if You Didn’t Apply My Method
True or false? You can’t complain about not being able to reproduce my result if you haven’t used my method.
Well, suppose I’ve claimed to provide evidence of a genuine statistical effect or of a reliable statistical predictor, and applying “my method” for warranting my claim depends on such gambits as: leaving out unfriendly data, cherry picking, ignoring multiple testing or the like. Then you certainly can rightly complain about not being able to reproduce my result. That’s because my claim C (to have evidence of a genuine effect or reliable predictor) readily “passes the test”–by the lights of my method–even if C is false. Diederik Stapel had “a method” for assuring support for his social psychology hypotheses (i.e., inventing data), but no one would think the purported effects are justified by his method simply because we too could finagle data that “fit” his hypotheses.
On the other hand, we can imagine cases where it would be correct to complain that a perfectly valid method had been misapplied. So some distinctions are needed, and I will try to supply them. (I take this up in Part 2).
1.3 Potti and Nevins and Baggerly and Coombes:
Potti and Nevins denied Baggerly and Coombes’ (B & C) criticism based on their inability to reproduce their results, and denounced B & C’s allegation that the Potti and Nevins method does “not work”.
“When we apply the same methods but maintain the separation of training and test sets, predictions are poor….Simulations show that the results are no better than those obtained with randomly selected cell lines.” (Baggerly, Wang and Coombes, Nature Medicine, Nov 2007, p. 1277.)
To which Potti and Nevins responded:
[T]hey suggest that our method of including both training and test data in the generation of mutagenes (principal components) is flawed. We feel this approach is entirely appropriate, as it does not include any information regarding the actual patient response and thus does not influence the generation of the signature with respect to predicting patient outcome….In short, they reproduce our result when they use our method. (Potti and Nevins, Nature Medicine, Nov 2007, p. 1277.
The Institute of medicine (IOM) Report (link below) growing out of the Duke episode clearly appears to be siding with C & B:
Candidate omics-based tests should be confirmed using an independent set of samples not used in the generation of the computational model, and when feasible, blinded to any outcome or phenotypic data until after the computational procedures have been locked down. …Ideally the specimens for independent confirmation will have been collected at a different point in time, at different institutions, from a different patient population, with samples processed in a different laboratory to demonstrate that the test has broad applicability and is not overfit to any particular situation.”(p. 36)
See “Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine Evolution of Translational Omics: Lessons Learned and the Path Forward.”
We will want to examine (in part 2) when it is warranted to claim a method “does not work” or “fails to reproduce”.
1.4 Steven McKinney letter to the IOM
It was a recent comment on this blog by statistician Steven McKinney that led to my delving further into this case. He agreed to my posting a letter he supplied to the IOM committee (PAF Document 19) below, and to responding to reader questions on this blog. [It can be found at the Cancer Letter website page: www.cancerletter.com/downloads/20110107_2/download, item 3 in “Internal NCI documents (zip files 1, 2 and 3)].So have a read, and I’ll come back to this in Part 2 later on.
December 16, 2010
Steven McKinney, Ph.D.
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
Christine M. Micheel, Ph.D.
Board on Health Care Services and National Cancer policy Forum
Institute of Medicine
500 5th Street, NW, 767;
Washington, DC 20001, USA
Dear Dr. Micheel,
I have been following with interest and concern the development of events related to the three clinical trials (NCT00509366, NCT00545948, NCT00636441) currently under review by the Institute of Medicine (Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials).
I have reviewed many of the omics papers related to this issue, and wish to communicate my concerns to the review committee. In brief, my concern is that the methodology employed in the now retracted papers, and many others issued by the Duke group all use a flawed statistical analytical paradigm. Essentially the paradigm involves fitting a statistical model to all available study data then splitting the data into subsets, labeling one of them a “training” set, another a “validation” or “test” set, and showing that the statistical model works well for both sets. The analysis paradigm is described as a statistical train-test-validate exercise in several published papers, though it is technically not a true train-test-validate exercise as the model under evaluation involves predictor components derived from the full data set.
I believe that this issue needs to be investigated as part of the Institute of Medicine’s review, because concerned readers who have written letters to journal editors have not been successful in educating a wider audience (in particular journal editors and reviewers of biomedical journals) as to the problematic aspects of the analysis method that are repeatedly used by the Duke group. The issue at hand is not just one researcher who committed errors in one analysis, but rather the systematic use of a flawed analytical paradigm in multiple papers discussing personalized medicine in a widening scope of medical scenarios.
The statistical properties of this analytical paradigm, in particular its type I error rate, have not to my knowledge been reviewed or published. I respectfully request the IOM committee to include this issue in its agenda for the upcoming review, as findings from this committee will provide a broader educational opportunity, allowing journal editors and reviewers to have a better understanding of the statistical properties of the analyses repeatedly developed and submitted for publication by the Duke University investigators.
As a citizen of the United States and a taxpayer, and as a practicing biomedical applied statistician, I am especially concerned about the possibility that the funding garnered for such potentially flawed studies is detracting from other groups’ ability to obtain funding to perform valid research in the valuable arena of personalized medicine. Additionally, the use of human subjects in ongoing studies involving this methodology is ethically problematic.
I will discuss this issue in greater detail in the attached Appendix to this letter of concern, and provide citations to the literature illustrating the various aspects involved.
Thank you for your consideration of this matter.
Attachments: Appendix – Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators.
Appendix – Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators
In 2001 West et al.  published some details of a statistical analytical method involving “Bayesian regression models that provide predictive capability based on gene expression data”. In the Statistical Methods section of this paper they state that the “Analysis uses binary regression models combined with singular value decompositions (SVDs) and with stochastic regularization by using Bayesian analysis (M.W., unpublished work) as discussed and referenced in Experimental Procedures, which are published as supporting information on the PNAS web site.”
Given the current state of affairs, it is of concern that so many papers have been published using this methodology when some undetermined amount of the underlying theory is unpublished.
In the supporting information on the PNAS website, the authors state “Statistical Methods. The analysis uses standard binary regression models combined with singular value decompositions (SVDs), also referred to as singular factor decompositions, and with stochastic regularization using Bayesian analysis (1). It is beyond the scope here to provide full technical details, so the interested reader is referred to ref. 2, which extends ref. 3 from linear to binary regression models; these manuscripts are available at the Duke web site, www.isds.duke.edu/~mw. Some key details are elaborated here.”
It is unclear why it should be “beyond the scope” to include details of the analytical methods in the supporting information materials – typically this is precisely the place to provide such details. Fortunately the reference “ref. 2” cited, above (West, M., Nevins, J. R., Marks, J. R., Spang, R. & Zuzan, H. (2000) German conference on Bioinformatics, in press.) is still available as an online publication in the electronic journal In Silico Biology (reference  below).
In the online journal article, the authors provide additional details about the analytical method, including the fact that “In a first step we fitted the regression model using the entire set of expression profiles and class assignments” (see the section titled “Probabilistic tumor classification”). This is a key point, and is precisely why the investigators’ continued publications claiming to have “validated” their analysis is false and deserves thorough statistical evaluation as part of the IOM review of these issues. When predictor variables derived from the entire set of data are used, it cannot be claimed that subsequent “validation” exercises are true cross-validation or out-of-sample evaluations of the model’s predictive capabilities, as the Duke investigators repeatedly state in publications.
In the same paragraph, the authors state “Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z1 … z27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data it does not give us reliable indications for a good predictive performance. One might suspect that the method just “stores” the given class assignments in the parameter, . Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.”
I believe this is the key flaw in the reasoning behind this statistical analytical method. The authors state without proof (via theoretical derivation or simulation study) that this Bayesian method is somehow immune to the issue of overfitting a model to a set of data.This is the aspect of this analytical paradigm that truly needs a sound statistical evaluation, so that a determination as to the true predictive capacity of this method can be scientifically demonstrated.
The authors state further in the Discussion section that “Clearly, the methodology is not limited to only this medical context nor is it specialized to diagnostic questions only. We have applied our model to the problem of predicting the nodal status of breast tumors based on expression profiles of tissue samples form the primary tumor. The results are reported in West et al., 2001. Due to the very general setting of our model, we expect it would be successful for a large class of diagnostic problems in various fields of medicine.”
Interestingly, the supporting material cited in  actually references . This again is an issue of concern.
Also now of concern is the realization of the authors’ prediction that they expect the method to be applicable to a large class of diagnostic problems in various fields of medicine. Indeed the authors have used this methodology in a widening scope of medical fields, as will be outlined below. That this methodology has been accepted for publication in many journals over many years, before its statistical properties have truly been investigated, is indeed an issue of concern.
I believe that part of the reason that journal editors and reviewers have not questioned the methodology is that the method uses primarily Bayesian statistical models, which are not as widely taught or understood in biological and medical higher education. It is difficult for many non-statisticians to follow the statistical logic and mathematical aspects of such complex Bayesian methods.
Thus the authors clearly describe that their paradigm is to fit a model to an entire data set, derive a set of predictors from that model, then use those predictors along with others on subsets of the entire data set. They state that the excellent performance of such models is validated, when it appears that what is actually demonstrated is that a model overfitted to an entire data set performs well on subsets of that entire data set. This issue should in my opinion be a key issue of concern in this IOM review of this omics methodology.
In early 2006, an apparently seminal paper was published by the Duke investigators in the journal Nature (Bild et al. ).This was followed by a publication in the New England Journal of Medicine (Potti et al. ) and another in Nature Medicine (Potti et al. ). All of these papers cite West et al.  and use the methodology therein. References  and  discuss breast cancer, and reference  discusses lung cancer. All use the same analytical paradigm, fitting an initial model to all available data to develop predictors (called “metagenes” in  and , then “gene expression signatures” in ) based on a singular value decomposition of the entire data set; then using these predictors on various subsets of the data involved and calling some portion of this subset analysis a “validation” exercise.
At this point researchers at the M.D. Anderson clinic explored the possibility of adapting this analytical paradigm, and asked statisticians Keith Baggerly and Kevin Coombes to review the publications. Their investigations are of course key in shedding light on this issue. In 2007 Baggedy and Coombes published a letter in the Correspondence section of Nature Medicine (Coombes et al. ). Coombes et al. state “Their software does not maintain the independence of training and test sets, and the test data alter the model. Specifically, their software uses ‘metagenes’: weighted combinations of individual genes. Weights are assigned through a singular value decomposition (SVD). Their software applies SVD to the training and test data simultaneously, yielding different weights than when SVD is applied only to the training data (Supplementary Report 9). Even using this more extensive model, however, we could not reproduce the reported results.” and further state that “When we apply the same methods but maintain the separation of training and test sets, predictions are poor (Fig. 1 and Supplementary Report 7). Simulations show that the results are no better than those obtained with randomly selected cell lines (Supplementary Report 8).”
Thus Coombes et al. have performed some initial analysis that sheds light on the true type I and type II error rates of this methodology. What is unclear from the work of Coombes et al. is the degree of departure from the null condition of no difference between groups of interest in the data sets used, so that the power of the statistical method can be properly evaluated. This is why a careful systematic study of this methodology, using known null data (data with equivalent distributional characteristics between groups of interest) and known non-null data (data with increasing levels of differential characteristics between groups of interest) is required, so that power characteristics of the methodology can be measured under null and non-null conditions. Further, such analysis needs to properly evaluate model performance on true out-of-sample data.
In 2007, the Duke group published another heavily cited paper (Hsu et al. , recently retracted on November 16, 2010). SVD components developed for this paper were termed “gene expression signatures”. All of these papers share the attribute that excessive claims of model accuracy are repeatedly asserted, with purported evidence from exercises termed “cross-validation”.
Recently, additional papers from the Duke investigators have been published concerning viral infection (Zaas et al. ) and bacterial infection (Zaas et al. . Statnikov et al.  submitted a letter challenging this methodology once again, stating “We suggest several approaches to improve the analysis protocol that led to discovery of the acute respiratory viral response signature. First, to obtain an unbiased estimate of predictive accuracy, genes should be selected using the training set of subjects as opposed to selecting genes from the entire data set as was done in the study of Zaas-et al. (2009). The latter gene selection procedure is known to typically lead to overoptimistic predictive accuracy estimates. Second, the cross-validation procedure employed by Zaas et al. should be modified to prohibit the use of samples from the same subjects both for developing signature and estimating its predictive accuracy, as this is another potential source of over-optimism.”
The Duke investigators, as with all previous challenges, offer only verbal refutations to these points, with no formal statistical evaluation via simulation or otherwise to address the true distributional properties of this method.
More recent papers from the Duke investigators that should be reviewed include Chen et al.  and Chen et al. . Additional complex Bayesian methods continue to be combined around the same analytical paradigm, and it is beyond the capability of many journal editors and reviewers to understand and deconstruct the arguments offered by the Duke investigators.
Additionally, with the apparent weight of so many seemingly accurate analysis results, resources such as research grants from federal agencies are being utilized without proper understanding of the value of the returns. Moreover, several studies on humans (the clinical trials currently scheduled for review, and the viral infection studies described in references  and ) have been conducted based on methodology with as yet unknown statistical properties. This is an issue of major concern, and a review of the statistical properties of the methodology used throughout these studies along with a publication of guidelines for evaluation of whether or not human trials involving this methodology should be permitted would be very valuable to the research community.
[McKinney References: see page 7 of PAF Document 19]
I’ll come back to this in “Potti Training and Test Data” Part 2. Please share your thoughts.
- Baggerly & Coombes. (2009). Deriving Chemosensitivity from cell lines: Forensic Bioinformatics and reproducible research in high-throughput Biology, Ann. of Appl. Stat., Vol. 3, No. 4 (Dec. 2009), pp. 1309-1334. [Starter Kit Webpage supplement for B&C 2009: http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html]
- Baggerly, Coombes, Neeley (2008) Run Batch Effects Potentially Compromise the Usefulness of Genomic Signatures for Ovarian Cancer. JCO March 1, 2008:1186-1187.
- Coombes, Wang & Baggerly. (2007). “Microrrays: retracing steps.” Nat. Med. Nov 13(11):1276-7.
- Dressman, Potti, Nevins & Lancaster (2008) In Reply. JCO March 1, 2008:1187-1188
- McShane (2010). NCI Address to lnstitute of Medicine Committee Convened to Review Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials. PAF 20.
- Micheel et al (Eds) Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine (2012). Evolution of Translational Omics: Lessons Learned and the Path Forward. Nat. Acad. Press.
- Potti et al.(2006). Genomic signatures to guide the use of chemotherapeutics. Nat. Med. Nov 12(11):1294-300. Epub 2006 Oct 22.
- Potti and Nevins (2007) Reply to Coombes, Wang & Baggerly Nat. Med. Nov 13(11):1277-8.
- Spang et al.(2002) Prediction And Uncertainty Gene Expression Profiles. In Silico Biology 2, 0033.