What have we learned from the Anil Potti training and test data fireworks ? Part 1 (draft 2)

toilet-fireworks-by-stephenthruvegas-on-flickr

Over 100 patients signed up for the chance to participate in the clinical trials at Duke (2007-10) that promised a custom-tailored cancer treatment spewed out by a cutting-edge prediction model developed by Anil Potti, Joseph Nevins and their team at Duke. Their model purported to predict your probable response to one or another chemotherapy based on microarray analyses of various tumors. While they are now described as “false pioneers” of personalized cancer treatments, it’s not clear what has been learned from the fireworks surrounding the Potti episode overall. Most of the popular focus has been on glaring typographical and data processing errors—at least that’s what I mainly heard about until recently. Although they were quite crucial to the science in this case,(surely more so than Potti’s CV padding) what interests me now are the general methodological and logical concerns that rarely make it into the popular press. These revolve around the capability of the predictive model, and the back and forth criticisms and defense of its reported error rates both for so-called “internal validity” and especially for the intended recommendations on new patients. Even after the errors were exposed by Baggerly and Coombes (2007, 2009), the trials were allowed to continue (after a brief pause to let the Duke internal committee investigate, but they found no problems.) Surely they would have tested and validated a model on which they would be recommending chemo treatments and associated surgery; it couldn’t be that these human subjects were the first external tests of the model? Could it?

This is my first foray into the episode, and I don’t claim to have a worked out view on the methodology (that’s the beauty of a blog, right?) Here, then for your weekend reading, are some background materials in relation to this episode. For starters, there is what Baggerly and Coombes call “a starter kit”:

http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html

Other key links and background will be found through this post. I’ll also be adding to this case later on.

1.2. It’s Not My Fault if You Didn’t Apply My Method

True or false? You can’t complain about not being able to reproduce my result if you haven’t used my method.

Well, suppose I’ve claimed to provide evidence of a genuine statistical effect or of a reliable statistical predictor, and applying “my method” for warranting my claim depends on such gambits as: leaving out unfriendly data, cherry picking, ignoring multiple testing or the like. Then you certainly can rightly complain about not being able to reproduce my result. That’s because my claim C (to have evidence of a genuine effect or reliable predictor) readily “passes the test”–by the lights of my method–even if C is false. Diederik Stapel had “a method” for assuring support for his social psychology hypotheses (i.e., inventing data), but no one would think the purported effects are justified by his method simply because we too could finagle data that “fit” his hypotheses.

On the other hand, we can imagine cases where it would be correct to complain that a perfectly valid method had been misapplied. So some distinctions are needed, and I will try to supply them. (I take this up in Part 2).

1.3 Potti and Nevins and Baggerly and Coombes:

Potti and Nevins denied Baggerly and Coombes’ (B & C) criticism based on their inability to reproduce their results, and denounced B & C’s allegation that the Potti and Nevins method does “not work”.

“When we apply the same methods but maintain the separation of training and test sets, predictions are poor….Simulations show that the results are no better than those obtained with randomly selected cell lines.” (Baggerly, Wang and Coombes, Nature Medicine, Nov 2007, p. 1277.)

To which Potti and Nevins responded:

[T]hey suggest that our method of including both training and test data in the generation of mutagenes (principal components) is flawed. We feel this approach is entirely appropriate, as it does not include any information regarding the actual patient response and thus does not influence the generation of the signature with respect to predicting patient outcome….In short, they reproduce our result when they use our method. (Potti and Nevins, Nature Medicine, Nov 2007, p. 1277.

The Institute of medicine (IOM) Report (link below) growing out of the Duke episode clearly appears to be siding with C & B:

Candidate omics-based tests should be confirmed using an independent set of samples not used in the generation of the computational model, and when feasible, blinded to any outcome or phenotypic data until after the computational procedures have been locked down. …Ideally the specimens for independent confirmation will have been collected at a different point in time, at different institutions, from a different patient population, with samples processed in a different laboratory to demonstrate that the test has broad applicability and is not overfit to any particular situation.”(p. 36)

See  “Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine Evolution of Translational Omics: Lessons Learned and the Path Forward.”

We will want to examine (in part 2) when it is warranted to claim a method “does not work” or “fails to reproduce”.

1.4 Steven McKinney letter to the IOM

It was a recent comment on this blog by statistician Steven McKinney that led to my delving further into this case.  He agreed to my posting a letter he supplied to the IOM committee (PAF Document 19) below, and to responding to reader questions on this blog. [It can be found at the Cancer Letter website page: www.cancerletter.com/downloads/20110107_2/download, item 3 in “Internal NCI documents (zip files 1, 2 and 3)].So have a read, and I’ll come back to this in Part 2 later on.

——-

December 16, 2010

Steven McKinney, Ph.D.
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
Vancouver B.C.
Canada

Christine M. Micheel, Ph.D.

Program Officer
Board on Health Care Services and National Cancer policy Forum
Institute of Medicine
500 5th Street, NW, 767;
Washington, DC 20001, USA

Dear Dr. Micheel,

I have been following with interest and concern the development of events related to the three clinical trials (NCT00509366, NCT00545948, NCT00636441) currently under review by the Institute of Medicine (Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials).

I have reviewed many of the omics papers related to this issue, and wish to communicate my concerns to the review committee. In brief, my concern is that the methodology employed in the now retracted papers, and many others issued by the Duke group all use a flawed statistical analytical paradigm. Essentially the paradigm involves fitting a statistical model to all available study data then splitting the data into subsets, labeling one of them a “training” set, another a “validation” or “test” set, and showing that the statistical model works well for both sets. The analysis paradigm is described as a statistical train-test-validate exercise in several published papers, though it is technically not a true train-test-validate exercise as the model under evaluation involves predictor components derived from the full data set.

I believe that this issue needs to be investigated as part of the Institute of Medicine’s review, because concerned readers who have written letters to journal editors have not been successful in educating a wider audience (in particular journal editors and reviewers of biomedical journals) as to the problematic aspects of the analysis method that are repeatedly used by the Duke group. The issue at hand is not just one researcher who committed errors in one analysis, but rather the systematic use of a flawed analytical paradigm in multiple papers discussing personalized medicine in a widening scope of medical scenarios.

The statistical properties of this analytical paradigm, in particular its type I error rate, have not to my knowledge been reviewed or published. I respectfully request the IOM committee to include this issue in its agenda for the upcoming review, as findings from this committee will provide a broader educational opportunity, allowing journal editors and reviewers to have a better understanding of the statistical properties of the analyses repeatedly developed and submitted for publication by the Duke University investigators.

As a citizen of the United States and a taxpayer, and as a practicing biomedical applied statistician, I am especially concerned about the possibility that the funding garnered for such potentially flawed studies is detracting from other groups’ ability to obtain funding to perform valid research in the valuable arena of personalized medicine. Additionally, the use of human subjects in ongoing studies involving this methodology is ethically problematic.

I will discuss this issue in greater detail in the attached Appendix to this letter of concern, and provide citations to the literature illustrating the various aspects involved.

Thank you for your consideration of this matter.

Yours sincerely
Steven McKinney

Attachments: Appendix – Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators.

AppendixDetails of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators

In 2001 West et al. [1] published some details of a statistical analytical method involving “Bayesian regression models that provide predictive capability based on gene expression data”. In the Statistical Methods section of this paper they state that the “Analysis uses binary regression models combined with singular value decompositions (SVDs) and with stochastic regularization by using Bayesian analysis (M.W., unpublished work) as discussed and referenced in Experimental Procedures, which are published as supporting information on the PNAS web site.”

Given the current state of affairs, it is of concern that so many papers have been published using this methodology when some undetermined amount of the underlying theory is unpublished.

In the supporting information on the PNAS website, the authors state “Statistical Methods. The analysis uses standard binary regression models combined with singular value decompositions (SVDs), also referred to as singular factor decompositions, and with stochastic regularization using Bayesian analysis (1). It is beyond the scope here to provide full technical details, so the interested reader is referred to ref. 2, which extends ref. 3 from linear to binary regression models; these manuscripts are available at the Duke web site, www.isds.duke.edu/~mw. Some key details are elaborated here.”

It is unclear why it should be “beyond the scope” to include details of the analytical methods in the supporting information materials – typically this is precisely the place to provide such details. Fortunately the reference “ref. 2” cited, above (West, M., Nevins, J. R., Marks, J. R., Spang, R. & Zuzan, H. (2000) German conference on Bioinformatics, in press.) is still available as an online publication in the electronic journal In Silico Biology (reference [2] below).

In the online journal article, the authors provide additional details about the analytical method, including the fact that “In a first step we fitted the regression model using the entire set of expression profiles and class assignments” (see the section titled “Probabilistic tumor classification”). This is a key point, and is precisely why the investigators’ continued publications claiming to have “validated” their analysis is false and deserves thorough statistical evaluation as part of the IOM review of these issues. When predictor variables derived from the entire set of data are used, it cannot be claimed that subsequent “validation” exercises are true cross-validation or out-of-sample evaluations of the model’s predictive capabilities, as the Duke investigators repeatedly state in publications.

In the same paragraph, the authors state “Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z1z27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data it does not give us reliable indications for a good predictive performance. One might suspect that the method just “stores” the given class assignments in the parameter, pnas 19 formula 1. Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.”

I believe this is the key flaw in the reasoning behind this statistical analytical method. The authors state without proof (via theoretical derivation or simulation study) that this Bayesian method is somehow immune to the issue of overfitting a model to a set of data.This is the aspect of this analytical paradigm that truly needs a sound statistical evaluation, so that a determination as to the true predictive capacity of this method can be scientifically demonstrated.

The authors state further in the Discussion section that “Clearly, the methodology is not limited to only this medical context nor is it specialized to diagnostic questions only. We have applied our model to the problem of predicting the nodal status of breast tumors based on expression profiles of tissue samples form the primary tumor. The results are reported in West et al., 2001. Due to the very general setting of our model, we expect it would be successful for a large class of diagnostic problems in various fields of medicine.”

Interestingly, the supporting material cited in [1] actually references [1]. This again is an issue of concern.

Also now of concern is the realization of the authors’ prediction that they expect the method to be applicable to a large class of diagnostic problems in various fields of medicine. Indeed the authors have used this methodology in a widening scope of medical fields, as will be outlined below. That this methodology has been accepted for publication in many journals over many years, before its statistical properties have truly been investigated, is indeed an issue of concern.

I believe that part of the reason that journal editors and reviewers have not questioned the methodology is that the method uses primarily Bayesian statistical models, which are not as widely taught or understood in biological and medical higher education. It is difficult for many non-statisticians to follow the statistical logic and mathematical aspects of such complex Bayesian methods.

Thus the authors clearly describe that their paradigm is to fit a model to an entire data set, derive a set of predictors from that model, then use those predictors along with others on subsets of the entire data set. They state that the excellent performance of such models is validated, when it appears that what is actually demonstrated is that a model overfitted to an entire data set performs well on subsets of that entire data set. This issue should in my opinion be a key issue of concern in this IOM review of this omics methodology.

In early 2006, an apparently seminal paper was published by the Duke investigators in the journal Nature (Bild et al. [3]).This was followed by a publication in the New England Journal of Medicine (Potti et al. [4]) and another in Nature Medicine (Potti et al. [5]). All of these papers cite West et al. [1] and use the methodology therein. References [3] and [5] discuss breast cancer, and reference [4] discusses lung cancer. All use the same analytical paradigm, fitting an initial model to all available data to develop predictors (called “metagenes” in [3] and [4], then “gene expression signatures” in [5]) based on a singular value decomposition of the entire data set; then using these predictors on various subsets of the data involved and calling some portion of this subset analysis a “validation” exercise.

At this point researchers at the M.D. Anderson clinic explored the possibility of adapting this analytical paradigm, and asked statisticians Keith Baggerly and Kevin Coombes to review the publications. Their investigations are of course key in shedding light on this issue. In 2007 Baggedy and Coombes published a letter in the Correspondence section of Nature Medicine (Coombes et al. [6]). Coombes et al. state “Their software does not maintain the independence of training and test sets, and the test data alter the model. Specifically, their software uses ‘metagenes’: weighted combinations of individual genes. Weights are assigned through a singular value decomposition (SVD). Their software applies SVD to the training and test data simultaneously, yielding different weights than when SVD is applied only to the training data (Supplementary Report 9). Even using this more extensive model, however, we could not reproduce the reported results.” and further state that “When we apply the same methods but maintain the separation of training and test sets, predictions are poor (Fig. 1 and Supplementary Report 7). Simulations show that the results are no better than those obtained with randomly selected cell lines (Supplementary Report 8).”

Thus Coombes et al. have performed some initial analysis that sheds light on the true type I and type II error rates of this methodology. What is unclear from the work of Coombes et al. is the degree of departure from the null condition of no difference between groups of interest in the data sets used, so that the power of the statistical method can be properly evaluated. This is why a careful systematic study of this methodology, using known null data (data with equivalent distributional characteristics between groups of interest) and known non-null data (data with increasing levels of differential characteristics between groups of interest) is required, so that power characteristics of the methodology can be measured under null and non-null conditions. Further, such analysis needs to properly evaluate model performance on true out-of-sample data.

In 2007, the Duke group published another heavily cited paper (Hsu et al. [7], recently retracted on November 16, 2010). SVD components developed for this paper were termed “gene expression signatures”. All of these papers share the attribute that excessive claims of model accuracy are repeatedly asserted, with purported evidence from exercises termed “cross-validation”.

Recently, additional papers from the Duke investigators have been published concerning viral infection (Zaas et al. [8]) and bacterial infection (Zaas et al. [9]. Statnikov et al. [10] submitted a letter challenging this methodology once again, stating “We suggest several approaches to improve the analysis protocol that led to discovery of the acute respiratory viral response signature. First, to obtain an unbiased estimate of predictive accuracy, genes should be selected using the training set of subjects as opposed to selecting genes from the entire data set as was done in the study of Zaas-et al. (2009). The latter gene selection procedure is known to typically lead to overoptimistic predictive accuracy estimates. Second, the cross-validation procedure employed by Zaas et al. should be modified to prohibit the use of samples from the same subjects both for developing signature and estimating its predictive accuracy, as this is another potential source of over-optimism.”

The Duke investigators, as with all previous challenges, offer only verbal refutations to these points, with no formal statistical evaluation via simulation or otherwise to address the true distributional properties of this method.

More recent papers from the Duke investigators that should be reviewed include Chen et al. [11] and Chen et al. [12]. Additional complex Bayesian methods continue to be combined around the same analytical paradigm, and it is beyond the capability of many journal editors and reviewers to understand and deconstruct the arguments offered by the Duke investigators.

Additionally, with the apparent weight of so many seemingly accurate analysis results, resources such as research grants from federal agencies are being utilized without proper understanding of the value of the returns. Moreover, several studies on humans (the clinical trials currently scheduled for review, and the viral infection studies described in references [8] and [9]) have been conducted based on methodology with as yet unknown statistical properties. This is an issue of major concern, and a review of the statistical properties of the methodology used throughout these studies along with a publication of guidelines for evaluation of whether or not human trials involving this methodology should be permitted would be very valuable to the research community.

[McKinney References: see page 7 of PAF Document 19]

I’ll come back to this in “Potti Training and Test Data” Part 2. Please share your thoughts.

_____________

References:

  • Baggerly & Coombes. (2009). Deriving Chemosensitivity from cell lines: Forensic Bioinformatics and reproducible research in high-throughput Biology, Ann. of Appl. Stat., Vol. 3, No. 4 (Dec. 2009), pp. 1309-1334. [Starter Kit Webpage supplement for B&C 2009: http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html]
  • Baggerly, Coombes, Neeley (2008) Run Batch Effects Potentially Compromise the Usefulness of Genomic Signatures for Ovarian Cancer. JCO March 1, 2008:1186-1187.
  • Coombes, Wang & Baggerly. (2007). “Microrrays: retracing steps.” Nat. Med. Nov 13(11):1276-7.
  • Dressman, Potti, Nevins & Lancaster (2008) In Reply. JCO March 1, 2008:1187-1188
  • McShane (2010). NCI Address to lnstitute of Medicine Committee Convened to Review Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials. PAF 20.
  • Micheel et al (Eds) Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine (2012). Evolution of Translational Omics: Lessons Learned and the Path Forward. Nat. Acad. Press.
  • Potti et al.(2006). Genomic signatures to guide the use of chemotherapeutics. Nat. Med. Nov 12(11):1294-300. Epub 2006 Oct 22.
  • Potti and Nevins (2007) Reply to Coombes, Wang & Baggerly Nat. Med. Nov 13(11):1277-8.
  • Spang et al.(2002) Prediction And Uncertainty Gene Expression Profiles. In Silico Biology 2, 0033.
Categories: science communication, selection effects, Statistical fraudbusting | 39 Comments

Post navigation

39 thoughts on “What have we learned from the Anil Potti training and test data fireworks ? Part 1 (draft 2)

  1. Steven: Thanks for several explanatory e-mails on the case over the past few days. I’m still curious though: has the general methodology and its properties not been analyzed and discussed? Is it so non-standard? (I doubt it). One might distinguish at least two stages: (a) arriving at and validating a model that is adequate for their data, and (b) validating its predictive use.

    Maybe you can suggest a set of distinct concerns/questions to direct discussion.

    • Steven McKinney

      I have not yet found any literature showing any honest, serious evaluation of the statistical capabilities of the methodology employed by the Duke team in the early to mid 2000’s. If anyone else has, please let me know.

      If I Google the term “duke genomics virus signatures” I get several initial hits for business news sites, discussing this topic, with titles such as

      ###—
      “Test Aims to Better Detect Viral Infections”.
      Thu, 09/19/2013 – 8:30am
      LAURAN NEERGAARD – AP Medical Writer – Associated Press

      “It happens too often: A doctor isn’t sure what’s causing someone’s feverish illness but prescribes antibiotics just in case, drugs that don’t work if a virus is the real culprit.

      Now, Duke University researchers are developing a blood test to more easily tell when a respiratory illness is due to a virus and not a bacterial infection, hoping to cut the dangerous overuse of antibiotics and speed the right diagnosis.

      It works by taking a fingerprint of your immune system — how its genes are revving up to fight the bug. That’s very different from how infections are diagnosed today. And if the experimental test pans out, it also promises to help doctors track brand-new threats, like the next flu pandemic or that mysterious MERS virus that has erupted in the Middle East.

      That viral “signature could be quite powerful, and may be a game-changer,” said Dr. Geoffrey Ginsburg, Duke’s genomic medicine chief. He leads the team that on Wednesday reported that a study involving 102 people provided early evidence that the test can work.”
      ###—

      So after the Duke fiasco, now things have been set straight, and a new “game changing” signature is at hand.

      From the Duke website page

      http://www.pratt.duke.edu/node/4347

      the following article is discussed.

      A Host-Based RT-PCR Gene Expression Signature to Identify Acute Respiratory Viral Infection

      Aimee K. Zaas, Thomas Burke, Minhua Chen, Micah McClain, Bradly Nicholson,
      Timothy Veldman, Ephraim L. Tsalik, Vance Fowler, Emanuel P. Rivers, Ronny Otero,
      Stephen F. Kingsmore, Deepak Voora, Joseph Lucas, Alfred O. Hero, Lawrence Carin,
      Christopher W. Woods, Geoffrey S. Ginsburg

      Science Translational Medicine
      DOI: 10.1126/scitranslmed.3006280

      “This is not the definitive study that will
      be necessary for adoption of this new approach
      to viral infection diagnosis into
      clinical care, but it is an important one
      nonetheless.” (Page 5 of Sci Transl Med PDF)

      You might not guess this study was not definitive from reading the press releases.
      This kind of statement leaves me concerned.

      “It should be emphasized that the probability of illness that is
      manifested from the model is in terms of traditional probit-regression technology, widely used within statistics. However, the amount of data we have here is too limited for quantitative verification of the probability values. Nevertheless, the performance of the model on real-world data suggests its robustness.”

      Amazingly, these studies always take place on small samples. For proposing such a game changing diagnostic, one would expect better assessment statistics from larger sample size studies. The wording here is chillingly familiar. As in the retracted Blood journal
      article (doi:10.1182/blood-2005-07-2669):

      “The predictive ability of the model developed in the
      training set also was robust in predicting the thrombotic state in the
      validation samples (Figure 3C). Based on these results, we
      conclude that it is possible to develop robust predictive models that
      can identify patients with aPLAs who are at risk for thrombosis.”

      The authors just say this stuff. No rigorous statistical evidence is given, just a few graphs based on small sample sizes, and an opinion that robustness is suggested.

      This is a pattern of concern.

      Back to the Sci Transl Med paper.
      ### —

      “The ENet basic model was first described by Zou and Hastie (29), and this algorithm is state of the art. However, we also note that we obtain similar quantitative results using other related statistical models, for example, the support vector machine (30) and the relevance vector machine (31).
      All of these algorithms have been used, and the performance reported here is very similar across all of them (that is, the quantitative performance is not sensitive to which of these modern statistical techniques are used).

      To validate the performance of the RT-PCR viral signature to accurately classify individuals as being symptomatic or asymptomatic, we trained the model on one data set (either H3N2 or H1N1) and tested it on the other independently acquired data set. To test the performance of the RT-PCR classifier in a real-world setting, we used peripheral blood RNA ascertained from patients presenting to the emergency department with febrile illness. The working procedure
      of the ENet classifier begins with preprocessing. We first log-transform the 2−DDCT data to obtain the data matrix X of dimension n*p, where n is the number of samples and p is the number of genes. We then subtract the corresponding sample (row) mean from X and append an additional column filled with value 1.0 to account for the shift (intercept) term in the regression.
      ENet training is performed with the labels of the samples Z and the
      preprocessed data X. We train a probit regression model Z = sign(Y),
      Y = X*beta+noise with an ENet prior on the weights (beta) (24). After training, we obtain the posterior distribution of the weights (classifier).
      Testing was then performed. For an unseen testing sample x, we do the same data preprocessing, and the predictive probability can be obtained by integrating out the classifier using the posterior distribution in the probit model. The above predictive probability is between 0 and 1, and classification decision is made with the threshold 0.5. The final “ENet score” is obtained by multiplying a constant (for example, 40) to the predictive probability. The code for reproducing the results can
      be found at http://people.ee.duke.edu/~lcarin/reproduce.html.”

      ###—

      No in-depth discussion is offered about the ENet training performed on the preprocessed data X. What is in X? Training data only? Training and test data? Only the computer code and original data can truly provide an answer here. Lack of clear descriptions of algorithm elements is of concern.

      The URL at the end of the paragraph offers a glimmer of hope.
      I look for this Sci Trans Med paper on the URL in the paper:

      ### — —
      http://people.ee.duke.edu/~lcarin/reproduce.html

      Reproducible Research

      A research team led by Duke University has engaged in research to examine the host (body) response to viruses. In the course of that research we have collected blood samples in human challenge studies. Four different Institutional Review Boards have approved these challenge studies. The host response has been investigated in terms of the gene-expression response. In the interest of encouraging other investigators to reproduce our results, and to build upon them and find new discoveries, on this webpage we provide all data and the software used to produce every figure in our papers. The software is in Matlab, and the data are in the form of associated .mat files. The data have been normalized from the raw expression values, to constitute the data posted here. Interested individuals may contact Lawrence Carin (lcarin@duke.edu) to learn details of how the normalization was done; standard techniques were applied. The raw data are available in GEO (accession no. GSE17156), if one wishes to start from the raw data.

      For the following three papers are:

      A.K. Zaas, M. Chen, J. Varkey, T. Veldman, A.O. Hero III, J. Lucas, R. Turner, A. Gilbert, C. Oien, B. Nicholson, S. Kingsmore, L. Carin, C.W. Woods, and G.S. Ginsburg, Gene Expression Signatures Diagnose Influenza and Other Symptomatic Respiratory Viral Infections in Humans, Cell Host and Microbe, 2009.

      M. Chen, D. Carlson, A. Zaas, C. Woods, G. Ginsburg, A. O. Hero III, J. Lucas, and L. Carin, Detection of Viruses via Statistical Gene-Expression Analysis, IEEE Transactions on Biomedical Engineering, 2010.

      B. Chen, M. Chen, J. Paisley, A. Zaas, C. Woods, G.S. Ginsburg, A. Hero III, J. Lucas, D. Dunson, and L. Carin, Bayesian inference of the number of factors in gene-expression analysis: application to human virus challenge studies, BMC Bioinformatics, 2010.

      The data and Matlab software needed to reproduce every figure in these papers are here (517 MB).

      We also present an example here, based on toy data, which shows the ability of the model to infer the number of factors present. In this example we consider a case for which the data are pure noise, and demonstrate that the model is able to infer that no factors are present.
      ______

      For the following paper:

      C.W. Woods, M.T. McClain, M. Chen, A.K. Zaas, B.P. Nicholson, J. Varkey, T. Veldman, S.F. Kingsmore, Y. Huang, R. Lambkin-Williams, A.G. Gilbert, A.O. Hero III, E. Ramsburg, S. Glickman, J.E. Lucas, L. Carin, and G.S. Ginsburg, A Host Transcriptional Signature for Presymptomatic Detection of Infection in Humans Exposed to Influenza H1N1 or H3N2, PLOS ONE, 2013

      The data and the code are here (205 MB; the main function to run is Flu_Validate.m ).

      ### — —

      I see no Sci Transl Med paper listed on the web page documented in the Sci Transl Med paper. So the promise in the Sci Transl Med paper is not realized on the website of the given URL in the paper.

      I can write and ask them to rectify this issue, and provide the code.

      I’ll do so and report what happens.

      Looking into the main function to run I see

      ### —-
      clc; clear all; close all; randn(‘state’,0); rand(‘state’,0); format long;
      version = 4; %% version = 3: H3N2 -> H1N1; version = 4: H1N1 -> H3N2. version = 1: H3N2+H1N1.
      load H3N2.mat; load H1N1.mat; load clinic.mat; load([‘Flu_FA_’ num2str(version) ‘.mat’]);
      gene = H3N2.gene1(:,2); H3N2.CL = clinic.H3N2; H1N1.CL = clinic.H1N1;
      switch version,
      case 3, tt = 41; sgn = 1; Full = Flu_Data(H3N2,H3N2.labelCM~=inf);
      Train = Flu_Data(H3N2,H3N2.labelCM~=0); Test = Flu_Data(H1N1,H1N1.labelCM~=0); mu = mean(Train.X,2);
      case 4, tt = 25; sgn = 1; Full = Flu_Data(H1N1,H1N1.labelCM~=inf);
      Train = Flu_Data(H1N1,H1N1.labelCM~=0); Test = Flu_Data(H3N2,H3N2.labelCM~=0); mu = mean(Train.X,2);
      case 1, tt = 42; sgn = -1; Full = Flu_Data(H3N2,H3N2.labelCM~=inf);
      Train = Flu_Data(H3N2,H3N2.labelCM~=0); Test = Flu_Data(H1N1,H1N1.labelCM~=0); mu = mean([Train.X,Test.X],2);
      otherwise, return;
      end;

      model = FA_Unit(spl,sgn); [p,n] = size(Train.X);
      S0 = FA_S(Train.X,model,mu); S1 = FA_S(Test.X,model,mu); S = FA_S(Full.X,model,mu);

      [pval0,auc0,n1,m1] = Flu_View(S0,Train,tt,model.AZ,gene,1);
      [pval1,auc1] = Flu_View(S1,Test,tt,model.AZ,gene,1);
      Flu_View(S,Full,tt,model.AZ,gene,0);

      fid = fopen([‘Result_’ num2str(version) ‘.xls’],’W’);
      fprintf(fid,’%s \t’, ‘Genes and loadings’); for j = 1:p, fprintf(fid,’%s %f \t’, gene{n1(j)},model.AZ(n1(j),tt)); end;
      fprintf(fid,’\n\n %s \t’, ‘Hours’); for t = 1:length(m1), fprintf(fid,’%f \t’, m1(t)); end;
      fprintf(fid,’\n %s \t’, ‘Pval’); for t = 1:length(m1), fprintf(fid,’%f \t’, pval0(t)); end;
      fprintf(fid,’\n %s \t’, ‘Pval (validation)’); for t = 1:length(m1), fprintf(fid,’%f \t’, pval1(t)); end;
      fprintf(fid,’\n %s \t’, ‘AUC’); for t = 1:length(m1), fprintf(fid,’%f \t’, auc0(t)); end;
      fprintf(fid,’\n %s \t’, ‘AUC (validation)’); for t = 1:length(m1), fprintf(fid,’%f \t’, auc1(t)); end;
      fclose(fid);
      ### —-

      Are we clear now?

      In version 1 of switches above, I see

      mu = mean([Train.X,Test.X],2)

      so some parameter estimate is coming from both the training and test set. I’ll have to claw through the code to see what this is about.

      The fact that the main driver code module pulls in both the training and test data is concerning. Lack of any comments or other documentation leaves me guessing and speculating, hardly an ideal state for code and data on a “Reproducible Research” page.

      Directory listing of the downloaded zip file that contains everything:

      -rwxrwxrwx@ 1 stevenmckinney staff 40846083 May 17 2011 2009.09.19.H1N1.(n=378).txt
      -rwxrwxrwx@ 1 stevenmckinney staff 1801 May 17 2011 FA_Lucas.m
      -rwxrwxrwx@ 1 stevenmckinney staff 193 Aug 25 2013 FA_S.m
      -rwxrwxrwx@ 1 stevenmckinney staff 235 May 17 2011 FA_Unit.m
      -rwxrwxrwx@ 1 stevenmckinney staff 647880 May 17 2011 Figures.pdf
      -rwxrwxrwx@ 1 stevenmckinney staff 185 May 23 2011 Flu_Data.m
      -rwxrwxrwx@ 1 stevenmckinney staff 910 May 24 2011 Flu_FA.m
      -rwxrwxrwx@ 1 stevenmckinney staff 5007827 May 24 2011 Flu_FA_1.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 16475961 May 25 2011 Flu_FA_2.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 4916812 May 24 2011 Flu_FA_3.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 4915191 May 24 2011 Flu_FA_4.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 892 Aug 25 2013 Flu_Test.m
      -rwxrwxrwx@ 1 stevenmckinney staff 1928 Aug 25 2013 Flu_Validate.m
      -rwxrwxrwx@ 1 stevenmckinney staff 2359 May 23 2011 Flu_View.m
      -rwxrwxrwx@ 1 stevenmckinney staff 598 Jan 24 2011 Gene_Plot.m
      -rwxrwxrwx@ 1 stevenmckinney staff 311 Dec 16 2010 Gene_ROC.m
      -rwxrwxrwx@ 1 stevenmckinney staff 33053528 May 17 2011 H1N1.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 1540 May 17 2011 H1N1_Collect.m
      -rwxrwxrwx@ 1 stevenmckinney staff 913408 May 17 2011 H1N1_Info.xls
      -rwxrwxrwx@ 1 stevenmckinney staff 66727924 May 17 2011 H3N2.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 1783 May 17 2011 H3N2_Collect.m
      -rwxrwxrwx@ 1 stevenmckinney staff 42496 May 17 2011 H3N2_key.xls
      -rwxrwxrwx@ 1 stevenmckinney staff 14085470 Aug 25 2013 Ramilo.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 12156792 Nov 29 2011 Ramilo.txt
      -rwxrwxrwx@ 1 stevenmckinney staff 1177 Nov 29 2011 Ramilo_Collect.m
      -rwxrwxrwx@ 1 stevenmckinney staff 36352 May 23 2011 Results.xls
      -rwxrwxrwx@ 1 stevenmckinney staff 1448702 May 25 2011 Test.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 501963 May 25 2011 Validation (05.25.2011).pptx
      -rwxrwxrwx@ 1 stevenmckinney staff 553 Dec 22 2010 clinic.mat
      -rwxrwxrwx@ 1 stevenmckinney staff 54133795 May 17 2011 darpa.flu.rma.(267).affy.csv
      -rwxrwxrwx@ 1 stevenmckinney staff 29202292 May 17 2011 darpa.flu.rma.(267).ccdf2.csv

      Nothing seems to describe this collection of files, no README.txt or other document appears in the listing as an obvious starting point. Just the suggestion from the webpage that “the main function to run is Flu_Validate.m”. This stuff just is never easy.

      I certainly had more luck getting decent descriptions of things in Sweave PDF files on Baggerly and Coombes ReproRsch website.

      So now I have to claw my way, line by line, through all the Matlab code, to see if I can make any sense of things. Not exactly the most helpful of reproducible research pages. I really can not judge at this point as to whether this directory really does allow me to assess, in a reproducible way, what is currently being done in these studies.

      As we have seen before, patent applications are underway.

      Ginsburg, Zaas, Woods, Hero, Carin and Lucas have filed for a provisional patent on the respiratory viral signature. Fuller disclosures are provided in the study.

      http://www.pratt.duke.edu/news/genomic-test-accurately-sorts-viral-versus-bacterial-infections

      I haven’t found the startup company name yet, but I’ll keep an eye out.

      So these are the initial conditions of an investigation into this purportedly game-changing methodology. Not easy or trivial stuff to sort through by any means.

      It’s unfortunate that this type of research, with such bold claims, is so difficult to penetrate.

      • Steven: Thanks for all this. I’m still keen to better understand the Nevins and Potti methodology. Are you saying this uses the same method? Maybe so, but all the more reason to look at it more carefully.

        • Steven McKinney

          The key feature in the West, Nevins, Potti et al. algorithms was the initial fitting of a principal components type model to all of the available data. The ‘test’ or ‘validation’ data was not kept in a lockbox during model fitting exercises that yielded the final specific form of the classifier.

          Their argument that they locked the known class labels classification data for the test or validation set in a lockbox is disingenuous, because there is information about the class labels in the validation or test expression data. If there was no information at all about class labels in the expression data, why would they be attempting to build a classifier using such data? Additionally, in the real world, that expression data would not be available for initial SVD fits, because that data would not have even been present. That data is in a lockbox, and that lockbox is the future. Patients for whom a treatment decision would need to be made are unknown at the time of the classifier modeling steps that define the classifier. Thus their modeling paradigm did not match the conditions under which the classifier would actually be used.

          Whether the initial data modeling exercise was a Bayesian SVD, or a frequentist principal components analysis, or even perhaps an elastic net, doesn’t really matter. Information regarding the outcome of interest for all of the data modelled is reflected in the final fitted model of this initial part of the algorithm used in the West, Nevins, Potti et al. studies.

          So putting the expression data for the test or validation set in the lockbox, alongside the known class labels, was the step that the Duke groups refused to do, and refused to study rigourously with decent statistical methods.

          So when I look into the Matlab code currently available from Duke that I discuss above, and I see train and test data read in together at the outset, I can not rule out that they are still using this manoeuvre of doing some initial modeling on all the data. A reasonable code suite would have an initial classifier estimation procedure that read in only the training data (expression data and class labels), followed by a prediction procedure that read in only the locked down classifier specifications and the test expression and class labels data.

          • Steven. Thanks for your comment. You say it doesn’t matter what technique they were using, but they claim that it’s the priors that enable them to ignore what they would not be able to ignore otherwise. right? Else the method would just “store” the class assignments, and the inference would be illicit–by their own position. So the priors seem to be taking on an important role. From my reading, they are claimed to be unbiased, conjugate priors. I wonder if anyone can say more about them. I wonder if there’s any connection to something Stephen Senn has posted about. i’ll come back to paste it later).

            • Steven McKinney

              Can anyone point me to any mathematical and computational explorations of structured data, assessing whether these notions of being able to ignore stuff such as overfitting a model to a local dataset? In the Spang et al. electronic journal article, they just said that they were protected by their prior restrains [sic]. What does this mean? Protected how? What does the mathematics of protection look like? Forgive me if I am missing something, but so far all I have is people saying that they are protected, that they can ignore situations that the rest of us have to deal with, but I am shown no mathematical or simulation evidence. Spang, West et al offered only a partial cross-validation on an in-sample data set (their model fitted the data well, the data upon which the model was estimated) as “evidence” of protection.

          • pr

            >Their argument that they locked the known class labels >classification data for the test or validation set in a lockbox is >disingenuous, because there is information about the class >labels in the validation or test expression data. If there was no >information at all about class labels in the expression data, why >would they be attempting to build a classifier using such data?

            They were doing SVD across training and test samples because this essentially produces common factors or principal components that are shared between training and test sets. This doesn’t require class labels at all. Try it yourself: combine columns from a set of 100 genes (microarray data) from both training sets. Let’s say you have 50 samples in the training set and 50 samples in the test set. Now you have a matrix with 100 columns(samples) and 100 rows(genes). Now find the first 3 principal components of this matrix. Now use the first 3 principal components of the training set to train the model using only the class labels from the training set. Now apply the model to the test set. See, doesn’t use labels from the test set at all.

            • Steven McKinney

              Try this yourself: Don’t combine columns, just use the 50 samples in the training set. Find the first three principal components of this matrix. Now use these first 3 principal components of the training set to train the model using only the class labels from the training set. Now apply the model to the test set. See, doesn’t use labels, nor any data from the test set at all.

              This is what development of a predictive medical model looks like. The test set in reality is not available, because it is data that will be acquired in the future. So in medical practice, how are you going to fit your SVD across future data not yet available?

              Now you can compare these two methods, and you will of course find that your suggested method performs better on the test set, well, because you used the test set data in the model training, via its influence on the first 3 principal components.

              Replying to this discussion months after its posting, and a good 5 years after the debunking of the whole mismanaged suite of Duke studies, shows that you have learned nothing at all about the improper nature of the above “strategy” you repeatedly hawk.

      • Steven: On the idea of developing a test to determine bacteria or virus looking at your genes—maybe they figured, how much trouble can they get into here? Don’t we already test for bacteria in a few minutes time, only culturing it if they need to know the particular infection? And if Duke’s test doesn’t find your “genes revving up” they still would have to do a test for bacteria. Meanwhile, would you let the folks at Duke inject you with respiratory virus or bacteria to study their latest predictor?

      • kikidenee

        I just happened across this blog today… I am a former employee of the Duke Cancer Institute and while I was not there during the “Potti fiasco”, I was there in the aftermath. I am not a statistician so I have nothing to offer in that area; however, I had the misfortune of working in that ethical cesspool and now understand how and why things like this happen and continue unquestioned for so long. (unquestioned = by Duke, only).

        I cannot speak to the entire institution but I saw enough from the inside regarding cancer research to know that 1. Duke learned nothing, ethically, from what happened with Dr. Potti, 2. Duke values reputation and profit above all else (Duke would never admit as much but makes it clear through action/inaction) 3. Duke’s internal methods of “investigation” do not equate to what the rest of us consider an “investigation”; they look for what they want to find, purposefully ignore what they do not want to find and if there isn’t enough in either category, they fabricate “evidence” 4. Protecting Duke’s reputation and legal liability is more important than protecting and respecting patients or employees 5. The oversight of cancer research is more than troubling- required “training” of research staff is superficial- the purpose is simply for documentation in the event of an audit, not the acquisition of required knowledge and skills; the environment is anything but professional- it is rife with intimidation, fear, bullying, harassment, discrimination and enabling. Those tasked with overseeing the ethical conduct of research spend most of their time covering their mistakes and intimidating their personnel- on anything but protecting patients. The departments/offices etc that exist to monitor such things fall extremely short- employees are intimidated into silence, anyone who dares speak out about such abuses becomes ostracized/ a scapegoat or simply fired- the rule of thumb seems to be to simply look the other way.

        I do not think Duke would have *ever* looked into what was going on if it hadn’t been for the overwhelming concern of the external research community that refused to let the issue go. I think Dr. Potti was made into a scapegoat for the many failings of Duke. I would not be surprised if something like this does not happen again (and soon). It’s no secret that the institution is highly unethical- in March of this year, they settled a fraud/ whistleblower suit to pay $1 million for purposefully defrauding the government for 6 years. They blamed it on a software error. I happen to know that they are still wrongly billing their patients and when I questioned this I was told it was on the patients to monitor the bills they receive, identify the errors and then contact Duke to remedy the problem.

        So…my post is not about statistics, just the overall ethical problems that, I think, enable such behavior to occur and continue.

        Please feel free to contact me if you would like to discuss more.

        • Kikidenee: I’ve heard many sides to the story, as a result of this post. Is the court case due to come up soon or is it now over?

  2. What McKinney refers to as the “key flaw in the reasoning”, referring to the Spang et.al., paper by the Duke group, warrantsscrutiny:
    http://www.bioinfo.de/isb/2002020033/main.html

    “Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z1 … z27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data it does not give us reliable indications for a good predictive performance. One might suspect that the method just “stores” the given class assignments in the parameter, pnas 19 formula 1. Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.”

    Without the priors “the method just ‘stores’ the given class assignments”. So without the priors we’d arrive at a model that fits the data well just by chance. But with the (conjugate, unbiased) priors we arrive at a model that does well with out-of-sample predictions. That of course is what B & C, as well as NCI, do not find. Still I’m curious if anyone has even a vague sense of why they might have made these remarks. (I have only a vague hunch.)
    Note: their on-line paper has a number of grammatical and spelling errors.

    • Steven McKinney

      After I wrote my letter and delivered my analysis of the West, Nevins, Potti et al. algorithm to the IOM, a collection of NCI internal documents was made public by the IOM committee. PAF Document 12.pdf described a true train and test exercise.

      The NCI had some data known to be unavailable to the Duke group involved in this exercise. The NCI team (lead in part by Lisa McShane) provided a training data set to the Duke group so the Duke team could certainly not claim that their procedure had not been carried out exactly so.

      When the Duke group provided their estimated classifier back to the NCI, the predictive capacity of the classifier was found to be lacking on the true test data that was never used in classifier estimation steps. Patients with predicted high risk showed a survival curve with longer survival times than patients with predicted low risk, though the survival curve differences were not statistically significant as assessed by a logrank or Cox model test (I have not yet found full details of the NCI evaluation exercise, though it is reasonably explained in PAF Document 12).

      So this represents the only verifiable lock-box assessment of the Lung Metagene Score (LMS), and the results of the assessment did not reveal a predictor with capabilities claimed in numerous publications by the Duke groups.

      • Steven: I’m quite surprised with how the NCI allowed them to essentially try and try again to find an ad hoc maneuver to get the supposedly “blind” sample correct. This is going to come up in part 2, I hope.
        That they (the NCI) weren’t more shocked makes me think that clinical trials (to evaluate predictive models) like this go on elsewhere. People should learn what questions to ask, such as, has this predictive model been tested on humans (or at least mice) before? What’s the source of the model? We have professionals to help on purchasing houses, cars; we have wedding planners and how-to-get-into college advisors. How about a group of independent statisticians to at least appraise the kind of procedure on which the clinical trial is based. They’d have to be rated of course…not so far fetched in the land of “personalized'” medicine, where even our doctors might not be sufficiently dubious.

        • Steven McKinney

          I think Lisa McShane at the NCI was shocked, and ended up forcing Duke personnel to hand over code and data materials. (Page 72 of the PAF Document 12 PDF shows NCI deconstruction of several Duke lines of argument.) Trials were closely scrutinized, and the NCI ordered that the trials only use Duke LMS scoring as a secondary exercise. Though Duke personnel may have done otherwise, I certainly feel I saw shock in McShane’s and others reactions. Those trials were ended on the orders of the NCI.

    • Mayo: Without the priors “the method just ‘stores’ the given class assignments”. So without the priors we’d arrive at a model that fits the data well just by chance. But with the (conjugate, unbiased) priors we arrive at a model that does well with out-of-sample predictions. That of course is what B & C, as well as NCI, do not find. Still I’m curious if anyone has even a vague sense of why they might have made these remarks.

      Yeah, I can field this one. It is not the case that “without the priors we’d arrive at a model that fits the data well just by chance”. Without priors (or with flat priors, which amounts to the same thing in this context) what you get is a *perfect* fit to the data — every time! — and terrible generalization to other data drawn from the same distribution. Key word: overfitting.

      The fix is to add a penalty (aka regularization) term to the likelihood. This term drags the optimal point away from the perfect fit and towards a prespecified point, and then ye olde James-Stein effect takes hold with respect to the predictive mean squared error.

      You can evaluate regularized/penalized regression (ridge regression, LASSO, elastic net, and the logistic regression versions of these) from the standard frequentist perspective and show that they have nice frequentist properties like the aforementioned improved predictive MSE and the oracle property and so on and so forth. Alternatively, you can treat the penalty term as (the log of a) prior density, whereupon the maximum penalized likelihood estimator can be seen to be identical to the maximum a priori estimator. This isn’t fully Bayesian, but it seems to be what these folks had in mind when they used the term “Bayesian”.

      • Corey: Thanks for this, and I just realized you’d be the person who would know and perhaps even use these techniques in your own work. Plus, you’re usually good at explaining these things. How do these differ from AIC, BIC model selection? If you can explain a bit about what the components are in the predictor, etc.I’d appreciate it.

        • Mayo: The key distinction is that in regularized regression, the penalty term involves the values of the parameter estimates, whereas in model selection via information criteria, the penalty term usually involves only the dimension of the parameter and possibly the sample size, and wields no influence over the parameter estimator (which is hence equivalent to maximum likelihood).

          In regularized regression, one basically throws in every predictor that might be worthwhile. (There are limits, of course, but you can put more predictors in than there are data points and still get estimators with reasonable generalization performance.) The concern is more with prediction than with inference, although the oracle property implies that the estimator will (asymptotically with increasing sample size) zero out parameters that are really zero. Not all penalties have the oracle property, though.

          Model selection criteria seem to be applied more for inference than for prediction. One defines a set of models of varying dimensionalities and computes their maximum likelihood estimate. Then the penalized log-likelihood is calculated to pick the “optimal” model according to the criterion. The various penalty terms have different justifications.

          Not sure what you mean by “what the components are in the predictor.”

      • Corey: “Without priors (or with flat priors, which amounts to the same thing in this context) what you get is a *perfect* fit to the data — every time! — and terrible generalization to other data drawn from the same distribution.”

        Not every method that doesn’t use priors gives a perfect fit, only pretty terrible ones. Nor do I see why “in this context” (or any) flat priors amounts to the same thing as a method that does not use priors.

        • The context in question is a generalized linear model fit by maximum likelihood (including straight-up linear modeling as a special case). “Amounting to the same thing” here means leading to the same optimal point in the parameter space. Given these points, my claim is correct.

          • Corey: So the penalty here concerns the estimates of the parameters, but what’s the gist of the penalty: kick out those small enough to be deemed 0? If so, how is the “smallness” measured? What are the model assumptions, do you know? Thanks much.

            • Mayo: The model assumptions are the same as in unregularized regression. The penalty is usually function of the 1-norm and/or the 2-norm of the parameter vector — it favors norm-smaller parameter vectors directly. Have a look at Wikipedia’s article on elastic net regularization for more details.

  3. The Senn post that I thought was possibly relevant here was this:
    https://errorstatistics.com/2013/12/03/stephen-senn-dawids-selection-paradox-guest-post/

    • Steven McKinney

      The Dawid writing that Senn discusses has this relevant passage in Section 4 “Discrete prediction and discrimination”, certainly relevant to the Duke predictor scenario:

      “Statisticians tend to work with applied problems, such as medical diagnosis and (more especially) prognosis, where it would not be reasonable to suppose that classification could be done almost perfectly, were only enough variables to be observed. But there are certainly problems (e.g. botanical classification) where the opposite is true — indeed, much of the work on pattern recognition undertaken within the artificial intelligence community seems based on archetypal applications in which perfect classification is possible. The Dirichlet prior appears well suited to such problems, but inappropriate to express the more “statistical” problems in which residual uncertainty can never be eliminated. Other priors expressing this idea need to be developed and explored.”

      Dawid does not appear to me to be supporting the notion of the protection of the prior restraints in the medical diagnostics arena.

      Multivariate Analysis and Its Applications
      IMS Lecture Notes – Monograph Series (1994) Volume 24
      SELECTION PARADOXES OF BAYESIAN INFERENCE
      BY A. P. DAWID
      University College London

  4. Stan Young

    Clearly written.

    Just some quick comments:

    “the trials were allowed to continue (after a brief pause to let the Duke internal committee investigate, but they found no problems.) Surely they would have tested and validated a model on which they would be recommending chemo treatments and associated surgery;”

    [SSY: My understanding is that the Duke internal review committee did not examine the actual data sets.]

    “When we apply the same methods but maintain the separation of training and test sets, predictions are poor..”

    [SSY: With complex data sets and methods it is common to hold out some of the data, develop a prediction model on part of the data, then test the prediction method on the hold out set. This is what Baggerly did and the Potti predictions were very poor/chance. Potti does not seem to understand there is a problem.]

    “Their software applies SVD to the training and test data simultaneously, …”

    [SSY: Applying SVD across all samples to define the metagenes means that information has be extracted from the hold out samples. The hold out samples are no longer independent.]

    [SSY: It is down in the weeds, but SVD has been shown to mix mechanisms together, Fogel et al. The American Statistician, 2013. So even if Duke got the logistics correct, their use of SVD was problematic.]

    [SSY: Again the Hayek Nobel prize speech comes to mine. The biology is complicated. The statistical methods are complicated. It takes an expert to evaluate an expert. It was a very favorable set of circumstances the Coombes and Baggerly could take the time to look into this situation. The Duke researchers seem to have been working beyond their statistical training.]

    • Stan:

      [SSY: My understanding is that the Duke internal review committee did not examine the actual data sets.]

      Well isn’t it a little bizarre to call in a committee to decide if the data warrant continuing the trials, and they do not examine the data sets? In reading the IOM report (focusing on the places where they discuss the Duke case, and it’s not a lot) there
      s a suggestion that the review committee admitted they could not get the purported results (I’d have to check the wording). One of the B & C articles, I guess the 2009 one, was not given to the review committee lest it bias them. Yet somewhere else there’s a claim that Potti and Nevins have adequately responded to the B&C charges.

      “When we apply the same methods but maintain the separation of training and test sets, predictions are poor..”

      [SSY: With complex data sets and methods it is common to hold out some of the data, develop a prediction model on part of the data, then test the prediction method on the hold out set. This is what Baggerly did and the Potti predictions were very poor/chance. Potti does not seem to understand there is a problem.]

      Yes, and I worry that others also do not seem to understand there is a problem. That’s why McKinney was/is calling for an analysis of the properties of the method. It doesn’t matter that, doubtless, there are new variations on the method; a general point can be made about what goes wrong.

      “Their software applies SVD to the training and test data simultaneously, …”

      [SSY: Applying SVD across all samples to define the metagenes means that information has be extracted from the hold out samples. The hold out samples are no longer independent.]

      So you don’t agree with the argument (e.g., in Spang et.al., above) that the prior saves the method from merely “storing” the information.

      [SSY: It is down in the weeds, but SVD has been shown to mix mechanisms together, Fogel et al. The American Statistician, 2013. So even if Duke got the logistics correct, their use of SVD was problematic.]

      Mix mechanisms? I hadn’t heard this before. What does it mean?

      [SSY: Again the Hayek Nobel prize speech comes to mine. The biology is complicated. The statistical methods are complicated. It takes an expert to evaluate an expert. It was a very favorable set of circumstances the Coombes and Baggerly could take the time to look into this situation. The Duke researchers seem to have been working beyond their statistical training.]

      Well they deserve medals for persevering, but notice that their article kept getting turned down by medical journals. It’s a good thing that Efron was willing to speedily publish it in the applied stat journal he edits. But even then the Potti and Nevins response was “they didn’t use our method, so no wonder they couldn’t reproduce our results.” (They also denied any of the errors in the figures were used in any way for the predictions.) A letter urged by 30+ statisticians and biostatisticians got the trials stopped, but then restarted. It appears that it was only discovering Potti lied about being a Rhodes scholar that made the NCI clamp down. This of course gave the rest of the Duke people involved an out—he’s a fraud, he must have tampered with the data, it’s not our fault.

      You say it’s beyond their statistical training. Really? This is Duke. How many statisticians and biostatisticians do they have? Many of course hoped to eventually patent the predictor.

      • Down in the weeds. SVD/PCA is a multivariate technique that factors a n by p matrix into two matrices, n by k and k by p with a diagonal matrix, k by k, of eigen values in between,

        X = LDR, left, diagonal, right.

        The elements of L are often useful for studying how the samples, rows of X, are related one to another. The problem is with R, usually called the loadings. The rows of R are orthogonal to one another by construction of the SVD solving of the factorization. In contrast, non-negative matrix factorization, X=LR, does not force the rows of R to be orthogonal, but it does require the elements of X, L and R to be non-negative. The claim by researchers using NMF is that NMF can recover the generating matrices; casually the pairs of vectors in L and R are called “mechanisms”.

        So how does this relate to the Duke data analysis problem? Using NMF, genes in the same mechanistic pathway can/will appear on one vector in R. With SVD/PCA genes from different pathways will appear in a single vector in R. Stated another way, genes specific to two different types of tumors can/will appear in one right vector of SVD. Pain.

        See Fogel P, Hawkins DM, Beecher C, Luta G, Young SS. (2013) A tale of two matrix factorizations. The American Statistician 67, 207-218.

        Of course there are wonderful statisticians at Duke, but my understanding is that they were not involved in these papers. University research is typically done in small teams and people are called in as the leader of the team thinks it useful. Within a university, there is generally no university oversight of the research process, unlike industrial research.

        • Stan?

          It is embarrassing that this happened at Duke but many universities are large unconnected entities.

          When I was at the Statistics Department at Duke (2007-8), I never once interacted with or met anyone from the Biostatistics Department or any Statisticians based in the Medical School. There was some discussion in the department about trying to interact more with Biostatistics in the future so that may have changed somewhat.

          But my experience at the University of Toronto when I was a teaching hospital based statistician might be more relevant. When I left for another university, one of the more active applied people in the Statistics Department told me I was the only hospital based statistician in Toronto that conferred on my work with members of the Statistics Departments.

          Apparently everyone else simply would not discuss their work, likely fearing criticism and potential embarrassment. They got their degree in statistics and simply do what they think they learned was right.

          Unfortunately no good data on what percentage of folks working as statisticians (with degrees) are competent in a clinical research setting – my guess/prior is 10%

    • Steven McKinney

      The Duke internal committee report is still available at The Cancer Letter website Documents page – look towards the bottom for the entry “Documents on Duke University genomics research: Duke report and email obtained by The Cancer Letter; comments by biostatisticians on the report (The Cancer Letter, May 14, 2010).” and select the first hyperlink.

      No author names, redaction ink all over the place – does this report inspire any scientific confidence for anyone? Would you opt for a medical treatment with this document as a testimonial?

      As I read the report, the anonymous advocates appear to state that they did investigate using the Duke data:

      “Summary of Charge 2

      In careful review of the published literature, the methods were not described in sufficient detail to allow replication based on what has been published; however, by studying the R code from Barry we were able to develop a parallel approach using [redacted] software and using the same and different data we were able to show that the approach used in the Duke software is able to generate valid predictors for the Adriamycin predictor and, furthermore, we believe the approach can generate valid predictors in general.”

      The anonymous authors “believed” the approach could work.

      I guess I am an atheist here – belief is not enough. Such statements about the ability of the statistical procedure surely admit testable hypotheses that can be demonstrated mathematically (as Dawid does in the paper referenced by Senn) and via computational simulations. The large swaths of the Duke report that sit under black ink do not persuade me as to the validity of this purported predictive procedure.

      • http://www.cancerletter.com/articles/20131018_2

        Nevins said they were validated:

        “Nevins remained a stalwart supporter of his protégé, and vigorously defended Potti’s work.
        In an interview with The Cancer Letter early in the controversy, Nevins said the Duke results had been confirmed by researchers at the European Organization for Research and Treatment of Cancer and were published by Lancet Oncology in December 2007.

        “Data was made available to us, blinded,” Nevins said to The Cancer Letter at the time. “All we got was the gene expression data. We ran the predictions and sent it back to the EORTC investigators, including the statisticians in the EORTC group.

        “They took the results, analyzed it in the context of the clinical responses in that study, and did further analyses with respect to evaluating developing combined probability measures. Even if you just look at the predictions for single agents in that study that came from our predictions—that was completely blinded to us—there is a clear ability to predict in the two arms of the trial with the individual predictors.” (The Cancer Letter, Oct. 2, 2009).
        The study, it turned out, was not blinded, Duke’s European collaborators of the Duke team said to The Cancer Letter at the time. The data originally sent to Duke had never been masked in any way. (The Cancer Letter, Oct. 23, 2009)”

        “Nevins and Potti were business partners as much as they were colleagues, garnering millions of dollars in grants as well as investments in their company, CancerGuideDX, which controlled the predictor model the team developed.”

        COI?
        http://www.cancerletter.com/articles/20131018_2

    • Steven McKinney

      From the Fogel et al. 2013 “A Tale of Two Matrix Factorizations”:

      “Three SVD vector pairs capture nearly all of the variance in the matrix, 99.7%, with singular values of 60.8, 26.2, and 13.1. NMF also recovers 99.7% of the variance. However, now that “inflammation genes” are turned on for all groups except the control group, NMF does not find the groups well and has a “ghost” in one of the NMF vectors (Figure 12). This clearly illustrates how deviation from D&S’s sufficient conditions cannot always be surmounted, even with an SVD-based initialization scheme.

      4.2.4 Some Discussion of the Synthetic Samples

      “Ghosts” could in principle be removed by a method that zeroed out unimportant terms in the alternating regressions. One familiar method of doing this is the LASSO, which adds an L1 penalty to a regression which has the effect, as the penalty coefficient increases, of getting progressively more sparse models. We implemented and tested a version of LASSO for NMF-LASSO by slightly adapting Lin’s projected gradient algorithm (Lin 2005) to include the L1 penalty in its alternating regressions—see also Hoyer (2004).

      Using our version of NMF-LASSO, the “ghost” of inflammation genes can be made to disappear (data not shown), but finding the right penalty was far from trivial: we increased it progressively until the ghost completely disappeared. Obviously, this strategy only works when you know exactly what you want.”

      Interesting to think on how well this strategy outlined above matches the Duke strategy to try and retry to make their model work on the NCI test data. They certainly knew what they wanted once they had had a good look at the test data.

  5. visitor

    A discussion of a presentation by Stephen Senn by Jane Feinmann: “Is the current system of publishing clinical trials fit for purpose?” (His answer is No!) is relevant here.
    http://blogs.bmj.com/bmj/2014/05/30/jane-feinmann-is-the-current-system-of-publishing-clinical-trials-fit-for-purpose/

    • This indictment, by Senn, of JAMA reminded me of his “throwing stones” guest post which includes a link to Baggerly and Coombes.
      https://errorstatistics.com/2013/03/07/stephen-senn-casting-stones/
      The IOM report has a good section on recommendations for journals. Apparently one journal in biostatistics gives authors the choice to request having their data checked for reproducibility. Seems like a good idea.

  6. Ian Crandell

    I find it odd that P&N claim that using ENet priors eliminates the need to separate testing from training data. As is mentioned int he Fogel paper above:

    “the “ghost” of inflammation genes can be made to disappear (data not shown), but finding the right penalty was far from trivial: we increased it progressively until the ghost completely disappeared. Obviously, this strategy only works when you know exactly what you want.”

    In practice, the ENet penalty (conceptually identical to the LASSO penalty Fogel talks about) is chosen by examining a bunch of different penalty values and seeing which “fits” best. If you do this evaluation on the whole of the data, there are tremendous over fitting problems. ENet doesn’t eliminate the need to hold out test data, it makes it even more essential!

    • Ian: Good to hear from you (Ian was in the grad seminar recently taught with Aris Spanos). I don’t know if P&N are alluding to these elastic priors or not, I haven’t yet seen the Fogel paper (which of course they wouldn’t have known about).I thought they were using conjugate priors. So what are “ghosts”, influences of confounding factors? “ENet doesn’t eliminate the need to hold out test data, it makes it even more essential!”
      Baggerly and Coombes (B&C) demonstrated a straightforward way to check (after the fact) that the P&N models were overfitting, e.g., tests between means. (They did no better than chance, so this was an error statistical critique.) It would be good to identify an overarching criterion behind the thought that it is “even more essential” to ensure hold out test data. Otherwise one feels there are all kinds of different “penalties” without their being connected to a unified goal of what must be avoided (for a legitimate claim). Just because they provide constraints doesn’t mean you end up avoiding pretty awful results. Do you see my point? thanks for commenting.

  7. Here’s a link to the Fogel et al paper.

  8. Steven McKinney

    The suggested papers by commenters on this topic have been useful reading indeed. Thanks for the references, and I look forward to more.

    The Tale of Two Matrix Factorizations paper by Fogel, Young et al. provides a perfect example outlining the type of study I always thought should have been performed and provided by the Duke groups involved in the above discussions. Thanks to Stanley Young for this reference and his comments above.

    Section 4 of Fogel et al. shows how to generate structured data sets that can be used to assess the ability of such algorithms and statistical methods. Researchers at Duke should read this paper, and do some similar research.

    The Fogel et al. paper references a paper by Brunet, Tamayo et al.

    “Metagenes and molecular pattern discovery using matrix factorization”
    Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov
    http://www.pnas.orgcgidoi10.1073pnas.0308531101

    I commented on another paper involving Tamayo and colleagues during the Duke fiasco discussions on Retraction Watch when an anonymous Duke defender mentioned the paper

    Metagene projection for cross-platform, cross-species characterization of global transcriptional states
    Pablo Tamayo, Daniel Scanfeld, Benjamin L. Ebert, Michael A. Gillette, Charles W. M. Roberts, and Jill P. Mesirov
    http://www.pnas.orgcgidoi10.1073pnas.0701068104

    My comments at URL

    http://retractionwatch.com/2012/02/10/two-mega-corrections-for-anil-potti-in-the-journal-of-clinical-oncology/#comment-12198
    http://retractionwatch.com/2012/02/10/two-mega-corrections-for-anil-potti-in-the-journal-of-clinical-oncology/#comment-11095

    The paper was a very interesting read, and uses the nonnegative matrix factorization described in the Fogel et al. paper.

    I contacted the authors at the time, Dr.s Mesirov and Tamayo. They were extremely helpful, and provided me with a software suite within two days of contact. I was able to run their software within minutes, and reproduce several of the findings described and illustrated in their paper. Truly a proper example of a group striving to do the right thing and practicing reproducible research methods.

    I see a more recent meeting abstract by this group:

    The Journal of Immunology, 2012, 188, 58.13
    Copyright 2012 by The American Association of Immunologists, Inc.
    58.13
    Metagene projection strategies for discovering mechanisms of T cell differentiation in gene expression profiling data

    and papers such as

    doi:10.1371/journal.pone.0040739

    So they are still striving to develop reasonable methods for understanding structure in genomic data, and I look forward to seeing further progress by this group at the Broad Institute.

  9. Pingback: Potti Update: “I suspect that we likely disagree with what constitutes validation” (Nevins and Potti) | Rejected Posts of D. Mayo

  10. Pingback: Findings of the Office of Research Misconduct on the Duke U (Potti/Nevins) cancer trial fraud: No one is punished but the patients | Error Statistics Philosophy

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.