Your (very own) personalized genomic prediction varies depending on who else was around?


personalized medicine roulette

As if I wasn’t skeptical enough about personalized predictions based on genomic signatures, Jeff Leek recently had a surprising post about a “A surprisingly tricky issue when using genomic signatures for personalized medicine“.  Leek (on his blog Simply Statistics) writes:

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

….it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set.

Here’s an extract from the paper,”Test set bias affects reproducibility of gene signatures“:

Test set bias is a failure of reproducibility of a genomic signature. In other words, the same patient, with the same data and classification algorithm, may be assigned to different clinical groups. A similar failing resulted in the cancellation of clinical trials that used an irreproducible genomic signature to make chemotherapy decisions (Letter (2011)).

This is a reference to the Anil Potti case:

Letter, T. C. (2011). Duke Accepts Potti Resignation; Retraction Process Initiated with Nature Medicine.

But far from the Potti case being some particularly problematic example (see here and here), at least with respect to test set bias, this article makes it appear that test set bias is a threat to be expected much more generally. Going back to the abstract of the paper:

ABSTRACT Motivation: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction.

Results: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly-available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms…..

“The implications of a patient’s classification changing due to test set bias may be important clinically, financially, and legally. … a patient’s classification could affect a treatment or therapy decision. In other cases, an estimation of the patient’s probability of survival may be too optimistic or pessimistic. The fundamental issue is that the patient’s predicted quantity should be fully determined by the patient’s genomic information, and the bias we will explore here is induced completely due to technical steps.”

“DISCUSSION We found that breast cancer tumor subtype predictions varied for the same patient when the data for that patient were processed using differing numbers of patient sets and patient sets had varying distributions of key characteristics (ER* status). This is undesirable behavior for a prediction algorithm, as the same patient should always be assigned the same prediction assuming their genomic data do not change (6)…

“This raises the question of how similar the test set needs to be to the training data for classifications to be trusted when the test data are normalized.”

*Endocrine receptor.

Returning to Leeks’ post:

The basic problem is illustrated in this graphic.


Screen Shot 2015-03-19 at 12.58.03 PM


This seems like a pretty esoteric statistical issue, but it turns out that  this one simple normalization problem can dramatically change the results of the predictions. …

In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.

This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.

As a complete outsider to this field, I’m wondering, at what point in the determination of the patient’s prediction does the normalization apply? A patient walks into her doctor’s office and is to get a prediction/recommendation?…
As for their recommendation not to normalize but use ranks, can it work? Should we expect these concerns to be well taken care of in the latest rendition of microarrays?


Prasad Patil, Pierre-Olivier Bachant-Winner, Benjamin Haibe-Kains, and Jeffrey T. Leek, “Test set bias affects reproducibility of gene signatures.” Bioinformatics Advance Access published March 18, 2015, CUP.



Categories: Anil Potti, personalized medicine, Statistics

Post navigation

10 thoughts on “Your (very own) personalized genomic prediction varies depending on who else was around?

  1. vl

    It’s odd that something that should be immediately obvious needs to get published in a top journal within this field. I’ll take a charitable read that this reflects the naivety of the field necessitating a paper and not that Leek’s group believes that they’ve made a novel discovery here.

    In general genomics treats statistics as rituals, arguably even more so than often-criticized fields like epidemiology and economics. Data normalization and significance are thought of as incantations with ‘correct’ answers, not data transformations for which one needs to think through the flow of information. This is what leads to logic along the lines of “we’ve applied the ‘correct’ normalization, so we’re confident to move forward with downstream analyses”.

    Ranks can be useful, but I think too often statistically unsophisticated audience reach for these non-parametric approaches as a panacea when they can hide uncertainty. What’s the uncertainty/stability of a rank estimate? I think it’s more useful to think of ranking more generally as performing a similar role to a quantile normalization with respect to a reference distribution (e.g. to a normal distribution), in that it’s smoothing out the nonlinearities in the scale of a measurement.

    Normalization can be needed at multiple scales, but ultimately, ranking at the subject level may be the most practical one (to me I don’t see a huge difference between rank normalization and quantile normalization to a reference distribution within a subject). If your technology _needs_ normalization at the batch level (e.g. at different sites or across an entire study population) _and_ your batches change in unpredictable ways between your model generation and application, then that probably means the tech is not reproducible enough to be driving these decisions.

    • vl:
      I’m not sure I understand your saying: “In general genomics treats statistics as rituals, arguably even more so than often-criticized fields like epidemiology and economics. Data normalization and significance are thought of as incantations with ‘correct’ answers”.

      How should they be doing it?

  2. Steven McKinney

    Well said, vl.

    But this is science, so things do need to be quantified. Thus although it seems to be immediately obvious, even the immediately obvious must be reasonably quantified to yield defensible scientific conclusions. Leek’s student has done just that. Shown that the technology is not reproducible enough to drive the decisions that, for example, the Duke genomics group were claiming to have developed.

    Normalization batch effects can be sizable. So tests based on Affy or Illumina chip assays that require “normalization” are demonstrably problematic. Leek’s student Prasad Patil’s results are valuable indeed. We still have work to do in developing reliable genomic assays.

    • Steven:
      Thanks for writing, especially in that you know the Potti case. I don’t quite get that this is expected behavior for a prediction rule. Can you answer my question (about the sequence of steps to a woman’s prognosis/recommended treatment)?

      • Steven McKinney

        If a classifier is built using absolute values for data coming from a gene chip, for example in a regression model, the normalization of gene chip data (details below) needs to happen before any values from the gene chip are used in a classifier. So then what strategy do you use to collect enough samples so you can run a “batch normalization” exercise?

        Using ranks obviates this issue – the ranks of the intensity values within a gene chip do not change if we add or subtract some value to adjust for brighter or darker gene chip images.

        The good thing about the test proposed by Patil et al. is that it is shown in full in the paper (see Fig 6 in the BiorXiv preprint linked above). This is a simple “locked down” model that others can check and assess, not a hand-wavy word salad of fancy statistical mumbo-jumbo that no one else can reproduce. The tree-based model uses eight simple rules (the rank for this gene is less than the rank for that gene, e.g. ESR1 (estrogen) < FOXC1) to determine which breast cancer subtype appears most appropriate for the patient. Assessment by others of the Patil et al. predictive model can now begin. As an aside, I can't double check the accuracy of the model on the breast cancer data we generated at the British Columbia Cancer Agency because these authors used our data in their model development.

        Patil et al's method allows the data from individual assays to be used immediately, without having to wait while you accrue enough additional patient data to run a batch normalization. For cancers that occur often, such as breast cancer, one might devise a reasonable batch normalization strategy, as data accumulates relatively quickly. But for rare cancers, patients do not have the luxury of waiting while other patient data accrues so that a batch normalization exercise can be run. Patients need diagnostic and prognostic information quickly, especially for aggressive cancers where rapid treatment decisions are vital.

        The effort by Patil, Leek et al. is a good model of proper development of a classifier, with full computer code available to allow assessment of the methodology. The property that individual patient data can be used immediately and is not dependent on other patient data is important in a clinical setting.

        Can it work? It certainly stands a better chance than other efforts we have seen to date.

        As for whether this can work with the latest rendition of microarrays – interestingly the microarray materials always come with a disclaimer "For Research Use Only. Not for use in diagnostic procedures." So there is some work to do in getting approval from regulatory and commercial agencies before patient diagnostics can proceed.


        This normalization conundrum arises because the data from a gene chip are the intensity values obtained when a laser beam is scanned across the millions of spots on a gene chip, each spot being a small island of short strands of DNA (20 to 30 bases long, the bases being the four bases that comprise DNA, A, C, T and G, e.g. ACCTGCCAAATGTGCCGGTAGCC). The spots of DNA strands are called probes. Gene chips on a molecular scale look something like a velcro patch, each velcro hair being one of these strands of 20 to 30 bases. Each individual island of strands (each probe) is comprised of the same alphabet strand, and each different island has its own unique code for its strands (e.g. ACCTGCCAAATGTGCCGGTAGCC for probe 1 and say AACTGGGTATTCCGTTAATAGGGC for probe 2 etc.)

        There are several factors that contribute to how brightly each spot will glow when the laser is shined upon the spot. The concentration and amount of reagents in the liquid that is washed over the gene chip along with the patient tissue contribute to a brighter or dimmer glow. We hope that only the quantity of DNA or RNA in the patient sample would change the intensity with which the appropriate probe glows when hit with the laser light, but these other reagent issues also contribute to the strength of the glow. We don't want to change the type of breast cancer we declare for a patient based on variation in batch reagents, so geneticists and bioinformaticians use several "normalization" algorithms to adjust each gene chip's data (adding values to data from chips that yielded dimmer images, subtracting values from data from chips that yielded brighter images) to control for such artefactual changes in data values.

        • Steven: Thank you for the detailed description. I have seen these microarrays several times and have read a lot about them, but I thought the issue you are referring to toward the end, about the reagents, was problematic when they didn’t take random samples from different batches. I thought the current issue was different in that the composition of other patients altered the algorithm, which isn’t so surprising, if you’re just building the model as you go. But I thought they’d gotten over the earlier problem, so maybe I’m confused.

          • Steven McKinney

            It’s confusing because there are several factors that affect the values that are obtained from a single gene chip assay. To mitigate these factors, batch normalization routines were developed which have proved useful in early studies using data from many gene chips.

            This paper shows quantitatively that another factor, the composition of patients within a batch of chip samples, affects the final absolute values for a given patient X in that patient sample batch. There are different “batch” things going on – batches of reagents, batches of chips, batches of patients, so some of the confusion is about which batch effects are important and how to handle them.

            Randomization is not often employed, as it ideally should be, in spreading patients across reagent batches and chip batches. Ultimately however for a method to be useful in the clinic it needs to work independently for each patient so there is work to be done in reducing variability in reagent batches, chip manufacturing and all the other bits that will comprise a diagnostic or prognostic system. Sometimes manufacturers can reduce variability, and where they can not, clever statistical methods can sometimes compensate.

            Any test will have some error rate, so the long term goal is to reduce error rates so that the use of genomic diagnostic and prognostic systems can demonstrably show improved outcomes for patients in clinical trials. This paper declares that ranks help to achieve that goal, and that claim now needs to be severely tested.

            • Steven: I’m grateful for your detailed replies, I had a feeling a lot of different things were going on.

  3. @steven fair point, although there’s a part of me that dislikes how well-known names like Leek’s and Ioannidis can get high-impact publications writing on things-that-should-be-obvious, while a lesser-known group would be dismissed as trying to publish something trivial or ignored altogether. Anyway, I get that that’s not Prasad or Leek’s fault but just the nature of the reputation/”star” system in academia.

    @mayo in my opinion part of the scientific method requires putting forth a model of how different aspects of data relate to one another. Usually, I’d consider such a model necessary to claim to have a well-specified hypothesis. In this predictive context an integrated model is not quite a hypothesis as it is an understanding of how information flows between multiple inference and prediction steps.

    This is my problem with the ‘jump in, jump out’ attitude towards statistics. The field of genomics loves this – jump in, do a hypothesis test, jump out and then take an adhoc extra-statistical approach to the integration of information between inferences (venn diagrams, plots of p-values, qq-plots, ROC curves, hierarchical clustering, gene set enrichment, I could go on….).

    What happens then is that, in isolation a normalization method may be “correct” in the sense that it improves statistical efficiency and makes an inference more robust. Likewise, in isolation, there may be nothing wrong with some unbiased estimator they chose for the prediction model. However, taken together with the issue of generalizing to a new population, the way in which these isolated inferences interacts would lead to incorrect conclusions.

    The way to avoid making these mistakes is to have an integrated model/understanding of the relationship between underlying generative processes and how information propagates through the combined inference procedure.

  4. VL: Interesting,(1) I never meant “jump in, jump out” this way but rather the opposite–yet I see what you mean. What I intend is to recognize the pieces, rather than take any one of the formal statistical measures as directly applying to the overall inference. A statistical method shouldn’t be blamed, in other words,that it doesn’t directly supply some number, like a posterior probability, to some overall substantive claim. If you use it that way, you’re misusing it. I had invoked/introduced an overall judgment of how well or severely tested a claim is precisely to refer to a kind of meta-level appraisal that would demand considering the problem that is to be solved, or relevant hypothesis to be inferred. So it was to be an integrated assessment.

    Explain your “integrated model” [which] “is not quite a hypothesis as it is an understanding of how information flows between multiple inference and prediction steps.”

    (2) “Ioannidis can get high-impact publications writing on things-that-should-be-obvious”. but I don’t think it is obvious that scientists go to press based on a single statistically significant result obtained with the help of significance seeking and selection effects–as he alleges.

    I would be glad to hear that the problem with test set bias is anything like as much of a caricature as is Ioannidis’ depictions. That is, there are easy ways to avoid the biases and too-quick allegations of “definitive findings” of an Ioannidis–one of the reasons for deeming his computation of positive predictive values irrelevant to evidential warrant. Is there anything remotely analogous in the case of the problem of test set bias in the genomic cases being discussed? I’d be relieved if there were. You may have an “integrated model/understanding” which would be interesting to hear about, but I doubt that the problem here is at all similar to fairly glaring bad science behind the Ioannidis argument. Right?

    In this connection, can you explain the “trivially obvious” test set bias problem?

Blog at