As if I wasn’t skeptical enough about personalized predictions based on genomic signatures, Jeff Leek recently had a surprising post about a “A surprisingly tricky issue when using genomic signatures for personalized medicine“. Leek (on his blog Simply Statistics) writes:
My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.
….it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set.
Here’s an extract from the paper,”Test set bias affects reproducibility of gene signatures“:
Test set bias is a failure of reproducibility of a genomic signature. In other words, the same patient, with the same data and classification algorithm, may be assigned to different clinical groups. A similar failing resulted in the cancellation of clinical trials that used an irreproducible genomic signature to make chemotherapy decisions (Letter (2011)).
This is a reference to the Anil Potti case:
Letter, T. C. (2011). Duke Accepts Potti Resignation; Retraction Process Initiated with Nature Medicine.
But far from the Potti case being some particularly problematic example (see here and here), at least with respect to test set bias, this article makes it appear that test set bias is a threat to be expected much more generally. Going back to the abstract of the paper:
ABSTRACT Motivation: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction.
Results: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly-available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms…..
“The implications of a patient’s classification changing due to test set bias may be important clinically, financially, and legally. … a patient’s classification could affect a treatment or therapy decision. In other cases, an estimation of the patient’s probability of survival may be too optimistic or pessimistic. The fundamental issue is that the patient’s predicted quantity should be fully determined by the patient’s genomic information, and the bias we will explore here is induced completely due to technical steps.”…
“DISCUSSION We found that breast cancer tumor subtype predictions varied for the same patient when the data for that patient were processed using differing numbers of patient sets and patient sets had varying distributions of key characteristics (ER* status). This is undesirable behavior for a prediction algorithm, as the same patient should always be assigned the same prediction assuming their genomic data do not change (6)…
“This raises the question of how similar the test set needs to be to the training data for classifications to be trusted when the test data are normalized.”
Returning to Leeks’ post:
The basic problem is illustrated in this graphic.
This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. …
In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.
This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.
Prasad Patil, Pierre-Olivier Bachant-Winner, Benjamin Haibe-Kains, and Jeffrey T. Leek, “Test set bias affects reproducibility of gene signatures.” Bioinformatics Advance Access published March 18, 2015, CUP.
It’s odd that something that should be immediately obvious needs to get published in a top journal within this field. I’ll take a charitable read that this reflects the naivety of the field necessitating a paper and not that Leek’s group believes that they’ve made a novel discovery here.
In general genomics treats statistics as rituals, arguably even more so than often-criticized fields like epidemiology and economics. Data normalization and significance are thought of as incantations with ‘correct’ answers, not data transformations for which one needs to think through the flow of information. This is what leads to logic along the lines of “we’ve applied the ‘correct’ normalization, so we’re confident to move forward with downstream analyses”.
Ranks can be useful, but I think too often statistically unsophisticated audience reach for these non-parametric approaches as a panacea when they can hide uncertainty. What’s the uncertainty/stability of a rank estimate? I think it’s more useful to think of ranking more generally as performing a similar role to a quantile normalization with respect to a reference distribution (e.g. to a normal distribution), in that it’s smoothing out the nonlinearities in the scale of a measurement.
Normalization can be needed at multiple scales, but ultimately, ranking at the subject level may be the most practical one (to me I don’t see a huge difference between rank normalization and quantile normalization to a reference distribution within a subject). If your technology _needs_ normalization at the batch level (e.g. at different sites or across an entire study population) _and_ your batches change in unpredictable ways between your model generation and application, then that probably means the tech is not reproducible enough to be driving these decisions.