personalized medicine

S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)


Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Evidence Based or Person-centred? A statistical debate

It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.

Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.

In fact, EBM will be essential to delivering effective PCH, as I shall now explain.

Blood will have blood

I shall take a rather unglamorous problem, that of deciding whether a generic form of phenytoin is equivalent in effect to a brand-name version. It may seem that this has little to do with PCH but in fact, unpromising as it may seem, it illuminates many points of the often-made but misleading claim that EBM is essentially about averages and is irrelevant to PCH, a view, in my opinion, that is behind the sentiment expressed in RLA’s final sentence: Causal singularism teaches us what PCH already knows:  that each person is unique, and that one size does not fit all.

If you want to prove that a generic formulation is equivalent to a brand-name drug a common design that is used to get evidence to that effect is a cross-over, in which a number of subjects are given both formulations on separate occasions and the concentrations in the blood of the two formulations are compared to see if they are similar. Such experiments, referred to as bioequivalence studies[3], may seem simple and trivial but they exhibit in extreme form several common characteristics of RCTs that contradict standard misconceptions regarding them

  1. General point. The subjects studied are not like the target population. Local instance. Very often young male healthy volunteers are studied, even though the treatments will be used in old as well as young ill people of either sex.
  2. General point. The outcome variable studied is not directly clinically relevant. Local instance. Analysis will proceed in terms of log area under the concentration-time curve, despite the fact that what is important to the patient is clinical efficacy and tolerability, neither of which will form the object of this study.
  3. General point. In terms of any measure that was clinically relevant, there would be important differences between the sample and the target population. Local instance. Serum concentration in a male healthy volunteer will possibly be much lower than for an elderly female patient. He will weigh more and probably eliminate the drug faster than she would. What would be a safe dose for him might not be for her.
  4. General point. Making use of the results requires judgement and background knowledge and is not automatic. Local instance. Theories of drug elimination and distribution imply that if time concentration in the blood of two formulations is similar their effect site concentrations and hence effects should be similar. “The blood is a gate through which the drug must pass.”

In fact, the whole use of such experiments is highly theory-laden and employs an implied model partly based on assumptions and partly on experience. The idea is that, first, equality of concentration in the blood implies equality of clinical effects and, second, although the concentration in the blood could be quite different between volunteers and patients, the relative bioavailability of two formulations should be similar from the one group to the other. Hence, analysis takes place on this scale, which is judged portable from volunteers to patients. In other words, one size does not fit all but evidence from a sample is used to make judgements about what would be seen in a population.

Concrete consideration

Consider a concrete example of a trial comparing two formulations of phenytoin reported by Shumaker and Metzler[4]. This was a double cross-over in 26 healthy volunteers. In the first pair of periods each subject was given one of the two formulations, the order being randomised. This was then repeated in a second pair of periods.  Figure 1 shows the relative bioavailability, that is to say the ratio of the area under the concentration time curve for the generic (test) formulation compared to the brand-name (reference) using data from the first pair of periods only. For a philosopher’s recognition of what is necessary to translate results from trails to practice, see Nancy Cartwright’s aptly named article, A philosopher’s view of the long road from RCTs to effectiveness [5] and for a statistician’s see Added Values[6].

Figure 1 Relative bioavailability in 26 volunteers for two formulations of phenytoin.

This plot may seem less than reassuring. It is true that the values seem to cluster around 1 (dashed line), which would imply equality of the formulations, the object of the study, but one value is perilously close to the limit of 0.8 and another is actually above the limit of 1.25, these two boundaries usually being taken to be acceptable limits of similarity.

However, it would be hasty to assume that the differences in relative bioavailability reflect any personal feature of the volunteers. Because the experiment is rather more complex than usual and each volunteer was tested in two cross-overs, we can plot a second determination of the relative bioavailability against the first. This has been done in Figure 2.

Figure 2 Relative bioavailability in the second cross-over plotted against the first.

There are 26 points, one for each volunteer with the X axis value being relative bioavailability in the first cross-over and the Y axis the corresponding figure for the second. The XY plane can also be divided into two regions: one in which the difference between the second determination and the first is less than the difference between the second and the mean of the first (labelled personal better) and one in which the reverse is the case (labelled mean better). The 8 points that are labelled with blue circles are in the former region and the 18 with black asterisks in the second. The net result is that for the majority of subjects one would predict the relative bioavailability on the second occasion better using the average value of all subjects rather than using the value for that subject.  Note that since much of the prediction error will be due to the inescapable unpredictability of relative bioavailability from occasion to occasion, the superiority of using the average here is plausibly underestimated. In the long run it would do far better.

In fact, judgement of similarity of the two formulations would be based on the average bioavailability, not the individual values and a suitable analysis of the data from the first period fitting subject and period effects in addition to treatment to the log-transformed values would produce a 90% confidence interval of 0.97 to 1.02.

Of course one could argue that this is an extreme example. A plausible explanation is that the long-run relative bioavailability is the same for every individual and it could be argued that there are many clinical trials in patients measuring more complex outcomes where effects would not be constant. Nevertheless, doing better than using the average is harder than it seems and trying to do better will require more evidence not less.

The moral

The moral is that if you are not careful you can easily do worse by attempting to go beyond the average. This is well known in quality control circles where it is understood that if managers attempt to improve the operation of machines, processes and workers without knowing whether or not observed variation has a specific identifiable and actionable source they can make quality worse. In choosing ‘one size does not fit all’ RLA has plumped for a misleading analogy. When fitting someone out with clothes, their measurements can be taken with very little error, it can be assumed that they will not change much in the near future and that what fits now will do so for some time.

Patients and diseases are not like that. The mistake is to assume that the following statement, ‘the effect on you of a given treatment will almost certainly differ from its effect averaged over others,’ justifies the following policy, ‘I am going to ignore the considerable evidence from others and just back my best hunch about you’. The irony is that doing best for the individual may involve making substantial use of the average.

What statisticians know is that where there is much evidence on many patients and a very little on the patient currently presenting, to do best will involve a mixed strategy involving finding some optimal compromise between ignoring all experience and ignoring that of the current patient. To produce such optimal strategies requires careful planning, good analysis and many data[7, 8]. The latter are part of what we call evidence and to claim, therefore, that personalisation involves abandoning evidence based medicine is quite wrong. Less ontology and more understanding of the statistics of prediction is needed.


[1] Mumford, S. & Anjum, R.L. 2013 Causation: a very short introduction, OUP Oxford.

[2] Anjum, R.L. 2016 Evidence based or person centered? An ontological debate.

[3] Senn, S.J. 2001 Statistical issues in bioequivalence. Statistics in Medicine 20, 2785-2799.

[4] Shumaker, R.C. & Metzler, C.M. 1998 The phenytoin trial is a case study of “individual bioequivalence”. Drug Information Journal 32, 1063–1072,.

[5] Cartwright, N. 2011 A philosopher’s view of the long road from RCTs to effectiveness. The Lancet 377, 1400-1401.

[6] Senn, S.J. 2004 Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 23, 3729-3753.

[7] Araujo, A., Julious, S. & Senn, S. 2016 Understanding Variation in Sets of N-of-1 Trials. PloS one 11, e0167167. (doi:10.1371/journal.pone.0167167).

[8] Senn, S. 2017 Sample size considerations for n-of-1 trials. Statistical methods in medical research.

Categories: personalized medicine, RCTs, S. Senn | 6 Comments

Your (very own) personalized genomic prediction varies depending on who else was around?


personalized medicine roulette

As if I wasn’t skeptical enough about personalized predictions based on genomic signatures, Jeff Leek recently had a surprising post about a “A surprisingly tricky issue when using genomic signatures for personalized medicine“.  Leek (on his blog Simply Statistics) writes:

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

….it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set.

Here’s an extract from the paper,”Test set bias affects reproducibility of gene signatures“:

Test set bias is a failure of reproducibility of a genomic signature. In other words, the same patient, with the same data and classification algorithm, may be assigned to different clinical groups. A similar failing resulted in the cancellation of clinical trials that used an irreproducible genomic signature to make chemotherapy decisions (Letter (2011)).

This is a reference to the Anil Potti case:

Letter, T. C. (2011). Duke Accepts Potti Resignation; Retraction Process Initiated with Nature Medicine.

But far from the Potti case being some particularly problematic example (see here and here), at least with respect to test set bias, this article makes it appear that test set bias is a threat to be expected much more generally. Going back to the abstract of the paper: Continue reading

Categories: Anil Potti, personalized medicine, Statistics | 10 Comments

Blog at