S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)

.

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Evidence Based or Person-centred? A statistical debate

It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.

Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.

In fact, EBM will be essential to delivering effective PCH, as I shall now explain.

Blood will have blood

I shall take a rather unglamorous problem, that of deciding whether a generic form of phenytoin is equivalent in effect to a brand-name version. It may seem that this has little to do with PCH but in fact, unpromising as it may seem, it illuminates many points of the often-made but misleading claim that EBM is essentially about averages and is irrelevant to PCH, a view, in my opinion, that is behind the sentiment expressed in RLA’s final sentence: Causal singularism teaches us what PCH already knows:  that each person is unique, and that one size does not fit all.

If you want to prove that a generic formulation is equivalent to a brand-name drug a common design that is used to get evidence to that effect is a cross-over, in which a number of subjects are given both formulations on separate occasions and the concentrations in the blood of the two formulations are compared to see if they are similar. Such experiments, referred to as bioequivalence studies[3], may seem simple and trivial but they exhibit in extreme form several common characteristics of RCTs that contradict standard misconceptions regarding them

  1. General point. The subjects studied are not like the target population. Local instance. Very often young male healthy volunteers are studied, even though the treatments will be used in old as well as young ill people of either sex.
  2. General point. The outcome variable studied is not directly clinically relevant. Local instance. Analysis will proceed in terms of log area under the concentration-time curve, despite the fact that what is important to the patient is clinical efficacy and tolerability, neither of which will form the object of this study.
  3. General point. In terms of any measure that was clinically relevant, there would be important differences between the sample and the target population. Local instance. Serum concentration in a male healthy volunteer will possibly be much lower than for an elderly female patient. He will weigh more and probably eliminate the drug faster than she would. What would be a safe dose for him might not be for her.
  4. General point. Making use of the results requires judgement and background knowledge and is not automatic. Local instance. Theories of drug elimination and distribution imply that if time concentration in the blood of two formulations is similar their effect site concentrations and hence effects should be similar. “The blood is a gate through which the drug must pass.”

In fact, the whole use of such experiments is highly theory-laden and employs an implied model partly based on assumptions and partly on experience. The idea is that, first, equality of concentration in the blood implies equality of clinical effects and, second, although the concentration in the blood could be quite different between volunteers and patients, the relative bioavailability of two formulations should be similar from the one group to the other. Hence, analysis takes place on this scale, which is judged portable from volunteers to patients. In other words, one size does not fit all but evidence from a sample is used to make judgements about what would be seen in a population.

Concrete consideration

Consider a concrete example of a trial comparing two formulations of phenytoin reported by Shumaker and Metzler[4]. This was a double cross-over in 26 healthy volunteers. In the first pair of periods each subject was given one of the two formulations, the order being randomised. This was then repeated in a second pair of periods.  Figure 1 shows the relative bioavailability, that is to say the ratio of the area under the concentration time curve for the generic (test) formulation compared to the brand-name (reference) using data from the first pair of periods only. For a philosopher’s recognition of what is necessary to translate results from trails to practice, see Nancy Cartwright’s aptly named article, A philosopher’s view of the long road from RCTs to effectiveness [5] and for a statistician’s see Added Values[6].

Figure 1 Relative bioavailability in 26 volunteers for two formulations of phenytoin.

This plot may seem less than reassuring. It is true that the values seem to cluster around 1 (dashed line), which would imply equality of the formulations, the object of the study, but one value is perilously close to the limit of 0.8 and another is actually above the limit of 1.25, these two boundaries usually being taken to be acceptable limits of similarity.

However, it would be hasty to assume that the differences in relative bioavailability reflect any personal feature of the volunteers. Because the experiment is rather more complex than usual and each volunteer was tested in two cross-overs, we can plot a second determination of the relative bioavailability against the first. This has been done in Figure 2.

Figure 2 Relative bioavailability in the second cross-over plotted against the first.

There are 26 points, one for each volunteer with the X axis value being relative bioavailability in the first cross-over and the Y axis the corresponding figure for the second. The XY plane can also be divided into two regions: one in which the difference between the second determination and the first is less than the difference between the second and the mean of the first (labelled personal better) and one in which the reverse is the case (labelled mean better). The 8 points that are labelled with blue circles are in the former region and the 18 with black asterisks in the second. The net result is that for the majority of subjects one would predict the relative bioavailability on the second occasion better using the average value of all subjects rather than using the value for that subject.  Note that since much of the prediction error will be due to the inescapable unpredictability of relative bioavailability from occasion to occasion, the superiority of using the average here is plausibly underestimated. In the long run it would do far better.

In fact, judgement of similarity of the two formulations would be based on the average bioavailability, not the individual values and a suitable analysis of the data from the first period fitting subject and period effects in addition to treatment to the log-transformed values would produce a 90% confidence interval of 0.97 to 1.02.

Of course one could argue that this is an extreme example. A plausible explanation is that the long-run relative bioavailability is the same for every individual and it could be argued that there are many clinical trials in patients measuring more complex outcomes where effects would not be constant. Nevertheless, doing better than using the average is harder than it seems and trying to do better will require more evidence not less.

The moral

The moral is that if you are not careful you can easily do worse by attempting to go beyond the average. This is well known in quality control circles where it is understood that if managers attempt to improve the operation of machines, processes and workers without knowing whether or not observed variation has a specific identifiable and actionable source they can make quality worse. In choosing ‘one size does not fit all’ RLA has plumped for a misleading analogy. When fitting someone out with clothes, their measurements can be taken with very little error, it can be assumed that they will not change much in the near future and that what fits now will do so for some time.

Patients and diseases are not like that. The mistake is to assume that the following statement, ‘the effect on you of a given treatment will almost certainly differ from its effect averaged over others,’ justifies the following policy, ‘I am going to ignore the considerable evidence from others and just back my best hunch about you’. The irony is that doing best for the individual may involve making substantial use of the average.

What statisticians know is that where there is much evidence on many patients and a very little on the patient currently presenting, to do best will involve a mixed strategy involving finding some optimal compromise between ignoring all experience and ignoring that of the current patient. To produce such optimal strategies requires careful planning, good analysis and many data[7, 8]. The latter are part of what we call evidence and to claim, therefore, that personalisation involves abandoning evidence based medicine is quite wrong. Less ontology and more understanding of the statistics of prediction is needed.

References

[1] Mumford, S. & Anjum, R.L. 2013 Causation: a very short introduction, OUP Oxford.

[2] Anjum, R.L. 2016 Evidence based or person centered? An ontological debate.

[3] Senn, S.J. 2001 Statistical issues in bioequivalence. Statistics in Medicine 20, 2785-2799.

[4] Shumaker, R.C. & Metzler, C.M. 1998 The phenytoin trial is a case study of “individual bioequivalence”. Drug Information Journal 32, 1063–1072,.

[5] Cartwright, N. 2011 A philosopher’s view of the long road from RCTs to effectiveness. The Lancet 377, 1400-1401.

[6] Senn, S.J. 2004 Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 23, 3729-3753.

[7] Araujo, A., Julious, S. & Senn, S. 2016 Understanding Variation in Sets of N-of-1 Trials. PloS one 11, e0167167. (doi:10.1371/journal.pone.0167167).

[8] Senn, S. 2017 Sample size considerations for n-of-1 trials. Statistical methods in medical research.

Categories: personalized medicine, RCTs, S. Senn | 7 Comments

Post navigation

7 thoughts on “S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)

  1. Stephen: Thanks so much for the guest post! I haven’t read her paper, but will, and I’m glad you called my attention to it. I’m a little surprised the debate is regarded as “ontological” rather than methodological (or epistemological), but that term has taken on a new meaning in some circles. Still, my first impression (just from her abstract and quick scan) is that her view of person-based is very different, and nearly opposite, from the way it is regularly used today as in opposition to EBM. Typically, it’s a call for intensive use of statistical associations and correlation data without requiring experimental control and randomization. The “21st century Cures” act, for example, says we needn’t randomize because massive correlational data will do. Use of models in epidemiology take precedence. This is positivistic and Humean–not the other way around. The controlled experimenters, by contrast, seek manipulational knowledge, and causal hypotheses and theories.

    Granted, an older view, prior to EBM, was to get medical expertise to decide how to treat you. I wouldn’t say that is theory-oriented though. It is “personal” in the sense that it lets your doctor decide, given all she knows about you, rather than looking at a bunch of statistics that tries to pigeon-hole you. So I have a feeling that the argument about the competing views is getting turned on its head here. I would very often prefer to let my doctor of 30 years decide on my treatment, knowing my idiosyncracies or whatever, rather than be compelled to follow Big Data “personalized” medicine. But this notion of personalized is nearly the opposite of the way “patient–based” medicine is being used here. With respect to the opposition between EBM & letting me & my doctor decide, it’s a matter of very different goals: my decision vs general knowledge of diseases and making policy decisions (for general recommendation and insurance companies).
    I’ll have to read the article more carefully to weigh in.

  2. Another opposition is found in what she says about probability:
    “If a treatment cures 70 percent of a patient group, then one might say that a patient who gets this treatment has a 0.7 probability of being cured.
    This fits with a Humean and empiricist view on probability.”
    Ok, but this is the view Bayesian epistemologists hold in opposition to frequentists. For the frequentist, it’s a fallacious instantiation, but for the Bayesian epistemologist it is a fundamental principle, provided at least nothing else is known.
    So the positions seem topsy turvy.

  3. I am not sure that I agree that this is fallacious instantation for a frequentist. The frequentist might argue that the true probability is unknown but accept that a prediction might reasonably be 0.7.However, much clinical trial reasoning, whether in a Bayesian or frequentist framework, involves differences between the treated and control groups. The 70% are this or that is more appropriate to sampling rather than experimentation.

    • One must distinguish between giving a prob to an event: the next hyp I randomly select with have the property “true”, and claiming that a particular hyp, so selected, gets a probability by having been randomly selected from an urn k% of which are true. Error statisticians won’t even say the prob that this application of a 95% interval is correct = .95. My point, of course, is that here’s yet another instance in which her caricature of the EBM person and/or the frequentist seems to have things in reverse. Doubtless there are more than 2 positions, but her depiction doesn’t hit home for the EBMer–they are not positivist/Humeans.
      I admit to having only scanned the paper, I look forward to reading it carefully.
      In airport, so I won’t rely til tonight.

  4. Having read Anjum’s paper, it is clear that my suspicions about several things being run together are well founded. I think the author has a view of evidence-based medicine that pits it in contrast to positions it is much closer to than today’s non-EBM perspectives–outside of philosophy. EBM is vague. It can be seen as
    (1) contrasting randomized-controlled studies to reliance on “expertise/anecdote/holistic info”, or
    (2) contrasting randomized controlled studies to observational/epidemiological/ Big Data association studies

    Her interest comes closest to (1), but the language and assumptions she is using mixes it up with (2).The Big-Data association studies are closest to the Humeans, not the randomized-controlled studies of EBM. They seek to uncover causes. Whether there are causal “laws”, claims, or underlying “causal powers” residing in the person, gene or something else is rather less important than whether your methodology seeks causes at all. I think her criticisms would be better aimed at the focus on prediction and association. Ironically, that is the methodology most associated with “personalized medicine”. So there’s a danger of really confusing things if terms aren’t sorted out.
    Philosophers have as their central job sorting out concepts and terms, and Anjum should try to do so.

    She holds to “causal powers” and”dispositions” which are always murky, but I don’t see an opposition to learning about causes. She herself claims,“From the dispositionalist perspective, the reason for performing RCTs is to establish that an intervention actually has the causal power to produce the anticipated outcome, and this is what it means to say that an intervention “works”.”p. 10.

    What I’d said earlier about a wrong-headed view of frequentists is shown clearly: “Such inferences from frequencies to propensities is known as the ecological fallacy, which is a logically invalid inference from group average to the individual. If half of all smokers die from it, it does not automatically follow that the probability of this outcome is exactly 0.5 for each individual smoker.”

    A frequentist statistician agrees this is a fallacy, this is what some diagnostic screeners do in medicine and elsewhere. (Bayesian epistemologists hold to this, provided that nothing else is known.) W’d also agree it cannot be assumed that whatever is assigned to a event is a “propensity”–whatever that is. On the other hand, genuine frequentist probabilities for individual cases, it may be argued, are closest to propensities. You will find it in frequentist Popper (sometimes in frequentist Neyman).

    The danger is that voicing objections to RCTs may well be comprehended as support for precisely the type of large-data correlational studies she doesn’t like.

    Senn already answers the charge of external validity–while RCTs don’t claim to randomly select from an ultimate target population, it doesn’t follow it’s irrelevant to finding out how the treatment or drug behaves (and any differences in subgroups). We use artificial animal models effectively because we also learn how to extrapolate, and there’s plenty of evidence of the validity of doing so. The point of experimentation is to deliberately set up a situation that differs from what one naturally comes across in order to learn by deliberately manipulating and controlling. Is she saying we’d learn more about theories and causes keeping to natural experiments (or field studies)? There’s no reason not to have both, but again, the language in which she frames her opposition seems to overlook the most serious opposition of methodology faced today, which has more to do with (2).

  5. Thanatos Savehn

    Dr. Senn,

    I’ve followed your argument here and on Twitter and have a couple of questions.

    First, would the gap between your position and that of the personalized medicine proponents be narrowed if everyone assumes that disease diagnosed was identical amongst the patients diagnosed with it? Years ago plaintiffs’ experts in benzene cases argued that myelogenous leukemia was a single disease; sometimes it burned slow and in other cases burned fast and often the local oncologist treated them that way. If you had a time machine and could send back the recipe for Gleevac and the technology for BMT and the doctor gave a Ph chromosome patient the former and another without Ph the latter wouldn’t that be personalized medicine then but evidence based medicine now? I know this is a stupid hypothetical but I couldn’t think of a better to make the point: doesn’t the distinction depend on the epistemic (if that’s the right word) status of the diagnosis?

    Second, have you begun to suspect that some diseases have been diced too finely? It might follow from your argument.

    My father aged 83 was diagnosed with prostate cancer and put on chemical castration plus (thanks to his clever doctor who’d seen it work wonders in some guys my Dad’s age who didn’t like the effects that followed suppression of testosterone production) high doses of dexamethasone. He developed rhabdomyolosis and died of it shortly thereafter. Thus I favor your argument that overly clever doctors and their treatment flourishes are at least as large a problem as those who fret about the implications of the ecological fallacy.

  6. Thanatos Savehn

    Further to my neglected point: http://www.bbc.com/news/health-43246261

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.