(guest post) When Relevance is Irrelevant, by Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Applied statisticians tend to perform analyses on additive scales and additivity is an important aspect of an analysis to try to check. Consider survival analysis. The most important model used, the default in many cases, is the proportional hazards model introduced by David Cox in 1972 and sometimes referred to as Cox regression. In fact, from one point of view, analysis takes place on the log-hazard scale and so the model could equally be referred to by the rather clumsier title additive log-hazards model and there is quite a literature on how the proportionality (or equivalently, additivity) assumption can be checked.
Words have a definite power on the mind and you sometimes encounter the nonsensical claim that if the proportionality assumption does not apply you should consider a log-rank test instead. In fact, when testing the null hypothesis that two treatments are identical, neither the log-rank test nor the score test using the proportional hazards model require the assumption of proportionality: the assumption is trivially satisfied by the fact of two treatments being identical. Furthermore the log-rank test is just a special case of proportional hazards: the score test for a proportional hazards model without any covariates is the log-rank test. Finally, it is easy to produce examples where proportional hazards would apply in a model with covariates but not in the model without covariates but very difficult to produce the converse.
An objection often made regarding such models is that they are very difficult for physicians to understand. My reply is to ask what is preferable: a difficult truth or an easy lie? Ah yes, it is sometimes countered, but surely I agree on the importance of clinical relevance. It is surely far more useful to express the results of a proportional hazards analysis in clinically relevant terms that can be understood, such as difference in median length of survival or the difference in the event rate up to a particular census point (say one year after treatment).
A disturbing paper by Snapinn and Jiang points to a problem, however, and to explain it I can do no better that cite the abstract:
The standard analysis of a time-to-event variable often involves the calculation of a hazard ratio based on a survival model such as Cox regression; however, many people consider such relative measures of effect to be poor expressions of clinical meaningfulness. Two absolute measures of effect are often used to assess clinical meaningfulness: (1) many disease areas frequently use the absolute difference in event rates (or its inverse, the number-needed-to-treat) and (2) oncology frequently uses the difference between the median survival times in the two groups. While both of these measures appear reasonable, they directly contradict each other. This paper describes the basic mathematics leading to the two measures and shows examples. The contradiction described here raises questions about the concept of clinical meaningfulness. (p2341)
To see the problem, consider the following. The more serious the disease, the less a given difference in the rate at which people die will impact on the time survived and hence on differences in median survival. However, generally, the higher the baseline mortality rate the greater the difference in survival at a given time point that will be conveyed by a given treatment benefit.
If you find this less than clear, you have my sympathy. The only solution I can offer is to suggest that you read the paper by Snappin and Jiang. However, in that case also consider the following point. If the point is so subtle, how many physicians who cannot understand proportional hazards can understand numbers needed to treat or differences in median survival? My opinion is that they can be counted on the fingers of one foot.
Let me explain the point at issue by analogy. If one were to study road traffic accidents one would find that among the very many factors affecting seriousness of the consequences of an accident would be the relative velocity at impact. However, if one looks at Newton’s laws of motion one finds that the second law speaks of the relationship between force, mass, and acceleration but not velocity. Now it is clear that a) acceleration being a concept that is defined (or at least understood) in terms of velocity (it is, indeed, derivative of velocity) it is a more complicated concept than velocity b) all cars have speedometers that show velocity but not acceleration c) the traffic laws are couched in terms of velocity rather than acceleration and d) it seems that from the point of view of human health it is velocity that is important.
None of this remotely constitutes an argument for replacing Newton’s second law. On the contrary what it implies is that you might need to work a little to use Newton’s laws to translate the effect of relative velocity into accident survivability. However, any attempt to simplify will run the danger of being an oversimplification.
This point was very well understood by a somewhat neglected scientist, the centenary of whose death falls this year: James Berry( 1852-1913) an English executioner or hangman but one who recognised the value of physics. Ronald Meek, the Marxist economist whose work I was expected to study when a student of Economics and Statistics in the 1970s, devotes a chapter of his entertaining book, Figuring out Society, to Mr Berry. Berry decided that the length of the drop in an execution required scientific study: too short and death was not instantaneous, too long and decapitation ensued. He soon realised that a simple law of hanging, linear in the height of the drop, was wrong and instead came upon the idea of a ‘striking force’. This enabled him to hang criminals along a curve. (I understand that in American universities professors also sometimes execute judgement along a curve.) The striking force required was adjusted according to the weight and neck musculature of the condemned and the height was then determined from the curve.
Many of the current proponents of evidence based medicine could learn from Berry’s example. NNTs derived from clinical trials are misleading indicators as to what will happen in clinical practice. For that to be the case would require that the patients in the clinical trial we run were a representative sample of the population of patients. They are not, and if they were the fact that we set such store on concurrent control would be inexplicable. To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto. Sometimes short cuts lead to long delays.
1. Cox DR. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society Series B 1972; 34: 187-220.
2. Snapinn S, Jiang Q. On the clinical meaningfulness of a treatment’s effect on a time-to-event variable. Statistics in Medicine 2011; 30: 2341-2348.
3. Meek RL. Figuring out society. Fontana, 1971.
When I asked Stephen about the intended analogy of the hanging example, he sent me an interesting response:
“The point I am making is that it may take several complex modelling steps to turn a scientific law (or a clinical trial result) into a useful prediction. Many medics seem to find this unreasonable. However, if a humble public executioner can manage it, there is no excuse.
I have a joke that goes like this. “What is the difference between a medic and a lumberjack? The latter has no difficulty with logs. “
I’m put in mind of a curious finding in the heuristics and biases literature. It’s fairly well known that non-experts aren’t good at applying Bayes’ theorem in word problems. (I’m not talking Bayesian statistics now, just standard unobjectionable probability theory.) The usual scenario is a medical test for a rare disease: suppose the base rate in the population is 0.4% and a test for the disease has a 97% chance of giving a false result. The uninitiated, even doctors, do quite poorly at answering the question of what the probability is that a person has the disease given that the test indicated that the disease is present.
But! — when the word problem is given in terms of relative frequencies instead of percentages — 4 people in 1,000 have the disease and the test gives a correct result 97 time in 100 — then people do much better, because the phrasing of the question promotes reasoning about some concrete set of items. In rough numbers: for every 4 disease victims there’s 970 true negatives and 30 false positives. That’s 4 true positives per 34 positive results, or a bit less than 12 in 100. Easy.
The basic theory is “easy to visualize” implies “makes humans happy”, and the preference for NNT and median survival time is at least consistent with this notion, since they are easy to visualize. (Absolute event rate difference, not so much…) *If* I’m right to suggest that this is what’s really going on, then there’s a useful and testable implication: NNT’s and MST’s “clinical relevance” is beside the point; explaining a Cox regression analysis — or any analysis, really — to non-statisticians will work best when done on a concrete, visualizable basis.