**(guest post) When Relevance is Irrelevant, by Stephen Senn**

*Head of Competence Center for Methodology and Statistics (CCMS)*

Applied statisticians tend to perform analyses on additive scales and *additivity* is an important aspect of an analysis to try to check. Consider survival analysis. The most important model used, the default in many cases, is the *proportional hazards model* introduced by David Cox in 1972[1] and sometimes referred to as *Cox regression*. In fact, from one point of view, analysis takes place on the log-hazard scale and so the model could equally be referred to by the rather clumsier title *additive log-hazards model* and there is quite a literature on how the proportionality (or equivalently, additivity) assumption can be checked.

Words have a definite power on the mind and you sometimes encounter the nonsensical claim that if the proportionality assumption does not apply you should consider a *log-rank test* instead. In fact, when testing the null hypothesis that two treatments are identical, neither the log-rank test nor the score test using the proportional hazards model require the *assumption* of proportionality: the assumption is trivially satisfied by the fact of two treatments being identical. Furthermore the log-rank test is just a special case of proportional hazards: the score test for a proportional hazards model without any covariates *is* the log-rank test. Finally, it is easy to produce examples where proportional hazards would apply in a model *with* covariates but *not* in the model *without* covariates but very difficult to produce the converse.

An objection often made regarding such models is that they are very difficult for physicians to understand. My reply is to ask what is preferable: a difficult truth or an easy lie? Ah yes, it is sometimes countered, but surely I agree on the importance of *clinical relevance*. It is surely far more useful to express the results of a proportional hazards analysis in clinically relevant terms that can be understood, such as difference in median length of survival or the difference in the event rate up to a particular census point (say one year after treatment).

A disturbing paper by Snapinn and Jiang[2] points to a problem, however, and to explain it I can do no better that cite the abstract:

The standard analysis of a time-to-event variable often involves the calculation of a hazard ratio based on a survival model such as Cox regression; however, many people consider such relative measures of effect to be poor expressions of clinical meaningfulness. Two absolute measures of effect are often used to assess clinical meaningfulness: (1) many disease areas frequently use the absolute difference in event rates (or its inverse, the number-needed-to-treat) and (2) oncology frequently uses the difference between the median survival times in the two groups. While both of these measures appear reasonable, they directly contradict each other. This paper describes the basic mathematics leading to the two measures and shows examples. The contradiction described here raises questions about the concept of clinical meaningfulness. (p2341)

To see the problem, consider the following. The more serious the disease, the less a given difference in the rate at which people die will impact on the time survived and hence on differences in median survival. However, generally, the higher the baseline mortality rate the greater the difference in survival at a given time point that will be conveyed by a given treatment benefit.

If you find this less than clear, you have my sympathy. The only solution I can offer is to suggest that you read the paper by Snappin and Jiang[2]. However, in that case also consider the following point. If the point is so subtle, how many physicians who cannot understand proportional hazards can understand numbers needed to treat or differences in median survival? My opinion is that they can be counted on the fingers of one foot.

Let me explain the point at issue by analogy. If one were to study road traffic accidents one would find that among the very many factors affecting seriousness of the consequences of an accident would be the relative velocity at impact. However, if one looks at Newton’s laws of motion one finds that the second law speaks of the relationship between force, mass, and *acceleration* but not *velocity*. Now it is clear that a) acceleration being a concept that is defined (or at least understood) in terms of velocity (it is, indeed, *derivative* of velocity) it is a more complicated concept than velocity b) all cars have speedometers that show velocity but *not* acceleration c) the traffic laws are couched in terms of velocity rather than acceleration and d) it seems that from the point of view of human health it is velocity that is important.

None of this remotely constitutes an argument for replacing Newton’s second law. On the contrary what it implies is that you might need to work a little to use Newton’s laws to translate the effect of relative velocity into accident survivability. However, any attempt to simplify will run the danger of being an oversimplification.

This point was very well understood by a somewhat neglected scientist, the centenary of whose death falls this year: James Berry( 1852-1913) an English executioner or hangman but one who recognised the value of physics. Ronald Meek, the Marxist economist whose work I was expected to study when a student of Economics and Statistics in the 1970s, devotes a chapter of his entertaining book, *Figuring out Society*[3], to Mr Berry. Berry decided that the length of the drop in an execution required scientific study: too short and death was not instantaneous, too long and decapitation ensued. He soon realised that a simple law of hanging, linear in the height of the drop, was wrong and instead came upon the idea of a ‘striking force’. This enabled him to hang criminals along a curve. (I understand that in American universities professors also sometimes execute judgement along a curve.) The striking force required was adjusted according to the weight and neck musculature of the condemned and the height was then determined from the curve.

Many of the current proponents of evidence based medicine could learn from Berry’s example. NNTs derived from clinical trials are misleading indicators as to what will happen in clinical practice. For that to be the case would require that the patients in the clinical trial we run were a representative sample of the population of patients. They are not, and if they were the fact that we set such store on concurrent control would be inexplicable. To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. ‘Additive at the point of analysis but relevant at the point of application’ should be the motto. Sometimes short cuts lead to long delays.

# References

1. Cox DR. Regression models and life-tables (with discussion). *Journal of the Royal Statistical Society Series B *1972; **34**: 187-220.

2. Snapinn S, Jiang Q. On the clinical meaningfulness of a treatment’s effect on a time-to-event variable. *Statistics in Medicine *2011; **30**: 2341-2348.

3. Meek RL. *Figuring out society. *Fontana, 1971.

When I asked Stephen about the intended analogy of the hanging example, he sent me an interesting response:

“The point I am making is that it may take several complex modelling steps to turn a scientific law (or a clinical trial result) into a useful prediction. Many medics seem to find this unreasonable. However, if a humble public executioner can manage it, there is no excuse.

I have a joke that goes like this. “What is the difference between a medic and a lumberjack? The latter has no difficulty with logs. “

I’m put in mind of a curious finding in the heuristics and biases literature. It’s fairly well known that non-experts aren’t good at applying Bayes’ theorem in word problems. (I’m not talking Bayesian statistics now, just standard unobjectionable probability theory.) The usual scenario is a medical test for a rare disease: suppose the base rate in the population is 0.4% and a test for the disease has a 97% chance of giving a false result. The uninitiated, even doctors, do quite poorly at answering the question of what the probability is that a person has the disease given that the test indicated that the disease is present.

But! — when the word problem is given in terms of relative frequencies instead of percentages — 4 people in 1,000 have the disease and the test gives a correct result 97 time in 100 — then people do much better, because the phrasing of the question promotes reasoning about some concrete set of items. In rough numbers: for every 4 disease victims there’s 970 true negatives and 30 false positives. That’s 4 true positives per 34 positive results, or a bit less than 12 in 100. Easy.

The basic theory is “easy to visualize” implies “makes humans happy”, and the preference for NNT and median survival time is at least consistent with this notion, since they are easy to visualize. (Absolute event rate difference, not so much…) *If* I’m right to suggest that this is what’s really going on, then there’s a useful and testable implication: NNT’s and MST’s “clinical relevance” is beside the point; explaining a Cox regression analysis — or any analysis, really — to non-statisticians will work best when done on a concrete, visualizable basis.

The problem is whether what is easy to see is really there or an illusion. I like Philip Dawid’s idea that a parameter is a resting place on the road to a prediction. The problem is that the road may be long and rocky. Sometimes I think that statisticians have made a mistake in sharing the details of what they do with their collaborators. Maybe we should just black box it all.

Stephen: “a parameter is a resting place on the road to a prediction”. Or, I’d say, on the road to understanding, explaining, theorizing. Yes, parameters in a statistical model rarely directly match-up with substantive claims/parameters (the statistical vs substantive distinction).

But I find the rest of your comment cryptic. Do you mean that by sharing the intermediate stages, the collaborators don’t understand the circuitous logic? How then can they be collaborators? Just “black box it all”? How would we be able to check anything? I think I must be missing something…

When I worked in drug development it was mainly in asthma. A popular measure was forced expiratory volume in one second, often calculated as the area under the flow curve between 0 and 1 seconds. Never did a physician ask “how is it calculated?. Using the trapezoidal rule, Simpson’s rule, or some other method?” They didn’t care. Similarly many bloggers these days have no idea how a computer works but they are happy to use it. I was taught in my day (but to a very low standard) to program in assembler but that does not mean that I understand anything about computers. Such understanding as I have is completely black box: I have a working idea of what input is needed to produce an output but no idea as to how it’s done.

The point I am making is that if we organise things appropriately it will not be necessary for physicians to understand how a statistical calculation does what it does. It should yield a usable prediction (but a parameter estimate is rarely this). In a comment on a paper by Lee and Nelder I put it thus

“For example, in my opinion, and as already stated, estimation and prediction are not the same except by accident. It is misleading that a standard statistical paradigm, to which textbooks often return, is that of estimating a population mean using simple random sampling. For this purpose, the parameter estimate of the simple model is, indeed, the same as the prediction. However, as soon as we turn to more complex sampling schemes, this is not so. Stratified random sampling, for example, yields estimates of stratum means from which the population mean can be predicted using the sampling fractions if one wishes, but there is no immediate connection between any of the parameters estimated and the target quantity.”

http://www.jstor.org/discover/10.2307/4144405?uid=3738488&uid=2134&uid=2&uid=70&uid=4&sid=21102221648517

When I say ‘black box it’ I am thinking of most user not all. Of course I expect statisticians to understand the innards and perhaps philosophers of science too.

Stephen: You say that the very reason “we set such store on concurrent control” in clinical trials is to discern an effect in an artificially created context, not expected to be a representative sample of patients. But the discernment of the effect is not irrelevant to understanding the treatment of interest. Much of scientific learning goes on by circuitous and indirect means, but that doesn’t make it irrelevant to whatever understanding is being sought. I spoze your point is that a lot of work and background have to enter to connect up to intended applications.

Yes. That’s my point. Clinical trials are a scientific device. It is tempting to shortcut the process of turning science into technology but it leads to a mess. The relevance to statistics was understood many years ago by Yates and Cochran. They wrote

“Agronomic experiments are undertaken with two different aims in view, which may roughly be termed the technical and the scientific. Their aim may be regarded as scientific in so far as the elucidation of the underlying laws is attempted, and as technical in so far as empirical rules for the conduct of practical agriculture are sought. The two aims are, of course, not in any sense mutually exclusive, and the results of most well conducted experiments on technique serve to add to the structure of general scientific law, or at least to indicate places where the existing structure is inadequate, while experiments on questions of a more fundamental type will themselves provide the foundation of further technical advances….In this

respect scientific research is easier than technical research.” 1 pp556-557 & 558

See also discussion in reference 2.

REFERENCES

1 Yates, F. and W. G. Cochran (1938). “The analysis of groups of experiments.” Journal of Agricultural Science 28(4): 556-580.

2. Christie, M., A. Cliffe, et al. (2011). Issues for modellers. in Simplicity, Complexity and Modelling. M. Christie, A. Cliffe, A. P. Dawid and S. Senn. Chichester, Wiley: 187-192.

Stephen: Yes, but isn’t that the point of experimentation? To create an artificial situation where you can find something out that would not be discernible without the planned intervention?

There’s an indirectness but not an irrelevance.

The point of this should be that differences in survival time or probability aren’t interpretable without knowing what they have changed from. An extra month from 6 months is a lot different from an extra month after 10 years, at least clinically.

I think the key thing is that Clinicians want a result expressed in a way that they can reason about patients and circumstances with.

As people generally seem to work well with additive type models this implies that that is the scale/ type of measure on which we have to talk with them. Or give them to reason with.

But, of course, there is evidence that for some parts of the world (such as forces acting on hanged necks) that things are not linear and simply do not work that way.

What should we do? We cannot simply hide it. In the long run we have to teach people how to understand that part of reality…