Also Smith and Jones
by Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
This story is based on a paradox proposed to me by Don Berry. I have my own opinion on this but I find that opinion boring and predictable. The opinion of others is much more interesting and so I am putting this up for others to interpret.
Two scientists working for a pharmaceutical company collaborate in designing and running a clinical trial known as CONFUSE (Clinical Outcomes in Neuropathic Fibromyalgia in US Elderly). One of them, Smith is going to start another programme of drug development in a little while. The other one, Jones, will just be working on the current project. The planned sample size is 6000 patients.
Smith says that he would like to look at the experiment after 3000 patients in order to make an important decision as regards his other project. As far as he is concerned that’s good enough.
Jones is horrified. She considers that for other reasons CONFUSE should continue to recruit all 6000 and that on no account should the trial be stopped early.
Smith say that he is simply going to look at the data to decide whether to initiate a trial in a similar product being studied in the other project he will be working on. The fact that he looks should not affect Jones’s analysis.
Jones is still very unhappy and points out that the integrity of her trial is being compromised.
Smith suggests that all that she needs to do is to state quite clearly in the protocol that the trial will proceed whatever the result of the interim administrative look and she should just write that this is so in the protocol. The fact that she states publicly that on no account will she claim significance based on the first 3000 alone will reassure everybody including the FDA. (In drug development circles, FDA stands for Finally Decisive Argument.)
However, Jones insists. She wants to know what Smith will do if the result after 3000 patients is not significant.
Smith replies that in that case he will not initiate the trial in the parallel project. It will suggest to him that it is not worth going ahead.
Jones wants to know suppose that the results for the first 3000 are not significant what will Smith do once the results of all 6000 are in.
Smith replies that, of course, in that case he will have a look. If (but it seems to him an unlikely situation) the results based on all 6000 will be significant, even though the results based on the first 3000 were not, he may well decide that the treatment works after all and initiate his alternative program, regretting, of course, the time that has been lost.
Jones points out that Smith will not be controlling his type I error rate by this procedure.
‘OK’, Says Smith, ‘to satisfy you I will use adjusted type I error rates. You, of course, don’t have to.’
The trial is run. Smith looks after 3000 patients and concludes the difference is not significant. The trial continues on its planned course. Jones looks after 6000 and concludes it is significant P=0.049. Smith looks after 6000 and concludes it is not significant, P=0.052. (A very similar thing happened in the famous TORCH study(1))
Shortly after the conclusion of the trial, Smith and Jones are head-hunted and leave the company. The brief is taken over by new recruit Evans.
What does Evans have on her hands: a significant study or not?
1. Calverley PM, Anderson JA, Celli B, Ferguson GT, Jenkins C, Jones PW, et al. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England journal of medicine. 2007;356(8):775-89.
 Not to be confused with either Alias Smith and Jones nor even Alas Smith and Jones
I think that this is rather a legal than a data analytic question. I guess that it can be treated as significant given that following Jones’s original protocol yielded p=0.049. If this was in advance formally connected to any decision, I think that’s the result the decision should be based on, but I really think that this depends on definitions and procedures used in the field which I don’t know.
Regarding the statistical evidence, the difference between 0.049 and 0.052 is 0.003 which is negligible. This is a borderline result and it isn’t too meaningful on which side of 0.05 we are. (Of course p about 0.05 with n=6000 may mean that the effect size is minuscule even if significant.)
Evans has a set of results that constitute evidence. Whether the trial result is ‘significant’ or not is irrelevant to the scientific exercise, although I concede that it may not be irrelevant to the FDA.
Courts will not allow presentation of evidence that has been obtained illegally, not because the evidence is in some way turned into non-evidence as a consequence of the illegality of its gathering, but because of meta-consequences, consequences for future trials. For example there would be an incentive for participants to act inappropriately, or consequences outside of the courts, such as a lowering ethical standards of the police.
The requirement to not peek at data during a trial is not necessary to protect the evidence, but to protect a meta-property: the false positive error rate of future studies. That may be worth preserving in some sort of global sense, but it is not something that does, or can, or should, affect the evidence itself.
If the P-value is to be used as an index of evidence then it should not be adjusted in any way to account for the potentially sequential nature of the trial. In the circumstances, I can’t see how Evans can do anything other than deal with the evidence as evidence. The false positive error-rate aspects of the trial have been screwed up. That may be no bad thing scientifically, as the evidence is more important than preservation of notional false positive error rates which can only be achieved at the cost of increased false negative error rates. However, commercially, the loss of control of false positive errors might be costly because of the influence of factors outside of the science, namely the FDA’s requirements.
Just as a side-note, I feel obliged to point out the fact that there is no mention of the size of the observed effect. Evans would need to consider that while weighing the evidence.
I think they got what they deserved!
I’d say it is significant, the no significance for Smith is only relevant for his decission not to engage in a trial in order to protect the error rates sanity for future studies.
My first reaction was that Jones’ result was the correct one, I bought Smith’s rationale because I’ve admittedly used similar rationales. But, alas, in this case I think that Smith’s result is the one that is relevant… Smith’s argument rests entirely on what the company counter-factually would have done had Smith’s interim look been “significant” (assuming of course that Smith used a reasonable sequential analysis boundary). Given that the interim result was “non-significant”, we’ll of course never know, but I have difficulty believing the drug company would have sat on the result had it been otherwise (after all, there are methods for properly conducting unplanned interim looks).
Also, I disagree that the observed effect size would be relevant to Evans’ decision or that it necessarily would have been minuscule. If this were trial comparing incidence of a rare event (e.g., HIV prevention), then even with 6000 participants the estimated effect size would not be terribly precise because precision is a function only of the number of events (a typical 95%ci for the hazard ratio might be 0.35 to 0.95). Further, the point estimate, in and of itself, provides no statistical information, so I can’t see how it would be relevant.
I’ll also add that Smith could have used a refresher on statistics. First he over interprets an interim “non-significant” result as evidence for the null (suggests to him that it’s not worth going ahead, even though the interim unadjusted p-value may have been as low as 0.001), then he doesn’t seem to understand how the sequential methods that he’s using really work (“seems to him an unlikely situation” that interim might not cross the boundary but final might). For what it’s worth, he would have been better off basing his interim decision on conditional power (and then it might have been easier to argue that Jones’ result was the more relevant in the end).
I had hoped that Steven would comment at the end and reveal his own opinion…?
By the way, another valid (I think) interpretation, following Neyman’s “inductive behaviour”, is that Jones and Smith have different decision problems (regarding future behaviour) and Jones’s result should be used for Jones’s problem and Smith’s result should be used for Smith’s problem.