Stephen Senn: Also Smith and Jones

Stephen SennAlso Smith and Jones[1]
by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)


This story is based on a paradox proposed to me by Don Berry. I have my own opinion on this but I find that opinion boring and predictable. The opinion of others is much more interesting and so I am putting this up for others to interpret.

Two scientists working for a pharmaceutical company collaborate in designing and running a clinical trial known as CONFUSE (Clinical Outcomes in Neuropathic Fibromyalgia in US Elderly). One of them, Smith is going to start another programme of drug development in a little while. The other one, Jones, will just be working on the current project. The planned sample size is 6000 patients.

Smith says that he would like to look at the experiment after 3000 patients in order to make an important decision as regards his other project. As far as he is concerned that’s good enough.

Jones is horrified. She considers that for other reasons CONFUSE should continue to recruit all 6000 and that on no account should the trial be stopped early.

Smith say that he is simply going to look at the data to decide whether to initiate a trial in a similar product being studied in the other project he will be working on. The fact that he looks should not affect Jones’s analysis.

Jones is still very unhappy and points out that the integrity of her trial is being compromised.

Smith suggests that all that she needs to do is to state quite clearly in the protocol that the trial will proceed whatever the result of the interim administrative look and she should just write that this is so in the protocol. The fact that she states publicly that on no account will she claim significance based on the first 3000 alone will reassure everybody including the FDA. (In drug development circles, FDA stands for Finally Decisive Argument.)

However, Jones insists. She wants to know what Smith will do if the result after 3000 patients is not significant.

Smith replies that in that case he will not initiate the trial in the parallel project. It will suggest to him that it is not worth going ahead.

Jones wants to know suppose that the results for the first 3000 are not significant what will Smith do once the results of all 6000 are in.

Smith replies that, of course, in that case he will have a look. If (but it seems to him an unlikely situation) the results based on all 6000 will be significant, even though the results based on the first 3000 were not, he may well decide that the treatment works after all and initiate his alternative program, regretting, of course, the time that has been lost.

Jones points out that Smith will not be controlling his type I error rate by this procedure.

‘OK’, Says Smith, ‘to satisfy you I will use adjusted type I error rates. You, of course, don’t have to.’

The trial is run. Smith looks after 3000 patients and concludes the difference is not significant. The trial continues on its planned course. Jones looks after 6000 and concludes it is significant P=0.049. Smith looks after 6000 and concludes it is not significant, P=0.052. (A very similar thing happened in the famous TORCH study(1))

Shortly after the conclusion of the trial, Smith and Jones are head-hunted and leave the company.  The brief is taken over by new recruit Evans.

What does Evans have on her hands: a significant study or not?


1.  Calverley PM, Anderson JA, Celli B, Ferguson GT, Jenkins C, Jones PW, et al. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England journal of medicine. 2007;356(8):775-89.

[1] Not to be confused with either Alias Smith and Jones nor even Alas Smith and Jones

Categories: Philosophy of Statistics, Statistics | Tags: , , ,

Post navigation

14 thoughts on “Stephen Senn: Also Smith and Jones

  1. Christian Hennig

    I think that this is rather a legal than a data analytic question. I guess that it can be treated as significant given that following Jones’s original protocol yielded p=0.049. If this was in advance formally connected to any decision, I think that’s the result the decision should be based on, but I really think that this depends on definitions and procedures used in the field which I don’t know.

    Regarding the statistical evidence, the difference between 0.049 and 0.052 is 0.003 which is negligible. This is a borderline result and it isn’t too meaningful on which side of 0.05 we are. (Of course p about 0.05 with n=6000 may mean that the effect size is minuscule even if significant.)

  2. Michael Lew

    Evans has a set of results that constitute evidence. Whether the trial result is ‘significant’ or not is irrelevant to the scientific exercise, although I concede that it may not be irrelevant to the FDA.

    Courts will not allow presentation of evidence that has been obtained illegally, not because the evidence is in some way turned into non-evidence as a consequence of the illegality of its gathering, but because of meta-consequences, consequences for future trials. For example there would be an incentive for participants to act inappropriately, or consequences outside of the courts, such as a lowering ethical standards of the police.

    The requirement to not peek at data during a trial is not necessary to protect the evidence, but to protect a meta-property: the false positive error rate of future studies. That may be worth preserving in some sort of global sense, but it is not something that does, or can, or should, affect the evidence itself.

    If the P-value is to be used as an index of evidence then it should not be adjusted in any way to account for the potentially sequential nature of the trial. In the circumstances, I can’t see how Evans can do anything other than deal with the evidence as evidence. The false positive error-rate aspects of the trial have been screwed up. That may be no bad thing scientifically, as the evidence is more important than preservation of notional false positive error rates which can only be achieved at the cost of increased false negative error rates. However, commercially, the loss of control of false positive errors might be costly because of the influence of factors outside of the science, namely the FDA’s requirements.

    Just as a side-note, I feel obliged to point out the fact that there is no mention of the size of the observed effect. Evans would need to consider that while weighing the evidence.

  3. I think they got what they deserved!

  4. I’d say it is significant, the no significance for Smith is only relevant for his decission not to engage in a trial in order to protect the error rates sanity for future studies.

  5. Mark

    My first reaction was that Jones’ result was the correct one, I bought Smith’s rationale because I’ve admittedly used similar rationales. But, alas, in this case I think that Smith’s result is the one that is relevant… Smith’s argument rests entirely on what the company counter-factually would have done had Smith’s interim look been “significant” (assuming of course that Smith used a reasonable sequential analysis boundary). Given that the interim result was “non-significant”, we’ll of course never know, but I have difficulty believing the drug company would have sat on the result had it been otherwise (after all, there are methods for properly conducting unplanned interim looks).

    Also, I disagree that the observed effect size would be relevant to Evans’ decision or that it necessarily would have been minuscule. If this were trial comparing incidence of a rare event (e.g., HIV prevention), then even with 6000 participants the estimated effect size would not be terribly precise because precision is a function only of the number of events (a typical 95%ci for the hazard ratio might be 0.35 to 0.95). Further, the point estimate, in and of itself, provides no statistical information, so I can’t see how it would be relevant.

  6. Mark

    I’ll also add that Smith could have used a refresher on statistics. First he over interprets an interim “non-significant” result as evidence for the null (suggests to him that it’s not worth going ahead, even though the interim unadjusted p-value may have been as low as 0.001), then he doesn’t seem to understand how the sequential methods that he’s using really work (“seems to him an unlikely situation” that interim might not cross the boundary but final might). For what it’s worth, he would have been better off basing his interim decision on conditional power (and then it might have been easier to argue that Jones’ result was the more relevant in the end).

  7. Christian Hennig

    I had hoped that Steven would comment at the end and reveal his own opinion…?
    By the way, another valid (I think) interpretation, following Neyman’s “inductive behaviour”, is that Jones and Smith have different decision problems (regarding future behaviour) and Jones’s result should be used for Jones’s problem and Smith’s result should be used for Smith’s problem.

    • Christian: yes, it could be seen this way. My new comment might be relevant regarding what N-P said.

  8. A few comments and thoughts:

    1) I have neither the benefit nor burden of understanding how clinical trials are run.

    2) Anyone who claims there’s a significant difference between p=0.049 and p=0.052 is, at the p=0.00863 level – no, make that the p=0.008625 level, nuts.

    3) What about the fact that p increased from 3000 to 6000 samples? Nevermind the increase, why didn’t it decrease? If I were measuring a property, x, of test and control groups I’d calculate x_t +/- dx_t and x_c +/- dx_c then use (x_t-x_c)/sqrt(dx_t^2 + dx_c^2) to decide if the difference between the test and control groups was significant. If dx_i didn’t scale with 1/sqrt(n) I’d investigate to see why not. Along those lines, what went on in the trial that caused p to increase when the sample size doubled? (For example, could the latter half of the trial inadvertently been sampling a different population?)

    4) Hey, yeah, what about that control group?

    5) What’s magic about p=0.05? Statistics is an arbitrary way of being reasonable. Doesn’t the reasonableness of p=0.05 depend upon context? For example, is p=0.05 appropriate for fire control decisions? Lower? Higher? I suppose that depends which end of the barrel you’re on. If nothing else, it would sure affect my feelings about Type II error rates.

    6) I think I’d take the trial data and do bootstrap analyses with 3000 and 6000 samples and see what they told me.

  9. I appreciate that Senn’s puzzle may actually be an issue in some medical trials, but here are a couple of quotes even from early Egon, and early N-P:

    “Were the action to be taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule”. (E.Pearson 1936/1966, 192).

    Or, as in the famous joint paper by Neyman and Pearson, “On the Problem of the Most Efficient Tests…”:
    “It is doubtful whether the knowledge that [the observed significance level] was really .03 (or .06) rather than .05…would in fact ever modify our judgment when balancing the probabilities regarding the origin of a single sample” (N-P, 27).

    The Pearson quote is from E.S Pearson’s selected papers, 118-30. (First in Biometrika 28, 1936: 308-20.)
    Everyone knows the second. The passages may be found in chapter 11, EGEK 386-7.

    Click to access EGEKChap11.pdf complete.htm

  10. Some comments.

    First the small difference between the two P-values is, as some have commented, of little practical relevance but it is nevertheless irrelevant to the paradox. If I had had Smith looking frequently throughout the trial or using a more aggressive approach to ‘spending alpha’ (he appears to have been using something like an O’Brien-Fleming rule) then the difference would have been bigger. The important point is that there is a difference.

    As with all clinical trials the P-value is that associated with the difference to control. It is a function of both the control group and the treatment group. Medical statisticians are only interested in differences hence the quip, ‘If you ask a medical statistician “How’s your wife” he will reply “compared to what?”‘.

    P-values are a position not a movement. If a P-value says anything at all of inferential value it says what one should believe based on the data one has. If one knew that the P-value would get smaller on obtaining more data then anticipated data would have the same value as actual data which would be a very unsatisfactory situation. See Senn, S. J. (2002). “A comment on replication, p-values and evidence S.N.Goodman, Statistics in Medicine 1992; 11:875-879.” Statistics in Medicine 21(16): 2437-2444. However, in this example the unadjusted P-value may well have got smaller. It was not small enough for Smith at the interim look. It was small enough for Jones at the final look. It was not small enough for Smith at the final look but that does not mean it was smaller at the interim look than it was at the end. He carried on because it was too large.

    My view of the ‘paradox’ is that as evidence the data are what they are irrespective of what was done as regards peeking or not. Thus what is important is the treatment contrast and the measure of its precision. This is what Evans should concentrate on. The P-values are only relevant (if at all) to local decisions that Smith and Jones are taking. These decisions are different and may require different approaches. (This is not the same as saying that an adjusted P-value approach is the right one.)

    Finally, all of this is not quite as fictional as it may seem. A lot of ink has been used (I am tempted to say wasted) in describing approaches to flexible designs permitting one to change all sorts of aspects of the trial itself midstream while still preserving a type I error rate. Some even think, quite mistakenly in my opinion, that this is what is going to “save” the pharmaceutical industry.

    • >A lot of ink has been used (I am tempted to say wasted) in describing approaches to flexible designs permitting one to change all sorts of aspects of the trial itself midstream while still preserving a type I error rate.

      Who do I talk to at the FDA about getting such info put on prescription warning labels? Yikes. I don’t like the sound of that.

    • > … as evidence the data are what they are irrespective of what was done as regards peeking or not.

      I concur. And that is, essentially, why I like bootstrapping (for example) as a method for evaluating the robustness of ones conclusions. Resampling as consistency check – if resampling the data significantly affects your conclusions then it seems to me that you have a problem.

      > If one knew that the P-value would get smaller on obtaining more data then anticipated data would have the same value as actual data which would be a very unsatisfactory situation.

      I don’t follow. Can you point me a reprint of your paper not behind a pay wall?

  11. Stephen: I thought it was inderect comparisons and network meta-analysis that was going to “save” the pharmaceutical industry 😉

    Less flippantly, I believe that the P-value has liitle if anything to say of inferential value. Rather it is simply a rough communal device to obtain low rates of things that do not have (positive) effects being attributed evidence that they do have (positive) effects.

    Also, I believe the intentions and purposes of experiments needs to be reflected in what ever is the inferential value.


Blog at