Screening for attention
There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case.
In the age of COVID-19 this is a highly suitable subject for a blog. However, I am a highly unsuitable person to blog about it, since what I know about screening programmes could be written on the back of a postage stamp and the matter is very delicate. So, I shall duck the challenge but instead write about something that bears more than a superficial similarity to it, namely, testing for model adequacy prior to carrying out a test of a hypothesis of primary interest. It is an issue that arises in particular in cross-over trials, where there may be concerns that carry-over has taken place. Here, I may or may not be an expert, that is for others to judge, but I can’t claim that I think I am unqualified to write about it, since I once wrote a book on the subject[4, 5]. So this blog will be about testing model assumptions taking the particular example of cross-over trials.
Presuming to assume
The simplest of all cross-over trials, the so-called AB/BA cross-over is one in which two treatments A and B are compare by randomising patients to one of two sequences: either A followed by B (labelled AB) or B followed by A (labelled BA). Each patient is thus studied in two periods, receiving one of the two treatments in each. There may be a so-called wash-out period between them but whether or not a wash-out is employed, the assumption will be made that by the time the effect of a treatment comes to be measured in period two, the effect of any treatment given in period one has disappeared. If such a residual effect, referred to as a carry-over, existed it would bias the treatment effect since, for example, the result in period two of the AB sequence would not only reflect the effect of giving B but the previous effect of having given A.
Everyone is agreed that if the effect of carry-over can be assumed negligible, an efficient estimate of the difference between the effect of B and A can be made by allowing each patient to act as his or her own control. One way of doing this is to calculate a difference for each patient of the period two values minus the period one values. I shall refer to these as the period differences. If the effect of treatment B is the same as that of treatment A, then these period differences will not be expected to differ systematically from one sequence to another. However, if (say) the effect of B was greater that that of A (and higher values were better), then in the AB sequence a positive difference would be added to the period differences and in the BA sequence that difference would be subtracted from the period differences. We should thus expect the means of the period differences for the two sequences to differ. So one way of testing the null hypothesis of no treatment effect is to carry out a two-sample t-test comparing the period differences between one sequence an another. Equivalently from the point of view of testing, but more convenient from the point of view of estimation, is to work with the semi period differences, that is to say the period difference divided by two. I shall refer to the associated estimate as CROS and the t-statistic for this test as CROSt, since they are what the crossover trial was designed to produce.
Unfortunately, however, these period differences could also reflect carry-over and if this occurs it would bias the estimate of the treatment effect. The usual effect will be to bias it towards the null and so there may be a loss of power. Is there a remedy? One possibility is to discard the second period values. After all, the first period values cannot be contaminated by carry-over. On the other hand single period values is what we have in any parallel group trial. So all we need to do is regard the first period values as coming from a parallel group trial. Patients in the AB sequence yield values under A and patients in the BA sequence yield values under B so a comparison of the first period values, again using a two-sample t-test, is a test of the null hypothesis of no difference between treatments. I shall refer to this estimate as the PAR statistic and the corresponding t-statistic as PARt.
Note that the PAR statistic is expected to be a second best to the CROS statistic. Not only are half the values discarded but since we can no longer use each patient as his or her own control, the relevant variance is a sum of between and within-patient variances unlike for CROS, which only reflects within-patient variation. Nevertheless, PAR may be expected to be unbiased in circumstances where CROS will be and since all applied statistics is a bias-variance trade-off it seems conceivable that there are circumstances under which PAR would be preferable.
Carry-over carry on
However, there is a problem. How shall we know that carry-over has occurred? It turns out that there is a t-test for this too. First, what we construct are means over the two periods for each patient. In each sequence such means must reflect the effect of both treatments, since each patient receives each treatment and they must also reflect the effect of both periods, since each patient will be treated in each period. However, in the AB sequence the total (and hence the mean) will also reflect the effect of any carryover from A in the second period whereas in the BA sequence the total (and hence the mean) will reflect the carry-over of B. Thus, if the two carry-overs differ, which is what matters, these sequence means will differ and therefore the t-test of the totals comparing the two sequences is a valid test of zero differential carry-over. I shall refer to the estimate as SEQ and the corresponding t-statistic as SEQt.
The rule of three
These three tests were then formally incorporated in a testing strategy known as the two-stage procedureas follows. First a test for carry-over was performed using SEQt. Since the test was a between-patient test and therefore of low power, a nominal type I error rate of 10% was generally used. If SEQt was not significant, the statistician proceeded to use CROSt to test the principle hypothesis of interest, namely that of the equality of the two treatments. If however, SEQt was significant, which might be taken as an indication of carry-over, the fallback test PARt was used instead to test equality of the treatments.
The procedure is illustrated in Figure 1.
Of course, to the extent that the three tests are used as prescribed, they can be combined in a single algorithm. In fact in the pharmaceutical company that I joined in 1987, the programming group had written a SAS® macro to do exactly that. You just needed to point the macro at your data and it would calculate SEQ, come to a conclusion, choose either CROS or PAR as appropriate and give you your P-value.
I hated the procedure as soon as I saw it and never used it. I argued that it was an abuse of testing to assume that just because SEQ was not significant that therefore no carry-over has occurred. One had to rely on other arguments to justify ignoring carry-over. It was only on hearing a colleague lecture on an example where the test for carry-over had proved significant and, much to his surprise, given its low power, the first period test had also proved significant, that I suddenly realised that SEQ and PAR were highly correlated and therefore this was only to be expected. In consequence, the procedure would not maintain the Type I error rate. Only a few days later a manuscript from Statistics in Medicine arrived on my desk for review. The paper by Peter Freeman overturned everything everyone believed on testing for carry-over. In bolting these tests together, statisticians had created a chimeric monster. Far from helping to solve the problem of carry-over the two-stage procedure had made it worse.
‘How can screening for something be a problem?’, an applied statistician might ask but in asking that they would be completely forgetting the advice they would give a physician who wanted to know the same thing. The process as a whole of screening plus remedial action needed to be studied and statisticians had failed to do so. Peter Freeman completely changed that. He did what statisticians should have done and looked at how the procedure as whole behaved. In the years since I have simply asked statisticians who wish to give an opinion on cross-over trials what they think of Freeman’s paper. It has become a litmus paper for me. Their answer tells me everything I need to know.
Correlation is not causation but it can cause trouble
So what is the problem? The problem is illustrated by Figure 2. This shows a simulation from a null case. There is no difference between the treatments and no carry-over. The correlation between periods one and two has been set to 0.7. One thousand trials in 24 patients (12 for each sequence) have been simulated. The figure plots CROSt(blue circles) and PARt (red diamonds) on the Y axis against SEQt on the X axis. The vertical lines show the critical boundaries for SEQt at the 10% level and the horizontal lines show the critical boundaries for CROSt and PARt at the 5% level. Filled circles or diamonds indicate significant results of CROSt and PARt and open circles or diamonds indicate non-significant values.
It is immediately noticeable that CROSt and SEQt are uncorrelated. This is hardly surprising, since given equal variances CROS and SEQ are orthogonal by construction. On the other hand PARt and SEQt are very strongly correlated. This ought not to be surprising. PAR uses the first period means. SEQ also uses the first period with the same sign. Even if the second period means were uncorrelated with the first the two statistics would be correlated, since the same information is used. However, in practice the second period means will be correlated and thus a strong correlation can result. In this example the empirical correlation is 0.92.
The consequence is that if SEQt is significant PARt is likely to be so. This can be seen from the scatter plot where there are far more filled diamonds in the regions to the left of the lower critical value or to the right of the higher critical values for SEQt than in the region in between. In this simulation of 1000 trials, 99 values of SEQt are significant at the 10% level 50 values of CROSt and 53 values of PARt are significant at the 5% level. These figures are close to the expected values. However, 91 values are significant using the two-stage procedure. Of course, this is just a simulation. However, for the theory see http://www.senns.demon.co.uk/ROEL.pdf .
In fact, this inflation really underestimates the problem with the two-stage procedure. Either the extra complication is irrelevant (we end up using CROSt) or the conditional type-I error rate is massively inflated. In this example, of the 99 cases where SEQt is significant, 48 of the values of PARt are significant. A nominal 5% significance rate has become nearly a 50% conditional one!
What are the lessons?
The first lesson is despite what your medical statistics textbook might tell you, you should never use the two-stage procedure. It is completely unacceptable.
Should you test for carry-over at all? That’s a bit more tricky. In principle more evidence is always better than less. The practical problem is that there is no advice that I can offer you as to what to do next on ‘finding’ carry-over except to drop the nominal target significance level. (See The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t, but note the warning in the title.)
Should you avoid using cross-over trials? No. they can be very useful on occasion. Their use needs to be grounded in biology and pharmacology. Statistical manipulation is not the cure for carry-over.
Are there more general lessons? Probably. The two-stage analysis is the worst case I know of but there may be others where testing assumptions is dangerous. Remember, a decision to behave as if something is true, is not the same as knowing it is true. Also, beware of recogiseable subsets. There are deep waters here.
- Ledley, R.S. and L.B. Lusted, Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science, 1959. 130(3366): p. 9-21.
- Dawid, A.P., Properties of Diagnostic Data Distributions. Biometrics, 1976. 32: p. 647-658.
- Guggenmoos-Holzmann, I. and H.C. van Houwelingen, The (in)validity of sensitivity and specificity.Statistics in Medicine, 2000. 19(13): p. 1783-92.
- Senn, S.J., Cross-over Trials in Clinical Research. First ed. Statistics in Practice, ed. V. Barnett. 1993, Chichester: John Wiley. 257.
- Senn, S.J., Cross-over Trials in Clinical Research. Second ed. 2002, Chichester: Wiley.
- Hills, M. and P. Armitage, The two-period cross-over clinical trial. British Journal of Clinical Pharmacology, 1979. 8: p. 7-20.
- Freeman, P., The performance of the two-stage analysis of two-treatment, two-period cross-over trials.Statistics in Medicine, 1989. 8: p. 1421-1432.
- Senn, S.J., The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t., in Liber Amicorum Roel van Strik, B. Hansen and M. de Ridder, Editors. 1996, Erasmus University: Rotterdam. p. 93-100.
I’m very grateful to Stephen Senn for his guest post on “Testing Times”. The times we are living in are indeed testing people’s ability to cope. But Senn’s post isn’t about that—at least not directly. It deals with an issue that arises in cross-over medical trials, although he tantalizingly hints that the problem he discusses has at least a “superficial similarity” to the problem of diagnostic screening for Covid-19 when “it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence”. His hesitancy to come out and reveal the similarity he has in mind, claiming he lacks sufficient expertise (on medical screening), just tantalizes this reader into guessing at the intended analogy. I hope that we can draw him out in the discussion.
In any event, one of the lessons Senn draws is that “testing for model adequacy prior to carrying out a test of a hypothesis of primary interest” can in some cases be “dangerous”. Of course it would be best to ground assumptions about how long the effect of a treatment lasts on biology and pharmacology, as he suggests. Perhaps the lesson is that flawed statistical tests of assumptions, and especially their flawed uses in subsequent statistical tests, can be dangerous because they can wreck, rather than help to ensure, the error probabilities of the primary test. But I definitely lack the expertise to speculate about the particular case.
This is very nice and a very good example for problems with combined procedures in which a test is decided based on a preliminary test for the validity of model assumptions. I wasn’t aware of this and will need to incorporate it here:
M. Iqbal Shamsudheen, Christian Hennig (2020), Should we test the model assumptions before running a model-based test? https://arxiv.org/abs/1908.02218
By the way, Stephen, if you have any more ideas on what may be missing in that preprint, please tell me!
I can’t resist quoting the last two sentences of Peter Freeman’s paper
“Once they are allowed, a perfectly satisfactory analysis of the two-period, two-treatment crossover trial becomes available, that of Grieve.9 In my opinion this is the only satisfactory analysis and, in the light of the results in this paper, the two-stage analysis is so unsatisfactory as to be ruled out of future use.”
The paper of mine that Peter Freeman was referring had appeared in Biometrics in 1985. While these two sentences are extremely supportive of the Bayesian approach developed in that paper, I now agree with Stephen that there is a major flaw with the approach which I then espoused.
The flaw is fundamental to most approaches to analysing crossover designs. What is wrong is that most approaches fail to acknowledge that an additive model in which the treatment and carryover effect are handled independently ignores the practical issue that it is extremely unlikely that the carryover is larger in magnitude than the treatment effect.
In our 1998 joint paper, while commenting on my Bayesian approach, Stephen wrote “I am prepared to stick my neck out and say that when we finally get a Bayesian method of analysing crossover trials that models the dependency between carryover and treatment ( and I am prepared to make the Bayesian statement that AG is the person most likely to produce it), the simple CROS analysis will be shown to provide reasonable results in practice, even where carryover is appreciable, unless the sample size of the trial is extremely large.” Unfortunately I haven’t as yet.