**Stephen Senn
**Consultant Statistician

Edinburgh, Scotland

**Testing Times**

# Screening for attention

There has been much comment on Twitter and other social media about testing for coronavirus and the relationship between a test being positive and the person tested having been infected. Some primitive form of Bayesian reasoning is often used to justify concern that an apparent positive may actually be falsely so, with specificity and sensitivity taking the roles of likelihoods and prevalence that of a prior distribution. This way of looking at testing dates back at least to a paper of 1959 by Ledley and Lusted[1]. However, as others[2, 3] have pointed out, there is a trap for the unwary in this, in that it is implicitly assumed that specificity and sensitivity are constant values unaffected by prevalence and it is far from obvious that this should be the case.

In the age of COVID-19 this is a highly suitable subject for a blog. However, I am a highly unsuitable person to blog about it, since what I know about screening programmes could be written on the back of a postage stamp and the matter is very delicate. So, I shall duck the challenge but instead write about something that bears more than a superficial similarity to it, namely, testing for model adequacy prior to carrying out a test of a hypothesis of primary interest. It is an issue that arises in particular in cross-over trials, where there may be concerns that carry-over has taken place. Here, I may or may not be an expert, that is for others to judge, but I can’t claim that *I* think I am unqualified to write about it, since I once wrote a book on the subject[4, 5]. So this blog will be about testing model assumptions taking the particular example of cross-over trials.

# Presuming to assume

The simplest of all cross-over trials, the so-called *AB/BA cross-over* is one in which two treatments A and B are compare by randomising patients to one of two sequences: either A followed by B (labelled AB) or B followed by A (labelled BA). Each patient is thus studied in two periods, receiving one of the two treatments in each. There may be a so-called *wash-out* period between them but whether or not a wash-out is employed, the assumption will be made that by the time the effect of a treatment comes to be measured in period two, the effect of any treatment given in period one has disappeared. If such a residual effect, referred to as a *carry-over*, existed it would bias the treatment effect since, for example, the result in period two of the AB sequence would not only reflect the effect of giving B but the previous effect of having given A.

Everyone is agreed that if the effect of carry-over can be assumed negligible, an efficient estimate of the difference between the effect of B and A can be made by allowing each patient to act as his or her own control. One way of doing this is to calculate a difference for each patient of the period two values minus the period one values. I shall refer to these as the *period differences*. If the effect of treatment B is the same as that of treatment A, then these period differences will not be expected to differ systematically from one sequence to another. However, if (say) the effect of B was greater that that of A (and higher values were better), then in the AB sequence a positive difference would be added to the period differences and in the BA sequence that difference would be subtracted from the period differences. We should thus expect the means of the period differences for the two sequences to differ. So one way of testing the null hypothesis of no treatment effect is to carry out a two-sample t-test comparing the period differences between one sequence an another. Equivalently from the point of view of testing, but more convenient from the point of view of estimation, is to work with the semi period differences, that is to say the period difference divided by two. I shall refer to the associated estimate as CROS and the t-statistic for this test as CROS_{t}, since they are what the crossover trial was designed to produce.

Unfortunately, however, these period differences could also reflect carry-over and if this occurs it would bias the estimate of the treatment effect. The usual effect will be to bias it towards the null and so there may be a loss of power. Is there a remedy? One possibility is to discard the second period values. After all, the first period values cannot be contaminated by carry-over. On the other hand single period values is what we have in any parallel group trial. So all we need to do is regard the first period values as coming from a parallel group trial. Patients in the AB sequence yield values under A and patients in the BA sequence yield values under B so a comparison of the first period values, again using a two-sample t-test, is a test of the null hypothesis of no difference between treatments. I shall refer to this estimate as the PAR statistic and the corresponding t-statistic as PAR_{t}.

Note that the PAR statistic is expected to be a second best to the CROS statistic. Not only are half the values discarded but since we can no longer use each patient as his or her own control, the relevant variance is a sum of between and within-patient variances unlike for CROS, which only reflects within-patient variation. Nevertheless, PAR may be expected to be unbiased in circumstances where CROS will be and since all applied statistics is a bias-variance trade-off it seems conceivable that there are circumstances under which PAR would be preferable.

# Carry-over carry on

However, there is a problem. How shall we know that carry-over has occurred? It turns out that there is a t-test for this too. First, what we construct are means over the two periods for each patient. In each sequence such means must reflect the effect of both treatments, since each patient receives each treatment and they must also reflect the effect of both periods, since each patient will be treated in each period. However, in the AB sequence the total (and hence the mean) will also reflect the effect of any carryover from A in the second period whereas in the BA sequence the total (and hence the mean) will reflect the carry-over of B. Thus, if the two carry-overs differ, which is what matters, these sequence means will differ and therefore the t-test of the totals comparing the two sequences is a valid test of zero differential carry-over. I shall refer to the estimate as SEQ and the corresponding t-statistic as SEQ_{t}.

# The rule of three

These three tests were then formally incorporated in a testing strategy known as the *two-stage procedure*[*6*]as follows. First a test for carry-over was performed using SEQ_{t}. Since the test was a between-patient test and therefore of low power, a nominal type I error rate of 10% was generally used. If SEQ_{t} was not significant, the statistician proceeded to use CROS_{t} to test the principle hypothesis of interest, namely that of the equality of the two treatments. If however, SEQ_{t} was significant, which might be taken as an indication of carry-over, the fallback test PAR_{t} was used instead to test equality of the treatments.

The procedure is illustrated in Figure 1.

# Chimeric catastrophe

Of course, to the extent that the three tests are used as prescribed, they can be combined in a single algorithm. In fact in the pharmaceutical company that I joined in 1987, the programming group had written a SAS® macro to do exactly that. You just needed to point the macro at your data and it would calculate SEQ, come to a conclusion, choose either CROS or PAR as appropriate and give you your P-value.

I hated the procedure as soon as I saw it and never used it. I argued that it was an abuse of testing to assume that just because SEQ was not significant that therefore no carry-over has occurred. One had to rely on other arguments to justify ignoring carry-over. It was only on hearing a colleague lecture on an example where the test for carry-over had proved significant and, much to his surprise, given its low power, the first period test had also proved significant, that I suddenly realised that SEQ and PAR were highly correlated and therefore this was only to be expected. In consequence, the procedure would not maintain the Type I error rate. Only a few days later a manuscript from *Statistics in Medicine* arrived on my desk for review. The paper by Peter Freeman[7] overturned everything everyone believed on testing for carry-over. In bolting these tests together, statisticians had created a chimeric monster. Far from helping to solve the problem of carry-over the two-stage procedure had made it worse.

‘How can screening for something be a problem?’, an applied statistician might ask but in asking that they would be completely forgetting the advice they would give a physician who wanted to know the same thing. The process as a whole of screening plus remedial action needed to be studied and statisticians had failed to do so. Peter Freeman completely changed that. He did what statisticians should have done and looked at how the procedure as whole behaved. In the years since I have simply asked statisticians who wish to give an opinion on cross-over trials what they think of Freeman’s paper[7]. It has become a litmus paper for me. Their answer tells me everything I need to know.

# Correlation is not causation but it can cause trouble

So what is the problem? The problem is illustrated by Figure 2. This shows a simulation from a null case. There is no difference between the treatments and no carry-over. The correlation between periods one and two has been set to 0.7. One thousand trials in 24 patients (12 for each sequence) have been simulated. The figure plots CROS_{t}(blue circles) and PAR_{t} (red diamonds) on the Y axis against SEQ_{t} on the X axis. The vertical lines show the critical boundaries for SEQ_{t} at the 10% level and the horizontal lines show the critical boundaries for CROS_{t} and PAR_{t} at the 5% level. Filled circles or diamonds indicate significant results of CROS_{t} and PAR_{t} and open circles or diamonds indicate non-significant values.

It is immediately noticeable that CROS_{t} and SEQ_{t} are uncorrelated. This is hardly surprising, since given equal variances CROS and SEQ are orthogonal by construction. On the other hand PAR_{t} and SEQ_{t} are very strongly correlated. This ought not to be surprising. PAR uses the first period means. SEQ also uses the first period with the same sign. Even if the second period means were uncorrelated with the first the two statistics would be correlated, since the same information is used. However, in practice the second period means will be correlated and thus a strong correlation can result. In this example the empirical correlation is 0.92.

The consequence is that if SEQ_{t} is significant PAR_{t} is likely to be so. This can be seen from the scatter plot where there are far more filled diamonds in the regions to the left of the lower critical value or to the right of the higher critical values for SEQ_{t} than in the region in between. In this simulation of 1000 trials, 99 values of SEQ_{t} are significant at the 10% level 50 values of CROS_{t }and 53 values of PAR_{t} are significant at the 5% level. These figures are close to the expected values. However, 91 values are significant using the two-stage procedure. Of course, this is just a simulation. However, for the theory[8] see http://www.senns.demon.co.uk/ROEL.pdf .

In fact, this inflation really underestimates the problem with the two-stage procedure. Either the extra complication is irrelevant (we end up using CROS_{t}) or the conditional type-I error rate is massively inflated. In this example, of the 99 cases where SEQ_{t} is significant, 48 of the values of PAR_{t} are significant. A nominal 5% significance rate has become nearly a 50% conditional one!

# What are the lessons?

The first lesson is despite what your medical statistics textbook might tell you, you should * never* use the two-stage procedure. It is completely unacceptable.

Should you test for carry-over at all? That’s a bit more tricky. In principle more evidence is always better than less. The practical problem is that there is no advice that I can offer you as to what to do next on ‘finding’ carry-over except to drop the nominal target significance level. (See *The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t*[8]*, *but note the warning in the title.)

Should you avoid using cross-over trials? No. they can be very useful on occasion. Their use needs to be grounded in biology and pharmacology. Statistical manipulation is not the cure for carry-over.

Are there more general lessons? Probably. The two-stage analysis is the worst case I know of but there may be others where testing assumptions is dangerous. Remember, a decision to behave as if something is true, is not the same as knowing it is true. Also, beware of recogiseable subsets. There are deep waters here.

# References

- Ledley, R.S. and L.B. Lusted,
*Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason.*Science, 1959.**130**(3366): p. 9-21. - Dawid, A.P.,
*Properties of Diagnostic Data Distributions.*Biometrics, 1976.**32**: p. 647-658. - Guggenmoos-Holzmann, I. and H.C. van Houwelingen,
*The (in)validity of sensitivity and specificity.*Statistics in Medicine, 2000.**19**(13): p. 1783-92. - Senn, S.J.,
*Cross-over Trials in Clinical Research*. First ed. Statistics in Practice, ed. V. Barnett. 1993, Chichester: John Wiley. 257. - Senn, S.J.,
*Cross-over Trials in Clinical Research*. Second ed. 2002, Chichester: Wiley. - Hills, M. and P. Armitage,
*The two-period cross-over clinical trial.*British Journal of Clinical Pharmacology, 1979.**8**: p. 7-20. - Freeman, P.,
*The performance of the two-stage analysis of two-treatment, two-period cross-over trials.*Statistics in Medicine, 1989.**8**: p. 1421-1432. - Senn, S.J.,
*The AB/BA cross-over: how to perform the two-stage analysis if you can’t be persuaded that you shouldn’t.*, in*Liber Amicorum Roel van Strik*, B. Hansen and M. de Ridder, Editors. 1996, Erasmus University: Rotterdam. p. 93-100.

## Recent Comments