Stephen Senn
Consultant Statistician
Edinburgh
What is this n you boast about?
Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in datapoints but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.
I shall illustrate this point using clinical trials in asthma.
Breathing lessons
Suppose that I design a clinical trial in asthma as follows. I have six centres, each centre has four patients, each patient will be studied in two episodes of seven days and during these seven days the patients will be measured daily, that is to say, seven times per episode. I assume that between the two episodes of treatment there is a period of some days in which no measurements are taken. In the context of a crossover trial, which I may or may not decide to run, such a period is referred to as a washout period.
The block structure is like this:
Centres/Patients/Episodes/Measurements
The / sign is a nesting operator and it shows, for example, that I have Patients ‘nested’ within centres. For example, I could label the patients 1 to 4 in each centre, but I don’t regard patient 3 (say) in centre 1 as being somehow similar to patient 3 in centre 2 and patient 3 in centre 3 and so forth. Patient is a term that is given meaning by referring it to centre.
The block structure is shown in Figure 1, which does not, however, show the seven measurements per episode.
Figure 1. Schematic representation of the block structure for some possible clinical trials. The six centres are shown by black lines. For each centre there are four patients shown by blue lines and each patient is studied in two episodes, shown by red lines.
I now wish to compare two treatments, two socalled betaagonists. The first of these, I shall call Zephyr and the second Mistral. I shall do this using a measure of lung function called forced expiratory volume in one second, (FEV_{1}). If there are no dropouts and no missing measurements, I shall have 6 x 4 x 2 x 7 =336 FEV_{1 }readings. Is this my ‘n’?
I am going to use Genstat®, a package that fully incorporates John Nelder’s ideas of general balance[1, 2]and the analysis of designed experiments and uses, in fact, what I have called the Rothamsted approach to experiments.
I start by declaring the block structure thus
BLOCKSTRUCTURE Centre/Patient/Episode/Measurement
This is the ‘null’ situation: it describes the variation in the experimental material before any treatment is applied. If I ask Genstat®to do a ‘null’ skeleton analysis of variance for me, by typing the statement
ANOVA
and the output is as given in Table 1
Analysis of variance
Source of variation  d.f. 
Centre stratum  5 
Centre.Patient stratum  18 
Centre.Patient.Episode stratum  24 
Centre.Patient.Episode.Measurement stratum  288 
Total  335 
Table 1. Degrees of freedom for a null analysis of variance for a nested block structure.
This only gives me possible sources of variation and degrees of freedom associated with them but not the actual variances: that would require data. There are six centres, so five degrees of freedom between centres. There are four patients per centre, so three degrees of freedom per centre between patients but there are six centres and therefore 6 x 3 = 18 in total. There are two episodes per patient and so one degree of freedom between episodes per patient but there are 24 patients and so 24 degrees of freedom in total. Finally, there are seven measurements per episode and hence six degrees of freedom but 48 episodes in total so 48 x 6 = 288 degrees of freedom for measurements.
Having some actual data would put flesh on the bones of this skeleton by giving me some mean square errors, but to understand the general structure this is not necessary. It tells me that at the highest level I will have variation between centres, next patients within centres, after that episodes within patients and finally measurements within episodes. Which of these are relevant to judging the effect of any treatments I wish to study depends how I allocate treatments.
Design matters
I now consider, three possible approaches to allocating treatments to patients. In each of the three designs, the same number of measurements will be available for each treatment. There will be 168 measurements under Zephyr and 168 measurements under Mistral and thus 336 in total. However, as I shall show, the designs will be very different, and this will lead to different analyses being appropriate and lead us to understand better what our n is.
I shall also suppose that we are interested in causal analysis rather than prediction. That is to say, we are interested in estimating the effect that the treatments did have (actually, the difference in their effects) in the trial that was actually run. The matter of predicting what would happen in future to other patients is much more delicate and raises other issues and I shall not address it here, although I may do so in future. For further discussion see my paper Added Values[3].
In the first experiment, I carry out a socalled clusterrandomised trial. I choose three centres at random and all patents, in both episodes on all occasions in the three centres chosen receive Zephyr. For the other three centres, all patients on all occasions receive Mistral. I create a factor Treatment (cluster trial), (Cluster for short) which encodes this allocation so that the pattern of allocation to Zephyr or Mistral reflects this randomised scheme.
In the second experiment, I carry out a parallel group trial blocking by centre. In each centre, I choose two patients to receive Zephyr and two to receive Mistral. Thus, overall, there 6 x 2 = 12 patients on each treatment. I create a factor Treatment (parallel trial) (Parallel for short) to reflect this.
The third experiment consists of a crossover trial. Each patient is randomised to one of two sequences, either receiving Zephyr in episode one and Mistral in episode two, or vice versa. Each patient receives both treatments so that there will be 6 x 4 = 24 patients given each treatment. I create a factor Treatment (crossover trial) (Crossover for short) to encode this.
Note that the total number of measurements obtained is the same for each of the three schemes. For the cluster randomised trial, a given treatment will be studied in three centres each of which has four patients, each of whom will be studied in two episodes on seven occasions. Thus, we have 3 x 4 x 2 x 7 = 168 measurement per treatment. For the parallel group trial, 12 patients are studied for a given treatment in two episodes, each providing 7 measurements. Thus, we have 12 x 2 x 7 = 168 measurement per treatment. For the crossover trial we have 24 patients each of whom will receive a given treatment in one episode (either episode one or two) so we have 24 x 1 x 7 + 168 measurements per treatment.
Thus, from one point of view the n in the data is the same for each of these three designs. However, each of the three designs provides very different amounts of information and this alone should be enough to warn anybody against assuming that all problems of precision can be solved by increasing the number of data.
Controlled Analysis
Before collecting any data, I can analyse this scheme and use Nelder’s approach to tell me where the information is in each scheme.
Using the three factors to encode the corresponding allocation, I now ask Genstat® to prepare a dummy analysis of variance (in advance of having collected any data) as follows. All I need to do is type a statement of the form
TREATMENTSTRUCTURE Design
ANOVA
Where Design is set equal to the Cluster, Parallel, Crossover, as the case may be. The result is shown in Table 2
Analysis of variance
Source of variation  d.f. 
Centre stratum  
Treatment (cluster trial)  1 
Residual  4 
Centre.Patient stratum  
Treatment (parallel trial)  1 
Residual  17 
Centre.Patient.Episode stratum  
Treatment (crossover trial)  1 
Residual  23 
Centre.Patient.Episode.Measurement stratum  288 
Total  335 
Table 2. Analysis of variance skeleton for three possible designs using the block structure given in Table 1
This shows us that the three possible designs will have quite different degrees of precision associated with them. Since, for the cluster trial, any given centre only receives one of the treatments, the variation between centres affects the estimate of the treatment effect and its standard error must reflect this. Since, however, the parallel trial balances treatments by centres it is unaffected by variation between centres. It is, however, affected by variation between patients. This variation is, in turn, eliminated by the crossover trial which, in consequence is only affected by variation between episodes (although this variation will, itself, inherit variation from measurements). Each higher level of variation inherits variation from the lower levels but adds its own.
Note, however, that for all three designs the unbiased estimate of the treatment effect is the same. All that is necessary is to average the 168 measurements under Zephyr and the 168 under Mistral and calculate the difference. It is the estimate of the appropriate variation in the estimate that varies.
Suppose that, more generally, we have m centres, with n patients per centre and p episodes per patient, with the number of measurements per episode fixed, then for the crossover trial the variance of our estimate will be proportional to /(mnp) where is variance between episodes. For the parallel group trial, there will be a further term involving /(mn) where is the variance between patients. Finally, for the cluster randomised trial there will be a further term involving /m, where is the variance between centres.
The consequences of this are, you cannot decrease the variance of a cluster randomised trial indefinitely simply by increasing the number of patients; it is centres you need to increase. You cannot decrease the variance of a parallel group trial indefinitely by increasing the number of episodes; it is patients you need to increase.
Degrees of Uncertainty
Why should this matter? Why should it matter how certain we are about anything? There are several reasons. Bayesian statisticians need to know what relative weight to give their prior belief and the evidence from the data. If they do not, they do not know how to produce a posterior distribution. If they do not know what the variances of both data and prior are, they don’t know the posterior variance. Frequentists and Bayesians are often required to combine evidence from various sources as, say, in a socalled metaanalysis. They need to know what weight to give to each and again to assess the total information available at the end. Any rational approach to decisionmaking requires an appreciation of the value of information. If one had to make a decision with no further prospect of obtaining information based on a current estimate it might make little difference how precise it was but if the option of obtaining further information at some cost applies, this is no longer true. In short, estimation of uncertainty is important. Indeed, it is a central task of statistics.
Finally, there is one further point that is important. What applies to variances also applies to covariances. If you are adjusting for a covariate using a regression approach, then the standard estimate of the coefficient of adjustment will involve a covariance divided by a variance. Just as there can be variances at various levels there can be covariances at various levels. It is important to establish which is relevant[4] otherwise you will calculate the adjustment incorrectly.
Consequences
Just because you have many data does not mean that you will come to precise conclusions: the variance of the effect estimate may not, as one might naively suppose, be inversely proportional to the number of data, but to some other much rarer feature in the dataset. Failure to appreciate this has led to excessive enthusiasm for the use of synthetic patients and historical controls as alternatives to concurrent controls. However, the relevant dominating component of variation is that between studies not between patients. This does not shrink to zero as the number of subjects goes to infinity. it does not even shrink to zero as the number of studies goes to infinity, since if the current study is the only one that the new treatment is on, the relevant variance for that arm is at least /1, where is the variance between studies, even if, for the ‘control’ dataset it may be negligible , thanks to data collected from many subjects in many studies.
There is a lesson also for epidemiology here. All too often, the argument in the epidemiological, and more recently, the causal literature has been about which effects one should control for or condition on without appreciating that merely stating what should be controlled for does not solve how. I am not talking here about the largely sterile debate, to which I have contributed myself[5] as to how at a given level, adjustment should be made for possible confounders (for example, propensity score or linear model), but to the level at which such adjustment can be made. The usual implicit assumption is that an observational study is somehow a deficient parallel group trial, with maybe complex and perverse allocation mechanisms that must somehow be adjusted for, but that once such adjustments have been made, precision increases as the subjects increase. But suppose the true analogy is a cluster randomised trial. Then, whatever you adjust for, your standard errors will be too small.
Finally, it is my opinion, that much of the discussion about Lord’s paradox would have benefitted from an appreciation of the issue of components of variance. I am used to informing medical clients that saying we will analyse the data using analysis of variance is about as useful as saying we will treat the patients with a pill. The varieties of analysis of variance are legion and the same is true of analysis of covariance. So, you conditioned on the baseline values. Bravo! But how did you condition on them? If you used a slope obtained at the wrong level of the data then, except fortuitously, your adjustment will be wrong, as will the precision you claim for it.
Finally, if I may be permitted an autoquote, the price one pays for not using concurrent control is complex and unconvincing mathematics. That complexity may be being underestimated by those touting ‘big data’.
References
 Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 147162.
 Nelder JA. The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 163178.
 Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004; 23: 37293753.
 Kenward MG, Roger JH. The use of baseline covariates in crossover studies. Biostatistics2010; 11: 117.
 Senn SJ, Graf E, Caputo A. Stratification for the propensity score compared with linear regression techniques to assess the effect of treatment or exposure. Statistics in Medicine 2007; 26: 55295544.
Some relevant blogposts
Lord’s Paradox:
Personalized Medicine:

 (01/30/18) S. Senn: Evidence Based or Personcentred? A Statistical debate (Guest Post)
 (7/11/18) S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)
 (07/26/14) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
Randomisation:

 (07/01/17) S. Senn: Fishing for fakes with Fisher (Guest Post)