In a previous post I considered Lord’s paradox from the perspective of the ‘Rothamsted School’ and its approach to the analysis of experiments. I now illustrate this in some detail giving an example.
What I shall do
I have simulated data from an experiment in which two diets have been compared in 20 student halls of residence, each diet having been applied to 10 halls. I shall assume that the halls have been randomly allocated the diet and that in each hall 10 students have been randomly chosen to have their weights recorded at the beginning of the academic year and again at the end.
I shall then compare two approaches to analysing these data and invite the reader to consider which is correct.
I shall then discuss (briefly) what happens to these approaches to analysis when the problem is changed so that we now have not 10 halls with 10 students each per diet but one hall per diet with 100 students each. This is the Lord’s paradox problem (Lord, F. M., 1967) in the form proposed by Wainer and Brown (Wainer, H. & Brown, L. M., 2004). We shall see that one of the philosophies of analysis will indicate that this more difficult case cannot be analysed. The other will produce an analysis that has been previously proposed as the right analysis for Lord’s paradox. I shall then consider what changes are necessary (if any) if we have an observational rather than an experimental set-up.
The data are saved in an Excel workbook here. The first sheet (Experiment_1) gives weights at the beginning and the end of each academic year for each student as well as the hall they were in (numbered 1 to 20); the diet they were given (A or B) and a unique student identification number (1 to 200). The second sheet (Summary_1) consists of mean weights per hall averaged over the students enrolled in the that hall and included in the study.
The approach to analysis
I shall use Genstat’s approach to analysing designed experiments. This is based on John Nelder’s theory of 1965 (Nelder, J. A., 1965a, 1965b) and declares block structure and treatment structure separately. The analyses will only differ as regards the block structure declared, although in one case I can produce an identical analysis using the so-called summary measures approach.
The first analysis
The code looks like this
ANOVA [PRINT=aovtable,effects,covariates; FACT=1; FPROB=yes] Weight
The first statement defines the block structure. Students are ‘nested’ within halls. This is written Hall/Student The second states that the (putative) causal factor, the treatment, is Diet. The third declares a covariate, Base, the baseline weight, to be taken account of and the fourth says that the outcome variable is Weight, that is to say the weight at the end of the experiment.
The analysis that is produced is now given in Figure 1.
Figure 1 Analysis of the diet data respecting the block structure
I have highlighted the hall stratum and the diet term. Note that there are two residual terms. The first appears in the hall stratum and the second in the students within-halls stratum. Since the diet given is varied between halls but not within, only the former is relevant for judging the effect of diet.
The term v.r. stands for variance ratio. It is the ratio of the mean square (m.s.) for diet (367.3) to the variation term that matters (29.9) the residual for the hall stratum. The ratio is 12.3 so that the analysis tells us that the variation between diets is about 12 times what you would expect given the variation between halls given the same diet.
Note also that there are two covariate terms: both between halls and between students-within-halls. The latter is also irrelevant to any analysis of the effect of diet.
Now consider a second equivalent analysis of this. This just uses the average at baseline and outcome per hall. In other words, it is based on 20 pairs of values (baseline and outcome) not 200. This analysis produces the table in Figure 2. Note that the result is exactly as before, showing the irrelevance of the variances and covariances within halls. That is to say, that although the mean squares change, because now based on averages of 10 students per hall, the ratio of the term for treatment to its residual is the same and so are all inferences. The equivalence of summary measure approaches to more complex models for certain balanced cases is well known (Senn, S. J. et al., 2000).
Note that for both of these equivalent analyses the residual degrees of freedom are 17: there are 20 halls and one degree of freedom has been used for each of grand mean, covariate and treatment, leaving 17. The variation between diets is judged by the variation between halls.
Figure 2 Summary measures analysis respecting block structure
The second analysis
This uses a different block structure. We now ignore the fact that the students are in different halls. The code becomes
ANOVA [PRINT=aovtable,effects,covariates; FACT=1; FPROB=yes] Weight
and the output is as given in Figure 3.
Figure 3 Analysis ignoring the block structure
Note that the residual used to judge the effect of diet is now based on 197 degrees of freedom and it is less than a quarter of what it was before (6.3 as opposed to 29.9). The numerator of the variance ratio is somewhat similar to what it was before (a different covariate term has been used to adjust it so there is some difference) but the variance ratio is now five times what it was. The result is much more impressive.
Which analysis is right?
A long tradition says that the first analysis is right and the second is wrong. In a clinical context, the experiment has a cluster randomised form. The regulators, the EMA and the FDA, will not let drug sponsors analyse cluster randomised trials as if they were parallel group trials but this is what the second analysis will do.
“No causality in, no causality out” is a common slogan but the actual intervention here did not take place independently at the level of students but at the level of halls. It is this variation (between halls) that should be used to judge treatment. Speaking practically, the halls may be situated at different distances from lecture theatres on campus so that exercise effects may be different. Some may be closer to food shops and so forth. One can imagine many effects independent of the diet offered that would vary at the level of hall but not at the level of students within halls.
But isn’t this a red-herring?
I have considered a randomised experiment involving many halls. It differs from the situation of the paradox in two respects. First, there were only two halls and second the diet was not randomised. We can summarise these differences as ’two not many’ and ‘observational not experimental’. I consider these in turn.
Two not many
There are only two halls in the Lord’s paradox case. This means that analysis one is impossible and only two is possible, which is an analysis that has been previously proposed as being right for Lord’s paradox. You cannot estimate the relevant variances and covariances for approach one if you only have two halls. (See my original blog on the subject.) I have no objection to analysts defending approach two on the grounds that this is all that can be done if an analysis is to be done. In fact, I have even given this analysis some (lukewarm) support in the past (Senn, S. J., 2006). However, two points are important. First, it should be recognised that a third choice is being overlooked: that of saying that the data are simply too ambiguous to offer any analysis. Second, it should be made explicit that the analysis is valid on the assumption that there are no between hall variance and covariance elements above those seen within. It should be made clear that this is a strong but untestable assumption.
Observational not experimental
I can think of no valid reason why analysis two could become valid for an observational set-up if it was not valid for the experimental one. I can imagine the reverse being the case but to claim that an invalid analysis of an experiment would suddenly become valid if only it had not been randomised despite the fact that no different or further data of any kind were available, strikes me as being a very unpromising line of defence. Thus, I consider this the real red herring.
Does it make a difference?
In this case the between halls regression is very similar to the within halls regression: the slope in the first case is 0.57 and in the second is 0.55. Furthermore, the means at baseline for the two diets are also very similar; 75.1kg and 76.1kg. This means that the estimates of the diet effects (B compared to A) are nearly identical 2.7kg versus 2.8kg. The situation is illustrated in Figure 4, the estimated treatment effect being the difference between the corresponding pair of parallel lines.
Figure 4 Two analyses of covariance. Red = Diet A, Black = Diet B. Open circles and dashed sloping lines, students. Closed circles and solid sloping lines, halls. Vertical dashed lines indicate mean weights per diet at baseline.
It might be concluded that the distinction is irrelevant. Such a conclusion would be false. Even in this case, where estimates do not differ, the standard errors of the estimates are radically different. For the between-halls analysis the standard error is 0.78kg. For the within-halls analysis it is 0.36kg. The relative evidential weight of the two, say for updating a prior distribution to a posterior for any Bayesian, or for combining with other evidence in any meta-analysis, is the ratio of the square of the reciprocal of the standard errors, that is to say 4.7. The analysis based on students rather than halls overstates the evidence considerably.
I maintain that thinking carefully about block-structure and treatment structure as John Nelder taught us to do is the right way to think about experiments. I also think, mutatis mutandi, that it can help in thinking about some causal questions in an observational set-up. Variation can occur at many levels and this is as true of observational studies as it is of experimental ones. In making this claim it is not my intention to detract from other powerful approaches. It can be helpful to have many tools to attack such problems.
Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin, 66, 304-305
Nelder, J. A. (1965a). The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A, 283, 147-162
Nelder, J. A. (1965b). The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A, 283, 163-178
Senn, S. J. (2006). Change from baseline and analysis of covariance revisited. Statistics in Medicine, 25(24), 4334–4344
Senn, S. J., Stevens, L., & Chaturvedi, N. (2000). Repeated measures in clinical trials: simple strategies for analysis using summary measures. Statistics in Medicine, 19(6), 861-877
Wainer, H., & Brown, L. M. (2004). Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician, 58(2), 117-123
Stephen: Thank you so much for the follow-up guest post!
Lord paradox is a causal problem. In the story, each statistician proposes a method of estimating the causal effect of diet on weight gain. To discuss the notion of “valid” as it relates to “causal effect” we must invoke causal vocabulary. We cannot do it in the vocabulary of variances and covariances no matter how intricate.
It is for this reason that I find myself unable to relate Stephen Senn’s post to the original paradox posed by F. Lord. Nor can I related it to my analysis of the Lord paradox in the Book of Why http://bayes.cs.ucla.edu/WHY/
As a student of paradoxes I would need the following information to get started:
1. What did the statisticians attempt to estimate in the non-experimental case.
2. My analysis ignores completely the Rothamsted School and Nedler’s analysis. What part of the analysis would improve by attending these sources.
3. We know that any exercise in causal analysis must rest on some untestable causal assumptions. What are the assumption that Senn wishes us to consider in the non-experimental case,
4. We know that causal assumptions cannot be expressed in the language of statistics. What language
does Senn prefer for expressing those assumptions.?
5. Senn argues that his analysis “can help in thinking about some causal questions”. We are fortunate to live in an era where we no longer need such help. Causal inference permit us to formulate things mathematically and derive answers to causal questions, rather than “thinking” about them.
To answer your questions
1) The research question is “what is the effect of diet on weight?”.
2) What is missing is a) an explanation as to what conditioning on /adjusting for the baseline weight means (there are at least two possible approaches) b) an explanation as to how the standard error would be calculated.
3) If causal analysis cannot provide standard errors you cannot weight evidence. So much the worse for causal analysis. Suppose two scientists had investigated the same causal question in different studies coming to slightly different answers. How do you propose to combine the evidence?
4) Please provide the valid reason as to why one of the two answers I gave rather than the other is correct. I gave an answer.
5) Do you have a theory of the strength of evidence? John Nelder’s approach provides an answer.
In connection with the latter, I have three questions for you.
1) Can the causal calculus analyse designed experiments?
2) If so, how are the following three cases analysed? In all cases two different training programmes for students are being compared to determine the effect on weight. In all three cases the students are drawn from two halls and in all three cases the length of follow up is the same-
i) The design is blocked by hall. One hundred students in each hall are randomly assigned to one of the two programmes, 50 students per programme per hall.
ii) The design is confounded with hall. One of the halls is chosen at random and 100 students in that hall receive one programme and 100 students in the other hall receive the other programme.
iii) The design is completely randomised. This leads to some students in each hall being allocated to one programme and some to another but with no guarantee that there will 50 students on each. By the time it comes to analysis it has been lost as to which student was in which hall although the other data are available.
Standard statistical theory says that these three experiments provide different amounts of evidence, despite the fact that each involves 200 students. The first is the most precise and the second the least , with the third being somewhere in between the two. John Nelder’s approach, incorporated in Genstat, explains how to analyse such experiments and would award the results different standard errors. Actually, for experiment ii) no standard error can be calculated.
This leads to my third question.
3) Is it the case that in causal analysis all studies are equal? Or is there any way in which one can assess the relative claims of different estimates (each of which has been declared OK by the causal calculus)?