The Many Halls Problem
It’s not that paradox but another
Generalisation is passing…from the consideration of a restricted set to that of a more comprehensive set containing the restricted one…Generalization may be useful in the solution of problems. George Pólya  (P108)
In a previous blog https://www.linkedin.com/pulse/cause-concern-stephen-senn/ I considered Lord’s Paradox, applying John Nelder’s calculus of experiments[3, 4]. Lord’s paradox involves two different analyses of the effect of two different diets, one for each of two different student halls, on weight of students. One statistician compares the so-called change scores or gain scores (final weight minus initial weight) and the other compares final weights, adjusting for initial weights using analysis of covariance. Since the mean initial weights vary between halls, the two analyses will come to different conclusions unless the slope of final on initial weights just happens to be one (in practice, it would usually be less). The fact that two apparently reasonable analyses would lead to different conclusions constitutes the paradox. I chose the version of the paradox outlined by Wainer and Brown  and also discussed in The Book of Why. I illustrated this by considering two different experiments: one in which, as in the original example, the diet varies between halls and a further example in which it varies within halls. I simulated some data which are available in the appendix to that blog but which can also be downloaded from here http://www.senns.uk/Lords_Paradox_Simulated.xls so that any reader who wishes to try their hand at analysis can have a go.
I showed that Nelder’s calculus reveals that there is no solution if the diet is varied between halls. This is because replication is needed at the level at which treatment is varied and two halls, one per treatment provides insufficient replication. In this blog I shall present a simulated example that provides more replication at the level of hall. In order to keep numbers manageable in the previous example I simulated data for 40 students in total, 20 per hall. In the new example, in order to underline the point that it is not the total number of students that is the problem, I shall stick to 40 students but will have them in ten halls, so four per hall. Of course, in practice this would be a ludicrously low number of students for any hall but that is irrelevant to the understanding of the problem. However, if one wishes to make the example seem more practical, assume that resources to carry out measurements are limited and to make the study feasible, four students from each hall have been chosen at random to be followed.
I shall present the simulated data in due course but first let me dispose of some red herrings.
A tedious objection that has been raised to the example previously presented is that an observational study was originally proposed by Lord and I have offered an experiment, which is irrelevant. This objection is just silly on several grounds. The first point to note is that Lord’s paradox is not dependent on an observational study having been conducted. The paradox is a numerical phenomenon. It will still arise if an experiment in conducted. Furthermore, the fact that an experiment has been conducted gets rid of many possible distracting factors revealing the paradox in a purer form. The second point is that whereas one can imagine analyses that would succeed in an experiment but fail in an observational study, the reverse is not the case. Therefore, since I showed that the proposed solution to the paradox would be problematic if an experiment had been carried out the objection cannot be avoided by switching to an observational study. The third point is that it is surely a fatal weakness of any causal theory that it has nothing to say about experiments. I think highly enough of the theory outlines in The Book of Why to conclude that it does not suffer from this defect, so it is reasonable to measure it against experiments.
A second tedious objection is that by considering a case with many halls I am not looking at Lord’s Paradox. My reply to this is to say that two is a special case (albeit a restricted one) of many. Sometimes to understand problems properly one has to generalise and, indeed, sometimes generalisation can make a solution easier. (See Pólya’s advice in the opening quotation.)
I am going to consider a many halls problem (not be confused with the famous Monty Hall problem). I simulated data for ten halls, five each being allocated diet A and five being allocated diet B. In each hall four students were followed and their initial weights at the start of the academic year and their final weights, at the end were measured. Other parameters of the simulation are given in my previous blog https://www.linkedin.com/pulse/cause-concern-stephen-senn/, however, unlike in the previous blog I avoid the complication of a second level of experimentation in which dietary advice is varied within halls. The 40 pairs of measurements are represented by the scatter plot in Figure 1 and are given in the appendix.
Blue circles represent values for student’s allocated to halls that used diet A and red squares values for students in halls that used diet B. The mean values per hall are represented by asterisks and the colour blue is used for diet A and red for diet B.
I can now analyse the values using John Nelder’s calculus of experiments as implemented in Genstat® as follows
The first statement informs the algorithm that the ‘experimental material’ consists of students within halls. The second declares the initial weight X (labelled Initial in output) as a covariate. If the following statement is added
the algorithm is informed of the treatment factor of interest, which I have called Between because it is varied between halls (the output labels it as diet). If I now instruct it to carry out an analysis of variance on the outcome variable , Y (labelled Final, in output) , as follows
ANOVA[FPROBABILITY = YES] Y
the instructions are complete. The output is given in Figure 2. The following points are important. First, the significance of the effect of treatment is judged by using the variation between halls. The output distinguishes two strata: the Hall stratum and the Hall.Student stratum. Only the former is used in judging the effect of treatment because the treatment has been varied between halls. It would be a mistake, analogous to analysing a cluster randomised trial as if it were a parallel group trial, to use the Hall.Student stratum to judge the effect of diet. The second point is that there are different regression coefficients between and within halls. The former is estimated to be 0.42 ( I had set it to 0.5 for the simulation) and the latter to 0.957 (I had set it to 0.8 for the simulation). Given the level at which treatment is varied, the first of these and not the second is relevant for adjusting final weights. See Yoon and Welsh for a general investigation of correlations for multilevel data and Kenward and Roger for the issue as it occurs in cross-over trials. For earlier discussions of technical matters to do with more complex multi-level experiments see Zelen , Ratcliff et al  and Payne and Tobias.
Lessons for the two-hall problem
There are two obvious lessons for the two-hall problem. The first is that variation has to be judged at the hall level not the within-hall level. A consequence of this is that since there are only two halls, there are not enough degrees of freedom to do this whether or not a simple change-score analysis or analysis of covariance is used. The second is that covariation also has to be judged at the between-hall level and again this cannot be estimated. A consequence of the second point is that it is no defence of a proposed solution based on using the within-halls regression coefficient to say that one is not interested in standard errors but only in the asymptotic solution. The estimate itself is wrong, unless it can be assumed that the between-hall regression will be the same as the within-hall one. The latter coefficient is estimate as 0.96 but the former at 0.42.
Note that had the design been that students from across the university had been allocated at random to one of two diets, then the conventional ANCOVA solution which, as far as I can tell, is the one proposed in The Book of Why would have been correct. Furthermore, if students had been allocated to one of two diets based on initial weight (with those weighing less being allocated to one diet and those weighing more to another), then as Holland and Rubin have discussed, provided the regression may be assumed to be constant across the range of initial weight, the conventional ANCOVA solution works.
Can the ANCOVA solution be rescued?
Before discussing this, I want to make it quite clear that the change-score solution is a non-starter. It would rely on the assumption that the regression between halls was one and, in any case, estimation of the standard errors would be incorrect.
To rescue the ANCOVA solution, one device would be to move the Hall effect from the block structure to the treatment structure. One might also require some further assumptions. For example I could consider a case where students have been randomised to halls or perhaps allocated based on their weight, as discussed by Holland and Rubin or I could just assume that given initial weight everything else is ignorable. Other assumptions regarding independence might be needed. If I do this and make Hall a fixed effect then Genstat® will tell me that Hall is confounded with Diet, however many halls I have, provided that the diet is varied between and not within halls.
Let us consider the two-hall case discussed in the previous blog. In the example I simulated, Hall 1 used diet B and Hall 2 used diet A. I might claim that I do not care what caused an effect on weight provided that I can identify there is some causal difference between the two groups. It may be diet B compared to A, it may be Hall 1 compared to 2 or it may be the combination of Hall 1 and diet B compared to Hall 2 and diet A.
This seems to me to be rather unambitious and does not get us very far in answering the Why that causal analysis is all about: it could be one thing it could be another or it could be both. At the very least it seems worth a discussion for anybody claiming that the ANCOVA solution is definitely correct.
I conclude by making one point regarding conditioning on effects. When you have hierarchical data-sets you can have variances and covariances at different levels and you can have more than one regression. For an analogy, think of stars moving relative to the centre of galaxies and galaxies being pulled by The Great Attractor. To say that you have conditioned on X under such circumstances is ambiguous. How have you conditioned? Or, to put it another way “How” have you answered “Why”?
As I have put it before, the consequence is that not only is correlation not necessarily causation, it may not even be correlation.
- Polya, G., How to solve it: A new aspect of mathematical method. 2004: Princeton university press.
- Lord, F.M., A paradox in the interpretation of group comparisons. Psychological Bulletin, 1967. 66: p. 304-305.
- Nelder, J.A., The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 147-162.
- Nelder, J.A., The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A, 1965. 283: p. 163-178.
- Wainer, H. and L.M. Brown, Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician, 2004. 58(2): p. 117-123.
- Pearl, J. and D. Mackenzie, The Book of Why. 2018: Basic Books.
- Campbell, M.J. and S.J. Walters, How to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. Statistics in Practice, ed. S. Senn. 2014, Chichester: Wiley. 247.
- Yoon, H.-J. and A.H. Welsh, On the effect of ignoring correlation in the covariates when fitting linear mixed models. Journal of Statistical Planning and Inference, 2020. 204: p. 18-34.
- Kenward, M.G. and J.H. Roger, The use of baseline covariates in crossover studies. Biostatistics, 2010. 11(1): p. 1-17.
- Zelen, M., The analysis of incomplete block designs. Journal of the American Statistical Association, 1957. 52(278): p. 204-217.
- Ratcliff, D., E. Williams, and T. Speed, A note on the analysis of covariance in balanced incomplete block designs. Australian Journal of Statistics, 1984. 26(3): p. 337-341.
- Payne, R. and R. Tobias, General balance, combination of information and the analysis of covariance. Scandinavian Journal of Statistics, 1992: p. 3-23.
- Senn, S.J., Change from baseline and analysis of covariance revisited. Statistics in Medicine, 2006. 25(24): p. 4334–4344.
- Holland, P.W. and D.B. Rubin, On Lord’s Paradox, in Principals of Modern Psychological Measurement, H. Wainer and S. Messick, Editors. 1983, Lawrence Erlbaum Associates: Hillsdale, NJ. p. 3-25.
Appendix The Simulated Data
Simulated values. X stands for initial weight and Y for final weight.
The data are available at here http://www.senns.uk/Lords_Paradox_Ten_Halls.xls
Related guest post of Senn’s on this blog: Rothamsted statistics meets Lord’s paradox (Guest Post)
Thank you so much for this intriguing guest post. I wonder if Pearl will respond in the comments as he did to a related post of yours in 2018:
Among his remarks:
“Thus, if the second statistician is not unambiguously right.
the first statistician cannot be unambiguously wrong.
If we exclude “untested assumptions” then no statistician
can be unambiguously right or unambiguously wrong.” (Pearl)
But this isn’t true in general. Your post is on a topic far removed from my expertise, but I thought your point was that John’s inference has a mistaken assumption whereas ambiguity remains with Jane’s analysis because there are different kinds of causal claims one might be asking about. (Or is that wrong?) As you say in your post:
How have you conditioned? Or, to put it another way “How” have you answered “Why”?
Given how equivocal are our “why” questions–especially when they are to be answered at the statistical, and not the individual, level–it seems a causal analysis should retain (and not claim to eliminate) ambiguity. Neither of the two attempts might give the “right” answer. Anyway, it’s interesting that (I think) you say Pearl’s analysis is in sync with randomly assigning all students to the two diets.
Pearl continues: “On the
other hand, if we accept “untested assumptions” they can tell us
unambiguously which statistician is right and which is wrong.
In either case, one of Senn’s conclusions must be wrong.” (Pearl)
But I thought the point of the ambiguity charge is that there are different “untested assumptions” one might assume.