Previous posts[a],[b],[c] of mine have considered Lord’s Paradox. To recap, this was considered in the form described by Wainer and Brown, in turn based on Lord’s original formulation:
A large university is interested in investigating the effects on the students of the diet provided in the university dining halls : : : . Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded. (p. 304)
The issue is whether the appropriate analysis should be based on change-scores (weight in June minus weight in September), as proposed by a first statistician (whom I called John) or analysis of covariance (ANCOVA), using the September weight as a covariate, as proposed by a second statistician (whom I called Jane). There was a difference in mean weight between halls at the time of arrival in September (baseline) and this difference turned out to be identical to the difference in June (outcome). It thus follows that, since the analysis of change score is algebraically equivalent to correcting the difference between halls at outcome by the difference between halls at baseline, the analysis of change scores returns an estimate of zero. The conclusion is thus, there being no difference between diets, diet has no effect.
On the other hand, ANCOVA will correct the difference at outcome by a multiple of the difference at baseline, this multiple being determined by the regression of outcome on baseline. However, in the example, variances of weights at outcome and baseline are identical and so the regression is equal to the correlation coefficient, which, in any practical example may be expected to be less than 1. Thus, for ANCOVA, the difference at outcome is corrected by a fraction of the difference at baseline. This results in a non-zero estimated difference, which we are to assume is significant, so the conclusion is that diet does have an effect.
The fact that these two different analyses lead to different conclusions constitutes the paradox. We may note that each is commonly used in analysing randomised clinical trials, say, and there the expectation of the two approaches would be identical, although results would vary from case to case.
In The Book of Why, the paradox is addressed and the conclusion based on causal analysis is that the second statistician is ‘unambiguously correct’ (p216) and the first is wrong. In my blogs, however, I applied John Nelder’s experimental calculus[5, 6] as embodied in the statistical software package Genstat® and came to the conclusion that the second statistician’s solution is only correct given an untestable assumption and that even if the assumption were correct and hence the estimate were appropriate, the estimated standard error would almost certainly be wrong.
I had looked at this problem some years ago and concluded that the ANCOVA solution was preferable to the change score one but made this warning comment:
Note that in estimating β an important assumption that makes ANCOVA unbiased is that the regression within groups is the same as that between, the latter being the potential bias and the former that by which the correction factor is estimated. (p 4342)
Here β is the multiple of the baseline difference that is used to correct the difference at outcome. However, at that time I had not appreciated the power of Nelder’s approach to designed experiments. This, when applied, makes the issue crystal clear. What I did was apply Nelder’s approach, which has the following key features
- Recognising the distinction between blocking structure and treatment structure. The former reflect variation in the experimental material that exists logically prior to any experimentation and the latter variation that can in principle be affected by experimentation.
- Defining the block structure.
- Defining the treatment structure.
- Mapping the treatment structure onto the block structure.
- Analysing the results in terms of outcome, block structure and treatment structure.
An extra feature of the approach is that a covariate can also be accommodated in the framework.
I then used this approach in Genstat to check Jane’s solution with the following code applied to a simulated example:
The BLOCKSTRUCTURE statement reflects that students are ‘nested’, to use the statistical terminology, within halls, an important feature of the data that has not been formally addressed, as far as I am aware, by any of the previous approaches to looking at this problem. Nelder’s approach places establishing this at the top of the agenda.
The analysis then gave a result that is obvious in retrospect. It produced an analysis of variance table that establishes the following:
- There are (potentially) two regression slopes in this problem: between hall and between students within hall.
- The first of these cannot be estimated if there are only two halls.
- The second of these is not relevant.
- Only variation between halls is relevant to estimating the standard error.
My conclusions regarding this are as follows.
- John’s analysis is wrong.
- Jane’s analysis is correct as regards estimating the effect of diet iff the between-halls regression is the same as the within-halls regression.
- She cannot test this assumption with the data.
- Even if this assumption is correct any standard error that she calculates is almost certainly wrong.
Another way of putting this is to say that Lord’s paradox involves pseudoreplication. Commentators have implicitly assumed that they have many replications of the experimental intervention because there are many students. However, intervention is at the level of hall not at the level of student and it is the level at which intervention occurs that provides replication.
Others have disagreed and have raised various objections. I consider these objections to be red herrings and address them here.
A diet of red herrings
First red herring
Objection: I describe an experiment but this is not relevant to what is an observational study.
Answer: This objection would be relevant if I had claimed the solution for the experimental set-up was necessarily adequate for the observational one. For example, in comparing two treatment groups in a simple randomised experiment, I could show that a simple t-test would provide a valid estimate and standard error. This would not be an argument, however, for saying that such an approach would be valid for a quasi-experimental study, for which confounding could be a problem. However, this is not what I did. I showed that the approach allowing for a confounder (baseline) would not work even in a randomised experiment (that is to say, under the best of circumstances) and therefore it could not work in the quasi-experimental analogue.
Second red herring
Objection. I discuss the problem in terms of variances and covariances but these are not relevant to causal thinking.
Answer. Jane’s solution is to use a regression-adjusted comparison. The regression coefficient is a ratio of covariances and variances. Thus, Lord and nearly all subsequent commentators, have used variances and covariances. It is true that the authors of The Book of Why make no explicit reference to variances and covariances but they use the geometry of the bivariate Normal, which is determined by means, variance and the covariance. The issue I have raised was that it has not generally been appreciated that variances and covariances occur at two levels.
Third red herring
Objection. At one point in my analysis I considered an experiment with many halls but in Lord’s example there are only two. This is misleading.
Answer. Solving the general case and considering the relevant special case is a well-known approach in mathematics. For example, Polya includes it as one of his heuristics in his famous book, How to Solve It. The Nelder approach shows that the two-hall example cannot be solved because it is degenerate. This immediately suggests creating a solvable version of the problem to show what the issue is. This is what I did.
The argument in a nutshell
The key and usually overlooked point about Lord’s paradox is that there are only two experimental units, the units in question being halls and not, as has been generally supposed, students. The clinical trials analogue is that of a cluster randomised trial and not a parallel group trial. Of course, we are not to suppose that allocation of diets was made at random but what I showed was that even if it were random, the proposed analysis is inadequate.
The consequences of this are the following
- Jane’s estimate of the diet effect is only unbiased if a) the regression of final weight on initial weight at the level of hall (averaging over students) is the same as that within halls b) there are no other hidden (conditionally) biasing factors.
- Jane’s standard error is only correct if there is no between-hall variation above and beyond that arising from within-hall variation. This is an extremely strong and often demonstrably false assumption.
Simulating the problem
Figure 1 Data for 20 simulated pairs of halls showing final weight plotted against initial weight in the presence of a strong random hall effect.
Figure 1 shows the results of simulating a Lord’s type example using not one pair but twenty pairs of halls. The simulation involves using two bivariate Normal distributions in a hierarchical set-up. In each pair, the hall receiving diet 1 is known as hall A and the hall receiving diet 2 is known as hall B. The true difference between diets is set to be such that, other things being equal, a student receiving diet 2 will be 5kg lighter than if they had received diet 1.
The scatter plots, however, show a bewildering inconsistency given that all 20 pictures represent simulation from identical parameter settings. (In fact, the simulation is produced in one run generating students within one of two halls in twenty pairs.) On many occasions it looks as if diet 2 produces the slimmer students but on many other occasions it is diet 1. This is illustrated in Figure 2, which gives the t-statistic plotted against the estimate for the analysis of each of the twenty hall pairs, conditioning on baseline (as in an ANCOVA). The figure also includes the critical values for ‘significance’ and classifies the results into one of three categories: “in favour of diet 1”, “in favour of diet 2” and “non-significant”. Here a diet is considered favourable if students weigh less under it than they would, other things being equal, on the alternative diet.
Figure 2 t-statistics of the diet effect plotted against the estimate for twenty hall pairs.
Note that, in judging ‘significance’, a Bonferroni correction has been applied to the standard value of 1/40, so for a one sided P-value a result is judged ‘significant’ if P < (1/40)/20 or P > (1/40)/20.
These results show that adjusting for the baseline weight fails to produce any consistent message here, despite producing apparently extremely precise measures. How is this possible? The reason is simple. Allocation to diet is at the level hall not student, yet the analysis attempts to unravel the causal knot at the level of student. This is only possible by making strong assumptions. The assumptions are that a) there is no variation between halls above and beyond the variation between students b) the regression between halls is the same as the regression within. However, I set the simulation parameters to violate both these assumptions.
Note that I am not saying that these assumptions could never be made. They could be, although I personally find assumption a), at least, unlikely to hold in practice. However, I am claiming that Jane’s solution to the problem is not unambiguously wrong. The assumptions should be made explicit so that they are ‘on the table’.
Figure 3 represents the case where I have set the simulation so that there is no random hall effect. Now we can see that there is no problem. Each pair of halls gives the same message: diet 2 leads to a lower weight.
Figure 3 Data for 20 simulated pairs of halls showing final weight plotted against initial weight when there is no hall effect.
Cause fishing with Fisher
One hundred years ago the great statistician RA Fisher arrived at Rothamsted Experimental Research station in Harpenden to start work as their statistician. The whole purpose of Rothamsted was causal: to discover what factors could affect crop growth and yield and to what degree. The whole deep and beautiful field of design and analysis of experiments as developed by Fisher, his successors and others working in statistics is devoted to this causal purpose and it is thus rather puzzling to find statistics referred to as a ‘causality-free enterprise’ in The Book of Why (p5). This statement is just wrong.
That statement can be rephrased to make it more reasonable by giving it the qualifier when applied to observational studies, but even here I think this is still something of a calumny. Note that I am not claiming that everything in the causal calculus was anticipated in statistics. Far from it. I consider the development of the calculus and its associated theory by Pearl, his co-workers and those he has inspired to be original, profound and important. However, if it cannot accommodate random effects it is incomplete. In my three previous blogs I showed how application of John Nelder’s experimental calculus produced a solution to Lord’s paradox that, although trivially true in retrospect, is nevertheless evidence of the power of the method.
Finally, I speculate that if the causal calculus can incorporate random effects* it will become even more powerful and useful. Indeed, it is hard to see how the first rung of the ladder of causation, that is to say, association, (Book of Why, p28), can be used without understanding error structures. It has been claimed that all that statisticians taught was that correlation is not causation. However, the truth is far more shocking. What statistics teaches is that frequently correlation is not even correlation. To say that you are adjusting Y for Xi s meaningless unless you say how such an adjustment is to take place. Perhaps the world needs a Book of How.
*Since I wrote the first draft of this blog, my attention has been drawn on Twitter to the paper by Kim and Steiner, which may address this issue.
- Wainer, H. and L.M. Brown, Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data.American Statistician, 2004. 58(2): p. 117-123.
- Lord, F.M., A paradox in the interpretation of group comparisons.Psychological Bulletin, 1967. 66: p. 304-305.
- Senn, S.J., The well-adjusted statistician.Applied Clinical Trials, 2019: p. 2.
- Pearl, J. and D. Mackenzie, The Book of Why. 2018: Basic Books.
- Nelder, J.A., The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance.Proceedings of the Royal Society of London. Series A, 1965. 283: p. 147-162.
- Nelder, J.A., The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance.Proceedings of the Royal Society of London. Series A, 1965. 283: p. 163-178.
- Senn, S.J., Change from baseline and analysis of covariance revisited.Statistics in Medicine, 2006. 25(24): p. 4334–4344.
- Hurlbert, S.H., Pseudoreplication and the design of ecological field experiments.Ecological monographs, 1984. 54(2): p. 187-211.
- Polya, G., How to solve it: A new aspect of mathematical method. 2004: Princeton university press.
- Campbell, M.J. and S.J. Walters, How to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. Statistics in Practice, ed. S. Senn. 2014, Chichester: Wiley. 247.
- Box, J.F., R.A. Fisher, The Life of a Scientist. 1978, New York: Wiley.
- Fisher, R.A., The Design of Experiments. 1935, Edinburgh: Oliver and Boyd.
- Kim, Y. and P.M. Steiner, Causal Graphical Views of Fixed Effects and Random Effects Models, in PsyArXiv. 2019. pp. 34. https://psyarxiv.com/cxd2n/
[a]Rothamsted statistics meets Lord’s Paradox https://errorstatistics.com/2018/11/11/stephen-senn-rothamsted-statistics-meets-lords-paradox-guest-post/
[b]On the level. Why block structure matters and its relevance to Lord’s paradox https://errorstatistics.com/2018/11/22/stephen-senn-on-the-level-why-block-structure-matters-and-its-relevance-to-lords-paradox-guest-post/
[c]To infinity and beyond: how big are your data, really? https://errorstatistics.com/2019/03/09/s-senn-to-infinity-and-beyond-how-big-are-your-data-really-guest-post/
At this point at least, I find nothing to disagree with here (as usual with your analyses), and in fact am learning from it (as you indicated you did). So my thanks for the posting!
The problem as I currently see it lies with drastic differences in goals, formal models, and languages between you and Pearl. Specifically (and I welcome any correction to my take):
You apply the statistically rich Nelder/random-effects(RE) analysis that provides a Fisherian ANOVA treatment, which is steeped in historical referents and technical facts that I fear will not be understood by most readers to which I (and Pearl) am accustomed.
In contrast, Pearl/Book-of-Why is limited to the simpler more accessible analysis using only expectations under causal models, and so does not address random variability/sampling variation. Thus among other things it does not address certain fixed (“unfaithful”) causal design effects that can arise in designed experiments via blocking or matching. Mansournia and I published a pair of articles about this limitation, not as deep as your analysis but perhaps a bit more accessible (with effort) to those without traditional training in design and analysis of experiments:
Mansournia, M.A., Greenland, S. (2015). The relation of collapsibility and confounding to faithfulness and stability. Epidemiology, 26(4), 466-472.
Greenland, S., Mansournia, M.A. (2015). Limitations of individual causal models, causal graphs, and ignorability assumptions, as illustrated by random confounding and design unfaithfulness. European Journal of Epidemiology, 30, 1101-1110.
Your general point I take it is that the theory in The Book of Why (and indeed in most treatments of modern causality theory I see, including my own) is incomplete for incorporating uncertainties about or variability of material and responses. It is thus (as you say) incomplete for statistical practice, and leaves its use open to missteps in subsequent variance calculations. But my teaching experience agrees with Pearl’s insofar as the target audience is in more dire need of first getting causal basics down, like how to recognize and deal with colliders and their often nonintuitive consequences. In doing so we must allow for lack of familiarity with or understanding of design-of-experiment theory, especially that involving ANOVA calculus or random effects. Thus while I agree The Book of Why seriously overlooks the central importance of causality in that theory, its criticism would be amended by saying that the theory buried causality too deeply within a structure largely impenetrable to the kind of researchers we encounter. Our efforts were intended to bring to the fore crucial aspects of causality for those researchers, aspects that do not depend on that theory and are even obscured by it for those not fluent in it (as some of the controversy surrounding Lord’s paradox illustrates).
The more specific point I think you make is how the randomization in Lord’s Paradox is itself almost noninformative: With only two halls randomized, it is only a randomized choice of the direction of the confounding (formally, just one sign-bit of information) in what is otherwise an observational study for the treatment effect. That being so, any statistical identification of the effect must depend on untestable assumptions beyond the barely informative randomization.
My questions are:
Does any of my description fail to align with your analysis?
Even if it does align broadly, what key details does it miss?
Again, thanks for the post!