The answer to the question of my last post is George Barnard, and today is his 100th birthday*. The paragraphs stem from a 1981 conference in honor of his 65^{th} birthday, published in his 1985 monograph: “A Coherent View of Statistical Inference” (Statistics, Technical Report Series, University of Waterloo). **Happy Birthday George!**

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Here’s an excerpt from the following section: “A Little History”(1)

Although I had been interested in statistics at school, in 1932, and first met Fisher in 1933, I came properly into the subject during the Second World War….[I]t was not, I think, recognized until the publication of Joan Box’s book, that the man who, more than any other, was responsible for creating the concepts now central to our subject, was cut off from these developments by some mysterious personal or political agency….

It is idle to speculate on what might have happened had the leaders of the subject, Fisher, Bartlett, Pearson, Neyman, Wald, Wilks, and others, all been engaged to work together during the war. Cynics might suggest that the resulting explosions would have made the Manhatten project redundant. But on an optimistic view we could have been spared the sharp and not particularly fruitful controversies which have beset the foundations over the past thirty years. Only now do we seem to be approaching a consensus on the respective role of “tests” or P-values, “estimates” likelihood, Bayes’ theorem, confidence or “fiducial” distributions and other more complex concepts. ….

It is interesting that Barnard calls for “more complexity” while urging “a coherent view” of statistics. I agree that a “coherent” view is possible at a foundational, philosophical level, if not on a formal level.

I’ll reblog some other posts on Barnard this week.

*There was at least one correct, original answer from Oliver Maclaren.

(1) Barnard’s rivulets remind me of Walt Whitman’s Autumn Rivulets.

I find this quote especially striking: “their study is primarily justified by the need to present a coherent view of the subject when teaching it to others.”

Fine but the devil is in the details, which vary dramatically with opinions about what is “coherent” (you and I would fail Lindley’s criterion). After that I don’t see advice that addresses today’s core problems in my fields of interest. The notion that more complexity would help does not ring true for me; rather the problem is to find what everyone would agree is both simple and correct to teach and use.

My view is that stats in soft sciences (medicine, health, social sciences among others) has been a massive educational and ergonomic failure, often self-blinded to the limits on time and capabilities of most teachers, students, and users. I suspect the reason may be that modern stats was developed and promulgated by a British elite whose students and colleagues were selected by their system to be the very best and brightest, a tiny fraction of a percent of the population. Furthermore, it was developed for fields where sequences of experiments leading to unambiguous answers could be carried out relatively quickly (over several years at most, not decades) so that the most serious errors could be detected and controlled, not left as part of the uncertainty surrounding the topic.

The statistical vision of this elite (to whom Barnard belonged) was accurate in the its time and fields. It failed as selection weakened dramatically in the explosion of university education after WWII, fueled especially by the “science race” against the Soviet bloc during the next 40 years. For example, the percent of the U.S. population going on to postgraduate training is now on the rough order of 10%; simple logic tells us that few of these could be geniuses, so no surprise that methods developed by brilliant minds to teach brilliant and well-educated students in relatively precise fields would fail as miserably in modern soft-science contexts. Stat education requires attention to the far less elite audience and far more broad applications of these times (in which to advance in scientific fields even the elite are increasingly consumed by proposal-writing and research-management responsibilities that leave ever less time for deeper thinking).

Sander: Thank you for your thoughtful comment. I agree that any kind of formal coherence in the style of Lindley can’t work, and frankly, I’m not sure what Barnard meant, but on the more interesting issue of complexity of inference, I’m inclined to agree with Barnard. Is it really advantageous/plausible to focus on “what everyone would agree is both simple and correct to teach and use”? There may be a danger that this gives a frail, artificial skeleton, limited to cases where there’s an “agreement on numbers,” such that any attempt to add the necessary flesh of interpretation is seen as perilous. At least this is so for attempts i have seen. I think it’s worth taking to heart Barnard’s remark: We need more complexity; that “this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs”.

The computer programs speed things up for your “non elite audiences,” so maybe they should use the extra time to stop and think about what they’re doing. I find it ironic for people who complain about unthinking, cookbook, uses of statistics to turn around and call for easier recipes rather than more thoughtfulness. Scientific inference, in general, isn’t less in need of self-critical, nuanced, and piecemeal progress just because more people go to college. The same is true for fields that employ statistical inference.

It is not surprising that students who have been taught that all they need is a suite of computer programs wind up blaming their tools when no one can replicate their results, and when they discover that finding things out is a bit more complicated than promised. Offered easier routes to publish, they are happy as clams to sign on. But isn’t this exacty the problem?

We are probably not far apart in general sentiments, and who would dispute that real science is complex business? But so can portrait painting even when the tools and basic techniques are simple.

People have been lamenting mindless statistics since before Fisher’s ascendance (e.g., see Boring 1919) and computer aggravation of the problem since before your Barnard quote (I recall David Freedman describing the then-new Cox and logistic regression packages as a plague on epidemiology). Again the details will matter enormously and our opinions about such practical matters (whether consonant or not) should not be taken too seriously without considerable evidence review. My point was that, as interesting as the “old masters” can be to read, in this matter we should be skeptical about about the relevance of their observations in light of the vastly different research environment in which they grew up and practiced.

I think the greatest complexity will be in answering the basic question about what should be done with statistics education, not the least because specifics will vary dramatically across fields. It is possible to do controlled experiments to test at least components of education proposals (there is a literature of such studies), and we should hope these proposals will be subject to the field-testing, selection, adaptation, and synthesis cycles that characterize applied research.

Sander: “It is possible to do controlled experiments to test at least components of education proposals (there is a literature of such studies), and we should hope these proposals will be subject to the field-testing, ”

Are such controlled experiments to be analyzed via statistical tests then?

It’s one thing to test certain teaching methods (in grade school) by looking at, say, reading and math scores–and even this is quite controversial because what “works” in one place (e.g., small classrooms) doesn’t work in others. How much more difficult to assess whether simpler or more complex teaching of statistics (presumably in higher ed) results in better statistical inferences.

That said, I don’t think it’s all that far-fetched, if people really wanted to bring about a deeper and more correct understanding of statistics, and without that much effort. I’ll bet almost everyone who has thought about this will have their own suggested pilot program.

Mayo: I think to pursue this topic one would need to examine the stat education literature. There is a whole section on stat education in ASA. This research topic is outside my expertise – I only know some of the cognitive-biases/behavioral econ literature (Gigerenzer’s experiments on natural frequencies seem most relevant here, both for basic education and for explaining frequentism).

As for testing: I think Rubin and others have been saying for decades that the key is design: Strong designs basically compel inferences by enforcing powerful controls that translate into powerful assumptions, leading to sound conclusions from any logically sound statistical approach – when applied to the entire body of evidence. The crucial foundations of statistics are thus in design theory (as in Cox’s 1958 book, Design of Experiments). Alas, my own work deals instead with data as given, often generated without strong design features such as randomization, or that lack adequate precision.

Sander: I certainly agree about the importance of planning, design, data generation and modeling. As Egon Pearson said about N-P:

“We were regarding the ideal statistical procedure as one in which preliminary planning and subsequent interpretation were closely linked together––formed part of a single whole. It was in this connexion that integrals over regions of the sample space were required. Certainly, we were much less interested in dealing with situations where the data are thrown at the statistician and he is asked to draw a conclusion. I have the impression that there is here a point which is often overlooked” (1966, 277-8). (“Some Thoughts on Statistical Inference” in The Selected Papers of E.S. Pearson)

What’s “often overlooked”, as I understand Pearson, refers not merely to downplaying experimental design, but to failing to see the intimate link between the properties of the data generation and the interpretation of the data. The error probabilities are not about future outcomes so much as about outcomes that could have occurred just now, rather than those that did occur.

Mayo: Thanks, that is helpful. I think the focus and reliance on design is why nominal frequentist Cox and nominal Bayesian Box (Egon’s star student) were closer in practical terms than the labels might lead one to think. To put it in modern terms, a good design provides a known data-generating mechanism and thus a distribution for the experiment’s potential outcomes (the sample space), which in turn provides known likelihoods, error rates, and posterior probabilities (if one has a prior distribution). Strong designs lead to concentrated likelihood functions and hence concentrated P-value functions and concentrated posterior distributions.

Again however, my field must often rely on data whose generating mechanism is only known vaguely and is often controversial; much harm is and has been done by naive application of experimental statistics (which amounts to ignoring the extensive design uncertainties in the calculations).

Sander: Still, I wouldn’t take the fact that the data generating mechanism might be unknown to imply that we must be doing something entirely different in those contexts. The error probabilities nearly always refer to hypothetical repetitions, and those are relevant even in “non-experimental” contexts (as we see, for example, with resampling).

Mayo: You give the standard argument to defend use of experimental stats in settings in which their assumed controls are not operating. Barnard’s concerns about the probability of “something else” when the latter is vague apply very well here, especially because (and contrary to Savage) that probability is easily on the order of a half in many of the cases I see, rather than almost negligible as in a tightly controlled experiment. So the error probabilities become conditional on a hypothesis which no one seriously believes. In that case, what good are they? More precisely, do they do more good than the harm they do when (as usual) the conditioning gets overlooked during interpretation and reporting? This question has been in play since before we were born and remains wide open. As far as I can tell, Barnard’s concerns about the conditional nature of hypothesis probabilities carry over to data probabilities in this case.