Thanks to Emrah Aktunc and Christian Hennig for their U-Phils on my September 12 post: “How should ‘prior information’ enter in statistical inference?” and my subsequent deconstruction of Gelman[i] (starting here, and ending with part 3). I’ll begin with some remarks on Emrah Aktunc’s contribution.
First, we need to avoid an ambiguity that clouds prior information and prior probability. In a given experiment, prior information may be stronger than the data: to take but one example, say that we’ve already falsified Newton’s theory of gravity in several domains, but in our experiment the data (e.g., one of the sets of eclipse data from 1919) accords with the Newtonian prediction (of half the amount of deflection as that predicted by Einstein’s general theory of relativity [GTR]). The pro-Newton data, in and of itself, would be rejected because of all that we already know.
Your second point, however, is entirely correct: “Either you use background information in the form of a ‘prior’ (informative or not) or you don’t (or can’t) use background information at all. This is fallacious thinking.” Indeed, that was a central point in the Cox-Mayo exchange to which Gelman refers. I think it points to a huge blunder that is often repeated; unfortunately, I don’t expect our efforts to put a stop to it. The irony is that it increasingly appears that Bayesians aren’t too keen on forcing background information into their former prior probability distributions. I would argue that the most popular Bayesian applications use “conventional” (e.g., default, reference, non-subjective) priors that are not even seen to represent background beliefs. Instead, they seek methods in which the prior has but a minimal impact.
I was just rereading Jim Berger (2006) (“The case for objective Bayesian analysis.” Bayesian Analysis, 1(3), 385–402):
One only has limited time to elicit models and priors from the experts in a problem, and usually it is most efficient to use the available expert time for modeling, not for prior elicitation (p. 391).
It’s a waste of time, he says, to elicit background beliefs about parameters in models “since the models will likely not even be those used at the end” (p. 392). By the time they have set out their first elicitation, everyone is likely to have moved on to a different model! By contrast, to refer to one of my fashion analogies: frequentist error-statistical methods let you get going with a handful of “ready to wear” (rather than “designer”) methods: find a way to ask one limited question (severely) now, and let feedback dynamically improve things.
[Discussing an elicitation effort] with statistically sophisticated engineers, it was enormously difficult and expensive, with extensive training needed. All the usual elicitation mistakes were encountered: variability was initially much too small (virtually never would different experts give prior distributions that even overlapped); there would be massive confusions over statistical definitions (e.g., what does a positive correlation mean?); etc. Since there was no data available about these parameters, one of the severe problems in elicitation was at least avoided, namely how to elicit the prior when the expert has already seen the data (the usual situation that a statistician faces) (ibid. pp. 392).
* * **
With respect to incorporating background information, Christian Hennig finds both the frequentist and the Bayesian approaches seriously wanting.
I began this blog in September 2011 with a deliberate example of a phenomenon that scarcely involved formal statistics. At the time, in another context, I happened to be writing about Kuru and related spongiform diseases like mad cow, specifically about the mysteries of their transmission. (Although, during decades of inquiry, rates of infection had been modeled stochastically, such pursuits occupied but a small corner of the inquiry.) So one might ask: if, in science in general, there is no need to formally introduce prior probability distributions, why require it in statistics?
In the early experiments on Kuru, before anyone understood what the disease was or how it was spread, rudimentary questions were being asked, piecemeal, until background on scrapie in sheep became foreground and ignited a spark: “Gee, we’ve got something vaguely familiar here. Didn’t we see something like this with Scrapie in sheep?” The spark was unexceptional. On the contrary, it’s the nature of learning. By piecemeal steps we home in on such analogies. Had the researchers first tried to express, in the form of prior probabilities, everything known (about all diseases or viruses, about bacteria, genetics, statistics . . .), they would never have learned what actually enabled their advances.
This summer we heard Bayesian statistician Tony O’Hagan tell the Higgs physicists that they might have found it valuable to consult with a (subjective) Bayesian. While O’Hagan grants the great difficulty of obtaining priors for all of the background, he believed that, having done so, they could have accomplished as good if not presumably a better job—because, he says, they would have learned whether their rough approximations relying on 5 sigma differences were really warranted.
If I and other piecemeal learners are correct, then this would have been the wrong way to go. Hennig regrets that we don’t have formal ways of incorporating all background. Philosophers of science, too, have long had a love affair with neat and tidy logics (at least since logical empiricist days and before). Finally accepting the waves of criticism lodged against hypothetical deductive accounts of science (e.g., in the 1960s and 70s), many sought refuge in Bayesian confirmation theory and inductive logics wherein background B gets a big letter in a Bayesian computation. (The probabilities generally stemmed from choices of a first order language.)
Unfortunately, as a forward-looking method, the formal logics are empty. And even in post-data reconstructions of an episode, when the “historical episode” is complete and no new learning is going on, it is at best a distant “paint-by-number” rational construction activity.
Where Hennig says, “the Bayesian has probably as hard a time constructing the prior as the frequentist has afterwards, making some overall sense of all results,” I want to reply that the non-Bayesian has a much easier time. Also, that the Bayesian would miss the boat if, in setting out to push the frontiers of knowledge, he or she really and truly insisted on forcing all the information, background knowledge, and theories involved into prior probability distributions. (What they actually do is generally quite different. Recall the discussion in relation to Stephen Senn and his critique of what I called “grace and amen” Bayesianism.)
To Hennig’s regret that we don’t have a systematic way of incorporating background information, I say that one reason for this is the failure to try to at least systematize the way in which, at the forefront, background knowledge is used to gain knowledge. I realize that statistical applications are rather special, but as a philosopher of science my concern is with science in general. Yet I still think statistical methods and models are the ideal place in which to look for ideas about systematizing strategies for learning from mistakes. If people would get over their insistence on learning in terms of formal updating of degrees of belief or support or the like, and instead considered that what we need are general ways of assessing how well-tested claims are, what errors have and have not been ruled out, etc., etc., we’d have a shot not only at understanding but at speeding up the process.
According to Hennig:
The problem is that the frequentists don’t have an in-built machinery for combining evidence. Think for example of a question like ‘Is improved street lightning good for reducing crime?’ . . . [T]here is a lot of knowledge, but it is not unified and much of it is only marginally relevant. Still it is more than just an unfounded expert opinion (of which Cox is understandably wary to incorporate it into the analysis) to add to the results of the possibly rather limited amount of data that the local government has collected itself.
First, I’m rather sure that Cox would not object to using information about whether improved street lighting reduces crime. We would not consider the effectiveness of street lighting on crime reduction to be a general causal law or even a regularity: it can and often does happen that techniques suited to general causal claims are used when one really wants to know whether a new streetlight on South Main St. would decrease crime, or some such thing. One may, of course, want to know how often streetlights diminish crime, or whatever,
but no such ordinary question would trouble a frequentist. but a frequentist has no problem pursuing this.
In a Philosophy of Science Association seminar in November (with Nancy Cartwright and others), as it happens, we’ll be considering examples in developmental economics, education, and the like, which may elicit questions such as: Do microloans reduce poverty? or, Does small class size improve education? A big problem, however, is that what “works” in one place may be irrelevant in another; collecting data using randomized control trials and aggregating the results may well yield useless predictions for what is likely to work in the case at hand.
Different kinds of background knowledge are needed for prediction, as opposed to scientific theorizing. But with prediction, again, we’d seek the kind of background information that tells us, say, what really was at work with microloans in country X, and why the attempt to apply them in country Y failed.
During the life of this blog we have seen in different ways how, in scientific practice, the idealized picture of Bayesian updating and aggregating is at odds with practice centered on deliberately piecemeal local probes—jump in quick and jump out fast—which can be dynamically altered by related inquiries from different groups, following different approaches and with different areas of expertise. In every scientific episode I’ve studied, no one would have anticipated the relevance of insights from radically different fields to the growth of knowledge. By trying to map out everything in advance, we distort the picture of learning.
I invite Gelman to respond to my main query in part 3 of my deconstruction.
[i] A Deconstruction of Gelman by Mayo in 3 parts:
(10/5/12) Part 1: “A Bayesian wants everybody else to be a non-Bayesian”
(10/7/12) Part 2: Using prior Information
(10/9/12) Part 3: beauty and the background knowledge