Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs

Increasingly, I am discovering that one of the biggest sources of confusion about the foundations of statistics has to do with what it means or should mean to use “background knowledge” and “judgment” in making statistical and scientific inferences. David Cox and I address this in our “Conversation” in RMM (2011); it is one of the three or four topics in that special volume that I am keen to take up.

Insofar as humans conduct science and draw inferences, and insofar as learning about the world is not reducible to a priori deductions, it is obvious that “human judgments” are involved. True enough, but too trivial an observation to help us distinguish among the very different ways judgments should enter according to contrasting inferential accounts. When Bayesians claim that frequentists do not use or are barred from using background information, what they really mean is that frequentists do not use prior probabilities of hypotheses, at least when those hypotheses are regarded as correct or incorrect, if only approximately. So, for example, we would not assign relative frequencies to the truth of hypotheses such as (1) prion transmission is via protein folding without nucleic acid, or (2) the deflection of light is approximately 1.75” (as if, as Pierce puts it, “universes were as plenty as blackberries”). How odd it would be to try to model these hypotheses as themselves having distributions: to us, statistical hypotheses assign probabilities to outcomes or values of a random variable.

However, quite a lot of background information goes into designing, carrying out, and analyzing inquiries into hypotheses regarded as correct or incorrect. For a frequentist, that is where background knowledge enters. There is no reason to suppose that the background required in order sensibly to generate, interpret, and draw inferences about H should—or even can—enter through prior probabilities for H itself! Of course, presumably, Bayesians also require background information in order to determine that “data x” have been observed, to determine how to model and conduct the inquiry, and to check the adequacy of statistical models for the purposes of the inquiry. So the Bayesian prior only purports to add some other kind of judgment, about the degree of belief in H. It does not get away from the other background judgments that frequentists employ.

This relates to a second point that came up in our conversation when Cox asked, “Do we want to put in a lot of information external to the data, or as little as possible?”

As I understand him, he is emphasizing the fact that the frequentist conception of  scientific inquiry involves piecemeal inquiries, each of which manages to restrain the probe so as to ask one question at a time reliably. We don’t want our account to demand we list all the possible hypotheses about prions, or relativistic gravity, or whatever—not to mention all the ways in which each can fail—simply to get a single inquiry going! Actual science, actual learning is hardly well captured this way. We will use plenty of background knowledge to design and put together results from multiple inquiries, but we find it neither useful nor even plausible to try to capture all of that with prior degrees of belief in the hypotheses of interest. I see no Bayesian argument otherwise, but I invite them to supply one.[i]

A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results. Although the complexity of the story makes it more difficult to set out neatly, as, for example, if a single algorithm is thought to capture the whole of inductive inference, the payoff is an account that approaches the kind of full-bodied arguments that scientists build up in order to obtain reliable knowledge and understanding of a field. (Mayo and Cox 2010)

An analogy I used in EGEK (Mayo 1996) is the contrast between “ready-to-wear” and “designer” methods: the former do not require huge treasuries just to get one inference or outfit going!

Some mistakenly infer from the idea of Bayesian latitude toward background opinions, subjective beliefs, and desires, that the Bayesian gives us an account that appreciates the full complexity of inference—but latitude for complexity is very different from latitude for introducing beliefs and desires into the data analysis!  How ironic that it is the Bayesian and not the frequentist who is keen to package inquiry into a neat and tidy algorithm, where all background enters via quantitative sum-ups of prior degrees of belief in the hypothesis under study.  In the same vein, I hear people say time and again that since it is difficult to test the assumptions of models, we should recognize the subjectivity and background and be Bayesians! Wouldn’t it be better to have an account that provides methods for actually testing assumptions? One that insists that any unresolved knowledge gap be communicated in the final report in a way that allows others to critique and extend the inquiry? This is what an error-statistical account requires, and it is at the heart of that account’s objectivity.  The onus of providing such information comes with the requirement to report, at least approximately, whether formally or informally, the error-probing capabilities of the given methods used. We wish to use background information, not to express degrees of subjective belief, but to avoid being misled by our subjective beliefs, biases and desires!

This is part and parcel of an overall philosophy of science that I discuss elsewhere.

[i] Of course, in some cases, a hypothesized parameter can be regarded as the result of a random trial and can be assigned probabilities kosher for a frequentist; however, computing a conditional probability is open to frequentists and Bayesians alike—if that is what is of interest—but even in those cases the prior scarcely exhausts “background information” for inference.

Categories: Statistics | Tags: ,

Post navigation

23 thoughts on “Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs

  1. Corey

    “Instead scientists break down their inquiries into piece-meal questions (see for example the PPN framework of parameters), and they are able to assess the error probing abilities for inquiries into well-delineated questions as, the deflection of light is such and such.”

    One can use a Bayesian statistical approach to make piece-meal inquiries of the type you describe. Yes, in principle, one ought to assign a “global prior”, but this turns out to be uncomputable in practice, alas (…. To assess the error-probing abilities of Bayesian models, the typical thing to do is run simulations to find the sampling properties of their answers with respect to a known truth. This isn’t philosophical Bayesianism, but it’s a scientifically fruitful approach anyway.

    “Nor do they assign degrees of truth to these parameter values, whatever that might mean. If you can show me how this is done in cases of actual learning, be it about prions, gravity, or something else, it would help.”

    Sure they do. Check out the arXiv for a sampling.… . To read all of these papers would be an overwhelming tasks, but if you’re interested in how physicist think about and use Bayes, that’s the link you want.

  2. I deny that any type of Bayesianism—insofar as the word has any meaning—is able to avoid the foundational bankruptcy I discuss in my RIMM contribution. Strict personalists may be said to come closest to being pure, but only at the cost of forfeiting relevance to what is at the heart of scientific learning about the world. Of course I know Doogian’s taxonomies well (though I prefer Cox’s “7 faces”. The troubles aren’t ‘ solved by adding to the types; and I don’t see the guest using a consistent notion. I don’t see that how well a claim has stood up to testing is well captured by the relative frequency of an event.

  3. Guest

    You write that “we would not assign relative frequencies to the truth of hypotheses” but very few of the 46656++ types of Bayesians are asking you to. For most Bayesians, priors quantify information about the truth, in a manner consistent with a set of particular axioms. The truth itself is a fixed constant, and to write otherwise is unhelpful.

    You also “find it neither useful nor even plausible” to quantify the available information using priors. Fine, but please;

    i) look at the prior-elicitation literature to see the sensible and scientifically-useful procedures that result from such quantification. ii) look at the damage done when frequentism is misinterpreted to mean that one should always use off-the-shelf designs and analyses.Bayesian methods are just methods; they have frequency and error properties that one is free to calculate if desired. And yes, choosing any method poorly gives bad results. But if Bayesian thinking helps users make better choices, then we should embrace it.Going in the other direction, because frequentist analyses are in practice going to be interpreted in a Bayesian manner (in particular, confidence intervals as credible intervals) then it seems pragmatic to ask whether they correspond to a Bayesian analysis that might be of any interest. Happily, very often they do. There is thus no need to categorize methods (or people) as Bayesian or frequentists; to paraphrase DR Cox, they may be both, and there is no reason for not viewing them this way.Finally, on model-checking. Taking a Bayesian or frequenstist approach, the Rumsfeldian problem of “unknown unknowns” means, in realistically complex situations, it is just impossible to give honest statements about what we might have missed. (It’s all to easy to be the bewildered student who “doesn’t know what he doesn’t know”, and is thus incapable of asking a good question, even if he could understand the answer). There is some excellent recent literature comparing the facts vs fiction of inference after model-checking; the extent of the problem is non-intuitive to many.

    • GUEST: I am glad you allow that “truth itself is a fixed constant”. I really don’t see how priors quantify information about the truth of such claims; nor does trying to assign such quantities (and multiplying them, etc) seem to capture how scientists learn new things. How do scientists set out to probe, say, relativistic gravity? By assigning the truth of GTR a quantitative assessment? How? What numbers do they assign? Perhaps they think that, as a whole, GTR will break down—so what number should they give it? Instead scientists break down their inquiries into piece-meal questions (see for example the PPN framework of parameters), and they are able to assess the error probing abilities for inquiries into well-delineated questions as, the deflection of light is such and such. Nor do they assign degrees of truth to these parameter values, whatever that might mean. If you can show me how this is done in cases of actual learning, be it about prions, gravity, or something else, it would help. I hope readers realize that I am drawing attention to a fundamental difference in the conception of science—not trying in the least to misrepresent the Bayesian view. I can articulate how these actual inquiries take place and build on eachother. Please tell me how such inquiries are well-captured by assigning degrees of quantitative information for the (fixed) truth, that you have in mind.

      As for your other points, the value of prior elicitation, the useful error probabilities of Bayesian methods, the insistence that frequentist methods are invariably interpreted Bayesianly,the supposition that there are satisfactory Bayesian techniques of model checking —these are all positions that I have questioned, and hope the reader will join me in doing so! I appreciate your listing them, whoever you are, so that we can see just how much of a difference in philosophy is involved. There’s a lot I would (and elsewhere do) say on these points, all of
      which I promise to take up here (but please see references in the mean

      • Guest

        I am not a physicist; someone like Bill Jeffreys would be better placed to tell you how physicists learn. For prior elicitation, a starting point is this book;

        For e.g. the effect of a drug on mean lifespan in a particular population, there seems little conceptual difficulty in saying that we don’t know the truth, but that the information we do have, a priori, can be summarized by a particular distribution.

        Multiplying this information by what we learn from given data is the process that goes on when we try to interpret results from a trial – whether we label this process as Bayesian, or not.

  4. John Byrd

    I do not agree. If we do not have strong evidence for the prior information, then we are not compelled to multiply anything. We should not be forced to use priors because we are enamered with Bayes’ theorem. Data can answer small questions and be helpful.

  5. John Byrd

    I like the use of the term, ‘probe’ to reference how we, in practice, actually make use of statistical analyses. I do not recall ever seeing observations made and numbers crunched to address anything but narrow questions. Most of the Bayesian treatments I have seen in my field (biological anthropology/forensics) have muddied the water by using poorly conceived priors to combine with specific skeletal data (age markers, measurements). So, a narrow concern with a set of measurements gets obscured by the introduction of speculative demographic data. Yes, demographics may be relevant in the big picture, but forcing that info into the measurements-based probe detracts from the clarity of the testing. You see the worst in forensics, where, for example, DNA technicians play the role of amateur demographers with their Bayesian models. Bayesian experts might cringe at naive use of poorly conceived priors, but their rhetoric, which reads like marketing sometimes encourages it. I pointed to a case in UK in a recent post where the judge threw out a Bayesian analysis because the data were ‘crap.’ (my word, not his/hers). The reaction of the Bayesian experts in UK interviewed fixated on the power of the models and ignored the junk data issue. I see this as potentially dangerous for science.

    • Thanks for your comments John. I’d like to read more about the forensic case you allude to. And what do you mean they fixated on the “power” of the models? What were they powerful for?

      • John Byrd

        Sorry for my delay. I will find the link to the story. What I meant by fixation on power was their insistence that Bayesian models have the ‘power to provide the probability of guilt or innocence (in contrast to classical stats).

  6. John Byrd

    O.k., the link is:

    My original point in sharing it was the issue of data quality, and the apparent lack of concern about it (save for the judge, who was awake that day!).

    Another note I will make is that the author/journalist stated that Bayesian reasoning reflects how jurors and all of us think. Presumably, this perspective was taken from the interviews. A lot of classic psych research (e.g. Tversky, Kannemann, etc.) has shown clearly this is not the case.

    Article is misleading.

  7. It will not do to simply assert that in general “the process that goes on when we try to interpret results from a trial” is tantamount to multiplying by a prior. It may be a very, very special kind of case where a legitimate prior distribution exists and further, that we’d want to use it in interpreting results. Is the idea of your example that there is a frequency distribution for mean lifespan effects, over a range of drugs, in the given population? But even there, this particular drug of interest would need to be sufficiently similar to the others about which we had sufficient empirical information to formulate the prior probability distribution. Which other drugs would be included? Treating the same disease? or several diseases? So even in these very atypical cases, it’s not too clear we’d want to try to interpret the bearing of the trial data this way. But if one did, one ought to be able to assess the evidence for the prior, and then one goes back to needing an account to warrant a hypothesis (not to apply yet another prior distribution).

    • Guest

      We obtain the product of our a priori information and whatever it is the data tell us. Hence, multiplication seems like the right term – and the right process. If you want to say “addition” instead, then add the logs of the two quantities I mentioned.

      The prior distribution is over the parameter of interest, e.g. the effect of drug on mean lifespan, in a particular population. The prior would be based on what is known about the drug, and the population. To elicit helpful priors, see the link I provided earlier.

      I’m not convinced by your claim that helpful priors are “a very, very special kind of case” or are “very atypical cases”. Again, see that link.

  8. With the case you describe now, it’s not at all clear one would want to use the known data on the drug (and its correlations with lifespan) as a prior for the new trial. What if the prior data are inconsistent with the new data?
    Then (as D.R. Cox points out), you wouldn’t want to multiply them together. If they can reliably be combined with the new data, an error statistician could do so as well, but not by viewing either data set as supplying a prior probability distribution.

    • Guest

      You might not multiply the prior on the drug’s effect in one population directly by the likelihood featuring the drug’s effect in another. But you can multiply a joint prior on both those parameters by a likelihood that updates them both. Features of the data that tell you about how correlated the information about two parameters is would affect how much you learn about one parameter from a study that tells you most directly about the first.

  9. Correy: I’m not sure if your two remarks are in tension with one another but I’d be interested in understanding how to “run simulations to find the sampling properties of their answers with respect to a known truth”. Does this violate the likelihood principle? One needs a sound foundation for the argument, an underlying core for each method, not a mere grab bag of things that might be done whiling adhering to the label “Bayesian”. In the frequentist error statistical approach, the computation and use of the sampling properties all hang together in a unified logic.

  10. Corey

    The likelihood principle is a “local” principle — it purports to say what the evidential content of the data are in the context of a given sampling model. So I would say that simulations to evaluate the usefulness of a given data analytic procedure do not violate the likelihood principle, being outside of its ambit. (I did mention this is not an approach prescribed by philosophical Bayesianism, right?) It’s also useful as a sanity check for subtle non-identifiability problems in a model.

    As to needing a sound foundation, all I can say is that at the end of the day, science is whatever scientists do, and they manage to get it more-or-less right in aggregate. Many of them no doubt do use statistical methods of whatever philosophical extraction as a mere grab bag. Fish swim, birds fly, biologists do t-tests with n = 3 (….

    • What is philosophical Bayesianism? Perhaps that is a useful term. On the usefulness of considering hypothetical outcomes that could have arisen (with various probabilities under varying hypotheses), of course that is the core of frequentist error statistics. Perhaps you are suggesting that a Bayesian could deny, as they do, the relevance of using the sampling distribution (for evaluating what inferences are warranted to draw from data x) while asserting the relevance of using the sampling distribution “to evaluate the usefulness of a given data analytic procedure”? In the view I favor, evaluating warranted inferences to draw from x depends crucially on evaluating the (relevant) sampling distribution (at least approximately). I don’t see why Bayesians (philosophical or otherwise) would want to rob themselves of this type of reasoning.

      • Corey

        I’m reluctant to put myself forward as the arbiter of what philosophical Bayesianism is. Personally I’d say there are two main, somewhat fuzzy categories of philosophical Bayesianism. Both would argue that only Bayesianism has a shot at coming up with respectable philosophical foundations, but they disagree as to whether it’s personalistic/subjective Bayes or logical/objective Bayes that has the firm foundation.

        Incidentally, if you weren’t aware, Guest’s mention of 46556 types of Bayesians was a reference to I. J. Good’s mini-article on the subject, available here (…. When I refer to practice in line with philosophical Bayesianism, I’m thinking of the practice of statisticians of type 4(a)8(a) and 4(a)8(b).

  11. I would certainly deny that science is whatever scientists do. If I thought there was no such thing as pointing out bad science, pseudoscience, corrupt science, science in the service of enforcing politics/ideology rather than what is the case, and so on—I would have given up doing philosophy of science long ago. I think that standpoint is extremely dangerous! Moreover, what interests me is figuring out how they manage to get it right, that is…what is doing the work. I think this is an answerable question and it is one I have been working on—even though, mostly, in utter exile….

  12. “But you can multiply a joint prior on both those parameters by a likelihood that updates them both.”
    And where does that joint prior come from? The same data?

  13. stat


    Just out of curiosity, using your weighing example suppose the scale you use to weight yourself gives errors e with a known distribution e~N(0,sigma).

    But before you carry out the frequentist analysis, you stand on a bridge that will collapse if you weigh more than 135lbs. The bridge doesn’t collapse, so you know for certain that you weight less than 135lbs.

    How would you as a frequentist include this background information in your model? The likelihood is fixed by e~N(0,sigma) and further knowledge about your actual weight doesn’t affect that. So what new model would you use in it’s place?

    The reason I ask is that this information “weight<135" can trivially be included in the prior for your weight and it is known as a technical fact (the numbers work out correctly) that doing so will improve the estimate.

    So what model would you actually use in this case?

  14. Pingback: Entsophy

Blog at