Last part (3) of the deconstruction: beauty and background knowledge

Please see parts 1 and 2 and links therein. The background began in my Sept 12 post.

Gelman (2012) considers a case where the overall available evidence, E, is at odds with the indication of the results x from a given study:

Consider the notorious study in which a random sample of a few thousand people was analyzed, and it was found that the most beautiful parents were 8 percentage points more likely to have girls, compared to less attractive parents. The result was statistically significant (p<.05) and published in a reputable journal. But in this case we have good prior information suggesting that the difference in sex ratios in the population, comparing beautiful to less-beautiful parents, is less than 1 percentage point. A (non-Bayesian) design analysis reveals that, with this level of true difference, any statistically-significant observed difference in the sample is likely to be noise. At this point, you might well say that the original analysis should never have been done at all—but, given that it has been done, it is essential to use prior information (even if not in any formal Bayesian way) to interpret the data and generalize from sample to population.

Where did Fisher’s principle go wrong here? The answer is simple—and I think Cox would agree with me here. We’re in a setting where the prior information is much stronger than the data. (p. 3)

Let me simply grant Gelman that this prior information warrants (with severity) the hypothesis H:

H: “difference in sex ratios in the population, comparing beautiful to less-beautiful parents, is less than 1 percentage point,” (ibid.)

especially given my suspicions of the well-testedness of claims to show the effects of “beautiful to less-beautiful” on anything. I will simply take it as a given that it is well-tested background “knowledge.” Presumably, the well-tested claim goes beyond those individuals observed, and is generalizing at least to some degree. So we are given that the hypothesis H is one for which there is strong evidence.

For me this would mean, roughly, that if H were wrong and that, in fact, there is as much as (or more than) a 1 percentage point difference in sex ratios in the population, comparing beautiful to less-beautiful parents, then these earlier studies would have indicated this.  Or at least there’s fairly good grounds for thinking they would have (or with high probability they would have) indicated a greater difference than they did. If, despite giving the effect a good chance to show itself, evidence in sync with a population difference of less than 1 percentage point is regularly found, then there is evidence for the absence of a greater effect.

I don’t think there’s much for Gelman to disagree with in this rough construal of the well-testedness of H (but if not, he will say so). My question is: Is Gelman suggesting that we translate this strong background information into a high prior for H?  I’m guessing that the answer is no, that he imagines using it in an informal, non-Bayesian way. How then might the background knowledge about H be taken account of in reporting  apparently conflicting data x?

I suggest that when it comes to the general hypothesis H, something like the following statement might be given:

If the statistical assumptions underlying this data x are approximately met, we would have an indication of a difference d (on sex ratio), but given what we already know, almost surely this study is in error somewhere (unless we are dealing with a very different population here).

Even without spotting the source of the error, this would be an entirely reasonable, and informative report. It would not, of course, be a Bayesian updating to a posterior probability, but it would use the well-corroborated background information about the correctness of H to appraise the new data. Although he does not require, would Gelman even allow, blending the background together with the new information, rather than calling out the conflict?

To combine the presumed background knowledge with the anomalous data, even inferring something like:

as a result of data x, the degree of confirmation has gone down slightly for H

would seem very inadequate. Scientists interpreted anomalous data for Newton (in 1919 and before) as anomalies, and they did this despite the strong belief in Newton. (I once wrote a paper, “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?‘”.) But unlike the standard inductive Bayesian, Gelman seems to favor a testing (Bayesian) philosophy. Thus, I am guessing he would combine the information much as I do in the bold statement above. But I leave off with this question at the end.

My own view would go further to consult a general “repertoire of errors” with respect to the type of study: for example, might there be a difference in what counts as beautiful in this new study? Might a variety of other causal factors be responsible? Of course, there is always the possibility that the influence on sex ratio has changed from past inquiries. But if it is assumed that there is a universal generalization on the order of H—at least for this point in time–, and without such an assumption Gelman wouldn’t be regarding the current data as in conflict with the known background, then there must be some explanation, even if there is no interest in discovering it.

If this all seems very obvious, that is exactly my point. We have no trouble (in science and in day-to-day reasoning)  incorporating this prior information. Here’s Gelman:

We’re in a setting where the prior information is much stronger than the data. If one’s only goal is to summarize the data, then taking the difference of 8% (along with a confidence interval and even a p-value) is fine. But if you want to generalize to the population—which was indeed the goal of the researcher in this example—then it makes no sense to stop there. (2012, p. 3)

From the perspective of the frequentist error statistician, the background knowledge in favor of H would be used even at the stage of criticizing the assumptions of the statistical inference (whatever the range of generalization of interest). A statistical inference, if it is an “inference” based on valid statistical premises, always goes strictly beyond the data, even if it is limited to an inference about what is responsible for the particular data x in this study.

So I’m not so sure the reported confidence interval or p-value would “be fine,” instead we would have stopped, even before getting there. Still, granting Gelman that it would “be fine” for “summarizing the data,” this is very different from using the data to infer that there is a genuine effect—even of the statistical variety (even with a limited population). Rather than asking where Fisher’s principle seems to go wrong (at least in the statements by Cox under discussion), we would emphasize that Fisher always denied, even with the lady tasting tea, that an “isolated record” of  statistically significant results suffices:

In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1935, p.14)

It follows that, for Fisher, even showing the (statistical) reality of an experimental effect, much less showing the evidence of a specific explanation of the effect (be it an evolutionary story about beauty and daughters or something else), requires going beyond an “isolated record.” There is clearly a difference between

What this data x seems to indicate about H,


What all available scientific evidence E indicates about H,

where “indicates” is being used as a neutral  term, which the reader may substitute with others as they wish.

And, there are still more differences between what E indicates about H, and what is indicated about subsequent explanations for the real effect, once demonstrated (my levels again).

Beyond this, all statistical methods require choices—assumptions, if you will. . . . It’s just not possible to determine or even validate all one’s choices from the data at hand. (Gelman 2012, p. 4)

Now I understand the distinction Gelman has in mind, between merely describing data x and moving to much wider generalizations, but making inferences even from this limited data set x does not take place starting with a blank slate. Even in entirely informal situations outside statistics, interpreting “the data at hand” invokes background information. That doesn’t make all data “theory laden” in any interesting sense. So long as the aspects of the background information used, and the way they are used, do not interfere with what one is trying to learn, there is no reason to worry. (Nor need we even trot out all the background, much less believe in the truth of theories, to reliably use the background.)

But steering back to our main issue: How is David Cox using “prior information” in our conversation?  Cox is clear about using prior knowledge in analyzing the data. When Cox says in our conversation (see Sept 12 post or [i]):

In fact you have very clever ways of making sure that your analysis is valid even if the prior information is totally wrong. If you use the wrong prior information you just got an inefficient design, that’s all. (p. 105),

he is using “prior information” to refer to various uncorroborated beliefs, conjectures, hunches, whether or not they are influenced by policy, ethical, or other values, and for which no evidence is given. That is why he mentions the possibility that “the prior information is totally wrong”.[i]

It doesn’t really matter much whether one describes these background theories and assumptions as part of the design and modeling (as Cox did in the passages Gelman cites), or as part of the various moves: (1) from “an isolated statistical record” to a genuine experimental effect, and separately (2) from genuine effects to explanations and predictions. I can see where Gelman regards the latter as preferable, and I do too. Background information enters at all levels, not in the form of mathematical prior probability distributions[ii], but as a repertoire of existing severely tested background claims (as well as those poorly tested, and why), along with a cluster of current obstacles, flaws and foibles to build upon.

That ends my deconstruction of (the two relevant sections in) Gelman.  I have received two U-Phils which I will post next time*.

But I want to leave off with a general question (to which I invite responses):

Is the recommendation, in relation to his sex ratio example, where we grant him that  “the prior information is much stronger than the data,” to report the background along the lines of my statement above (in bold)? Or is the recommendation to turn the prior information into a prior probability distribution and combine it Bayesianly with the new, anomalous, data x?  The former is in the spirit of error statistical testing, at least for this case. (There may be other cases, obviously, where some kind of average or aggregation is what is wanted.) The latter is in the spirit of Bayesian confirmation theorists, at least in philosophy (given my background knowledge).

*the original U-Phil call was in my Sept 12 post, with a deadline of Sept. 25.


Gelman, A. (2012). Ethics and the statistical use of prior information.
Fisher, R. A. (1935). The Design of Experiments. Edinburgh: Oliver & Boyd.

[i] A Statistical Scientist Meets a Philosopher of Science: A Conversation between Sir David Cox and Deborah Mayo,” published in Rationality, Markets and Morals.

[ii] We do not deny there are some cases where background comes in via prior probabilities (I discuss this elsewhere); but even then, there will be other pieces of information that would not enter as formal probabilities.

Categories: Background knowledge, Error Statistics, Philosophy of Statistics, Statistics, U-Phil

Post navigation

14 thoughts on “Last part (3) of the deconstruction: beauty and background knowledge

  1. Reader

    A confidence interval outputs an estimate. The estimate goes beyond the observed data to indicate something about a population. It is not just “to summarize the data”. The same goes for a p-value. The adequacy of the statistical model is assumed in both. A separate generalization to an even wider population may be of interest, but the statistical estimate from x is generalizing still to the data generating mechanism giving rise to the data. Yes?

    • guest

      Reader: the bald numeric value of a confidence interval can be seen, if one wishes, as just a numerical operation on the data, and therefore as just a summary of the data. To do so requires no assumptions.

      If the two numbers the interval comprises come with a statement that 95% of the time an interval constructed in the same way will cover the true value of some parameter of interest, then I agree we are going a bit further and some assumptions are required, not least about how the data was generated. But to do so is not mandatory.

  2. Reader

    Guest: ? I can do lots of numerical operations on “the data”(which can be combined in various ways), and that doesn’t make them summaries of the data, or meaningful claims about what was observed. In one confidence interval I might get: m x + e. or maybe e/2- x < m + 69 < x + e/2, or maybe any number of things. are they all just data summaries? Without the distribution, and names for each term, and statistic, these can't be regarded as summarizing the data; nor can you get the likelihoods of the data.

    • guest

      Reader: yes, numerical operations on the data, including those that spit out two numbers not one, are giving you summaries of the data. They certainly are “claims about what was observed” – they’re *statements* about what was observed.

      But no distributional assumptions or similar are required for one to state that e.g. “the average weight in the sample was 162.4cm, and the standard deviation was 10.1cm”. Similarly, one can state, as a simple summary, that “the sample mean deviation of height from 160cm divided by the sample standard deviation over the square root of the sample size was 1.96” or indeed that “Moving out from the sample mean by 1.96 times the sample standard deviation divided by the square root of the sample size in either direction, we reach 159.9cm and 164.9 cm.” I could go on, but I imagine you get the idea.

      > Without the distribution, and names for each term, and statistic,
      > these can’t be regarded as summarizing the data; nor can you
      > get the likelihoods of the data.

      Not true, except for the bit about names. We don’t need to specify distributions in order to report e.g. the sample mean, see above. In some cases (certainly not all) one can use summaries like those above to get a likelihood for the original data – if the model is agreed upon, and the summaries provide sufficient statistics. Finally, It’s quite possible, in some situations, that one is able to use the summaries to provide reasonable inference without any use of likelihood functions.

      • Reader

        “If the model is agreed upon, and the summaries provide sufficient statistics” are statistical assumptions. The discussion presumed that summarizing the statistical information was meant; true there’s nothing stopping any number of descriptions of what someone happen to record, e.g., what time of day, what order, biggest, smallest, and whatever operations one likes.. Anyway, this is at right angles to the point that background knowledge in favor of H could be used even at the stage of criticizing the assumptions, to conside, as Hennig remarks, that “a) Something went wrong”.

  3. What I want to know is what Gelman’s Bayesian is recommending for the case that he brings up, where we are presumed to know H: there’s no more than a percentage point difference in sex ratios in the population, comparing beautiful to less-beautiful parents. Do we blend that background knowledge with the anomalous new data, e.g., by multiplying probabilities one might try to attach to H.Or do we communicate that the new data are in conflict without blending. We might say something akin to my statement:

    If the statistical assumptions underlying this data x are approximately met, we would have an indication of a difference d (on sex ratio), but given what we already know, almost surely this study is in error somewhere (unless we are dealing with a very different population here).

  4. Christian Hennig

    1) I’d in fact be very curious about the nature of the background knowledge in this situation and how Gelman would translate this into a prior. If it’s just more studies of the same kind, the frequentist could run a meta-analysis (and wouldn’t require a prior even before the first study, as opposed to the Bayesian).

    2) Here is another issue with Gelman’s comments on this case. If a study brings forth such a surprising result, there are in principle three possible explanations:
    a) Something went wrong.
    b) Nothing went wrong and the truth is just surprisingly far away from the prior belief.
    c) Nothing went wrong, the truth is about what was believed a priori to be true, but in the study something very unlikely happened by chance, as it sometimes does.
    Now I see how incorporating a prior can deal with b and c, but standard Bayesian updating cannnot deal with a. Bayesian updating will just treat the new data as if they are fine, and if there are enough to override the prior belief, this will happen.
    If there is something generally wrong with the study design or data gathering, however, the new data probably shouldn’t be used at all, not even for Bayesian updating. So if results are so outlandish that it looks more likely that something went wrong than that the data point in the wrong direction by a random accident, there should be a possibility to discard them.
    I think that the Bayesian framework in principle allows for this (by introducing a study specific effect which with some large (?) prior probability is zero but may be far away) but I hardly ever see Bayesians do such a thing, and in principle, if Bayesians want to handle potential systematic bias in a study, they probably should always do it. (Actually perhaps Stephen Senn knows more because such things may be used in Bayesian meta-analysis.)

    • Christian: Yes, I think this is essentially my point. The example to which Gelman appeals, as possibly problematic for a non-Bayesian use of background information, is actually best dealt with in a non-Bayesian manner (something along the lines of my statement, in bold). Moreover, this “background information” if we grant its validity, should point us toward critically examining the assumptions of the new, anomalous data, rather then reporting either a confidence interval estimate or a p-value. There’s no error probability warrant otherwise. But the main issue is the former. The Bayesian at best seems to have to jump through hoops to make sensible scientific use of background information. Maybe this is why Gelman says a Bayesian wants everybody else to be a non-Bayesian. The problem is that this alone does not tell us how we ought to be using the background. Nor are presumed likelihoods enough. Error statistical principles of evidence do better here, even if it’s only in directing a qualitative use, and a building up of a repertoire of design and interpretation considerations (for the general problem area). This is the Cox-Fisher position as well, it seems to me.

    • Christian:

      I will respond to Mayo’s questions later, but just very quickly, in response to your comments above:

      1. In the example I’m discussing, the prior information really is stronger than the data. The background information I’m talking about is a large literature on sex ratios. David Weakliem and I published an article all about that example, and one of the things we did in our paper was demonstrate a non-Bayesian analysis that used the prior information. I don’t think every analysis needs to be Bayesian, nor to I believe that Bayes is the only way to include prior information in an analysis. Cox expressed support for the attitude that it’s ok to use prior information in designing a study but not in the analysis. I disagree; I think it’s a good idea to use prior information at both stages. This use could be Bayesian or non-Bayesian.

      2. The study did not “bring forth such a surprising result.” It was, rather, a borderline-significant finding (depending on how you measure it, the observe difference was somewhere between 1.2 and 2.4 standard errors away from zero) that was consistent with the prior. The result was also consistent with lots of silly claims that were not consistent with scientific knowledge. That’s what happens when you have a very small study: you get noisy estimates and can make all sorts of silly estimates if you ignore the prior information.

      3. I’m sorry that you “hardly ever see Bayesians do such a thing” as check the fit of models to data. I recommend you take a look at chapter 6 of Bayesian Data Analysis. We actually discuss this general idea on the very first page of chapter 1 of our book! It’s been a crusade of mine for twenty years to understand and explain how Bayesians check the fit of their models, so I’m really really really not the guy you want to be criticizing on this point.

      • Mark

        To me, what seems to be missing from this entire discussion, and what would go a long way toward indicating the usefulness, or not, of prior information, is the intended population to which we are intending to make inference (personally, I see this as an issue of design). Are the inferences intended for a (reasonably well-defined) finite population, or are they intended to refer to a law of nature? If the former, and if one could really objectively segregate parents based on their beauty, then prior information really has nothing to do with whether or not the finding could currently be true in the specified population. If the latter, and if the claim is that this law of nature has always been true or has recently been increasing or whatever, then clearly prior information is important (however one would choose to use it). On the other hand, if the claim is that the we have evidence that the law is now true, then the prior information wouldn’t seem to matter except by indicating that prior evidence would suggest that it wasn’t true last year, or whenever… Regardless, if this were the claim then, in the spirit, I believe, of Fisher’s quote, I would simply say “prove it, replicate your results in a well-designed study.”

        Anyway, with that ramble behind me (with apologies), I think that saying something akin to Deborah’s proposed statement would be fine (and in line with what I’ve said above).

        • Mark: Thanks for your comment. I think this echoes my main point entirely, and I’m glad someone is getting back to it!

      • Christian Hennig

        Andrew: Regarding your point 3, that seems to be a misunderstanding. I was not talking about “checking the fit of models and data” but rather about designing prior distributions in such a way that they reserve some prior probability for the possibility that the new data may be biased or faulty or something like that, if it deviates too much from what can be expected a priori.

  5. Eileen

    “He imagines using [background] in an informal, non-Bayesian way”. OK, so what’s the Bayesian advantage? I also do not get what Gelman is on about in his article….

    • Hi Eileen: I don’t know that there is a Bayesian advantage at all. On Gelman’s paper, well my deconstruction was an attempt to unravel it. Of course, my goal is on how these issues are of relevance much more generally regarding Bayesian, in contrast to frequentist or error statistical accounts. I wouldn’t have looked at it if it didn’t reflect a very common issue (as was discussed in the Cox-Mayo 2011 conversation to which Gelman refers): the idea that frequentists advocate ignoring background information. As I say, “making inferences even from this limited data set x does not take place starting with a blank slate”. Granted, the use of background information is context-dependent and question dependent. It would be absurd to try and cover all cases with a recipe-like, automatic, crank-out formalism, and one of my points is that (sensible) Bayesians don’t really advocate that either. Gelman seems to go much further away (from standard? Bayesians) in not viewing inference in terms of Bayesian updating via Bayes’ theorem. Well, I’m repeating myself now, try rereading the first 2 deconstruction posts on this.

Blog at