Posts Tagged With: double-counting

Going Where the Data Take Us

A reader, Cory J, sent me a question in relation to a talk of mine he once attended:

I have the vague ‘memory’ of an example that was intended to bring out a central difference between broadly Bayesian methodology and broadly classical statistics.  I had thought it involved a case in which a Bayesian would say that the data should be conditionalized on, and supports H, whereas a classical statistician effectively says that the data provides no support to H.  …We know the data, but we also know of the data that only ‘supporting’ data would be given us.  A Bayesian was then supposed to say that we should conditionalize on the data that we have, even if we know that we wouldn’t have been given contrary data had it been available.

That only “supporting” data would be presented need not be problematic in itself; it all depends on how this is interpreted.  There might be no negative results to be had (H might be true) , and thus none to “be given us”.  Your last phrase, however, does describe a pejorative case for a frequentist error statistician, in that, if “we wouldn’t have been given contrary data” to H (in the sense of data in conflict with what H asserts), even “had it been available” then the procedure had no chance of finding or reporting flaws in H.  Thus only data in accordance with H would be presented, even if H is false; so H passes a “test” with minimal stringency or severity. I discuss several examples in papers below (I think the reader had in mind Mayo and Kruse 2001). Continue reading

Categories: double-counting, Statistics | Tags: , ,

The UN Charter: double-counting and data snooping

John Worrall, 26 Nov. 2011

Last night we went to a 65th birthday party for John Worrall, philosopher of science and guitarist in his band Critique of Pure Rhythm. For the past 20 or more of these years, Worrall and I have been periodically debating one of the most contested principles in philosophy of science: whether evidence in support of a hypothesis or theory should in some sense be “novel.”

A novel fact for a hypothesis H may be: (1) one not already known, (2) one not already predicted (or counter-predicted) by available hypotheses, or (3) one not already used in arriving at or constructing H. The first corresponds to temporal novelty (Popper), the second, to theoretical novelty (Popper, Lakatos), the third, to heuristic or use-novelty. It is the third, use-novelty (UN), best articulated by John Worrall, that seems to be the most promising at capturing a common intuition against the “double use” of evidence:

If data x have been used to construct a hypothesis H(x), then x should not be used again as evidence in support of H(x).

(Note: Writing H(x) in this way emphasizes that, one way or another, the inferred hypothesis selected or constructed to fit or agree with data x. The particular instantiation can be written as H(x0).)

The UN requirement, or, as Worrall playfully puts it, the “UN Charter,” is this:

Use-novelty requirement (UN Charter): for data x to support hypothesis H (or for x to be a good test of H), H should not only agree with or “fit” the evidence x, but x itself must not have been used in H‘s construction.

The problem has arisen as a general prohibition against data mining, hunting for significance, tuning on the signal, ad hoc hypotheses, and data peeking, and as a preference for predesignated hypotheses and novel predictions.

The intuition underlying the UN requirement seems straightforward: it is no surprise that data x fits H(x), if H(x) was deliberately constructed to accord with data x, and then x is used once again to support H(x). To use x both to construct and to support a hypothesis is to face the accusation of illicit “double-counting.” In order for x to count as genuine evidence for a hypothesis, we need to be able to say that so good a fit between data x and H is practically impossible or extremely improbable (or an extraordinary coincidence, or the like) if in fact it is a mistake to regard x as evidence for H.

In short, the epistemological rationale for the UN requirement is essentially the intuition informing the severity demand associated with Popper. The disagreement between me and Worrall has largely turned on whether severity can be satisfied even in cases of UN violation. (Worrall 2010).

I deny that UN is necessary (or sufficient) for good tests or warranted inferences—there are severe tests that are non-novel, and novel tests that are not-severe. Various types of UN violations do alter severity, by altering the error-probing capacities of tests. Without claiming that it is easy to determine just when this occurs, at least the severity requirement provides a desiderata for discriminating problematic from unproblematic types of double-counting.

Its goal is also to explain why we often have conflicting intuitions about the novelty requirement. On the one hand, it seems clear that were you to search out several factors and report only those that show (apparently) impressive correlations, there would be a high probability of erroneously inferring a real correlation. But it is equally clear that we can and do reliably use the same data both to arrive at and to warrant hypotheses: in forensics, for example, where DNA is used to identify a criminal; in using statistical data to find out if it has satisfied its own assumptions; as well as in common realms such as measurement—inferring, say, my weight gain after three days in London. Here, although any inferences (about the criminal, the model assumptions, my weight) are constructed to fit or account for the data, they are deliberately constrained to reflect what is correct, at least approximately.  We use the data all right, but we go where it takes us (not where we want it to go.)

What matters is not whether H was deliberately constructed to accommodate data x. What matters is how well the data, together with background information, rule out ways in which an inference to H can be in error. Or so I have argued [1]

I claim that if we focus on the variety of “use-construction rules” and associated mistakes that need to be ruled out or controlled in each case, we can zero in on the problematic cases. Even where UN violations can alter the error-probabilistic properties of tools, this recognition can lead us to correct overall severity assessments.

Despite some differences, there are intriguing  are parallels between how this debate has arisen in philosophy and statistics. Traditionally, philosophers who deny that an appraisal of evidence can or should be altered by UN considerations have adhered to “logical theories of confirmation.” As Alan Musgrave notes:

According to modern logical empiricist orthodoxy, in deciding whether hypothesis h is confirmed by evidence e, and how well it is confirmed, we must consider only the statements h and e, and the logical relations between them. It is quite irrelevant whether e was known first and h proposed to explain it, or whether e resulted from testing predictions drawn from h. (Musgrave 1974, 2)

These logical theories of confirmation have an analogy in formal statistical accounts that obey the likelihood principle:

The likelihood principle implies . . . the irrelevance of predesignation, of whether an hypothesis was thought of beforehand or was introduced to explain known effects. (Rosenkrantz 1977, 122)

A prime example of a UN violation is one in which a hypothesis or theory contains an “adjustable” or free parameter, which is then “tied down” on the basis of data (in order to accord with it).

Bayesians looking to justify the preference against such UN violations (without violating the likelihood principle) typically look for it to show up in prior probability assignments. For instance, Jim Berger, in statistics, and Roger Rosenkrantz, in philosophy of science, maintain that a theory that is free of adjustable parameters is “simpler” and therefore enjoys a higher prior probability. There is a long history of this type of move based on different kinds of simplicity considerations.  Conversely, according to philosopher John Earman (discussing GTR): “On the Bayesian analysis,” the countenancing of parameter fixing that we often see in science “is not surprising, since it is not at all clear that GTR deserves a higher prior than the [use-constructed rivals to GTR]” (Earman 1992, 115). He continues: “Why should the prior likelihood of the evidence depend upon whether it was used in constructing T?” (p 116).

Given the complexity and competing intuitions, it’s no surprise that Bayesians appear to hold different positions here. Andrew Gelman tells me that Bayesians have criticized his (Bayesian?) techniques for checking models on the grounds that they commit double-counting (and thereby have problems with power?).  I’m unsure what exactly the critical argument involves.  Frequentist model checking techniques are deliberately designed to allow computing error probabilities for the questions about assumptions, distinct from those needed to answer the primary question.  Whether this error statistical distinction can be relevant for Gelman’s “double counting” I cannot say.

Earman, J. 1992. Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. Cambridge, MA: MIT Press.

Musgrave, A. 1974. Logical versus historical theories of confirmation. British Journal for the Philosophy of Science 25:1-23.

Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

Worrall, J. 2010. Theory, confirmation and novel evidence. In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by D. Mayo and A. Spanos, 125-154. Cambridge: Cambridge University Press.

[1] For my discussions on the novelty and severity business (updated Feb. 24, 2021):

Categories: double counting, philosophy of science | Tags: ,

Blog at WordPress.com.