picking up the pieces |

# Monthly Archives: November 2011

## If you try sometime, you find you get what you need!

## The UN Charter: double-counting and data snooping

John Worrall, 26 Nov. 2011 |

Last night we went to a 65th birthday party for John Worrall, philosopher of science and guitarist in his band *Critique of Pure Rhythm*. For the past 20 or more of these years, Worrall and I have been periodically debating one of the most contested principles in philosophy of science: whether evidence in support of a hypothesis or theory should in some sense be “novel.”

A novel fact for a hypothesis* H* may be: (1) one not already known, (2) one not already predicted (or counter-predicted) by available hypotheses, or (3) one not already used in arriving at or constructing *H*. The first corresponds to *temporal novelty* (Popper), the second, to *theoretical novelty* (Popper, Lakatos), the third, to* heuristic* or *use-novelty*. It is the third, use-novelty (UN), best articulated by John Worrall, that seems to be the most promising at capturing a common intuition against the “double use” of evidence:

If data x have been used to construct a hypothesis *H(x)*, then x should not be used again as evidence in support of *H(x)*.

(Note: Writing *H(x)* in this way emphasizes that, one way or another, the inferred hypothesis selected or constructed to fit or agree with data x. The particular instantiation can be written as *H(x _{0})*.)

The UN requirement, or, as Worrall playfully puts it, the “UN Charter,” is this:

Use-novelty requirement (UN Charter): for data x to support hypothesis *H* (or for x to be a good test of *H*),* H *should not only agree with or “fit” the evidence x, but x itself must not have been used in *H*‘s construction.

The problem has arisen as a general prohibition against data mining, hunting for significance, tuning on the signal, ad hoc hypotheses, and data peeking, and as a preference for predesignated hypotheses and novel predictions.

The intuition underlying the UN requirement seems straightforward: it is no surprise that data x fits *H(x)*, if *H(x)* was deliberately constructed to accord with data x, and then x is used once again to support *H(x)*. To use x both to construct and to support a hypothesis is to face the accusation of illicit “double-counting.” In order for x to count as genuine evidence for a hypothesis, we need to be able to say that so good a fit between data x and *H* is practically impossible or extremely improbable (or an extraordinary coincidence, or the like) if in fact it is a mistake to regard x as evidence for *H*.

In short, the epistemological rationale for the UN requirement is essentially the intuition informing the severity demand associated with Popper. The disagreement between me and Worrall has largely turned on whether severity can be satisfied even in cases of UN violation. (Worrall 2010).

I deny that UN is necessary (or sufficient) for good tests or warranted inferences—there are severe tests that are non-novel, and novel tests that are not-severe. Various types of UN violations do alter severity, by altering the error-probing capacities of tests. Without claiming that it is easy to determine just when this occurs, at least the severity requirement provides a desiderata for discriminating problematic from unproblematic types of double-counting.

Its goal is also to explain why we often have conflicting intuitions about the novelty requirement. On the one hand, it seems clear that were you to search out several factors and report only those that show (apparently) impressive correlations, there would be a high probability of erroneously inferring a real correlation. But it is equally clear that we can and do reliably use the same data both to arrive at and to warrant hypotheses: in forensics, for example, where DNA is used to identify a criminal; in using statistical data to find out if it has satisfied its own assumptions; as well as in common realms such as measurement—inferring, say, my weight gain after three days in London. Here, although any inferences (about the criminal, the model assumptions, my weight) are constructed to fit or account for the data, they are deliberately constrained to reflect what is correct, at least approximately. We use the data all right, but we go where it takes us (not where we want it to go.)

*What matters is not whether H was deliberately constructed to accommodate data x. What matters is how well the data, together with background information, rule out ways in which an inference to H can be in error. Or so I have argued [1]*

I claim that if we focus on the variety of “use-construction rules” and associated mistakes that need to be ruled out or controlled in each case, we can zero in on the problematic cases. Even where UN violations can alter the error-probabilistic properties of tools, this recognition can lead us to correct overall severity assessments.

Despite some differences, there are intriguing are parallels between how this debate has arisen in philosophy and statistics. Traditionally, philosophers who deny that an appraisal of evidence can or should be altered by UN considerations have adhered to “logical theories of confirmation.” As Alan Musgrave notes:

According to modern logical empiricist orthodoxy, in deciding whether hypothesis *h* is confirmed by evidence *e*, and how well it is confirmed, we must consider only the statements *h* and *e*, and the logical relations between them. It is quite irrelevant whether e was known first and *h* proposed to explain it, or whether *e* resulted from testing predictions drawn from *h*. (Musgrave 1974, 2)

These logical theories of confirmation have an analogy in formal statistical accounts that obey the likelihood principle:

The likelihood principle implies . . . the irrelevance of predesignation, of whether an hypothesis was thought of beforehand or was introduced to explain known effects. (Rosenkrantz 1977, 122)

A prime example of a UN violation is one in which a hypothesis or theory contains an “adjustable” or free parameter, which is then “tied down” on the basis of data (in order to accord with it).

Bayesians looking to justify the preference against such UN violations (without violating the likelihood principle) typically look for it to show up in prior probability assignments. For instance, Jim Berger, in statistics, and Roger Rosenkrantz, in philosophy of science, maintain that a theory that is free of adjustable parameters is “simpler” and therefore enjoys a higher prior probability. There is a long history of this type of move based on different kinds of simplicity considerations. Conversely, according to philosopher John Earman (discussing GTR): “On the Bayesian analysis,” the countenancing of parameter fixing that we often see in science “is not surprising, since it is not at all clear that GTR deserves a higher prior than the [use-constructed rivals to GTR]” (Earman 1992, 115). He continues: “Why should the prior likelihood of the evidence depend upon whether it was used in constructing T?” (p 116).

Given the complexity and competing intuitions, it’s no surprise that Bayesians appear to hold different positions here. Andrew Gelman tells me that Bayesians have criticized his (Bayesian?) techniques for checking models on the grounds that they commit double-counting (and thereby have problems with power?). I’m unsure what exactly the critical argument involves. Frequentist model checking techniques are deliberately designed to allow computing error probabilities for the questions about assumptions, distinct from those needed to answer the primary question. Whether this error statistical distinction can be relevant for Gelman’s “double counting” I cannot say.

Earman, J. 1992. *Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory.* Cambridge, MA: MIT Press.

Musgrave, A. 1974. Logical versus historical theories of confirmation.* British Journal for the Philosophy of Science* 25:1-23.

Rosenkrantz, R. 1977. *Inference, Method and Decision: Towards a Bayesian Philosophy of Science.* Dordrecht, The Netherlands: D. Reidel.

Worrall, J. 2010. Theory, confirmation and novel evidence. In *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science*, edited by D. Mayo and A. Spanos, 125-154. Cambridge: Cambridge University Press.

[1] For my discussions on the novelty and severity business (updated Feb. 24, 2021):

- 1991, Novel Evidence and Severe Tests. Philosophy of Science 58:523-52]; 1996
- 1996,
*Error and the Growth of Experimental Knowledge,*chs. 8, 9, 10 - 2001, Principles of Inference and their Consequences, with M. Kruse, in
*Foundations of Bayesianism*, edited by D. Cornfield and J. Williamson, 381-403]; - 2008, How to Discount Double Counting When It Counts, [
*British Journal for the Philosophy of Science*59: 857–79]; - 2006, Frequentist Statistics as a Theory of Inductive Inference, with D.R. Cox,in
*Optimality: The Second Erich L. Lehmann Symposium*, edited by J. Rojo, 77-97 Lecture Notes-Monograph Series, vol. 49.] - 2010, “An Ad Hoc Save of a Theory of Adhocness? Exchanges with John Worrall” in
*Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science*(D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press:155-169. - 2014, Some surprising facts about (the problem of) surprising facts (from the Dusseldorf Conference, February 2011)
*Studies in History and Philosophy of Science*45 (2014): 79-86.

## RMM-5: Special Volume on Stat Scie Meets Phil Sci

The article “Low Assumptions, High Dimensions” by Larry Wasserman has now been published in our special volume of the on-line journal, Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?)

**Abstract:**

*These days, statisticians often deal with complex, high dimensional datasets. Researchers in statistics and machine learning have responded by creating many new methods for analyzing high dimensional data. However, many of these new methods depend on strong assumptions. The challenge of bringing low assumption inference to high dimensional settings requires new ways to think about the foundations of statistics. Traditional foundational concerns, such as the Bayesian versus frequentist debate, have become less important.*

## Neyman’s Nursery (NN5): Final Post

I want to complete the Neyman’s Nursery (NN) meanderings while we have some numbers before us, and while there is a particular example, test T+, on the table. Despite my warm and affectionate welcoming of the “power analytic” reasoning I unearthed in those “hidden Neyman” papers (see post from Oct. 22)– admittedly, largely lost in the standard decision-behavior model of tests–, it still retains an unacceptable coarseness: power is always calculated relative to the cut-off point c_{a} for rejecting H_{0}. But rather than throw out the baby with the bathwater, we should keep the logic and take account of the actual value of the statistically insignificant result.

__________________________________

(For those just tuning in, power analytic reasoning aims to avoid the age-old fallacy of taking a statistically insignificant result as evidence of 0 discrepancy from the null hypothesis, by identifying discrepancies that can and cannot be ruled out. For our test T+, we reason from insignificant results to inferences of the form: μ < μ_{0} + γ.

We are illustrating (as does Neyman) with a one-sided test T+, with μ_{0} = 0, and α=.025. Spoze σ = 1, n = 25, so X is statistically significant only if it exceeds .392.

Power-analytic reasoning says (in relation to our test T+):

If X is statistically insignificant and the POW(T+, μ= μ_{1}) is high, then X indicates, or warrants inferring (or whatever phrase you like) that μ < μ_{1}.)

_______________________________

Suppose one had an insignificant result from test T+ and wanted to evaluate the inference: μ < .4

(it doesn’t matter why just now, this is an illustration).

Since POW(T+, μ =.4) is hardly more than .5, Neyman would say “it was a little rash” to regard the observed mean as indicating μ < .4 . He would say this regardless of the actual value of the statistically insignificant result. There’s no place in the power calculation, as defined, to take into account the actual observed value.^{1}

That is why, although high power to detect μ as large as μ_{1} is sufficient for regarding the data as good evidence that μ < μ_{1} , it is too coarse to be a necessary condition. Spoze, for example, that X = 0.

Were μ as large as .4 we would have observed a larger observed difference from the null than we did with high probability (~.98). Therefore, our data provide evidence that μ < .4.^{2}

We might say that the severity associated with μ < .4 is high. There are many ways to articulate the associated justification—I have done so at length earlier; and of course it is no different from “power analytic reasoning”. Why consider a miss as good as a mile?

When I first introduced this idea in my Ph.D dissertation, I assumed researchers already did this, in real life, since it introduces no new logic. But I’ve been surprised not to find it.

I was (and am) puzzled to discover under “observed power” the Shpower computation, which we have already considered and (hopefully) gotten past—at least for present purposes, namely, reasoning from insignificant results to inferences of the form: μ < μ_{0} + γ.

Granted, there are some computations which you might say lead to virtually the same results as SEV, e.g., certain confidence limits, but even so there are differences of interpretation.^{3} Let me know if you think I am wrong, there may well be something out there I haven’t seen….

____________

(1) This does not mean the place to enter it is in the hypothesized value of under which the power is computed (as with Shpower). This is NOT power, and as we have seen in two posts, it is fallacious to equate it to power or to power analytic reasoning. Note that the Shpower associated with X = 0 is .025—that we are interested in μ < .4 does not enter.

(2) It doesn’t matter here if we use ≤ or < .

(3) For differences between SEV and confidence intervals, see Mayo 1996, Mayo and Spanos 2006, 2011.

## Logic Takes a Bit of a Hit!: (NN4) Continuing: Shpower ("observed" power) vs Power:

Logic takes a bit of a hit—student driver behind me. Anyway, managed to get to JFK, and meant to explain a bit more clearly the first “shpower” post.

I’m not saying shpower is illegitimate in its own right, or that it could not have uses, only that finding that the logic for power analytic reasoning does not hold for shpower is no skin off the nose of power analytic reasoning. Continue reading

## Neyman’s Nursery (NN3): SHPOWER vs POWER

EGEK weighs 1 pound |

Before leaving base again, I have a rule to check on weight gain since the start of my last trip. I put this off til the last minute, especially when, like this time, I know I’ve overeaten while traveling. The most accurate of the 4 scales I generally use (one is at my doctor’s) is actually in Neyman’s Nursery upstairs. To my surprise, none of these scales showed any discernible increase over when I left. At least one of the 4 scales would surely have registered a weight gain of 1 pound or more, had I gained it, and yet none of them do; that is an indication I’ve not gained a pound or more. I check that each scale reliably indicates 1 pound, because I know that is the weight of the book EGEK (you can even see this on the scale shown), and they each show exactly one pound when EGEK is weighed. Having evidence I’ve gained less than 1 pound, there is even less grounds for supposing I’ve gained as much as 5 pounds, right? Continue reading

## Neyman’s Nursery (NN2): Power and Severity [Continuation of Oct. 22 Post]:

Let me pick up where I left off in “Neyman’s Nursery,” [built to house Giere’s statistical papers-in-exile]. The main goal of the discussion is to get us to exercise correctly our “will to understand power”, if only little by little. One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955). It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman. Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H

_{0}is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H

_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0}cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+. (Whether Greek symbols will appear where they should, I cannot say; it’s being worked on back at Elba).

H_{0}: µ ≤ µ_{0} against H_{1}: µ > µ_{0}.

*The test statistic* d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ_{0} iff {d(x_{0}) > cα) where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x_{0}) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present].

We are back to our old friend: interpreting negative results!

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ_{0} + δ)

It is interesting to hear Neyman talk this way since it is at odds with the more behavioristic construal he usually championed. He sounds like a Cohen-style power analyst! Still, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result, and if the actual outcome corresponds to a larger p-value, that should be taken into account in interpreting the results. It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ_{0} + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ_{0} + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–the way I’d defined it in Mayo (1996) and before. We were thinking at first of calling it “attained power” but then came across what some have called “observed power” which is very different (and very strange). Those measures are just like ordinary power but calculated assuming the value of the mean equals the observed mean! (Why anyone would want to do this and then apply power analytic reasoning is unclear. I’ll come back to this in my next post.) Anyway, we refer to it as the Severity Interpretation of “Acceptance” (SIA) in Mayo and Spanos 2006.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a core frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:^{1} A moderate p-value is evidence of the absence of a discrepancy d from H_{0} only if there is a high probability the test would have given a worse fit with H_{0} (i.e., smaller p value) were a discrepancy d to exist.

It is important to see that it is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, d(x) in excess of the cutoff, the opposite concern arises—namely, the test is too sensitive. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough.^{2}

________________________________________

The full version of our frequentist principle of evidence FEV corresponds to the interpretation of a small p-value:

x is evidence of a discrepancy d from H_{0} iff, if H_{0} is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as FEV reasoning within the formal statistical analysis.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.

It didn’t have to be done this way, but I decided it was best, even though it means appropriately swapping out the claim H for which one wants to assess SEV.

NOTE: There are 5 Neyman’s Nursery posts (NN1-NN5). NN3 is here. Search this blog for the others.

REFERENCES:

Cohen, J. (1992) *A Power Primer.*

Cohen, J. (1988), *Statistical Power Analysis for the Behavioral Sciences*, 2^{nd} ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), *Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science*, CUP.

Neyman, J. (1955), “The Problem of Inductive Inference,” *Communications on Pure and Applied Mathematics*, VIII, 13-46.

## Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest

Secret Key |

Why attend presentations of interesting papers or go to smashing London sites when you can spend better than an hour racing from here to there because the skeleton key to your rented flat won’t turn the lock (after working fine for days)? [3 other neighbors tried, by the way, it wasn’t just me.] And what are the chances of two keys failing, including the porter’s key, and then a third key succeeding–a spare I’d never used but had placed in a hollowed-out volume of *Error and Inference*, and kept in an office at the London School of Economics? (Yes, that is what the photo is! A anonymous e-mailer guessed it right, so they must have spies!) As I ran back and forth one step ahead of the locksmith, trying to ignore my still-bum knee (I left the knee brace in the flat) and trying not to get run over—not easy, in London, for me—I mulled over the perplexing query from one of my Ghost Guests (who asked for my positive account). Continue reading

## Who is Really Doing the Work?*

A common assertion (of which I was reminded in Leiden*) is that in scientific practice, by and large, the frequentist sampling theorist (error statistician) ends up in essentially the “same place” as Bayesians, as if to downplay the importance of disagreements within the Bayesian family, let alone between the Bayesian and frequentist. Such an utterance, in my experience, is indicative of a frequentist in exile (as described on this blog). [1] Perhaps the claim makes the frequentist feel less in exile; but it also renders any subsequent claims to prefer the frequentist philosophy as just that—a matter of preference, without a pressing foundational imperative. Yet, even if one were to grant an agreement in numbers, it is altogether crucial to ascertain *who or what is really doing the work*. If we don’t understand what is really responsible for success stories in statistical inference, we cannot hope to improve those methods, adjudicate rival assessments when they do arise, or get ideas for extending and developing tools when entering brand new arenas. Clearly, understanding the underlying foundations of one or another approach is crucial for a philosopher of statistics, but practitioners too should care, at least some of the time. Continue reading

## RMM-4: Special Volume on Stat Scie Meets Phil Sci

The article “Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*” by Aris Spanos has now been published in our special volume of the on-line journal, *Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”)*

**Abstract: **

*Statistical model specification and validation raise crucial foundational problems whose pertinent resolution holds the key to learning from data by securing the reliability of frequentist inference. The paper questions the judiciousness of several current practices, including the theory-driven approach, and the Akaike-type model selection procedures, arguing that they often lead to unreliable inferences. This is primarily due to the fact that goodness-of-fit/prediction measures and other substantive and pragmatic criteria are of questionable value when the estimated model is statistically misspecified. Foisting one’s favorite model on the data often yields estimated models which are both statistically and substantively misspecified, but one has no way to delineate between the two sources of error and apportion blame. The paper argues that the error statistical approach can address this Duhemian ambiguity by distinguishing between statistical and substantive premises and viewing empirical modeling in a piecemeal way with a view to delineate the various issues more effectively. It is also argued that Hendry’s general to specific procedures does a much better job in model selection than the theory-driven and the Akaike-type procedures primary because of its error statistical underpinnings.*