# Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately.

The one-sided CI for the parameter μ in test T+ with Mo the observed sample mean, and

α = .025 is:                                     (Mo -1.96(σ/ √n), infinity]

(σ would generally be estimated.) Outcome M = .39 just fails to reject H0 at the .025 level, correspondingly 0 is included in the one-sided 97.5% interval:

-.002 < μ

Suppose one had an insignificant result from test T+  and wanted to evaluate the inference:   μ < .4

(It doesn’t matter why just now, this is an illustration).

Since the power of test T+ to detect  μ =.4 is hardly more than .5, Neyman would say “it was a little rash” to regard the observed mean as indicating μ < .4 , to use his language in chiding Carnap.  So the N-P tester avoids taking the insignificant result as evidence that μ < .4.  Not only has she avoided regarding the insignificant result as evidence of no discrepancy from the null, she immediately and properly denies there is good evidence for ruling out a discrepancy of .4. [i]

How does the confidence interval:       -.002 < μ

block interpreting the negative result as evidence that the discrepancy is less than .4?

It does not.

Yet many New Reformers declare that once the confidence interval is in hand, there is no additional information to be obtained from a power analysis, by which they are referring to precisely what Neyman recommends (although they have in mind Cohen).[ii]

CI advocates typically hold that anything tests can do, CIs do better, or at any rate that the information is already in the CI. However, in claiming this (regarding test T+), they always compute the CI-upper bound–yet  it is the lower bound that corresponds to this test. If we wish to use the upper confidence bound as a kind of adjunct for interpreting intervals, fine, but then CIs must be supplemented with a principle for warranting such a move: it does not come from CI theory.  But even granting the use of the corresponding 95% CI they recommend, we get:

(-.002 < μ < .782)

How does this rule out supposing one has corroboration for μ < .4?  It doesn’t. All of the values in the interval (they tell us) are plausible, so they fail to rule out the erroneous inference. (Some CI advocates even chastise the power analyst for denying there is evidence for μ < μ’, for a value of  μ’ smaller than the upper limit (UL) of the CI. Their grounds are that the data are strong evidence that μ < UL. True. But that does not prevent us from denying there is strong evidence that μ < various μ values less than the upper limit.)

The hypothesis μ < 0.4 is non-rejectable by the test with this outcome. In general, the values within the interval are not excluded, they are survivors, as it were, of the test. But if we wish to block fallacies of acceptance, CI’s won’t go far enough. Although M is not sufficiently greater (or less) than any of the values in the confidence interval to reject them at the α-level, this does not imply there is evidence for each of the values in the interval (for discussion see Mayo and Spanos 2006).

By contrast, for each value of μ1 in the confidence interval, there would be a different answer to the question(ii):

What is the power of the test against μ1?

Thus the power analyst makes distinctions that the CI interval theorist does not.  As we saw, the power analyst blocks as “rash” (Neyman) the inference  μ < 0.4 since the power of T+ to detect .4 is not high (.5). Even worse off would be the inference to  μ < 0.2 since the power to detect .2 is only .16.  Likewise for a severity analysis which also avoids the coarseness of a power analysis.[iii]

CIs are swell, and their connection to severity evaluations may be developed, but CI theory requires being supplemented by a principle that will direct their correct interpretation if they are to avoid the fallacies of significance tests.

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

[i] Recall, Neyman employed in chiding Rudolph Carnap: (See 11/9/11 post).

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0is true of the particular data set? (Neyman, pp 40-41).Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95.

For more on power see: See posts under “Neyman’s Nursery” (1, 2, 3, 4, 5)

[ii] Likewise for the question: what is the SEV associated with a given inference, from a given test with a given outcome.

[iii] If, we observe not M = .39, but rather M= -.2, we again fail to reject H0, but the power analyst, looking just at cα = 1.96 is led to the same assessment, again regarding as fallacious the claim to have evidence for  μ < 0.2. Although the “prespecified” power is low, .16, we would wish to say, taking into account the actual outcome, that there is a high probability for a more significant result than the one attained, were m as great as 0.2!

References
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.