# Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately.

The one-sided CI for the parameter μ in test T+ with Mo the observed sample mean, and

α = .025 is:                                     (Mo-1.96(s/ √n), infinity]

Outcome M = .39 just fails to reject H0 at the .025 level, correspondingly 0 is included in the one-sided 97.5% interval:

-.002 < μ

Suppose one had an insignificant result from test T+  and wanted to evaluate the inference:   μ < .4

(It doesn’t matter why just now, this is an illustration).

Since the power of test T+ to detect  μ =.4 is hardly more than .5, Neyman would say “it was a little rash” to regard the observed mean as indicating μ < .4 , to use his language in chiding Carnap.  So the N-P tester avoids taking the insignificant result as evidence that μ < .4.  Not only has she avoided regarding the insignificant result as evidence of no discrepancy from the null, she immediately and properly denies there is good evidence for ruling out a discrepancy of .4. [i]

How does the confidence interval:       -.002 < μ

block interpreting the negative result as evidence that the discrepancy is less than .4?

It does not.

Yet many New Reformers declare that once the confidence interval is in hand, there is no additional information to be obtained from a power analysis, by which they are referring to precisely what Neyman recommends (although they have in mind Cohen).[ii]

CI advocates typically hold that anything tests can do, CIs do better, or at any rate that the information is already in the CI. However, in claiming this (regarding test T+), they always compute the CI-upper bound–yet  it is the lower bound that corresponds to this test. If we wish to use the upper confidence bound as a kind of adjunct for interpreting intervals, fine, but then CIs must be supplemented with a principle for warranting such a move: it does not come from CI theory.  But even granting the use of the corresponding 95% CI they recommend, we get:

(-.002 < μ < .782)

How does this rule out supposing one has corroboration for μ < .4?  It doesn’t. All of the values in the interval (they tell us) are plausible, so they fail to rule out the erroneous inference. (Some CI advocates even chastise the power analyst for denying there is evidence for μ < μ’, for a value of  μ’ smaller than the upper limit (UL) of the CI. Their grounds are that the data are strong evidence that μ < UL. True. But that does not prevent us from denying there is strong evidence that μ < various μ values less than the upper limit.)

The hypothesis μ < 0.4 is non-rejectable by the test with this outcome. In general, the values within the interval are not excluded, they are survivors, as it were, of the test. But if we wish to block fallacies of acceptance, CI’s won’t go far enough. Although M is not sufficiently greater (or less) than any of the values in the confidence interval to reject them at the α-level, this does not imply there is evidence for each of the values in the interval (for discussion see Mayo and Spanos 2006).

By contrast, for each value of μ1 in the confidence interval, there would be a different answer to the question(ii):

What is the power of the test against μ1?

Thus the power analyst makes distinctions that the CI interval theorist does not.  As we saw, the power analyst blocks as “rash” (Neyman) the inference  μ < 0.4 since the power of T+ to detect .4 is not high (.5). Even worse off would be the inference to  μ < 0.2 since the power to detect .2 is only .16.  Likewise for a severity analysis which also avoids the coarseness of a power analysis.[iii]

CIs are swell, and their connection to severity evaluations may be developed, but CI theory requires being supplemented by a principle that will direct their correct interpretation if they are to avoid the fallacies of significance tests.

[i] Recall, Neyman employed in chiding Rudolph Carnap: (See 11/9/11 post)

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0is true of the particular data set? (Neyman, pp 40-41).Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95.

[ii] Likewise for the question: what is the SEV associated with a given inference, from a given test with a given outcome.

[iii] If, we observe not = .39, but rather= -.2, we again fail to reject H0, but the power analyst, looking just at cα = 1.96 is led to the same assessment, again regarding as fallacious the claim to have evidence for  μ < 0.2. Although the “prespecified” power is low, .16, we would wish to say, taking into account the actual value, that there is a high probability for a more significant result than the one attained, were m as great as 0.2!

References
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, Erlbaum, NJ.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, CUP.

### 8 thoughts on “Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)”

1. Tom Passin

Many years ago, I worked for a company that had gotten a contract to supply a certain device. The device had to have a very high quality level, and we were required to provide an estimated value for the MTBF (Mean Time Between Failure). We had a test unit in a continuous test exposure, and it had not failed in a year.

Hmm, no failures but we have to provide an MTBF, what to do? Well, if failures followed a true interval distribution (constant failure probability per unit time), then the probability of one failure in the test period equaled P(no failure). I argued that therefore we might just as well have had one failure, and if that had happened we would have been able to compute an MTBF, complete with standard deviation (by the properties of the interval distribution).

So I provided an estimated MTBF value based on one failure in one year, and felt that this response was at least safely conservative. A bit strange, but we were required to respond with *some* value, and I wanted it to be as well-supported as I could manage.

2. guest

Surprised you didn’t note that a (better-scanning!) pastiche of an Annie Get Your Gun number is cited in Kadane’s book (pg 44, due to Box)… also McGrayne’s “Would Not Die” book, pg 132.

• Guest: Well my first blog post gives pride of place to the Bayesian use of “There’s no Business…” (Comedy Hour at the Bayesian Retreat) Sept 3, 2011: https://errorstatistics.com/2011/09/03/overheard-at-the-comedy-hour-at-the-bayesian-retreat/
I wrote: “Given the wide latitude with which some critics define ‘controlling long-run error,’ it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.”

That’s because they really do these jokes and sing that song….My use of “Anything You Can Do…” —an entirely different song from the play*,—goes back to graduate school, but wrt to a selling point used by philosophers to advance the Bayesian Way rather than wrt to CIs. It just occurred to me the other day how perfectly the syllables work (poetry is a side line of mine).

*There have also been uses of ‘Doing What Comes Naturally’ in statistics.

• guest

Yes, they really do sing (or have sung) that song. But it’s not the finale, and the whole cabaret thing is very much in the tradition of self-derision that is also found in many other conferences.

• Guest: Thanks for pointing to the post on Xi’an’s blog. I think that things must look very different for one who is inside than for one who is not. Interesting that he claims “the contrast between Bayesian and frequent approaches is definitely not ‘philosophical’” and that probability (for him “is not a belief but a rational construct”—meaning? It is plain that Bayesians are not in lock step on these very fundamental issues and so it’s hard to pin down the position.
In any event, I’ve never been to a conference with a cabaret in the “tradition of self derision”; I’m certainly not knocking it*. Were I involved, maybe I’d tap dance my way around the error statistical standpoint.
(By the way, what is the finale? I thought it was the song. I can correct this.)
*For a different approach, see my “statistical theater of the absurd” pieces.

3. Christian Hennig

Hmmmm… this posting seems a little bit unfair to me. I don’t see how having a CI that includes values other than \mu<0.4 should encourage anyone to "accept" \mu<0.4. I'm not sure what kind of "blocking of the fallacy of acceptance" you'd expect, but the CI tells us quite clearly that \mu=0.5 is consistent with the data, so saying that this somehow incurs evidence that \mu<0.4 seems nonsensical to me.

4. Christian; I don’t understand your remark. We need an interpretation of negative results that tells us that inferences “strictly speaking” within the interval are terrible! The form of inference of relevance in negative results (for test T+) is: μ < μo + g.
I wrote: "Some CI advocates even chastise the power analyst for denying there is evidence for μ < μ’, for a value of μ’ smaller than the upper limit (UL) of the CI." The power analyst would do so because the associated power to detect μ’ is low. The critic of power analysis takes this as a conflict or at least a tension between power analysis and CIs.
The relation to the fallacy of acceptance is this: it is only thanks to power analysis -type reasoning* that we can avoid it. With a negative result, the CI contains the null and values close to it, but we don't want to infer there is evidence for the null or even for ruling out discrepancies from the null against which the test had low power.
*I am putting aside here that I would always work, not with the "worst case" of failing to reject (as does power), but with the actual value of the statistically insignificant result.

• Christian Hennig

I guess that we agree that power analysis adds something to CIs, so no problem here.
My point is that I just don’t believe that they’d really do this:
“Some CI advocates even chastise the power analyst for denying there is evidence for μ < μ’, for a value of μ’ smaller than the upper limit (UL) of the CI."
…because as far as I know how to interpret a CI, it doesn't provide such evidence.
Well, you could prove that you're right by citing somebody.