Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ.

Considering problem (2), suppose two tests of type T+ reach the same significance level, .02 and let

(i) n = 100 and  (ii) n = 400.

(With n = 100, M = .2; with n = 400, M = .1)

(i) for n = 100, the .98 (lower) CI = µ > M – 2(1/10)

(ii)  for n = 400, the .98 (lower) CI = µ > M – 2(1/20)

So in both cases, the confidence intervals are

µ > 0

or as he writes them (0, infinity]. So how are the CIs distinguishing them?

The sample means in both cases here are just statistically significant. As Cumming states, for a 98% CI, the p-value is .02 if the interval falls so that the LL is at µ0 (p. 103).  Here, the LL (lower limit) of the CI is µ0–namely, 0.

So the p-value in our case would be .02 and the result is taken to infer µ > 0. So where is the difference? The construal is dichotomous: in or out, plausible or not; all values within the interval are on par.  But if we wish to avoid fallacies, CI’s won’t go far enough.

To avoid fallacies of rejection,  distinguish between cases (i) and (ii), and make good on the promise to have solved the problem in (3), we would need to report the extent of discrepancies well and poorly indicated. Let’s just pick an example to illustrate: Is there evidence of a discrepancy .1? , i.e., that  µ > .1

For n = 100, we would say that µ > .1 is fairly well indicated (p-value is .16, associated SEV is .84)*.

The reasoning is counterfactual: were µ less than or equal to .1, it is fairly probable, .84, that a larger M would have occurred.

For n = 400, µ > .1 is poorly indicated (p-value is .5, associated SEV is .5).

The reasoning, among many ways it can be put, is that the M observed is scarcely unusual under the assumption that µ is less than or equal to .1. The probability is .5 that we’d observe sample means as (or more) statistically significant as the one we observed, even if  µ < .1.

Now it might be said that it is required to always compute a two-sided interval.  But we cannot just deny one-sided tests, nor does Cumming. In fact, he encourages one-sided tests/CIs (although he also says he is happy for people to decide afterwards whether to report it as a one or two-sided test (p. 112), but I put this to one side).

Or it might be suggested that we do the usual one-sided test by means of the one-sided CI (lower) interval, but we add the CI upper (at the same level) for purposes of scrutinizing the effective discrepancy indicated. First, let me be clear that Cumming does not suggest this.  Were one to suggest this, a justification would be needed, and that would demand going beyond confidence interval reasoning to something akin to severe-testing reasoning.

Merely forming the two two-sided intervals would not help:

(i) (0, .4]

(ii) (0, .2]

The question is: how well do the data indicate µ > .1?  It would scarcely jump out at you that this is poorly warranted by (ii).  in this way,  simple severe testing reasoning distinguishes (i) and (ii) as was wanted.

This was not a very extreme example either, in terms of the difference in sample sizes. The converse problem, the inability of standard CIs to avoid fallacies of insignificant results, is even more glaring; whereas, it is easily and intuitively accomplished by a severity evaluation. (For several computations, See Mayo and Spanos 2006.)

Nor does it suffice to have a series of confidence intervals as some suggest: they are still viewed as parameter values that “fit” the data to varying degrees, without a clear principled ground for using the series of intervals to avoid unwarranted interpretations. The counterfactual reasoning may be made out in terms of (what may be dubbed) a Severity Interpretation of Rejection and Acceptance (SIR) and (SIA), in Mayo and Spanos 2006, or in terms of the frequentist principle of evidence (FEV) in Mayo and Cox (2010, 256).

*Here SEV is calculated by the probability of getting a less statistically significant result, computed under the assumption that µ = .1. The SEV would increase if computed under smaller values of µ.

Cumming, G. (2012), Understanding the New Statistics, Routledge.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.


[1] Null Hypothesis Significance Tests.

[2] The warrant for the confidence, for Cumming, is the usual one: it was arrived at by a method with high probability of covering the true parameter value, and he has worked out nifty programs to view the “dance” of CI’s, as well as other statistics.

[3] “If we think of whether or not the CIs include the null value,there’s a direct correspondence between, respectively, the two- and one-tailed tests, and the two- and one-sided intervals” (Cumming 2012, p. 111).

Categories: Statistics | Tags: , , ,

Post navigation

Comments are closed.

Blog at WordPress.com.