Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ.

Considering problem (2), suppose two tests of type T+ reach the same significance level, .02 and let

(i) n = 100 and  (ii) n = 400.

(With n = 100, M = .2; with n = 400, M = .1)

(i) for n = 100, the .98 (lower) CI = µ > M – 2(1/10)

(ii)  for n = 400, the .98 (lower) CI = µ > M – 2(1/20)

So in both cases, the confidence intervals are

µ > 0

or as he writes them (0, infinity]. So how are the CIs distinguishing them?

The sample means in both cases here are just statistically significant. As Cumming states, for a 98% CI, the p-value is .02 if the interval falls so that the LL is at µ0 (p. 103).  Here, the LL (lower limit) of the CI is µ0–namely, 0.

So the p-value in our case would be .02 and the result is taken to infer µ > 0. So where is the difference? The construal is dichotomous: in or out, plausible or not; all values within the interval are on par.  But if we wish to avoid fallacies, CI’s won’t go far enough.

To avoid fallacies of rejection,  distinguish between cases (i) and (ii), and make good on the promise to have solved the problem in (3), we would need to report the extent of discrepancies well and poorly indicated. Let’s just pick an example to illustrate: Is there evidence of a discrepancy .1? , i.e., that  µ > .1

For n = 100, we would say that µ > .1 is fairly well indicated (p-value is .16, associated SEV is .84)*.

The reasoning is counterfactual: were µ less than or equal to .1, it is fairly probable, .84, that a larger M would have occurred.

For n = 400, µ > .1 is poorly indicated (p-value is .5, associated SEV is .5).

The reasoning, among many ways it can be put, is that the M observed is scarcely unusual under the assumption that µ is less than or equal to .1. The probability is .5 that we’d observe sample means as (or more) statistically significant as the one we observed, even if  µ < .1.

Now it might be said that it is required to always compute a two-sided interval.  But we cannot just deny one-sided tests, nor does Cumming. In fact, he encourages one-sided tests/CIs (although he also says he is happy for people to decide afterwards whether to report it as a one or two-sided test (p. 112), but I put this to one side).

Or it might be suggested that we do the usual one-sided test by means of the one-sided CI (lower) interval, but we add the CI upper (at the same level) for purposes of scrutinizing the effective discrepancy indicated. First, let me be clear that Cumming does not suggest this.  Were one to suggest this, a justification would be needed, and that would demand going beyond confidence interval reasoning to something akin to severe-testing reasoning.

Merely forming the two two-sided intervals would not help:

(i) (0, .4]

(ii) (0, .2]

The question is: how well do the data indicate µ > .1?  It would scarcely jump out at you that this is poorly warranted by (ii).  in this way,  simple severe testing reasoning distinguishes (i) and (ii) as was wanted.

This was not a very extreme example either, in terms of the difference in sample sizes. The converse problem, the inability of standard CIs to avoid fallacies of insignificant results, is even more glaring; whereas, it is easily and intuitively accomplished by a severity evaluation. (For several computations, See Mayo and Spanos 2006.)

Nor does it suffice to have a series of confidence intervals as some suggest: they are still viewed as parameter values that “fit” the data to varying degrees, without a clear principled ground for using the series of intervals to avoid unwarranted interpretations. The counterfactual reasoning may be made out in terms of (what may be dubbed) a Severity Interpretation of Rejection and Acceptance (SIR) and (SIA), in Mayo and Spanos 2006, or in terms of the frequentist principle of evidence (FEV) in Mayo and Cox (2010, 256).

*Here SEV is calculated by the probability of getting a less statistically significant result, computed under the assumption that µ = .1. The SEV would increase if computed under smaller values of µ.


Cumming, G. (2012), Understanding the New Statistics, Routledge.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

[1] Null Hypothesis Significance Tests.

[2] The warrant for the confidence, for Cumming, is the usual one: it was arrived at by a method with high probability of covering the true parameter value, and he has worked out nifty programs to view the “dance” of CI’s, as well as other statistics.

[3] “If we think of whether or not the CIs include the null value,there’s a direct correspondence between, respectively, the two- and one-tailed tests, and the two- and one-sided intervals” (Cumming 2012, p. 111).

Categories: Statistics | Tags: , , , , , ,

Post navigation

14 thoughts on “Do CIs Avoid Fallacies of Tests? Reforming the Reformers

  1. Paul

    The link to ‘earlier post’ comes up ‘Page not found’ so I’m not sure who’s saying “… values of [mu] …[outside a .99 confidence interval] are implausible.”

    Even 0.99 doesn’t seem high enough to bound the plausible region when these days we can evaluate millions (upon millions) of hypotheses. Physicists currently require 5 sigma confidence, which is a 0.9999994 confidence interval.

    • This is my paraphrase of what is recommended in the Reform literature on CIs. The link was just to a previous post, and also others, on the New Reformers from way back, in case a new reader hadn’t a clue what I was talking about.

  2. Corey

    Would you be interested in commenting on the confidence distribution concept? It seems severity has mostly been exposed in the philosophy of science literature, while the confidence distribution concept is known in the statistical literature. So if you care about getting practising statisticians to pay attention to severity, outlining the relationship between them might be a good way to go about it.

    • Corey: As I note, “Nor does it suffice to have a series of confidence intervals” even though they provide useful computations in need of proper interpretations These are mentioned also in joint work with Cox. But while I like them,(and I’ll have to check your wiki article to see if they’re using “confidence distribution” in a similar way, they seem more like reports of different accordances at different levels, and do not obviously underwrite the interpretation I think is needed to avoid fallacies. They’re reporting the data in a way. I referred to some of these attempts (e.g., Kempthorne) early on because at least they seemed close. Spanos emphasized the ways in which they differ from what SEV is after. So, anyway, it’s a good idea and I will look at them, though I virtually never see them in the Reformers handbooks.

    • Corey: This article indicates the use of confidence distributions as a kind of computational device for relating different measures to intervals, like p-values, I don’t really see anything to help the new reformer in his/her use of CIs to replace tests. Maybe there are some other articles out there, I’m not familiar with them. It seems to me that the power analyst, especially if she is a severity analyst*, does better. *I use “severity analyst” here because one cannot use anything like “observed power,” “attained power,” or the like since these terms have sometimes been appropriated by “shpower” analysts. I’m tracing recent work on power at the moment.

      • Corey

        I seem to recall thinking, several months ago when I read your intro to severity, that the severity function was mathematically equal to the confidence distribution for a one-sided test of a mean for the normal distribution (for variance both known and unknown). Did I get that wrong?

    • Corey: I have now researched some of the work on confidence distributions—going far beyond that wikiped citation–and have found them quite interesting and in many way, familiar. So I will contact the authors (it seems a small group). Thanks.

  3. I don’t know what the confidence distribution is. Of course many numbers come out the same, even the one-sided Bayesian test with reference priors. the interpretation may still differ.

  4. Eileen

    I thought the Reformers, as a group, wanted power analysis, like Cohen. That was in a post of yours a while back, I think. Doesn’t Cumming also use power analysis? (I don’t have his book, should I?)

    • Eileen: I hadn’t noticed this. They mention power and power analysis, but that literature is a huge mess (in fact I just recently discovered, it’s more of a mess than I ever thought). People still ponder over what I call “shpower” and they call “retrospective” power, but that is NOT the data-dependent analysis called for by SEV. (See my posts on power and shpower.) Others are all over the place confusing power with a posterior probability. The official New Reformers, e.g., in psych, shake their fingers at retrospective power, calling it a “no no!”, but fail to give any valid reason for not using them. That is a product of their not having a clear idea of error statistical reasoning (be it tests or CIs or something else). They have merely set themselves up as a kind of Statistics Police, bossing editors around to get them to accept the programs they themselves create for confidence intervals. Power, if it comes up at all for this group, is back to being a pre-trial planning tool. Cohen himself, father of power analysis, is somewhat the root of the misinterpretation of power. I don’t know why, but he always defined the power of a test as its probability of rejecting the null (even though he’d compute it correctly as “under various alternatives or discrepancies”). Then he seemed to lose it in later work, frustrated that people weren’t computing power, and that they were misconstruing it as a posterior (as in the “world is round” paper cited in my last Sat night comedy. Traveling now, so not looking it up.

  5. Eileen

    Thanks–I’ll look up your posts on power & shpower this weekend. Speaking of weekends–are you going to make the comedy club posts a regular Saturday night feature? (Hope so!).

  6. Eileen: No (not that they ever were regular), perhaps they are a bit too radical for some: they haven’t earned applause (except some privately written e-mails)—even my very favorite task force one from last week. Perhaps it’s too much of an inside joke, but I was cracking up as I was writing it. The comedy hours at B-retreats are not funny–to me– nor are they meant to be. But they’re based on howlers repeated constantly (whereas the task force is a goofy parody).

  7. Anonymous

    I’m not familiar with May or Cumming papers cited above, but as I recall, in the recommendations from the APA task force (referenced in the post from a few days ago, and in many of the various papers on which those recommendations were based), the recommendations for using CIs were typically linked to recommendations to report (hopefully) meaningful effect sizes.

    As I recall (and I admit it’s been a long time since I read any of the articles dealing with this topic), those two things (effect sizes + CIs on effect sizes) were usually recommended not for statistical reasons, but for reasons similar to why people like Andrew Gelman might suggest presenting data in graphs rather than tables with a million decmal points: that is, as a stylistic change to help focus researchers’ attention on arguably important aspects of the data that might otherwise be missed.

    For example, I remember a number of recommendations to use effect sizes and CIs focusing on the idea that, for purposes of reporting results, it might be better/more meaningful to report something like “participants who took part in the training improved their scores by 1-5% (95% CI)” than to report something like, “training participants scored significantly higher (F(whatever, etc.), p < .05)" and hide the separate means in some giant table.

    I believe that sort of thing (IIRC) was the main thust of many (though not all) of the articles criticizing the use of p-values in psychological research – e.g., that reporting focused too much on the simple significant/not-significant dichotomy, and not enough on interpreting magnitude and ranges of effects in context….

    • Dear Anonymous: Well it used to be a recommendation but as advice for non-statistician practitioners in psych and related social sciences, it has increasingly been a call for editors or rule books to prevent or ban the use of p-values and significance tests(you can search other posts on this blog); so you might have a look at recent work. I like CIs, but they too call for interpretive work, and a philosophical grounding, not adequately reflected in the “new reform” CI mantra. They are interpreted in a dichotomous fashion by the reformers, as I note; and the one-sided interval fails to provide information about the effects or discrepancies that are poorly warranted. The two-sided CI, interpreted as they recommend, scarcely helps: the set of values that are plausible, given a present CI level, without distinctions between the discrepancies that are and are not warranted within the CI.
      Anyway, I’m repeating myself. Fortunately, the choice is not between the much-lampooned use of p-values and N-P tests on the one hand, and automatic CIs on the other.
      I can’t comment on a possible analogy with Gelman; so far as I know he’s one of the few Bayesians calling for the use of p-values and simple significance tests (in his own work)! The reasoning, even if done via graphs, is akin to significance test reasoning. You might see Gelman’s paper in the RMM volume this blog has often been discussing (e.g., p. 70).

Blog at