The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values ofµwithin the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/* √n* ).

*where M, following Cumming, is the sample mean (thereby avoiding those x-bars). *M – 2(1/* √n* ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer µ > µ_{0} + δ.

Considering problem (2), suppose two tests of type T+ reach the same significance level, .02 and let

(i) n = 100 and (ii) n = 400.

(With n = 100, M = .2; with n = 400, M = .1)

(i) for n = 100, the .98 (lower) CI = µ > M – 2(1/10)

(ii) for n = 400, the .98 (lower) CI = µ > M – 2(1/20)

So in both cases, the confidence intervals are

µ > 0

or as he writes them (0, infinity]. So how are the CIs distinguishing them?

The sample means in both cases here are *just* statistically significant. As Cumming states, for a 98% CI, the p-value is .02 if the interval falls so that the LL is at µ_{0} (p. 103). Here, the LL (lower limit) of the CI is µ_{0}–namely, 0.

So the p-value in our case would be .02 and the result is taken to infer µ > 0. So where is the difference? The construal is dichotomous: in or out, plausible or not; all values within the interval are *on par*. But if we wish to avoid fallacies, CI’s won’t go far enough.

To avoid fallacies of rejection, distinguish between cases (i) and (ii), and make good on the promise to have solved the problem in (3), we would need to report the extent of discrepancies well and poorly indicated. Let’s just pick an example to illustrate: Is there evidence of a discrepancy .1? , i.e., that µ > .1

For n = 100, we would say that µ > .1 is fairly well indicated (p-value is .16, associated SEV is .84)*.

The reasoning is counterfactual: were µ less than or equal to .1, it is fairly probable, .84, that a larger M would have occurred.

For n = 400, µ > .1 is poorly indicated (p-value is .5, associated SEV is .5).

The reasoning, among many ways it can be put, is that the M observed is scarcely unusual under the assumption that µ is less than or equal to .1. The probability is .5 that we’d observe sample means as (or more) statistically significant as the one we observed, even if µ < .1.

Now it might be said that it is required to always compute a two-sided interval. But we cannot just deny one-sided tests, nor does Cumming. In fact, he encourages one-sided tests/CIs (although he also says he is happy for people to decide afterwards whether to report it as a one or two-sided test (p. 112), but I put this to one side).

Or it might be suggested that we do the usual one-sided test by means of the one-sided CI (lower) interval, but we add the CI upper (at the same level) for purposes of scrutinizing the effective discrepancy indicated. First, let me be clear that Cumming does not suggest this. Were one to suggest this, a justification would be needed, and that would demand going beyond confidence interval reasoning to something akin to severe-testing reasoning.

Merely forming the two two-sided intervals would not help:

(i) (0, .4]

(ii) (0, .2]

The question is: how well do the data indicate µ > .1? It would scarcely jump out at you that this is poorly warranted by (ii). in this way, simple severe testing reasoning distinguishes (i) and (ii) as was wanted.

This was not a very extreme example either, in terms of the difference in sample sizes. The converse problem, the inability of standard CIs to avoid fallacies of insignificant results, is even more glaring; whereas, it is easily and intuitively accomplished by a severity evaluation. (For several computations, See Mayo and Spanos 2006: “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” .)

Nor does it suffice to have a series of confidence intervals as some suggest: they are still viewed as parameter values that “fit” the data to varying degrees, without a clear principled ground for using the series of intervals to avoid unwarranted interpretations. The counterfactual reasoning may be made out in terms of (what may be dubbed) a Severity Interpretation of Rejection and Acceptance (SIR) and (SIA), in Mayo and Spanos 2006, or in terms of the frequentist principle of evidence (FEV) in Mayo and Cox (2010, 256): “Frequentist Statistics as a Theory of Inductive Inference“.

*Here SEV is calculated by the probability of getting a less statistically significant result, computed under the assumption that µ = .1. The SEV would increase if computed under smaller values of µ.

Cumming, G. (2012), *Understanding the New Statistics*, Routledge.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” *British Journal of Philosophy of Science*, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

[1] Null Hypothesis Significance Tests.

[2] The warrant for the confidence, for Cumming, is the usual one: it was arrived at by a method with high probability of covering the true parameter value, and he has worked out nifty programs to view the “dance” of CI’s, as well as other statistics.

[3] “If we think of whether or not the CIs include the null value,there’s a direct correspondence between, respectively, the two- and one-tailed tests, and the two- and one-sided intervals” (Cumming 2012, p. 111).

One-side CIs fail right on the very first definition presented above: “…values of µ within the observed interval are plausible…”

How can a µ value corresponding to infinity be “plausible”? Some serious subversion of language happened here.

We are better off forgetting about one-sided testing, except that as you said ” But we cannot just deny one-sided tests…”, and I cannot disagree with that either.

R. Moritz: I don’t understand your claim/question: How can a µ value corresponding to infinity be “plausible”?

Corresponding to infinity? I guess there are (uncountably) infinitely many possible values for the deflection of light, but we have excellent evidence it’s approximately the value in the general theory of relativity (GTR), even if GTR winds up being false. I don’t care if this is cashed out “instrumentally” (in terms of the range of values expected in experiments) or in terms of a model of the cause of the outcomes (or some such thing). I prefer the latter in this case, but in other contexts, the instrumental construal may be what we know.

But anyway, i am missing the problem, and if there is one, I don’t think it’s a problem of one-sided tests (which strike me as the most useful in numerous situations).

This may seem slightly off topic, but is it just me or are other people also annoyed about the wording of the interpretation of a CI “we can be 95% confident that the parameter is in the interval”?

In fact, the 95% come from a probability statement. Granted, some find the CI-related probability statement difficult to stomach, but at least people discussed the meaning of probability statements for 300 years and there are some well established (albeit non-unique) explanations. The statement “we are 95% confident” is totally mysterious, though. The only explanation of it can be given by translating it into the original probability statement, because “percentage of confidence” doesn’t have a definition outside the context of CIs. But apparently some people are under the impression that this is easier to understand than the original probability statement. How on earth could that be?

Christian: That is one of the big reasons for using severity (or some notion akin to it). We CAN ascertain which statements about discrepancies are (and are not) well-warranted according to how severely they have passed. The severity principle moves from how improbable it would be to generate outcomes x standard deviations in excess of the mean, under the supposition they were generated from a process where parameter values are smaller than the corresponding 1-sided lower CI bound. It’s this reasoning that I am trying to propose as a way to both make sense of methods, and avoid fallacies. See what I mean?

Recall the section 5.3 from my Stat Science and Phil Science paper:

http://errorstatistics.com/2012/10/20/mayo-section-5-statsci-and-philsci-part-2/

Mayo: Yes, fine by me, although I personally don’t have big problems with the proper probabilistic interpretation of CIs either (only that people are keen to make something else of it than it actually is).

R Moritz: One-sided testing seems allright to me; at least in the many situations in which we are indeed interested in a one-sided question (“if it’s not significantly better than placebo, we’re not buying it, let alone it being worse.”)

At least one of the standard criticisms against testing (“who can believe in a point null hypothesis anyway?”) goes away in one-sided testing.

Charistian: Firstly, let me say it is good to hear from you after awhile. Second, you had initially said:

“The statement “we are 95% confident” is totally mysterious, though. The only explanation of it can be given by translating it into the original probability statement, because “percentage of confidence” doesn’t have a definition outside the context of CIs.”

Now the problem that I have is the supposition that 95% strength or strong evidence or the like, must or should have an interpretation in terms of the probability of some event (and probability is of an event). It just doesn’t get at what we want when we demand evidence. Instead, I am arguing, we may USE probabilistic ideas to characterize the testing method (one needn’t call it a test, it’s scarcely limited to hypotheses tests, but holds for any inference).

Nor is it that the testing method is often reliable that lets us use this information to qualify the evidence in the particular case. I claim this is ordinary reasoning from the capability of a method, to what has/has not been shown in the case at hand. It is only in formal statistical contexts, it seems to me, that people forget the ordinary pattern of “uncertain” inference, and imagine it’s all about probabilities of events. I would be hard pressed to find a case where the high probability of an event translated into high evidence for something. We needn’t preclude that way of speaking about events, even though I find it odd, but the main point is that hypotheses are not events.

It’s interesting to look at the discussion from May 17, 2012 (with some of the same people who still weigh in).

http://errorstatistics.com/2012/05/17/4060/

I wrote:

I like CIs, but they too call for interpretive work, and a philosophical grounding, not adequately reflected in the “new reform” CI mantra. They are interpreted in a dichotomous fashion by the reformers, as I note; and the one-sided interval fails to provide information about the effects or discrepancies that are poorly warranted. The two-sided CI, interpreted as they recommend, scarcely helps: the set of values that are plausible, given a present CI level, without distinctions between the discrepancies that are and are not warranted within the CI.