# Posts Tagged With: confidence intervals

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading

## Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ. Continue reading

Categories: confidence intervals and tests, reformers, Statistics |

## G. Cumming Response: The New Statistics

Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post).   He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.

The New Statistics

PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)

I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:

Cumming (quoted by Mayo): as usual, the estimation approach is better

Mayo: Is it?

Lots to discuss there. In this first post I’ll outline the big picture as I see it.

‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article (http://tiny.cc/GeoffConversation) and radio talk (http://tiny.cc/geofftalk). The website www.thenewstatistics.com has information about the book (Cumming, 2012) and ESCI software, which is a free download.

Categories: Statistics |

## Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include: Continue reading

Categories: Statistics |

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

## Do CIs Avoid Fallacies of Tests? Reforming the Reformers

The one method that enjoys the approbation of the New Reformers is that of confidence intervals (See May 12, 2012, and links). The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+. Continue reading

Categories: Statistics |

## Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1

This morning I received a paper I have been asked to review (anonymously as is typical). It is to head up a forthcoming issue of a new journal called Philosophy of Statistics: Retraction Watch.  This is the first I’ve heard of the journal, and I plan to recommend they publish the piece, conditional on revisions. I thought I would post the abstract here. It’s that interesting.

“Some Slightly More Realistic Self-Criticism in Recent Work in Philosophy of Statistics,” Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 (2012), pp. 1-19.

In this paper we delineate some serious blunders that we and others have made in published work on frequentist statistical methods. First, although we have claimed repeatedly that a core thesis of the frequentist testing approach is that a hypothesis may be rejected with increasing confidence as the power of the test increases, we now see that this is completely backwards, and we regret that we have never addressed, or even fully read, the corrections found in Deborah Mayo’s work since at least 1983, and likely even before that.

Second, we have been wrong to claim that Neyman-Pearson (N-P) confidence intervals are inconsistent because in special cases it is possible for a specific 95% confidence interval to be known to be correct. Not only are the examples required to show this absurdly artificial, but the frequentist could simply interpret this “vacuous interval” “as a statement that all parameter values are consistent with the data at a particular level,” which, as Cox and Hinkley note, is an informative statement about the limitations in the data (Cox and Hinkley 1974, 226). Continue reading

Categories: Comedy, Statistics |