A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. CIs thereby obtain an inferential rationale (beyond performance), and several benchmarks are reported. Continue reading
confidence intervals and tests
Tour III Capability and Severity: Deeper Concepts
From the itinerary: A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs), but in fact there’s a clear duality between the two. The dual mission of the first stop (Section 3.7) of this tour is to illuminate both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. The severity analysis seamlessly blends testing and estimation. A typical inquiry first tests for the existence of a genuine effect and then estimates magnitudes of discrepancies, or inquires if theoretical parameter values are contained within a confidence interval. At the second stop (Section 3.8) we reopen a highly controversial matter of interpretation that is often taken as settled. It relates to statistics and the discovery of the Higgs particle – displayed in a recently opened gallery on the “Statistical Inference in Theory Testing” level of today’s museum. Continue reading
When the rejection ratio (1 – β)/α turns evidence on its head, for those practicing in an error-statistical tribe (ii)
I’m about to hear Jim Berger give a keynote talk this afternoon at a FUSION conference I’m attending. The conference goal is to link Bayesian, frequentist and fiducial approaches: BFF. (Program is here. See the blurb below ). April 12 update below*. Berger always has novel and intriguing approaches to testing, so I was especially curious about the new measure. It’s based on a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. They recommend:
“that researchers should report what we call the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report what we call the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results.” (BBBS 2016)….
“The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for H1 over H0 .”
If you’re seeking a comparative probabilist measure, the ratio of power/size can look like a likelihood ratio in favor of the alternative. To a practicing member of an error statistical tribe, however, whether along the lines of N, P, or F (Neyman, Pearson or Fisher), things can look topsy turvy. Continue reading
Nate Silver describes “How we’re forecasting the primaries” using confidence intervals. Never mind that the estimates are a few weeks old, and put entirely to one side any predictions he makes or will make. I’m only interested in this one interpretive portion of the method, as Silver describes it:
In our interactive, you’ll see a bunch of funky-looking curves like the ones below for each candidate; they represent the model’s estimate of the possible distribution of his vote share. The red part of the curve represents a candidate’s 80 percent confidence interval. If the model is calibrated correctly, then he should finish within this range 80 percent of the time, above it 10 percent of the time, and below it 10 percent of the time. (My emphasis.)
Suppose you are reading about a statistically significant result x (just at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ: H0: µ ≤ 0 against H1: µ > 0.
I have heard some people say :
A. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).◊See point on language in notes.
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.
I have heard other people say:
B. If the test’s power to detect alternative µ’ is very low, then the statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).
They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.
Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?
Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post: Continue reading
by Stephen Senn*
Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult. Continue reading
A question came up in our seminar today about how to understand the duality between a simple one-sided test and a lower limit (LL) of a corresponding 1-sided confidence interval estimate. This is also a good route to SEV (i.e., severity). Here’s a quick answer: Continue reading
If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]
Yet I often hear people say things to the effect that: Continue reading
For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.
Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:
“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)
For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.
However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).
“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”
Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.
H0: µ ≤ 0 against H1: µ > 0 , and let σ= 1.
Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:
µ > M – 2(1/ √n ).
where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.
Central problems with significance tests (whether of the N-P or Fisherian variety) include:
(1) results are too dichotomous (e.g., significant at a pre-set level or not);
(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way (whereas the larger the sample size the smaller the discrepancy the test is able to detect);
(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).
We would like to know for what values of δ it is warranted to infer µ > µ0 + δ. Continue reading