If you like Neyman’s confidence intervals then you like N-P tests

Neyman

Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests!

Consider a typical N-P test of the mean of a Normal distribution T+: H₀: µ ≤ µ₀ vs H1: µ > µ_0.

Imagine σ is known, since nothing of interest to the logic changes if it is estimated as is more typical. Notice the null hypothesis is composite, it is not a point, and the alternative is explicit (you can’t jump from a small P-value to some theory that would “explain ” it).[i]

The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) lower bound:

µ > M – c_a(σ/ √n ).

M is the sample mean, and this is the generic lower confidence bound. Replacing M with the observed sample mean M₀ yields the particular CI lower bound.

Why does µ > M – c_a(σ/ √n ) correspond the above test T+? Why is it an inversion or dual to the test?

Consider, said Neyman, that the values of µ that exceed M₀ – c_a(σ/ √n ) are values of µ that could not be rejected at level α with sample mean M₀. Equivalently, these are values of the parameter µ that M₀ is not statistically significantly greater than at a P-value of α. Yes CIs correspond to Neyman-Pearson tests and were developed by Neyman in 1930, a bit after Fisher’s Fiducial intervals. Yes, those doing CIs (the so-called “new” statistics) are doing Neyman-Pearson tests, only inverted. Neyman didn’t care if you called them hypothesis tests or significance tests (as we saw in my last post). [ii]

Thanks to the duality between tests and confidence intervals, you could give the information provided by a confidence interval at any level in terms of the corresponding test. For a two-sided, 95% confidence interval [µ_L,µ_U].

µ_Lis the (parameter) value that the sample mean is just statistically significantly greater than at the P= .025 level.

µ_U is the (parameter) value that the sample mean is just statistically significantly lower than at the P= .025 level.

That means it is wrong to say you cannot ascertain anything about the population effect size using P-value computations. You can. It’s not the only way. You can also use P-value functions (Fraser, Cox), power, and severity, but they are all interrelated.

You ask: Please tell me the value of µ that the sample mean M₀ is just statistically significantly greater than, at the P= .025 level? The answer is the lower confidence bound µ_L

If the tester is able to determine the P-value corresponding to a specific value of µ you wanted to test, then she is also able to use the observed M₀ to compute the value µ_L

Likewise for finding µ_U . All the information is there.

But choosing a single confidence level is quite inadequate. Yet that is still what members of today’s “new” CI tribe do–generally .95. They get very upset at your dichotomizing P ≤ 0.05 and P > 0.05, but happily dichotomize µ is in or out of the CI formed.

The severe tester always infers a discrepancy that is well indicated (if any) but also at least one that is poorly indicated. In relation to test T+, the inference µ > M₀where M₀ is the observed mean is a good benchmark for a terrible inference! It corresponds to a lower confidence bound at level 0.5! And yet, critics of significance tests (at least,from outside the error statistical family) often advocate inferring

µ ≥ M₀

as either comparatively more likely or probable than the null or test hypothesis. For detailed examples, see SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What?

So why are members of the Confidence Interval tribe going around misrepresenting hypothesis tests as if they must take the form of Fisherian “simple” significance tests with a point null (nil) hypothesis, usually of 0? (N-P tests were purposely designed to improve upon Fisher’s tests, and it’s that improvement that gives you CIs.) And why do they say what’s inferred with a CI cannot be ascertained with N-P tests? Are they unaware they’re using N-P tests? Or is the simple Fisherian test (no explicit alternative, no consideration of power) just much easier to criticize? If they’re cousins or brothers, why the family feud? Sibling rivalry? Why be a Unitarian? Most testers would supply a P-value as well as a CI. The severe tester combines the two, so that discrepancies are directly reported from test results. For another reason, see [iii].

Critics of tests from outside the family, will also take the simple “nil” point null vs a two-sided alternative as their foil, and demonstrate that the p-value ≠ either their Bayes Factor or posterior probability. It serves as a convenient straw test to knock down. If they kept the comparison to one-sided tests, they would not disagree (at least not with any sensible prior). See SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What? This is shown by Casella and R. Berger (1987) and the reconciliation is agreed to by Berger and Sellke (1987).

I’m not saying the simple significance test doesn’t have uses; it’s vital for testing assumptions of statistical models. That’s why Bayesians who want to check their models can be found sneaking P-value goodies from the tests that many of them profess to dislike. If a small P-value indicates a discrepancy from the null there, it does so in other uses too. [iv]

Note too the connection between confidence intervals and severity: Taking a sample mean M that is just statistically significant at level α (Mα) as warranting µ > µ₀with severity 1 – α is the same as inferring µ > M₀– c_a(σ/ √n ) at confidence level 1 – α. However, severity improves on CIs by breaking out of the single confidence level, providing an inferential justification (rather than merely a long-run coverage rationale), and avoids a number of fallacies and paradoxes of ordinary CIs. For a post on CIs and severity see here. Also see: Do CIs Avoid Fallacies of Tests? Reforming the Reformers. For a full discussion, see SIST.

[i] The null and alternative would be treated symmetrically. You are to choose the null, or more properly, what Neyman called the test hypothesis, according to which error was more serious. A lot of the agony that has people up in arms regarding the fallacy of taking non-significant results as evidence for a (point) null is immediately scotched by letting the test hypothesis be “an effect exists” (or an effect of a given magnitude is present). For example, T-: H₀: µ ≥ µ₀ vs H1: µ < µ_0.

A two-sided test, if wanted, may be seen as doing two one-sided tests (Cox and Hinkley 1974).

[ii] Note the equivalences:

µ < M – c_a(σ/ √n ) iff M > µ + c_a(σ/ √n )

So µ < CI lower at confidence level 1 – α iff M reaches statistical significance at P = α in test T+. Since it’s continuous we could use ≤ or <.

Iff = if and only if.

[iii] Some prefer CIs to corresponding tests because it’s easier to slide the confidence level onto the interval estimate, viewing it as affording a probability assignment to the interval itself. This of course is, strictly, a fallacy, unless one just stipulates: I assign “probability” .95, say, to the result of applying a method if that method has .95 “coverage probability”. This is/was the Fiducial dream. But one cannot do probability computations with these assignments. For the severe tester’s evidential interpretation of CIs, please see SIST, Excursion 3 Tour III.

[iv] Moving from a discrepancy (from a model assumption) to a particular rival model invites the same risks as when explaining other small P-values by invoking a rival insofar as the null and the rival model do not exhaust the possibilities.

SIST= Mayo, D (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

If you like Neyman’s confidence intervals then you like N-P tests

Post navigation

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

If you like Neyman’s confidence intervals then you like N-P tests

Related

Post navigation

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.