Power and Severity with nonsignificant results: more power puzzles? (ii)

The concept of a test’s power, originating in Neyman-Pearson’s early work, by and large, is a pre-data concept for purposes of specifying a test (notably, determining worthwhile sample size), and choosing between tests. In some papers, however, Neyman lists a third goal for power: to interpret test results post data much in the spirit of what is often called “power analysis”. This is to determine the discrepancy from a null hypothesis that may be ruled out, given nonsignificant results. One example is in a paper “The Problem of Inductive Inference” (Neyman 1955)–already a surprising title for behaviorist Neyman. The reason I’m bringing this up is that it has direct bearing on some of today’s most puzzling (and problematic) post-data uses of power. Interestingly, in that 1955 paper, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H₀ is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H₀, then…. the attitude described is dangerous [for this n]…. [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H₀ cannot be reasonably considered as anything like a confirmation of H₀. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman alludes to a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation, call it test T+.

H₀: µ ≤ µ₀ against H₁: µ > µ₀.

The test statistic d(X) is the standardized sample mean.

The test rule: Infer a (positive) discrepancy from µ₀ iff d(x₀) > cα where cα corresponds to a difference statistically significant at the α level.

In Carnap’s example the test could not reject the null hypothesis, i.e., d(x₀) ≤ cα, but (to paraphrase Neyman) the problem is that the chance of detecting the presence of discrepancy δ from the null, with so few observations, is extremely slim, even if [δ is present]. Says Neyman:

“One may be confident in the absence of that discrepancy only if the power to detect it were high.”

The power of the test T+ to detect discrepancy δ:

(1) P(d(X) > cα; µ = µ₀ + δ)

This is rather different than the more behavioristic construal Neyman usually championed. In fact, Neyman sounds like a Cohen-style power analyst!

Still, in standard power analysis, power is calculated relative to an outcome just missing the cutoff cα. This is, in effect, the worst case of a negative (non significant) result. If the actual outcome corresponds to a larger p-value (an even more negative result), it seems to me that should be taken into account in interpreting the results. Do you agree? It is more informative, therefore, to look at the probability of getting a worse fit (with the null hypothesis) than you did:

(2) P(d(X) > d(x0); µ = µ₀ + δ)

In this example, this gives a measure of the severity (or degree of corroboration) for the inference µ < µ₀ + δ.

Although (1) may be low, (2) may be high (For numbers, see Mayo and Spanos 2006).

Spanos and I (Mayo and Spanos 2006) couldn’t find a term in the literature defined precisely this way–as defined in Mayo (1996). Note that the observed outcome enters in d(x0), not in the discrepancy under which the probability of d(X) > d(x0) is computed. It’s really just the p-value corresponding to µ = µ₀ + δ. So, this differs from what some have called “observed power” and I call “shpower” (see this post). Spanos and I called it the severity interpretation for acceptance SIA; in SIST, it’s also called attained power, and is cashed out in SIN: the severity interpretation of negative results. With SIA and SIN, we consider the value of the observed statistic, rather than the cut-off for rejection or significance. [i] This is a core concept that I claim testers should be using to interpret warranted discrepancies post-data.

The claim in (2) could also be made out viewing the p-value as a random variable, calculating its distribution for various alternatives (Cox 2006, 25). This reasoning yields a basic frequentist principle of evidence (FEV) in Mayo and Cox 2010, 256):

FEV:¹ A moderate (i.e., non-small) p-value is evidence of the absence of a discrepancy δ from H₀ only if there is a high probability the test would have given a worse fit with H₀ (i.e., smaller p value) were a discrepancy d to exist.

It is only in the case of a negative result that severity for various inferences is in the same direction as power. In the case of significant results, with d(x) in excess of the cutoff, the opposite concern arises—namely, the test may be too sensitive to warrant a claimed discrepancy. So severity is always relative to the particular inference being entertained: speaking of the “severity of a test” simpliciter is an incomplete statement in this account. These assessments enable sidestepping classic fallacies of tests that are either too sensitive or not sensitive enough, relative to a claim of interest.²
________________________________________

The full version of the Cox-Mayo frequentist principle of evidence FEV is:

x is evidence of a discrepancy d from H₀ iff, if H₀ is a correct description of the mechanism generating x, then, with high probability a less discordant result would have occurred.

Severity (SEV) may be seen as a meta-statistical principle that follows the same logic as statistical significance test reasoning.

By making a SEV assessment relevant to the inference under consideration, we obtain a measure where high (low) values always correspond to good (poor) evidential warrant.

Severity did not have to be defined this way, but I felt it was desirable to have a concept or measure that was always good–by contrast to type 1 and 2 errors. However, it means SEV has to be computed relative to what is being inferred. This requires appropriately swapping out the claim H for which one wants to assess SEV.

[i] Cox famously (in Cox and Hinkley) said that power was irrelevant post data. But he agreed that attained power was relevant for interpreting nonsignificant results.

NOTE: This discussion was part of what I dubbed Neyman’s Nursery posts (NN1-NN5). This was the second, NN2. Why I used that term is a long story, but if you’re curious, you can learn about by searching this blog.

REFERENCES:

Cohen, J. (1992) A Power Primer.

Mayo, D. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. and Cox, D. (2010), “Frequentist Statistics as a Theory of Inductive Inference,” in D. Mayo and A. Spanos (2011), pp. 247-275.

Neyman, J. (1955), “The Problem of Inductive Inference,” Communications on Pure and Applied Mathematics, VIII, 13-46.

Neyman, J. [1957]: ‘The Use of the Concept of Power in Agricultural Experimentation,’ Journal of the Indian Society of Agricultural Statistics, IX, pp. 9–17.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

Power and Severity with nonsignificant results: more power puzzles? (ii)

Post navigation

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Power and Severity with nonsignificant results: more power puzzles? (ii)

Related

Post navigation

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.