*In preparation for a new post that takes up some of the recent battles on reforming or replacing p-values, I reblog an older post on power, one of the most misunderstood and abused notions in statistics. (I add a few “notes on howlers”.) The power of a test T in relation to a discrepancy from a test hypothesis H _{0} is the probability T will lead to rejecting H_{0} when that discrepancy is present. Power is sometimes misappropriated to mean something only distantly related to the probability a test leads to rejection; but I’m getting ahead of myself. This post is on a classic fallacy of rejection.* Continue reading

# power

## How to avoid making mountains out of molehills (using power and severity)

## Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

## How to tell what’s true about power if you’re practicing within the error-statistical tribe

*This is a modified reblog of an earlier post, since I keep seeing papers that confuse this.*

Suppose you are reading about a result * x* that is

*just statistically significant*at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with

*n*iid samples, and (for simplicity) known σ:

*H*

_{0}: µ ≤

_{ }0 against

*H*

_{1}: µ >

_{ }0.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant

isxpoorevidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).*See point on language in notes.They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant

isxgoodevidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is

unwarranted.

**Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?** Continue reading

## “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”

Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?

**(I) Listen to Jacob Cohen (1988) introduce Power Analysis**

“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is

nodifference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists. Continue reading

## Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ R

_{pre}, the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting underH_{1}andH_{0}respectively), is shown to capture the strength of evidence in the experiment for H_{1 }over H_{0}. (ibid., p. 2)

*But in fact it does no such thing!* [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

## When the rejection ratio (1 – β)/α turns evidence on its head, for those practicing in an error-statistical tribe (ii)

I’m about to hear Jim Berger give a keynote talk this afternoon at a FUSION conference I’m attending. The conference goal is to link Bayesian, frequentist and fiducial approaches: BFF. (Program is here. See the blurb below [0]). ** April 12 update below***. Berger always has novel and intriguing approaches to testing, so I was especially curious about the new measure. It’s based on a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. They recommend:

“that researchers should report what we call the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report what we call the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results.” (BBBS 2016)….

“The (pre-experimental) ‘rejection ratio’ R

_{pre}, the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting underH_{1}andH_{0}respectively), is shown to capture the strength of evidence in the experiment for H_{1 }over H_{0}.”

If you’re seeking a comparative probabilist measure, the ratio of power/size can look like a likelihood ratio in favor of the alternative. To a practicing member of an error statistical tribe, however, whether along the lines of N, P, or F (Neyman, Pearson or Fisher), things can look topsy turvy. Continue reading

## How to avoid making mountains out of molehills, using power/severity

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significantxis agoodindication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result *is just statistically significant *at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result ** x** (at level α) from a one-sided test T+ of the mean of a Normal distribution with

*n*iid samples, and (for simplicity) known σ:

*H*

_{0}: µ ≤

_{ }0 against

*H*

_{1}: µ >

_{ }0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significantxis agoodindication that µ < µ’.

(The higher the POW(T+,µ’) is, the better the indication that µ < µ’.)That is, if the test’s power to detect alternative µ’ is

high, then the statistically significantis axgoodindication (or good evidence) that the discrepancy from null isnotas large as µ’ (i.e., there’s good evidence that µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

**POWER**: POW(T+,µ’)** = **POW(Test T+ rejects *H*_{0};µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

**EXAMPLE**. Let σ = 10, *n* = 100, so (σ/√*n*) = 1. Test T+ rejects H_{0 }at the .025 level if M_{ } > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M_{0} just reach the cut-off 2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values. Power fact, POW(M* + 1(σ/√*n*)) = .84.

That is, adding one (σ/ √*n*) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ** _{ }**= 3) = .84. See this post.

By (2), the (just) significant result * x* is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84. The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4, the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical *in*significance, (2) is essentially ordinary *power analysis.* (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when *H*_{0}: µ ≤ _{ }0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”. No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values *within* a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

**Severity replaces M* in (2) with the actual result, be it significant or insignificant. **

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to *custom tailor* results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

*One more thing:*

**Applying (1) and (2) requires the error probabilities to be actual** (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. *If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones.* There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ_{0 }+ δ, or µ ≤ µ’ =µ_{0 }+ δ or the like. They are *not* to point values! (Not even to the point µ =M_{0}.) Take a look at the alternative *H*_{1}: µ > _{ }0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities.

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (3/4/14) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/12/14) Get empowered to detect power howlers
- 3/17/14 Stephen Senn: “Delta Force: To what Extent is clinical relevance relevant?”
- (3/19/14) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
**12/29/14**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**01/03/15**No headache power (for Deirdre)

## Telling What’s True About Power, if practicing within the error-statistical tribe

Suppose you are reading about a statistically significant result * x* (

*just*at level α) from a one-sided test T+ of the mean of a Normal distribution with

*n*iid samples, and (for simplicity) known σ:

*H*

_{0}: µ ≤

_{ }0 against

*H*

_{1}: µ >

_{ }0.

I have heard some people say [0]:

A. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxpoorevidence of a discrepancy (from the null) corresponding to µ’. (i.e., there’s poor evidence that µ > µ’ ).◊See point on language in notes.They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the statistically significant

isxgoodevidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that µ > µ’).They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is

unwarranted.

**Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined?**

Allow the test assumptions are adequately met. I have often said on this blog, and I repeat, the most misunderstood and abused (or unused) concept from frequentist statistics is that of a test’s power to reject the null hypothesis under the assumption alternative µ’ is true: POW(µ’). I deliberately write it in this correct manner because it is faulty to speak of the power of a test without specifying against what alternative it’s to be computed. It will also get you into trouble if you define power as in the first premise in a recent post: Continue reading

## Spot the power howler: α = ß?

Spot the fallacy!

- The power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
- So, the probability of incorrectly rejecting the null hypothesis is β.
- But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

## No headache power (for Deirdre)

Deirdre McCloskey’s comment leads me to try to give a “no headache” treatment of some key points about the **power of a statistical test**. (Trigger warning: formal stat people may dislike the informality of my exercise.)

We all know that for a given test, **as the probability of a type 1 error goes down the probability of a type 2 error goes up (and power goes down**).

And **as the probability of a type 2 error goes down (and power goes up), the probability of a type 1 error goes up.** Leaving everything else the same. There’s a **trade-off** between the two error probabilities.(No free lunch.) No headache powder called for.

So if someone said, as the power increases, the probability of a type 1 error *decreases*, they’d be saying: **As the type 2 error decreases, the probability of a type 1 error decreases!** **That’s the opposite of a trade-off.** So you’d know automatically they’d made a mistake or were defining things in a way that differs from standard NP statistical tests.

Before turning to my little exercise, I note that power is defined in terms of a test’s cut-off for rejecting the null, whereas a severity assessment always considers the actual value observed (attained power). Here I’m just trying to clarify *regular old* power, as defined in a N-P test.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let’s use a familiar oversimple example to fix the trade-off in our minds so that it cannot be dislodged. Our old friend, test T+ : We’re testing the mean of a Normal distribution with *n* iid samples, and (for simplicity) known, fixed σ:

H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0

Let **σ = 2**, *n* = 25, so (σ/ √*n*) = .4. To avoid those annoying X-bars, I will use **M for the sample mean**. I will abbreviate (σ/ √*n*) as σ_{x .}

**Test T+**is a rule: reject H_{0 }iff M > m***Power of a test T+**is computed in relation to values of µ >_{ }0.- The
**power**of T+ against alternative µ =µ_{1 }= Pr(T+ rejects H_{0};µ = µ_{1}) = Pr(M > m*; µ = µ_{1})

We may abbreviate this as : POW(T+,α, µ = µ_{1}) Continue reading

## To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)

I said I’d reblog one of the 3-year “memory lane” posts marked in red, with a few new comments (in burgundy), from time to time. So let me comment on one referring to Ziliac and McCloskey on power. (from Oct.2011). I would think they’d *want* to correct some wrong statements, or explain their shifts in meaning. My hope is that, 3 years on, they’ll be ready to do so. By mixing some correct definitions with erroneous ones, they introduce more confusion into the discussion.

From my post 3 years ago: “The Will to Understand Power”: In this post, I will adhere precisely to the text, and offer no new interpretation of tests. Type 1 and 2 errors and power are just formal notions with formal definitions. But we need to get them right (especially if we are giving expert advice). You can hate the concepts; just define them correctly please. They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, then (say) such and such a positive effect is true.”

So far so good (keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine.

Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive effect as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01.

(1) The power of the test to detect H’(δ) =

P(test rejects null at .01 level; H’(δ) is true).

Say it is 0.85.

“If the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct”. (Z & M, 132-3).

But this is not so. Perhaps they are slipping into the cardinal error of mistaking (1) as a posterior probability:

(1’) P(H’(δ) is true| test rejects null at .01 level)! Continue reading

## Neyman, Power, and Severity

*Jerzy Neyman: April 16, 1894-August 5, 1981. *This reblogs posts under “The Will to Understand Power” & “Neyman’s Nursery” here & here.

Way back when, although I’d never met him, I sent my doctoral dissertation, *Philosophy of Statistics, *to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC.

Some ~~ten~~ 15 years ago, Giere decided to quit philosophy of statistics (while remaining in philosophy of science): I think it had to do with a certain form of statistical exile (in philosophy). He asked me if I wanted his papers—a mass of work on statistics and statistical foundations gathered over many years. Could I make a home for them? I said yes. Then came his caveat: there would be a lot of them.

As it happened, we were building a new house at the time, Thebes, and I designed a special room on the top floor that could house a dozen or so file cabinets. (I painted it pale rose, with white lacquered book shelves up to the ceiling.) Then, for more than 9 months (same as my son!), I waited . . . Several boxes finally arrived, containing hundreds of files—each meticulously labeled with titles and dates. More than that, the labels were hand-typed! I thought, If Ron knew what a slob I was, he likely would not have entrusted me with these treasures. *(Perhaps he knew of no one else who would actually want them!)* Continue reading

## A. Spanos: “Recurring controversies about P values and conﬁdence intervals revisited”

**Aris Spanos**

Wilson E. Schmidt Professor of Economics

*Department of Economics, Virginia Tech*

**Recurring controversies about P values and conﬁdence intervals revisited*
**

*Ecological Society of America (ESA) ECOLOGY*

Forum—P Values and Model Selection (pp. 609-654)

Volume 95, Issue 3 (March 2014): pp. 645-651

*INTRODUCTION*

The use, abuse, interpretations and reinterpretations of the notion of a *P* value has been a hot topic of controversy since the 1950s in statistics and several applied ﬁelds, including psychology, sociology, ecology, medicine, and economics.

The initial controversy between Fisher’s signiﬁcance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher’s *post-data threshold *for the *P *value. Continue reading

## Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

**Stephen Senn**

Head, Methodology and Statistics Group,

Competence Center for Methodology and Statistics (CCMS),

Luxembourg

**Delta Force
**

*To what extent is clinical relevance relevant?*

**Inspiration
**This note has been inspired by a Twitter exchange with respected scientist and famous blogger David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as

*clinically relevant*could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are

*not*obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject. Continue reading

## Get empowered to detect power howlers

**If a test’s power to detect µ’ is low then a statistically significant result is good/lousy evidence of discrepancy µ’? Which is it?**

If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]

Yet I often hear people say things to the effect that: Continue reading

## Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance. Continue reading →