Deirdre McCloskey’s comment leads me to try to give a “no headache” treatment of some key points about the power of a statistical test. (Trigger warning: formal stat people may dislike the informality of my exercise.)
We all know that for a given test, as the probability of a type 1 error goes down the probability of a type 2 error goes up (and power goes down).
And as the probability of a type 2 error goes down (and power goes up), the probability of a type 1 error goes up. Leaving everything else the same. There’s a trade-off between the two error probabilities.(No free lunch.) No headache powder called for.
So if someone said, as the power increases, the probability of a type 1 error decreases, they’d be saying: As the type 2 error decreases, the probability of a type 1 error decreases! That’s the opposite of a trade-off. So you’d know automatically they’d made a mistake or were defining things in a way that differs from standard NP statistical tests.
Before turning to my little exercise, I note that power is defined in terms of a test’s cut-off for rejecting the null, whereas a severity assessment always considers the actual value observed (attained power). Here I’m just trying to clarify regular old power, as defined in a N-P test.
Let’s use a familiar oversimple example to fix the trade-off in our minds so that it cannot be dislodged. Our old friend, test T+ : We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known, fixed σ:
H0: µ ≤ 0 against H1: µ > 0
Let σ = 2, n = 25, so (σ/ √n) = .4. To avoid those annoying X-bars, I will use M for the sample mean. I will abbreviate (σ/ √n) as σx .
- Test T+ is a rule: reject H0 iff M > m*
- Power of a test T+ is computed in relation to values of µ > 0.
- The power of T+ against alternative µ =µ1 = Pr(T+ rejects H0 ;µ = µ1) = Pr(M > m*; µ = µ1)
We may abbreviate this as : POW(T+,α, µ = µ1)
(1) First test (test 1): T+ with α = .02: To avoid headaches even further, let test 1 use the 2σx cut-off, α = .02 (approx), even though 1.96 is a more familiar cut-off for α = .025.
The cut-off for rejecting: m*.02 = 0+ 2(2)/√25 = .8. Test T+ rejects H0 at the .02 level if M > 2(.4) = .8. The cut-off m*.02 = .8.
Cool Fact #1: The power of test T+ to detect an alternative that exceeds the cut-off m* by 1σx =.84.[i]
In test 1, the alternative μ1 that exceeds the cut-off m* by 1σx = m* + 1(.4) = .8 + .4 = 1.2.
So test T+ rejects the null with probability .84 under the assumption that µ = 1.2:
POW(T+,α = .02, µ = 1.2) = .84.
The red curve below is the alternative µ = 1.2, and the green area is the power of the test under µ = 1.2.
(2) Second test (test 2): Now consider we are instructed to decrease the type 1 error probability α to .001, but it’s impossible to get more samples. This requires the cut-off to be further away from 0 than when α = .02: the cut-off must be 3σx greater than 0 rather than only 2σx so now the cut-off is:
m*.001 = 0+ 3(2)/√25 = 3(.4) = 1.2.
We decrease α (the type 1 error probability) from .02 to .001 by moving the hurdle (m*) over to the right by 1 σx unit. Against what value of µ does this test have .84 power? We know from our cool fact:
The power of test T+ to detect an alternative that exceeds the cut-off m* by 1σx =.84. So we can easily fill in the ? in the following:
POW(T+,α = .001, µ = ?) = .84.
To replace the ?, set µ = m* + 1σx
µ = 1.2. + (.4) = 1.6. So, POW(T+,α = .001, µ = 1.6) = .84.
- Decreasing the type 1 error (of the test 1) by moving the hurdle over to the right by 1 σx unit (making the hurdle for rejection higher) results in the alternative against which we have .84 power also moving over to the right by 1 σx .
- So we see the trade-off very neatly, at least in one direction.
- Of course the alternative against which the test has .84 power is the alternative against which the type 2 error probability is 1 – .84 = .16.
QUESTION: What’s the POW(T+, µ = 1.2) now that we’ve changed the cut-off to m*.001 ?
Pr(M ≥ m*; µ = 1.2) = Pr(Z ≥ (1.2- 1.2)/.4) = Pr(Z ≥ 0) = .5.
Notice here the cut-off m* happens to equal to the value of µ for which we are computing the power. We get a general result in test T+: this is always .5.
Cool Fact #2: POW(T+, µ = m*) = .5. This is a very useful benchmark that saves many headaches. Notice:
- Test 1: The power to detect (µ = m*.02 = .8) = .5
- Test 2: The power to detect (µ = m*.001= 1.2) = .5.
Since test 2 makes it harder to reject the null than test 1, it’s cut-off m* is bigger, so the value against which it has power .5 is bigger. That means test 2 is less powerful than test 1. That’s because we reduced the probability of a type 1 error.
A power of .5 is rather lousy, but notice the value of µ that test 1 is this lousy at detecting is smaller than the value of µ that test 2 is this lousy at detecting. Again, test 2 has less power than test 1. Compare the power of the two tests for a fixed alternative µ = 1.2:
- Test 1: POW(µ = 1.2) = .84
- Test 2: POW(µ = 1.2) = .5.
By lowering the type 1 error probability (from .02 to .001), we’ve lowered the power to detect µ = 1.2. Raising the threshold for rejecting the null (higher hurdle) => lowering the type 1 error probability=>lowering the power of the test against any alternative.[ii] ——————————————-
I’m not saying anything in this post about interpreting tests (which I’ve done elsewhere on this blog), nor the desire for high power, nor the problems with some ways power has been misused, or any such thing. I’m just getting at the trade-off in relation to test hurdles.
EXERCISE TWO is next