Get empowered to detect power howlers

questionmark pinkIf a test’s power to detect µ’ is low then a statistically significant result is good/lousy evidence of discrepancy µ’? Which is it?

If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]

Yet I often hear people say things to the effect that:

if you get a result significant at a low p-value, say ~.03,
but the power of the test to detect alternative µ’ is also low, say .04 (i.e., POW(µ’)= .04),then “the result hasn’t done much to distinguish” the data from that obtained by chance alone.

–but wherever that reasoning is coming from it’s not from statistical hypothesis testing, properly understood. It’s easy to see.

We can use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:

H0: µ ≤  0 against H1: µ >  0

Let σ = 1, n = 25, so (σ/ √n) = .2.

To avoid those annoying X-bars, I will use M instead. The Excel example has µ ≤  12, but it’s even easier to have 0, and easy to switch over.Test T+ rejects Hat the .025 level if M > 1.96(.2). Let’s make it the 2-standard deviation cut-off:

Test T+ rejects Hat ~ .025 level if M > 2(.2) = .4. So the cut-off M*= .4.

Now we need a µ’ such that POW(µ’) = low.
Power is always defined in terms of the cut-off for rejection, M*.

  • I know the power against alternatives between 0 and cut-off M* will be less than .5.
  • I’ll get really low power (.16) if µ’ were to exceed 0 by only 1 (σ/ √n) unit –which in this case is 1(.2) = .2. (That is, POW(.2) =.16).
  • I’ll get even lower power if µ’ were to exceed 0 by only .25 (σ/ √n) unit–which in this case is .25(.2) = .05.

I’m cutting corners with symbols wherever possible.

So what’s the power of T+ against .05? POW(.05) = ?

P(M > .4;  µ = .05)= P(Z > (.4 -.05)(1/.2)) = P(Z > .35(5)) =P(Z > 1.75)= .04

So POW(.05) = .04 –quite low.

[Whether this low chance of triggering when µ = .05 is just what we want is a separate issue.]

My claim is, if it has triggered, say just at the cut-off M* (.4), then there’s a good indication µ >.05.

You can see this using lower confidence limits (LL) corresponding to test T+.

Find the .96 lower confidence limit (LL) corresponding to test T+, supposing the observed sample mean M = .4. (Never mind that we’d typically estimate σ).

µ > M – (1.75)1/ √25

µ > M – (1.75)(.2)

µ > M – .35.

Since we’re imagining M reaches the cut-off M*, we have the following one-sided lower .96 confidence limit.

µ > .4 – .35 = .05.

So µ > .05 is certainly warranted.

(This is also given by severity reasoning.)

Here’s another example: What’s the power of T+ against .1? POW(.1) = ?

P(M > .4;  µ = .1)= P(Z > (.4 -.1)(1/.2)) = P(Z > (.3)(5)) = P(Z > 1.5) = .07

So POW(.1) = .07.

Correspondingly, µ = .1 is the lower limit of a one-sided confidence interval with confidence level of ______?

Answer .93.

So the statistically significant result is a better indication that µ > .05 than µ > .1.

You can see the duality between CIs and tests, but I’ll come back to this. The main lesson is:

If a test’s power to detect µ’ is low, then a statistically significant result (i.e., a rejection of the null with low p-value), is a good indication of discrepancy µ’.

[i] I assume the alarm system shares the obvious properties of good tests for detecting discrepancies; that’s the point of an analogy. In any event, I have delineated those points elsewhere.

Categories: confidence intervals and tests, power, Statistics | 34 Comments

Post navigation

34 thoughts on “Get empowered to detect power howlers

  1. original_guest

    Thanks for posting this. I don’t really buy your smoke detector argument. A closer analogy to what I wrote on low-power m-s tests is;

    You have a kitchen smoke detector that can detect smoke, but is also so crappy it randomly goes off, for reasons totally unconnected to smoke or fire, one day in 20. Also, the *only* potential source of any smoke or fire in your house on any day is someone striking one match inside a in metal box in the basement – to trigger the alarm most of this match’s little puff of smoke has to get to the detector, and while that could happen it is highly unlikely. So, you have low power to detect the fire/smoke that the match provides, not much above 0.05.

    One day the alarm goes off. This is a “good indication” of a match being struck? How?

    NB If it helps any, you seem to be entertaining that it’s possible to have much greater power, e.g. that someone could be setting off munitions in the basement. I am not, and wasn’t in my earlier comments.

    • OG: Your example is one that doesn’t even have a proper distance function (as with Kadane’s howler*). The interesting question is:do you concur with the analysis within a proper test as in the test T+ described? If yes, then we agree.

      *e.g., https://errorstatistics.com/2012/09/08/return-to-the-comedy-hour-on-significance-tests/

      • Mayo: Just out of curiosity (i.e., this question isn’t related to my own ongoing investigations of severity), can you give a mathematically precise specification for what does/doesn’t count as a distance? I ask because there are various formalizations of the notion of distance with different properties, and I want to get a bead on the key properties.

      • original_guest

        You wrote here that if a low power test finds flaws (i.e. a significant result, at some fixed alpha) it indicates they’re present (i.e. the that the truth really is not null). If my interpretations in parentheses above are not what you actually meant, I welcome correction.

        My example has valid Type I error rate, and low power. If you want strictly valid p-values for it these can be arranged by trivial use of randomized tests, though I contend that, as used above, it is a “proper” test regardless.

        If you want to build in extra assumptions about what a “proper” test is, go ahead, but the math about distance functions and contiguous alternatives gets non-trivial, I think you’ll likely need different blog software to handle the notation.

        My reaction to the test above is that you’re making statements about whether mu exceeds some level other than that specified in the test we started with – see the part about higher values being “warranted” – and my example does not. So, I don’t see the relevance.

        • OK: yes wrt the parenthentical remark. SEV principle is the basis for ruling out randomized tests for inferential purposes. Allegedly good long-run performance is not adequate for apt probative measures for inference. I don’t understand your last paragraph (it’s equivocal).

    • john byrd

      OG: Your example is comparable to assessing results of a validation study, where a large error rate is discovered. Not the same as a p-value, which is always based on a continuous distance measure from a hypothetical parameter.

      • original_guest

        john: it’s just an example of a low power test, I don’t see anything specific to validation studies… though obviously they can have low power.

        Nothing in p-value definitions requires continuity; if you want completely accurate alpha from discrete measures, use a randomized test.

        Parameters are also not required; an example is use of maxT-based permutation testing, where the test statistic doesn’t measure anything directly relevant to a population. Of course, this doesn’t say thinking in terms of parameters isn’t often helpful.

        • john byrd

          OG: Not to run down a rabbit hole (forgive me if it is), but I do not know of any test statistics that are not continuous (F, t, chi-square, etc.). Maybe there are some. I think your analogy does not work for a significance test, but would work for test of a method where you tally up performance results and can say the method is unreliable if it fails more than some rate (ignoring degree of difference from a standard). In this case the error rate can motivate doubt about a singular result. This is very different from the example in this blog, where a great difference from the null can provide a cogent basis for inferring a difference, even if there is high variance. The low power only means that for a significant result there must be a very large effect size. Why should we not conclude a difference when looking at a large effect size and low p-value? It appears we should.

          • john byrd: Any test on a discrete sample space will generate a discrete p-value distribution. Tests of mean parameters for binomial distribution or Poisson distribution, contingency tables, logistic regression…

            Often the test statistic p-values are approximated using continuous distributions, but make no mistake — if the underlying sample space is discrete, the p-value distribution is discrete.

            • original_guest

              +1. Thanks Corey.

              John; it’d help a lot if you distinguish “effect size” and “estimate of effect size”. Why? Because, at least in regular situations, low power means that for a significant result there must be a large estimate. It doesn’t mean that the true effect size must be large (though it also doesn’t rule this out)

              • OG: I don’t see this at all.

                • original_guest

                  To get a statistically significant result when there is low power, something unusual must have happened. For the i.i.d. normal mean problem with an assumed-correct model and known variance (so the estimate is sufficient) the unusual event that must have happened is getting a larger estimate than would be expected by chance.

                  This does not mean the true effect must be as large as or larger than the estimate.

                  Recall this is all in reaction to John’s absolute statement that there “must be a very large effect size”, which is not justified if we’re being careful about distinguishing estimates from parameters – see Stephen’s statement on careful language, with which I agree.

                  • john byrd

                    OG: a stat Sig result under high or low power can be interpreted as “an unusual event” under the null. Power has no bearing on that.

                    • original_guest

                      john: I don’t disagree, but it’s not clear (to me) what you’re responding to – the layout of this blog’s comments is not helpful.

                      I share mayo’s concern that, by now, it’s “hopeless to get clear” on whatever her initial point was.

                      Given that it seemed a startling claim, I hope she’ll revisit it – somewhere else – with a clearer example than her smoke detectors, and a more precise formalization of what is being tested, or estimated, or inferred.

                      Thanks for the discussion, I hope you found some of it helpful.

                  • OG: Sorry I was away from the blog for awhile today. Your remark is perplexing, at least without qualification.

                    “To get a statistically significant result when there is low power, something unusual must have happened.”

                    Power is always in relation to some alternative, and since we seem (finally) to be focussing on this one example, we can make it clear. For any stat sig result, there always is going to be an alternative against which T+ has low power. After all POW(null) = .03 or other low number, and POW(.05) = .04, etc.
                    So it’s unusual under the null, unusual under alternatives sufficiently “close”–e.g.,between the null and say 1 standard deviation unit (sigma/root n) up from the null. Is that what you mean?

  2. As original_guest points out, the fire alarm analogy is far from perfect. Since we are talking about hypothesis tests we have to have that both alarms (if they are ‘correct’ alarms in the hypothesis test sense), have the same probability of going off when there is nothing at all, neither toast smoking nor house blazing, nor anything in between.

    However, if one looks at it in terms of likelihood ratio, it is certainly the case that for a given P-value, eventually, as the power increases the likelihood ratio favours the null rather than the alternative. See Senn, S. J. (2003). P-Values. Encylopedia of Biopharmaceutical Statistics. S. C. Chow, Marcel Dekker: 685-695.

    • Stephen: Your point about fixed p-value and increasing power is a distinct issue dealt with elsewhere. The severity assessment takes into account the sensitivity of the test. But here we have low power. And notice, of course, that it’s because POW(null) = very low, that a rejection is indicative of discrepancy from null. In other words, there’s no new logic in my claim, only the logic of significance testing.

  3. There’s something going on on the blog today resulting in hundreds and hundreds of hits, but it’s not this for post but rather :
    “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (guest post)

    It must be linked someplace. Anyone know? Just curious.

  4. Michael Lew

    It seems to me that any attempt to understand or explain power without mentioning the observed effect size and observed variance is silly and bound for failure.

    If the power of a test to detect a particular effect size is small but the result is significant (or, more usefully, the observed P-value is small) then either the observed effect size is larger than the particular effect size for which power was specified or the observed variance is less than the variance assumed in the power calculation, or both.

    • Michael: The example gives the variance. The observed effect size is also given. Take another look.

      • Michael Lew

        I didn’t mean to say that the example failed to include effect size or variance, but to imply that you present an attempt to understand and explain power without reference to them. I don’t see anywhere in your explanation that would allow a reader to see the straightforward message of the second sentence of my first comment.

        The use of the unrealistic simple setting where sigma is known precludes any sensible discussion of the role of the observes variance in power.

  5. Michael Last

    Taken to a silly extreme, I think this argument breaks down. I will conduct a test. To do this, I will roll a 20-sided die. On a 1, I’ll declare significance. This has low power (.05), and is a size .05 test.

    Or, for another silly example, see http://xkcd.com/1132/ – this is a test of size 1/36, but a power of 1!

    • Michael: If you reject a null (here I guess a fair die) on grounds of an improbable result, then you will very often if not always reject erroneously. The rest, confuses the argument as well as committing Kadane’s fallacy.

  6. deborah, I don’t think that Michael’s null has to be that the die is fair. It could be anything at all. The die coming up 1 is a test that has the same power as size whatever the null hypothesis. (Admittedly, a test that nobody would accept but what is interesting is to know why nobody would accept it.)

    Also one has to be careful with ‘you will very often if not always reject erroneously’ For tests with very small size you will reject very rarely and reject erroneously even more rarely. I assume you mean ‘given that you have rejected, the probability that you will have made an error is high’. (Although one might argue that this leads us down Bayesian paths.)

    • Stephen: No, I mean that if you reject a null because of a result improbable in some respect, one will be practically guaranteed to reject erroneously.
      The irony in the examples people are bringing up, which bring up different issues, is that they are examples that show what’s wrong with looking just at likelihoods—whereas an error statistician doesn’t do this. These are problems for likelihood ratio accounts, Bayes boost accounts, hacking’s old “support” account.
      My original point in writing the post (when I really should have been posting yours) is to highlight an erroneous understanding of significance testing logic. The mistake involves construing error probabilities in a manner that reflects a striving to use them in a likelihood logic, whereas the testing logic is very different. On a shaky bus!

      The fact that the size is close to the power is precisely what’s illustrated in the example I began with. But it’s hopeless to get clear on that point if people are at the same time bringing up ill-defined tests, everything and the kitchen sink at the same time. Too shaky and someone’s jacket is in my way–can’t see what I’m writing.

  7. Well, I still maintain you have to be careful with your language. If you roll the icosahedral die you will reject at a rate of 1/20 and you will reject erroneously at a rate of less than 1/ 20 (since some of your null hypotheses will be false). That is to say that fewer than 1/20 of tests will lead to erroneous rejections. This was a point that Fisher made (I could find the quote). I assume, however, that you were talking about a probability having rejected (conditioning on rejection) and not conditioning on testing.

    • Stephen: No this is Kadane’s fallacy, as I see it. The error rates must be changing because of changes in the underlying hypotheses, and of course, erroneous rejections and erroneous failures to reject matter. Again, I thought,we were getting at a question that arises given one has proper significance tests.

  8. OG and everyone else: I gave an example–an explicit one–of test T+. Never mind the analogy (I have to remind myself that analogies tend not to work well with statisticians, on the other hand, Meng seemed to like this one–at the Boston colloquium, so it was in my mind. I can do the same thing with fishnet sizes). Skip the analogy, resume the given statistical example. Is there agreement there or not?

  9. Cesar Rabak

    I found the idea of the analogy [of fire alarm] compelling, but I think the example fails in the details:

    Your fire alarm has ” little capability of triggering” when the µ’ is large, not when it is small!!

    Your text gets confusing when you proceed with the idea the cutoff is 0.4 (then its the value the mean of the sample you got, right?) and then the “alarm sounds” with this threshold.

    Once the alarm sounded (with those 25 samples which gave the >= 0,4 sample mean) what’s the point in going to lower values of a “possible mean” through the exploration of a µ’ < M*?

    Is not the case when you go for values lower than M* you've already increased the chance of incurring an error of type I, or in the fire alarm analogy yelling "fire" where no one is in sight?

    []s

    • I can’t understand your question. M* is the cut-off for the observed sample mean. The null value in this one-sided test is always lower than M*, so one is certainly interested in such values.

      • Cesar Rabak

        M* is a cut-off value determined to decide forthcoming sample means if they agree with a certain null hypothesis (keeping the the fire alarm analogy: the house is not in fire).

        There is no such a thing as the null value, but a Null Hypothesis. The outcomes of several experiments (in which all the Null Hypothesis are true) are belonging to a set of values whose distribution is given by you N(0, 1), which for the sample size chosen gave a MSE of 0.2. As the distribution of the values is assumed Gaussian, also there is an expectation that extreme values are less likely than values around zero (the postulated mean of the distribution under null hypothesis).

        So: if a random sample of 25 values is drawn and the value is < M* are you asserting that you'll consider the null hypothesis true or not?

        Power only starts to enter in consideration if you're willing to consider the null hypothesis is not true. . .

        Then why would you consider an "µ’" at all?

        []s

        • Sorry, but very little of what you wrote makes sense except using null hypothesis rather than the abbreviation “null value”. If I had time I’d go through it line by line but I’ve a massive deadline just now.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.