Spot the fallacy!

- The power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
- So, the probability of incorrectly rejecting the null hypothesis is β.
- But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

Mayo,

If I were to guess where this came from, I would say Ziliak. Beta is the probability of failing to reject the null when the null is false (type II error), not the probability of incorrectly rejecting the null (type I), or else I am my own grandpa….

The mischief that Ziliak has done in the legal system through the Matrixx Initiatives case is substantial. Some parties and their expert witnesses (including Sander Greenland in a recent report) now misrepresent the statements from the case as though they were holdings of the Supreme Court. As I have explained before, the statements were necessarily dicta (and thus non-binding) because once the Court held that causation was not necessary for materiality of the undisclosed information, then anything that might, or might not, be necessary for causation was no longer relevant to the Court’s consideration. And yet the Court plowed on, and stepped in the mule poop. The Court cited three cases for its dictum, two which were so-called differential etiology cases (ruling in by ruling out specific causes by process of elimination), which had nothing to do with statistical significance. The third case was Wells v. Ortho Pharmaceuticals, in which plaintiffs’ expert witnesses did have some statistically significant studies, but the problem was that the studies were plagued by bias and confounding. Wells is one of the most discredited legal decisions in the federal system, even though it was a case tried to the judge (not a jury).

Nathan

Nate: Thanks for your comment. It’s really too bad the Supreme Court has “stepped into the mule poop”. You should send me updates on this issue for posting.

But I think we must distinguish between what’s all wrong about the logic in that case, as you showed in your guest post:

Interstitial Doubts about the Matrixx

https://errorstatistics.com/2012/02/08/guest-blog-interstitial-doubts-about-the-matrixx-by-n-schachtman/

and as I argued in:

Distortions in the Court (Mayo)

https://errorstatistics.com/2012/02/08/distortions-in-the-court-philstock-feb-8/

as opposed to common abuses of power and related error probabilities. (As you also note, Ziliac and McCloskey go further into mule territory in interpreting error probabilities as posterior probabilities in their brief to the Court).

So back to my specific power howler: Of course you are correct that the power of a test (against H’) is “the probability of failing to reject the null when the null is false (type II error)” and alternative H’ is true. (Note the addition I made to your definition.) But if you start with the ambiguous premise #1, you can see how it could happen that you land in erroneous line #2.

The power of a test is the probability of rejecting the null hypothesis when the alternative H’ is true (in the sense of adequately describing the data generation).

It is NOT the probability of correctly rejecting the null—in and of itself–unless this is qualified to give the correct definition.

Premise #1 should be: POW(H’) = Pr(test T rejects Ho; H’). Then it’s clear that the correct complement is

1 – POW(H’) = Pr(test T does not reject Ho; H’)–as you observe.

I have numerous posts on power (look up Neyman for some, Dierdre for others) on this blog.

Someone just twittered me this disaster of misinterpretations of P-values: http://effectivehealthcare.ahrq.gov/index.cfm/glossary-of-terms/?pageaction=showterm&termid=67

Yikes! and frorm the Agency for Healthcare Research and Quality.

What makes this worse is that in much of their work the comparisons will not be between randomized groups but rather observed groups and there type one and two error are not really defined or have properties that are almost impossible to discern -see this comment http://andrewgelman.com/2015/07/15/prior-information-not-prior-belief/#comment-226905

Nathan: If not already aware, you might be interested in the estimation of Null distributions present in Madigan’s slides.

Keith O’Rourke

Keith: But I assume they can still compute a P-value. After I read this I began jotting how I’d write it.

Mayo: A P-value can be computed for non-randomized comparisons but no one can tabulate the distribution of those P-values for any assumed true effect size without assuming the unkown bias of the comparison. As Percy Diaconis once put it – the model is mis-specified so there is no way to work out the distribution.

What Madigan is doing is empirically estimating distributions using multiple substances and studies where their is a strong belief those substances don’t have an effect and those studies are like studies also (to be) done on substances that might have an effect.

In philosophy of science, it is fine to restrict consideration just to randomized studies (but then not generalized to non-random studies.)

Keith

Keith: But even if there’s a random assignment of treatments,can we really suppose an indication of a genuine “treatment” effect? Say the subjects are college men and the treatment is “write down and think about a time your partner succeeded at something” verses the control “think about an ordinary day”, and the observed effect is the subject’s score on a word association test purporting to measure implicit self esteem. If a statistically significant difference is observed (those who thought of their partner’s success had lower mean self-esteem scores than the controls, say), is there an indication there is a genuine effect due to the treatment?

The trouble of course is that we immediately doubt there were no biasing selection effects, but let’s imagine there were none. It’s not implausible that men, on average, might feel slighted at being surpassed by a female partner, but I would question that this study is picking up on that. We don’t even know what they thought about. I also question the self esteem test, especially if they didn’t go back and see how the same subjects score without the treatment.

So I wonder if a causal connection or even a “genuine statistical effect due to the treatment” is automatically warranted in an RCT.

Mayo:

Of course not, all randomization warrants is that it should be possible to calculate p-values so that when there are no differences – the distribution of those p_values will be as close as possible to Uniform(0,1).

This is turn, supports the statistical significance vocabulary e.g. “The statistical significance of an observed dobs is the [assessed] probability of observing results as large as dobs , even if the null hypothesis is true: Pr(d ≥ dobs ;H0).”

Keith

Wow! This could be the basis for a good short essay question for a final exam: Identify all the errors on this agency’s website. And the misinterpretations are not limited to “significance” and p-values. Note that “effects” are observed rather than inferred.

People may say the above fallacy is obvious and the trouble is in line #2, but what about premise #1? While this claim is ambiguous as written, and could be interpreted correctly, it generally is not. So it should never be written this way. It is very often supposed that the higher the power, the higher the probability that the test correctly rejects the null (without further qualification).

So, if a test rejects a null, and it has high power, then the probability is high that that rejection is correct. This is a disaster.

This can mean either the alternative is false, or more specifically that the alternative against which the test has high power, H’ say, is true. Along these same lines it is supposed that a rejection of the null by means of a test with high power (agains H’) is better evidence against the null than a test with low power (against H’). Worse, it is frequently supposed that a test with high power to detect (H’) warrants inferring H’ when the test rejects the null. This was the error Ziliac and McCloskey make*, but it is pervasive.

* See https://errorstatistics.com/2014/12/29/to-raise-the-power-of-a-test-is-to-lower-the-hurdle-for-rejecting-the-null-ziliac-and-mccloskey-3-years-on/

Statisticians think about p-values as the probability of rejecting or not rejecting the null hypothesis given some condition – such the null being true or not true. The probability always refers to the probability of rejection or non- rejection; it never refers to the probability of the null being true or not. So, statement 1 (correctly rejecting) refers to the probability of rejecting the null given that It is not true – or P(+R|-H) =1-beta. Statement 2 (incorrectly rejecting) refers to the probability of rejecting the null given that it is true – or P(+R|+H) = alpha. The probabilities in statements 1 and 2 do not add to 1. In this context, two probabilities will only add to 1 if the actions are different, one being rejection and one being non-rejection and the conditions are identical – ie null being true in both cases or null being untrue in both cases.

Putting correctly or incorrectly in front of reject or non reject only leads to confusion. It is much better to say reject or non reject given some condition. Then it is clear that we are talking about a conditional probability and it is clear what the action is and clear what the condition is. It is even better to use mathematical notation.

Sorry to be so black and white, but statements 1 to 3 are ridiculous as any first year undergrad in statistics would know.

Peter: I understand how to interpret these notions. As you can see in my replies, I’ve emphasized that power always has to be wrt an alternative. One should not even leave it as P(+R;-H), one should write it as I indicated. The concern with line #1 isn’t even that someone will take it as a posterior. It’s that someone will take it as it reads: the probability of correctly rejecting the null. This is at the heart of many a power howler (you can search this blog). As far as why I would not put any of these as conditional probabilities, but rather probabilities computed under the assumption of a hypothesis, you can check the blog (or I’ll look it up later, I’m on the road).