In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:
It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)
The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:
…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….
The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for H1 over H0. (ibid., p. 2)
But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! 
This brings me to where I left off in my last post: How could people think it plausible to compute comparative strength of evidence this way? The rejection ratio is one of the “new monsters”, but it also appears, without this name, in popular diagnostic screening models of tests. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)
The Law of Comparative Support
It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H1 make the observed results probable, while H0 make them improbable, then the results are strong (or at least better) evidence for H1 compared to H0 . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H0 and H1 be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but
only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.
I had a post last year called “What’s Wrong with Taking (1 – β)/α as a Likelihood Ratio Comparing H0 and H1?” While it garnered over 80 interesting comments (and a continuation), only one or two concerned the point I really had in mind. So in what follows I’ll take some excerpts from it, interspersed with new remarks.
Take a one-sided Normal test T+: with n iid samples:
H0: µ ≤ 0 against H1: µ > 0
σ = 10, n = 100, σ/√n =σx= 1, α = .025.
So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)
People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H1 . The power of test T+ in relation to point alternative µ’ is
Pr(Z > 1.96; µ = µ’).
We can abbreviate this as POW(T+,µ’).
Jacob Cohen’s slips
By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know).  Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.
In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”
In case you don’t see the two mistakes, I will write them in my first comment.
Examples of alternatives against which T+ has high power:
- If we add σx (i.e.,σ/√n) to the cut-off (1.96) we are at an alternative value for µ that test T+ has .84 power to detect. In this example, σx = 1.
- If we add 3σx to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96
Let the observed outcome just reach the cut-off to reject the null, z0 = 1.96.
If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using
it would be 40. (.999/.025).
It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding
Pr(H0|z0) = 1/(1 + 40) = .024, so Pr(H1|z0) = .976.
Such an inference is highly unwarranted and would almost always be wrong. Back to our question:
How could people think it plausible to compute comparative evidence this way?
I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:
The probability is very high that z > 1.96 under the assumption that μ = 4.96.
The probability is low that z > 1.96 under the assumption that μ = μ0 = 0.
We’ve observed z0 = 1.96 (so you’ve observed z > 1.96).
Therefore, μ = 4.96 makes the observation more probable than does μ = 0.
Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.
But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.
I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,
If POW(T+,µ’) is high, then z0 = 1.96 is poor evidence that μ > μ’.
That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.
(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.)
if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that μ < μ’!
Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
My favorite post by Stephen Senn
In my very favorite post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:
[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)
Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big.
Supposing that it is, is essentially to treat the test as if it were:
H0: μ < 0 vs H1: μ > 4.96
This, he says, is “ludicrous”as it:
would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)
The same holds with H0: μ = 0 as null.
If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:
μ > 0 or μ > .3.
What does the severe tester say?
In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.
Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate
μ > 4.96
would be wrong over 99% of the time!
Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.
The (Type 1, 2 error probability) trade-off vanishes
Notice what happens if we consider the “real Type 1 error” as Pr(H0|z0)
Since Pr(H0|z0) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H0|z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.
Upshot (modified 8p.m. 5/23/16)
Using size/ power as a likelihood ratio or as an indication of pre-data strength of evidence with which to accord a rejection, a bad idea for anyone who wants to assess the comparative evidence by likelihoods. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses). Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit also serves very badly for a Bayes ratio or likelihood assessment.(Likelihoodlums* and Bayesians, please weigh in on this.)
This is related to several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.
Please alert me to errors.
*Michael Lew’s term.
Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology
Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.
Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)
Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing?” (pp. 19-24)
Senn, S. 2008. Statistical Issues in Drug Development, 2nd ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.
Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).
 I don’t say the Rejection Ratio can have no frequentist role. It may arise in a diagnostic screening or empirical Bayesian context.
 It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.
For the first error, see my definition of power. (It must be relative to a given alternative hypothesis). For the second, see my criticism of NHST as going directly from statistical significance to a genuine phenomenon. This second, we may forgive as just illustrative of the general idea. The first, committed at least twice (please point out any others) is more serious. It may be tied to Cohen’s tendency to imagine one has fixed the d corresponding to the alternative–but this is troublesome.
Department of Statistical Science
University of Idaho