Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

Slide1

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But it does no such thing! [See my post from the FUSION 2016 conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1]

~~~~~~~~~~~~~~

The Law of Comparative Support

It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H1 make the observed results probable, while H0 make them improbable, then the results are strong (or at least better) evidence for H1 compared to H0 . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H0 and H1 be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.

~~~~~~~~~~~~~~

Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H1 . The power of test T+ in relation to point alternative µ’ is

Pr(Z > 1.96; µ = µ’).

We can abbreviate this as POW(T+,µ’).

~~~~~~~~~~~~~~

Jacob Cohen’s slips

By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know). [2] Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.

In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”

Do you see the two mistakes? 

~~~~~~~~~~~~~~

Examples of alternatives against which T+ has high power:

  • If we add σx (i.e.,σ/√n) to the cut-off  (1.96) we are at an alternative value for µ that test T+ has .84 power to detect. In this example, σx = 1.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null, z= 1.96.

If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[POW(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0|z0) = 1/(1 + 40) = .024, so Pr(H1|z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Back to our question:

Here’s my explanation for why some think it’s plausible to compute comparative evidence this way:

I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ = μ0 = 0.

We’ve observed z= 1.96 (so you’ve observed z > 1.96).

Therefore, μ = 4.96 makes the observation more probable than does  μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.

I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,

If POW(T+,µ’) is high, then z= 1.96 is poor evidence that μ  > μ’.

That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.

(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.) 

In fact:

if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that  μ < μ’!

Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
~~~~~~~~~~~~~~

A post by Stephen Senn:

In my favorite guest post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:

[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. 

Supposing that it is, is essentially  to treat the test as if it were:

H0: μ < 0 vs H1: μ  > 4.96

This, he says,  is “ludicrous”as it:

would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)

The same holds with H0: μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ  > 0 or μ  > .3.

~~~~~~~~~~~~~~

What does the severe tester say?

In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

would be wrong over 99% of the time!

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.

~~~~~~~~~~~~~~

The (Type 1, 2 error probability) trade-off vanishes

Notice what happens if we consider the “real Type 1 error” as Pr(H0|z0)

Since Pr(H0|z0) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H0|z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.

Upshot

Using size/ power as a likelihood ratio, or even as a preregistrated estimate of expected strength of evidence (with which to accord a rejection) is problematic. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses. It’s not unusual for criticisms to start out forming these ratios, and then blame the “tail areas” for exaggerating the evidence against the test hypothesis. We don’t form those ratios. But the pre-data Rejection Ratio is also misleading as an assessment alleged to be akin to a Bayes ratio or likelihood assessment. You can marry frequentist components and end up with something frequentsteinian.

REFERENCES

Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology

Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.

Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)

Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing?  (pp. 19-24)

*Mayo, D. Statistical Inference as Severe Testing, forthcoming (2017) CUP.

Senn, S. 2008. Statistical Issues in Drug Development, 2nd ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.

Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).

[1] I don’t say there’s no context where the Rejection Ratio has a frequentist role. It may arise in a diagnostic screening or empirical Bayesian context where one has to deal with a dichotomy. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)

[2] It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn

Post navigation

17 thoughts on “Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

  1. Michael Lew

    Mayo, given that this is a re-airing of an earlier blog, I have to ask if there has been any incorporation of the issues that came up in the commentary conversation sparked by that earlier blog. I recall that there were some important points.

  2. Carlos Ungil

    You argue that the ‘pre-experimental rejection ratio’ of BBBS is inadequate to ‘quantify the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant’ with an argument derived from a specific outcome (result barely significant). I think that your criticism of the ‘pre-experimental rejection ratio’ is based upon illegitimately interpreting their measure of initial strength of evidence provided by a significant result as a measure of final evidence provided by a significant result.

    • Carlos:
      But it’s not a “measure of initial strength of evidence provided by a significant result”-whatever that means.

      • Carlos Ungil

        I’ve taken the terminology initial (= pre-data) an final (= post-data) from http://www.phil.vt.edu/dmayo/personal_website/(1981)%20In%20Defense%20of%20the%20Neyman-Pearson%20Theory%20of%20Confidence%20Intervals.pdf

        From that paper: “Seidenfeld argues that while the ‘best’ NP confidence interval may be reasonable before observing the data (i.e., on the ‘forward look’) it may no longer be reasonable once the data is observed (i.e., on the ‘backward look’)”

        You argue, reasonably, that he cannot apply post-data arguments to criticize NP intervals. For the same reason, you shouldn’t use post-data arguments to criticize the pre-experimental rejection ration as defined by BBBS.

        • Carlos: I’m impressed you dug up that old thing, but it has nothing whatsoever to do with this. The Rejection Ratio is not a pre-data frequentist method. There I was saying a confidence coefficient doesn’t apply to the interval estimate.

          • Carlos Ungil

            You may be right, but their claim is about the pre-data properties of this measure. If you make a critique, it cannot be based on the post-data properties. You need a different kind of argument to say that the rejection ratio is not a pre-data frequentist method.

            By the way, my first comment is still awaiting moderation and I think our exchange is not visible to anyone else.

            • Carlos: All the comments should be showing.
              “If you make a critique, it cannot be based on the post-data properties.”
              It’s not, it’s a pre-data animal the could warrant saying a significant result is good evidence of a discrepancy 4 SE greater than the null, say.
              “You need a different kind of argument to say that the rejection ratio is not a pre-data frequentist method.”
              No, the situation is very simple. You can make up all kinds of functions of power and alpha–square them, add and subtract whatever–that simply ARE NOT methods from any frequentist school, just as not all functions of probs that you can dream up obey the probability calculus. Now as I explained in my post the rejection ratio could look sensible –in an affirming the consequent kind of way. i.e., I’m not saying it is just any old arbitrary invention (see footnote 1), but it’s not a method from a frequentist methodology. I don’t know how to say this any more clearly.

              • Mayo: Do you have a concise demarcation frequentist methodology?
                (Benjiman et al claimed to have matched expectations if I recall correctly. NormalDeviate insisted on uniform coverage of confidence intervals as an ideal?)

                Keith O’Rourke

                • Phan: don’t understand the question.

                  • george

                    don’t understand the question

                    I’ll try – because I think Keith has a good question, that’s getting garbled and so not answered.

                    Mayo: do you have a concise definition one can use to define methods as frequentist? If so, what is it?

                    For Larry Wasserman, frequentist inference is defined by the goal of constructing properties with frequency guarantees though he notes that attaining this exactly may not lead to useful methods.

                    Somewhat similarly, for Bayarri, Benjamin, Berger and Sellke frequentist procedures are those that, in repeated practical use of a statistical procedure, give a long-run average actual accuracy that should not be less than (and ideally should equal) the long-run average reported accuracy.

                    Without this sort of definition, one is left with no way to determine whether or not methods are frequentist – or error-statistical, or whatever. This is, of course, not satisfactory. Look forward to your answer.

                    • George: Let me be clear, my problem isn’t defining a frequentist procedure, my problem is that putting the issue that way so distorts it that doing so will certainly render the issue misunderstood. Suppose one is dealing with a case where one didn’t have a definition, e.g., a geometric point. That wouldn’t prevent our saying that blood, books, bananas, countries,sonatas are not geometric points.
                      But since you bring up some definitions of frequentist requirements, let me comment, though it changes the issue:
                      “Somewhat similarly, for Bayarri, Benjamin, Berger and Sellke frequentist procedures are those that, in repeated practical use of a statistical procedure, give a long-run average actual accuracy that should not be less than (and ideally should equal) the long-run average reported accuracy.”

                      This is a perfect example of how one can appear to be talking about one thing, and through changes of definition, actually not talking about that thing. This frequentist principle (FP) is actually not an N-P-F principle at all but, with the intended interpretation is Bayesian. See my comment on Berger 2003
                      “Berger’s FP, however, does not require controlling errors at small values and is highly dependent on prior probability assignments. So far as I can see, the only “clear practical value” of saddling Neyman with this vaguely worded principle … is to convince us that CEPs satisfy the N–P error statistical philosophy. But they do not.” (p.21)

                      Click to access Berger%20Could%20Fisher%20Jeffreys%20and%20Neyman%20have%20agreed%20on%20testing%20with%20Commentary.pdf

                      Let me repeat, this is a completely different issue.
                      But having said all that, the pre-data rejection ratio has terrible error probabilities. Insofar as it’s doing anything. You can have high evidence against ,or expected evidence against, the null in favor of the alternative against which the test has high power, when the alternative one chooses for this computation is miles away from the null.

                      So really, you’re taking a simple point and turning into another one which, while interesting in its own right, is far beyond the issue.

                    • george

                      you’re taking a simple point and turning into another one

                      This is frustrating. You appear to criticize others for espousing incorrect (or incomplete, or unhelpful) definitions of what it means for a methodology to be frequentist; see, for example, your line that “it’s not a method from a frequentist methodology”. But you are, at length, not supplying your definition of this term.

                      So, I’ll try a final time. Do you have a concise definition one can use to define methods as frequentist? If so, what is it?

                      NB I will interpret further attempts to change the subject (or not answer) as simply a “no”. And be in agreement with Keith’s point about pseudo-science.

                    • George: Please reread my reply, I’d be able to say a thimble is not a geometric point, even if I didn’t have a definition of geometric point. That said, I do have requirements for frequentist methods, e.g, they require error control (you can look at Birnbaum’s “confidence concept” in the 1977 paper in the current blogpost–high probability of rejecting claims if and only if they’re false). Severe testers see error control performance as necessary but not sufficient. Tell you what: let’s call the pre-data rejection ratio frequentist, or frequentist-for-these authors. Then it’s a highly misleading way to assess evidence against a null afforded by a significant result. The appraisal by any other name is still wrong-headed, but I still say you’re missing the point. There are pre-data concepts that “blend” into post-data ones, like pre-data error probabilities and p-values or severity. There’s a general shared reasoning. Not so for the rejection ratio, but I think I’ve said more than enough about this.

                    • George and others: i want to be clear that this reblog was limited to the single point made in the original post. I’ve said nothing about other points of disagreement and problems, even though you could say they’re more important. wrt those other issues, I see this as another ingenious argument of the sort at which J. Berger excels, as seen in Berger 2003, and perhaps earlier. Namely, it’s an argument for Bayes Factors as a compromise or reconciliation between frequentists and Bayes. I have elsewhere discussed how the bayesian foot sneaks into the door. Notice for example that power is said to require a prior on alternatives. That was not the point of this post (which is much simpler). Anyway, I linked to my response to Berger (2003) in my earlier comment.
                      My one-time colleage I.J. Good’s use of the Bayes Factor as a Bayes/non-Bayes compromise is different. See also Pratt (1977) in my current blogpost on Birnbaum.

  3. Mayo:

    From Benjiman: “a fully frequentist measure because its frequentist expectation under the null hypothesis precisely equals the frequentist rejection ratio” you seem to require something more than this.

    Normaldeviate saw uniform confidence interval coverage as an ideal that often was not obtainable https://normaldeviate.wordpress.com/2012/08/10/confidence-intervals-for-misbehaved-functionals/

    This raises the question of what is needed for a method to be a frequentist methodology?

    A demarcation question like what distinguishes science from pseudo-science.

    Keith O’Rourke

    • Phan: I don’t see it as akin to demarcation at all, and it doesn’t even have to be a frequentist complaint. Berger admitted to me it made no sense once I pointed out you can pick an alternative mu’ against which a test has high power and then get a high rejection ratio which says nothing about evidence for that alternative mu’, even if the test rejects. Otherwise it’s wrong with very high probability. But likelihoodists can put their objection in other terms.

Blog at WordPress.com.