When the rejection ratio (1 – β)/α turns evidence on its head, for those practicing in an error-statistical tribe (ii)



I’m about to hear Jim Berger give a keynote talk this afternoon at a FUSION conference I’m attending. The conference goal is to link Bayesian, frequentist and fiducial approaches: BFF.  (Program is here. See the blurb below [0]).  April 12 update below*. Berger always has novel and intriguing approaches to testing, so I was especially curious about the new measure.  It’s based on a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. They recommend:

that researchers should report what we call the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report what we call the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results.” (BBBS 2016)….

“The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for H1 over H0 .”

If you’re seeking a comparative probabilist measure, the ratio of power/size can look like a likelihood ratio in favor of the alternative. To a practicing member of an error statistical tribe, however, whether along the lines of N, P, or F (Neyman, Pearson or Fisher), things can look topsy turvy.

I’ll follow their example that gives an easy to grasp figure involving Normal testing with a point µ= 0 vs µ ≠0. I will follow them in letting H1 be a particular point alternative.

Screen Shot 2016-04-10 at 8.25.34 PM

The red area is the probability of a Type 1 error, α. The Rejection ratio = (red area + blue area)/red area

Without any computations, we see that the further away one chooses the point value for H1 , dragging it to the right, the more blue area and hence the higher the rejection ratio (for a fixed α-level), hence the stronger the comparative impact of a statistically significant result, at least on this pre-data measure. But wait, this is at odds with N-P-F (Neyman-Pearson-Fisher) testing reasoning and confidence interval logic. (A related post is here.)

Consider two point alternatives: H1 , H1 . H’1: µ=µ’, H1:  µ=µ” where µ” exceeds µ’.

Then according to BBBS, or so it appears, a positive 0.05 significant result from (the same test T) is stronger evidence for the larger discrepancy µ” than it is for the smaller discrepancy µ’. [1] This gets to one of one of my pet peeves in contemporary uses of power. I realize they later want to consider priors in the parametric hypotheses and Bayes factors, so maybe this saves the day (for them). But then it’s not clear why they include the pre-data rejection ratio, unless maybe for cases of diagnostic testing (with binary results).

For some arbitrary numbers, Let σ = 10, n = 100, so (σ/ √n) = 1.  Test T+ rejects Hat the .05 level if  d(x) > 1.96. For simplicity, let the cut-off for rejection be 2. Let H1: µ= .25, and H1: µ=3.

In the first case, the rejection ratio is POW(.25) /.05 =.04/ .05 = .8.  In the second case,  POW(3)/.05  = .84/.05 ~ 17. So according to them, a rejection based on d(x) = 2 is better comparative evidence for µ > 3 than it is for inferring µ > .25. [1]  

Yet, look what happens if we compare the corresponding confidence levels for the inferences[2]:

In the first case, µ >.25 is the lower confidence interval (CI) at level .96 . 

In the second case, µ > 3  is the lower confidence interval (CI) at level .16!  That’s terrible evidence for a discrepancy so large.[3]

Jim Berger is always coming up with intriguing measures that are to have frequentist (and Bayesian) justifications; but I always find they have similar flaws. I could recommend one or two that might work!

I may come back to this later on.

April 12 update: So Jim Berger granted my point, saying he hadn’t really thought about this untoward result, but noted that power/alpha was used in high throughput testing. He said it only made him more inclined to opt for the post-data Bayes factor discussed in the paper (which I don’t take up here). (Jim will correct me if I’ve got any of this wrong.)


[0]The objective of this two-and-half day workshop is to study the role and foundations of statistical inference in the era of data science and also its applications in fusion learning. Its main foci are:

  • Report new advances in and re-examine the foundations of statistical inferences in the modern era of data science,
  • Develop links to bridge gaps among different statistical paradigms, including Bayesian, frequentist and fiducial (BFF) inferences, and explore the possibility of a unifying statistical theme for scientific learning and research
  • Disseminate new ideas and foster new research approaches in fusion learning from multiple diverse data sources.

Professor Sir David R. Cox (Oxford University) is to deliver a featured address through a video presentation, to be discussed by Professor Nancy Reid (University of Toronto). Professors Jim Berger (Duke University) and Brad Efron (Stanford University)  are to deliver the keynote addresses, to be followed by many more distinguished lecturers and discussions. 

The workshop will bring together statisticians and data scientists across the aisles to address issues related to the foundation of statistical inferences and its applications to combining information and fusion learning. This workshop will help disseminate new developments of coherent BFF inferences and new advances in statistical inferences and their applications to both within the field of statistics and all fields that use statistics. It is sponsored by Rutgers Statistics Department, DIMACS center and the National Science Foundation (NSF).

They sometimes suggest that “BFF” is also supposed to echo the “best friends forever” meme. The program and a link to abstracts is here.

[1] It must be kept in mind that confidence interval (CI) inferences, like severity inferences, are in the form of µ > µ’ = µ+ δ,  or µ < µ’ = µ+ δ  or the like. They are not to point values!

[2] Power of Test T in relation to alternative µ’: POW(T,µ’) = POW(Test T rejects H0;µ’)
In terms of P-values: POW(µ’) = Pr(P < p*; µ’) where {P < p*} corresponds to rejecting the null hypothesis at the given level p*.

Pr(d(X) > 2;  µ = .25 ) = Pr(Z > 1.75) = .04; Pr(d(X) > 2;  µ = 3 ) = Pr(Z > -1) = .84.

[3] As Stephen Senn points out (in my favorite of his posts), the alternative against which we set high power is the discrepancy from the null that “we should not like to miss”, delta Δ.  Δ is not the discrepancy we may infer from a significant result (in a test where POW(Δ, ) = .84).



Categories: confidence intervals and tests, power, Statistics

Post navigation

31 thoughts on “When the rejection ratio (1 – β)/α turns evidence on its head, for those practicing in an error-statistical tribe (ii)

  1. john byrd

    Are these papers/presentations from the workshop available someplace?

  2. Steven McKinney

    Thank you Mayo for setting things straight. Figure 1 is simple and basic – surprising that heavyweight authors fail to run such simple exercises and thus stumble on these points.

    Very minor typo point – at the end of your post I see

    Pr(d(X) > 2; µ = 3 ) = Pr(Z > -1) = .04.

    which should be

    Pr(d(X) > 2; µ = 3 ) = Pr(Z > -1) = .84.

    • Steven: Thanks so much for your comment. I thought maybe people who were thinking straight about tests had disappeared or become disinterested. This is the same proposal Benjamin and Berger give as a recommended replacement or reform in their commentary to the ASA statement on p-values. What I still want to know is how they can go on on p. 11 to declare that optional stopping (in this example) is so awful for the hypothesis tester as to be tantamount to not bothering to collect data, whereas for the user of their Bayes factor, it’s hunky dory! Now Berger has agreed with my criticism but thinks it’s all the more reason to use the Bayes factor. But the latter is supposed to get its frequnetist justification from equally the expected value of the power/alpha. Having a frequentist meaning doesn’t show it has good control of error probabilities. So this kind of shocks me,….Oh and thanks for correcting the typo in the footnote.

      • john byrd

        It seems that this suggested “fix” is simply saying life is better when the competing distributions show greater separation, whether due to greater differences in proposed parameters or great sample size. What is new in this? The interest in this ratio as frequentist qualification seems to really call for severity. What does it offer that is not better given by the severity curve?

        • But the rejection ratio is backwards as an assessment of impact of a rejection–that’s my point.

  3. Mayo – what do you think of Christian Robert’s ‘testing as estimation of a mixture model’?

    • Om: Not familiar with it.

      • Thought you might be interested as he discusses in context of the ‘demise of the bayes factor’ – see eg http://arxiv.org/abs/1506.08292 as well as http://arxiv.org/abs/1412.2044

        • Om: I looked quickly at the first, thank you for letting me know about it. I’m not surprised that people would be having great doubts about the use of Bayes factors, but I’m not sure what this mixture parameter is supposed to do. Really, I’d have to study it much more carefully, and clearly this is just a draft version of the paper. How does this connect with his ABC work, by the way?

          • My impression is there is some Gelman influence – convert a discrete decision problem into a continuous parameter estimation problem ‘within’ and enlarged model. See also Senn’s comments on Laplace vs Jeffrey’s priors and testing. I don’t know if there is any direct connection to ABC though perhaps ABC might help with the implementation.

            • And while we’re at it – see also Fisher vs NP! The latter don’t seem (to me) to have understood the relation between Fisher’s parameter estimation and testing formulations. My reading is that Fisher was right to be annoyed by them!

              • Om; You might check Fisher vs N-P on this blog, for example, the fiducial posts that are linked at the end of this post. N-P did everything they possibly could to place Fisher’s methods on firm footing, and took his side in battling the “old guard”–including Egon’s father, Karl (see the Spanos post on Fisher). Fisher has no problem with N’s confidence interval estimation as an extension of Fiducial inference (1934). It also gave an interpretation that avoided the flawed probabilistic instantiation in Fisher. Fisher’s anger, aside from the important issues of personality disputes (after 1935), and Neyman’s unwillingness to be obsequious (see “anger management” post), grew out of Neyman’s refusal to accept Fisher’s Fiducial argument, and Fisher’s refusal to accept that there was anything wrong with it. If you want to read just how much Fisher accepted the link between tests and estimation, reflected in the N-P fundamental lemma, look up (on this blog) a post that discusses a passage from Fisher’s “Two New Properties of likelihood”.
                So I really don’t know what you mean by saying Fisher was “right” to be annoyed by them (it was really only Neyman who annoyed him by the way). Do you think Fisher was right about Fiducial probability? or right to insist that Neyman use his book, or else? (Not that Neyman didn’t have his own personality quirks.)

                • That’s actually the precise paper I was just reading! I read a lot of underlying impatience into Fisher explaining the connection between tests and estimation to NP who he quotes as claiming they aren’t connected. Furthermore from memory one of N or P states that they still don’t understand it in the discussion.

                  I think Fisher was essentially right about Fiducial probability though he may have made the occasional slip in his attempts to explain it. I’m glad people are starting to revisit it and I’m glad people like Fraser did their best to keep the ideas alive.

                  • Om: Well you’ll have to explain which part of Fisher’s Fiducial you hold. Fraser turns them into frequentist confidence distributions, as does Cox. I argue for using such assessments evidentially by means of the severity principle, or the like. It’s not an”occasional slip” of Fisher to instantiate as he does, e.g., see Fisher in the 1955 triad, and Neyman’s response in 1956, also part of the “triad”.

                    • Sigh. These arguments are impossible and unproductive. No wonder Fisher got so annoyed

                    • (For the record – as I have stated many times, I believe a fruitful way to interpret the Fiducial argument is through a ‘structural’ lense a la Fraser. The Fiducial argument seems clear *given structural assumptions* – as Fraser et al have pointed out – and I think NP missed this. Whether or not Fiducial inference stands as a general theory of inference I don’t know but again I think NP clearly missed the underlying idea, while Fraser didn’t. I have also pointed out how the structural interpretation appears to avoid the ‘fallicious instantiation that you are fond of focusing on. It’s up to you whether you want to seriously consider this point of view but I haven’t seen you give a thoughtful argument against it.)

                    • For the record, you’re not being clear about a ‘structural’ lense, and as it happens I spoke to Fraser abut this last week at Rutgers. He doesn’t think N-P “missed” what survives of a Fisherian Fiducial argument, but rather that Bayesians miss it.

                    • I’ve posted the relevant quote from Fraser’s paper (and many others) many times. You have to want to understand. Interesting that Fraser feels this way – did he explicitly mention NP? His earlier work certainly discusses Fisher as widely misunderstood.

                  • The text you quote here: https://errorstatistics.com/2013/02/20/fisher-from-two-new-properties-of-mathematical-likelihood/

                    reads to me as Fisher expressing mild annoyance at NP for ‘not getting it’. He’s explaining how to get their results in the ‘right way’ (his way of course!) and starting from the ‘right concepts’ (here likelihood not power).

                • Michael Lew

                  I agree with omaclaren on that paper. And I think that your disdain for fiducial probability, “the flawed probabilistic instantiation in Fisher”, is, like Neyman’s, based on a mindset that is focussed on dichotomous accounting of errors instead of estimation. The fact that fiducial and confidence intervals for normal means are identical initially disguised differences between Fisher and Neyman. Eventually they both discovered that they meant different things.

                  • Michael: I have no clue where you’re getting this from! I have no disdain for fiducial nor for estimation. I choose to use “testing language” because for every claim based on evidence, one may appraise it this way, i.e.,by considering how well probed it is. One might have instead chosen other language. There’s a precise duality with estimation. What are the difference you perceive between Fisher and Neyman on estimation? Whenever Fisher had a leg to stand on, Neyman has no problem, nor do I.

                    • Mayo – the basic point is that Fisher claimed that Neyman misunderstood him and that I (and it seems Michael) agree with Fisher and not Neyman while you (it seems) agree with Neyman.

                      You have argued (I believe) that this disagreement was only superficial – I (and it seems Michael) believe that the disagreement was not superficial.

                      There is clearly room for interpretation here – there is no ‘official Fisher was wrong except where he agreed with Neyman’ reading, unless you appeal to certain authorities (while the opposite reading is given by other authorities).

                      As an exercise, perhaps you could interpret this response by Fisher to Neyman’s discussion of Fisher’s 1935 ‘Logic of inductive inference’ paper. In my reading Fisher is making a clear statement consistent with modern ‘likelihoodist’ approaches. Do you agree?

                      *begin quote*
                      ii) I ought to mention that the theorem that if a sufficient statistic exists, then it is given by the method of maximum likelihood was proved in my paper of 1921, to which Dr. Neyman refers. It was this that led me to attach especial importance to this method. I did not at that time, however, appreciate the cases in which there is no sufficient statistic, or realise that other properties of the likelihood function, in addition to the position of the maximum, could supply what was lacking.

                      ii) In saying “We need statistics with minimum variances, or, what comes to the same thing, with maximum possible amount of information,” Dr. Neyman must be taken as speaking only of the preliminary part of the theory, dealing with the properties of statistics in “large” samples. The concept of amount of information as a measurable quantity, not identical with the invariance, was developed for the theory of finite samples, where the distributions are often skew, and it is only in studying these that the advantage of assessing and utilising the whole of the information available will be fully appreciated.
                      *end quote*

                    • Om: Are you saying, in disagreeing with Neyman on this point that you endorse the probabilistic instantiations on p. 118 (4), (5)

                      Click to access neyman-1977_frequentist-probability-and-frequentist-statistics.pdf

                      and accept the contradictions there and on p. 119?

                    • I disagree with that one sentence or two of Fisher as stated but based on reading his other sentences and eg Fraser’s ‘structural probability’ paper I think I understand what he meant.

                      Now, what’s your reading of the Fisher quote I posted.

  4. In his discussion of your Birnbaum paper Fraser states “Statistical inference as an alternative to Neyman– Pearson decision theory has a long history in sta- tistical thinking, with strong impetus from Fisher’s research; see, for example, the overview in Fisher (1956)”

    So again I’m confused that Fraser would say Fisher and NP were doing the same thing. From memory he has drawn the distinction between Fisher and NP a number of times. Perhaps he could write a guest post here about Fisher, NP and structural inference.

    • Om: I keep saying, “it’s the method, stupid”, not what particular personalities thought they were doing, nor the formalism which is invariably very different from applications by founders. As in my chapter 12 of EGEK–a Neyman-Pearsonian such as Pearson (and in applications also Neyman!) reject the N-P-Wald decision-theoretic formulation of hypotheses testing which came later. Fraser says “what’s left” (in the evidential construal of N-P methods) are Cox style p-value functions, confidence intervals and distributions. I have my own evidential formulation of N-P tests, and I can’t say just where it is identical/or different to any of the attempts on offer, since they aren’t clear enough, except for Cox’s (along the lines of Mayo and Cox 2006).

      • Absolutely agree the focus should be on the mathematics of the methods. I’m just saying that it seems like NP legitimately misunderstood what Fisher was doing. A lot of the ‘confidence distribution’ folk are just doing Fiducial while trying to avoid the stigma

        • Anyway…that all got off track – I was trying to relate Fisher and NP’s apparent disagreement about whether testing could be done as estimation to a similar shift towards estimation in Bayesian testing (see the original paper I linked). Personally I find the estimation perspective more natural and testing as secondary. Yes I know about confidence interval test inversion duality. Is there a similar duality for confidence distributions? It seems like it all leads back to – represent what information you have pre test/decision, then interpret/make a decision on this basis. It’s unfortunate that classical likelihood requires strong regularity conditions but I think it points in a worthwhile direction (eg empirical/nonparametric/synthetic likelihood, ABC etc)

Blog at WordPress.com.