Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Slide1

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1]

This brings me to where I left off in my last postHow could people think it plausible to compute comparative strength of evidence this way? The rejection ratio is one of the “new monsters”, but it also appears, without this name, in popular diagnostic screening models of tests. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)

The Law of Comparative Support

It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H1 make the observed results probable, while H0 make them improbable, then the results are strong (or at least better) evidence for H1 compared to H0 . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H0 and H1 be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.

I had a post last year called “What’s Wrong with Taking (1 – β)/α as a Likelihood Ratio Comparing H0 and H1?” While it garnered over 80 interesting comments (and a continuation), only one or two concerned the point I really had in mind. So in what follows I’ll take some excerpts from it, interspersed with new remarks.

Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H1 . The power of test T+ in relation to point alternative µ’ is

Pr(Z > 1.96; µ = µ’).

We can abbreviate this as POW(T+,µ’).

~~~~~~~~~~~~~~

Jacob Cohen’s slips

By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know). [2] Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.

In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”

In case you don’t see the two mistakes, I will write them in my first comment. 

~~~~~~~~~~~~~~

Examples of alternatives against which T+ has high power:

  • If we add σx (i.e.,σ/√n) to the cut-off  (1.96) we are at an alternative value for µ that test T+ has .84 power to detect. In this example, σx = 1.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null, z= 1.96.

If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[POW(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0|z0) = 1/(1 + 40) = .024, so Pr(H1|z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Back to our question:

How could people think it plausible to compute comparative evidence this way?

I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ = μ0 = 0.

We’ve observed z= 1.96 (so you’ve observed z > 1.96).

Therefore, μ = 4.96 makes the observation more probable than does  μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.

I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,

If POW(T+,µ’) is high, then z= 1.96 is poor evidence that μ  > μ’.

That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.

(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.) 

In fact:

if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that  μ < μ’!

Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
~~~~~~~~~~~~~~

My favorite post by Stephen Senn

In my very favorite post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:

[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. 

Supposing that it is, is essentially  to treat the test as if it were:

H0: μ < 0 vs H1: μ  > 4.96

This, he says,  is “ludicrous”as it:

would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)

The same holds with H0: μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ  > 0 or μ  > .3.

~~~~~~~~~~~~~~

What does the severe tester say?

In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

would be wrong over 99% of the time!

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.

~~~~~~~~~~~~~~

The (Type 1, 2 error probability) trade-off vanishes

Notice what happens if we consider the “real Type 1 error” as Pr(H0|z0)

Since Pr(H0|z0) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H0|z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.

Upshot (modified 8p.m. 5/23/16)

Using size/ power as a likelihood ratio or as an indication of pre-data strength of evidence with which to accord a rejection, a bad idea for anyone who wants to assess the comparative evidence by likelihoods. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses).  Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit also serves very badly for a Bayes ratio or likelihood assessment.(Likelihoodlums* and Bayesians, please weigh in on this.)

This is related to several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.

Please alert me to errors.

*Michael Lew’s term.

REFERENCES

Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology

Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.

Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)

Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing?  (pp. 19-24)

Senn, S. 2008. Statistical Issues in Drug Development, 2nd ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.

Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).

[1] I don’t say the Rejection Ratio can have no frequentist role. It may arise in a diagnostic screening or empirical Bayesian context.

[2] It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Post navigation

36 thoughts on “Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

  1. For the first error, see my definition of power. (It must be relative to a given alternative hypothesis). For the second, see my criticism of NHST as going directly from statistical significance to a genuine phenomenon. This second, we may forgive as just illustrative of the general idea. The first, committed at least twice (please point out any others) is more serious. It may be tied to Cohen’s tendency to imagine one has fixed the d corresponding to the alternative–but this is troublesome.

  2. Erkan Buzbas
    Assistant Professor
    Department of Statistical Science
    University of Idaho

  3. Carlos Ungil

    > Now BBBS also object to the Rejection Ratio, but only because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.

    How do you “get around the data-dependent part” by assuming a particular outcome (the just statistically significant one)?

    BBBS recommendation is, as you quoted, to report the pre-experimental rejection ratio when presenting the experimental design (and not when presenting the experimental results). They even mention that “(its) use after seeing the data has been rightly criticised by many”.

    • Carlos: First, thanks for your comment. Were it not for the number of hits, I would have thought no one was out there at all, the way I get no comments these days.

      In response to your comment, yes, they object to it because of the lack of data-specificity. Just to check that I was grasping their view correctly, I asked Berger straight out at the FUSION conference, and why it seemed topsy turvy–expecting him not to agree. It was at a lunch break prior to his talk. But to my surprise he did agree (with my objection), and even seemed surprised they hadn’t noticed it. After the conference we exchanged e-mails and he revised his slides to reflect my point somewhat. Even as a pre-data assessment of evidence against or expected evidence against it won’t do.

      • Carlos Ungil

        > Even as a pre-data assessment of evidence against or expected evidence against it won’t do.

        Could you explain why? Their claim is conditional *only* on the result being statistically significant (and does not apply for any particular value of the test statisic). They propose to report the pre-experimental rejection ratio (or the pre-experimental rejection odds, to include the prior) when presenting the experimental design, but your critique is based on a post-data argument.

    • Carlos: Note I changed “only” to “largely”, in your first sentence.
      Their remark alludes to what would be conveyed by the experimental result being statistically significant:

      “The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant.”

      My point is that this isn’t so.

      • Carlos Ungil

        I still don’t understand what is your objection to BBBS.

        They say: “In section 2, we take a pre-experimental perspective (…) The (pre-experimental) ‘rejection ratio’ (…) is shown to capture the strength of evidence in the experiment for H1 over H0″.

        You say:”But in fact it does no such thing!”

        They discuss extensively how, and under what conditions, the rejection ratio can be used. Do you think there is any particular issue with the arguments they present or do you just object to the fact that they characterise the result as “capturing the strength of evidence” in that paragraph?

      • Carlos Ungil

        Their remark seems correct to me (when their assumptions are satisfied: either the null hypothesis or the alternative are true, and it is conditional on the result being statistically significant but in the absence of any additional information about the outcome).

        • Carlos: No, take a look at my example w/alternative 4.96, or take 10.96 (with power = 1).

          • Carlos Ungil

            I don’t see how your example contradicts BBBS. If I know that the true value is either mu=4.96 or mu=0 and you tell me that you got a statistically significant result then as far as I know the strength of the evidence is indeed 40 times larger for the alternative than for the null. This is basic probabilty.

            • Michael Lew

              Carlos, your final comment is pretty interesting. However, I think that you have missed two essential aspects.

              First, you suggest that “I know that the true value is either mu = 4.96 or mu = 0” and in such a case I would agree with the sentiment that where the data support one of those parameter values by 40 times more than the other it would be sensible to go with that parameter value. However, remember that the Berger ratio is NOT a likelihood ratio. It seems to me that the ratio of real likelihoods for the example would favour the mu = 0 option.

              Second, only a very small fraction of real world examples would allow you to say that you know that the parameter value is exactly one of two distinct values. More generally we would say that the value could be either (e.g.) 0 or 4.96 for the purposes of pre-data experimental design, but after the data are in hand we pay at least minimal attention to what those data say. In Mayo’s example the data could be shouting fairly loudly that the correct value of the parameter is 1.96, not 0 and not 4.96.

              Your recourse to dismissive “This is basic probability” is thus way off the mark.

              • Carlos Ungil

                I have not missed those aspects. My point is that the remarks of BBBS about the rejection ratio seem correct ***when their assumptions are satisfied***.

                I suggest that either H0 or H1 is true because this is the same assumption that BBBS make. You say that “the ratio of real likelihoods for the example would favour the mu = 0 option.” But that example goes beyond what BBBS claim, their recommendation is to report the pre-experimental rejection ratio when presenting the experimental design. At that point there is no specific outcome to condition on (but you can reason about what information the experiment may provide conditional on the result being statistically significant).

                Your second point is again about post-data analysis (and I agree that the most sensible thing to do may be to ignore all the hypothesis testing stuff and just take the point estimate!). I hope I don’t sound dismissive if I reiterate that I think it is irrelevant to the discussion because we are restricting ourselves to the pre-data situation (i.e. conditioning only on the result being statistically significant). You also touch on the exhaustivity of H0/H1, which is another assumption in the paper (H1 it can be a single point or a distribution, if the latter the average power can be calculated).

                • Michael Lew

                  OK, if you are restricting yourself to exactly the BBS claim then you may be correct. However, I was commenting on the assumption that the BBS claim was being applied to Mayo’s example.

                  Nonetheless, I would dispute the validity of any claim about “strength of evidence” from any pre-data calculation. I would also dispute a claim of “strength of evidence” when the data are represented as only ‘significant’ or ‘not significant’. I note that my reservations come from a respect for the likelihood principle and from a tendency to always want to know ‘how much?’, a tendency that your parenthetical comment suggests that you share.

                  • Carlos Ungil

                    These are valid concerns. The “rejection ratio” is little more than a power calculation. It can have its place when discussing the design of an experiment. Too many experiments are grossly underpowered. For example, http://www.nature.com/nrn/journal/v14/n5/pdf/nrn3475.pdf shows hat the median statistical power in neuroscience is 21%. I don’t like hypothesis testing much, but I like badly done hypothesis testing even less.

                    A completely unrelated remark (and I think you will agree on this) is that if one is a hardcore Neyman-Pearson hypothesis tester then the actual outcome is irrelevant and the only evidence obtained is whether the result is statistically significant or not. In that context one cannot do much better than using the rejection ratio to quantify the strength of evidence.

                    • Carlos: Just on your N-P points, the answers are “no” and “no”. I’m not a hardcore N-P tester–whatever that is. The myth that N-P tests tell you to report “reject” or “do not reject” at a pre-fixed level is just that. A myth, that is a lie. They worked out optimality properties of tests, but never applied tests as automatic accept/reject routines, nor do their papers recommend it. N-P started with a third “remain uncertain” category, by the way, but was trying to keep close to Fisher’s tests. N-P recommended computing and reporting the actual significance level attained. See Lehmann. All these things are discussed throughout the blog, you might search Neyman, or check “all she wrote so far” in 4.5 years.

                      From your first remarks, I suspect that you’re falling into the “making mountains out of molehills” fallacy of my last post.

                    • Carlos Ungil

                      Mayo: my N-P points were addressed to Michael Lew. Of course you don’t need my permission to chime in, I just want to reassure you that I didn’t expect you to agree. I was not thinking of you as a hardcore N-P hypothesis tester either (or a believer in the myth of automatic accept/reject routines, if you like this description better).

                • Carlos: I see that Lew’s correct reply to you hasn’t convinced you of the problem: I just picked any old alternative against which the test has high power. I then applied their claim that
                  “the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for H1 over H0.”

                  But this claim is false, and we know it’s false in advance, because a just significant result, say z = 1.96 doesn’t give comparatively stronger evidence to, say, 10.96 than to 0. The probability is ~1 that you’d have gotten a larger difference than 1.96, were the data generated from a universe where the mean is 10.96. You’d be wrong with probability 1 if you followed such a method in general.
                  For a likelihoodist, the complaint is put in terms of the fact that 0 is far more likely than 10.96, given z.

                  Now as I indicated in my FUSION post,
                  https://errorstatistics.com/2016/04/11/when-the-rejection-ratio-1-%CE%B2%CE%B1-turns-evidence-on-its-head-for-those-practicing-in-the-error-statistical-tribe/
                  but perhaps it bears repeating, Berger immediately agreed with me, and he modified a presentation based on this paper. His grounds for agreeing were the likelihood grounds. But since his privately recognizing the mistake doesn’t change the published paper, it’s worth discussing and preventing more misunderstanding.

                  • Carlos Ungil

                    Mayo: I don’t know if you find that Lew’s later reply (“if you are restricting yourself to exactly the BBS claim then you may be correct”) is also correct. I’ll stress again that their claim is pre-data and if you think your argument shows that the claim is false, let me show you that your claim is false.

                    (I’m not 100% sure, but it seems that) you maintain that “z=1.96 gives comparatively stronger evidence to 0 than to 10.96”

                    If the true value is 10.96 and you say that z=1.96 gives more evidence to 0 than to 10.96 you will be wrong with probability 1.

                    (To be clear: I don’t think the argument is valid, for the very same reason that I don’t think your argument is valid).

                    • Carlos: I guess you don’t understand tests, but leave my claim as merely the denial of theirs, that’s all I need. Remember, you want your argument to go through with minimal assumptions.

                    • Carlos Ungil

                      I don’t understand what does “you want your argument to go through with minimal assumptions” mean. I guess I don’t understand tests either. Anyway, paraphrasing something you wrote elsewhere, I understand their claim to be:

                      Hence, before the data is observed it is known that the (pre-experimental) ‘rejection ratio’ captures the strength of evidence in the experiment for H1 over H0. This is a measure of initial strength of evidence for a statistically significant result. Once the data is known, this rejection ratio cannot legitimately be attached to the specific outcome (i.e., it cannot be used as a measure of final strength of evidence).

            • Carlos: No this is the part where Senn points out the nonsensical and the ludicrous. But I see that Micael Lew has replied to you on this.

  4. Michael Lew

    Mayo, I’m not sure what sort of comment you expect from a likelihoodlum. You’ve come the the correct conclusion: “The conclusion is that using size and power as likelihoods is a bad idea for anyone who wants to assess the comparative evidence by likelihoods.” The real reason why one should not use a ratio non-likelihood “likelihoods” is that it is the ratio of the actual likelihoods that gives a measure of the model-bound evidence in the data about the parameter values under consideration.

    I think that discussions like the one you want to spark require some specification of what the key words mean. Crucially, I do not know what you mean by “evidence” because you seem to exclude at least some of the features that I would expect of “evidence”. For example, you say that it is not the job of an error statistician to make inferences about “comparative appraisals of different point hypotheses”, but I don’t see how one can look at the data and not see that it has something to say in evidential terms about the relative merits of point hypotheses.

    I think we would make more progress if we agreed on terms before arguing about them. Thus, I ask you to please define “evidence” and explain its functional features. Your definition will no doubt differ from mine (and those of Edwards, Royall, early Hacking and Fisher, I would assume), so let’s get it clear.

    My specification of evidence would start by noting that the statistical evaluation of evidence is necessarily model-bound. Choose a different model and the evidence will be different. Next I would say that the evidence is within the data. Details of the experimental design can alter how one should respond to the evidence in much the same way as a criminal alibi provided by the accused’s mother might be discounted compared to an equivalent alibi provided by a disinterested observer.

    Under my definition of “evidence” there is no evidence until the data are available and that would mean that the pre-data ratio of power over test size is a ratio unrelated to the evidence. It might affect how the evidence should be received, but it cannot be evidence itself.

    • Michael: I was, at that point, just saying that possibly likelihoodists disagree with me in saying the pre-data rejection ratio doesn’t work for them either. I am deliberately, in this post, moving away from having the issue be all about formal likelihoods. I was trying to explain why it might look plausible to compare “explanations” in this manner. The problem, to put it most generally, is one of the fallacy of affirming the consequent. High power to detect H’, says, if H1 were a correct description of the data generation, then rejection of H0 would be very probable. Now that you’ve rejected h0, it doesn’t follow that H1 is a good explanation/ correct description (or what have you) of the data generation.
      I deny that we need an agreed upon definition of evidence in order to show that a given measure fails in its evidential assessment. I was trying to keep to the terms in BBBS without demanding they first define”strength of evidence”. Since they regard their argument as relevant for a frequentist tester, it’s obviously relevant to show why the result is at odds on that view. However, I was trying to do more, and wrt Berger at least, I did. I was trying to say it couldn’t be what he wants either–since, for example, x can be closer to H0 than to the H1 against which the test has high power. Still, I acknowledged that a likelihood person could disagree, although I didn’t think you would.
      Bottom line, BBBS are trying to give an argument that stands or has some weight for all or most tribes, at least frequentist and Bayesian ones, because they specifically give that as a desideratum at the outset.

      • Michael Lew

        “I deny that we need an agreed upon definition of evidence in order to show that a given measure fails in its evidential assessment.” I’m surprised. Surely we need at least some agreement on what would represent a pass or fail in “evidential assessment”. If you want to use the error statistical considerations as the be-all-and-end-all of evidence, then that is your right, I suppose. However, to do so without being prepared to say that that is what you are doing is disingenuous and unhelpful. And without clear specification of evidential assessment you cannot expect others to comply its tenets.

        I like the fact that likelihoodlums make explicit what they mean by evidential favouring, and I wish that other participants in the discussions would be equally open and clear.

        • Michael:
          I’m examining, to start with, the claims of Benjamin and Berger, and BBBS. Take a look:
          (BBBS 2016, p. 3)

          “The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant”. (Benjamin and Berger 2016, p. 1)

          “The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for H1 over H0”. (BBBS 2016, p. 3)
          p. 2)

          Now my criticism is that on no sensible measure of strength of evidence has this been “shown”–including the conceptions the authors hold–, but in fact we can show it’s topsy turvy. It violates error statistical, frequentist, likelihoodist, and confidence concept requirements. A good philosopher will always strive for the strongest criticism, the one with as few assumptions. That’s why my criticism is much stronger than saying, “well it all depends on how you define evidence”.

          Quite aside from the current critical task, I’ve defined error statistical evidence.

          • Michael Lew

            Mayo, I think that you have misunderstood my position. I agree with you that the Berger ratio is not a measure of evidence, and I do not wish to say that there is a definition of evidence that would make that ratio into a reasonable measure. I’ll say it again: I agree with you regarding Berger’s ratio.

            Perhaps, if you now understand that I agree with you regarding the ratio of power and size, you could now read my comments again to see if there is something else in them.

            • Michael: I’ve reread your comments, and I’m still not sure what cryptic message is hiding there–why not just tell it straight out.

              • Michael Lew

                Mayo, how about the following bit, for one:

                you say that it is not the job of an error statistician to make inferences about “comparative appraisals of different point hypotheses”, but I don’t see how one can look at the data and not see that it has something to say in evidential terms about the relative merits of point hypotheses.

                Then there’s this:

                Surely we need at least some agreement on what would represent a pass or fail in “evidential assessment”. If you want to use the error statistical considerations as the be-all-and-end-all of evidence, then that is your right, I suppose. However, to do so without being prepared to say that that is what you are doing is disingenuous and unhelpful. And without clear specification of evidential assessment you cannot expect others to comply its tenets.

                • Michael: Firstly, I should have noted that I’m glad you agree about power/alpha not being a good measure on likelihoodist grounds.

                  “you say that it is not the job of an error statistician to make inferences about “comparative appraisals of different point hypotheses”, but I don’t see how one can look at the data and not see that it has something to say in evidential terms about the relative merits of point hypotheses.”

                  I said we’re “not in the business” of merely reporting comparative likelihoods, I didn’t mean we’re unable to see or compute likelihoods, which we do all the time. If the model is adequate and there are no selection effects, they’re fine for comparative fit measures.

                  Your next remark I think I already addressed in my earlier comment.

  5. My previous post got a comment just now from Andrew Gelman:
    Andrew Gelman

    Deborah:

    You’ll love my forthcoming post, “Why the garden-of-forking-paths criticism of p-values is not like a famous Borscht Belt comedy bit.” No kidding.

    Check there for my reply.

  6. Michael Lew

    Mayo, this departs from the main topic of your post, but it is sparked by your suggestion of May 24 that we should “See Lehman” to see how it is a “myth” that N-P tests should be reported as reject of do not reject. You have made similar suggestions in the past and I have complied with your suggestion.

    My copy of Lehman’s book Testing Statistical Hypotheses is not the most recent edition, being from 1959, but I imagine that it represents his opinion on the topic fairly clearly. I can’t see why any reader would form an opinion that Lehman prefers any other approach than accept and reject. In most places where he can write about rejection regions and decisions he does so.

    Lehman does write on page 62 that “It is good practice to determine not only whether the hypothesis is accepted or rejected, but also to determine the smallest significance level,” [symbols I cannot include] “, the critical level, at which the hypothesis would be rejected for the given observation.” Fine, one might interpret that as support for a graded response to the evidence, but he does not say anywhere that I can find how one’s decision should be coloured, affected or influenced as part of that “good practice”. Furthermore, he includes the clear implication that the hypothesis is already accepted or rejected prior to the determination of the “critical level” (which is, of course, the the P-value, an object that he does not name it anywhere in the book, as far as I can tell). That section offers only very minimal support for the notion that Lehman was not in favour of a generally all or none accept/reject decision as the primary outcome of an N-P test.

    I note that Lehman does in places mention the option of neither accepting nor rejecting the hypotheses, but the implied trichotomy is still a long way shy of using the P-value as a graded index of the strength of evidence. I have consulted more recent editions and the third edition does mention P-values by name in the section I quoted above, but still emphasises accept/reject decisions.

    Where should I be looking to “See Lehman” for evidence that it is a “myth, that is a lie” that the results of N-P tests are reported as accept/reject decisions?

    • Michael: Here’s a relevant quote from Lehmann and Romano (note the spelling of Lehmann):

      [I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables other to reach a verdict based on the significance level of their choice. (Lehmann and Romano 2005, pp. 63-4).

      Equally important to note are the ways N and P use tests in practice. Then there’s the rather more complex philosophy underlying N-P’s idea of inference as a kind of action. You’ll find the same verbiage in others who wish to distinguish their view of inference from probabilists, e.g., Peirce, Popper.

      Still, on the formal side, it must be admitted that Lehmann’s mathematical formulation of tests did much to promote the view of tests as decisions–much more than N-P. He was immersed in the mathematics of optimality, but if you ask him about the idea of fixing alpha and accepting or rejecting, he’d tell you he finds the prospect horrifying. There was an interview of Lehmann someplace on my blog where he says this, but I know from talking with him many times directly. Also see his paper, linked on this blog, on N-P tests, one theory or two.

      • Michael Lew

        OK, that’s about as I thought. I’ll deal with the substance of your reply in order of appearance.

        That part of Lehmann’s book that you quote (Lehmann & Romano is the third edition of the book that I quoted from) does contain the name of P-values, and notes how one might choose to interpret them. However, the section has not changed between editions in any sustantive way. Those P-values are included only as an optional accessory to the accept/reject decision in the quote you give exactly as they are in the original quote that I gave. When I last looked at that third edition of Lehmann’s book (it’s in the library, not on my shelf) that single sentence was the only place that I could find reference to the evidence in a non-dichotomous, non-trichotomous manner. As Lehmann did not include the word P-value until the third edition, an edition with a second author, I will speculate that he did not really want to mention P-values and the graded response to evidence that they enable.

        Next you tell us to look at how Neyman and Pearson used tests in practice to determine how they used them to make, or support, inferences. Fine, I can see how that is sensible. However, a simple reading of the original papers in which they specify the methods should suffice for defining those methods. The mathematics of those papers deal only with a dichotomisation of outcomes.

        Your suggestion that we look at the writings of philosophers to find the true meaning of the N-P framework is probably reasonable in the abstract: philosophical argument and thought is very useful for determining which parts of ideas are well supported. However, at the same time that suggestion is silly, because no statistics student will turn to the philosophical literature for instruction in statistics!

        Finally you agree with me that Lehmann’s maths (like Neyman’s) comes from a dichotomisation of outcomes, but suggest that Lehmann gave a different account in person. Was he working on his book(let) “Fisher, Neyman and the Creation of Classical Statistics” at the time? Is that where we should turn for a definition of the N-P testing framework?

        So, it is not really a myth or a lie that Lehmann’s instruction supports exactly the dichotomous response to data that Carlos attached to “hardcore” N-P testers. You have previously called my description of N-P testing as a “caricature”, but it seems that my account was, in fact, pretty close to Lehmann’s account.

        • Michael: Given how much of the 4.5 years of blogposts devoted to very specific arguments and detailed citations and links from original papers on precisely this issue, your remarks baffle me, and make me feel that, at least for some readers, it was all just a waste of time.

          (1) My view is that for our uses of method and solving today’s problems of statistical inference, it doesn’t matter what the founders really, really thought, that it’s the properties of the methods that matter– “it’s the methods, stupid!” (as I argue in numerous posts). The developers of the formal methods, even when they weren’t engaged in personality and professional disputes, rarely stepped back to consider the relationships of their methods to a general philosophy of statistics. I’ve brought out the instances where they do (in this blog), and provided a general philosophy of science for using our OWN BRAINS to bring the methods into the current world of statistical inference and modeling, where conceptual and logical confusion exists.

          Perhaps, take a look at the paper I wrote with Sir David Cox, “Frequentist Statistics as a Theory of Inductive Inference” (2006): http://www.phil.vt.edu/dmayo/personal_website/Ch%207%20mayo%20&%20cox.pdf

          (2) I never said anywhere that people should look to philosophy of statistics, which has been almost exclusively Bayesian, for their interpretation of N-P.

          (3) Yes it is “really a myth” and “a lie that Lehmann’s instruction supports” the dichotomous response to data. As I’ve already said, he developed N-P methods in a more decision-theoretic manner than either Neyman or Pearson. Any readers interested in where Neyman and Pearson, together or separately, reflect on foundations can search Neyman, Pearson, and yes Lehmann on this blog.
          A list of posts are here:
          https://errorstatistics.com/2016/03/22/all-she-wrote-so-far-error-statistics-philosophy-4-5-years-on/

          Just one of dozens of examples is “Performance or Probativeness: E.S. Pearson’s Statistical Philosophy”: https://errorstatistics.com/2015/08/14/performance-or-probativeness-e-s-pearsons-statistical-philosophy/

          Pearson defines three steps in specifying tests:
          “Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…
          Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).
          “Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).
          Pearson warns that:
          “Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

          Even Neyman (the more behavioristic of the two) declared: “Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen”
          https://errorstatistics.com/2015/08/05/neyman-distinguishing-tests-of-statistical-hypotheses-and-tests-of-significance-might-have-been-a-lapse-of-someones-pen-2/

          In Pearson’s (1955) response to Fisher:
          http://www.phil.vt.edu/dmayo/personal_website/Pearson%201955.pdf

          “it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

          Readers may learn more about the difference between the behavioral and evidential interpretations of N-P statistics in some papers to appear in a post for Allan Birnbaum’s birthday tomorrow—assuming I don’t decide I’m too exasperated to bother posting it.

  7. Michael: I will reply to your question tomorrow. I’ve just returned from a long trip back to the island of Elba, and the ferry ride was more tumultuous than normal.

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.