fallacy of rejection

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

Slide1

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But it does no such thing! [See my post from the FUSION 2016 conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1]

~~~~~~~~~~~~~~

The Law of Comparative Support

It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H1 make the observed results probable, while H0 make them improbable, then the results are strong (or at least better) evidence for H1 compared to H0 . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H0 and H1 be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.

~~~~~~~~~~~~~~

Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H1 . The power of test T+ in relation to point alternative µ’ is

Pr(Z > 1.96; µ = µ’).

We can abbreviate this as POW(T+,µ’).

~~~~~~~~~~~~~~

Jacob Cohen’s slips

By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know). [2] Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.

In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”

Do you see the two mistakes? 

~~~~~~~~~~~~~~

Examples of alternatives against which T+ has high power:

  • If we add σx (i.e.,σ/√n) to the cut-off  (1.96) we are at an alternative value for µ that test T+ has .84 power to detect. In this example, σx = 1.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null, z= 1.96.

If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[POW(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0|z0) = 1/(1 + 40) = .024, so Pr(H1|z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Back to our question:

Here’s my explanation for why some think it’s plausible to compute comparative evidence this way:

I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ = μ0 = 0.

We’ve observed z= 1.96 (so you’ve observed z > 1.96).

Therefore, μ = 4.96 makes the observation more probable than does  μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.

I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,

If POW(T+,µ’) is high, then z= 1.96 is poor evidence that μ  > μ’.

That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.

(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.) 

In fact:

if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that  μ < μ’!

Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.
~~~~~~~~~~~~~~

A post by Stephen Senn:

In my favorite guest post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:

[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. 

Supposing that it is, is essentially  to treat the test as if it were:

H0: μ < 0 vs H1: μ  > 4.96

This, he says,  is “ludicrous”as it:

would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)

The same holds with H0: μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ  > 0 or μ  > .3.

~~~~~~~~~~~~~~

What does the severe tester say?

In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

would be wrong over 99% of the time!

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.

~~~~~~~~~~~~~~

The (Type 1, 2 error probability) trade-off vanishes

Notice what happens if we consider the “real Type 1 error” as Pr(H0|z0)

Since Pr(H0|z0) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H0|z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.

Upshot

Using size/ power as a likelihood ratio, or even as a preregistrated estimate of expected strength of evidence (with which to accord a rejection) is problematic. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses. It’s not unusual for criticisms to start out forming these ratios, and then blame the “tail areas” for exaggerating the evidence against the test hypothesis. We don’t form those ratios. But the pre-data Rejection Ratio is also misleading as an assessment alleged to be akin to a Bayes ratio or likelihood assessment. You can marry frequentist components and end up with something frequentsteinian.

REFERENCES

Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology

Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.

Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)

Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing?  (pp. 19-24)

*Mayo, D. Statistical Inference as Severe Testing, forthcoming (2017) CUP.

Senn, S. 2008. Statistical Issues in Drug Development, 2nd ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.

Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).

[1] I don’t say there’s no context where the Rejection Ratio has a frequentist role. It may arise in a diagnostic screening or empirical Bayesian context where one has to deal with a dichotomy. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)

[2] It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments

Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”

 

Categories: fallacy of rejection, Fisher, fraud, frequentist/Bayesian, Likelihood Principle, reforming the reformers | 16 Comments

The “P-values overstate the evidence against the null” fallacy

3077175-lg

.

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming). 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 46 Comments

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig

images-34

.

I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error. Continue reading

Categories: Bayesian/frequentist, Error Statistics, fallacy of rejection, P-values, Statistics | 12 Comments

Glymour at the PSA: “Exploratory Research is More Reliable Than Confirmatory Research”

psa-homeI resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.

GLYMOUR’S ARGUMENT (in a nutshell):Glymour_2006_IMG_0965

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

exploratory-research-is-more-reliable-than-confirmatory-research-13-1024(slide #5)

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim. Continue reading

Categories: fallacy of rejection, P-values, replication research | 20 Comments

“P-values overstate the evidence against the null”: legit or fallacious?

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow.  It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature! 

 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values | 3 Comments

In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel)

cropped-imgp587612

Scientist sees squirrel

Evolutionary ecologist, Stephen Heard (Scientist Sees Squirrel) linked to my blog yesterday. Heard’s post asks: “Why do we make statistics so hard for our students?” I recently blogged Barnard who declared “We need more complexity” in statistical education. I agree with both: after all, Barnard also called for stressing the overarching reasoning for given methods, and that’s in sync with Heard. Here are some excerpts from Heard’s (Oct 6, 2015) post. I follow with some remarks.

This bothers me, because we can’t do inference in science without statistics*. Why are students so unreceptive to something so important? In unguarded moments, I’ve blamed it on the students themselves for having decided, a priori and in a self-fulfilling prophecy, that statistics is math, and they can’t do math. I’ve blamed it on high-school math teachers for making math dull. I’ve blamed it on high-school guidance counselors for telling students that if they don’t like math, they should become biology majors. I’ve blamed it on parents for allowing their kids to dislike math. I’ve even blamed it on the boogie**. Continue reading

Categories: fallacy of rejection, frequentist/Bayesian, P-values, Statistics | 20 Comments

How to avoid making mountains out of molehills, using power/severity

images

.

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant. (As soon as it exceeds the cut-off the rule has to be modified). 

Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication  that µ < µ’.)

That is, if the test’s power to detect alternative µ’ is high, then the statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that  µ < µ’).

Continue reading

Categories: fallacy of rejection, power, Statistics | 20 Comments

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

spanos 2014

.

 

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

images-5

Categories: Error Statistics, fallacy of rejection, Phil6334, reforming the reformers, Statistics | 27 Comments

Continued:”P-values overstate the evidence against the null”: legit or fallacious?

.

continued…

Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 39 Comments

“P-values overstate the evidence against the null”: legit or fallacious? (revised)

0. July 20, 2014: Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

Continue reading

Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 71 Comments

Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

Categories: Comedy, fallacy of rejection, Statistical power | Tags: , , , , | 9 Comments

Blog at WordPress.com.