fallacy of rejection

Memory Lane (4 years ago): Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1).

The Bayes factor discussed is of H1 over H0, in two-sided Normal testing of H0: μ = 0 versus H1: μ ≠ 0.

“The variance of the observations is known. Without loss of generality, we assume that the variance is 1, and the sample size is also 1.” (p. 2 supplementary)

“This is achieved by assuming that μ under the alternative hypothesis is equal to ± (z0.025 + z0.75) = ± 2.63 [1.96. + .63]. That is, the alternative hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63”. (p. 2 supplementary)

Putting to one side whether this is “without loss of generality”, the use of “power” is quite different from the correct definition. The power of a test T  (with type I error probability α) to detect a discrepancy μ’ is the probability T generates an observed difference that is statistically significant at level α, assuming μ = μ’. The value z = 2.63 comes from the fact that the alternative against which this test has power .75 is the value .63 SE in excess of the cut-off for rejection. (Since an SE is 1, they add .63 to 1.96.) I don’t really see why it’s advantageous to ride roughshod on the definition of power, and it’s not the main point of this blogpost, but it’s worth noting if you’re to avoid sinking into the quicksand.

Let’s distinguish the appropriateness of the test for a Bayesian, from its appropriateness as a criticism of significance tests. The latter is my sole focus. The criticism is that, at least if we accept these Bayesian assignments of priors, the posterior probability on H0 will be larger than the p-value. So if you were to interpret a p-value as a posterior on H0 (a fallacy) or if you felt intuitively that a .05 (2-sided) statistically significant result should correspond to something closer to a .05 posterior on H0, you should instead use a p-value of .005–or so it is argued. I’m not sure of the posterior on H0, but the BF is between around 14 and 26.[1] That is the argument. If you lower the required p-value, it won’t be so easy to get statistical significance, and irreplicable results won’t be as common. [2]

The alternative corresponding to the preferred p =.005 requirement

“corresponds to a classical, two-sided test of size α = 0.005. The alternative hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is nearly identical to the “power” curve if that curve had been defined using 80% power, rather than 75% power. The Power curve for 80% power would place ½ its mass at ±2.80”. (Supplementary, p. 2)

z = 2.8 comes from adding .84 SE to the cut-off: 1.96 SE +.84 SE = 2.8. This gets to the alternative vs which the α = 0.05 test has 80% power. (See my previous post on power.)

Is this a good form of inference from the Bayesian perspective? (Why are we comparing μ = 0 and μ = 2.8?). As is always the case with “p-values exaggerate” arguments, there’s the supposition that testing should be on a point null hypothesis, with a lump of prior probability given to H0 (or to a region around 0 so small that it’s indistinguishable from 0). I leave those concerns for Bayesians, and I’m curious to hear from you. More importantly, does it constitute a relevant and sound criticism of significance testing? Let’s be clear: a tester might well have her own reasons for preferring z = 2.8 rather than z = 1.96, but that’s not the question. The question is whether they’ve provided a good argument for the significance tester to do so?

II. What might the significance tester say?

For starters, when she sets .8 power to detect a discrepancy, she doesn’t “implicitly assume” it’s a plausible population discrepancy, but simply one she wants the test to detect by producing a statistically significant difference (with probability .8). And if the test does produce a difference that differs statistically significantly from H0, she does not infer the alternative against which the test had high power, call it μ’. (The supposition that she does grows out of fallaciously transposing “the conditional” involved in power.) Such a rule of interpreting data would have a high error probability of erroneously inferring a discrepancy μ’ (here 2.8).

The significance tester merely seeks evidence of some (genuine) discrepancy from 0, and eschews a comparative inference such as the ratio of the probability of the data under the points 0 and 2.63 (or 2.8). I don’t say there’s no role for a comparative inference, nor preclude someone arguing it is comparing how well μ = 2.8 “explains” the data compared to μ = 0 (given the assumptions), but the form of inference is so different from significance testing, it’s hard to compare them. She definitely wouldn’t ignore all the points in between 0 and 2.8. A one-sided test is preferable (unless the direction of discrepancy is of no interest). While one or two-sided doesn’t make that much difference for a significance tester, it makes a big difference for the type of Bayesian analyses that is appealed to in the “p-values exaggerate” literature. That’s because a lump prior, often .5 (but here .9!), is placed on the point 0 null. Without the lump, the p-value tends to be close to the posterior probability for H0, as Casella and Berger (1987a,b) show–even though p-values and posteriors are actually measuring very different things.

“In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987 b, p. 334, p. 335).

There is a long and old literature on all this (at least since Edwards, Lindman and Savage 1963–let me know if you’re aware of older sources).

Those who lodge the “p-values exaggerate” critique often say, we’re just showing what would happen even if we made the strongest case for the alternative. No they’re not. They wouldn’t be putting the lump prior on 0 were they concerned not to bias things in favor of the null, and they wouldn’t be looking to compare 0 with so far away an alternative as 2.8 either.

The only way a significance tester can appraise or calibrate a measure such as a BF (and these will differ depending on the alternative picked) is to view it as a statistic and consider the probability of an even larger BF under varying assumptions about the value of μ. This is an error probability associated with the method. Accounts that appraise inferences according to the error probability of the method used I call error statistical (which is less equivocal than frequentist or other terms.)

For example, rejecting H0 when z ≥ 1.96 (which is the .05 test, since they make it 2-sided), we said, had .8 power to detect μ = 2.8, but with the .005 test it has only 50% power to do so. If one insists on a fixed .005 cut-off, this is construed as no evidence against the null (or even evidence for it–for a Bayesian). The new test has only 30% probability of finding significance were the data generated by μ = 2.3. So the significance tester is rightly troubled by the raised type II error [3], although the members of an imaginary Toxic Co. (having the risks of their technology probed) might be happy as clams.[4]

Suppose we do attain statistical significance at the recommended .005 level, say z = 2.8. The BF advocate assures us we can infer μ = 2.8, which is now 25 times as likely as μ = 0, (if all the various Bayesian assignments hold). The trouble is, the significance tester doesn’t want to claim good evidence for μ = 2.8. The significance tester merely infers an indication of a discrepancy (an isolated low p-value doesn’t suffice, and the assumptions also must be checked). She’d never ignore all the points other than 0 and ± 2.8. Suppose we were testing μ ≤ 2.7 vs. μ > 2.7, and observed z = 2.8. What is the p-value associated with this observed difference? The answer is ~.46. (Her inferences are not in terms of points but of discrepancies from the null, but I’m trying to relate the criticism to significance tests. ) To obtain μ ≥ 2.7 using one-sided confidence intervals would require a confidence level of .46 .54. An absurdly low confidence level/high error probability.

The one-sided lower .975 bound with z = 2.8 would only entitle inferring μ > .84 (2.8 – 1.96)–quite a bit smaller than inferring μ = 2.8. If confidence levels are altered as well (and I don’t see why they wouldn’t be), the one-sided lower .995 bound would only be μ > 0. Thus, while the lump prior on  Hresults in a bias in favor of a null–increasing the type II error probability– it’s of interest to note that achieving the recommended p-value licenses an inference much larger than what the significance tester would allow.

Note, their inferences remain comparative in the sense of “H1 over H0” on a given measure, it doesn’t actually say there’s evidence against (or for) either (unless it goes on to compute a posterior, not just odds ratios or BFs), nor does it falsify either hypothesis. This just underscores the fact that the BF comparative inference is importantly different from significance tests which seek to falsify a null hypothesis, with a view toward learning if there are genuine discrepancies, and if so, their magnitude.

Significance tests do not assign probabilities to these parametric hypotheses, but even if one wanted to, the spiked priors needed for the criticism are questioned by Bayesians and frequentists alike. Casella and Berger (1987a) say that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in one or two-sided tests. According to them “The testing of a point null hypothesis is one of the most misused statistical procedures.” (ibid., p. 106)

III. Why significance testers should reject the “redefine statistical significance” argument:

(i) If you endorse this particular Bayesian way of attaining the BF, fine, but then your argument begs the central question against the significance tester (or of the confidence interval estimator, for that matter). The significance tester is free to turn the situation around, as Fisher does, as refuting the assumptions:

Even if one were to imagine that H0  had an extremely high prior probability, says Fisher—never minding “what such a statement of probability a priori could possibly mean”(Fisher, 1973, p.42)—the resulting high posteriori probability to H0 , he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (ibid., p. 44) … “…is not capable of finding expression in any calculation of probability a posteriori” (ibid., p. 43). Indeed, if one were to consider the claim about the priori probability to be itself a hypothesis, Fisher says, “it would be rejected at once by the observations at a level of significance almost as great [as reached by H0 ]. …Were such a conflict of evidence, as has here been imagined under discussion… in a scientific laboratory, it would, I suggest, be some prior assumption…that would certainly be impugned.” (p. 44)

(ii) Suppose, on the other hand, you don’t endorse these priors or the Bayesian computation on which the “redefine significance” argument turns. Since lowering the p-value cut-off doesn’t seem too harmful, you might tend to look the other way as to the argument on which it is based. Isn’t that OK? Not unless you’re prepared to have your students compute these BFs and/or posteriors in just the manner upon which the critique of significance tests rests. Will you say, “oh that was just for criticism, not for actual use”? Unless you’re prepared to defend the statistical analysis, you shouldn’t support it. Lowering the p-value that you require for evidence of a discrepancy, or getting more data (should you wish to do so) doesn’t require it.

Moreover, your student might point out that you still haven’t matched p-values and BFs (or posteriors on H0 ): They still differ, with the p-value being smaller. If you wanted to match the p-value and the posterior, you could do so very easily: use the frequency matching priors (which doesn’t use the spike). You could still lower the p-value to .005, and obtain a rejection region precisely identical to the Bayesian. Why isn’t that a better solution than one based on a conflicting account of statistical inference?

Of course, even that is to grant the problem as put before us by the Bayesian argument. If you’re following good error statistical practice you might instead shirk all cut-offs. You’d report attained p-values, and wouldn’t infer a genuine effect until you’ve satisfied Fisher’s requirements: (a) Replicate yourself, show you can bring about results that “rarely fail to give us a statistically significant result” (1947, p. 14) and that you’re getting better at understanding the causal phenomenon involved. (b) Check your assumptions: both the statistical model, the measurements, and the links between statistical measurements and research claims. (c) Make sure you adjust your error probabilities to take account of, or at least report, biasing selection effects (from cherry-picking, trying and trying again, multiple testing, flexible determinations, post-data subgroups)–according to the context. That’s what prespecified reports are to inform you of. The suggestion that these are somehow taken care of by adjusting the pool of hypotheses on which you base a prior will not do. (It’s their plausibility that often makes them so seductive, and anyway, the injury is to how well-tested claims are, not to their prior believability.) The appeal to diagnostic testing computations of “false positive rates” in this paper opens up a whole new urn of worms. Don’t get me started. (see related posts.)

A final word is from a guest post by Senn.  Harold Jeffreys, he says, held that if you use the spike (which he introduced), you are to infer the hypothesis that achieves greater than .5 posterior probability.

Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) … A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

Please share your views, and alert me to errors. I will likely update this. Stay tuned for asterisks.
12/17 * I’ve already corrected a few typos.

[1] I do not mean the “false positive rate” defined in terms of α and (1 – β)–a problematic animal I put to one side here (Mayo 2003). Richard Morey notes that using their prior odds of 1:10, even the recommended BF of 26 gives us an unimpressive  posterior odds ratio of 2.6 (email correspondence).

[2] Note what I call the “fallacy of replication”. It’s said to be too easy to get low p-values, but at the same time it’s too hard to get low p-values in replication. Is it too easy or too hard? That just shows it’s not the p-value at fault but cherry-picking and other biasing selection effects. Replicating a p-value is hard–when you’ve cheated or been sloppy  the first time.

[3] They suggest increasing the sample size to get the power where it was with rejection at z = 1.96, and, while this is possible in some cases, increasing the sample size changes what counts as one sample. As n increases the discrepancy indicated by any level of significance decreases.

[4] The severe tester would report attained levels and,in this case, would indicate the the discrepancies indicated and ruled out with reasonable severity. (Mayo and Spanos 2011). Keep in mind that statistical testing inferences are  in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values. As for the imaginary Toxic Co., I’d put the existence of a risk of interest in the null hypothesis of a one-sided test.

Related Posts

Elements of this post are from Mayo 2018.

References

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., 3 … Johnson, V. (2017, July 22), “Redefine statistical significance“, Nature Human Behavior.

Berger, J. O. and Delampady, M. (1987). “Testing Precise Hypotheses” and “Rejoinder“, Statistical Science 2(3), 317-335.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Cassella, G. and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady”, Statistical Science 2(3), 344–347.

Edwards, W., Lindman, H. and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242.

Fisher, R. A. (1947). The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd. (First published 1935).

Fisher, R. A. (1973). Statistical Methods and Scientific Inference, 3rd ed,  New York: Hafner Press.

Ghosh, J. Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. New York: Springer.

Mayo, D. G. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing? Commentary on J. Berger’s Fisher Address,” Statistical Science 18: 19-24.

Mayo (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge (June 2018)

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”

The “P-values overstate the evidence against the null” fallacy

.

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming).

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig

.

I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error. Continue reading

Glymour at the PSA: “Exploratory Research is More Reliable Than Confirmatory Research”

I resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.

GLYMOUR’S ARGUMENT (in a nutshell):

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

(slide #5)

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim. Continue reading

Categories: fallacy of rejection, P-values, replication research | 21 Comments

“P-values overstate the evidence against the null”: legit or fallacious?

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow.  It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature!

In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel)

Scientist sees squirrel

Evolutionary ecologist, Stephen Heard (Scientist Sees Squirrel) linked to my blog yesterday. Heard’s post asks: “Why do we make statistics so hard for our students?” I recently blogged Barnard who declared “We need more complexity” in statistical education. I agree with both: after all, Barnard also called for stressing the overarching reasoning for given methods, and that’s in sync with Heard. Here are some excerpts from Heard’s (Oct 6, 2015) post. I follow with some remarks.

This bothers me, because we can’t do inference in science without statistics*. Why are students so unreceptive to something so important? In unguarded moments, I’ve blamed it on the students themselves for having decided, a priori and in a self-fulfilling prophecy, that statistics is math, and they can’t do math. I’ve blamed it on high-school math teachers for making math dull. I’ve blamed it on high-school guidance counselors for telling students that if they don’t like math, they should become biology majors. I’ve blamed it on parents for allowing their kids to dislike math. I’ve even blamed it on the boogie**. Continue reading

How to avoid making mountains out of molehills, using power/severity

.

A classic fallacy of rejection is taking a statistically significant result as evidence of a discrepancy from a test (or null) hypothesis larger than is warranted. Standard tests do have resources to combat this fallacy, but you won’t see them in textbook formulations. It’s not new statistical method, but new (and correct) interpretations of existing methods, that are needed. One can begin with a companion to the rule in this recent post:

(1) If POW(T+,µ’) is low, then the statistically significant x is a good indication that µ > µ’.

To have the companion rule also in terms of power, let’s suppose that our result is just statistically significant at a level α. (As soon as the observed difference exceeds the cut-off the rule has to be modified).

Rule (1) was stated in relation to a statistically significant result x (at level α) from a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. Here’s the companion:

(2) If POW(T+,µ’) is high, then an α statistically significant x is a good indication that µ < µ’.
(The higher the POW(T+,µ’) is, the better the indication  that µ < µ’.)

That is, if the test’s power to detect alternative µ’ is high, then the statistically significant x is a good indication (or good evidence) that the discrepancy from null is not as large as µ’ (i.e., there’s good evidence that  µ < µ’).

An account of severe testing based on error statistics is always keen to indicate inferences that are not warranted by the data, as well as those that are. Not only might we wish to indicate which discrepancies are poorly warranted, we can give upper bounds to warranted discrepancies by using (2).

POWER: POW(T+,µ’) = POW(Test T+ rejects H0;µ’) = Pr(M > M*; µ’), where M is the sample mean and M* is the cut-off for rejection. (Since it’s continuous, it doesn’t matter if we write > or ≥.)[i]

EXAMPLE. Let σ = 10, n = 100, so (σ/√n) = 1.  Test T+ rejects Hat the .025 level if  M  > 1.96(1).

Find the power against µ = 2.3. To find Pr(M > 1.96; 2.3), get the standard Normal z = (1.96 – 2.3)/1 = -.84. Find the area to the right of -.84 on the standard Normal curve. It is .8. So POW(T+,2.8) = .8.

For simplicity in what follows, let the cut-off, M*, be 2. Let the observed mean M0 just reach the cut-off  2.

The power against alternatives between the null and the cut-off M* will range from α to .5. Power exceeds .5 only once we consider alternatives greater than M*, for these yield negative z values.  Power fact, POW(M* + 1(σ/√n)) = .84.

That is, adding one (σ/ √n) unit to the cut-off M* takes us to an alternative against which the test has power = .84. So, POW(T+, µ = 3) = .84. See this post.

By (2), the (just) significant result x is decent evidence that µ< 3, because if µ ≥ 3, we’d have observed a more statistically significant result, with probability .84.  The upper .84 confidence limit is 3. The significant result is much better evidence that µ< 4,  the upper .975 confidence limit is 4 (approx.), etc.

Reporting (2) is typically of importance in cases of highly sensitive tests, but I think it should always accompany a rejection to avoid making mountains out of molehills. (However, in my view, (2) should be custom-tailored to the outcome not the cut-off.) In the case of statistical insignificance, (2) is essentially ordinary power analysis. (In that case, the interest may be to avoid making molehills out of mountains.) Power analysis, applied to insignificant results, is especially of interest with low-powered tests. For example, failing to find a statistically significant increase in some risk may at most rule out (substantively) large risk increases. It might not allow ruling out risks of concern. Naturally, what counts as a risk of concern is a context-dependent consideration, often stipulated in regulatory statutes.

NOTES ON HOWLERS: When researchers set a high power to detect µ’, it is not an indication they regard µ’ as plausible, likely, expected, probable or the like. Yet we often hear people say “if statistical testers set .8 power to detect µ = 2.3 (in test T+), they must regard µ = 2.3 as probable in some sense”. No, in no sense. Another thing you might hear is, “when H0: µ ≤  0 is rejected (at the .025 level), it’s reasonable to infer µ > 2.3″, or “testers are comfortable inferring µ ≥ 2.3”.  No, they are not comfortable, nor should you be. Such an inference would be wrong with probability ~.8. Given M = 2 (or 1.96), you need to subtract to get a lower confidence bound, if the confidence level is not to exceed .5 . For example, µ > .5 is a lower confidence bound at confidence level .93.

Rule (2) also provides a way to distinguish values within a 1-α confidence interval (instead of choosing a given confidence level and then reporting CIs in the dichotomous manner that is now typical).

At present, power analysis is only used to interpret negative results–and there it is often called “retrospective power”, which is a fine term, but it’s often defined as what I call shpower). Again, confidence bounds could be, but they are not now, used to this end [iii].

Severity replaces M* in (2) with the actual result, be it significant or insignificant.

Looking at power means looking at the best case (just reaching a significance level) or the worst case (just missing it). This is way too coarse; we need to custom tailor results using the observed data. That’s what severity does, but for this post, I wanted to just illuminate the logic.[ii]

One more thing:

Applying (1) and (2) requires the error probabilities to be actual (approximately correct): Strictly speaking, rules (1) and (2) have a conjunct in their antecedents [iv]: “given the test assumptions are sufficiently well met”. If background knowledge leads you to deny (1) or (2), it indicates you’re denying the reported error probabilities are the actual ones. There’s evidence the test fails an “audit”. That, at any rate, is what I would argue.

————

[i] To state power in terms of P-values: POW(µ’) = Pr(P < p*; µ’) where P < p* corresponds to rejecting the null hypothesis at the given level.

[ii] It must be kept in mind that statistical testing inferences are going to be in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values! (Not even to the point µ =M0.) Take a look at the alternative H1: µ >  0. It is not a point value. Although we are going beyond inferring the existence of some discrepancy, we still retain inferences in the form of inequalities.

[iii] That is, upper confidence bounds are too readily viewed as “plausible” bounds, and as values for which the data provide positive evidence. In fact, as soon as you get to an upper bound at confidence levels of around .6, .7, .8, etc. you actually have evidence µ’ < CI-upper. See this post.

[iv] The “antecedent” of a conditional refers to the statement between the “if” and the “then”.

OTHER RELEVANT POSTS ON POWER

Categories: fallacy of rejection, power, Statistics | 20 Comments

All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!”

.

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

continued…

“P-values overstate the evidence against the null”: legit or fallacious? (revised)

0. July 20, 2014: Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading