# Posts Tagged With: criticism of frequentist methods

## Power howlers return as criticisms of severity

Suppose you are reading about a statistically significant result x that just reaches a threshold p-value α from a test T+ of the mean of a Normal distribution

H0: µ ≤  0 against H1: µ >  0

with n iid samples, and (for simplicity) known σ.  The test “rejects” H0 at this level & infers evidence of a discrepancy in the direction of H1.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ). See point* on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the frequentist error statistical philosophy? Continue reading

## Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)

.

I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) howler well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past: Continue reading

Categories: memory lane, significance tests, Statistics |

## Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

## Oxford Gaol: Statistical Bogeymen

Memory Lane: 3Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

• Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
• Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
• I have proposed an alternative philosophy that replaces these tenets with different ones:
• the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
• the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
• Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

## Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)

Since we’ll be discussing Bayesian confirmation measures in next week’s seminar—the relevant blogpost being here--let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

Did you hear the one about the frequentist error statistical tester who inferred a hypothesis H passed a stringent test (with data x)?

The problem was, the epistemic probability in H was so low that H couldn’t be believed!  Instead we believe its denial H’!  So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis H has passed a test, this Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H[i].  But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.”  This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true.  This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## Oxford Gaol: Statistical Bogeymen

Memory Lane: Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (I’m serious, it is now a boutique hotel.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

• Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
• Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
• I have proposed an alternative philosophy that replaces these tenets with different ones:
• the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
• the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
• Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution.  Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical. Continue reading

Categories: Philosophy of Statistics |

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)

Our favorite high school student, Isaac, gets a better shot at showing his college readiness using one of the comparative measures of support or confirmation discussed last week. Their assessment thus seems more in sync with the severe tester, but they are not purporting that z is evidence for inferring (or even believing) an H to which z affords a high B-boost*. Their measures identify a third category that reflects the degree to which H would predict z (where the comparison might be predicting without z, or under ~H or the like).  At least if we give it an empirical, rather than a purely logical, reading. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

Did you hear the one about the frequentist error statistical tester who inferred a hypothesis H passed a stringent test (with data x)?

The problem was, the epistemic probability in H was so low that H couldn’t be believed!  Instead we believe its denial H’!  So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis H has passed a test, this Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H[i].  But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.”  This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true.  This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”

Dear Reader: Tonight marks the 2-year anniversary of this blog; so I’m reblogging my very first posts from 9/3/11 here and here (from the rickety old blog site)*. (One was the “about”.) The current blog was included once again in the top 50 statistics blogs. Amazingly, I have received e-mails from different parts of the world describing experimental recipes for the special concoction we exiles favor!  (Mine is here.) If you can fly over to the Elbar Room, please join us: I’m treating everyone to doubles of Elbar Grease!  Thanks for reading and contributing! D. G. Mayo

(*The old blogspot is a big mix; it was before Rejected blogs. Yes, I still use this old typewriter [ii])

“Overheard at the Comedy Club at the Bayesian Retreat” 9/3/11 by D. Mayo

“Did you hear the one about the frequentist . . .

• “who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

or

• “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of “straw-men” fallacies, they form the basis of why some reject frequentist methods, then they are not such a laughing matter.   But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it?  I invite the curious reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call “error statistics,” continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates.   Given the wide latitude with which some critics define “controlling long-run error,” it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of “There’s No Theorem Like Bayes’s Theorem.”

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in many Bayesian textbooks and articles on philosophical foundations. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i)  insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds.  Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson “really thought”.  Many others just find the “statistical wars” distasteful.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well.  I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error- statistical philosophy.

But given this is a blog, I shall be direct and to the point: I hope to cultivate the interests of others who might want to promote intellectual honesty within a generally very lopsided philosophical debate.  I will begin with the first entry to the comedy routine, as it is put forth by leading Bayesians……

___________________________________________

“Frequentists in Exile” 9/3/11 by D. Mayo

Confronted with the position that “arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006, 196), frequentists might have seen themselves in a kind of exile when it came to foundations, even those who had been active in the dialogues of an earlier period [i].  Sometime around the late 1990s there were signs that this was changing.  Regardless of the explanation, the fact that it did occur and is occurring is of central importance to statistical philosophy.

Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur.  In some corners of practice it appears that frequentist error statistical foundations are being discovered anew.  Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are finally being disinterred.  But let’s learn from some of the mistakes in the earlier attempts to understand it.  With this goal I invite you to join me in some deep water drilling, here as I cast about on my Isle of Elba.

Cox, D. R. (2006), Principles of Statistical Inference, CUP.

________________________________________________

[i] Yes, that’s the Elba connection: Napolean’s exile (from which he returned to fight more battles).

[ii] I have discovered a very reliable antique typewriter shop in Oxford that was able to replace the two missing typewriter keys. So long as my “ribbons” and carbon sheets don’t run out, I’m set.

Categories: Comedy, Error Statistics |

## Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the third time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears).

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

Mayo’s Rejoinder:

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation. Continue reading

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading

## 13 well-worn criticisms of significance tests (and how to avoid them)

2013 is right around the corner, and here are 13 well-known criticisms of statistical significance tests, and how they are addressed within the error statistical philosophy, as discussed in Mayo, D. G. and Spanos, A. (2011) “Error Statistics“.

•  (#1) error statistical tools forbid using any background knowledge.
•  (#2) All statistically signiﬁcant results are treated the same.
• (#3) The p-value does not tell us how large a discrepancy is found.
• (#4) With large enough sample size even a trivially small discrepancy from the null can be detected.
•  (#5) Whether there is a statistically signiﬁcant diﬀerence from the null depends on which is the null and which is the alternative.
• (#6) Statistically insigniﬁcant results are taken as evidence that the null hypothesis is true.
• (#7) Error probabilities are misinterpreted as posterior probabilities.
• (#8) Error statistical tests are justiﬁed only in cases where there is a very long (if not inﬁnite) series of repetitions of the same experiment.
• (#9) Specifying statistical tests is too arbitrary.
• (#10) We should be doing conﬁdence interval estimation rather than signiﬁcance tests.
• (#11) Error statistical methods take into account the intentions of the scientists analyzing the data.
• (#12) All models are false anyway.
• (#13) Testing assumptions involves illicit data-mining.

You can read how we avoid them in the full paper here.

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Categories: Error Statistics, significance tests, Statistics |

## Reblogging: Oxford Gaol: Statistical Bogeymen

Reblogging 1 year ago in Oxford: Oxford Jail is an entirely fitting place to be on Halloween!

Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba!  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort) Continue reading

Categories: Error Statistics, Philosophy of Statistics |

These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).

‘ “Did you hear the one about the frequentist . . .

“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).

“Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05.  If the coin comes up tails reject the null hypothesis.  Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test.  It is also very robust against data errors; indeed it does not depend on the data at all.  It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)

Much laughter.

___________________

But is it allowed?  I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post).  I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**.  That, recall, was Rule #1. Continue reading

Categories: Comedy, Philosophy of Statistics, Statistics |

## A “Bayesian Bear” rejoinder practically writes itself…

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities. Coincidentally, I have been sent several different p-value U-Tube clips in the past two weeks, rehearsing essentially the same interpretive issues, but this one (“what the p-value”*) was created by some freebee outfit that will apparently set their irritating cartoon bear voices to your very own dialogue (I don’t know the website or outfit.)

The presumption is that somehow there would be no questions or confusion of interpretation were the output in the form of a posterior probability. The problem of indicating the extent of discrepancies that are/are not warranted by a given p-value is genuine but easy enough to solve**. What I never understand is why it is presupposed that the most natural and unequivocal way to interpret and communicate evidence (in this case, leading to low p-values) is by means of a (posterior) probability assignment, when it seems clear that the more relevant question the testy-voiced (“just wait a tick”) bear would put to the know-it-all bear would be: how often would this method erroneously declare a genuine discrepancy? A corresponding “Bayesian bear” video practically writes itself, but I’ll let you watch this first. Share any narrative lines that come to mind.

*Reference: Blume, J. and J. F. Peipert (2003). “What your statistician never told you about P-values.” J Am Assoc Gynecol Laparosc 10(4): 439-444.

**See for example, Mayo & Spanos (2011) ERROR STATISTICS

Categories: Statistics |

## Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

Stephen Senn
Head of the Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg

An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence based medicine. Philosophy of Science 2002; 69: S316-S330: see page S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random.

The second point is that in the absence of a treatment effect, where randomization has taken place, the statistical theory predicts probabilistically how the variation in outcome between groups relates to the variation within. Continue reading

Categories: Statistics |

## G. Cumming Response: The New Statistics

Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post).   He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.

The New Statistics

PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)

I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:

Cumming (quoted by Mayo): as usual, the estimation approach is better

Mayo: Is it?

Lots to discuss there. In this first post I’ll outline the big picture as I see it.

‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article (http://tiny.cc/GeoffConversation) and radio talk (http://tiny.cc/geofftalk). The website www.thenewstatistics.com has information about the book (Cumming, 2012) and ESCI software, which is a free download.

Categories: Statistics |

## Answer to the Homework & a New Exercise

Debunking the “power paradox” allegation from my previous post. The authors consider a one-tailed Z test of the hypothesis H0: μ ≤ 0 versus H1: μ > 0: our Test T+.  The observed sample mean is = 1.4 and in the first case σx = 1, and in the second case σx = 2.

First case: The power against μ = 3.29 is high, .95 (i.e. P(Z > 1.645; μ=3.29) =1-φ(-1.645) = .95), and thus the DDS assessor would take the result as a good indication that μ < 3.29.

Second case: For σx = 2, the cut-off for rejection would be 0 + 1.65(2) = 3.30.

So, in the second case (σx = 2) the probability of erroneously accepting H0, even if μ were as high as 3.29, is .5!  (i.e. P(Z ≤ 1.645; μ=3.29)  = φ(1.645-(3.29/2)) ~.5.)  Although p1 < p2[i] the justifiable upper bound in the first test is smaller (closer to 0) than in the second!  Hence, the DDS assessment is entirely in keeping with the appropriate use of error probabilities in interpreting tests. There is no conflict with p-value reasoning.

NEW PROBLEM

The DDS power analyst always takes the worst cast of just missing the cut-off for rejection. Compare instead

SEV(μ < 3.29) for the first test, and SEV(μ < 3.29) for the second (using the actual outcomes as SEV requires).

[i] p1= .081 and p2 = .242.

Categories: Statistics |

## U-Phil: Is the Use of Power* Open to a Power Paradox?

* to assess Detectable Discrepancy Size (DDS)

In my last post, I argued that DDS type calculations (also called Neymanian power analysis) provide needful information to avoid fallacies of acceptance in the test T+; whereas, the corresponding confidence interval does not (at least not without special testing supplements).  But some have argued that DDS computations are “fundamentally flawed” leading to what is called the “power approach paradox”, e.g., Hoenig and Heisey (2001).

We are to consider two variations on the one-tailed test T+: H0: μ ≤ 0 versus H1: μ > 0 (p. 21).  Following their terminology and symbols:  The Z value in the first, Zp1, exceeds the Z value in the second, Zp2, although the same observed effect size occurs in both[i], and both have the same sample size, implying that σ1 < σ2.  For example, suppose σx1 = 1 and σx2 = 2.  Let observed sample mean M be 1.4 for both cases, so Zp1 = 1.4 and Zp2 = .7. They note that for any chosen power, the computable detectable discrepancy size will be smaller in the first experiment, and for any conjectured effect size, the computed power will always be higher in the first experiment.

“These results lead to the nonsensical conclusion that the first experiment provides the stronger evidence for the null hypothesis (because the apparent power is higher but significant results were not obtained), in direct contradiction to the standard interpretation of the experimental results (p-values).” (p. 21)

But rather than show the DDS assessment “nonsensical”, nor any direct contradiction to interpreting p values, this just demonstrates something  nonsensical in their interpretation of the two p-value results from tests with different variances.  Since it’s Sunday  night and I’m nursing[ii] overexposure to rowing in the Queen’s Jubilee boats in the rain and wind, how about you find the howler in their treatment. (Also please inform us of articles pointing this out in the last decade, if you know of any.)

______________________

Hoenig, J. M. and D. M. Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations in Data Analysis,” The American Statistician, 55: 19-24.

[i] The subscript indicates the p-value of the associated Z value.

[ii] With English tea and a cup of strong “Elbar grease”.

Categories: Statistics, U-Phil |

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better”  from one of my favorite plays, Annie Get Your Gun (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high(low), then x constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):