I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) ~~howler~~ well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past: Continue reading

# Posts Tagged With: criticism of frequentist methods

## Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)

## Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the n^{th} time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

## Oxford Gaol: Statistical Bogeymen

Memory Lane: 3 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.) My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory. Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended. But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

Criticisms then follow readily: the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
- Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
- I have proposed an alternative philosophy that replaces these tenets with different ones:
- the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
- the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
- Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

## Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)

Since we’ll be discussing Bayesian confirmation measures in next week’s seminar—the relevant blogpost being here--let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, this Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## Oxford Gaol: Statistical Bogeymen

Memory Lane: 2 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (I’m serious, it is now a boutique hotel.) My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory. Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended. But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

Criticisms then follow readily: the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
- Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
- I have proposed an alternative philosophy that replaces these tenets with different ones:
- the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
- the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
- Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution. Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical. Continue reading

## Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)

Our favorite high school student, Isaac, gets a better shot at showing his college readiness using one of the comparative measures of support or confirmation discussed last week. Their assessment thus seems more in sync with the severe tester, but they are not purporting that z is evidence for inferring (or even believing) an H to which z affords a high B-boost*. Their measures identify a third category that reflects the degree to which H would predict z (where the comparison might be predicting without z, or under ~H or the like). At least if we give it an empirical, rather than a purely logical, reading. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, this Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Continue reading

## First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”

*Dear Reader*: Tonight marks the 2-year anniversary of this blog; so I’m reblogging my very first posts from 9/3/11 here and here (from the rickety old blog site)*. (One was the “about”.) The current blog was included once again in the top 50 statistics blogs. Amazingly, I have received e-mails from different parts of the world describing experimental recipes for the special concoction we exiles favor! (Mine is here.) If you can fly over to the Elbar Room, please join us: I’m treating everyone to doubles of Elbar Grease! Thanks for reading and contributing! *D. G. Mayo*

(*The old blogspot is a big mix; it was before Rejected blogs. Yes, I still use this old typewriter [ii])

**“Overheard at the Comedy Club at the Bayesian Retreat” 9/3/11 by D. Mayo**

**“Did you hear the one about the frequentist . . .**

- “who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

or

- “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of “straw-men” fallacies, they form the basis of why some reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the curious reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call “error statistics,” continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called *probabilism*. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define “controlling long-run error,” it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of “There’s No Theorem Like Bayes’s Theorem.”

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in many Bayesian textbooks and articles on philosophical foundations. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson “really thought”. Many others just find the “statistical wars” distasteful.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error- statistical philosophy.

But given this is a blog, I shall be direct and to the point: I hope to cultivate the interests of others who might want to promote intellectual honesty within a generally very lopsided philosophical debate. I will begin with the first entry to the comedy routine, as it is put forth by leading Bayesians……

___________________________________________

**“Frequentists in Exile” 9/3/11 by D. Mayo**

Confronted with the position that “arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006, 196), frequentists might have seen themselves in a kind of exile when it came to foundations, even those who had been active in the dialogues of an earlier period [i]. Sometime around the late 1990s there were signs that this was changing. Regardless of the explanation, the fact that it did occur and is occurring is of central importance to statistical philosophy.

Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur. In some corners of practice it appears that frequentist error statistical foundations are being discovered anew. Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are finally being disinterred. But let’s learn from some of the mistakes in the earlier attempts to understand it. With this goal I invite you to join me in some deep water drilling, here as I cast about on my Isle of Elba.

Cox, D. R. (2006), *Principles of Statistical Inference*, CUP.

________________________________________________

[i] Yes, that’s the Elba connection: Napolean’s exile (from which he returned to fight more battles).

[ii] I have discovered a very reliable antique typewriter shop in Oxford that was able to replace the two missing typewriter keys. So long as my “ribbons” and carbon sheets don’t run out, I’m set.

## Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the *third* time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears).* *

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

*Mayo’s Rejoinder:*

*Bear #1:* Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

*Bear #2:* Not really, that would be an incorrect interpretation. Continue reading

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null *H*_{0}: μ < μ_{0} vs μ >μ_{0} and μ_{0} = 0, α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a *fallacy of rejection* (discussed before) is a *fallacy of acceptance*, or the fallacy of misinterpreting statistically insignificant results. To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ_{0}, we wish to identify discrepancies that can and cannot be ruled out. For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ_{0} + γ

Fisher continually emphasized that failure to reject was not evidence for the null. Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

** Neymanian Power Analysis (Detectable Discrepancy Size DDS)**: If data

**are not statistically significantly different from**

*x**H*

_{0}, and the power to detect discrepancy γ is high (low), then

**constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).**

*x*By taking into account the actual **x**_{0}, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy

dfrom Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancydto exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading

## 13 well-worn criticisms of significance tests (and how to avoid them)

2013 is right around the corner, and here are 13 well-known criticisms of statistical significance tests, and how they are addressed within the error statistical philosophy, as discussed in Mayo, D. G. and Spanos, A. (2011) “Error Statistics“.

- (#1) error statistical tools forbid using any background knowledge.
- (#2) All statistically signiﬁcant results are treated the same.
- (#3) The p-value does not tell us how large a discrepancy is found.
- (#4) With large enough sample size even a trivially small discrepancy from the null can be detected.
- (#5) Whether there is a statistically signiﬁcant diﬀerence from the null depends on which is the null and which is the alternative.
- (#6) Statistically insigniﬁcant results are taken as evidence that the null hypothesis is true.
- (#7) Error probabilities are misinterpreted as posterior probabilities.
- (#8) Error statistical tests are justiﬁed only in cases where there is a very long (if not inﬁnite) series of repetitions of the same experiment.
- (#9) Specifying statistical tests is too arbitrary.
- (#10) We should be doing conﬁdence interval estimation rather than signiﬁcance tests.
- (#11) Error statistical methods take into account the intentions of the scientists analyzing the data.
- (#12) All models are false anyway.
- (#13) Testing assumptions involves illicit data-mining.

You can read how we avoid them in the full paper here.

*Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.*

## Reblogging: Oxford Gaol: Statistical Bogeymen

Reblogging 1 year ago in Oxford: Oxford Jail is an entirely fitting place to be on Halloween!

Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory. Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended. But for Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort) Continue reading

## Return to the comedy hour…(on significance tests)

These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).

‘ “**Did you hear the one about the frequentist . . .**

“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).

“Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05. If the coin comes up tails reject the null hypothesis. Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test. It is also very robust against data errors; indeed it does not depend on the data at all. It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)

Much laughter.

___________________

But is it allowed? I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post). I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with *H* only by being *counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). *Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**. That, recall, was Rule #1. Continue reading

## A “Bayesian Bear” rejoinder practically writes itself…

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities. Coincidentally, I have been sent several different p-value U-Tube clips in the past two weeks, rehearsing essentially the same interpretive issues, but this one (“what the p-value”*) was created by some freebee outfit that will apparently set their irritating cartoon bear voices to your very own dialogue (I don’t know the website or outfit.)

The presumption is that somehow there would be no questions or confusion of interpretation were the output in the form of a posterior probability. The problem of indicating the extent of discrepancies that are/are not warranted by a given p-value is genuine but easy enough to solve**. What I never understand is why it is presupposed that the most natural and unequivocal way to interpret and communicate evidence (in this case, leading to low p-values) is by means of a (posterior) probability assignment, when it seems clear that the more relevant question the testy-voiced (“just wait a tick”) bear would put to the know-it-all bear would be: *how often would this method erroneously declare a genuine discrepancy?* A corresponding “Bayesian bear” video practically writes itself, but I’ll let you watch this first. Share any narrative lines that come to mind.

*Reference: Blume, J. and J. F. Peipert (2003). “What your statistician never told you about P-values.” J Am Assoc Gynecol Laparosc 10(4): 439-444.

**See for example, Mayo & Spanos (2011) ERROR STATISTICS

## Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

**Stephen Senn**

*Head of the Methodology and Statistics Group,*

* Competence Center for Methodology and Statistics (CCMS), Luxembourg*

An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence based medicine.

Philosophy of Science2002;69: S316-S330: see page S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random.

The second point is that in the absence of a treatment effect, where randomization has taken place, the statistical theory predicts probabilistically how the variation in outcome between groups relates to the variation within. Continue reading

## Answer to the Homework & a New Exercise

Debunking the “power paradox” allegation from my previous post. The authors consider a one-tailed Z test of the hypothesis *H*_{0}: μ ≤ 0 versus *H*_{1}: μ > 0: our Test T+. The observed sample mean is = 1.4 and in the first case _{σx} = 1, and in the second case _{σx} = 2.

First case: The power against μ = 3.29 is high, .95 (i.e. P(Z >* *1.645; μ=3.29) =1-φ(-1.645) = .95), and thus the DDS assessor would take the result as a good indication that μ < 3.29.

Second case: For σ** _{x}** = 2, the cut-off for rejection would be 0 + 1.65(2) = 3.30.

So, in the second case (σ_{x} = 2) the probability of erroneously accepting *H*_{0}, even if μ were as high as 3.29, is .5! (i.e. P(Z ≤* *1.645; μ=3.29) = φ(1.645-(3.29/2)) ~.5.) Although p_{1} < p_{2}[i] the justifiable upper bound in the first test is *smaller* (closer to 0) than in the second! Hence, the DDS assessment is entirely in keeping with the appropriate use of error probabilities in interpreting tests. There is no conflict with p-value reasoning.

NEW PROBLEM

The DDS power analyst always takes the worst cast of just missing the cut-off for rejection. Compare instead

SEV(μ < 3.29) for the first test, and SEV(μ < 3.29) for the second (using the actual outcomes as SEV requires).

[i] p_{1}= .081 and p_{2} = .242.

## U-Phil: Is the Use of Power* Open to a Power Paradox?

*** to assess Detectable Discrepancy Size (DDS)
**

In my last post, I argued that DDS type calculations (also called Neymanian power analysis) provide needful information to avoid fallacies of acceptance in the test T+; whereas, the corresponding confidence interval does not (at least not without special testing supplements). But some have argued that DDS computations are “fundamentally flawed” leading to what is called the “power approach paradox”, e.g., Hoenig and Heisey (2001).

We are to consider two variations on the one-tailed test T+: *H*_{0}: μ ≤ 0 versus *H*_{1}: μ > 0 (p. 21). Following their terminology and symbols: The Z value in the first, Z_{p1}, exceeds the Z value in the second, Z_{p2}, although the same observed effect size occurs in both[i], and both have the same sample size, implying that σ_{1} < σ_{2}. For example, suppose σ_{x1} = 1 and σ_{x2} = 2. Let observed sample mean M be 1.4 for both cases, so Z_{p1} = 1.4 and Z_{p2} = .7. They note that for any chosen power, the computable detectable discrepancy size will be smaller in the first experiment, and for any conjectured effect size, the computed power will always be higher in the first experiment.

“These results lead to the nonsensical conclusion that the first experiment provides the stronger evidence for the null hypothesis (because the apparent power is higher but significant results were not obtained), in direct contradiction to the standard interpretation of the experimental results (p-values).” (p. 21)

But rather than show the DDS assessment “nonsensical”, nor any direct contradiction to interpreting p values, this just demonstrates something nonsensical in their interpretation of the two p-value results from tests with different variances. Since it’s Sunday night and I’m nursing[ii] overexposure to rowing in the Queen’s Jubilee boats in the rain and wind, how about you find the howler in their treatment. (Also please inform us of articles pointing this out in the last decade, if you know of any.)

______________________

Hoenig, J. M. and D. M. Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations in Data Analysis,” *The American Statistician*, 55: 19-24.

## Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

*The title is to be sung to the tune of “Anything You Can Do I Can Do Better” from one of my favorite plays*, Annie Get Your Gun* (‘you’ being replaced by ‘test’).

This post may be seen to continue the discussion in May 17 post on Reforming the Reformers.

Consider again our one-sided Normal test T+, with null *H*_{0}: μ < μ_{0} vs μ >μ_{0} and μ_{0} = 0, α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M just misses significance, say

Mo = .39.

The flip side of a *fallacy of rejection* (discussed before) is a *fallacy of acceptance*, or the fallacy of misinterpreting statistically insignificant results. To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out. For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ_{0} + γ

Fisher continually emphasized that failure to reject was not evidence for the null. Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

** Neymanian Power Analysis (Detectable Discrepancy Size DDS)**: If data

**are not statistically significantly different from**

*x**H*

_{0}, and the power to detect discrepancy γ is high(low), then

**constitutes good (poor) evidence that the actual effect is no greater than γ. (See 11/9/11 post)**

*x*By taking into account the actual **x**_{0}, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

## Recent Comments