## New SEV calculator (guest app: Durvasula)

## Get empowered to detect power howlers

**If a test’s power to detect µ’ is low then a statistically significant result is good/lousy evidence of discrepancy µ’? Which is it?**

If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]

Yet I often hear people say things to the effect that:

if you get a result significant at a low p-value, say ~.03,

but the power of the test to detect alternative µ’ is also low, say .04 (i.e., POW(µ’)= .04),then “the result hasn’t done much to distinguish” the data from that obtained by chance alone.

–but wherever that reasoning is coming from it’s not from statistical hypothesis testing, properly understood. It’s easy to see.

We can use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with *n* iid samples, and (for simplicity) known σ:

H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0

Let σ = 1, *n* = 25, so (σ/ √*n*) = .2.

## Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power

Below are slides from March 6, 2014: (a) the 2nd half of “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects),”* and (b) the discussion of the Higgs particle discovery and controversy over 5 sigma.

We spent the rest of the seminar computing significance levels, rejection regions, and power (by hand and with the Excel program). Here is the updated syllabus (3rd installment).

A relevant paper on selection effects on this blog is here.

## Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers. Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

## Msc kvetch: You are fully dressed (even under your clothes)?

## Power, power everywhere–(it) may not be what you think! [illustration]

Statistical power is one of the neatest [i], yet most misunderstood statistical notions [ii].So here’s a visual illustration (written initially for our 6334 seminar), but worth a look by anyone who wants an easy way to attain *the will to understand power*.(Please see notes below slides.)

[i]I was tempted to say power is one of the “most powerful” notions.It is.True, severity leads us to look, not at the cut-off for rejection (as with power) but the actual observed value, or observed p-value. But the reasoning is the same. Likewise for less artificial cases where the standard deviation has to be estimated. See Mayo and Spanos 2006.

[ii]

- Some say that to compute power requires either knowing the alternative hypothesis (whatever that means), or worse, the alternative’s prior probability! Then there’s the tendency (by reformers no less!) to transpose power in such a way as to get the appraisal of tests exactly backwards. An example is Ziliac and McCloskey (2008). See,for example, the will to understand power: http://errorstatistics.com/2011/10/03/part-2-prionvac-the-will-to-understand-power/
- Many allege that a null hypothesis may be rejected (in favor of alternative H’) with greater warrant, the greater the power of the test against H’, e.g., Howson and Urbach (2006, 154). But this is mistaken. The frequentist appraisal of tests is the reverse, whether Fisherian significance tests or those of the Neyman-Pearson variety. One may find the fallacy exposed back in Morrison and Henkel (1970)! See EGEK 1996, pp. 402-3.
- For a humorous post on this fallacy, see: “The fallacy of rejection and the fallacy of nouvelle cuisine”: http://errorstatistics.com/2012/04/04/jackie-mason/

You can find a link to the Severity Excel Program (from which the pictures came) on the right hand column of this blog, and a link to basic instructions.This corresponds to EXAMPLE SET 1 pdf for Phil 6334.

Howson, C. and P. Urbach (2006). *Scientific Reasoning: The Bayesian Approach*. La Salle, Il: Open Court.

Mayo, D. G. and A. Spanos (2006) “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction“ *British Journal of Philosophy of Science*, 57: 323-357.

Morrison and Henkel (1970), *The significance Test controversy.*

Ziliak, Z. and McCloskey, D. (2008), *The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives*, University of Michigan Press*.*

## capitalizing on chance (ii)

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is

no,because the difference tested has beenselectedfrom the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

## Significance tests and frequentist principles of evidence: Phil6334 Day #6

Slides (2 sets) from Phil 6334 2/27/14 class (Day#6).

D. Mayo:

“Frequentist Statistics as a Theory of Inductive Inference”

A. Spanos

“Probability/Statistics Lecture Notes 4: Hypothesis Testing”

## Cosma Shalizi gets tenure (at last!) (metastat announcement)

News Flash! Congratulations to Cosma Shalizi who announced yesterday that he’d been granted tenure (Statistics, Carnegie Mellon). Cosma is a leading error statistician, a creative polymath and long-time blogger (at Three-Toad sloth). Shalizi wrote an early book review of EGEK (Mayo 1996)* that people still send me from time to time, in case I hadn’t seen it! You can find it on this blog from 2 years ago (posted by Jean Miller). A discussion of a meeting of the minds between Shalizi and Andrew Gelman is here.

*Error and the Growth of Experimental Knowledge.

## Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

**Phil 6334* Day #4:** Mayo slides follow the comments below. (Make-up for Feb 13 snow day.) Popper reading is from *Conjectures and Refutations*.

As is typical in rereading any deep philosopher, I discover (or rediscover) different morsals of clues to understanding—whether fully intended by the philosopher or a byproduct of their other insights, and a more contemporary reading. So it is with Popper. A couple of key ideas to emerge from Monday’s (make-up) class and the seminar discussion (my slides are below):

- Unlike the “naïve” empiricists of the day, Popper recognized that observations are not just given unproblematically, but also require an interpretation, an interest, a point of view, a problem. What came first, a hypothesis or an observation? Another hypothesis, if only at a lower level, says Popper. He draws the contrast with Wittgenstein’s “verificationism”. In typical positivist style, the verificationist sees observations as the given “atoms,” and other knowledge is built up out of truth functional operations on those atoms.[1] However, scientific generalizations beyond the given observations cannot be so deduced, hence the traditional philosophical problem of induction isn’t solvable. One is left trying to build a formal “inductive logic” (generally deductive affairs, ironically) that is thought to capture intuitions about scientific inference (a largely degenerating program). The formal probabilists, as well as philosophical Bayesianism, may be seen as descendants of the logical positivists–instrumentalists, verificationists, operationalists (and the corresponding “isms”). So understanding Popper throws a lot of light on current day philosophy of probability and statistics.
- The fact that observations must be interpreted opens the door to interpretations that prejudge the construal of data. With enough interpretive latitude, anything (or practically anything) that is observed can be interpreted as in sync with a general claim H. (Once you opened your eyes, you see confirmations everywhere, as with a gestalt conversion, as Popper put it.) For Popper, positive instances of a general claim H, i.e., observations that agree with or “fit” H, do not even count as evidence for H if virtually any result could be interpreted as according with H.

: Instead of putting the “riskiness” on H itself, it is the method of assessment or testing that bears the burden of showing that something (ideally quite a lot) has been done in order to scrutinize the way the data were interpreted (to avoid “verification bias”). The scrutiny needs to ensure that it would be difficult (rather than easy) to get an accordance between data x and H (as strong as the one obtained) if H were false (or specifiably flawed). Continue reading**Note**a modification of Popper here

## Winner of the Febrary 2014 palindrome contest (rejected post)

*Winner of February 2014 Palindrome Contest
*

**Samuel Dickson**

**Palindrome:
**

**Rot, Cadet A, I’ve droned! Elba, revile deviant, naïve, deliverable den or deviated actor.**

The requirement was: A palindrome with Elba plus deviate with an optional second word: deviant. A palindrome that uses both deviate and deviant tops an acceptable palindrome that only uses deviate.

**Bio:
**Sam Dickson is a regulatory statistician at U.S. Army Medical Research Institute of Infectious Diseases (USAMRIID) with experience in statistical consulting, specializing in design and analysis of biological and genetics/genomics studies.

**Statement**:

“It’s great to get a chance to exercise the mind with something other than statistics, though putting words together to make a palindrome is a puzzle very similar to designing an experiment that answers the right question. Thank you for hosting this contest!”

**Choice of book:**

*Principles of Applied Statistics* (D. R. Cox and C. A. Donnelly 2011, Cambridge: Cambridge University Press)

*Congratulations, Sam! I hope that your opting to do two words (plus Elba) means we can go back to the tougher standard for palindromes, but I’d just as soon raise the level of competence for several months more (sticking to one word). *

## Phil 6334: February 20, 2014 (Spanos): Day #5

PHIL 6334 – “Probability/Statistics Lecture Notes 3 for 2/20/14: Estimation (Point and Interval)”:(Prof. Spanos)*

*This is Day #5 on the Syllabus, as Day #4 had to be made up (Feb 24, 2014) due to snow. Slides for Day #4 will go up Feb. 26, 2014. (See the revised Syllabus Second Installment.)

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance…

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejectingH_{0}because of outcomesH_{0}didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## STEPHEN SENN: Fisher’s alternative to the alternative

*By: Stephen Senn*

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

*Exactly 1 year ago: I find this to be an intriguing discussion–before some of the conflicts with N and P erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.*

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

## Aris Spanos: The Enduring Legacy of R. A. Fisher

More Fisher insights from A. Spanos, this from 2 years ago:

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most remarkable, but least recognized, achievement was to initiate the recasting of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

M_{θ}(**x**)={f(**x**;θ); θ∈Θ**}**; **x**∈R^{n };Θ⊂R^{m}; m < n; (1)

where the distribution of the sample f(**x**;θ) ‘encapsulates’ the probabilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily conﬁned to the description of the distributional features of the data in hand using the histogram and the ﬁrst few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand **x**_{0}:=(x_{1},x_{2},…,x_{n}). As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pearson’s approach, proceeding from data **x**_{0 }in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespeciﬁed M_{θ}(**x**) (a ‘hypothetical inﬁnite population’), and view x_{0 }as a ‘typical’ realization thereof; see Spanos (1999).

In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material experiment into M_{θ}(**x**), and taming errors via probabiliﬁcation, i.e. to deﬁne frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘eﬀectiviness’ of the inference procedure: how often a certain method will give rise to correct inferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been reﬁned and extended by Mayo’s error statistical approach (EGEK 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011). Continue reading

## R. A. Fisher: how an outsider revolutionized statistics

Today is R.A. Fisher’s birthday and I’m reblogging the post by Aris Spanos which, as it happens, received the highest number of views of 2013.

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

## Fisher and Neyman after anger management?

Monday is Fisher’s birthday, and to set the stage for some items to appear, I’m posing the anger management question from a year ago post (please also see the comments from then). Here it is:

Would you agree if your (senior) colleague urged you to use his/her book rather than your own –even if you thought doing so would change for the positive the entire history of your field? My guess is that the answer is no (but see “add on”). For that matter, would you ever try to insist that your (junior) colleague use your book in teaching a course rather than his/her own notes or book? Again I guess no. But perhaps you’d be more tactful than were Fisher and Neyman.

It wasn’t just Fisher who seemed to need some anger management training, Erich Lehmann (in conversation and in 2011) points to a number of incidences wherein Neyman is the instigator of gratuitous ill-will. Their substantive statistical and philosophical disagreements, I now think, were minuscule in comparison to the huge animosity that developed over many years. Here’s how Neyman describes a vivid recollection he has of the 1935 book episode to Constance Reid (1998, 126). [i]

A couple of months “after Neyman criticized Fisher’s concept of the complex experiment” Neyman vividly recollects Fisher stopping by his office at University College on his way to a meeting which was to decide on Neyman’s reappointment[ii]: Continue reading

## Phil6334 Statistical Snow Sculpture

No Seminar. Blizzard.