New SEV calculator (guest app: Durvasula)

Unknown-1Karthik Durvasula, a blog follower[i], sent me a highly apt severity app that he created:
I have his permission to post it or use it for pedagogical purposes, so since it’s Saturday night, go ahead and have some fun with it. Durvasula had the great idea of using it to illustrate howlers. Also, I would add, to discover them.
It follows many of the elements of the Excel Sev Program discussed recently, but it’s easier to use.* (I’ll add some notes about the particular claim (i.e, discrepancy) for which SEV is being computed later on).
*If others want to tweak or improve it, he might pass on the source code (write to me on this).
[i] I might note that Durvasula was the winner of the January palindrome contest.
Categories: Severity, Statistics | 12 Comments

Get empowered to detect power howlers

questionmark pinkIf a test’s power to detect µ’ is low then a statistically significant result is good/lousy evidence of discrepancy µ’? Which is it?

If your smoke alarm has little capability of triggering unless your house is fully ablaze, then if it has triggered, is that a strong or weak indication of a fire? Compare this insensitive smoke alarm to one that is so sensitive that burning toast sets it off. The answer is: that the alarm from the insensitive detector is triggered is a good indication of the presence of (some) fire, while hearing the ultra sensitive alarm go off is not.[i]

Yet I often hear people say things to the effect that:

if you get a result significant at a low p-value, say ~.03,
but the power of the test to detect alternative µ’ is also low, say .04 (i.e., POW(µ’)= .04),then “the result hasn’t done much to distinguish” the data from that obtained by chance alone.

–but wherever that reasoning is coming from it’s not from statistical hypothesis testing, properly understood. It’s easy to see.

We can use a variation on the one-sided test T+ from our illustration of power: We’re testing the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:

H0: µ ≤  0 against H1: µ >  0

Let σ = 1, n = 25, so (σ/ √n) = .2.

Continue reading

Categories: confidence intervals and tests, power, Statistics | 34 Comments

Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power

SEV CALCULATORBelow are slides from March 6, 2014: (a) the 2nd half of “Frequentist Statistics as a Theory of Inductive Inference” (Selection Effects),”* and (b) the discussion of the Higgs particle discovery and controversy over 5 sigma.physics pic yellow particle burst blue cone

We spent the rest of the seminar computing significance levels, rejection regions, and power (by hand and with the Excel program). Here is the updated syllabus  (3rd installment).

A relevant paper on selection effects on this blog is here.

Categories: Higgs, P-values, Phil6334, selection effects | Leave a comment

Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

Categories: Comedy, fallacy of rejection, Statistical power | Tags: , , , , | 9 Comments

Msc kvetch: You are fully dressed (even under your clothes)?

UnknownRejected posts

Categories: rejected post | Leave a comment

Power, power everywhere–(it) may not be what you think! [illustration]

sev-calculatorStatistical power is one of the neatest [i], yet most misunderstood statistical notions [ii].So here’s a visual illustration (written initially for our 6334 seminar), but worth a look by anyone who wants an easy way to attain the will to understand power.(Please see notes below slides.)

[i]I was tempted to say power is one of the “most powerful” notions.It is.True, severity leads us to look, not at the cut-off for rejection (as with power) but the actual observed value, or observed p-value. But the reasoning is the same. Likewise for less artificial cases where the standard deviation has to be estimated. See Mayo and Spanos 2006.


  • Some say that to compute power requires either knowing the alternative hypothesis (whatever that means), or worse, the alternative’s prior probability! Then there’s the tendency (by reformers no less!) to transpose power in such a way as to get the appraisal of tests exactly backwards. An example is Ziliac and McCloskey (2008). See,for example, the will to understand power:
  • Many allege that a null hypothesis may be rejected (in favor of alternative H’) with greater warrant, the greater the power of the test against H’, e.g., Howson and Urbach (2006, 154). But this is mistaken. The frequentist appraisal of tests is the reverse, whether Fisherian significance tests or those of the Neyman-Pearson variety. One may find the fallacy exposed back in Morrison and Henkel (1970)! See EGEK 1996, pp. 402-3.
  •  For a humorous post on this fallacy, see: “The fallacy of rejection and the fallacy of nouvelle cuisine”:

You can find a link to the Severity Excel Program (from which the pictures came)  on the right hand column of this blog, and a link to basic instructions.This corresponds to EXAMPLE SET 1 pdf for Phil 6334.

Howson, C. and P. Urbach (2006). Scientific Reasoning: The Bayesian Approach. La Salle, Il: Open Court.

Mayo, D. G. and A. Spanos (2006) “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction“ British Journal of Philosophy of Science, 57: 323-357.

Morrison and Henkel (1970), The significance Test controversy.

Ziliak, Z. and McCloskey, D. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, University of Michigan Press.

Categories: Phil6334, Statistical power, Statistics | 26 Comments

capitalizing on chance (ii)

Mayo playing the slots

DGM playing the slots

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This  also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

Categories: junk science, selection effects, spurious p values, Statistical fraudbusting, Statistics | 4 Comments

Significance tests and frequentist principles of evidence: Phil6334 Day #6

picture-216-1Slides (2 sets) from Phil 6334 2/27/14 class (Day#6).


D. Mayo:
“Frequentist Statistics as a Theory of Inductive Inference”

A. Spanos
“Probability/Statistics Lecture Notes 4: Hypothesis Testing”

Categories: P-values, Phil 6334 class material, Philosophy of Statistics, Statistics | Tags: | Leave a comment

Cosma Shalizi gets tenure (at last!) (metastat announcement)

ShaliziNews Flash! Congratulations to Cosma Shalizi who announced yesterday that he’d been granted tenure (Statistics, Carnegie Mellon). Cosma is a leading error statistician, a creative polymath and long-time blogger (at Three-Toad sloth). Shalizi wrote an early book review of EGEK (Mayo 1996)* that people still send me from time to time, in case I hadn’t seen it! You can find it on this blog from 2 years ago (posted by Jean Miller). A discussion of a meeting of the minds between Shalizi and Andrew Gelman is here.

*Error and the Growth of Experimental Knowledge.

Categories: Announcement, Error Statistics, Statistics | Tags: | Leave a comment

Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

Phil 6334* Day #4: Mayo slides follow the comments below. (Make-up for Feb 13 snow day.) Popper reading is from Conjectures and Refutations.



As is typical in rereading any deep philosopher, I discover (or rediscover) different morsals of clues to understanding—whether fully intended by the philosopher or a byproduct of their other insights, and a more contemporary reading. So it is with Popper. A couple of key ideas to emerge from Monday’s (make-up) class and the seminar discussion (my slides are below):

  1. Unlike the “naïve” empiricists of the day, Popper recognized that observations are not just given unproblematically, but also require an interpretation, an interest, a point of view, a problem. What came first, a hypothesis or an observation? Another hypothesis, if only at a lower level, says Popper.  He draws the contrast with Wittgenstein’s “verificationism”. In typical positivist style, the verificationist sees observations as the given “atoms,” and other knowledge is built up out of truth functional operations on those atoms.[1] However, scientific generalizations beyond the given observations cannot be so deduced, hence the traditional philosophical problem of induction isn’t solvable. One is left trying to build a formal “inductive logic” (generally deductive affairs, ironically) that is thought to capture intuitions about scientific inference (a largely degenerating program). The formal probabilists, as well as philosophical Bayesianism, may be seen as descendants of the logical positivists–instrumentalists, verificationists, operationalists (and the corresponding “isms”). So understanding Popper throws a lot of light on current day philosophy of probability and statistics.
  2. The fact that observations must be interpreted opens the door to interpretations that prejudge the construal of data. With enough interpretive latitude, anything (or practically anything) that is observed can be interpreted as in sync with a general claim H. (Once you opened your eyes, you see confirmations everywhere, as with a gestalt conversion, as Popper put it.) For Popper, positive instances of a general claim H, i.e., observations that agree with or “fit” H, do not even count as evidence for H if virtually any result could be interpreted as according with H.
    Note a modification of Popper here: Instead of putting the “riskiness” on H itself, it is the method of assessment or testing that bears the burden of showing that something (ideally quite a lot) has been done in order to scrutinize the way the data were interpreted (to avoid “verification bias”). The scrutiny needs to ensure that it would be difficult (rather than easy) to get an accordance between data x and H (as strong as the one obtained) if H were false (or specifiably flawed). Continue reading
Categories: Phil 6334 class material, Popper, Statistics | 7 Comments

Winner of the Febrary 2014 palindrome contest (rejected post)

SamHeadWinner of February 2014 Palindrome Contest
Samuel Dickson

Rot, Cadet A, I’ve droned! Elba, revile deviant, naïve, deliverable den or deviated actor.

The requirement was: A palindrome with Elba plus deviate with an optional second word: deviant. A palindrome that uses both deviate and deviant tops an acceptable palindrome that only uses deviate.

Sam Dickson is a regulatory statistician at U.S. Army Medical Research Institute of Infectious Diseases (USAMRIID) with experience in statistical consulting, specializing in design and analysis of biological and genetics/genomics studies.

“It’s great to get a  chance to exercise the mind with something other than statistics, though putting words together to make a palindrome is a puzzle very similar to designing an experiment that answers the right question.  Thank you for hosting this contest!”

Choice of book:
Principles of Applied Statistics (D. R. Cox and C. A. Donnelly 2011, Cambridge: Cambridge University Press)

Congratulations, Sam! I hope that your opting to do two words (plus Elba) means we can go back to the tougher standard for palindromes, but I’d just as soon raise the level of competence for several months more (sticking to one word). 

Categories: Announcement, Palindrome, Rejected Posts, Statistics | Leave a comment

Phil 6334: February 20, 2014 (Spanos): Day #5

may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809dPHIL 6334 – “Probability/Statistics Lecture Notes 3 for 2/20/14: Estimation (Point and Interval)”:(Prof. Spanos)*

*This is Day #5 on the Syllabus, as Day #4 had to be made up (Feb 24, 2014) due to snow. Slides for Day #4 will go up Feb. 26, 2014. (See the revised Syllabus Second Installment.)

Categories: Phil6334, Philosophy of Statistics, Spanos | 5 Comments

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Comedy hour icon

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance… 

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

IMG_1547It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Fisher, Jeffreys, P-values, Stephen Senn | Leave a comment

STEPHEN SENN: Fisher’s alternative to the alternative

Reblogging 2 years ago:

By: Stephen Senn

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 31 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Exactly 1 year ago: I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 1 Comment

Aris Spanos: The Enduring Legacy of R. A. Fisher

spanos 2014

More Fisher insights from A. Spanos, this from 2 years ago:

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most re­markable, but least recognized, achievement was to initiate the recast­ing of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

Mθ(x)={f(x;θ); θ∈Θ}; x∈Rn ;Θ⊂Rm; m < n; (1)

where the distribution of the sample f(x;θ) ‘encapsulates’ the proba­bilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily confined to the description of the distributional features of the data in hand using the histogram and the first few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand x0:=(x1,x2,…,xn). As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pear­son’s approach, proceeding from data x0 in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespecified Mθ(x) (a ‘hypothetical infinite population’), and view x0 as a ‘typical’ realization thereof; see Spanos (1999).

In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material ex­periment into Mθ(x), and taming errors via probabilification, i.e. to define frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘effectiviness’ of the inference procedure: how often a certain method will give rise to correct in­ferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been refined and extended by Mayo’s error statistical approach (EGEK 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011). Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , , , , | 2 Comments

R. A. Fisher: how an outsider revolutionized statistics

A SPANOSToday is R.A. Fisher’s birthday and I’m reblogging the post by Aris Spanos which, as it happens, received the highest number of views of 2013.

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Spanos, Statistics | 6 Comments

Fisher and Neyman after anger management?

Photo on 2-15-13 at 11.47 PM

Monday is Fisher’s birthday, and to set the stage for some items to appear, I’m posing the anger management question from a year ago post (please also see the comments from then). Here it is:

Would you agree if your (senior) colleague urged you to use his/her book rather than your own –even if you thought doing so would change for the positive the entire history of your field? My guess is that the answer is no (but see “add on”). For that matter, would you ever try to insist that your (junior) colleague use your book in teaching a course rather than his/her own notes or book?  Again I guess no. But perhaps you’d be more tactful than were Fisher and Neyman.

It wasn’t just Fisher who seemed to need some anger management training, Erich Lehmann (in conversation and in 2011) points to a number of incidences wherein Neyman is the instigator of gratuitous ill-will. Their substantive statistical and philosophical disagreements, I now think, were minuscule in comparison to the huge animosity that developed over many years. Here’s how Neyman describes a vivid recollection he has of the 1935 book episode to Constance Reid (1998, 126). [i]

A couple of months “after Neyman criticized Fisher’s concept of the complex experiment” Neyman vividly recollects  Fisher stopping by his office at University College on his way to a meeting which was to decide on Neyman’s reappointment[ii]: Continue reading

Categories: phil/history of stat, Statistics | 9 Comments

January Blog Table of Contents


January, and the blogging was easy

BLOG Contents: January 2014
Compiled by Jean Miller and Nicole Jinn

(1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
(1/3) Error Statistics Philosophy: 2013
(1/4) Your 2014 wishing well. …
(1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
(1/11) Two Severities? (PhilSci and PhilStat)
(1/14) Statistical Science meets Philosophy of Science: blog beginnings
(1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
(1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
(1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
(1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
(1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
(1/27) Winner of the January 2014 palindrome contest (rejected post)
(1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics
(1/31) Phil 6334: Day #2 Slides

Categories: Metablog | Leave a comment

Phil6334 Statistical Snow Sculpture

Statistical Snow Sculpture

Statistical Snow Sculpture

No Seminar. Blizzard.

Categories: Announcement, Phil6334 | Leave a comment

Blog at Customized Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 324 other followers