# Posts Tagged With: significance tests

## Power howlers return as criticisms of severity

Suppose you are reading about a statistically significant result x that just reaches a threshold p-value α from a test T+ of the mean of a Normal distribution

H0: µ ≤  0 against H1: µ >  0

with n iid samples, and (for simplicity) known σ.  The test “rejects” H0 at this level & infers evidence of a discrepancy in the direction of H1.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ). See point* on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the frequentist error statistical philosophy? Continue reading

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll reblog some Fisherian items this week with a few new remarks. This paper comes just before the conflicts with Neyman and Pearson (N-P) erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see Fisher and N-P as ending up in a similar place while starting from different origins, as David Cox might say [1]. Unfortunately, the blow-up that occurred soon after is behind today’s misdirected war vs statistical significance tests.* I quote just the most relevant portions…the full article is linked below.** Happy Birthday Fisher! Continue reading

Categories: Fisher, phil/history of stat |

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I will post some Fisherian items this week in recognition of it*. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  We may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Statistics |

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307  (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics |

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some different Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. Continue reading

Categories: Fisher, phil/history of stat, Statistics |

## Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

## Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

.

I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue.

Dear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo

STATISTICS IN MEDICINE, LETTER TO THE EDITOR

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: 4 years ago!, reproducibility, S. Senn, Statistics |

## WHIPPING BOYS AND WITCH HUNTERS (ii)

.

At least as apt today as 3 years ago…HAPPY HALLOWEEN! Memory Lane with new comments in blue

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways)—as well as for what really boils down to a field’s weaknesses in modeling, theorizing, experimentation, and data collection.  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.  It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing. Continue reading

|

## NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)

Note on an Article by Sir Ronald Fisher

By Jerzy Neyman (1956)

Summary

(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.  (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values.  The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight.  (4)  The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.

1. Introduction

In a recent article (Fisher, 1955), Sir Ronald Fisher delivered an attack on a a substantial part of the research workers in mathematical statistics. My name is mentioned more frequently than any other and is accompanied by the more expressive invectives. Of the scientific questions raised by Fisher many were sufficiently discussed before (Neyman and Pearson, 1933; Neyman, 1937; Neyman, 1952). In the present note only the following points will be considered: (i) Fisher’s attack on the concept of errors of the second kind; (ii) Fisher’s reference to my objections to fiducial probability; (iii) Fisher’s reference to the origin of the concept of loss function and, before all, (iv) Fisher’s attack on Abraham Wald.

THIS SHORT (5 page) NOTE IS NEYMAN’S PORTION OF WHAT I CALL THE “TRIAD”. LET ME POINT YOU TO THE TOP HALF OF p. 291, AND THE DISCUSSION OF FIDUCIAL INFERENCE ON p. 292 HERE.

Categories: Fisher, Neyman, phil/history of stat, Statistics |

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’: Just before breaking up (with N-P)

17 February 1890–29 July 1962

In recognition of R.A. Fisher’s birthday tomorrow, I will post several entries on him. I find this (1934) paper to be intriguing –immediately before the conflicts with Neyman and Pearson erupted. It represents essentially the last time he could take their work at face value, without the professional animosities that almost entirely caused, rather than being caused by, the apparent philosophical disagreements and name-calling everyone focuses on. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a very similar place (no pun intended) while starting from different origins. I quote just the most relevant portions…the full article is linked below. I’d blogged it earlier here.  You may find some gems in it.

‘Two new Properties of Mathematical Likelihood’

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307(1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics |

## 2015 Saturday Night Brainstorming and Task Forces: (4th draft)

TFSI workgroup

Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!

Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–the New Reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?

Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative?  It’s hard to say.

This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pawl: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

Franz: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

Jake: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading

## Higgs Discovery two years on (1: “Is particle physics bad science?”)

July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”) Continue reading

## Fallacy of Rejection and the Fallacy of Nouvelle Cuisine

Any Jackie Mason fans out there? In connection with our discussion of power,and associated fallacies of rejection*–and since it’s Saturday night–I’m reblogging the following post.

In February [2012], in London, criminologist Katrin H. and I went to see Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond: But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. Recall the September 26, 2011 post “Whipping Boys and Witch Hunters”: [i]

Fortunately, philosophers of statistics would surely not reprise decades-old howlers and fallacies. After all, it is the philosopher’s job to clarify and expose the conceptual and logical foibles of others; and even if we do not agree, we would never merely disregard and fail to address the criticisms in published work by other philosophers.  Oh wait, ….one of the leading texts repeats the fallacy in their third edition: Continue reading

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Exactly 1 year ago: I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics |

## WHIPPING BOYS AND WITCH HUNTERS

This, from 2 years ago, “fits” at least as well today…HAPPY HALLOWEEN! Memory Lane

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways).  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a of “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.   It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting, that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis–especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.

The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the reverse. By and large, these critics are not espousing a Bayesian line but rather see themselves as offering “reforms” e.g., supplementing simple significance tests with power (e.g., Jacob Cohen’s “power analytic movement), and most especially,  replacing tests with confidence interval estimates of the size of discrepancy (from the null) indicated by the data.  Of course, the use of power is central for (frequentist) Neyman-Pearson tests, and (frequentist) confidence interval estimation even has a duality with hypothesis tests!)

But rather than take a temporary job of pointing up some understandable fallacies in the use of newly adopted statistical tools by social scientific practitioners, or lead by example of right-headed statistical analyses, the New Reformers have seemed to settle into a permanent career of showing the same fallacies.  Yes, they advocate “alternative” methods, e.g., “effect size” analysis, power analysis, confidence intervals, meta-analysis.  But never having adequately unearthed the essential reasoning and rationale of significance tests—admittedly something that goes beyond many typical expositions—their supplements and reforms often betray the same confusions and pitfalls that underlie the methods they seek to supplement or replace! (I will give readers a chance to demonstrate this in later posts.)

We all reject the highly lampooned, recipe-like uses of significance tests; I and others insist on interpreting tests to reflect the extent of discrepancy indicated or not (back when I was writing my doctoral dissertation and EGEK 1996).  I never imagined that hypotheses tests (of all stripes) would continue to be flogged again and again, in the same ways!

Frustrated with the limited progress in psychology, apparently inconsistent results, and lack of replication, an imagined malign conspiracy of significance tests is blamed: traditional reliance on statistical significance testing, we hear,

“has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122)[i].

Meta-analysis was to be the cure that would provide cumulative knowledge to psychology: Lest enthusiasm for revisiting the same cluster of elementary fallacies of tests begin to lose steam, the threats of dangers posed  become ever shriller: just as the witch is responsible for whatever ails a community, the significance tester is portrayed as so powerful as to be responsible for blocking scientific progress. In order to keep the gig alive, a certain level of breathless hysteria is common: “statistical significance is hurting people, indeed killing them” (Ziliak and McCloskey 2008, 186)[ii]; significance testers are members of a “cult” led by R.A. Fisher” whom they call “The Wasp”.  To the question, “What if there were no Significance Tests,” as the title of one book inquires[iii], surely the implication is that once tests are extirpated, their research projects would bloom and thrive; so let us have Task Forces[iv] to keep reformers busy at journalistic reforms to banish the test once and for all!

Harlow, L., Mulaik, S., Steiger, J. (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Hunter, J.E. (1997), “Needed: A Ban on the Significance Test,”, American Psychological Society 8:3-7.

Morrison, D. and Henkel, R. (eds.) (1970), The Significance Test Controversy, Aldine, Chicago.

MSERA (1998), Research in the Schools, 5(2) “Special Issue: Statistical Significance Testing,” Birmingham, Alabama.

Rosenthal, R. and Gaito, J. (1963), “The Interpretation of Levels of Significance by Psychologicl Researchers,”  Journal of Psychology 55:33-38.

Ziliak, T. and McCloskey, D. (2008), The Cult of Statistical Significance, University of Michigan Press.

[i]Schmidt was the one Erich Lehmann wrote to me about, expressing great concern.

[ii] While setting themselves up as High Priest and Priestess of “reformers” their own nostroms reveal they fall into the same fallacy pointed up by Rosenthal and Gaito (among many others) nearly a half a century ago.  That’s what should scare us!

[iii] In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

[iv] MSERA (1998): ‘Special Issue: Statistical Significance Testing,’ Research in the Schools, 5.   See also Hunter (1997). The last I heard, they have not succeeded in their attempt at an all-out “test ban”.  Interested readers might check the status of the effort, and report back.

Related posts:

“What do these share in common: MMs, limbo stick, ovulation, Dale Carnegie?: Sat. night potpourri”

Categories: significance tests, Statistics |

## Is Particle Physics Bad Science? (memory lane)

I suppose[ed] this was somewhat of a joke from the ISBA, prompted by Dennis Lindley, but as I [now] accord the actual extent of jokiness to be only ~10%, I’m sharing it on the blog [i].  Lindley (according to O’Hagan) wonders why scientists require so high a level of statistical significance before claiming to have evidence of a Higgs boson.  It is asked: “Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Bad science?   I’d really like to understand what these representatives from the ISBA would recommend, if there is even a shred of seriousness here (or is Lindley just peeved that significance levels are getting so much press in connection with so important a discovery in particle physics?)

Well, read the letter and see what you think.

On Jul 10, 2012, at 9:46 PM, ISBA Webmaster wrote:

Dear Bayesians,

A question from Dennis Lindley prompts me to consult this list in search of answers.

We’ve heard a lot about the Higgs boson.  The news reports say that the LHC needed convincing evidence before they would announce that a particle had been found that looks like (in the sense of having some of the right characteristics of) the elusive Higgs boson.  Specifically, the news referred to a confidence interval with 5-sigma limits.

Now this appears to correspond to a frequentist significance test with an extreme significance level.  Five standard deviations, assuming normality, means a p-value of around 0.0000005.  A number of questions spring to mind.

1.  Why such an extreme evidence requirement?  We know from a Bayesian  perspective that this only makes sense if (a) the existence of the Higgs  boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme.  Neither seems to be the case, so why  5-sigma?

2.  Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis.  Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is? Continue reading

Categories: philosophy of science, Statistics |

## PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)

Memory Lane: One Year Ago on error statistics.com

A quick perusal of the “Manual” on Nathan Schachtman’sshows it to be chock full of revealing points of contemporary legal statistical philosophy.  The following are some excerpts, read the full blog here.   I make two comments at the end.

July 8th, 2012

Nathan Schachtman

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance?  Inconsistently and at times incoherently.

Professor Berger’s Introduction

In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards.  Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology.  And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

…………

Chapter on Statistics

The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction.  The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011).  Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large data set—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57.  This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed. Continue reading

Categories: P-values, PhilStatLaw, significance tests |

## Bad news bears: ‘Bayesian bear’ rejoinder-reblog mashup

Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the third time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears).

These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.

Mayo’s Rejoinder:

Bear #1: Do you have the results of the study?

Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.

Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.

Bear #2: Not really, that would be an incorrect interpretation. Continue reading

## Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ. Continue reading

Categories: confidence intervals and tests, reformers, Statistics |

## R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

Categories: phil/history of stat, Statistics |