Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology *(*Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had ~~been taught as acceptable~~ become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) *Significance Test Controversy*. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my* “statistical dirty laundry” post from 2013:* Continue reading

# statistical tests

## Some statistical dirty laundry: have the stains become permanent?

## Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England. It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix. Still, hearing his rants for the n^{th} time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

## A. Spanos: Talking back to the critics using error statistics

Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.

See also on this blog:

A. Spanos, “Recurring controversies about p-values and confidence intervals revisited”

A. Spanos, “Lecture on frequentist hypothesis testing

## “A small p-value indicates it’s improbable that the results are due to chance alone” –fallacious or not? (more on the ASA p-value doc)

There’s something about “Principle 2” in the ASA document on p-values that I couldn’t address in my brief commentary, but is worth examining more closely.

*2. P-values do not measure (a) the probability that the studied hypothesis is true , or (b) the probability that the data were produced ** by random chance alone,*

(a) is true, but what about (b)? That’s what I’m going to focus on, because I think it is often misunderstood. It was discussed earlier on this blog in relation to the **Higgs experiments** and deconstructing “the probability the results are ‘statistical flukes'”. So let’s examine: Continue reading

## Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

## Stephen Senn: The pathetic P-value (Guest Post) [3]

**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**The pathetic P-value* [3]**

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration

And the end of all our exploring

Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and

All manner of thing shall be wellTS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews

Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they *were already calculating* *and interpreting as posterior probabilities* relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

## The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

**The Paradox of Replication**

Critic 1: It’s much too easy to get small P-values.

Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).

Is it easy or is it hard?

You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other *biasing selection effects*. *The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs.* In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.

The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:

The findings also offered some support for the oft-criticized statistical tool known as the

Pvalue, which measures whether a result is significant or due to chance. …The project analysis showed that a low

Pvalue was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with aPvalue of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)

## Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”

*I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…*

*1. Slipping into pseudoscience.*

The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

*2. A role for philosophy of science?*

I am intrigued that one of the final recommendations in the Report is this: Continue reading

## Stephen Senn: The pathetic P-value (Guest Post)

**Stephen Senn**

Head of Competence Center for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

**The pathetic P-value**

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration

And the end of all our exploring

Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and

All manner of thing shall be wellTS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews

Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they *were already calculating* *and interpreting as posterior probabilities* relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

## What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

**Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..**

1.** Take a one-sided Normal test T+: with n iid samples:**

**H _{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0**

σ = 10, n = 100, σ/√n =σ** _{x}**= 1, α = .025.

So the test would reject H_{0} iff Z > c_{.025} =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

**Simple rules for alternatives against which T+ has high power:**

- If we add σ
_{x }(here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect. - If we add 3σ
to the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ_{x }^{.}^{999}= 4.96

Let the observed outcome just reach the cut-off to reject the null,z_{0 }= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ_{0} = 0 using

[Power(T+, 4.96)]/α,

it would be 40. (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H_{0} |z_{0}) = 1/ (1 + 40) = .024, so Pr(H_{1} |z_{0}) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

## 2015 Saturday Night Brainstorming and Task Forces: (4th draft)

*Saturday Night Brainstorming: The TFSI on NHST–part reblog from here and here, with a substantial 2015 update!*

*Each year leaders of the movement to “reform” statistical methodology in psychology, social science, and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like to see adopted, not just by the APA publication manual any more, but all science journals! ***Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers.**

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Frustrated that the TFSI has still not banned null hypothesis significance testing (NHST)–a fallacious version of statistical significance tests that dares to violate Fisher’s first rule: It’s illicit to move directly from statistical to substantive effects–**the New Reformers have created, and **very successfully* published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? **Or not? **

*Most recently, the group has helped successfully launch a variety of “replication and reproducibility projects”. Having discovered how much the reward structure encourages bad statistics and gaming the system, they have cleverly pushed to change the reward structure: Failed replications (from a group chosen by a crowd-sourced band of replicationistas ) would not be hidden in those dusty old file drawers, but would be guaranteed to be published without that long, drawn out process of peer review. Do these failed replications indicate the original study was a false positive? or that the replication attempt is a false negative? It’s hard to say. *

*This year, as is typical, there is a new member who is pitching in to contribute what he hopes are novel ideas for reforming statistical practice. In addition, for the first time, there is a science reporter blogging the meeting for her next free lance “bad statistics” piece for a high impact science journal. Notice, it seems this committee only grows, no one has dropped off, in the 3 years I’ve followed them. *

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

** Pawl**: This meeting will come to order. I am pleased to welcome our new member, Dr. Ian Nydes, adding to the medical strength we have recently built with epidemiologist S.C.. In addition, we have a science writer with us today, Jenina Oozo. To familiarize everyone, we begin with a review of old business, and gradually turn to new business.

** Franz**: It’s so darn frustrating after all these years to see researchers still using NHST methods; some of the newer modeling techniques routinely build on numerous applications of those pesky tests.

** Jake**: And the premier publication outlets in the social sciences still haven’t mandated the severe reforms sorely needed. Hopefully the new blood, Dr. Ian Nydes, can help us go beyond resurrecting the failed attempts of the past. Continue reading

## Some statistical dirty laundry

*It’s an apt time to reblog the “statistical dirty laundry” post from 2013 here. I hope we can take up the recommendations from Simmons, Nelson and Simonsohn at the end (Note [5]), which we didn’t last time around.*

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…*

*1. Slipping into pseudoscience.*

The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

*2. A role for philosophy of science?*

I am intrigued that one of the final recommendations in the Report is this: Continue reading

## No headache power (for Deirdre)

Deirdre McCloskey’s comment leads me to try to give a “no headache” treatment of some key points about the **power of a statistical test**. (Trigger warning: formal stat people may dislike the informality of my exercise.)

We all know that for a given test, **as the probability of a type 1 error goes down the probability of a type 2 error goes up (and power goes down**).

And **as the probability of a type 2 error goes down (and power goes up), the probability of a type 1 error goes up.** Leaving everything else the same. There’s a **trade-off** between the two error probabilities.(No free lunch.) No headache powder called for.

So if someone said, as the power increases, the probability of a type 1 error *decreases*, they’d be saying: **As the type 2 error decreases, the probability of a type 1 error decreases!** **That’s the opposite of a trade-off.** So you’d know automatically they’d made a mistake or were defining things in a way that differs from standard NP statistical tests.

Before turning to my little exercise, I note that power is defined in terms of a test’s cut-off for rejecting the null, whereas a severity assessment always considers the actual value observed (attained power). Here I’m just trying to clarify *regular old* power, as defined in a N-P test.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let’s use a familiar oversimple example to fix the trade-off in our minds so that it cannot be dislodged. Our old friend, test T+ : We’re testing the mean of a Normal distribution with *n* iid samples, and (for simplicity) known, fixed σ:

H_{0}: µ ≤ _{ }0 against H_{1}: µ > _{ }0

Let **σ = 2**, *n* = 25, so (σ/ √*n*) = .4. To avoid those annoying X-bars, I will use **M for the sample mean**. I will abbreviate (σ/ √*n*) as σ_{x .}

**Test T+**is a rule: reject H_{0 }iff M > m***Power of a test T+**is computed in relation to values of µ >_{ }0.- The
**power**of T+ against alternative µ =µ_{1 }= Pr(T+ rejects H_{0};µ = µ_{1}) = Pr(M > m*; µ = µ_{1})

We may abbreviate this as : POW(T+,α, µ = µ_{1}) Continue reading

## A biased report of the probability of a statistical fluke: Is it cheating?

One year ago I reblogged a post from Matt Strassler, “Nature is Full of Surprises” (2011). In it he claims that

[

Statistical debate] “often boils down to this:is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake.It is asking whether they cheated— whether they adjusted the rules unfairly — and biased the answer through the question they chose…”

(Nov. 2014):I am impressed (i.e., struck by the fact) that he goes so far as to call it “cheating”. Anyway, here is the rest of the reblog from Strassler which bears on a number of recent discussions:

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up. Continue reading

## A. Spanos: Talking back to the critics using error statistics (Phil6334)

Aris Spanos’ overview of error statistical responses to familiar criticisms of statistical tests. Related reading is Mayo and Spanos (2011)

## Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day#13)

**S. Stanley Young, PhD**

Assistant Director for Bioinformatics

National Institute of Statistical Sciences

Research Triangle Park, NC

Here are Dr. Stanley Young’s slides from our April 25 seminar. They contain several tips for unearthing deception by fraudulent p-value reports. Since it’s Saturday night, you might wish to perform an experiment with three 10-sided dice*,recording the results of 100 rolls (3 at a time) on the form on slide 13. An entry, e.g., (0,1,3) becomes an imaginary p-value of .013 associated with the type of tumor, male-female, old-young. You report only hypotheses whose null is rejected at a “p-value” less than .05. Forward your results to me for publication in a peer-reviewed journal.

*Sets of 10-sided dice will be offered as a palindrome prize beginning in May.

## capitalizing on chance (ii)

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is

no,because the difference tested has beenselectedfrom the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

## “The probability that it be a statistical fluke” [iia]

My rationale for the last post is really just to highlight such passages as:

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until

the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance.” (Strassler)….

Even before the dust had settled regarding the discovery of a Standard Model-like Higgs particle, the nature and rationale of the 5-sigma discovery criterion began to be challenged. But my interest now is not in the fact that the 5-sigma discovery criterion is a convention, nor with the choice of 5. It is the understanding of “the probability that it be a statistical fluke” that interests me, because if we can get this right, I* think we can understand a kind of equivocation that leads many to suppose that significance tests are being misinterpreted—even when they aren’t!* So given that I’m stuck, unmoving, on this bus outside of London for 2+ hours (because of a car accident)—and the internet works—I’ll try to scratch out my point (expect errors, we’re moving now). Here’s another passage…

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. …Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?”

A very sketchy nutshell of the Higgs statistics: There is a general model of the detector, and within that model researchers define a “global signal strength” parameter “such that *H _{0}:* μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the Standard Model (SM) Higgs boson signal in addition to the background” (quote from an ATLAS report). The statistical test may be framed as a one-sided test; the test statistic records differences in the positive direction, in standard deviation or sigma units. The interest is not in the point against point hypotheses, but in finding discrepancies from

*H*in the direction of the alternative, and then estimating their values. The improbability of the 5-sigma excess alludes to the sampling Continue reading

_{0}## Probability that it is a statistical fluke [i]

**From another blog:**

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify *the probability that it is a statistical fluke* — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed *within three days* by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence? I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.

To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable. Continue reading

## Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock

There’s an update (with overview) on the infamous Harkonen case in *Nature* with the dubious title “Uncertainty on Trial“, first discussed in my (11/13/12) post “Bad statistics: Crime or Free speech”, and continued here. The new *Nature* article quotes from Steven Goodman:

“You don’t want to have on the books a conviction for a practice that many scientists do, and in fact think is critical to medical research,” says Steven Goodman, an epidemiologist at Stanford University in California who has filed a brief in support of Harkonen……

Goodman, who was paid by Harkonen to consult on the case, contends that the government’s case is based on faulty reasoning, incorrectly equating an arbitrary threshold of statistical significance with truth. “How high does probability have to be before you’re thrown in jail?” he asks. “This would be a lot like throwing weathermen in jail if they predicted a 40% chance of rain, and it rained.”

I don’t think the case at hand is akin to the exploratory research that Goodman likely has in mind, and the rain analogy seems very far-fetched. (There’s much more to the context, but the links should suffice.) Lawyer Nathan Schachtmen also has an update on his blog today. He and I usually concur, but we largely disagree on this one[i]. I see no new information that would lead me to shift my earlier arguments on the evidential issues. From a Dec. 17, 2012 post on Schachtman (“multiplicity and duplicity”):

So what’s the allegation that the prosecutors are being duplicitous about statistical evidence in the case discussed in my two previous (‘Bad Statistics’) posts? As a non-lawyer, I will ponder only the evidential (and not the criminal) issues involved.

“After the conviction, Dr. Harkonen’s counsel moved for a new trial on grounds of newly discovered evidence. Dr. Harkonen’s counsel hoisted the prosecutors with their own petards, by quoting the government’s amicus brief to the United States Supreme Court in

Matrixx Initiatives Inc. v. Siracusano, 131 S. Ct. 1309 (2011). InMatrixx, the securities fraud plaintiffs contended that they need not plead ‘statistically significant’ evidence for adverse drug effects.” (Schachtman’s part 2, ‘The Duplicity Problem – The Matrixx Motion’)

The Matrixx case is another philstat/law/stock example taken up in this blog here, here, and here. Why are the Harkonen prosecutors “hoisted with their own petards” (a great expression, by the way)? Continue reading