Aris Spanos’ overview of error statistical responses to familiar criticisms of statistical tests. Related reading is Mayo and Spanos (2011)

# statistical tests

## Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day#13)

**S. Stanley Young, PhD**

Assistant Director for Bioinformatics

National Institute of Statistical Sciences

Research Triangle Park, NC

Here are Dr. Stanley Young’s slides from our April 25 seminar. They contain several tips for unearthing deception by fraudulent p-value reports. Since it’s Saturday night, you might wish to perform an experiment with three 10-sided dice*,recording the results of 100 rolls (3 at a time) on the form on slide 13. An entry, e.g., (0,1,3) becomes an imaginary p-value of .013 associated with the type of tumor, male-female, old-young. You report only hypotheses whose null is rejected at a “p-value” less than .05. Forward your results to me for publication in a peer-reviewed journal.

*Sets of 10-sided dice will be offered as a palindrome prize beginning in May.

## capitalizing on chance (ii)

I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This also pertains to reading for Phil6334 for Thurs. March 6):

Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:

When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is

no,because the difference tested has beenselectedfrom the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]

…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Continue reading

## “The probability that it be a statistical fluke” [iia]

My rationale for the last post is really just to highlight such passages as:

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until

the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance.” (Strassler)….

Even before the dust had settled regarding the discovery of a Standard Model-like Higgs particle, the nature and rationale of the 5-sigma discovery criterion began to be challenged. But my interest now is not in the fact that the 5-sigma discovery criterion is a convention, nor with the choice of 5. It is the understanding of “the probability that it be a statistical fluke” that interests me, because if we can get this right, I* think we can understand a kind of equivocation that leads many to suppose that significance tests are being misinterpreted—even when they aren’t!* So given that I’m stuck, unmoving, on this bus outside of London for 2+ hours (because of a car accident)—and the internet works—I’ll try to scratch out my point (expect errors, we’re moving now). Here’s another passage…

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. …Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?”

A very sketchy nutshell of the Higgs statistics: There is a general model of the detector, and within that model researchers define a “global signal strength” parameter “such that *H _{0}:* μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the Standard Model (SM) Higgs boson signal in addition to the background” (quote from an ATLAS report). The statistical test may be framed as a one-sided test; the test statistic records differences in the positive direction, in standard deviation or sigma units. The interest is not in the point against point hypotheses, but in finding discrepancies from

*H*in the direction of the alternative, and then estimating their values. The improbability of the 5-sigma excess alludes to the sampling Continue reading

_{0}## Probability that it is a statistical fluke [i]

**From another blog:**

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify *the probability that it is a statistical fluke* — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed *within three days* by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence? I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.

To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable. Continue reading

## Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock

There’s an update (with overview) on the infamous Harkonen case in *Nature* with the dubious title “Uncertainty on Trial“, first discussed in my (11/13/12) post “Bad statistics: Crime or Free speech”, and continued here. The new *Nature* article quotes from Steven Goodman:

“You don’t want to have on the books a conviction for a practice that many scientists do, and in fact think is critical to medical research,” says Steven Goodman, an epidemiologist at Stanford University in California who has filed a brief in support of Harkonen……

Goodman, who was paid by Harkonen to consult on the case, contends that the government’s case is based on faulty reasoning, incorrectly equating an arbitrary threshold of statistical significance with truth. “How high does probability have to be before you’re thrown in jail?” he asks. “This would be a lot like throwing weathermen in jail if they predicted a 40% chance of rain, and it rained.”

I don’t think the case at hand is akin to the exploratory research that Goodman likely has in mind, and the rain analogy seems very far-fetched. (There’s much more to the context, but the links should suffice.) Lawyer Nathan Schachtmen also has an update on his blog today. He and I usually concur, but we largely disagree on this one[i]. I see no new information that would lead me to shift my earlier arguments on the evidential issues. From a Dec. 17, 2012 post on Schachtman (“multiplicity and duplicity”):

So what’s the allegation that the prosecutors are being duplicitous about statistical evidence in the case discussed in my two previous (‘Bad Statistics’) posts? As a non-lawyer, I will ponder only the evidential (and not the criminal) issues involved.

“After the conviction, Dr. Harkonen’s counsel moved for a new trial on grounds of newly discovered evidence. Dr. Harkonen’s counsel hoisted the prosecutors with their own petards, by quoting the government’s amicus brief to the United States Supreme Court in

Matrixx Initiatives Inc. v. Siracusano, 131 S. Ct. 1309 (2011). InMatrixx, the securities fraud plaintiffs contended that they need not plead ‘statistically significant’ evidence for adverse drug effects.” (Schachtman’s part 2, ‘The Duplicity Problem – The Matrixx Motion’)

The Matrixx case is another philstat/law/stock example taken up in this blog here, here, and here. Why are the Harkonen prosecutors “hoisted with their own petards” (a great expression, by the way)? Continue reading

## Phil/Stat/Law: 50 Shades of gray between error and fraud

An update on the Diederik Stapel case: July 2, 2013, *The Scientist*, “Dutch Fraudster Scientist Avoids Jail”.

Two years after being exposed by colleagues for making up data in at least 30 published journal articles, former Tilburg University professor Diederik Stapel will avoid a trial for fraud. Once one of the Netherlands’ leading social psychologists, Stapel has agreed to a pre-trial settlement with Dutch prosecutors to perform 120 hours of community service.

According to Dutch newspaper

NRC Handeslblad, the Dutch Organization for Scientific Research awarded Stapel $2.8 million in grants for research that was ultimately tarnished by misconduct. However, the Dutch Public Prosecution Service and the Fiscal Information and Investigation Service said on Friday (June 28) that because Stapel used the grant money for student and staff salaries to perform research, he had not misused public funds. …In addition to the community service he will perform, Stapel has agreed not to make a claim on 18 months’ worth of illness and disability compensation that he was due under his terms of employment with Tilburg University. Stapel also voluntarily returned his doctorate from the University of Amsterdam and, according to

Retraction Watch, retracted 53 of the more than 150 papers he has co-authored.“I very much regret the mistakes I have made,” Stapel told

ScienceInsider. “I am happy for my colleagues as well as for my family that with this settlement, a court case has been avoided.”

No surprise he’s not doing jail time, but 120 hours of community service? After over a decade of fraud, and tainting 14 of 21 of the PhD theses he supervised? Perhaps the “community service” should be to actually run the experiments he had designed? What about his innocence of misusing public funds? Continue reading

## Some statistical dirty laundry

*I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. Here are some stray thoughts…*

*1. Slipping into pseudoscience.*

The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts

…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

*2. A role for philosophy of science?*

I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant

basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of sciencemust be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments). That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

* 3. Hanging out some statistical dirty laundry.
*Items in their laundry list include:

- An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings…. Continue reading

## Higgs analysis and statistical flukes (part 2)

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of *excess events* of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post. This, too, is a rough outsider’s angle on one small aspect of the statistical inferences involved. (Doubtless there will be corrections.) But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Following an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as: Continue reading

## Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST

Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update. Please see most recent 2015 update.

*Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. *

*While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and **very successfully* published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?

*This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost. Continue reading *

## Severity Calculator

SEV calculator (with comparisons to p-values, power, CIs)

In the illustration in the Jan. 2 post,

H_{0}: μ < 0 vs H_{1}: μ > 0

and the standard deviation SD = 1, n = 25, so σ_{x } = SD/√n = .2

Setting α to .025, the cut-off for rejection is .39. (can round to .4).

Let the observed mean X = .2 , a statistically insignificant result (p value = .16)

SEV (μ < .2) = .5

SEV(μ <.3) = .7

SEV(μ <.4) = .84

SEV(μ <.5) = .93

SEV(μ <.6*) = .975

*rounding

Some students asked about crunching some of the numbers, so here’s a rather rickety old SEV calculator*. It is limited, rather scruffy-looking (nothing like the pretty visuals others post) but it is very useful. It also shows the Normal curves, how shaded areas change with changed hypothetical alternatives, and gives contrasts with confidence intervals. Continue reading

## “Bad statistics”: crime or free speech?

Hunting for “nominally” significant differences, trying different subgroups and multiple endpoints, can result in a much higher probability of erroneously inferring evidence of a risk or benefit than the nominal p-value, even in randomized controlled trials. This was an issue that arose in looking at RCTs in development economics (an area introduced to me by Nancy Cartwright), as at our symposium at the Philosophy of Science Association last month[i][ii]. Reporting the results of hunting and dredging in just the same way as if the relevant claims were predesignated can lead to misleading reports of actual significance levels.[iii]

Still, even if reporting spurious statistical results is considered “bad statistics,” is it criminal behavior? I noticed this issue in Nathan Schachtman’s blog over the past couple of days. The case concerns a biotech company, InterMune, and its previous CEO, Dr. Harkonen. Here’s an excerpt from Schachtman’s discussion (part 1). Continue reading

## Type 1 and 2 errors: Frankenstorm

I escaped (to Virginia) from New York just in the nick of time before the threat of Hurricane Sandy led Bloomberg to completely shut things down (a whole day in advance!) in expectation of the looming “Frankenstorm”. Searching for the latest update on the extent of Sandy’s impacts, I noticed an interesting post on statblogs by Dr. Nic: “Which type of error do you prefer?”. She begins:

Mayor Bloomberg is avoiding a Type 2 errorAs I write this, Hurricane Sandy is bearing down on the east coast of the United States. Mayor Bloomberg has ordered evacuations from various parts of New York City. All over the region people are stocking up on food and other essentials and waiting for Sandy to arrive. And if Sandy doesn’t turn out to be the worst storm ever, will people be relieved or disappointed? Either way there is a lot of money involved. And more importantly, risk of human injury and death. Will the forecasters be blamed for over-predicting?

Given that my son’s ability to travel back here is on-hold until planes fly again—not to mention that snow is beginning to swirl outside my window,—I definitely hope Bloomberg was erring on the side of caution. However, I think that type 1 and 2 errors should generally be put in terms of the extent and/or direction of errors that are or are not indicated or ruled out by test data. Criticisms of tests very often harp on the dichotomous type 1 and 2 errors, as if a user of tests does not have latitude to infer the extent of discrepancies that are/are not likely. At times, attacks on the “culture of dichotomy” reach fever pitch, and lead some to call for the overthrow of tests altogether (often in favor of confidence intervals), as well as to the creation of task forces seeking to reform if not “ban” statistical tests (which I spoof here). Continue reading