# A biased report of the probability of a statistical fluke: Is it cheating?

One year ago I reblogged a post from Matt Strassler, “Nature is Full of Surprises” (2011). In it he claims that

[Statistical debate] “often boils down to this: is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake. It is asking whether they cheated — whether they adjusted the rules unfairly — and biased the answer through the question they chose…”

(Nov. 2014):I am impressed (i.e., struck by the fact) that he goes so far as to call it “cheating”. Anyway, here is the rest of the reblog from Strassler which bears on a number of recent discussions:

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify the probability that it is a statistical fluke — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed within three days by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence?  I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.” (my emphasis)

“To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable.”(my emphasis; but is it unavoidable?)

“Why? It isn’t just politics. One might expect problems in subjects with a direct political consequence, for example in demographic, medical or psychological studies. But even in these subjects, the problem isn’t merely political — it’s inherent in what is being studied, and how. …And the debate often boils down to this: is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake. It is asking whether they cheated — whether they adjusted the rules unfairly — and biased the answer through the question they chose, in just the way that every Republican and Democratic pollster does.

Inevitably, the scientists proposing intelligent but different possible answers to this question end up not seeing eye-to-eye. They may continue to battle, even in public, because much is at stake. Biasing a scientific result is considered a terrible breach of scientific protocol, and it makes scientists very upset when they believe others are doing it. But it is best if the disputing parties come up with a convention that all subscribe to, even if they don’t like it. Because if each experimenter were to choose his or her own preferred statistical technique, in defiance of others’ views, then it would become virtually impossible to compare the results of two experiments, or combine them into a more powerful result.

Yes, the statistics experts at the two main LHC experiments, ATLAS and CMS, have been having such a debate, which has been quite public at times. Both sides are intelligent and make good points. There’s no right answer.   Fortunately, they have reached a suitable truce, so in many cases the results from the two experiments can be compared.”[Remember this was 2011].

“But does the precise choice of question actually matter that much? I personally take the point of view that it really doesn’t. That’s because no one should take a hint of the presence (or absence) of a new phenomenon too seriously until it becomes so obvious that we can’t possibly argue about it anymore.   If intelligent, sensible people can have a serious argument about whether a strange sequence of events could be a coincidence, then there’s no way to settle the argument except to learn more.

While my point of view is not shared explicitly by most of my colleagues, I would argue that this viewpoint is embedded in our culture. Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance. “….

(Nov. 2014): I had taken it that the point not shared by most of his colleagues is requiring that hints of a phenomenon not be taken seriously until it becomes “so obvious that we can’t possibly argue about it anymore.” But now I wonder if it isn’t the subsequent point about requiring a draconian convention. Or perhaps they are the same.

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. There are hundreds of different types of experiments going on, collecting millions of data points and looking at the data in thousands of different ways.  Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?  The probability of it happening depends on how many different experiments we include in calculating the probability, just as the probability of a 2011 New York hurriquake depends on whether we include other years, other cities, and other types of disasters.

This is sometimes called the “look-elsewhere effect”; how many other places did you look before you found something that seems exceptional?  It explains how sometimes a seemingly “impossible” coincidence is, despite appearances, perfectly possible. And for the scientist whose earth-shattering “discovery” blows away in the statistical winds, it is the cause of a deeply personal form of natural disaster.”

The above is taken from Matt Strassler, “Nature is Full of Surprises”. My main rationale for this rereblog is the general discussion of “flukes” in [part ii], and our recent PSA symposium here and here.

Categories: Higgs, spurious p values, Statistics | 7 Comments

### 7 thoughts on “A biased report of the probability of a statistical fluke: Is it cheating?”

1. john byrd

It seems that he has expressed a similar argument to that of DJ Hand, in the The Improbability Principle. I wonder if the 5 sigma standard reflects partly a desire to protect against statistical fluke, but also against the myriad ways that the experiment can fail to be sensitive enough to measure the real effect?

• John: yes and the multiple testing business.

2. West

An otherwise very intelligent individual writing a lot of dumb about politics and statistics. Not sure where to start…

• I think the analogy between the use of leading and biased questions in the two arenas is apt.

3. Do you know what I really find interesting about this? (Put aside issues of philosophical nitpicking, of which I can list several.)

“scientists (including particle physicists), when they see something unusual in their data, always try to quantify the probability that it is a statistical fluke — a pure chance event. You would not want to be wrong, …. only to receive instead a booby prize.”

You can be wrong in your quantification of the probability that it is a statistical fluke.

compare that to being wrong about your posterior probability in not-Ho.

Note: You will be able to find out (just as Fisher instructed) that you erred in your quantification of the probability that the result is a statistical fluke, i.e., a spuriously significant result, by discovering you cannot “rarely fail to bring about” such a statistically significant effect (to paraphrase Fisher). Instead, you will often fail to replicate the statistically significant result. (Or, equivalently, you will too rarely succeed in bringing it about for it to count as a genuine experimental effect, Fisher would say.)

All these probabilities concern the sampling distribution. there is no illicit assignment of probability to the null or its denial (unlike what the P-value police allege). Nor is it just that the P-Police failed to look as carefully as they might have at how the “probable fluke” term was being used in HEP physics. I think it is a deep and subtle philosophical issue about the nature of statistical tests and statistical inference. One that’s not so easy to clarify.

• West

The first thing that bothers me is the vague character of the phrase “see something unusual in the data.” Does this mean there is a strong null, where deviations induce a “WTF is that” sort of response or that one expects a signal of unknown character buried within the noise.

I am going to presume he meant the latter, as it is more closely resembles the problem of the Higgs search. So now its a question of, does the signal+background model with all its implied uncertainties match better the data than the background-only model. Computing the latter reasonably, which I expect is what he meant by “the probability that it is a statistical fluke”, is important but not sufficient on its own.

The “you would not want to be wrong because you would look silly” line is so condescendingly simplistic that I imagine he would have some choice and colorful words to say if I made analogously flippant comments, say about the methods of his own research.

• West

Mayo: The important bit is the 2+ paragraphs that follow your quotation, which basically amounts to “I mean no disrespect, but the work of your discipline (statistics) is a pile of useless garbage.” And at that point, I throw up my hands and say goodnight cause this conversation is going nowhere. Why take the pronouncements of someone serious when they themselves don’t care to take the work of practitioners seriously.

Its a waste of time and best not to go looking for allies among this sort.