# Probability that it is a statistical fluke [i]

From another blog:
“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify the probability that it is a statistical fluke — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed within three days by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence?  I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.

To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable.

Why? It isn’t just politics. One might expect problems in subjects with a direct political consequence, for example in demographic, medical or psychological studies. But even in these subjects, the problem isn’t merely political — it’s inherent in what is being studied, and how. …And the debate often boils down to this: is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake. It is asking whether they cheated — whether they adjusted the rules unfairly — and biased the answer through the question they chose, in just the way that every Republican and Democratic pollster does.

Inevitably, the scientists proposing intelligent but different possible answers to this question end up not seeing eye-to-eye. They may continue to battle, even in public, because much is at stake. Biasing a scientific result is considered a terrible breach of scientific protocol, and it makes scientists very upset when they believe others are doing it. But it is best if the disputing parties come up with a convention that all subscribe to, even if they don’t like it. Because if each experimenter were to choose his or her own preferred statistical technique, in defiance of others’ views, then it would become virtually impossible to compare the results of two experiments, or combine them into a more powerful result.

Yes, the statistics experts at the two main LHC experiments, ATLAS and CMS, have been having such a debate, which has been quite public at times. Both sides are intelligent and make good points. There’s no right answer.   Fortunately, they have reached a suitable truce, so in many cases the results from the two experiments can be compared.

But does the precise choice of question actually matter that much? I personally take the point of view that it really doesn’t. That’s because no one should take a hint of the presence (or absence) of a new phenomenon too seriously until it becomes so obvious that we can’t possibly argue about it anymore.   If intelligent, sensible people can have a serious argument about whether a strange sequence of events could be a coincidence, then there’s no way to settle the argument except to learn more.

While my point of view is not shared explicitly by most of my colleagues, I would argue that this viewpoint is embedded in our culture. Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance. ….

Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. There are hundreds of different types of experiments going on, collecting millions of data points and looking at the data in thousands of different ways.  Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?  The probability of it happening depends on how many different experiments we include in calculating the probability, just as the probability of a 2011 New York hurriquake depends on whether we include other years, other cities, and other types of disasters.

This is sometimes called the “look-elsewhere effect”; how many other places did you look before you found something that seems exceptional?  It explains how sometimes a seemingly “impossible” coincidence is, despite appearances, perfectly possible. And for the scientist whose earth-shattering “discovery” blows away in the statistical winds, it is the cause of a deeply personal form of natural disaster.”

Matt Strassler, “Nature is Full of Surprises”. My main rationale for posting this will be  explained in [draft ii].

### 22 thoughts on “Probability that it is a statistical fluke [i]”

1. Michael Lew

Very interesting post.

Surely what the physicists need to do is to distinguish between preliminary exploratory investigations and experiments that are designed to evaluate the questions that turn up as interesting in those explorations. The preliminary explorations will always be vulnerable to the ‘look elsewhere’ effect, but the designed experiments will not.

Frequentist methods will always be vulnerable to problematically high false positive error rates when large datasets are interrogated for information about lots of hypotheses. Presumably a Bayesian approach with a prior that takes the speculative nature of the exploration into account would be better.

• john byrd

Not sure “high false positive error rates” is an appropriate way to view results from frequentist tests. It presumes the researcher is naive about how to assess p-values when many tests are performed. This is not consistent with errorstatistical reasoning. A low p- value amongst numerous test results is not necessarily a “positive.” What is the probability I will be fooled if I view each p<0.05 as a positive when I performed many tests?

• vl

A low p-value amongst a single test result is not necessarily a positive either. Even if a researcher only looked at one result, you have the same problem when you look at the field as a whole because there’s a large number of researchers.

One of the unfortunate consequences of emphasizing p-values is that it’s lead to a focus on multiple testing rather than the ubiquitous problem of biased “truly significant” results. The former has easy p-value “corrections” (bonferroni, fdr, which many applied researchers understand as black boxes for giving the “correct” answer), while the latter doesn’t.

• Right.

• john byrd

Fisher gave a clear warning about over-reliance on a singular test result.

• Michael Lew

John

Well, if a method is Frequentist then it is hard to avoid categorising the potential results into errors. I am personally in the camp that sees P-values as indices of evidence and I dislike the practice of dividing them into categories of small enough to claim and not small enough. However, the problem of ‘look elsewhere’ is usually discussed in that context.

You might like to see my recently arXived paper (http://arxiv.org/abs/1311.0081) that shows the intimate relationships between P-values and likelihood functions and how Fisherian significance tests support estimation.

The point relevant here is that while the evidential meaning of the P-value is not affected by multiple comparisons or by stopping rules, the response of an experimenter to the evidence can properly be affected by the context of the evidence.

• No problem with moving from p-values to corresponding estimations (as with a severity assessment) but to say the “evidential meaning” of a p-value is unaffected, is to use p-values as evidential-relation (E-R), or mere “fit” measures, taking away their error statistical function which is essential to employing them as (just one) way to ascertain the capability of a method to distinguish and rule out errors. Here’s a paper of relevance:

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference” http://www.phil.vt.edu/dmayo/personal_website/Ch%207%20mayo%20&%20cox.pdf

• john byrd

I have read the paper, thanks. Very interesting. I would agree that the response of the researcher to more than the p-value is what is important. That is why I would not presume that the researcher would be naive in interpreting the p-values, and should not suffer higher false positives as a necessary condition of the experiment.

• West

Strassler’s post, which was written almost a year before the Higg’s discovery announcement, deals solely with the detection problem. So discussing targeted parameter estimation techniques is besides the point when the particle’s existence anywhere in the mass space had yet to be confirmed.

And when it was found with a high confidence, everyone went hooray and then moved on to trying to measure the parameters of the illusive beast. One inadvertently does some estimation in the detection effort, because you find the excursion from the null in a particular mass bin. But for the most part, “physicists” realize the detection and estimation are different problems.

• West: I don’t see anything that disagrees with my points at all. And incidentally, I often find that one learns the most from tracing what scientists say BEFORE an episode is somewhat settled, and this is no exception. As I understand it, please correct me if I’m wrong, it’s partly because the mass hadn’t yet been pinpointed that the “look elsewhere effect” was taken into account in requiring 5 sigma. One might wish to see testing and estimation as distinct, (I won’t stop to qualify my position on this here). I did intend that the estimation concerned various other properties/parameters (I was also following a recent paper by Robert Cousins, which I’ll describe later on).

• West

@Mayo

I was principally replying to the first half of Michael Lew’s first comment about the need for physicists to differentiate between detection and estimation problems. I realize this wasn’t at all clear from what I wrote. The defensiveness was probably unwarranted, but I wanted to make it clear physicists as a whole aren’t unsophisticated rubes when it comes to inference.

You are quite right that the discovery threshold was so stringent because no one was really sure what the mass of the thing was, so we have no quarrel here.

• I concur–it’s the critics who are wrong,not the physicists, and some of them have an agenda: to suggest physicists want/need a posterior. (e.g., the Lindley O’Hagan letters). But I’ve no doubt these critics believe it, because they’ve got their Bayesian hats on and won’t take them off. They should try on the error statistical way of thinking, if only just to understand it.

• Michael Lew

@West

I didn’t intend to imply that physicists are unsophisticated rubes. (Well, any more than other flavours of scientists.) What I was hoping to emphasise is the fact that statistical analyses rarely distinguish between preliminary studies and experiments designed to test specific ideas and estimate specific parameters.

I would suggest that frequentist methods cannot deal well with the preliminary experiments because of the decisions implicit in their calibration.

• Michael: I don’t understand your remark about preliminary experiments. I always find that the strongest arguments for frequentist error statistical methods is that they don’t require collecting vast resources to get them going: e.g., an exhaustive set of hypotheses and priors. All accounts that use statistics rely on statistical models, but the point of simple significance tests, say, is that they enable a single questions to be split off, e.g., is their a real effect? without more than a directional alternative.

So you’ll need to explain what you mean in saying “frequentist methods cannot deal well with the preliminary experiments” because I thought it was fairly well recognized that their comparative advantage is how little one needs to get going with them. Which methods are free of worries about “calibration”?

How does this sit with your other remark (that I’m sure many would question), that statistical analyses rarely distinguish between preliminary studies and experiments to test more full blown ideas and estimate parameters? It is precisely because we have to do a lot of work to get to the latter stage that we need a methodology for the more preliminary stages.

• Michael Lew

Mayo: Sorry for the slow response, I’ve been at a scientific conference and have not been checking this blog.

When you start to investigate a ‘problem’ in science it is difficult to know exactly what type of experimental procedures and designs will best answer the questions that might relate most directly to the interesting issues that relate to or come from that problem. (Yes, the imprecision in that sentence is deliberate. It serves to mirror the imprecisely defined goals of early investigations.) In practice a scientist will usually jump in and start collecting data. If the results look useful and interpretable then the data collection can become an experiment either by simple extension or by providing the requisite information that allows a properly designed experiment. If the results do ‘work out’ so well then the experimental approach is modified, the design is changed, the theory is revised of the system to be investigated is changed.

The frequentist approach treats each unit experiment as if it yields a decision. If you don’t have a hypothesis then what is the use of a frequentist hypothesis test? If the preliminary ‘fiddle-fart’ experimentation suggests a “yes” but the more designed experiment dealing with the same hypotheses says “no”, or vice versa, then what can a frequentist do?

• The others generally must begin with full blown models, perhaps priors and even loss functions.I think it’s absurd to suppose “the frequentist approach” starts with a preset hypothesis and reaches a decision.” If there’s any approach now on offer that contains the kind of unregimented panoply of tools for probing tentative claims and for building up viable models, it’s the frequentist approach. Experimental design, exploratory analysis, validating assumptions, learning about potentially relevant data generation sources,estimation,building up methods piecemeal–all involve frequentist methods and are not decision-theoretic (although you could trivially say that there’s a decision to report such and such, or a decision to get more data or try a problem. Even in formal “hypothesis testing” the relevant hypothesis inferred is not typically one of the ones given by a formal statistical test, but perhaps claims about discrepancies from them.

anyway, I didn’t go back to reread the earlier remark that started the discussion on this…rather late here.

2. Mayo: I’m really curious to see what you have to say about this. I want to see how my expectations match up to reality.

• Patience. I must go visit royalty today in London. If you wish to state your expectations in advance feel free to do so.

• I may write them down somewhere to keep myself honest, but I wouldn’t want to create self-fulfilling or self-defeating prophecies.

• Corey: I’m guessing, given your remark on my current post, that you were expecting something else. But this isn’t really any different from what I’ve said all along, even though at first I wasn’t sure how the physicists were using “fluke”. Strassler makes it clear.