Department of philosophy
Saint Louis University
Regular visitors to Error Statistics Philosophy may recall a discussion that broke out here and on other sites last summer when the CMS and ATLAS collaborations at the Large Hadron Collider announced that they had discovered a new particle in their search for the Higgs boson that had at least some of the properties expected of the Higgs. Both collaborations emphasized that they had results that were significant at the level of “five sigma,” and the press coverage presented this is a requirement in high energy particle physics for claiming a new discovery. Both the use of significance testing and the reliance on the five sigma standard became a matter of debate.
Mayo has already commented on the recent updates to the Higgs search results (here and here); these seem to have further solidified the evidence for a new boson and the identification of that boson with the Higgs of the Standard Model. I have been thinking recently about the five sigma standard of discovery and what we might learn from reflecting on its role in particle physics. (I gave a talk on this at a workshop sponsored by the “Epistemology of the Large Hadron Collider” project at Wuppertal [i], which included both philosophers of science and physicists associated with the ATLAS collaboration.)
- “Why such an extreme evidence requirement?} We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?
- “Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?
- “We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?”
Note that question 2 (which O’Hagan acknowledges was written to be provocative) begins with the assumption that the use of p-values is “bad science,” and question 3 alludes to an issue which is sometimes taken to support that claim. (A newly-published article by Aris Spanos tackles this “Large N” problem squarely, showing how post-data severity analysis deals with the potential fallacies involved. [ii])
I do not have much to add to what Mayo has already written about the use of p-values in general. I like to quote what David Cox has written on this topic: “significance tests in the first place address the question of whether the data are reasonably consistent with a null hypothesis in the respect tested. This is in many contexts an interesting but limited question” [iii].
Of course, the actual experimental reports from CMS and LHC include a considerable amount of additional argumentation beyond simply citing a p-value that directly address the limitations of p-values to which Cox alludes, including any possible worry one might have that this is an instance of a discrepancy of no scientific significance becoming statistically significant because of the large amount of data collected. [iv]
These additional factors include:
- They defined a parameter m, the “global signal strength”, such that m=0 corresponds to the background-only hypothesis and m=1 corresponds to the background-plus-signal hypothesis and then set confidence limits on the value of m.
- They found a statistically significant excess of events not only in the overall Higgs search, also in all of the decay channels that they searched.
- Those decays channels reflect specific physical properties leading to the conclusion that the particle producing the decays is a boson, such as decaying into pairs of W bosons or pairs of Z bosons, with a net charge of zero.
- When they use the data from those individual decay channel searches to estimate the mass of the hypothetical particle producing the decays, they see a clear peak in the estimates.
- Then there are appeals to methodological credibility, such as the use of blinding techniques to prevent biased data selection criteria.
Both ATLAS and CMS, then, used p-values in just the way that they should, to indicate how poorly their data fit the background-only hypothesis. Bayesians may of course continue to believe that using p-values constitutes “bad science,” and I am not attempting here to talk them out of this position, but I do think it is fair to say that particle physicists use significance testing with a pretty clear understanding of what it can and cannot accomplish.
That brings us to O’Hagan’s first question: Why require that results achieve a level of five sigma before announcing a discovery?
Before discussing possible answers to this question, I should note that this five-sigma standard is not any kind of formalized or absolute criterion, but a convention that has only rather recently been adopted in particle physics. Physicist-turned-philosopher-and-historian-of-science Allan Franklin has looked into this, and was kind enough to share with me the prologue to his forthcoming book on twentieth-century particle physics, Shifting Standards. That prologue carefully details the solidification of this practice. The practice of giving a number of standard deviations as an indication of departure from the background hypothesis seems to have begun in the 1960s. Some discovery claims were not accompanied by any probability calculation at all, such as the 1961 discovery of the h, for which the authors considered it sufficient to give just the numbers of expected background events and of observed candidate events, as well as a histogram of the candidates superimposed over the background distribution, showing a clear peak.
Franklin goes on to document how the five-sigma standard seems to have begun to take hold in the mid-1990s, with the discovery of the top quark. As I discuss in detail in my book The Evidence for the Top Quark (Cambridge UP, 2004), the CDF collaboration at Fermilab first published a paper claiming “evidence” for the top quark when they had a 2.8 sigma effect. They chose the word “evidence” deliberately because they considered the evidence insufficiently strong to use the words “discovery” or “observation.” Later, with more data both the CDF and D0 collaborations published “observation of the top quark” papers that did meet the five-sigma standard. Analyzing the contents of Physical Review Letters and Physics Letters, Franklin shows how this distinction between “evidence” and “observation” and its association with the five-sigma standard has become fixed over the nearly 20 years since.
Why has this happened? Based on O’Hagan’s digest of responses to his post and also my own conversations with particle physicists, we can pin down a number of rationales for the five sigma standard. The following stand out:
- The huge investment of resources required for a result in particle physics makes it imperative to avoid an erroneous discovery announcement, not only for the sake of the self-esteem of the physicists involved, but for the sake of maintaining public support for the continued funding of the enterprise.
- “There is so much that can go wrong”: The analysis of data from the almost inconceivably complex detectors currently used has so many possibilities for error, including in the use of simulations, that an extra margin of caution is prudent.
- In any given search, there will be the possibility of finding something throughout some range of possible masses/cross sections, and not just in the place in that range where something in fact is found. This “look elsewhere effect” (LEE) means that p-values based on where in fact a new particle is found are under-estimates, since if a five-sigma excess over background had been found elsewhere in that range, that would have been reported as the effect.
- At any given time, there are multiple collaborations performing experiments and searching for new phenomena. Moreover, collaborations like ATLAS and CMS are carrying out many searches for new phenomena at any given time. This increases the chances that a fluctuation of background will yield a significant result in one or more of these searches.
- Physicists often point out that previous discovery claims based on results that were reported as significant at the level of three sigma have sometimes turned out to be incorrect. In an often discussed example, the H1 collaboration at the HERA collider claimed in 2004 to have discovered a new resonance that they interpreted as a pentaquark state, citing a statistical significance of 6.2 sigma. This finding was later judged to be an error. Franklin even cites an example of an erroneous 20 sigma result!
- Then there is the “resilience” argument. Joe Incandela, who is spokesperson for the CMS collaboration, put it this way to me: “It is not so hard to lose a sigma with added data or other changes to an analysis. With 5 going to 4, you still have a pretty solid result, but 4 going to 3 is now pretty marginal.” As I interpret it, this rationale relies on the idea that it is undesirable not only to be completely wrong in claiming a discovery, but also to make make a strong claim and then have the evidence for it start to look doubtful.
What I think is really interesting about this philosophically is that these rationales seem to require two different perspectives on significance testing that are generally thought to be at odds.
If we adopt a behavioristic perspective, then the acceptance or rejection of a hypothesis is interpreted as a decision to act a certain way. The properties of a test should allow one to limit the rate at which one chooses the `wrong’ action, and hence allows one to limit the costs of such actions.
Mayo has advocated an alternative [v], which she has at different times called a learning or evidential approach, according to which the error probabilities of a test with accept/reject outputs can be used to assess how good the test is at revealing errors of specific kinds. When a test that has good properties for detecting an error of a given kind with the data in hand fails to detect such an error, this is evidence that the error is absent.
The latter approach is suited to addressing the question of how strongly the data from a given experimental test support a particular hypothesis (let’s call them “single case problems”), while the former approach seems suited to the task of limiting costs associated with errors of a specific kind (“repeated case problems”). But the rationales given for the 5 sigma standard in particle physics include problems of both kinds. [vi]
Take the LEE. This is clearly a single case problem in the sense that the worry is that the p-value calculated for the inference at hand may be misleadingly small. Of course, one might then say that one should calculate it differently, by giving the probability that one would get a departure from the null expectation that is as large or larger anywhere in the range that is being searched. In fact, both CMS and ATLAS do exactly this, citing the resulting number as the “global significance.” They both give two global significance calculations, based on both a “wide” and a “narrow” range of masses to which their experiments were sensitive. Oddly, CMS’s wide range is narrower than ATLAS’s narrow range and is just 1/14 the size of ATLAS’s wide range! I doubt that this reflects a correspondingly large objective difference in the physics capabilities of the two detectors.
Although having the global significance calculations is informative, the five sigma standard (which applies to the local significance) bypasses the need to worry about which global significance matters. It ensures that when a discovery is announced, it will be based on a more severe test than would be accomplished with a lower standard. I think similar comments would apply to the “resilience” rationale.
Other rationales, however, clearly call for the behavioristic interpretation, because they concern “repeated case” problems. This approach seems needed to make sense of the concerns over investment, multiple searches, and previous erroneous discovery claims. There would seem to be a connection between the latter two of these. You might think that the reason why there have been numerous statistically significant discoveries that have turned out to be errors is simply that so many searches have been carried out. Given all of those searches, the probability is not low that in some of them, an upward fluctuation of background would fool experimenters into thinking they had found something.
It seems unlikely, however, that this is the whole story. The error rate for discovery claims seems much more likely to involve both stochastic fluctuations and problems in the analysis of data. Were there none of the latter, then holding the community to a five sigma standard for making discovery claims would limit the error rate to 3 x 10-7. But that assumes that everything else is done correctly, and as another of the rationales notes, “there is so much that can go wrong”!
[i] I have to put in a plug here for the project, which developed out of an already quite remarkable cooperative environment among physicists and philosophers and historians of science at Wuppertal. Koray Koraca, Friedrich Steinle, and Christian Zeitnitz organized a terrific workshop.
[ii] Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.
[iii] Cox, D. Principles of Statistical Inference. New York: Cambridge University Press, 2006, 42.
[iv] ATLAS Collaboration, “Observation of a New Particle in the Search for the Standard Model Higgs Boson with the ATLAS Detector at the LHC,” Physics Letters B716 (2012): 1–29; CMS Collaboration, “Observation of a New Boson at a Mass of 125 GeV with the CMS Experiment at the LHC,” Physics Letters B716 (2012): 30–61.
[v] Mayo, D. “Behavioristic, Evidentialist, and Learning Models of Statistical Testing,” Philosophy of Science, 52 (1985): 493–516; Error and the Growth of Experimental Knowledge, Chicago: University of Chicago Press, 1996.
[vi] See Mayo, Error and the Growth of Experimental Knowledge, pp. 374–377.
Staley, K. The Evidence for the Top Quark, Cambridge UP, 2004.