Kent Staley
Associate Professor
Department of philosophy
Saint Louis University
Regular visitors to Error Statistics Philosophy may recall a discussion that broke out here and on other sites last summer when the CMS and ATLAS collaborations at the Large Hadron Collider announced that they had discovered a new particle in their search for the Higgs boson that had at least some of the properties expected of the Higgs. Both collaborations emphasized that they had results that were significant at the level of “five sigma,” and the press coverage presented this is a requirement in high energy particle physics for claiming a new discovery. Both the use of significance testing and the reliance on the five sigma standard became a matter of debate.
Mayo has already commented on the recent updates to the Higgs search results (here and here); these seem to have further solidified the evidence for a new boson and the identification of that boson with the Higgs of the Standard Model. I have been thinking recently about the five sigma standard of discovery and what we might learn from reflecting on its role in particle physics. (I gave a talk on this at a workshop sponsored by the “Epistemology of the Large Hadron Collider” project at Wuppertal [i], which included both philosophers of science and physicists associated with the ATLAS collaboration.)
Just to refresh our memories, back in July 2012, Tony O’Hagan posted at the ISBA forum (prompted by “a question from Dennis Lindley”) three questions regarding the five-sigma claim:
- “Why such an extreme evidence requirement?} We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson (or some other particle sharing some of its properties) has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. Neither seems to be the case, so why 5-sigma?
- “Rather than ad hoc justification of a p-value, it is of course better to do a proper Bayesian analysis. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?
- “We know that given enough data it is nearly always possible for a significance test to reject the null hypothesis at arbitrarily low p-values, simply because the parameter will never be exactly equal to its null value. And apparently the LHC has accumulated a very large quantity of data. So could even this extreme p-value be illusory?”
O’Hagan received a lot of responses to this post, and he very helpfully wrote up and posted a digest of those responses, discussed on this blog here and here.
Note that question 2 (which O’Hagan acknowledges was written to be provocative) begins with the assumption that the use of p-values is “bad science,” and question 3 alludes to an issue which is sometimes taken to support that claim. (A newly-published article by Aris Spanos tackles this “Large N” problem squarely, showing how post-data severity analysis deals with the potential fallacies involved. [ii])
I do not have much to add to what Mayo has already written about the use of p-values in general. I like to quote what David Cox has written on this topic: “significance tests in the first place address the question of whether the data are reasonably consistent with a null hypothesis in the respect tested. This is in many contexts an interesting but limited question” [iii].
Of course, the actual experimental reports from CMS and LHC include a considerable amount of additional argumentation beyond simply citing a p-value that directly address the limitations of p-values to which Cox alludes, including any possible worry one might have that this is an instance of a discrepancy of no scientific significance becoming statistically significant because of the large amount of data collected. [iv]
These additional factors include:
- They defined a parameter m, the “global signal strength”, such that m=0 corresponds to the background-only hypothesis and m=1 corresponds to the background-plus-signal hypothesis and then set confidence limits on the value of m.
- They found a statistically significant excess of events not only in the overall Higgs search, also in all of the decay channels that they searched.
- Those decays channels reflect specific physical properties leading to the conclusion that the particle producing the decays is a boson, such as decaying into pairs of W bosons or pairs of Z bosons, with a net charge of zero.
- When they use the data from those individual decay channel searches to estimate the mass of the hypothetical particle producing the decays, they see a clear peak in the estimates.
- Then there are appeals to methodological credibility, such as the use of blinding techniques to prevent biased data selection criteria.
Both ATLAS and CMS, then, used p-values in just the way that they should, to indicate how poorly their data fit the background-only hypothesis. Bayesians may of course continue to believe that using p-values constitutes “bad science,” and I am not attempting here to talk them out of this position, but I do think it is fair to say that particle physicists use significance testing with a pretty clear understanding of what it can and cannot accomplish.
That brings us to O’Hagan’s first question: Why require that results achieve a level of five sigma before announcing a discovery?
Before discussing possible answers to this question, I should note that this five-sigma standard is not any kind of formalized or absolute criterion, but a convention that has only rather recently been adopted in particle physics. Physicist-turned-philosopher-and-historian-of-science Allan Franklin has looked into this, and was kind enough to share with me the prologue to his forthcoming book on twentieth-century particle physics, Shifting Standards. That prologue carefully details the solidification of this practice. The practice of giving a number of standard deviations as an indication of departure from the background hypothesis seems to have begun in the 1960s. Some discovery claims were not accompanied by any probability calculation at all, such as the 1961 discovery of the h, for which the authors considered it sufficient to give just the numbers of expected background events and of observed candidate events, as well as a histogram of the candidates superimposed over the background distribution, showing a clear peak.
Franklin goes on to document how the five-sigma standard seems to have begun to take hold in the mid-1990s, with the discovery of the top quark. As I discuss in detail in my book The Evidence for the Top Quark (Cambridge UP, 2004), the CDF collaboration at Fermilab first published a paper claiming “evidence” for the top quark when they had a 2.8 sigma effect. They chose the word “evidence” deliberately because they considered the evidence insufficiently strong to use the words “discovery” or “observation.” Later, with more data both the CDF and D0 collaborations published “observation of the top quark” papers that did meet the five-sigma standard. Analyzing the contents of Physical Review Letters and Physics Letters, Franklin shows how this distinction between “evidence” and “observation” and its association with the five-sigma standard has become fixed over the nearly 20 years since.
Why has this happened? Based on O’Hagan’s digest of responses to his post and also my own conversations with particle physicists, we can pin down a number of rationales for the five sigma standard. The following stand out:
- The huge investment of resources required for a result in particle physics makes it imperative to avoid an erroneous discovery announcement, not only for the sake of the self-esteem of the physicists involved, but for the sake of maintaining public support for the continued funding of the enterprise.
- “There is so much that can go wrong”: The analysis of data from the almost inconceivably complex detectors currently used has so many possibilities for error, including in the use of simulations, that an extra margin of caution is prudent.
- In any given search, there will be the possibility of finding something throughout some range of possible masses/cross sections, and not just in the place in that range where something in fact is found. This “look elsewhere effect” (LEE) means that p-values based on where in fact a new particle is found are under-estimates, since if a five-sigma excess over background had been found elsewhere in that range, that would have been reported as the effect.
- At any given time, there are multiple collaborations performing experiments and searching for new phenomena. Moreover, collaborations like ATLAS and CMS are carrying out many searches for new phenomena at any given time. This increases the chances that a fluctuation of background will yield a significant result in one or more of these searches.
- Physicists often point out that previous discovery claims based on results that were reported as significant at the level of three sigma have sometimes turned out to be incorrect. In an often discussed example, the H1 collaboration at the HERA collider claimed in 2004 to have discovered a new resonance that they interpreted as a pentaquark state, citing a statistical significance of 6.2 sigma. This finding was later judged to be an error. Franklin even cites an example of an erroneous 20 sigma result!
- Then there is the “resilience” argument. Joe Incandela, who is spokesperson for the CMS collaboration, put it this way to me: “It is not so hard to lose a sigma with added data or other changes to an analysis. With 5 going to 4, you still have a pretty solid result, but 4 going to 3 is now pretty marginal.” As I interpret it, this rationale relies on the idea that it is undesirable not only to be completely wrong in claiming a discovery, but also to make make a strong claim and then have the evidence for it start to look doubtful.
What I think is really interesting about this philosophically is that these rationales seem to require two different perspectives on significance testing that are generally thought to be at odds.
If we adopt a behavioristic perspective, then the acceptance or rejection of a hypothesis is interpreted as a decision to act a certain way. The properties of a test should allow one to limit the rate at which one chooses the `wrong’ action, and hence allows one to limit the costs of such actions.
Mayo has advocated an alternative [v], which she has at different times called a learning or evidential approach, according to which the error probabilities of a test with accept/reject outputs can be used to assess how good the test is at revealing errors of specific kinds. When a test that has good properties for detecting an error of a given kind with the data in hand fails to detect such an error, this is evidence that the error is absent.
The latter approach is suited to addressing the question of how strongly the data from a given experimental test support a particular hypothesis (let’s call them “single case problems”), while the former approach seems suited to the task of limiting costs associated with errors of a specific kind (“repeated case problems”). But the rationales given for the 5 sigma standard in particle physics include problems of both kinds. [vi]
Take the LEE. This is clearly a single case problem in the sense that the worry is that the p-value calculated for the inference at hand may be misleadingly small. Of course, one might then say that one should calculate it differently, by giving the probability that one would get a departure from the null expectation that is as large or larger anywhere in the range that is being searched. In fact, both CMS and ATLAS do exactly this, citing the resulting number as the “global significance.” They both give two global significance calculations, based on both a “wide” and a “narrow” range of masses to which their experiments were sensitive. Oddly, CMS’s wide range is narrower than ATLAS’s narrow range and is just 1/14 the size of ATLAS’s wide range! I doubt that this reflects a correspondingly large objective difference in the physics capabilities of the two detectors.
Although having the global significance calculations is informative, the five sigma standard (which applies to the local significance) bypasses the need to worry about which global significance matters. It ensures that when a discovery is announced, it will be based on a more severe test than would be accomplished with a lower standard. I think similar comments would apply to the “resilience” rationale.
Other rationales, however, clearly call for the behavioristic interpretation, because they concern “repeated case” problems. This approach seems needed to make sense of the concerns over investment, multiple searches, and previous erroneous discovery claims. There would seem to be a connection between the latter two of these. You might think that the reason why there have been numerous statistically significant discoveries that have turned out to be errors is simply that so many searches have been carried out. Given all of those searches, the probability is not low that in some of them, an upward fluctuation of background would fool experimenters into thinking they had found something.
It seems unlikely, however, that this is the whole story. The error rate for discovery claims seems much more likely to involve both stochastic fluctuations and problems in the analysis of data. Were there none of the latter, then holding the community to a five sigma standard for making discovery claims would limit the error rate to 3 x 10-7. But that assumes that everything else is done correctly, and as another of the rationales notes, “there is so much that can go wrong”!
_________
[i] I have to put in a plug here for the project, which developed out of an already quite remarkable cooperative environment among physicists and philosophers and historians of science at Wuppertal. Koray Koraca, Friedrich Steinle, and Christian Zeitnitz organized a terrific workshop.
[ii] Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.
[iii] Cox, D. Principles of Statistical Inference. New York: Cambridge University Press, 2006, 42.
[iv] ATLAS Collaboration, “Observation of a New Particle in the Search for the Standard Model Higgs Boson with the ATLAS Detector at the LHC,” Physics Letters B716 (2012): 1–29; CMS Collaboration, “Observation of a New Boson at a Mass of 125 GeV with the CMS Experiment at the LHC,” Physics Letters B716 (2012): 30–61.
[v] Mayo, D. “Behavioristic, Evidentialist, and Learning Models of Statistical Testing,” Philosophy of Science, 52 (1985): 493–516; Error and the Growth of Experimental Knowledge, Chicago: University of Chicago Press, 1996.
[vi] See Mayo, Error and the Growth of Experimental Knowledge, pp. 374–377.
Staley, K. The Evidence for the Top Quark, Cambridge UP, 2004.
That’ll be *Dennis* Lindley.
Oh my. I’m terribly sorry about that. It would be really something if there were an important and accomplished statistician who also had played lap steel guitar for Jackson Browne and Ry Cooder! http://en.wikipedia.org/wiki/David_Lindley_%28musician%29
Perhaps I am thinking about too many things all at once?
Deborah, if you get a chance, I would be grateful if you could fix my error!
Sorry the Elbian editors missed that….they were working on the blog too late at night, Elba time. But it does seem there are a lot of Davids in statistics.
Five sigma is a convention as well as p<0.05 or p<0.01, which are in many situations not tough enough to deal with potential data dependent adjustments that could amount to "fishing for significance" to some extent even without evil intentions (and with which Bayes could deal no better). So I'm not much worried about asking for a stronger "signal" that p<0.01.
What I find more interesting about this is what is implied by measuring the "significance" in terms of sigma and not in terms of p, and I wonder why Kent didn't comment on this issue.
I can see two implications of this. One is that the measurement unit of sigma is the measurement unit of the original observations and it may be that physicists (and many scientists) feel more comfortable about interpreting this than about a probability. In fact, one could argue that computing a discrepancy in terms of sigma is probability model-free, and (that's the second implication) one could say that stating the evidence in terms of sigma does not rely on normality. Now this is problematic, because it is among the main characteristics of the normal distribution that the variation is measured appropriately by averaging squared differences from the mean as in the standard estimation of sigma. Outliers will affect the estimation of sigma strongly, so expressing evidence in terms of sigma is not more robust against outliers than p-values based on the normal distribution (unless the sigma estimate is computed from "better" data than the test itself). There may be non-normal distributions for which k-sigma gives a more appropriate description of significance than a p-value computed from a normal distribution (of course assuming that k-sigma is not interpreted as equivalent to the corresponding normal p-value anyway), but I don't know how large this class is. Chances are that for large n violations of normality either don't make much of a difference, or the problems affect sigma to the same extent as normal p-values.
So I wonder whether using a k-sigma rule somehow suggests rather misleadingly that one doesn't rely on normality.
Perhaps the expression as k-sigmas originates from W. Edward Deming (as in ‘six sigma’)?
Based on what I’ve read, it sounds as if the analysis of LHC data does not assume normality; a p-value is first generated then converted to sigmas.
Paul: I too thought of the Deming 6 sigma business catching on. I am definitely an outsider here, but I have seen the analysis in terms of Normal curves with lovely graphs. I don’t know what you mean about the p-values being “first generated then converted”. Can you explain?
Hi Christian,
Just a quick reply before I go off to teach…
They actually do calculate a p-value first. For example, see ATLAS’s July ’12 Physics Letters paper, figure 8:
Click to access 1207.7214v1.pdf
They go into a little more detail on their statistical procedures in a paper for Physical Review D, see section 5:
Click to access 1207.0319v2.pdf
The reporting of the significance in terms of sigmas is more prominent in the public presentations because that seems to be the way in which particle physicists typically think, statistically, about excesses beyond background.
Here’s a link to a paper particle physicist Matt Strassler sent me recently, by some who have been designing the methods–although he said it might be a bit self-serving, perhaps in emphasizing the techniques they developed? There’s even a reference to Birnbaum, though I can’t tell how he is being used.
http://indico.cern.ch/getFile.py/access?contribId=46&sessionId=1&resId=0&materialId=slides&confId=202554
Thanks for the interesting post and the above links – this has cleared up some confusion from me about the sigma levels.
As I read it, from section 5 of the second paper, the 5-sigma is indeed calculated post-hoc from the p-value. What it shows is the equivalent Z-score, that is the equivalent statistic you would get if you were doing a test on a normally distributed variable with known standard deviation. If I’m reading this right, the sigma in “5 sigma” refers to the standard deviation of the statistic according to the sampling distribution, *not* the standard deviation of the recorded values. I.e. it’s often called the standard error (and generally, outside of physics, not referred to as “sigma”) – i.e. the standard deviation (which *is* usually denoted sigma) divided by root N (= number of observations). A potential source of confusion perhaps. Moreover the sigma is the standard deviation of some imagined equivalent sampling distribution from a test that hasn’t actually been performed, it isn’t a standard deviation taken directly from any observation.
I would guess then that the main use of this is to make a comparison to a well known type of test. This doesn’t really clarify for me any reason to prefer sigma-level over the p-value apart from convention – both effectively serve this purpose. Conventions are useful for communication though, and “5 sigma” has a nice ring to it…
James: I think a relevant reason is that the exact probability doesn’t mean anything and is so often misinterpreted (e.g., my “statistical fluke” post) that it’s preferable to emphasize the evidence of a genuine experimental effect that will not go away, and then, with the new data, that it is evidence of a Higgs particle.
The point is to “detach” an inferred claim, not probabilify anything. For example, to detach or infer a region of values for the various parameters, now known. There’s a qualification, but it concerns the process, and it is recognized that further evidence can change aspects of the interpretation and the overall theories.
Christian: No, there’s the normality assumption, to my knowledge. I think it’s to emphasize the extent of the inconsistency rather than have yet more misunderstandings of the p-value. Did you see my previous post on this regarding “random flukes”?
Mayo: Yes, I did. Probably you’re right about them using normality anyway (I’m not so familiar with the literature in physics); my point was rather psychological: using k-sigma, people can discuss it without making clear reference to the normal model.
Christian,
Where does your “chances are” comment come from? In essentially all standard situations, one doesn’t need normality of data to get acceptable p-values, if the samples are independent and there are a lot of them. k-sigma rules (i.e. k standard error rules) make a lot of sense anytime a Central Limit Theorem offers a decent approximation to the distribution of the point estimates.
Original guest: “If samples are independent and there are a lot of them”… Well, what “a lot of them” exactly means depends on the tails of the underlying distribution, and there is always little data to know what goes on in the extreme tails (only how extreme “extreme” is changes, unless you have finite value ranges covered by data even close to the boundaries). And you’ll never have precise independence.
Yes, but you’re not going to have precise Normality either.
And pretty mild data-cleaning steps will take care of extreme heavy-tails; it’s rare for self-respecting scientists to let you analyze data where one of the observations (or a tiny fraction of them) is zillions of miles from the rest.
Does anyone have a sense of the sample sizes that go into the 5 sigma? sorry if this came up somewhere already, I know I had asked in my earlier post.
In the July ’12 ATLAS paper, table 5 gives the number of candidate events across the various channels, along with expected background and signal for a 125 GeV Higgs:
Click to access 1207.7214v1.pdf
The total number of candidate events is 213, with an expected background of about 168, plus or minus some, and an expected signal of 25, plus or minus some (the background and signal estimates are given separately for analyses requiring different numbers of hadronic ‘jets’ — I’ve added them together without attempting to combine the uncertainties).
“Oddly, CMS’s wide range is narrower than ATLAS’s narrow range and is just 1/14 the size of ATLAS’s wide range! I doubt that this reflects a correspondingly large objective difference in the physics capabilities of the two detectors.”
I haven’t paid attention to this yet, but it does not seem surprising. Each detector has a different theoretical basis. Consider, for example, thermometers, all of which measure temperature on the same scale. Mercury thermometers, which are based on expansion of liquid mercury, work well from about -30 to 300 (celsius). Resistance thermometers, which do not rely on expansion of a liquid, work well from about -250 to 700.
Their detectors are different, Paul, but not *that* different. More specifically, they claim sensitivities to Higgs with masses across comparable ranges. The ranges are different for different decay modes, but both experiments include ranges up to 600 GeV (H->ZZ and H->WW for CMS, H->ZZ for ATLAS).
Kent: Thank you so much for this interesting guest post! i will study it and comment later on!
Kent: Quick remarks:
First off, I should let everyone know that I’ve known Staley for donkey’s years, even a bit before he participated in my NEH Summer Seminar on Induction and Experiment back in 99! He’s been at nearly all of my conferences, we’ve co-edited stuff, and he is perhaps the most full-fledged “error statistical philosopher” among all philosophers that I know.
You wrote:
“Other rationales, however, clearly call for the behavioristic interpretation, because they concern “repeated case” problems.” Concerns about multiple searching, look elsewhere effects, and various other selection effects reflect the goal of avoiding misinterpretations of the data, and so are inferential rather than behavioristic. This is the same as with p-value adjustments for getting an assessment that correctly reflects the actual severity (associated with the inference).
One of the biggest blindspots many harbor–as you know well– is in failing to understand that the sampling distribution is crucial for what you call “single case”* inference. I have heard sophisticated statisticians say things about frequentist sampling theory on the order of “there is evidence”, e.g., of fit between data and hypotheses (I don’t know where from, maybe likelihoods) and then there are sampling distributions used only for assessing performance. Let me be clear, they are supposing this is the frequentist view and it is not. We cannot have evidence or inference without knowing something about the probative capacities of the method. That’s why we’d worry about selection effects, multiple testing, etc. for purposes of avoiding misinterpreting this phenomenon in this case. Thus sampling distributions are integral to inference—even though they need to be relevant sampling distributions.
Granted, there is an interest in purely behavioristic goals at different stages of the Higgs research, e.g., for purposes of “triggering” (essentially getting rid of 99.9% of the data, “deciding” what data to record and analyze now, and what to “park” for later). I will discuss this in a next post.
* I wouldn’t make it a choice between “single case” and long-run decisions, even though I realize you were just looking for a term to distinguish them.
Thanks for your comments, Deborah.
Perhaps the contrast between “single case” and “repeated case” problems is not well drawn. Of course I agree that the sampling distribution is important for evaluating the evidence in a single case, which is why I think that the LEE rationale, for example, for the 5 sigma requirement can be understood using an evidential interpretation. The rationales that I think call for a behavioristic interpretation are those that invoke the standard for goals having to do with the broader practices of particle physicists, such as reducing the number of erroneous discovery claims or maintaining popular support for the funding of particle physics experiments.
Kent: No I don’t see that the practical or economic goals in setting standards render any of the inferences using the standards behavioristic (at least in my sense). The fact that there are pragmatic reasons for getting it right, learning the truth approximately,etc. does not make the process of getting it right (using statistics in this case) itself pragmatic in any way that gets at the distinction of interest.
Kent: It’s interesting that you cite my old “learning interpretation” of N-P tests. i stopped using that word because I kept getting requests from psych people for offprints (remember those old days?), and I’m sure they thought the papers concerned psychology.
Hi to every one, it’s truly a pleasant for me to pay a visit this website, it contains important Information.