The biennial meeting of the Philosophy of Science Association (PSA) starts this week (Nov. 6-9) in Chicago, together with the History of Science Society. I’ll be part of the symposium:
“A 5 sigma effect!” is how the recent Higgs boson discovery was reported. Yet before the dust had settled, the very nature and rationale of the 5 sigma (or 5 standard deviation) discovery criteria began to be challenged and debated both among scientists and in the popular press. Why 5 sigma? How is it to be interpreted? Do p-values in high-energy physics (HEP) avoid controversial uses and misuses of p-values in social and other sciences? The goal of our symposium is to combine the insights of philosophers and scientists whose work interrelates philosophy of statistics, data analysis and modeling in experimental physics, with critical perspectives on how discoveries proceed in practice. Our contributions will link questions about the nature of statistical evidence, inference, and discovery with questions about the very creation of standards for interpreting and communicating statistical experiments. We will bring out some unique aspects of discovery in modern HEP. We also show the illumination the episode offers to some of the thorniest issues revolving around statistical inference, frequentist and Bayesian methods, and the philosophical, technical, social, and historical dimensions of scientific discovery.
1) How do philosophical problems of statistical inference interrelate with debates about inference and modeling in high energy physics (HEP)?
2) Have standards for scientific discovery in particle physics shifted? And if so, how has this influenced when a new phenomenon is “found”?
3) Can understanding the roles of statistical hypotheses tests in HEP resolve classic problems about their justification in both physical and social sciences?
4) How do pragmatic, epistemic and non-epistemic values and risks influence the collection, modeling, and interpretation of data in HEP?
Abstracts for Individual Presentations
The discovery and characterization of a Higgs boson in 2012-2013 provide multiple examples of statistical inference as practiced in high energy physics (elementary particle physics). The main methods employed have a decidedly frequentist flavor, drawing in a pragmatic way on both Fisher’s ideas and the Neyman-Pearson approach. A physics model being tested typically has a “law of nature” at its core, with parameters of interest representing masses, interaction strengths, and other presumed “constants of nature”. Additional “nuisance parameters” are needed to characterize the complicated measurement processes. The construction of confidence intervals for a parameter of interest q is dual to hypothesis testing, in that the test of the null hypothesis q=q0 at significance level (“size”) a is equivalent to whether q0 is contained in a confidence interval for q with confidence level (CL) equal to 1-a. With CL or a specified in advance (“pre-data”), frequentist coverage properties can be assured, at least approximately, although nuisance parameters bring in significant complications. With data in hand, the post-data p-value can be defined as the smallest significance level a at which the null hypothesis would be rejected, had that a been specified in advance. Carefully calculated p-values (not assuming normality) are mapped onto the equivalent number of standard deviations (“s”) in a one-tailed test of the mean of a normal distribution. For a discovery such as the Higgs boson, experimenters report both p-values and confidence intervals of interest.
Such a hypothesis testing procedure can lead to direct conflict with Bayesian hypothesis testing methods, in which the posterior probability of the null hypothesis is calculated. Testing in high energy physics (for example testing for the existence of a new force of nature of unknown strength) unavoidably raises the famous “paradox”, as Lindley called it, when testing the hypothesis of a specific value q0 (in particular q0=0) of a parameter against a continuous set of alternatives q. The different scaling of p-values and Bayes factors with sample size, described by Jeffreys and emphasized by Lindley, can lead the frequentist and the Bayesian to opposite inferences. Much of the literature on the Jeffreys-Lindley paradox, based on contexts rather different from high energy physics, might lead one to believe that is it not a serious problem. In contrast, I will give examples from high energy physics where sharp hypotheses such as q=q0 are taken seriously within the relevant approximations, and where the core physics models are laws of nature, not just convenient approximations of very limited domain. Thus the paradox must be confronted, and there remain important open issues regarding how best to formulate and communicate hypothesis tests in high energy physics. These issues have been discussed by physicists and statisticians for over a decade in the PhyStat series of workshops without being resolved.
On July 4, 2012 the CMS and ATLAS collaborations at the Large Hadron Collider announced the discovery of the Higgs boson, the last remaining undiscovered piece in the Standard Model, the currently accepted theory of elementary particles. The question of the statistical criterion for a scientific discovery was prominent in both the scientific announcement and in the popular media. Both groups presented five-sigma effects, which is as Dennis Overbye reported in the New York Times, “the gold standard in physics for a discovery (New York Times website, 7/24/12).” The BBC News website stated, “Both of the Higgs boson-hunting experiments see a level of certainty in their data worthy of a ‘discovery.’” The BBC further commented that that, “Particle physics has an accepted definition for a discovery, a ‘five sigma’ or five standard deviation, level of certainty.” They noted that the likelihood of such an effect was the same as tossing more than 20 heads in a row. The grapevine reported that Rolf Heuer, the director-general of CERN, had told the groups that they could not claim a discovery unless each of them had a five-sigma effect.
In this paper I will look at the history of high-energy physics from the 1960s to the present to see the evolution of this statistical criterion. In the early 1960s there was essentially no such criterion. By the late 1960s a three-standard-deviation criterion was established. This criterion gradually changed from three to four to the currently accepted five-standard deviations. The “five-sigma” rule is enforced by both journals and by the experimental groups themselves. This history will demonstrate that the use of standard deviations is not a mechanical application of a statistical formula, but demands knowledge, craft, and judgment. Questions have also been raised concerning the appropriate statistical formulas to use. How does one deal appropriately with the statistical noise? Episodes in which a statistically significant effect was initially seen, but which later disappeared, an unlikely event on statistical grounds, as well as the recent discovery of the Higgs boson will be discussed.
The data analysis and modeling in HEP employs multiple methods, both Bayesian and frequentist. But the report in terms of standard deviation units relates to statistical significance testing and p-values, leading to questions about whether the Higgs results inherit some problems with these methods. I argue that the use of statistical methods in the Higgs experiments illuminates how these tools can and often do work validly.
A rough sketch of the Higgs statistics: There is a statistical model of the detector, within which researchers define a “global signal strength” parameter such that H0: μ = 0 corresponds to the “background only” hypothesis, and μ = 1 to the Standard Model (SM) Higgs boson signal in addition to the background. The statistical test records differences in the positive direction, in standard deviation or sigma units. The improbability of an excess as large as 5-sigma alludes to the sampling distribution associated with such signal-like results or “bumps”, fortified with much cross-checking. In particular, the probability of observing a result as extreme as 5 sigmas, under the assumption it was generated by background alone, that is, under H0, is approximately 1 in 3,500,000. Some put this as: the probability that the results were just a statistical fluke is 1 in 3,500,000.
Many critics have claimed that this fallaciously applies the probability to H0 itself—a posterior probability in H0. A careful look shows this is not so. H0 does not say the observed results are due to background alone, although were H0 true (about what’s generating the data), it follows that various results would occur with specified probabilities. The probability is assigned to “observing such large or larger bumps (at both sites)” supposing they are due to background alone. These computations are based on simulating what it would be like under H0 (given a detector model). Now the inference actually detached from the evidence is something like:
There is strong evidence for (or they have experimentally demonstrated) H: a Higgs (or a Higgs-like) particle.
Granted, this inference relies on an implicit principle of evidence:
Data provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).
This is a variant of the severe or stringent testing requirement for evidence.
Here, with probability .9999997, the test would generate less impressive bumps than those obtained, under H0. So, very probably H0 would have survived, were μ = 0.
Some cases do commit the fallacy of “transposing the conditional” from a low significance level to a low posterior to the null; but in many other cases, what’s going on is precisely as in the case of the Higgs.The difference is as subtle as it is important, and calls for philosophical illumination.
The announcements by the ATLAS and CMS collaborations of the observation of a Higgs-like boson prompted much discussion of their reliance on a statistical threshold of five standard deviations excess for the announcement of a discovery in particle physics. Just how ironclad a requirement this five-sigma standard (5SS) is, and exactly what role it plays, are not so simple, but in this paper I wish to focus on one possible interpretation of the 5SS and its relation to a long-standing debate within philosophy of science over the role of value judgments in scientific inquiry.
The argument from inductive risk (AIR), articulated by Richard Rudner in 1953 and C. West Churchman in 1948, and revisited in recent years by Heather Douglas and others, seeks to show that in some cases the conclusion that a scientist draws from her data may properly be influenced by judgments concerning non-epistemic values, such as moral or economic values.
Discussions of the AIR commonly draw on the statistical theories of Neyman and Pearson and Wald, which require the investigator to designate a critical value for the acceptance of an alternative hypothesis and the corresponding rejection of the null hypothesis, thus setting the error probabilities of the test. The investigator, it is argued, may legitimately consider the consequences of different possible errors, including moral or economic consequences, as grounds for the choice of critical threshold, which may in turn make a difference to the outcome of the inference.
The methodological requirements of the AIR are simply that scientists engage in hypothesis acceptance, that they must choose among competing methods for deciding whether to accept a hypothesis, and that different decision methods yield different probabilities for the erroneous acceptance of hypotheses under consideration. Any method that imposes a “standard of evidence” criterion for hypothesis acceptance will qualify, and most discussions of 5SS treat it as just such a criterion.
This talk will examine the role of the 5SS in the Higgs discovery from the perspective of Churchman’s pragmatism as a test case for some of the assumptions about the argument from inductive risk made by both its defenders and its critics. Of particular interest is Isaac Levi’s critical argument, which regards the AIR as premised upon a “behavioristic” view of statistical inference. Levi argued that the AIR did not apply to inquiries that constitute “attempts to seek the truth and nothing but the truth.” Such inquiries immunize statistical inference against the threat posed by the AIR of reducing “the theoretical aspects of science to technology and policy making.” In response to this kind of critic, the nuances of the 5SS show how Churchman’s pragmatism (which, contra Levi, is not reductively behaviorist) can help make sense of the relevance of the broader purposes of the particle physics community to their adoption of the 5SS without reducing their enterprise to “technology and policy-making”.
PSA site for full program: http://www.philsci.org/
Search this blog for several posts on the Higgs.