The biennial meeting of the Philosophy of Science Association (PSA) starts this week (Nov. 6-9) in Chicago, together with the History of Science Society. I’ll be part of the symposium:
Summary
“A 5 sigma effect!” is how the recent Higgs boson discovery was reported. Yet before the dust had settled, the very nature and rationale of the 5 sigma (or 5 standard deviation) discovery criteria began to be challenged and debated both among scientists and in the popular press. Why 5 sigma? How is it to be interpreted? Do p-values in high-energy physics (HEP) avoid controversial uses and misuses of p-values in social and other sciences? The goal of our symposium is to combine the insights of philosophers and scientists whose work interrelates philosophy of statistics, data analysis and modeling in experimental physics, with critical perspectives on how discoveries proceed in practice. Our contributions will link questions about the nature of statistical evidence, inference, and discovery with questions about the very creation of standards for interpreting and communicating statistical experiments. We will bring out some unique aspects of discovery in modern HEP. We also show the illumination the episode offers to some of the thorniest issues revolving around statistical inference, frequentist and Bayesian methods, and the philosophical, technical, social, and historical dimensions of scientific discovery.
Questions:
1) How do philosophical problems of statistical inference interrelate with debates about inference and modeling in high energy physics (HEP)?
2) Have standards for scientific discovery in particle physics shifted? And if so, how has this influenced when a new phenomenon is “found”?
3) Can understanding the roles of statistical hypotheses tests in HEP resolve classic problems about their justification in both physical and social sciences?
4) How do pragmatic, epistemic and non-epistemic values and risks influence the collection, modeling, and interpretation of data in HEP?
Abstracts for Individual Presentations
(1) Unresolved Philosophical Issues Regarding Hypothesis Testing in High Energy Physics
Robert D. Cousins.
Professor, Department of Physics and Astronomy, University of California, Los Angeles (UCLA)
The discovery and characterization of a Higgs boson in 2012-2013 provide multiple examples of statistical inference as practiced in high energy physics (elementary particle physics). The main methods employed have a decidedly frequentist flavor, drawing in a pragmatic way on both Fisher’s ideas and the Neyman-Pearson approach. A physics model being tested typically has a “law of nature” at its core, with parameters of interest representing masses, interaction strengths, and other presumed “constants of nature”. Additional “nuisance parameters” are needed to characterize the complicated measurement processes. The construction of confidence intervals for a parameter of interest q is dual to hypothesis testing, in that the test of the null hypothesis q=q0 at significance level (“size”) a is equivalent to whether q0 is contained in a confidence interval for q with confidence level (CL) equal to 1-a. With CL or a specified in advance (“pre-data”), frequentist coverage properties can be assured, at least approximately, although nuisance parameters bring in significant complications. With data in hand, the post-data p-value can be defined as the smallest significance level a at which the null hypothesis would be rejected, had that a been specified in advance. Carefully calculated p-values (not assuming normality) are mapped onto the equivalent number of standard deviations (“s”) in a one-tailed test of the mean of a normal distribution. For a discovery such as the Higgs boson, experimenters report both p-values and confidence intervals of interest.
Such a hypothesis testing procedure can lead to direct conflict with Bayesian hypothesis testing methods, in which the posterior probability of the null hypothesis is calculated. Testing in high energy physics (for example testing for the existence of a new force of nature of unknown strength) unavoidably raises the famous “paradox”, as Lindley called it, when testing the hypothesis of a specific value q0 (in particular q0=0) of a parameter against a continuous set of alternatives q. The different scaling of p-values and Bayes factors with sample size, described by Jeffreys and emphasized by Lindley, can lead the frequentist and the Bayesian to opposite inferences. Much of the literature on the Jeffreys-Lindley paradox, based on contexts rather different from high energy physics, might lead one to believe that is it not a serious problem. In contrast, I will give examples from high energy physics where sharp hypotheses such as q=q0 are taken seriously within the relevant approximations, and where the core physics models are laws of nature, not just convenient approximations of very limited domain. Thus the paradox must be confronted, and there remain important open issues regarding how best to formulate and communicate hypothesis tests in high energy physics. These issues have been discussed by physicists and statisticians for over a decade in the PhyStat series of workshops without being resolved.
(2) The Rise of the Sigmas
Allan Franklin.
Professor, Department of Physics, University of Colorado
On July 4, 2012 the CMS and ATLAS collaborations at the Large Hadron Collider announced the discovery of the Higgs boson, the last remaining undiscovered piece in the Standard Model, the currently accepted theory of elementary particles. The question of the statistical criterion for a scientific discovery was prominent in both the scientific announcement and in the popular media. Both groups presented five-sigma effects, which is as Dennis Overbye reported in the New York Times, “the gold standard in physics for a discovery (New York Times website, 7/24/12).” The BBC News website stated, “Both of the Higgs boson-hunting experiments see a level of certainty in their data worthy of a ‘discovery.’” The BBC further commented that that, “Particle physics has an accepted definition for a discovery, a ‘five sigma’ or five standard deviation, level of certainty.” They noted that the likelihood of such an effect was the same as tossing more than 20 heads in a row. The grapevine reported that Rolf Heuer, the director-general of CERN, had told the groups that they could not claim a discovery unless each of them had a five-sigma effect.
In this paper I will look at the history of high-energy physics from the 1960s to the present to see the evolution of this statistical criterion. In the early 1960s there was essentially no such criterion. By the late 1960s a three-standard-deviation criterion was established. This criterion gradually changed from three to four to the currently accepted five-standard deviations. The “five-sigma” rule is enforced by both journals and by the experimental groups themselves. This history will demonstrate that the use of standard deviations is not a mechanical application of a statistical formula, but demands knowledge, craft, and judgment. Questions have also been raised concerning the appropriate statistical formulas to use. How does one deal appropriately with the statistical noise? Episodes in which a statistically significant effect was initially seen, but which later disappeared, an unlikely event on statistical grounds, as well as the recent discovery of the Higgs boson will be discussed.
(3) Statistical flukes, the Higgs discovery, and 5 sigma
Deborah G. Mayo.
Professor, Department of Philosophy, Virginia Tech
The data analysis and modeling in HEP employs multiple methods, both Bayesian and frequentist. But the report in terms of standard deviation units relates to statistical significance testing and p-values, leading to questions about whether the Higgs results inherit some problems with these methods. I argue that the use of statistical methods in the Higgs experiments illuminates how these tools can and often do work validly.
A rough sketch of the Higgs statistics: There is a statistical model of the detector, within which researchers define a “global signal strength” parameter such that H0: μ = 0 corresponds to the “background only” hypothesis, and μ = 1 to the Standard Model (SM) Higgs boson signal in addition to the background. The statistical test records differences in the positive direction, in standard deviation or sigma units. The improbability of an excess as large as 5-sigma alludes to the sampling distribution associated with such signal-like results or “bumps”, fortified with much cross-checking. In particular, the probability of observing a result as extreme as 5 sigmas, under the assumption it was generated by background alone, that is, under H0, is approximately 1 in 3,500,000. Some put this as: the probability that the results were just a statistical fluke is 1 in 3,500,000.
Many critics have claimed that this fallaciously applies the probability to H0 itself—a posterior probability in H0. A careful look shows this is not so. H0 does not say the observed results are due to background alone, although were H0 true (about what’s generating the data), it follows that various results would occur with specified probabilities. The probability is assigned to “observing such large or larger bumps (at both sites)” supposing they are due to background alone. These computations are based on simulating what it would be like under H0 (given a detector model). Now the inference actually detached from the evidence is something like:
There is strong evidence for (or they have experimentally demonstrated) H: a Higgs (or a Higgs-like) particle.
Granted, this inference relies on an implicit principle of evidence:
Data provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).
This is a variant of the severe or stringent testing requirement for evidence.
Here, with probability .9999997, the test would generate less impressive bumps than those obtained, under H0. So, very probably H0 would have survived, were μ = 0.
Some cases do commit the fallacy of “transposing the conditional” from a low significance level to a low posterior to the null; but in many other cases, what’s going on is precisely as in the case of the Higgs.The difference is as subtle as it is important, and calls for philosophical illumination.
(4) Inductive Risk and the Higgs Boson
Kent Staley.
Associate Professor, Department of Philosophy, Saint Louis University
The announcements by the ATLAS and CMS collaborations of the observation of a Higgs-like boson prompted much discussion of their reliance on a statistical threshold of five standard deviations excess for the announcement of a discovery in particle physics. Just how ironclad a requirement this five-sigma standard (5SS) is, and exactly what role it plays, are not so simple, but in this paper I wish to focus on one possible interpretation of the 5SS and its relation to a long-standing debate within philosophy of science over the role of value judgments in scientific inquiry.
The argument from inductive risk (AIR), articulated by Richard Rudner in 1953 and C. West Churchman in 1948, and revisited in recent years by Heather Douglas and others, seeks to show that in some cases the conclusion that a scientist draws from her data may properly be influenced by judgments concerning non-epistemic values, such as moral or economic values.
Discussions of the AIR commonly draw on the statistical theories of Neyman and Pearson and Wald, which require the investigator to designate a critical value for the acceptance of an alternative hypothesis and the corresponding rejection of the null hypothesis, thus setting the error probabilities of the test. The investigator, it is argued, may legitimately consider the consequences of different possible errors, including moral or economic consequences, as grounds for the choice of critical threshold, which may in turn make a difference to the outcome of the inference.
The methodological requirements of the AIR are simply that scientists engage in hypothesis acceptance, that they must choose among competing methods for deciding whether to accept a hypothesis, and that different decision methods yield different probabilities for the erroneous acceptance of hypotheses under consideration. Any method that imposes a “standard of evidence” criterion for hypothesis acceptance will qualify, and most discussions of 5SS treat it as just such a criterion.
This talk will examine the role of the 5SS in the Higgs discovery from the perspective of Churchman’s pragmatism as a test case for some of the assumptions about the argument from inductive risk made by both its defenders and its critics. Of particular interest is Isaac Levi’s critical argument, which regards the AIR as premised upon a “behavioristic” view of statistical inference. Levi argued that the AIR did not apply to inquiries that constitute “attempts to seek the truth and nothing but the truth.” Such inquiries immunize statistical inference against the threat posed by the AIR of reducing “the theoretical aspects of science to technology and policy making.” In response to this kind of critic, the nuances of the 5SS show how Churchman’s pragmatism (which, contra Levi, is not reductively behaviorist) can help make sense of the relevance of the broader purposes of the particle physics community to their adoption of the 5SS without reducing their enterprise to “technology and policy-making”.
PSA site for full program: http://www.philsci.org/
Search this blog for several posts on the Higgs.
Will the symposium be the panel reading submitted papers or giving presentations, and regardless of which it is, will the materials be available at some point for non-attendees? Video, slides, proceedings, discussion transcripts?
West: I can likely arrange to post them on my blog after the conference–good idea!
A technical question I hope the panel members addresses: At what point can an alternative hypothesis, with its additional parameters, be rejected in favor of the simpler null? In the context of particle physics (and put in intentionally provocative terms), how many null results do there have to be before people stop taking certain BSM scenarios seriously?
West: good question. It brings up the fact that the theoretical issues loom large. It’s a very interesting contrast with other fields in dealing with failure to reject nulls. The error statistical explanation is that it is known that there are rival theories (rival to SM) that current experiments could not distinguish between.
So your error statistical method does not any provide guidance on when to stop testing a null after repeated failures to reject it? There will always be corners of parameter space that remain untouched, but that is result of the additional degrees of freedom in the alternative.
West: It gives guidance as to what may be inferred with warrant, and what not. Where there are non-trivial discrepancies from what a theory T predicts about observable properties, and when these could not have been discriminated with existing tests, then it makes sense to deny we may infer T (in full).
I think we may be talking past each other at this point.
Anon –
[Mayo, pardon if this is all off track, as usual!]
I assume you’re making a distinction in the sense of Jaynes. I read a bit of him a while ago when I was an undergrad and have to say I enjoyed it, though he was a bit much at times. I moved on somewhat for various reasons (which are not necessarily good ones) and never really got to try out his sort of approach properly. It’s been interesting to see it all get quite popular in recent times and I have been giving it another go lately. I still have some concerns but figure I need to get some more hands-on experience to really see.
I largely subscribe to a ‘meaning is in the use’ sort of approach, so it might be entirely reasonable to use ‘non-frequency probabilities’. I’m still just pretty fuzzy on how it all plays out exactly. Do you interpret these in some information-theoretic way? I wonder why emphasise probability distributions at all? The Laurie Davies comment from the other day was interesting, for example, and raised quite a different perspective [Mayo – any chance of a guest post??].
PS
I re-read this paper recently – ‘Prior information and ambiguity in inverse problems’ (bayes.wustl.edu/etj/articles/ambiguity.pdf) which I enjoyed. I’ve also been interested to look back over some snippets of mutual positive reference between him and someone I suspect you might call the ‘best philosopher of the 20th century’. I wonder how much they communicated? Did they ever attempt to reconcile their approaches or work together? I know they both liked Gibbs.
omaclaran: I’ve no idea why comments are requiring approval as I’ve unchecked that box.
Oops! I meant that for another post! This one – https://errorstatistics.com/2014/10/31/oxford-gaol-statistical-bogeymen-3
omaclaren,
There’s not much point discussing Jaynes around here. This blog is in a kind of timewarp (circa 1975) in which Bayesians are either subjective or default (ignorance) prior type, or some flavor thereof. Jaynes has repeatedly been lumped in with “default” Bayesians even though the was the extreme opposite of that and definitely never used any kind of “default” rational to justify his distributions.
With that kind of lack of basic comprehension, communication is impossible.
I’ll say this though. The easiest step one can make to get insight into the foundations of statistics is to simply call every frequency distribution a “frequency distribution”. That basic mental hygiene of calling frequency distributions what they are does wonders for all concerned.
Prob distributions are always for modeling uncertainty in datums. Datums can be any real quantity such as the speed of light, my last blowing score, or the frequency of something. Frequencies are datums: i.e. real physical things that exist in the real world. Probabilities model uncertainty about datums, they are man-made descriptions which reflect how much is known (or assumed) about the real datums.
In some instances where we model uncertainty of some frequency it turns about that probabilities are approximately equal to frequencies. This isn’t true in general though, even when we’re specifically concerned with modeling frequencies. And it’s never true if the datum in question isn’t a frequency.
Like I said, calling frequencies “frequencies” and leaving the word “probability” for probabilities does wonders.
Anon: Scarcely a throwback. It’s only in the past few years that it’s been dawning on people that traditional Bayesian foundations are questionable. Even Bayes updating as a normative principle has largely been abandoned. Subjective elicitation–mostly gone, reference priors that maximize the influence of the data is the most appealing.
Uncertainty for you is a measure of what?
I wouldn’t mind separating formal probabilities from frequencies: in the proper subset of science wherein formal statistics enters, we are keen to critically assess and control misleading interpretations of data. Probability models are relevant only to the extent that they further this goal via measures of hypothetical relative frequencies. We could do what we need with capabilities and frequencies–hence the popularity of resampling and bootstrapping.
But do tell us what uncertainty means for you.
“Uncertainty” means exactly what scientists always mean by uncertainty. The mass of the electron for example is:
9.10938291*10^(-31) kg +/- 0.00000040*10^(-31) kg
Notice the uncertainty attached to the number.
In general information A doesn’t determine a datum x exactly. So based on A, there is a range of more or less consistent or potential values of x which could be true. That range defines the “uncertainty”. P(x|A) is just the formal specification of this uncertainty range. If the range is big, the uncertainty is large, if it’s small the uncertainty is small.
The more important point is what’s the goal of modeling uncertainty with P(x|A)?
The goal is to find which claims are insensitive to the uncertainty implied by A. That is to say, A leaves many possibilities for the true x, so which claims are going to be true for almost all those possibilities?
If P(H|A) = .9999 that means almost every reasonable possibility for the true x in that range of uncertainty makes H true. That’s the basis for having confidence in H given A and the probability is a direct measure of that confidence. If 100% of the possible values for the true x make H true then it’s true with certainty.
The sum and product rule of probability (which includes Bayes theorem) are the unique correct rules for manipulating these uncertainties. Which is why ad-hoc methods which deviate from them always have serious failings even if they seem to work sometimes.
The rational here is not wildly different from frequentist intuition. That’s because frequentistm (at least the parts that make sense) are a special cases of this. This is far more general than frequencies though. The problem with Frequentistm isn’t that it’s wrong as far as it goes. The problem is they claim their special case is the general case. It isn’t.
Incidentally, this isn’t some new foundation. This was the original viewpoint of Bernoulli, Laplace, Keynes, Jeffreys, Jaynes. It was around at the start and has always been around. It’s Frequentistm and Subjective Bayes who are the redheaded step children of Statistics.
Anon: You do realize that a significant portion of the physics community, particularly those who work in particle physics, describe parameter “uncertainty” using confidence and not credible intervals, right? Ask some astronomers and you are likely to get a different answer. Appealing to “what scientists always mean by uncertainty” is therefore problematic because … opinions differ.
Also the invocation of the precision measurement of the electron’s rest mass is irrelevant to the problem of “how to define uncertainty.”
West;
I’m glad you brought this up. The use of uncertainty by Anon and others tend to have “bubbles” of uncertainty around it–we can grasp it as a metaphor that often gets concreteness in terms of error bars and the like. I have no objection to the metaphor, it’s scarcely so different from things like confidence intervals. Nor would I have a problem with his “extraordinarily useful idea that ‘H is probable given A’ means ‘H is insensitive to the residual uncertainty inherent in A’. This is the type of claim that we make informally all the time. I think it’s a difference of goals that might divide us, and my goals are certainly not in line with the typical frequentist. Jaynes talks the falsificationist’s talk, but where are the rules for falsifying and pinpointing blame correctly? Improbability is not enough to do justice to the idea. The severe testing account and corresponding philosophy of science can at least go a fair way in describing methods to answer these queries.
Anon:”So based on A, there is a range of more or less consistent or potential values of x which could be true. That range defines the “uncertainty”. P(x|A) is just the formal specification of this uncertainty range.”. That does not seem right– you are interested in a “true” value of x, but are going to model it as a random variable? Didn’t Venn, Fisher, and many others flag that problem going back to the 19th century?
@john Mixed effects models are popular with frequentist practitioners in cases where the “random effect” has no randomization to speak of (eg repeated measurements), likely because it’s a backdoor excuse for Bayesian shrinkage estimation 🙂
@anon
I’m sympathetic to your viewpoint of probability as a mathematical model of underdetermination, something that I’ve found to be inadequately addressed by the usual application of hypothesis testing, particularly in data from non-randomized conditions (mayo would probably say that severity addresses underdetermination).
Let me play the devil’s advocate here though.
“P(H|A) = .9999 that means almost every reasonable possibility for the true x in that range of uncertainty makes H true.”
If I were to make such a statement about a credible interval and the true value turned out to be outside that interval 99.9999% of the time, could the statement still be considered both valid and useful?
If not wouldn’t that imply that even if we aren’t rigid about calibration, that there’s implicitly some approximate relation between a notion of a correct probability model and calibration?
I think I know one argument you could make, which is that you can throw frequencies out the window and broaden your prior probabilities arbitrarily and so long as you do a full marginalization when computing, they would encompass the true value at about or greater than the credible interval probability.
However this only works in one direction – broadening space of underdetermination. I can’t arbitrarily shrink my prior probabilities and still make reasonable statements. This would imply that while there isn’t an equality constraint between your notion of probability and the calibrated model, there is a notion of correctness that is still at least directional relevance of the calibrated model to the correct application of your model, right?