We had an excellent discussion at our symposium yesterday: “**How Many Sigmas to Discovery? Philosophy and Statistics in the Higgs Experiments”** with Robert Cousins, Allan Franklin and Kent Staley. Slides from my presentation, “Statistical Flukes, the Higgs Discovery, and 5 Sigma” are posted below (we each only had 20 minutes, so this is clipped,but much came out in the discussion). Even the challenge I read about this morning as to what exactly the Higgs researchers discovered (and I’ve no clue if there’s anything to the idea of a “techni-higgs particle”) — would not invalidate* the knowledge of the experimental effects severely tested.

*Although, as always, there may be a reinterpretation of the results. But I think the article is an isolated bit of speculation. I’ll update if I hear more.

The last time the issue of the p-value police came up, it was in the context of a larger argument. I wrote an objection that gave the misimpression that I was objecting to all of the elements of the argument, which led to a great deal of talking past one another. This time I won’t make that mistake.

The claim is that the phrase “there is less that a 1 in 3.5 million chance that the results are a statistical fluke” describes an ordinary error probability. (Certainly the particular numbers quoted are the result of an ordinary p-value calculation, but that’s besides the point.) It requires an enormous twisting of words to make this claim. Let’s examine the part of the construction to which a probability is being ascribed: “the results are a statistical fluke”. A simple and direct reading of this phrase is that it is tantamount to “the null hypothesis is true”. The results being referred to are actually in hand, and those specific results either are (i.e., frequentist probability one), or are not (i.e., frequentist probability zero), a statistical.fluke, in the sense that an indefinite repetition of the experiment would see the “bump” disappear. On slide 15 we find the claim that the null hypothesis “does not say that the observed results… are flukes”, but if we hold that the null hypothesis is true, then that is *exactly* what we have to conclude about the observed results! (And on slide 16 we see the substitution of the event “d(X) > 5” for the actual event proper to the p-value calculation, to wit, “d(X) > d(x_obs)”, which supports the argument by dropping “x_obs” from the picture.)

So in my view, the p-value police have the right of it. Insofar as there is some sensible and reasonable thought behind the reported phrase, that thought is best understood as a posterior probability.

Corey: That Ho entails it doesn’t make it equivalent. It is not. Moreover, one needs to see how the term is being used in context. (That is, I can imagine “fluke” being defined differently, but in all the HEP lit, I see it used this way.) Take a look at how Overbye describes the appraisal of bumps according to the probability of their being spurious i.e., the probability of the background rising up to form bumps as large as these. As a matter of fact the P-value police approve of those, e.g.,

Spiegelhalter, in a quote I cut for length, gives thumbs up to:

“CMS observes an excess of events at a mass of approximately 125 GeV with a statistical significance of five standard deviations (5 sigma) above background expectations. The probability of the background alone fluctuating up by this amount or more is about one in three million.”

I can see how one may equivocate; but one need not and should not. We are always giving probabilities from the sampling distribution (under some hypothesis), why and always viewing data as of a generic event when assigning it a probability (under some hypothesis). Of course, as I emphasize, there’s an implicit premise that subsequently gets you to the inference to H*. (It’s the severity principle.)

Now why is no one winning the current palindrome contest? Too easy?

Mayo: My argument doesn’t turn on the use of the word “tantamount”; there’s more that needs addressing if the claim is to be defended.

Are we talking about “fluke” as used in the HEP lit, or about how the results are reported in the popular press as filtered through the understanding of journalists?

My standard for a correct description of a p-value is that the probability has to be ascribed (or at least, can be read as being ascribed) to hypothetical observations rather than the data already in hand, and the event to which the probability is being ascribed has to be some variety of “as *or more* extreme” — ascribing the probability to the data simpliciter is incorrect. So I would give a thumbs-up to the quote you cut, just as Spiegelhalter did.

Corey: The issue doesn’t turn on tail areas, nor the meaning a frequentist properly ascribes to the probability of data under Ho.. The issue does not turn on whether one considers the probability of d(X) = 5 ; Ho) = 1 in 3 million (or whatever) or the probability of d(X) > 5 ; Ho) = 1 in 3 million.Neither give the probability of Ho.

Mayo: I agree, the issue doesn’t turn on any of those things. The issue is something like: do the journalists who write the articles really understand what has been calculated and why? Do lay readers come away from popular science articles with a correct understanding of what has been calculated and why?

Slide 11 is very interesting to me: it describes what amounts to an optional stopping sampling scheme, which brings to the fore some interesting questions about how to calculate SEV. Consider this simplified abstraction of the HEP setup: we have funding for at most two independent observations from a normal distribution with unknown mean and known variance (without loss of generality, equal to one). If x_1 < 2, then we'll abort the experiment; otherwise, we'll collect both observations. What is the appropriate SEV calculation for the following possible data sets?

Set 1: x_1 = 1.5

Set 2: x_1 = 4, x_2 = 1.5

Set 3: x_1 = 4, x_2 = 6.5

Corey: If you may permit me to tweak your hypothetical a little bit, I’d like to use a Poisson distribution rather than a Gaussian. Counting experiments is essentially what is being done in HEP experiments anyways, so why not use something more applicable when dealing with small number statistics.

Lets say we know the background rate (b=1) with certainty but do not know the signal rate (s=?). Each observation is of the same length and n_i is the number of events in the i^th observation.

Set 1: n_1 = 1

Set 2: n_1 = 4, n_2 = 1

Set 3: n_1 = 4, n_2 = 6

Critics fall into the same confusion in using talk that is more familiar than “p-level flukes”, namely, type 1 error probabilities. If the test’s type 1 error probability is a low value p, then the probability it does not commit a type 1 error is (1 – p). Ho is rejected, so “not-Ho” is inferred. So far, so good.

But now it is (erroneously) assumed (by some critics) that the probability of “not-Ho” is (1 – p). (And the prob of Ho is p—a posterior.) In other words, the same wrong-headed criticism is seen in the more familiar language of type 1 error probabilities. Were it correct, it would be tantamount to asserting that use of a type 1 error prob constitutes a posterior probability assignment.Indeed, I suspect most criticisms of error probabilities boil down to this.

Pretty sure most *Bayesian* criticisms of error probabilities boil down to “right answer to the wrong question”, and not a misunderstanding on the part of the critics of what error probabilities are, nor a misunderstanding of what frequentists think error probabilities are.

Corey: How do you know it is the wrong question if what is assumed to be the right question is the wrong question, and the actual right question is the one being answered by the correctly understood error probability?* Why don’t you focus on the particular point I made about the (understandable) slippery slide from a type i error probability to a posterior or an assumed posterior. There is a subtle point here that is missed by such a blanket dismissal.

*Of course, claiming a significance tester is answering the wrong question is very different from the current topic. The current issue is a critic or some critic alleging that a significance tester THINKS she is answering a different question, but only by misinterpreting what they’re doing.

Mayo: How do *I* know it is the wrong question? As you say, that’s very different from the current topic; I was just responding to your “boil down to” claim. (I will eventually get around to answering this question on my blog, but I have some howlers to disavow first…)

“Why don’t you focus on the particular point I made about [stuff]?”

Okie-dokie. It’s difficult to nail down who you’re talking about in various places, so let’s set a scene with some players so that we can talk about who understands or misunderstands things, who thinks other people are misunderstanding things, etc. Dramatis personae:

(1) error statisticians;

(2) non-statistician scientists using frequentist tools;

(3) Bayesian critics;

(4) pop-sci writers;

(5) lay audience

What are your impressions about how these various groups understand error probability calculations, and how these various groups believe the other groups understand error probabilities?

Talk about off topic–the other was at least close. It would be fun to shoot the breeze on these topics but I am not a sociologist of statistics or a statistician of statistical understanding.

but our issue is more like one of language:

“get w white balls in a random selection of size k from an urn with p% white”

“randomly selecting k balls from a urn with p% white balls yields w white”

Considering w or more doesn’t alter anything. It’s intended as the same event.

or various other rewrites. Putting the “H” as a subscript in many N-P texts was a good idea.

I’m baffled — words don’t have meanings without people for them to have meaning to. The author has a meaning in mind, and the reader takes away a meaning; ideally the author has a correct idea in mind, and this idea is correctly communicated to the reader.

On slide 14 you talk about the p-value police’s assessment of various reports. The p-value police are reading phrases generated by various authors and judging whether they have a correct (according to the p-value police) meaning in mind. I don’t know how we could possibly carry on a conversation about this situation without, as a bare requirement, being specific about who we’re talking about and what we think they think.

‘right answer to the wrong question’ sums up my feelings about quite a few statistical procedures like multiple comparison “adjustments”

I’m hoping we can banish such knee-jerk quips, and try thinking things through from scratch.

The paper on a technicolor Higgs is reminds me of what I find most frustrating about certain statistical methods. Some recent papers on Dark Matter detection are filled with this sort of slop.

Gross oversimplification follows:

1. Create new physical model that subsumes accepted phenomena

2. Determine best fit values for all the new parameters

3. Construct likelihood ratio of New Model with best fits to Old Model

4. Claim observations show hints of new something or if you are sufficiently brash, claim outright DETECTION of the something.

My knee-jerk reaction: Is it really all that surprising that the new model with however many additional parameters fits the data better? Shouldn’t your analysis take that greater flexibility into account when deciding between models?

West: Merely comparing best-fit likelihoods is statistical malpractice according to Bayesians and frequentists alike. (Not sure about likelihoodists…) Hypothesis tests take the extra degrees of freedom into account; Bayesians have Occam factors to penalize the extra degrees of freedom. I’m mildly shocked to learn that physics papers can pass peer review with the kind of egregious flaw you describe.

Corey: I thought Bayesians appealed to priors; I’m not sure the philosophical foundation of “penalizing” is clear. Some would say it is frequentist or error statistical.

The prior implies the penalisation, penalizing without a prior or penalizing differently than that implied by the prior – would be without a Bayesian foundation.

(Of course, if the prior is a bad representation of what’s unknown, the penalisation will do more harm than good.)

Keith: Don’t non-Bayesian likelihoodist or other modelers also penalize, as in Akaiki?

Mayo: I can answer this one. Akaike proposes to use (an estimator of) the expected log-likelhood ratio for model selection (in Spanos’s sense). Since the distribution to use for the expectation is unknown — is, in fact, the implicit target of the inference — he resorts to an estimator, to wit, the observed log-likelihood ratio. Since this estimator is biased, he computes a correction that provides unbiasedness to first order. The bias correction term is the penalty. This approach rests on an entirely frequentist conception of probability (albeit the approach is not error-statistical in the least).

Contrast this approach with the Schwarz’s so-called Bayesian Information Criterion (BIC). Schwarz explicitly approximates the log-Bayes-factor and arrives at the observed log-likelihood ratio plus a penalty term that approximates the Occam factor to first order.

Corey: The paper that comes first to my mind (http://arxiv.org/pdf/1402.6703v1.pdf) uses the log-likelihood ratio of best fit models as chi-squared test statistic. Then a p-value for that statistic is computed using a chi-squared distribution and a counted number of degrees of freedom.

The major claim of the paper (aka the reason why anyone took notice) is buried on page 7 and surrounded by a load of qualifiers. “For the best-fit spectrum and halo profile, we find that the inclusion of the dark matter template improves the formal fit by ∆χ2 ≃ 1672, corresponding to a statistical preference greater than 40σ.” Which, to me, looks like a giant red flag that someone messed up their statistical inference.

If you are interested in a rather testy discussion of the statistical methods used in this paper, you can find it in the comments of this blog post: http://resonaances.blogspot.com/2014/03/signal-of-wimp-dark-matter.html

Mayo: Keith is right — the prior is a key ingredient of the Occam factor. The Occam factor is, very roughly and approximately, the ratio of the posterior “width” to the prior “width”.

If we’re comparing a base model to an expanded model with one extra parameter, the Bayes factor (of the expanded model relative to the base model) is approximately equal to the ratio of the best fit likelihoods times the Occam factor. The more informative the data is about the extra parameter, the greater the Occam factor penalty. The improvement in the best fit likelihood has to compensate for this penalty or else the Bayes factor will be less than 1 (i.e., the base model gets the B-boost).

Corey: Some of these papers do make it into journals like Physical Review D, but its the first few weeks after they hit the arXiv when they have the greatest impact. As in other disciplines, peer-reviewed publication can take months and while it does weed out some, it isn’t perfect.

I don’t mean to give the impression that physics papers that rife with statistical errors. There is a lot of great and careful work going on, but I feel too often some of the turds gets attention due to slippery wording and inference practices.

If mu = 1 is the proposed hypothesis, doesn’t merely showing mu > 0 not imply the hypothesis?

Also, how does the “and under various alternatives” actually get incorporated on slide 6. If one doesn’t exhaustively enumerate alternatives, doesn’t that imply that 5 sigma can be achieved even if the hypothesis is false due to confounding? You’ve claimed in the past that an exhaustive enumeration of the space of possibilities is not necessary in your framework, but I don’t see how you can get around confounding particularly in a case like this where there is no randomization and one is effectively observing nature (albeit under extreme conditions).

“If mu = 1 is the proposed hypothesis, doesn’t merely showing mu > 0 not imply the hypothesis?”

No, and I don’t know what “proposed hypothesis” mean. Even rejecting the null doesn’t show mu = 1, since there are other properties that would need to be shown for it to even be “SM Higgs-like”. But even when that is inferred, there are non-SM possibilities.

“You’ve claimed in the past that an exhaustive enumeration of the space of possibilities is not necessary in your framework,”

You’re running two things together: you must exhaust the space of alternatives at the level of the question, e.g., all mu values. Whereas one need not exhaust the space of theories or hypotheses at “higher” or different levels. You should look at chapter 6 of EGEK, or anything I’ve written on underdetermination. That’s what allows piecemeal learning (in error statistics).

Claims about model assumptions and systematic errors, of course, can always come up for questioning. (They are on a distinct (generally lower) level.)