# “The probability that it be a statistical fluke” [iia]

My rationale for the last post is really just to highlight such passages as:

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance.” (Strassler)….

Even before the dust had settled regarding the discovery of a Standard Model-like Higgs particle, the nature and rationale of the 5-sigma discovery criterion began to be challenged. But my interest now is not in the fact that the 5-sigma discovery criterion is a convention, nor with the choice of 5. It is the understanding of “the probability that it be a statistical fluke” that interests me, because if we can get this right, I think we can understand a kind of equivocation that leads many to suppose that significance tests are being misinterpreted—even when they aren’t! So given that I’m stuck, unmoving, on this bus outside of London for 2+ hours (because of a car accident)—and the internet works—I’ll try to scratch out my point (expect errors, we’re moving now). Here’s another passage…

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. …Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?”

A very sketchy nutshell of the Higgs statistics: There is a general model of the detector, and within that model researchers define a “global signal strength” parameter “such that H0: μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the Standard Model (SM) Higgs boson signal in addition to the background” (quote from an ATLAS report). The statistical test may be framed as a one-sided test; the test statistic records differences in the positive direction, in standard deviation or sigma units. The interest is not in the point against point hypotheses, but in finding discrepancies from H0 in the direction of the alternative, and then estimating their values.  The improbability of the 5-sigma excess alludes to the sampling distribution associated with such signal-like results or “bumps”, fortified with much cross-checking of results (The observable bumps form from observable excess events):

The probability of observing a result as extreme as 5 sigmas, under the assumption it was generated by background alone, that is, under H0, is approximately 1 in 3,500,000. Alternatively, we hear: the “probability that the results were just a statistical fluke is 1 in 3,500,000”.

Yet many critics[1] have claimed that this is to fallaciously apply the probability  “to the explanation” H0. It is a common allegation, but a careful look shows this is not so. H0 does not say the observed results are due to background alone, although were H0 true (about what’s generating the data), it follows that various results would occur with specified probabilities. Thus we get the sampling distribution of d(X) under H0. For example, the probability of a type I error (false positive) is low.

(1) Pr(d(X) > 5; H0) ≤  .0000003.

These computations are based on simulating what it would be like were H0 (given a detector model). So particle physicists are not slipping in a posterior probability on H0. In fact it is an ordinary error probability (of a type 1 error) or significance level.  In terms of the corresponding p-value:

(2) Pr(test T produces a p-value < .0000003; H0<  .0000003.

Now the inference that is actually detached from the evidence can be put in any number of ways:

(3) There is strong evidence for (or they have experimentally demonstrated) H: a Higgs (or a Higgs-like) particle.

Granted, inferring (3) relies on an implicit principle of evidence beyond (1) or (2): Data  provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).[A variant of the severe or stringent testing requirement for evidence.]

Here, with probability .9999997, the test would generate less impressive bumps than these, under H0. So, very probably H0 would have survived, were μ = 0.[2]

So while it is true that some cases in science commit the fallacy of  “transposing the conditional” from a low significance level to a low posterior to the null—in many other cases, what’s going on is precisely as in the case of the Higgs. If you go back to examples where the fallacy is alleged with this in mind, I think you will find they mostly evanesce.

Once the null is rejected, confidence intervals take over to check if various parameters agree with the SM predictions. Now the corresponding null hypothesis is the SM Higgs boson H’0  (Cousins, p.18 ), and discrepancies from it are probed. It is here that we actually get to the most important role served by statistical significance tests: affording a standard for denying sufficient evidence of a new discovery.

The basic principle here is that: An observed difference from a test T does not provide evidence for rejecting H’0 if even larger bumps are fairly easily produced where H’is a reasonably adequate description of the process generating the data (with respect to the question).

In determining results do not meet the “new discovery” threshold, it is not merely formal statistics that is involved, but various problems, such as the fact that an anomalous bump shows up at CMS but not at ATLAS, that the effect dissipates with increasing data.

Thumbs up:

• “The probability of the background alone fluctuating up by this amount or more is about one in three million.”
• “only one experiment in three million would see an apparent signal this strong in a universe without a Higgs.”

The following two are thumbs down (according to the critic)

• Both groups said that the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million,
• There is less than a one in a million chance that their results are a statistical fluke.

But correctly understood, all four are thumbs up.The incorrect one that he states alludes to [2].

[2] Correction: The correct claim is that the complement of a “false positive”–within significance testing– is a “true negative”. So, Pr(test T does not reject H0 ; H0)= 1 – the corresponding type 1 error probability.

Expect errors–please note corrections, I’ll update it when I reach land—we’re moving!

***
Some previous posts on Higgs & 5 sigma standard:

March 17, 2013: Update on Higgs data analysis: statistical flukes (part 1)
March 27, 2013: Higgs analysis and statistical flukes (part 2)
April 4, 2013: Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics

Categories: Error Statistics, P-values, statistical tests, Statistics | 66 Comments

### 66 thoughts on ““The probability that it be a statistical fluke” [iia]”

1. I totally disagree. I think you’ve described the reasoning correctly, but I can’t see phrasing like,

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance,”

as anything other than a failure to correctly describe the error statistical warrant for the claim. The claim “the observed phenomenon is a statistical fluke” is equivalent to “the observed phenomenon is not reproducible” which is a statistical hypothesis, not an event. On its face, the quoted phrasing is talking about the probability of an hypothesis!

To me, your “correctly understood” reads as “if we ignore the plain meaning of the words and look at the math these words attempt (and fail!) to describe”. You yourself are so very careful never to get this mixed up; I don’t know why you would defend this confusion of ideas.

And why do scientists repeatedly make this slip? On this I should not care to dogmatize…

• West

@corey

Apologies if your question was rhetorical, but the most common cause of slips of this sort (and other sorts too) is speaking outside one’s realm of expertise.

Most of the statistics training one gets in a physics degree comes on the job and not via formal training. So if you don’t live and breathe p-values and posterior pdfs in one’s work, all of this likely resembles pointless hair splitting.

I personally don’t subscribe to this view, in part because my research involves questions of background estimation and detection confidence. It doesn’t surprise me that this often happens. I wish it wouldn’t, since it would save me a lot of headaches, but the subtleties of statistical inference are irrelevant to many in physics…

• Sorry, these are probabilities to various observable bumps—ordinary error probs, whether by simulation or analysis. But this explains much of the confusion as to what we really want…. Just returned to home country.

• West

What explains the confusion? I am afraid I don’t follow…

• Corey: You’ll remember the discussion not long ago about an error probability not being a conditional probability, but a probability assignment to an event such as {d(X) > 5} (within statistical model M) under the assumption that parameter mu is some value, say 0 (where mu governs the distributions). Whether by analysis or simulation or something else, we may determine P(d(X) > d’; mu = 0). We know the actual data set is statistically unusual in all sorts of ways, and we aren’t impressed by such small likelihoods.

Behaviorist Neyman might say: we adopt a rule to announce “the phenomenon is real”, or “the bumps are not due to background alone”, or “reject null Ho” or any number of “announcements” if and only if d(X) > 5.

He will show that the probability of erroneously rejecting Ho is very small, and further that the test T has uniform power, or good power, against discrepancies from 0.

My thing is to give an inferential rationale to this kind of move, because I think that’s how we actually reason. We always want to know things like: “how frequently would even more impressive bumps be generated even in a system where these could only be produced by background: mu = 0”.
If, for example, P(d(X) > d’; mu = 0) ~ .4, for the observed d’, then we simply deny that this data are good evidence for mu> 0. It’s rather lousy evidence*.

Even if we somehow knew, perhaps from other data or theory, that mu > 0, we would deny that evidence for such an inference may be found in this data.

*And our warrant is not merely or not solely that were we to regard it as evidence against the null, we’d be wrong 40% of the time.

2. Mark

I agree with Corey (not sure I’ve ever said that before on this blog), to a point. Actually, Mayo, I think that this is one of the clearest statements of the problem I’ve ever read, I might share your write up with others that I’m trying to explain this to (with full credit, of course). I especially like the part “Once the null is rejected, confidence intervals take over”… that is exactly how I try to teach things.

However, I do see that “correctly understood” as a serious hedge on your part… that is, assuming that *we* understand what the other person is talking about we can mentally provide the correct interpretation of their words. Problem is, I work with many medical researchers, and I guarantee you that *they*, by and large, don’t understand what is being implied when someone interprets a p-value as “the probability that the results are due to chance”. In their minds, they seem to jump to the (seemingly) logical conclusion that 1 – p is the “probability that the results are not due to chance”, which is really just another way of saying the probability that the alternative hypothesis is true (or that the null hypothesis is false, whichever).

• Mark: Corey’s claim is incorrect. Say for example
Pr(d(X) > d’); Ho) ≤ .4, for d’ a particular observed difference. Say there is a tool that simulates “background alone” and we generate bumps as or even more impressive than d’ is 40% of the time. That’s the meaning of the claim. Nothing mysterious. This is an ordinary sampling distribution of statistic d(X). There’s no suggestion whatever of a posterior.

It’s not even clear what it would be a posterior of anyway–would it be the probability that the simple SM is true? e.g., 40% of possible worlds have the simple SM true, given x? There’s no license for such a thing, especially based on experiments that are producing results perfectly consistent with a world where rivals to the simple SM hold. Nor can I see why someone would want to know this (even those who believe in multiverses). We’re interested in this world, and we can learn about it using error statistical reasoning.

• Mark

Mayo, I think I wan’t clear enough… I certainly don’t think it’s a posterior probability! I agree with you completely. I was simply commenting on the “correctly understood” comment as a seeming rationale for allowing loose language such as “probability results are due to chance” or “chance that the result are a statistical fluke”. Clearly, to those who understand what a pvalue is, it doesn’t really matter if we use loose language. But I think it is a mistake to allow or disregard such loose language in general because, from my experience, I know that many practitioners take these phrases at face value.

• Mark

Actually, thinking more, “chance that it’s a statistical fluke” isn’t all that terrible because it sort of has the null assumption built into it. But “due to chance” does not (which, admittedly, you never rote in your post). Sorry, Happy Thanksgiving!

• Thanks Mark. yes, although one doesn’t usually use “fluke” , if you look at the claims carefully, they are used entirely correctly. But many of the “thumbs down” construals use other terms, also entirely correct for frequentist significance testers. The irony is, it shows that what we really want are probabilities that various observable events would/do result under various hypothesized parameter values.

The truth or approximate adequacy of a statistical hypothesis is not equal to the occurrence of an event*.

Probabilities, everyone agrees, are assigned to events (or perhaps propositions about the occurrence of events).

Therefore probabilities are not assigned to statistical hypotheses.

That doesn’t mean we cannot qualify inferences to those hypotheses, we do, by error statistical reasoning about the relationships between events and underlying causes as modeled statistically–that’s what sampling distributions give us.

*There are special cases where the truth of a hypothesis really is or can be seen as the occurrence on an event. It might even have a frequentist “prior” probability assignment.

• Mayo: Which claim of mine is incorrect?

• Corey: I was alluding to your claim that the following involved a probability assignment to a statistical hypothesis–here, the standard model Higgs or the like:
“The probability of observing a result as extreme as 5 sigmas, under the assumption it was generated by background alone, that is, under H0, is approximately 1 in 3,500,000. Alternatively, we hear: the ‘probability that the results were just a statistical fluke is 1 in 3,500,000’.”
Whether it could be reproduced in a repeat trial is yet a different issue…

Anyway, here it’s all Thanksgivukkah….

• Mayo: That’s not the claim I intended to make. I’m criticizing unclear communication, not the substance: my contention is only that on its face the phrasing Strassler used appears to refer to the probability of a statistical hypothesis. I freely admit (and tried to be be clear that I recognize) that the referent of the phrase is not actually such a probability.

What all of the phrasings that I deprecate are missing is the idea of accordance/discordance plus the picking out of a subset of the sample space via phrases like “a result more extreme”, “fluctuating up by this amount or more”, “a result this deviant”, etc.

I think the result of the failure to use a precise enough description of what is actually being accomplished provokes confusion in scientists who, as West says, are more concerned with the science at hand than abstruse statistical hairsplitting, and end up, as Mark says, taking the phrasing at face value.

(As my quote of Pearson alludes, I would go further and suggest that this slip is so easy to make and so easy to misread precisely because scientists — and people more generally — find the notion of the probability (read: plausibility) of a (non-random!) claim quite intuitive; and even further, that the notion admits a formalization that makes it useful in science. But that’s just an afterthought.)

Anyway, enjoy Thanksgivukkah! Harvest comes early in the frozen north, so it’s just Hanukkah here.

• Corey:
Probability = plausibility?
The other day you claimed you couldn’t say what was meant by “plausibility”–it was ineffable–so I’m confused.

Think all of it through once more and write again.

• This doesn’t seem so hard to me. When a person talks, informally, about how “probable” they find some non-random claim, it seems to me that we could swap the word “probable” for the word “plausible” without changing their meaning.

Additionally, we can set up a formalization which attempts to capture the informal notion of plausibility. The entire content of such a formalization is to be found in the postulates that specify the relations between these quantities. Cox’s theorem shows that the rules for manipulating any such system of formal “plausibility” numbers are isomorphic to those for manipulating probabilities.

• Corey: (1) This is better than giving plausible no meaning
(2) It is of crucial importance to distinguish the informal uses of such terms and probability, likelihood from their formal counterparts in discussing statistical foundations.
I might say, “I’ve a hunch the Standard Model will break down somewhere, so it’s probably false”. Or, given the plausibility of the Standard Model, physicists might rightly have said, a SM Higgs exists before spending all that money on colliders. This gets to the mamouth difference with error statistics.
(3) We want to evaluate whether some data/information has stringently tested claims about the theory or model H, and by letting prior informal assessments (or beliefs) in the theory/model to be part of that assessment, I have failed to do so.
I need to be able to say (among many other things) that the theory is plausible but this test/data has done a terrible, biased, unreliable etc etc. job at probing it.And having lousy evidence for H does not translate into evidence for not-H, or conversely. It is really at the negative, critical job–I keep saying–that the error statistician differs most from the Bayesian.

We’re back to the discussion we once had about capturing terrible tests.

Moreover, ours is a discovery procedure–discovery in pieces–as we do not want to have to trot out all the possible rivals to a theory in order to get the inquiry going. Nor will the “Bayesian catchall-factor”, P(x|notH) be a plausible factor to rely on for forward-looking inquiry.

So it is here, at the very elemental ground of basic aims and goals (and what can achieve them) that discussions of statistical foundations rests and should rest. One needn’t legislate all aims, but to be clear on them and not mask one for the other. That would be progress.

*I liked Strassler’s remark about cheating although some might find it too strong.

• Mayo: I agree with all this. I’ve got some ideas on this subject that I’ll be writing up at some point — you won’t like them! 😉

• Corey: Well, they don’t accord with a formal posterior probability Pr(H|x) capturing how well tested H is by dint of x and test T. Of course that’s what the severity assessment is intended to supply, yet it doesn’t obey the probability calculus, nor is it arrived at via prior probabilities to statistical hypotheses. For example, both H and ~H might be egregiously poorly tested.

But I think there is another deep difference that we haven’t spoken about directly on this blog*. It’s not amenable to super simple stating, but in a nutshell: We error statisticians view statistical inference as being directed to a (generally) deliberately created data generating mechanism/detector/instrument/experimental design. That’s why those so-called “dangerous” statements of p-values and other error probabilities are not dangerous for us at all.

*Though it’s in my published work.

• Elbafolk

A relevant blogpost quotes Allan Birnbaum:
Allan Birnbaum:
“It is of course common nontechnical usage to call any proposition probable or likely if it is supported by strong evidence of some kind. .. However such usage is to be avoided as misleading in this problem-area, because each of the terms probability, likelihood and confidence coefficient is given a distinct mathematical and extramathematical usage.” (1969, 139 Note 4).
For my part, I find that I never use probabilities to express degrees of evidence (either in mathematical or extramathematical uses), but I realize others might. Even so, I agree with Birnbaum “that such usage is to be avoided as misleading in” foundational discussions of evidence. We know, infer, accept, and detach from evidence, all kinds of claims without any inclination to add an additional quantity such as a degree of probability or belief arrived at via, and obeying, the formal probability calculus.
https://errorstatistics.com/2012/08/27/knowledgeevidence-are-not-captured-by-mathematical-probability/

• Corey: Given your comment had a number of issues I wanted to address, I neglected to mention that the phrasings I used are not missing reference to “a result more extreme”, “fluctuating up by this amount or more”, “a result this deviant”, etc. The likelihood of the particular result (under Ho) would not provide the observable phenomenon to which we intend to allude. {d(X) > d(x)} does. Of course, ~0 probability is added in considering the “tail area” associated with a 5 sigma difference.*

* It’s ironic, by the way, that the use of tail areas should be of concern to Bayesians who are always claiming the tests make it too easy to reject the null.The tail area makes it more difficult to reject the null. What would make it too easy would be to just consider the specific data set. Nevertheless, this notion was not a part of N-P statistics. For Fisher it arose to ensure a sensible test statistic, and to avoid taking any old improbable result as evidence of a genuine experimental effect. The latter role arises to block rejecting the null. So Jeffreys’ clever refrain actually misses its mark. I will post on this soon.

• Mayo: ‘I neglected to mention that the phrasings I used are not missing reference to “a result more extreme”, “fluctuating up by this amount or more”, “a result this deviant”, etc.’

Indeed; as I said, you yourself are so very careful to avoid the phrasings I dislike that I wonder at your defense of others’ use of them…

• No, the phrasings did not ignore them, that’s my point. They included them.

• visitor

Mark.
Finding the probability of a sequence of coin-toss outcomes under chance to be very low, would they jump to the conclusion that the probability the outcomes are not due to chance is very high?

• Mark

Visitor, I doubt it, but they shouldn’t in the case of coin tossing. I really hate it when people bring up those silly coin tossing examples… It all depends on the sampling distribution. So, yes, any sequence of coin tosses would have very low probability, but not in relation to the sampling distribution (as any sequence of tosses would have the same probability).

• john byrd

This would require one to do a terrible thing: Interpret statistical results with no knowledge or concern of the problem at hand and the random process being modeled by the test.

3. vl

If “p is the probability that the results are due to chance”, then by definition 1-p is the “probability that the results are not due to chance”. This is a perfectly logical conclusion.

The mistake they’re making is believing that “p is the probability that the results are due to chance” at all in the first place since it’s A) a non-sensical statement (as if there’s only one possible explanation of the observation that involves randomness and the alternative hypothesis doesn’t have any randomness at all) and B) if it did correspond to something, it would be the probability of the null hypothesis, not the p-value.

• If you look at the equations I wrote, you’ll see the proper interpretation of p-values. From note [2] you’ll see where you’re going wrong. e.g., P(D > d’;H) = 1 – P(D < d';H). (the last "<" should be less than or equal".

4. West

The quote below from Prof. Strassler’s linked post should give us pause before spending more time trying to interpret the meaning of ‘statistical fluke.’

“But does the precise choice of question actually matter that much? I personally take the point of view that it really doesn’t. That’s because no one should take a hint of the presence (or absence) of a new phenomenon too seriously until it becomes so obvious that we can’t possibly argue about it anymore. If intelligent, sensible people can have a serious argument about whether a strange sequence of events could be a coincidence, then there’s no way to settle the argument except to learn more.”

Why would a person who believes this to be true put any effort into representing the relevant statistical problem accurately. I personally have a hard time caring about what he means when he doesn’t care about the precise nature of the question. But I do care to understand (and deconstruct) his reasoning as to why all this fuss about significance is a waste of time.

• Well that wasn’t my favorite part of the post, and I’m not at all trying to deconstruct Strassler. The point (mine) is simply the natural and correct use of the expression (in the title of my post). It’s as if some critics (L. Wasserman calls them “the p-value police”) forget the most natural* way of reasoning in their attempt to carp at frequentist tests.

But there are two important implications of Strassler’s cavalier remark—(a) unless we have a really strong argument that it’s not a coincidence, there’s grounds to probe further, and (b) whenever frequentist error statisticians get to the point of having a very strong argument, unsurprisingly, Bayesians and just about any other approach, does too (even if it’s for a different goal and a different interpretation). But the real significance of test/confidence interval work is performed “along the way” (i.e., before getting to that point): in blocking various claims, indicating which theories won’t work, getting a better understanding of the detector and other instruments/models and uncovering better experiments to discriminate from among the space of possible theories/parameter values. I know very little about HEP in its own right—I’m applying the generality that comes from being a philosopher of science/statistics.

* 12/3/13: Having had this brought to my attention by Christian Hennig, I would replace “natural” with “typical”, “common,” or “useful” or the like. I mean it is common and useful in entirely non-statistical settings,and statistical inference should be continuous both with science and day-to-day inquiry.

5. I’m trying to figure out your position on this, Deborah. Here’s a fairly typical statement from press coverage of the Higgs (this is from Discovery News): “A 5-sigma result represents a one-in-3.5 million chance of the result being noise.” I have tended to read these kinds of statements as instances of what you have elsewhere called the “fallacy of probabilistic instantiation”. But these kinds of statements might also be thought of as ambiguous between a fallacious and a correct statement of error probabilities.

Are you claiming them to be ambiguous and then applying a principle of charity to their interpretation?

• Kent: No, this isn’t an instantiation of the fallacy of probabilistic instantiation, if I correctly understand your claim. But I think you might be focusing on the word “this”. We (frequentists) always regard events as being of a type when we assign probabilities to them*. [Added] Note too that this “observed phenomenon”, as in Strassler’s phrase–is not observed data. If we have a “noise” simulator (like the one that generates bumps from background alone), we might generate bumps to find out how easy (frequent) it is to get a modest “anomalous” bump in the data (at an energy level). (You likely know the experimental physics better than I). If it’s easy (common, frequent) to generate these due to background alone, then we regard such bumps as bad evidence of something other than background as responsible.
I don’t know if this answers your question.

* (One of the issues Strassler raises with the examples of hurricanes or whatnot is, what type? Any outcome can be “unusual” in some respect.)

• john byrd

Kent: I think the problem is with the phrase “chance of the result being noise.” It should say, “chance of a result this deviant given there is no real effect” or the like. I think the problem stems from the way statistics are taught today more as cookbook recipes for data analysis, without the underlying philosophy guiding the approach. For frequency stats, it is imperative to understand the role of the null hypothesis, which requires understanding the underlying philosophy. Most journal submissions I review use frequency stats, and typically do not state the null or justify the test. The authors often put little thought into it, which leads to phrasing in the conclusions like the quote you provided.

• John:
The p-value plays dual roles: you can say,
p = Pr(test or detector T outputs such a pattern of bumps; background alone [as modeled by M])
or
Pr(mu = 0 [in model M] would produce such a pattern of bumps in T)
or
Pr(P< p; Ho) = p,
and many other things besides, and one still isn’t assigning a probability p to the parameter mu = 0.
Nor is one thereby assigning probability (1 – p) to mu = 1.
(which represents the simplest standard model Higgs).
I don’t know what such a probability assignment is supposed to mean. John, maybe you’d want to read the Cousins article I linked to in this post.

• Kent: It occurs to me (in reading something else) that you may have wrongly supposed that the fallacy of prob instantiation refers to assigning a prob that holds for a generic event to a specific instantiation of that event. That’s not the fallacy–in fact that might not be a fallacy, but put that aside. The fallacy is much more serious. Take the example that begins with hypotheses considered as true or not, randomly selects from an urn in which p% of these hypotheses are true, picks out h’, and then assigns probability p to h’. On has taken an h’ that is either true or false and swapped it out for the event of selecting some true hypothesis or other (from the urn) and then goes back to assigning it to h’. Achinstein, recall, had said this was only a fallacy for a frequentist, but I also think it’s fallacious as a “degree of epistemic belief”.
Anyway, this is all by way of saying why that situation isn’t an issue here, where it’s events all the way. The probability of a type 1 (or 2) error, the significance level, p-value, etc. doesn’t go away or change. Now you might go back to what I said about the understanding of “this”–

6. Christian Hennig

Late in this debate and with a rather off-topic question, although it could seem “natural” that I raise it here 😉 because it comes from this quote from Mayo:
“The point (mine) is simply the natural and correct use of the expression (in the title of my post). It’s as if some critics (L. Wasserman calls them “the p-value police”) forget the most natural way of reasoning in their attempt to carp at frequentist tests.”
In discussions like this, can you explain what is meant by the term “natural”, how to diagnose if some kind of reasoning is “natural”, and why is it a good thing to be “natural”?
(I’m asking this because people use the term “natural” in statistics all the time to justify certain methods; usually when this happens in presentations I ask what they mean by this and how this is an argument in favour of their method, and I haven’t got a single good response yet. Usually they say something like “forget it and stick to my other arguments”.)

7. Christian: Maybe my answer sounds like their’s but “natural” is doing a bit of work here (by the way, I say “natural and correct”), and it’s this: I sometimes get the feeling that people regard reasoning about the properties of the method–as in the two versions of the severity principle in this post–as a convoluted way to reason, whereas a posterior probability to the statistical model/hypothesis is somehow natural, or we can’t help it or the like. I’m trying to remind people of their day to day reasoning: you tell me something or someone has passed the test, and I immediately want to know the probability it/or they would NOT have passed so well, were some flaw present. I stand back and ask about what your test would have done if other outcomes had occurred, and I want to know how readily “outcomes that are taken to pass H” (according to your test) would have arisen in a universe where H is flawed, or discrepancies from H are present. This is how I scrutinize your passing result.

I hear people state the defn of a p-value as if it’s such an arcane piece of reasoning that they have to read it off word for word lest they say it wrong, and I’m trying to point out that (i) a number of phrases are altogether equivalent, and (ii) altogether common and altogether just what we want to know, when we’re testing whether the test result should count as evidence for a claim–or not.

• Christian Hennig

Thanks. This is better than what I got from others and works well explaining your specific use in this case. It doesn’t really remove my curiosity regarding the general use of this term for statistical arguments, but of course it’s not your job to do this.

• Christian: And I agree with you that it’s a fuzzy term that I’d never rely on to do work for me either in statistics or philosophy of science. My rationale in this case is what I said above. Thanks much.

• Mayo: ‘the probability it/or they would NOT have passed so well, were some flaw present… I’m trying to point out that (i) a number of phrases are altogether equivalent’

I’m glad you wrote these two statements in relatively close juxtaposition because the first part illuminates exactly why I deny the second part, enabling me to state my objection very succinctly.

Your claim, if I understand it correctly, is that “probability that it be a statistical fluke” is *altogether equivalent* to “the probabilty that, where some flaw present, the test would not have been passed so well”. My claim is that the former is a more general — a *too* general — phrasing; it admits the latter interpretation and also a Bayesian-style probability-of-an-hypothesis interpretation, precisely because it fails to incorporate the latter phrasing’s notion of “so well” or something equivalent.

• Corey: I’m confused about where you’re getting the phrases. The latter phrase arose in my general response to Christian Hennig about how I construe a report about a result passing H. Here I did not refer back to the Higgs example but kept things entirely general. In the Higgs example the H that passes is the non-null (which can take different forms). If it were just a generic “non-Ho”, and if it passed based on, say, a .5 sigma difference, I would note that the probability of {d(X) > .5} is high even if Ho is all that’s operating. In other cases the “flaw” might be that the purported parametric discrepancy from the null is unwarranted because results as or more impressive than what was observed can readily be produced even with smaller discrepancies from the null. See what I mean? “discrepancy” by the way, for me, alway refers to parametric discrepancy (to avoid confusion). So where’s the Bayesian posterior coming in?

• Mayo: I’m getting my phrases from either direct quotes of you or things you would give a “thumbs up”, per footnote [1].

Consider the claim “the observed phenomenon is a statistical fluke”. On the frequentist view, the mere fact that this claim is being assigned a probability conveys that the claim refers to an event. But many scientists have not thought carefully about these things, and cannot really be said to have a view, much less a frequentist one. (See West’s comments above.)

For such people, the probability of the claim “some statistical hypothesis actually prevails” does not immediately and forcefully strike them as nonsensical. When considering the claim “the observed phenomenon is a statistical fluke”, they could (and I expect do) interpret this as a claim that the observed phenomenon is not consistently reproducible. Notice that the claim is being interpreted as referring to what would be observed in a large number of experimental replications; that is, it is being interpreted as a claim that some statistical hypothesis prevails. That’s not the *intended* interpretation at all, but it is a fair one.

And then the claim is reported to have a probability. Boom! –confusion and misunderstandings result.

• Please disregard the first sentence of my third paragraph and replace it with the following:

For such people, the idea of a probability for the claim “some statistical hypothesis actually prevails” does not immediately and forcefully strike them as a nonsensical notion.

• Corey and others: I’m not sure why West’s remarks are being assigned to me, but I think you’re confusing what may follow from hypothesis H with respect to actual or hypothetical experiments, and what H asserts. For example, it may follow from Ho: coin is fair*, that large observed differences from 50% heads will “not be replicable” in a proposed experiment E consisting of thus and so trials. But to say P(differences > 5 sigma; Ho)= some very low number is not to give a probability to Ho.

*Ho: p(heads on each trial) = .5 (within a model).

Let P(|d(x)| > 1 sigma; Ho) = .32. Ho is an assertion about a standard model without a type of Higgs particle. We don’t assign a probability of .32 to Ho. We do infer that the observed d(x) is a poor indication that Ho is false. Indeed, since you’re talking replicability, this is the kind of observed bump that regularly disappears. I’m reminded of my favorite passage from Fisher:

“[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14).

Going back to the 5 sigma difference, it is true that with very high probability (.999…) you would not be able to regularly produce d(x) values as large as 5 SD units or more were Ho an adequate description of the mechanism producing observed decays. So if you manage to do so, finding such bumps at ATLAS and CMS at the same energy levels, etc. I may infer that x indicates a genuine effect—here, SM Higgs particle, with fairly high severity (~.999…). But I do not assign .999..to H’: SM Higgs particle. Not only have I not performed a Bayesian computation which would require P(H’), as a matter of fact there are rivals to H’ that would also produce such Higgs-like bumps. That a Bayesian might assign P(H’) a value on the basis of the data shows, again, that we’re doing very different things.

Finally, and given what you’ve written this may surprise you: frequentist inference is all about reaching claims of the Fisherian sort regarding reproducible effects. (Whether we can actually reproduce them is something else, the colliders may shut down for example–but that’s just a way to cash out a genuine effect.) We make error statistical inferences abut genuine effects. Since H is a statistical hypothesis, various probabiity assignments to outcomes follow. I don’t mind saying, as you do, that “some statistical hypothesis H prevails” although it’s not standard. All kinds of experimental implications may follow from H’s “prevailing”. But I’m not assigning a probability to H. Statistical hypothesis H assigns probabilities to outcomes. To accept H is to accept those probability assignments.
So I think you’re wrong about what we learn when we are warranted in inferring a statistical hypothesis. If people keep this straight, Boom!**–we get just what we need from error statistics.

**to mimic Corey.

• Mayo: In this comment thread I have only ever been concerned with the clarity of communication with scientists who, as West reports, get most of their statistics training on the job and not via formal training, who don’t “live and breathe p-values and posterior pdfs in their work”, and for whom “all of this likely resembles pointless hair splitting”.

Let me be as clear as I know how: I grant for the sake of argument that error statistics “works”, in whatever sense you want to take that. I grant that neither you, nor Strassler, nor any of the statistical literati in this discussion (and I include myself here) labor under the misconception that any of the probabilities we’re discussing are actually probabilities of hypotheses.

Let me correct a miscommunication: like you, I don’t mind saying that “some statistical hypothesis H prevails”. Call that quoted phrase a claim of type H. My concern is simply this: some phrasings that you defend can lead scientists not well-versed in statistical thinking to believe that a probability has been assigned to claims of type H — even though nothing of the sort is going on! — because for those scientists, the phrasings are ambiguous about the type of claim to which they refer.

I’m tapping out now.

• I know what you’re saying, and I’m worried (on the basis of your remarks) that you have a wrong-headed view of what constitutes (implicitly or explicitly) assigning a probability to a statistical or other hypothesis. Thus I’m afraid your view of what it is to be confused about these phrases is rather idiosyncratic and curious. I don’t think I’ve even heard other Bayesians equate the claims you made (about replicable phenomena) to posterior probabilities. Perhaps we can learn a lot from scientists as to the nature of scientific reasoning with limited information, variable phenomena and statistical predictions.*

*Addition: I admit that wrt the use of “fluke”, at first glance (and in my March 27, 2013 post): I wasn’t sure how they were using it, but it became obvious, upon reading them, that they intended observable bumps arising from background alone, and all I’ve explained above.

I’ll tap out too (which is not the same as tap dancing).

• West

@Corey: On the topic of communication, there is an interesting asymmetry that occurs when discussing technical issues whether it involves hypothesis testing or electroweak symmetry breaking. The specialists nash their teeth and complain about the presentation of their subject to the general public. Non-specialists reply with the generic response, “OK, I wasn’t precisely perfectly accurate with description, but you get what I meant. So what is the problem?”

While there are a number of sources for this conflict, I think among academics there is a certain lack of courtesy. “Do onto others how you would have them do onto you.” Mayo’s effort in this and other posts on the ATLAS/CMS Higgs search is a laudable example of what we all should do when commenting on topics outside our own specialty.

I also want to note that my office-chair psychologizing is just that, extrapolating from what I have encountered among colleagues. This is admittedly a small sample and prone to any number of biases.

• West: Thanks much*, but I really and truly mean it when I say I learn so much from listening to scientists when they’re in the midst of finding out new things and scrutinizing not-yet-settled results. This is true both in statistical and entirely non-statistical realms.

I only wish I understood more of the science.

*It’s so rare to get any kind of credit around here.

• West: I think there is a rough, fuzzy distinction to be made between telling lies-to-children and fostering misconceptions.

Your extrapolations are borne out by my own experiences.

• Corey: You know that your allegation against West here is unfounded: you have not shown in what way the physicist’s remarks, rather than yours, are erroneous (nor what lies are being told). Nor can you merely assert, as you now do (at least in the comments to this post) that an assessment of plausibility = a Bayesian posterior probability assignment to hypotheses. It’s fine if you want to say the Standard Model Higgs is plausible (it accords well with the data in all kinds of ways), but this was so prior to the recent discovery which was based on stringently rejecting the “no Higgs” hypothesis (using years of collision data).
You are free to assign the (simple) SM Higgs a high posterior degree of belief or as a report of how well it is born out by data, but it wouldn’t follow that that’s what physicists are doing when they reject their nulls, and estimate various quantities by ruling out some values with severity.
Moreover, rivals to the (simple) SM Higgs also predict a Higgs-like particle. Well I’m repeating myself–back to tapping out.

• Visitor

Why does Corey find statements of “what would be observed in a large number of experimental replications” demand a Bayesian probability in a hypothesis? This is the frequentist’s world!

• West

@Corey: I have to agree with you that this particular instance falls into the “fostering misconceptions” camp, though this is based on my view of the entire post in question. If it had lacked the dismissive attitude towards all of statistical inference, regardless of the philosophical persuasion, I probably would have been more charitable in my assessment.

@Mayo: I can only give credit on subjects I can follow, which I will admit is rather limited here. Am personally a pragmatists when it comes to statistical methods, picking and choosing whichever fits the particular question I need to answer at the moment. I have little to add when the discussion becomes more abstract or philosophical in nature.

If you want to sink your teeth into a good example of a scientific collaboration pondering the problem of “what counts as a confident first detection,” I must recommend “Gravity’s Ghost and Big Dog” by the Cardiff sociologist Harry Collins. Would be interested in your (and anyone elses) thoughts as I personnally sat through some of the meetings described within.

• Oh My! (on all 3 counts) I take it the first alludes to Strassler and not me, on the second, now you’re sounding like Strassler, on the third, well–I don’t think I want to sink that deeply into dogs, ghosts and Harry Collins’ sociology.

• Mayo: The way you reacted to “lies-to-children” suggests that you didn’t follow my (incorrectly closed) link. A lie-to-chlidren is a simplified explanation that, although not correct in detail, creates a foundation for more complete apprehension — think Wittgenstein’s ladder.

I made no allegation against West. When I make an allegation, you’ll know it — I do so very explicitly. I believe West understood the thrust of my response: West noted that conflict between specialists and non-specialists about the appropriate way to communicate technical concepts is fairly common; I responded that my way of viewing such conflicts is in light of the distinction of lies-to-children and simplifications that foster misconceptions.

“Nor can you merely assert, as you now do (at least in the comments to this post) that an assessment of plausibility = a Bayesian posterior probability assignment to hypotheses.”

My assertion is that a Jaynesian probability distribution over hypotheses/parameters is to be read as a plausibility assessment. I do *not* claim that any and all plausibility assessments are, or can be, equivalent to Jaynesian probability distributions.

• Corey: well I truly don’t know who is allegedly telling lies that foster a foundation for complete understanding.

I’m afraid that your construal of an inference about replication being either what’s meant by, or requiring, a posterior probability to a hypothesis does not foster a foundation for improved understanding.

What I’m trying to do is explain how these notions are to be legitimately and usefully understood by grown-ups. It’s almost as if some grown-ups these days have forgotten the basic ABCs of frequentist error statistics (and imagine they see Bayesian probabilities everywhere).

• “well I truly don’t know who is allegedly telling lies that foster a foundation for complete understanding”

In elementary school I was taught that a neutral atom has a positively charged nucleus with negatively charged electrons *orbiting* it.

You write “my construal”; in a sense it is mine, since I brought it up for discussion, but in another sense it is not mine, because I don’t hold it. I claim only that certain phrasings do not do enough to rule it out. I’m glad we agree that the construal does not foster a foundation for improved understanding of error statistical warrant.

• Corey: But what are we ruling out? Certain phrasings may not do enough to rule out identifying posterior probabilities of hypotheses with Krisky Kreme donut weights–but who ever though the weights of donuts were like posterior probabilities? Baffled.

• Mayo: If I come to you with some amazing observation I made, and you reply that it’s just a fluke, I put it to you that you are asserting something about both (i) the way the world is, and (ii) how likely an observation as amazing as mine is in light of the way the world is. Specifically, you are saying that my observation is somewhat unusual (the sense (ii) part) but not enough to overturn your current understanding of the way the world is (the sense (i) part). (Donuts, although delicious, don’t enter into it.)

And error-statistically, that is groovy and cool and presents no problems whatsoever. But problems do arise when we start talking about *probabilities of flukes* without being much more specific. In the frequentist view, probabilities can only be attached to flukes events as in sense (ii). Not everyone is aware of this, so simply talking about the probability that a fluke occurred does not do enough to rule out a construal in sense (i). That construal leads to the misapprehension that a probability is being directly attached to a claim about the way the world is.

• Corey: We never spoke of the *probabilities of flukes” but the probability that an observable phenomenon (e.g., bumps as large as 5 sigma) is a fluke. I wrote: “The probability of observing a result as extreme as 5 sigmas, under the assumption it was generated by background alone, that is, under H0, is approximately 1 in 3,500,000. Alternatively, we hear: the “probability that the results were just a statistical fluke is 1 in 3,500,000”.
We are always speaking about experimental results–e.g., patterns of excess events in the form of bumps –under conjectures or hypotheses about the bump generating procedure. We speak for ex. of the probability that such large bumps are generated by background alone, etc. etc.So again, you’ve not addressed my point.

As far as inferring claims “about the way the world is”, that is the frequentist error statistician’s view of inference. The Bayesians I read (of various stripes) keep telling me they wish to make inferences only about what they believe (or how they’d bet), denying we can know about the way the world is, possibly denying there is such a thing. (For the latter,you might look at Hennig’s point about de Finetti in the current post).

• “We are always speaking about experimental results”

Right — *we* are always speaking about experimental results, but the phrasing we use does not rule out an incorrect construal by our listeners. That’s why I view your “properly understood” as overly permissive of imprecise language.

The Bayesians you read are apparently (warning: neologism ahead) “observable-ists”: they regard as meaningful only probabilities for observable quantities; probabilities on quantities that can never be observed are not considered meaningful but are permitted as aids to certain calculations, this permission being justified by de Finetti’s exchangeability theorem. (I wanted to call them “predictivists”, but a quick Googling shows that that term already has a different meaning).

Observablism is not necessarily part of the Jaynesian stance, but is not incompatible with it — I have encountered one Jaynesian who is also an observable-ist. I’m not a doctrinaire observable-ist; I think the idea can be helpful in some contexts, but that’s as far as I go.

8. I should add to my above remarks that the error statistical account always includes an indication about what is not severely indicated, even when a statistical hypothesis H’ passes with severity. We have not ruled out alternative theories of a non-SM sort (beyond the SM some call them), because they too predict SM Higgs-like phenomenon.

We recognize that we have not discriminated between substantive theories that “are at a different level” from the statistical ones.

9. Dec 6, 2013: MY CONCLUDING COMMENT: At this point, I can only repeat the point raised by Elbafolk in a comment to this post. Equivocations between informal and formal uses of “probability” (as well as “likelihood” and “confidence”) are responsible for much confusion in statistical foundations. It can only add confusion (precisely of the sort I have been at pains to avoid) to equate by fiat nontechnical and technical uses of “probability”, “likelihood”, “confidence”–especially in a discussion of statistical foundations*. I agree with Birnbaum (1969) “that such usage is to be avoided as misleading in” foundational discussions of evidence.

Therefore, we should add this as a rule #3 to this blog, lest we go round and round in circles.

I have been trying to clarify the role of formal error probabilities in appraising and illuminating assessments of how well tested, and how well warranted, claims are. I have argued that this provides a rationale for error statistical methods, and is the key to avoiding fallacies and misinterpretations. One doesn’t stop with a report of the error probabilities, but uses them in a severity assessment. To simply assume error probabilities are or will be confused with posterior probabilities, perhaps because that’s what appears to follow from someone else’s philosophy of statistics, is to forfeit the possibility of breaking out of those ruts and trying out a new idea in good faith.

For a refresher both in ordinary type 1 and 2 error probabilities and the uses of error probabilities in severity assessments, please see Mayo, D. G. and Spanos, A. (2006). https://errorstatistics.files.wordpress.com/2013/12/2006mayo_spanos_severe_testing.pdf
“Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

A. Birnbaum (1969), “Concepts of Statistical Evidence,” in Philosophy, Science, and Method (eds., Morgenbesser, Suppes, and White), NY: St. Martins, 112-143.

*Giving arguments for the view is something else.

• I can’t help but feel that we spent this whole comment thread talking past each other, and it’s my own stupid fault for starting by saying, “I totally disagree” when in fact I don’t totally disagree with the post but only with one small part of it at the end. Mea culpa, mea maxima culpa.

• Corey: Thanks for that. Would you consider, then, as penance ,rereading Neyman and Pearson on error probabilities, likelihood ratio tests and the like? I know that immersion in one approach–altogether necessary for getting sharp in it–can lead to overlooking/forgetting some ingredients of another approach. No need to reply.

• Mayo: Point me to something specific online and I’ll read it. I’d have grabbed a bunch of papers already, but I have trouble getting the old stuff — no institutional access.

(You may recall I mentioned having a question I would send to you by email. I’ve been exercising my brain on N-P-style statistical inference and I think I’ve worked up a satisfactory answer on my own using the Karlin-Rubin theorem, but I’ll still be sending you the question at some point.)

This site uses Akismet to reduce spam. Learn how your comment data is processed.