# Higgs analysis and statistical flukes (part 2)

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post. This, too, is a rough outsider’s angle on one small aspect of the statistical inferences involved. (Doubtless there will be corrections.) But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Following an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as:

Pr(Test T would yield at least a 5 sigma excess; H0: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone).  The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

Error probabilities

In a Neyman-Pearson setting, a cut-off cα is chosen pre-data so that the probability of a type I error is low. In general,

Pr(d(X) > cαH0) ≤  α

and in particular,alluding to an overall test T:

(1) Pr(Test T yields d(X) > 5 standard deviations; H0) ≤  .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, p0. In general,

Pr(P < p0H0) < p0

and in particular,

(2) Pr(Test T yields P < .0000003H0.0000003.

For test T to yield a “worse fit” with H(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0.  With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic d(X), or the P-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

An implicit principle of inference or evidence

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Data x from a test T provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between H0 and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977).  “His true” is a shorthand for a very long statement that H0 is an approximately adequate model of a specified aspect of the process generating the data in the context. (This relates to statistical models and hypotheses living “lives of their own”.)

Severity and the detachment of inferences

The sampling distributions serve to give counterfactuals. In this case they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to H0.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out..Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference. (This is why bootstrap, and other types of, resampling works when one has a random sample from the process or population of interest.)

The severity principle, put more generally:

Data from a test T[ii] provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.) In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H’s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually detached from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated  H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

Qualifying claims by how well they have been probed

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned.  Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

Telling what’s true about significance levels

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation).  It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to H0. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to H0, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just H and H0 but all rivals to the Standard Model, rivals to the data and statistical models, and higher level theories as well. But can’t we just imagine a Bayesian catchall hypothesis?  On paper, maybe, but where will we get these probabilities? What do any of them mean? How can the probabilities even be comparable in different data analyses, using different catchalls and different priors?[iv]

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

Those prohibited phrases

One may wish to return to some of the condemned phrases of particular physics reports.Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution:  The statistical null asserts that Ho: background alone adequately describes the process.

Ho does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under Ho”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < po}. Even when Ho is true, such “signal like” outcomes may occur. They are po level flukes. Were such flukes generated even with moderate frequency under Ho, they would not be evidence against Ho. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from Ho.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain Ho as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

Triggering, indicating, inferring

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

To be continued: See statistical flukes (part 3)

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

REFERENCES:

ATLAS Collaboration  (November 14, 2012),  Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” Annals of Mathematical Statistics, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” Scandinavian Journal of Statistics, 4: 49–70.

Mayo, D.G. (1996), Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323–357.

___________

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is  given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv]In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis https://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

Categories: P-values, statistical tests, Statistics

### 33 thoughts on “Higgs analysis and statistical flukes (part 2)”

1. >As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data.

Yes!

> If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to H0, point out the more relevant and actually attainable [iii] inference that is detached in (3).

At risk of making an unproductive side comment, p-values follow from the distribution of test statistic under H0, i.e., they follow from the prior for t under H0. I’m not clear on how one could legitimately associate a p-value with a posterior of anything.

• Chris. The distribution of the test statistic under Ho is not a prior, maybe that was a slip. As you say, it’s erroneous to infer posterior probabilities of hypotheses such as “the Higgs exists” from a p-value, but I tried to emphasize that it is also not the slightest bit desirable to do so, in any of the interpretations of formal posteriors that I’ve ever seen. We want to make inferences such as (3), (3)’, based on sampling distributions or their counterparts. But I admit a principle that is left implicit is behind the move from warranted p-values to detaching claims about what is and is not learned.

Instead, we have about a zillion articles published each year showing how one can “live with” p-values by finding priors that enable posteriors to be deduced (the blog can be searched for examples).

• What was going through my head was, “It doesn’t make any sense to me think of a p-value as a posterior probability. Can I imagine a perspective from which I could think of it in terms of a prior?” I thought I had something but then, given a few more minutes to think about it, realized I didn’t. Pardon the distraction.

Back to more important matters…

> We want to make inferences such as (3), (3)’, based on sampling distributions or their counterparts.

Yes, absolutely. One wants (needs) to show not only that the data is more consistent with H_i than the alternatives but that the model associated with H_i is also a good fit to the data. One must be careful that when declaring “It’s H_i!” that you’re not doing so on the basis of the least bad of a bunch of horrendous fits to the data. I’ve seen it done and, unfortunately, have inadvertently done so myself on occasion. Checking a test statistic for your favored hypothesis, e.g., the RSS value for a regression analysis if you believe you know the noise/measurement uncertainty in your data or maybe the Durbin-Watson statistic if you don’t, can help identify lousy fits.

2. Kent Staley

I think these kinds of papers are fascinating for seeing how a good experimental argument can supplement an overall statistical assessment like a p-value with detailed probing of the data for signs of possible trouble. To cite just one example from the Higgs search, based on the ATLAS Higgs results: They not only give limits on the parameter \mu (for a 126 GeV Higgs), but they also show estimates of that parameter based on data from the separate decay channels in order to show that there is (good enough) consistency in the evidence. They did this in the July 2012 paper, and they have updated this table in the more recent update that you link to. The estimates have become more consistent as the data set has grown.

• Hi Kent Maybe you can answer a question I have. After talking with Matt Strassler last week, I began to wonder if maybe the worry in relation to disappearing “bumps” only (or mostly?) concerns those thought to be indicative of effects in violation of the Standard Model.
Please explain what you think there is “good enough consistency” of. I distinguish the two inferences in part 1. I’m think maybe you are referring to the predictions of the simple (Standard Model) Higgs. It sounds too weak as a description of the evidence for a Higgs particle of some type. I may be saying this sloppily, but hopefully you’ll get my drift.

3. Another thing: Whenever I see assertions of an n sigma event (where n is a large number) I wonder how well the sampling distribution is known. Do they have data to support that claim or is the assertion based on an extrapolation from limited observations – say from a best-fit model to the data? Nature seems to love heavy-tailed distributions. Best to tread carefully if attempting to extrapolate from a finite-sample distribution.

4. Kent Staley

The particular point you raised earlier about a disappearing bump concerned the fact that there were more events in Higgs to two-photons decay channel than predicted for a Standard Model Higgs with mass of 126 GeV. This can be seen, for example, in CMS’s July 2012 Higgs paper in Figure 2 (sorry I can’t embed it in my comment!). It shows the p value of the results in the H>gamma gamma channel as a function of m_H, as well as the expected p value for an SM Higgs. At about 125-126, there is a dip of the observed value way below that expected for SM Higgs. In the ATLAS paper of July 2012 a similar discrepancy is shown in Figure 8. So I assume this is what Strassler was discussing.

There is something interesting about this, in that the peak in the H > gamma gamma distribution was considered by some to be the most compelling plot in favor of the July 2012 announcement. Joe Incandela told me that when he showed that plot “the whole audience gasped.” That might seem to contradict the business about a possible violation of the predictions for a SM Higgs, but it isn’t really. The H > gamma gamma and H > ZZ channels were the “high resolution” channels that were expected to give the cleanest signal. If a greater than expected excess in H > gamma gamma had persisted it would just have meant that they had found something even more interesting than the SM Higgs.

As for “good enough consistency” what I mean is ATLAS presents (again in their July 2012 paper) estimates of signal strength based on the different decay channels, and, at least when they combine the 7 TeV and 8 TeV data, they agree within the rather broad errors on those estimates. (The estimate from H > gamma gamma is not surprisingly higher, 1.8 +/- 0.5, while H > ZZ and H > WW are 1.4 +/- 0.6 and 1.3 +/- 0.5, respectively.) This is a useful check on the claim that this is evidence of new boson with Higgs-like properties, which requires that there be good evidence that it decays in just these ways.

5. Kent: Maybe you can link it? (or send it with your post?)
Strassler wasn’t talking specifically about my query, but it occurred to me after speaking to him that maybe the disappearing bumps only (or mainly?) concerned those indicative of anomalies for the SM Higgs.
As for your remark about the “most compelling plot in favor of the July 2012 announcement”, do you mean that it was strong evidence for the SM Higgs, but also at the same time showed some possible indications of “something even more interesting than the SM Higgs”? I have only an outsiders awareness of this work.

6. SEV assignments. Someone asked me why I didn’t give a corresponding SEV assessment. One can, of course, report SEV(not-Ho) ~1,
And also, with the current data SEV(H:Higgs particle) ~1. So I added that to the post.
As always, this is a shorthand for the full SEV(test T, claim H, data x).

7. Kent Staley

The July ’12 Higgs papers are at http://arxiv.org/abs/1207.7214v1 (ATLAS) and http://arxiv.org/abs/1207.7235v1 (CMS).

The most cautious statement of what the H > gamma gamma plot was compelling evidence for (in the context of the other plots being shown) would be: there is a neutral boson that has not previously been observed. I suspect that the excitement at the July ’12 announcement was prompted by a less cautious sense that it was a Higgs boson that they were seeing. Noticing that they were seeing more decays in that channel than predicted by the SM Higgs hypothesis might have been a sign that it was not exactly the SM Higgs.

Which is to say that I think you have correctly interpreted the worry about the “bumps” as potential anomalies for the SM Higgs.It is interesting that this “statistical fluke” (as is now believed based on larger data sets) showed up in the data of both experiments.

In one of his posts on this (http://profmattstrassler.com/articles-and-posts/the-higgs-particle/the-discovery-of-the-higgs/higgs-discovery-is-it-a-simplest-higgs/) Strassler also makes the important observation that the excess relative to the expectations for a SM Higgs assumes that the H production rate in the SM has been correctly estimated, and that this is a very difficult calculation to carry out. I don’t know whether any of the subsequent disappearance of the excess is due to possible
revisions in the way that they are modeling the signal for an SM Higgs.

• Kent: Thanks for the link, I missed some of this post, and it’s quite interesting. Do you plan to study, or have you already studied, this case in the detail you gave to the top quark?

• Kent. Do you know the sample sizes?

8. Corey

“As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just H and H0 but all rivals to the Standard Model, rivals to the data and statistical models, and higher level theories as well.”

In principle, yes; in practice, no.

• Corey: Sure, but what do you do in practice?

• Corey

Mayo: Me personally? The problems I’ve spent most of my effort on had lots of data with rich and thoroughly known structure. I mostly needed my priors to constrain inferences in a pretty inarguably reasonable way. I don’t have any experience in physics problems — you’d best ask a Bayesian physicist or two.

• Corey: thanks for the refs, though I’m not at all sure they are using “Bayesian” in the way you are. There are quite a lot of frequentist priors in these many examples of modeling in physics–perhaps all.
of course, I was alluding to your “in practice” remark. I realize there’s a lot of disingenuity in Bayesian practice (readers who haven’t seen the deconstruction of stephen Senn, and U-Phils, might search this blog).

But, perhaps more constructively, you were questioning what insights could emerge from illuminating the meaning of “under H” vs “conditional on H” (in part 1), and my discussion here turns on this. On the other hand, maybe you and I shouldn’t go there; but I hope some get the drift. It’s not so obvious, and language, I admit, is ambiguous.

• Corey

Mayo: Your text that I quoted describes an impossible requirement for Bayesian model selection — and yet, physicists find ways to put Bayesian model selection into practice. That’s all I’m pointing out — it’s not really on me (and in fact would be presumptuous of me to attempt) to tell you how they do it.

By “frequentist priors” I’m guessing you mean probability matching priors. Such priors are usually improper, which makes them impossible to use for model selection unless the models nest. Also, they only exist for a fairly limited set of likelihoods, and there’s no reason to suppose that physics models typically induce such likelihoods. At least one paper on the list uses priors not motivated by probability matching.

“Disingenuity”, huh? That’s rich coming from one who supports the philosophical worth of the frequentist perspective by pointing to the fact that practicing scientists use it and seem to find it adequate — ignoring that scientists have relied on predominantly frequentism-indoctrinated statistical consultants to help them grapple with noisy data for the past eighty years or so. What you call “disingenuity” seems to me to stem from the (regrettable) fact that a near-continuum of perspectives and practices use the label. As for Senn’s deconstruction, it doesn’t touch on my Jaynesian foundations; if it were actually directed at those of my school of thought it would be an extended no-true-Scotsman fallacy.

• Corey: No I didn’t mean frequentist matching, I meant actual empirical relative frequency priors. I never “support the philosophical worth of the frequentist perspective by pointing to the fact that practicing scientists use it.” I have developed a philosophy of science that, as a subpart, provides an account that explicates and justifies using error statistical methods to find things out. I move away from all misuses and misinterpretations that have led to so much criticism as well as to 4-8 howlers repeated verbatim in textbooks as if they are knock down refutations of the error statistical method. I seem to even recall your saying (on Normal Deviant’s blog, I believe) that I have provided the best(?) or a plausible (I forget) philosophy for frequentist statistics. It is not justified by an appeal to numbers. But if someone did point out the methods are used to control and evaluate errors in inquiry, it would not show any disingenuity.
The disingenuity referred to a type of Bayesianism that I thought was being discussed, but it is true more generally with the move away from classical subjective/personalistic Bayesianism. This is discussed on this blog. E.g.the conflicting uses and interpretations of priors (to introduce background beliefs into the interpretation of evidence, but also to have the most minimal effects on the interpretation of evidence; priors represent personal beliefs, attained by introspection, but they are also to be ‘tested’ and changed with data; updating and downdating; uncertain hypotheses are to be quantified with a probability, but not so for uncertain models or data which are just accepted or given.These are just off the top of my head; for fuller discussions, readers might check the RMM papers*, and the index to this blog (or ask me) .
And you’ve never properly explicated or defended your Jaynesian view, whereas I have explicated and defended my error statistical philosophy. You’ve said your account follows if you accept the axioms….
p. 104

• Corey

Mayo: Empirical relative frequency priors? No, I don’t believe so; certainly not for the one paper I linked.

I have certainly come away from your writing with the impression that pointing to what scientists actually do was a piece of your argument. If my claim turns out to be a tendentious misreading of your text, I will certainly apologize. I need to check.

I think I wrote something to the effect that your philosophy was the only one I have encountered that could possibly put frequentist procedures on a sound footing; I stand by that. Right now I lean negative on the notion that any philosophy can put frequentist procedures on a sound footing. Informal severity/severity-in-the-large is fine; severity qua assessment of simple statistical hypotheses suffers from the same flaw as p-values, to wit, consideration of tail areas.

It’s true that I haven’t laid out my Jaynesian viewpoint properly, but most of it is here in the comments on various posts, piecemeal as it were. Aggregating it might be a worthwhile project for me.

• Caring to explicate why certain applications of an inference account work, is not the same as “an appeal to numbers” in justifying that account.

The Bayesian doesn’t like the “tail area” because it considers outcomes other than the one observed—i.e., the sampling distribution, but for error statisticians, the sampling distribution is crucial. For N-P, the tail area falls out from controlling type 1 and 2 errors, with a sensible test statistic. Tail areas were a consequence, not a cause, e.g., in certain tests beginning with likelihood ratios. This is the case for corresponding CIs too. For ex., a one-sided test of Normal mean m, m = mo vs m < m0 corresponds to setting a one-sided upper bound for m, CI-upper (a value in the CI interval is accepted by the corresponding test at the associated level) . The test with a “tail area” is simply a consequence of the goal to ensure a minimal probability that CI-upper exceeds m’, where m’ exceeds m,subject to the requirement that, with high probability, CI-upper exceeds the true value of m. (The notation kept disappearing in the comment, so I wrote it the best I could in words, see p. 190 Cox and Mayo 2010.) E.S. Pearson makes this point in being challenged on tail areas.

Freedom from considering the sampling distribution is freedom from considering effects of data-dependent selections, multiple testing, optional stopping, etc. and, in my terms, prevents controlling and evaluating the severity of tests.

Yes, you should go back and aggregate the points you’ve made on the blog and see what kind of justification for Jaynes you get.

You wrote:”I think I wrote something to the effect that your philosophy was the only one I have encountered that could possibly put frequentist procedures on a sound footing; I stand by that.” I’m curious as to why I deserve this honor Corey bear*.

*This refers to a dialogue Corey supplied in commenting on a post “bad news bears” (Aug 6, 2012):

• Corey

Mayo: It was always obvious no competent frequentist statistician would use a procedure criticized by the howlers; the problem was that I had never seen a compelling explanation why (beyond “that’s obviously stupid”). So you deserve the honor for putting forth a single principle from which error statistical procedures flow that refutes all of the howlers at once.

• Corey: Wow, that’s a big concession even coupled with your remaining doubts….maybe I should highlight this portion of our exchange for our patient readers, looking for any sign of progress…

• Corey

Mayo: Feel free to highlight it. I will point out that this “concession” shouldn’t be news to you: in an email I sent you on September 11, 2012, I wrote, ‘I now appreciate how the severity-based approach fully addresses all the typical criticisms offered during “Bayesian comedy hour”. Now, when I encounter these canards in Bayesian writings, I feel chagrin that they are being propagated; I certainly shall not be repeating them myself.’

But I understand how a busy person might miss statements like this in an email.

• O.B.

Data in the region of highest expected decay events, as of 2011, are blocked out while they model the background for the 2012 analysis. They call this blinding. Peeking at the effects that tinkering has on the expected region is prohibited until after the background is fixed. A Bayesian would naturally include the high prior (in the expected area) in the data analysis; permitting the bias that is supposed to be avoided by blinding.

9. Nicole Jinn

You say: “a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v]”

A question I have is: what role does truth play in this “open world” that you advocate? The reason for asking this question is: as I read papers that address the aim(s) of science, it seems that science seeks ‘important’ truth, which is supposedly more than ‘explanatory’ truth, however these terms are to be interpreted. And I want to know how you interpret ‘important’ (or ‘explanatory’) truth.

• Nicole:

Perhaps the quickest way to respond effectively is with a couple of posts. I’ve talked a fair amount in regards to the “all models are false” refrain. An example is:
One that connects truth/idealization and types of Bayesians vs error statisticians is here:
https://errorstatistics.com/2012/10/18/query/
It resulted in 50 comments, mostly Hennig and I.

In the current post, too, you see my explication of “Ho is true” as an abbreviation for a detailed claim about adequacy of a model for a purpose. I do agree that trivial truth and ‘probable claims’ are not very interesting, and differ greater from highly corroborated hypotheses. Not sure how you mean “important”. Many philosophers appeal to a variety of so-called non-cognitive or pragmatic adjectives for aims because highly probable claims are so boring.

• Nicole Jinn

Thank you for answering my question – I will be sure to read the blog posts that you referred to. However, I admit my lack of familiarity with the literature relevant to ‘aims of science’ (or ‘values in science’); hence, I (provisionally) think that ‘important’ truth is meant to be the opposite of (or at least distinguished from) ‘trivial’ truth, thus signaling that there are certain values (e.g., social, political) that are commonplace when performing science.

• Nicole: Are you saying ‘important truth’ indicates social-political values or the opposite? Rather than guessing, try to be clearer in whatever view you are asking about. If there is a citation, indicate that.

• Nicole Jinn

Well, I am actively trying to make myself more familiar with a topic I have not engaged in until a few days ago. For this reason, I sincerely apologize for bringing up the terms/phrases ‘important’ truth and ‘explanatory’ truth in the first place. Please note that I am not holding a specific view (yet) and only wanted to ask what your take is on the role (explanatory or important) truth plays in inquiry – I did not have a certain interpretation of the two terms in mind when I asked you my question. To answer your question, yes, saying ‘important’ truth indicates social/political values. In an attempt to clear up confusion, here is an excerpt from the article I got these terms from:

“Science seeks, not truth *per se*, but rather explanatory truth, truth presupposed to be explanatory. But this is only the first hint of what is wrong with the official view of science. For science does not just seek explanatory truth. More generally, it seeks important truth. The search for explanatory truth is just a special case of the more general search for important truth. Science seeks to acquire knowledge deemed to be *of value*, either of value intellectually or culturally – because it enhances our understanding of the world around us or illuminates matters especially significant to us, such as our origins – or of value practically or technologically, in enabling us to achieve other ends of value, such as health, food, shelter, travel, communications.” (Science Under Attack by Nicholas Maxwell, The Philosopher’s Magazine Issue 31, 3rd Quarter 2005, p. 39)

• Nicole: This assertion of Maxwell comes pretty close to being empty & uninformative. What does it rule out? I don’t know the ‘official’ view he thinks he’s challenging really.

• Nicole Jinn

Oh, I see you’re confused even more now. At this point, I ask you to please discard everything I said up to now – I’m sorry for bringing this topic up in the first place. My intention is really not to confuse you too much (nor to waste your time), and I sincerely apologize for my questions/comments producing more confusion. I appreciate the time you took in reading (and replying to) my comments, and will definitely try to find better or more relevant questions to ask.

• Nicole: Don’t be silly, your questions are good ones, and am glad you brought them up. I gave lazy answers by mostly citing posts because I’m just so tired……

10. Corey: Ok, so you get an Honorable Mention, especially given that I’m always pushing this boulder up a hill (or maybe it’s a egg made of stone)! It will be a miracle if new editions of Bayesian texts reduce or cut the howler repertoire.

But I still don’t understand the hesitancy in coming over to the error statistical side….