Four ~~score~~ years ago (!) we held the conference “Statistical Science and Philosophy of Science: Where Do (Should) They meet?” at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, where I’m visiting professor [1] Many of the discussions on this blog grew out of contributions from the conference, and conversations initiated soon after. The conference site is here; my paper on the general question is here.[2]

*My main contribution was “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations” SS & POS 2. **It begins like this: *

**1. Comedy Hour at the Bayesian Retreat[3]**

** **Overheard at the comedy hour at the Bayesian retreat: Did you hear the one about the frequentist…

“who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”

or

“who claimed that observing ‘heads’ on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”

Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of ‘straw-men’ fallacies, they form the basis of why some statisticians and philosophers reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the reader to stay and find out.

If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call ‘error statistics’, continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.

First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define ‘controlling long-run error’, it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of ‘There’s No Theorem Like Bayes’s Theorem’.

Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in every Bayesian and some non-Bayesian textbooks and articles on philosophical foundations. No wonder that statistician Stephen Senn is inclined to “describe a Bayesian as one who has a reverential awe for all opinions except those of a frequentist statistician” (Senn 2011, 59, this special topic of RMM). Never mind that a correct understanding of the error-statistical demands belies the assumption that any method (with good performance properties in the asymptotic long-run) succeeds in satisfying error-statistical demands.

The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson ‘really thought’. I regard this as a shallow way to do foundations.

Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error-statistical philosophy. Here I mostly sketch key ingredients and report on updates in a larger, ongoing project.

** ****2. Popperians Are to Frequentists as Carnapians Are to Bayesians**

** **Statisticians do, from time to time, allude to better-known philosophers of science (e.g., Popper). The familiar philosophy/statistics analogy—that Popper is to frequentists as Carnap is to Bayesians—is worth exploring more deeply, most notably the contrast between the popular conception of Popperian falsification and inductive probabilism. Popper himself remarked:

In opposition to [the] inductivist attitude, I assert that C(

H,x) must not be interpreted as the degree of corroboration ofHbyx, unlessxreports the results of our sincere efforts to overthrowH. The requirement of sincerity cannot be formalized—no more than the inductivist requirement thatxmust represent our total observational knowledge. (Popper 1959, 418, I replace ‘e’ with ‘x’)

In contrast with the more familiar reference to Popperian falsification, and its apparent similarity to statistical significance testing, here we see Popper alluding to failing to reject, or what he called the “corroboration” of hypothesis *H*. Popper chides the inductivist for making it too easy for agreements between data **x **and* H* to count as giving *H* a degree of confirmation.

Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory. (Popper 1994, 89)

(Note the similarity to Peirce in Mayo 2011, 87.)

**2.1 Severe Tests**

Popper did not mean to cash out ‘sincerity’ psychologically of course, but in some objective manner. Further, high corroboration must be ascertainable: ‘sincerely trying’ to find flaws will not suffice. Although Popper never adequately cashed out his intuition, there is clearly something right in this requirement. It is the gist of an experimental principle presumably accepted by Bayesians and frequentists alike, thereby supplying a minimal basis to philosophically scrutinize different methods. (Mayo 2011, section 2.5, this special topic of RMM) Error-statistical tests lend themselves to the philosophical standpoint reflected in the severity demand. Pretty clearly, evidence is not being taken seriously in appraising hypothesis *H* if it is predetermined that, even if *H* is false, a way would be found to either obtain, or interpret, data as agreeing with (or ‘passing’) hypothesis *H*. Here is one of many ways to state this:

Severity Requirement (weakest): An agreement between dataxandHfails to count as evidence for a hypothesis or claimHif the test would yield (with high probability) so good an agreement even ifHis false.

Because such a test procedure had little or no ability to find flaws in *H*, finding none would scarcely count in *H*’s favor.

*2.1.1 Example: Negative Pressure Tests on the Deep Horizon Rig*

Did the negative pressure readings provide ample evidence that:

H: leaking gases, if any, were within the bounds of safety (e.g., less than θ_{0}_{0})?

Not if the rig workers kept decreasing the pressure until H passed, rather than performing a more stringent test (e.g., a so-called ‘cement bond log’ using acoustics). Such a lowering of the hurdle for passing *H _{0}* made it too easy to pass

*H*even if it was false, i.e., even if in fact:

_{0}

H: the pressure build-up was in excess of θ_{1}_{0}.

That ‘the negative pressure readings were misinterpreted’, meant that it was incorrect to construe them as indicating H_{0}. If such negative readings would be expected, say, 80 percent of the time, even if* H _{1}* is true, then

*H*might be said to have passed a test with only .2 severity. Using Popper’s nifty abbreviation, it could be said to have low corroboration, .2. So the error probability associated with the inference to

_{0}*H*would be .8—clearly high. This is not a posterior probability, but it does just what we want it to do.

_{1}**2.2 Another Egregious Violation of the Severity Requirement**

Too readily interpreting data as agreeing with or fitting hypothesis *H* is not the only way to violate the severity requirement. Using utterly irrelevant evidence, such as the result of a coin flip to appraise a diabetes treatment, would be another way. In order for data **x **to succeed in corroborating *H* with severity, two things are required: (i) **x **must fit *H*, for an adequate notion of fit, and (ii) the test must have a reasonable probability of finding worse agreement with *H*, were *H* false. I have been focusing on (ii) but requirement (i) also falls directly out from error statistical demands. In general, for *H* to fit **x**, H would have to make **x **more probable than its denial. Coin tossing hypotheses say nothing about hypotheses on diabetes and so they fail the fit requirement. Note how this immediately scotches the second howler in the second opening example.

But note that we can appraise the severity credentials of other accounts by using whatever notion of ‘fit’ they permit. For example, if a Bayesian method assigns high posterior probability to *H* given data **x**, we can appraise how often it would do so even if *H* is false. That is a main reason I do not want to limit what can count as a purported measure of fit: we may wish to entertain different measures for purposes of criticism.

**2.3 The Rationale for Severity is to Find Things Out Reliably**

** **Although the severity requirement reflects a central intuition about evidence, I do not regard it as a primitive: it can be substantiated in terms of the goals of learning. To flout it would not merely permit being wrong with high probability—a long-run behavior rationale. In any particular case, little if anything will have been done to rule out the ways in which data and hypothesis can ‘agree’, even where the hypothesis is false. The burden of proof on anyone claiming to have evidence for H is to show that the claim is not guilty of at least an egregious lack of severity.

Although one can get considerable mileage even with the weak severity requirement, I would also accept the corresponding positive conception of evidence, which will comprise the full severity principle:

Severity Principle (full):Dataxprovide a good indication of or evidence for hypothesisH(only) to the extent that testTseverely passesHwithx.

Degree of corroboration is a useful shorthand for the degree of severity with which a claim passes, and may be used as long as the meaning remains clear.

**2.4 What Can Be Learned from Popper; What Can Popperians Be Taught?**

** **Interestingly, Popper often crops up as a philosopher to emulate—both by Bayesian and frequentist statisticians. As a philosopher, I am glad to have one of our own taken as useful, but feel I should point out that, despite having the right idea, Popperian logical computations never gave him an adequate way to implement his severity requirement, and I think I know why: Popper once wrote to me that he regretted never having learned mathematical statistics. Were he to have made the ‘error probability’ turn, today’s meeting ground between philosophy of science and statistics would likely look very different, at least for followers of Popper, the ‘critical rationalists’.

Consider, for example, Alan Musgrave (1999; 2006). Although he declares that “the critical rationalist owes us a theory of criticism” (2006, 323) this has yet to materialize. Instead, it seems that current-day critical rationalists retain the limitations that emasculated Popper. Notably, they deny that the method they recommend—either to accept or to prefer the hypothesis best-tested so far—is reliable. They are right: the best-tested so far may have been poorly probed. But critical rationalists maintain nevertheless that their account is ‘rational’. If asked why, their response is the same as Popper’s: ‘I know of nothing more rational’ than to accept the best-tested hypotheses. It sounds rational enough, but only if the best-tested hypothesis so far is itself well tested (see Mayo 2006; 2010b). So here we see one way in which a philosopher, using methods from statistics, could go back to philosophy and implement an incomplete idea.

On the other hand, statisticians who align themselves with Popper need to show that the methods they favor uphold falsificationist demands: that they are capable of finding claims false, to the extent that they are false; and retaining claims, just to the extent that they have passed severe scrutiny (of ways they can be false). Error probabilistic methods can serve these ends; but it is less clear that Bayesian methods are well-suited for such goals (or if they are, it is not clear they are properly ‘Bayesian’).

__________________

*Here is section 5:*

**5. The Error-Statistical Philosophy**

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are).

**5.1 Error (Probability) Statistics**

*What is key on the statistics side *is that the probabilities refer to the distribution of a statistic *d*(**X**)—the so-called sampling distribution. This alone is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle [LP]), at least once the data are available.

Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero. (Kadane 2011, 439)

The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many, at least those outside of this specialized debate, as bizarre for an account of statistical inference to banish such considerations. And yet, banish them the Bayesian must—at least if she is being coherent. I come back to the likelihood principle in section 7.

*What is key on the philosophical side *is that error probabilities may be used to quantify the probativeness or severity of tests (in relation to a given inference).

The twin goals of probative tests and informative inferences constrain the selection of tests. But however tests are specified, they are open to an after-data scrutiny based on the severity achieved. Tests do not always or automatically give us relevant severity assessments, and I do not claim one will find just this construal in the literature. Because any such severity assessment is relative to the particular ‘mistake’ being ruled out, it must be qualified in relation to a given inference, and a given testing context. We may write:

SEV(

T,x,H) to abbreviate ‘the severity with which testTpasses hypothesisHwith datax’.

When the test and data are clear, I may just write SEV(*H*). The standpoint of the severe prober, or the severity principle, directs us to obtain error probabilities that are relevant to determining well testedness, and this is the key, I maintain, to avoiding counterintuitive inferences which are at the heart of often-repeated comic criticisms. This makes explicit what Neyman and Pearson implicitly hinted at:

If properly interpreted we should not describe one [test] as more accurate than another, but according to the problem in hand should recommend this one or that as providing information which is more relevant to the purpose. (Neyman and Pearson 1967, 56–57)

For the vast majority of cases we deal with, satisfying the N-P long-run desiderata leads to a uniquely appropriate test that simultaneously satisfies Cox’s (Fisherian) focus on minimally sufficient statistics, and also the severe testing desiderata (Mayo and Cox 2010).

**5.2 Philosophy-Laden Criticisms of Frequentist Statistical Methods**

** **What is rarely noticed in foundational discussions is that appraising statistical accounts at the foundational level is ‘theory-laden’, and in this case the theory is philosophical. A deep as opposed to a shallow critique of such appraisals must therefore unearth the philosophical presuppositions underlying both the criticisms and the plaudits of methods. To avoid question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.

But for many philosophers, in particular, Bayesians, the presumption that inference demands a posterior probability for hypotheses is thought to be so obvious as not to require support. At any rate, the only way to give a generous interpretation of the critics (rather than assume a deliberate misreading of frequentist goals) is to allow that critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, the criticisms of frequentist statistical methods assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error-statistical methods can only achieve radical behavioristic goals, wherein long-run error rates alone matter.

Criticisms then follow readily, in the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses.
- Methods can satisfy long-run error probability demands while giving rise to counterintuitive inferences in particular cases.

I have proposed an alternative philosophy that replaces these tenets with different ones:

- The role of probability in inference is to quantify how reliably or severely claims have been tested.
- The severity principle directs us to the relevant error probabilities; control of long-run error probabilities, while necessary, is not sufficient for good tests.

The following examples will substantiate and flesh out these claims.

** ****5.3 Severity as a ‘Metastatistical’ Assessment**

** **In calling severity ‘metastatistical’, I am deliberately calling attention to the fact that the reasoned deliberation it involves cannot simply be equated to formal quantitative measures, particularly those that arise in recipe-like uses of methods such as significance tests. In applying it, we consider several possible inferences that might be considered of interest. In the example of test *T+* [this is a one-sided Normal test of H_{0}: μ≤μ_{0} against H_{1}: μ>μ_{0}, on p. 81], the data specific severity evaluation quantifies the extent of the discrepancy (γ) from the null that is (or is not) indicated by data **x **rather than quantifying a ‘degree of confirmation’ accorded a given hypothesis. Still, if one wants to emphasize a post-data measure one can write:

SEV(μ <

X+ γσ_{0}_{x}) to abbreviate: The severity with which a testT+with a resultxpasses the hypothesis:(μ <

X+ γσ_{0}_{x}) with σ_{x}abbreviating (σ /√n)^{ }

One might consider a series of benchmarks or upper severity bounds:

SEV(μ <

x+ 0σ_{0}_{x}) = .5

SEV(μ <x+ .5σ_{0}_{x}) = .7

SEV(μ <x+ 1σ_{0}_{x}) = .84

SEV(μ <x+ 1.5σ_{0}_{x}) = .93

SEV(μ <x+ 1.98σ_{0}_{x}) = .975

More generally, one might interpret nonstatistically significant results (i.e., *d*(**x**) ≤* c _{α}*) in test

*T+*above in severity terms:

(μ ≤

X+ γ_{0}_{ε}(σ /√n)) passes the testT+with severity (1 –ε),

for any P(*d*(**X**)>γ_{ε}) = ε.

It is true that I am here limiting myself to a case where σ is known and we do not worry about other possible ‘nuisance parameters’. Here I am doing philosophy of statistics; only once the logic is grasped will the technical extensions be forthcoming.

*5.3.1 Severity and Confidence Bounds in the Case of Test T+*

It will be noticed that these bounds are identical to the corresponding upper confidence interval bounds for estimating μ. There is a duality relationship between confidence intervals and tests: the confidence interval contains the parameter values that would not be rejected by the given test at the specified level of significance. It follows that the (1 – α) one-sided confidence interval (CI) that corresponds to test *T+* is of form:

μ>X− c_{α}(σ /√n)^{ }

The corresponding CI, in other words, would not be the assertion of the upper bound, as in our interpretation of statistically insignificant results. In particular, the 97.5 percent CI estimator corresponding to test *T+* is:

μ>X− 1.96(σ /√n)

We were only led to the upper bounds in the midst of a severity interpretation of negative results (see Mayo and Spanos 2006). [See also posts on this blog, e.g., on reforming the reformers.]

Still, applying the severity construal to the application of confidence interval estimation is in sync with the recommendation to consider a series of lower and upper confidence limits, as in Cox (2006). But are not the degrees of severity just another way to say how probable each claim is? No. This would lead to well-known inconsistencies, and gives the wrong logic for ‘how well-tested’ (or ‘corroborated’) a claim is.

A classic misinterpretation of an upper confidence interval estimate is based on the following fallacious instantiation of a random variable by its fixed value:

P(μ < (

X+2(σ /√n); μ) = .975,

observe mean **x**,

therefore, P (μ < (

x+ 2(σ /√n); μ) = .975.

While this instantiation is fallacious, critics often argue that we just cannot help it. Hacking (1980) attributes this assumption to our tendency toward ‘logicism’, wherein we assume a logical relationship exists between any data and hypothesis. More specifically, it grows out of the first tenet of the statistical philosophy that is assumed by critics of error statistics, that of probabilism.

*5.3.2 Severity versus Rubbing Off*

The severity construal is different from what I call the ‘rubbing off construal’ which says: infer from the fact that the procedure is rarely wrong to the assignment of a low probability to its being wrong in the case at hand. This is still dangerously equivocal, since the probability properly attaches to the method not the inference. Nor will it do to merely replace an error probability associated with an inference to *H* with the phrase ‘degree of severity’ with which *H* has passed. The long-run reliability of the rule is a necessary but not a sufficient condition to infer *H* (with severity).

The reasoning instead is the counterfactual reasoning behind what we agreed was at the heart of an entirely general principle of evidence. Although I chose to couch it within the severity principle, the general frequentist principle of evidence (FEV) or something else could be chosen.

To emphasize another feature of the severity construal, suppose one wishes to entertain the severity associated with the inference:

*H*: μ< (** x _{0}** + 0σ

_{x})

on the basis of mean **x _{0}** from test

*T+*.

*H*passes with low (.5) severity because it is easy, i.e., probable, to have obtained a result that agrees with

*H*as well as this one, even if this claim is false about the underlying data generation procedure. Equivalently, if one were calculating the confidence level associated with the one-sided upper confidence limit μ <

**x**, it would have level .5. Without setting a fixed level, one may apply the severity assessment at a number of benchmarks, to infer which discrepancies are, and which are not, warranted by the particular data set. Knowing what fails to be warranted with severity becomes at least as important as knowing what is: it points in the direction of what may be tried next and of how to improve inquiries.

*5.3.3 What’s Belief Got to Do with It?*

Some philosophers profess not to understand what I could be saying if I am prepared to allow that a hypothesis *H* has passed a severe test *T* with **x **without also advocating (strong) belief in* H*. When SEV(*H*) is high there is no problem in saying that **x **warrants *H*, or if one likes, that **x **warrants believing *H*, even though that would not be the direct outcome of a statistical inference. The reason it is unproblematic in the case where SEV(*H*) is high is:

If SEV(*H*) is high, its denial is low, i.e., SEV(~*H*) is low.

But it does not follow that a severity assessment should obey the probability calculus, or be a posterior probability—it should not, and is not.

After all, a test may poorly warrant both a hypothesis *H* and its denial, violating the probability calculus. That is, SEV(*H*) may be low because its denial was ruled out with severity, i.e., because SEV(~*H*) is high. But Sev(*H*) may also be low because the test is too imprecise to allow us to take the result as good evidence for *H*.

Even if one wished to retain the idea that degrees of belief correspond to (or are revealed by?) bets an agent is willing to take, that degrees of belief are comparable across different contexts, and all the rest of the classic subjective Bayesian picture, this would still not have shown the relevance of a measure of belief to the objective appraisal of what has been learned from data. Even if I strongly believe a hypothesis, I will need a concept that allows me to express whether or not the test with outcome **x** warrants *H*. That is what a severity assessment would provide. In this respect, a dyed-in-the wool subjective Bayesian could accept the severity construal for science, and still find a home for his personalistic conception.

Critics should also welcome this move because it underscores the basis for many complaints: the strict frequentist formalism alone does not prevent certain counterintuitive inferences. That is why I allowed that a severity assessment is on the metalevel in scrutinizing an inference. Granting that, the error- statistical account based on the severity principle does prevent the counterintuitive inferences that have earned so much fame not only at Bayesian retreats, but throughout the literature.

*5.3.4 Tacking Paradox Scotched*

In addition to avoiding fallacies within statistics, the severity logic avoids classic problems facing both Bayesian and hypothetical-deductive accounts in philosophy. For example, tacking on an irrelevant conjunct to a well-confirmed hypothesis *H* seems magically to allow confirmation for some irrelevant conjuncts. Not so in a severity analysis. Suppose the severity for claim *H* (given test *T* and data **x**) is high: i.e., SEV(*T*, **x**, *H*) is high, whereas a claim *J* is not probed in the least by test *T*. Then the severity for the conjunction (*H* & *J*) is very low, if not minimal.

If SEV(Test

T, datax, claimH) is high, butJis not probed in the least by the experimental testT, then SEV (T,x, (H&J)) = very low or minimal.

For example, consider:

H: GTR andJ: Kuru is transmitted through funerary cannibalism,

and let data **x _{0}** be a value of the observed deflection of light in accordance with the general theory of relativity, GTR. The two hypotheses do not refer to the same data models or experimental outcomes, so it would be odd to conjoin them; but if one did, the conjunction gets minimal severity from this particular data set. Note that we distinguish

**x**severely passing

*H*, and

*H*being severely passed on all evidence in science at a time.

A severity assessment also allows a clear way to distinguish the well-testedness of a portion or variant of a larger theory, as opposed to the full theory. To apply a severity assessment requires exhausting the space of alternatives to any claim to be inferred (i.e., ‘*H* is false’ is a specific denial of *H*). These must be relevant rivals to *H*—they must be at ‘the same level’ as *H*. For example, if *H* is asking about whether drug Z causes some effect, then a claim at a different (‘higher’) level might a theory purporting to explain the causal effect. A test that severely passes the former does not allow us to regard the latter as having passed severely. So severity directs us to identify the portion or aspect of a larger theory that passes. We may often need to refine the hypothesis of stated interest so that it is sufficiently local to enable a severity assessment. Background knowledge will clearly play a key role. Nevertheless we learn a lot from determining that we are not allowed to regard given claims or theories as passing with severity. I come back to this in the next section (and much more elsewhere, e.g., Mayo 2010a, b).

[1] co-organized with Aris Spanos.

[2] This was a special topic of the on-line journal, *Rationality, Markets and Morals (RMM)*, edited by Max Albert—also a conference participant. For more Saturday night reading, check out the page.Authors are: David Cox, Andrew Gelman, David F. Hendry, Deborah G. Mayo, Stephen Senn, Aris Spanos, Jan Sprenger, Larry Wasserman. Search this blog for a number of commentaries on most of these papers.

[3]Long-time blog readers will recognize this from the start of this blog. for some background, and a table of contents for the paper, see my Oct 17 post.

While I realize these stories are strawman statistical arguments made to mock a certain perspective, I can’t help but be confused anyways. Trying to determine whether the measurements from a seemingly malfunctioning instrument can be salvaged is an important and interesting research problem. It comes up all the time. So after a moments reflection, the first joke just seems dumb, irrespective of ones statistical perspective.

West: These are not made up criticisms but literal criticisms repeated over and over again by important people. I can name names. The first is sometimes called irrelevant censoring, the supposition being that frequentists obey it. the second comes from the supposition that if x is an improbable event in its own right, then rejecting a null hypothesis whenever x occurs is tantamount to rejecting the null with a low p-value. Look up “comedy hours” or just “comedy” on this blog and you’ll find examples of these and more.

Mayo: I was not questioning your claim of having been witness to these jokes in person or their intended meaning. I have commented on enough posts that I wouldn’t make those mistake.

My issue is that I just can’t take either of these comments seriously because they resemble nothing at all like science as practiced. At least in my experience. So while I can admire the clever usage of language, the pithy statements as rhetoric should be greeted with a yawn.

West: Well perhaps your yawning attitude toward the howlers is the correct one; what flummoxes me is the way they are often used as grounds for utter dismissal and chants of “unsound!” or worse. You’d think more people might wonder (as you do) as to why there’s a felt need to resort to such examples (howlers) in advocating other methods as superior, on philosophical and other grounds.

A good example just came up in my last post, commenting on one of the papers on p-values by Burnham and Anderson

http://errorstatistics.com/2014/06/11/a-spanos-recurring-controversies-about-p-values-and-conﬁdence-intervals-revisited/#comment-53182

Mayo:

You say that these mistakes “keep popping up (verbatim) in every Bayesian . . . textbook.” Could you please give the verbatim quote (or at least the page number) from our book? I find it hard to believe that I make such a silly mistake but it’s hard for me to say without knowing exactly what the verbatim quote is that you are talking about.

Andrew: I don’t think you’d want me to list all the texts that criticize error probabilities like confidence levels and p-values as invariably interpreting them as posteriors, or insisting we are proved unsound (and in need of “reeducation” Bernardo) because there can be trivial intervals, or are incoherent because we consider outcomes other than the one observed, or are irrelevant to “evidential meaning” because we don’t give posteriors, or invariably overestimate effects because of disagreement with Bayesian posteriors, or the fact that small discrepancies are detected with large sample sizes, or the howlers I begin with and delineate on this blog, or the tracts from the “task forces” from psych, econ, etc.also on this blog. In the past 2 years I surveyed quite a few. Of course the sources from Berger, Bernardo, Howson and Urbach, Jaynes, Kadane, and others, but also middle of the road texts like Carlin and Louis (1.4), Ghosh, Delampady and Samanta (2.4)—present company excluded of course. I’m not criticizing the texts in general at all, just the throwaway rehearsals of why we have to be rescued from our illogical ways.

“…overestimate effects because of disagreement with Bayesian posteriors…”.

I think the point this is referring to isn’t up for philosophical debate – the effect overestimation issue is a mathematical consequence of Stein’s phenomenon.

However, I don’t think the point is precisely described (at least as I’ve ever heard it) – nobody argues that there aren’t methods with frequentist interpretations (e.g. ridge regression or james-stein estimators) that can address the need for shrinkage and effect overestimation. Bayesian estimates just happens to be one approach to achieving and interpreting shrinkage.

It is true that often-used “default” methodology (unbiased maximum likelihood estimates in linear or logistic regression) often have problems in this regard though.

Mayo:

I’m specifically asking about our book, Bayesian Data Analysis. I can’t do anything about other people’s books, but I can do something about our! If I’ve made a mistake (even if only in how a result is presented), I’d like to correct it. And if I haven’t made a mistake, I don’t way you to think I have. Either way I’d like to see the verbatim quotes!

Andrew: I went too fast, and just noticed you were asking about “present company” which I said was excluded. I don’t have your text with me,but I did give you an honorable mention for a first chapter that hesitated to jump aboard any of the common arguments for assigning probabilities to hypotheses as events (e.g., betting coherency, decision theory, etc.). Then there’s the valuable meeting of the minds with Shalizi and several other places where you explicitly come out in the spirit of error statistics. There are around 3 posts on this blog speculating about “Gel-Bayesianism” http://errorstatistics.com/2012/07/31/whats-in-a-name-gelmans-blog/ There is still the “whiff of inconsistency” that arises, not so rarely, http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/; or just perplexity http://errorstatistics.com/2012/10/05/deconstructing-gelman-part-1-a-bayesian-wants-everybody-else-to-be-a-non-bayesian/

Do you consider all of the criticisms above to be “silly mistakes”? That would be all to the good. In any event, I should have said “virtually all” if only because I can’t claim to have read every textbook (but only a large, representative sample). So I’ll modify that.

Thanks.

“or invariably overestimate effects because of disagreement with Bayesian posteriors”

OK But what is sauce for the goose is sauce for the gander. A common frequentist criticism of Bayesian methods is that they overestimate effects because they don’t take account of stopping rules. But this ignores the point that the sample size plays a role in the Bayesian point estimate that it does not in frequentist methods. They are already adjusting and trials that may be stopped before 500 patients have been recruited will on average be adjusted more than trials that must complete recruitment of the 500.

I am far from being uncritical of Bayesian attitudes but I think that they are not just sinners but also sinned against.

Stephen: You are dealing with an example that differs from the classic one discussed by Armitage and dozens of others over the years. If they’re adjusting for stopping in recruitment, then they’re not in the classic example that Savage all too happily regarded as showing the restoration of “simplicity and freedom” and not having to worry about optional stopping.

Deborah:. My example is a classic one. The Bayesian posterior mean is a weighted average of prior mean and statistic with the statistic receiving more weight (and hence the prior less) as the sample size increases. Hence the Bayesian adjusts trials with interim looks on average more than trials with no interim looks.

The difference to the frequentist is

a) frequentists do not adjust at all if there is no interim look

b) Bayesians treat trials with interim looks differently from those with none only if they stop early

Interesting post. I have two comments

1) You rightly criticise the throwing of dice as being a reasonable adjunct to inferences. However, it is a fact, that many so-called superior NP test do involve the throwing of dice, albeit dice of a virtual kind. That is to say they involve (pretty much covert) random and irrelevant orderings of the sample space to classify differently points that likelihood alone would not distinguish. (See Senn, S. (2007). “Drawbacks to noninteger scoring for ordered categorical data.” Biometrics 63(1): 296-298; discussion 298-299. for an example.)

This makes problems that are relatively discrete in terms of the sufficient statistics less so (because there may be many points in the sample space that will map to the same value of the statistic) and permits higher power for a fixed Type I error rate. Ironically, allowing the use of an auxiliary device in addition to the sufficient statistic shows such gains in power to be illusory. Now, we can’t blame frequentist philosophy for this sort of nonsense but it would help if those who defended frequentist philosophy were more vigorous in repudiating it.

2) Being committed to a falsificationist view of inference does not (of itself) commit one to being a frequentist. De Finetti’s view was that Bayesian updating was falsificationist: ‘experience… acts alway and only in the way we have just described: suppressing the alternatives that turn out to be no longer possible…’ Theory of Probability Vol 1 P141

This is quoted on P89 of Dicing with Death http://www.senns.demon.co.uk/DICE.html

Stephen: On the eliminative induction view of “falsificationsim”: call it that if you ant, but it is clearly anti-Popperian in spirit. For starters, we never get anything new, but have to start out with all possibilities; for seconders, it focuses on naive-positivistic style observables which Popper would (rightly) consider at odds with scientific theorizing and ingenious tests. Then of course there’s the subjectivity and instrumentalism.

Sometimes the howlers (championed by certain Bayesians) come dressed in vampire stake-through-the-heart venom against frequentism as a whole, or at least severity.

http://www.entsophy.net/blog/?p=305

I wonder if he thinks the bloody stuff improves his bad arguments?

bombast aside, what is the problem with Joseph’s example?

vl: I won’t relook (as I saw a few all together), but I recall it involved a familiar chestnut based on truncating the parameter space (so that so-called “trivial” intervals result). Those need not be problematic, but here, as I recall, it’s not a sensible inference.

Mayo: No, this particular case involved a parameter that determines the support of sampling distribution. The parameter space is all of the positive reals, which is a standard sort of parameter space.

The question to be addressed is: how does one pick test statistics? Pearson’s step two gives the general notion, but I’d like to see how this works in practice using models a bit more complicated than the fixed sample size IID Gaussian that is your go-to example in your expositions of SEV.

Let me point once again to the example of optional stopping. Let’s assume iid Gaussian known variance, unknown mean. For each N_max greater than one, we can define an optional stopping design that stops if the sample mean exceeds nominal significance bounds or if the realized sample size equals N_max. (Real designs always have a finite N_max.) If N_max is large, we can get reasonable performance by ignoring the sample mean and just taking the realized sample size N as the statistic. On the other hand, if N_max is small — say, N_max = 2 — it seems clear that we want a statistic that will somehow combine the realized sample mean with the realized sample size, since discarding either can represent a huge loss of information if the sample mean is near the optional stopping boundary.

Joseph provides another straw man argument. After forcefully stating that no failure event could happen until after time t0 (the guarantee time parameter of the two parameter exponential distribution generating the failure times), Joseph begins discussion of a mean-like “unbiased” estimator of t0 (I presume to insinuate that all frequentists would use such a silly estimate) that produces an estimate for t0 that is larger than t1, the minimum observed data value. The Jaynes paper he refers to, as well as any sensible statistician examining the likelihood, use min(t1, t2, . . . , tn) as the basis for an estimate or bound for t0. Why then does Joseph discuss this odd unbiased estimate? This would not be the first unbiased estimate that sensible statisticians have set aside for something more reasonable. So, vl, that is the problem with Joseph’s bombastic disingenuous discussion. Did you read the Jaynes paper?

Yes, _sensible statisticians_

“If such a recognizable subset existed [min(y) in Joseph's example], then Fisher would no doubt find it,however, there does not seem to be any general methodology used.”

Here is a sketch by George Casella of one line of work that has yet to succeed http://projecteuclid.org/download/pdf_1/euclid.lnms/1215458835

Don’t believe anyone has succeeded (I last talked to George about it in 2008) – Stephen Senn might be a good person to ask.