**1.An Assumed Law of Statistical Evidence (law of likelihood)**

Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:

Data ** x** are better evidence for hypothesis

*H*than for

_{1}*H*if

_{0}

*x**are*more probable under

*H*than under

_{1}*H*.

_{0}Ian Hacking (1965) called this the * logic of support: x* supports hypotheses

*H*

_{1}more than

*H*

_{0}if

*H*is more

_{1}**likely**, given

**than is**

*x**H0*:

Pr(*x;** H _{1}*) > Pr(

*x;**H*).

_{0}[With likelihoods, the data ** x** are fixed, the hypotheses vary.]*

Or,

** x** is evidence for

*H*over

_{1}*H*if the

_{0 }*(*

**likelihood ratio****LR***H*over

_{1}*H*) is greater than 1.

_{0 }It is given in other ways besides, but it’s the same general idea. (Some will take the LR as actually quantifying the support, others leave it qualitative.)

In terms of rejection:

“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)

**2. Barnard (British Journal of Philosophy of Science )**

But this “law” will immediately be seen to fail on our minimal *severity requirement*. Hunting for an impressive fit, or trying and trying again, it’s easy to find a rival hypothesis *H _{1}* much better “supported” than

*H*even when

_{0 }*H*is true. Or, as Barnard (1972) puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 p. 129).

_{0}*H*: the coin is fair, gets a small likelihood (.5)

_{0}^{k}given k tosses of a coin, while

*H*: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”. Or, to use an example with P-values: a statistically significant difference, being improbable under the null

_{1}*H*, will afford high likelihood to any number of explanations that fit the data well.

_{0}**3.Breaking the law (of likelihood) by going to the “second,” error statistical level:**

How does it fail our severity requirement? First look at what the frequentist error statistician must always do to critique an inference: she must consider the capability of the inference method that *purports* to provide evidence for a claim. She goes to a higher level or metalevel, as it were. In this case, the likelihood ratio plays the role of the needed statistic *d*(** X**). To put it informally, she asks:

What’s the probability the method would yield an LR disfavoring

Hcompared to some alternative_{0}Heven if_{1}His true?_{0 }

What’s the probability of so small a likelihood for *H0 compared to H1*, even if *H _{0}* adequately describes the data generating procedure? As Pearson and Neyman put it:

“[I]n order to fix a limit between ‘small’ and ‘large’ values of LR we must know how often such values appear when we deal with a true hypothesis. That is to say we must have knowledge of the chance of obtaining [so small a likelihood ratio] in the case where the hypothesis tested [

H] is true” (Pearson and Neyman 1930, 106)._{0}

Looking at “how often such values appear” of course turns on the *sampling distribution* of the LR viewed as a statistic. That’s why frequentist error statistical accounts are called *sampling theory* accounts. This requires considering *other values that could have occurred,* not just the one you got.

*But this this breaks the law of likelihood and so is taboo for the likelihoodist!* (Likewise for anyone holding the *Likelihood Principle[i].)*

Viewing the sampling distribution as taboo (once the data are given) is puzzling in the extreme[ii]. How can it be desirable to block out information about how the data were generated and the hypotheses specified? I fail to see how anyone can evaluate an inference from data * x* to a claim

*C*without learning about the capabilities of the method, through the relevant sampling distribution. Readers of this blog know my favorite example to demonstrate the lack of error control if you look only at likelihoods: the case of

*optional stopping*. (Keep sampling until you get a nominal p value of .05 against a 0 null hypothesis in two-sided Normal testing of the mean. You can be wrong with maximal probability.)

Just such examples, where the alternative is not a point value, led Barnard to abandon (or greatly restrict) the Likelihood Principle. Interestingly, in raising these criticisms of likelihood, Barnard is reviewing Ian Hacking’s 1965 book: *The Logic of Statistical Inference*. Only thing is, by the time of this 1972 review, Hacking had given it up as well! In fact, in the pages immediately following Barnard’s review of Hacking, is Hacking reviewing A.F. Edwards’ book *Likelihood* (1972) wherein Hacking explains why he’s thrown his own likelihood rule of support overboard.

**4.Hacking (also BJPS)**

A classic case is the normal distribution and a single observation. Reluctantly we will grant Edwards that the observation

is the best supported estimate of the unknown mean. But the hypothesis about the variance, with highest likelihood, is the assumption that there is no variance, which strikes us as monstrous. .. we must concede that as prior information we take for granted the variance is at leastxw. But even this will not do, for the best supported view on the variance is then that it is exactlyw.For a less artificial example, take the ‘tram-car’ or ‘tank’ problem We capture enemy tanks at random and note the serial numbers on their engines. We know the serial numbers start at 0001. We capture a tank number 2176. How many tanks did the enemy make? On the likelihood analysis, the best supported guess is: 2176. Now one can defend this remarkable result by saying that it does not follow that we should estimate the actual number as 2176, only that comparing individual numbers, 2176 is better supported than any larger figure. My worry is deeper. Let us compare the relative likelihood of the two hypotheses, 2176 and 3000. Now pass to a situation where we are measuring, say, widths of a grating, in which error has a normal distribution with known variance; we can devise data and a pair of hypotheses about the mean which will have the same log-likelihood ratio. I have no inclination to say that the relative support in the tank case is ‘exactly the same as’ that in the normal distribution case, even though the likelihood ratios are the same. Hence even on those increasingly rare days when I will rank hypotheses in order of their likelihoods, I cannot take the actual log-likelihood number as an objective measure of anything. (Hacking 1972, 136-137).

Hacking appears even more concerned with the fact that likelihood ratios do not enjoy a stable evidential meaning or calibration, than the lack of error control in likelihoodist accounts. But Hacking was still assuming the latter must be cashed out in terms of long run error performance[iii] as opposed to stringency of test.

I say: a method that makes it easy to declare evidence against hypotheses erroneously gives an unwarranted inference each time; a method that fails to directly pick up on optional stopping, data dredging, cherry picking, multiple testing or any of the other gambits that alter the capabilities of tests to avoid mistaken inferences are poor methods, but not because of their behavior in the long-run. They license unwarranted or questionable inferences in each and every application.This is so, I aver, even if we happen to know, through other means, that their inferred claim *C* is correct.

**5.Three ways likelihoods arise in inference. Aug. 31 note at end of para.**

Likelihoods are fundamental in all statistical inference accounts. One might separate how they arise in three groups (acknowledging divisions within each)

(1) likelihoods only (pure likelihoodist)

(2) likelihoods + priors (Bayesian)

(3) likelihoods + error probabilities based on sampling distributions (error* statistics, sampling theory*

Only the error statistician (3) requires breaking the likelihood law.[**See note.**] You can feed us fit measures from (1) and (2), and we will do the same thing: ask about the probability of so good (or poor) a fit between data and some claim C, even if C is false (true). The answer will be based on the *sampling distribution* of the relevant statistic, computed under the falsity of C, or discrepancies from what C asserts).[iv]

**Aug 31 note: **

If someone wanted to describe the addition of the priors under rubric (2) as tantamount to “breaking the likelihood law”, as opposed to merely requiring it to be supplemented, nothing whatever changes in the point of this post. (It would seem to introduce idiosyncrasies in the usual formulation–but these are not germane to my post.) My sentence, in fact, might well have been “Only the error statistician (3) *requires* breaking the likelihood law *and the likelihood principle (by dint of requiring considerations of the sampling distribution to obtain the evidential import of the data).*

**^^^^^^^^^^^^^^^^^^^^^^^^^**

**Installment (B): an ad hoc clarificatory note, prompted by comments from an anonymous fan**

**6. Of tests and comparative support measures**

The statements of “the” law of likelihood, and likelihood support logics are not all precisely identical. Some accounts are qualitative, merely indicating prima facie increased support; others will devise quantitative measures of support based on likelihoods. (There are at least 10 of them we covered in our recent seminar, maybe more.) Some will try out corresponding “tests” others not. One needn’t have anything like a test or a “rejection rule” to be a likelihoodist. I mentioned the construal in terms of tests because it is in the sentence just before the one I quote from Barnard, and wanted to be true to what he had just said about Hacking’s 1965 book.

*Remember the topic of my post concerns criticisms of error statistical methods,* and a principle (or “law”) of evidence used in those criticisms. *(If you reject that principle, then presumably you wouldn’t use it to criticize error statistical methods, so we have no disagreement on this.)* A clear rationale for connecting tests of hypotheses—be they Fisherian or N-P style—and logics of likelihood is to mount criticisms: to explain what’s wrong with those (Fisherian or N-P) tests, and how they may be cured of their problems.

Hacking lays out an impressive argument that all that is sensible in N-P likelihood ratio tests are captured by his conception of likelihood tests (the one he advanced back in 1965) while all the (apparently) counterintuitive parts are jettisoned. Now that I’ve access to my NYC library, I can quote the portion to which Barnard is alluding in his review of Hacking.

“Our theory of support leads directly to the theory of testing suggested in the last chapter [VI]. An hypothesis should be rejected if and only if there is some rival hypothesis much better supported than it is. Support has already been analysed in terms of ratios of likelihoods. But what shall serve as ‘much better supported’? For the present I leave this in abeyance, and speak merely of tests of different stringency. With each test will be associated a critical ratio. The greater the critical ratio, the more stringent the test. Roughly speaking hypothesis h will be rejected in favour of rival i at critical level alpha, just if the likelihood ratio of i to h exceeds alpha.” (Hacking 1965 p.89)

I don’t want to pursue this discussion of Hacking here. To repeat, my post concerns criticisms of error statistical methods. A foundational critique of a method of inference depends on holding another view or principle or method of inference. This post is an offshoot of the recent posts here and here (7/14/14 and 8/17/14)..

Critiques in those posts are based on assuming that it is fair, reasonable, obvious or what have you, to criticize the way p-values arise in inference by means of a different view of inference. (I allude here to genuine or “audited” p-values, not mere nominal or computed p-values.) The p-value, it is reasoned, should be close to either a posterior probability (in the null hypothesis) or a likelihood ratio (or Bayes ratio). Ways to “fix” p-values are proposed to get them closer to these other measures. I don’t think there was anything controversial about this being the basic goal, not just of the particular papers we looked at, but mountains of papers that have been written and are being written this very moment.

I may continue with my intended follow-up (Part C)

*Note; I am not sure whether the powers that be are allowing us to say “data x *is*” nowadays–I read something about this, maybe it was by Pinker. Can somebody please ask Stephen Pinker for me? Thanks.

[i] Please search this blog for quite a lot on the likelihood principle and the strong likelihood principle.

[ii]I would say this even if we knew the model was adequate. Likelihood principlers may regard using the sampling distribution to test the model as legitimate.

[iii]Perhaps he still is, I don’t mean to saddle him with my testing construal of error probabilities at all. (Some hints of a shift exists in his 1980 article in the Braithwaite volume.)

[iv] This delineation comes from Cox and Hinkley, but I don’t have it here.

REFERENCES:

Barnard, G. (1972). Review of ‘The Logic of Statistical Inference’ by I. Hacking, *Brit. J. Phil.Sci.*, 23(2): 123-132.

Hacking, I. (1965). *Logic of Statistical Inference*. Cambridge: CUP.

Hacking, I. (1972). “Review of Likelihood. An Account of the Statistical Concept of Likelihood and Its Application to Scientific Inference by A. F. Edwards,” *Brit. J. Phil.Sci.*, 23(2): 132-137.

Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite.” In D. H. Mellor (ed.), *Science, belief and behavior: Essays in honor of R.B. Braithwaite. * 141-160. Cambridge: CUP.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples.*Joint Statistical Papers* by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in *Bul. Acad. Pol.Sci. *73-96.

First, let me second (or third, fourth!) the comments on the last post. It is great to see philosophers of science engaging with scientists, especially philosophers who have deep knowledge of the subjects they’re offering advice on. Too often philosophers are like those Mckinsey management consultants, who get hired to proffer advice on running major corporations despite never having run anything.

Such philosophers are like those (apocryphal?) botanists of old who couldn’t recognize a plant in the wild, but could if it was pressed and dried for a year in the leaves of a book. It’s nice to see some philosophers break that mold and know the subjects they’re commenting on.

Back to this post. At the very end you say only (3) breaks the likelihood law, while (2) does not. If I’m understanding this correctly, the likelihood law says data is more favorable to H0 over H1 if, and only if:

Likelihood of x under H0 is greater than likelihood of x under H1.

When you do part (B) are you going to provide a proof that (2) “likelihood+ priors (Bayesians)” adheres to the Likelihood law?

anon-fan*:

For all the faults one might attribute to philosophers, I don’t know any that have been “hired to proffer advice” on science or the other subjects involved in philosophy of x. No one hires philosophers for such things! There are plenty of self-annointed philosophers outside philosophy, but that’s another matter.

On likelihoods, although we should distinguish the law of likelihood from the likelihood principle, the latter follows from inference by way of Bayes’s theorem. I realize lots of Bayesians are reluctant to advocate inference by Bayesian theorem; most these days keep to Bayes ratios and Bayes boosts (making them more like group (1). Others say they’re reallly doing predictive modeling and are “just a little bit Bayesian” in their dealings with nuisance parameters (as against the title of chapter 10 of EGEK). But my point just now doesn’t turn on any of this at all! What I said was:

“Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:” (a version of the law of likelihood.

Can you please give me an example of a different type of principle regarding the nature of inductive evidence that is relied upon in critical discussions of frequentist error statistics? Thanks.

*In my 3 years of blogging, I’ve never had a genuine fan who was anonymous.

I don’t understand. When a Bayesian judges two hypothesis H0, H1 by comparing two posterior probabilities (i.e. data favors whichever one has the higher probability conditional on the data), is that equivalent to them using the law of likelihoods?

The sentence “Only the error statistician (3) requires breaking the likelihood law” seems to be saying yes.

p.s. I don’t want to out myself as a Frequentist since I’ll probably be fired by Bayesians – they’re more of a cult than anything at this point.

p.p.s. If a Bayesian method can’t be given a Frequentist justification then should it be rejected? If so, then you’re “proffering advice”.

Here is my concern. Depending on the prior, it’s possible for a Baysian to favor a hypothesis which has lower (not higher!) likelihood. So the claim that “likelihood+priors (Bayesian) doesn’t break the law of likelihoods” is trivially false.

Here is my deeper concern. Maximizing the posterior is (after taking logarithms) equivalent not to maximizing the likelihood, which leads to over-fitting, but rather to maximizing:

log(likelihood)+penalty term

Interestingly, all the most common methods for avoiding over-fitting (AIC, DIC, and so on) have that form. The claim that only way to avoid over-fitting is through sampling theory methods directly contradicts the majority of current statistical practice.

“Depending on the prior, it’s possible for a Bayesian to favor a hypothesis which has lower (not higher!) likelihood. So the claim that “likelihood+priors (Bayesian) doesn’t break the law of likelihoods” is trivially false.”

Doesn’t follow at all, there’d be no reason to add the priors to (1) if they didn’t make a difference. But the data import, with which the priors are combined, would still come through the likelihoods/LRs or the like.

Didn’t say anything about the value or disvalue of “penalties” for overfitting in model selection.

“P(H0|x) is greater than P(H1|x)” can be true even when “Pr(x;H0) is less than Pr(x;H1)”. Therefore Bayesians do break the law of likelihoods.

One of the statements of the law of likelihood directly from the post is:

“Data x are better evidence for hypothesis H1 than for H0 if x are more probable under H1 than under H0.”

If a Bayesian takes “data x are better evidence for H1 than for H1″ to be “the posterior for H1 is larger than H0″ then Bayesians do not adhere to the “law of likelihood”.

Another way of stating it is that the bayes factor can be greater than 1, even when the likelihood ratio is less than 1.

You call this a break in the law of likelihoods? Obviously, with high P(Ho)–e.g., Ho no disease, H1 disease, x a pos result–can readily have (H0|x) is greater than P(H1|x) even when Pr(x;H0) is less than Pr(x;H1).

The pos result gives comparatively more support to disease than no disease; posterior for disease goes up, even if still lower than its denial.

This is the whole distinction between “making more firm” and having a high posterior. e.g.

http://errorstatistics.com/2013/10/19/bayesian-confirmation-philosophy-and-the-tacking-paradox-in-i/

I’m not sure how this bears on the point I’ve already noted, group (1) differs from group (2).

Will be away the rest of the day.

For Bayesians as for classical statisticians, the import of the data on the hypothesis involves a model. The import of the data is interpreted in terms of everything that is known about the hypothesis/data. “Interpretation” requires the whole model.

Translating “Data x are better evidence for hypothesis H1 than for H0″ into Bayesian terms gives:

P(H1|x) is greater than P(H0|x)

directly. The posterior (not the likelihood ratio directly) is what’s used by Bayesians in decisions analysis, hypothesis testing, parameter estimation, you name it.

This isn’t a technicality. As stated before, Bayesian breaking of the “law of likelihood” is directly related the most common ways of avoiding over-fitting in practice. Over-fitting being the chief negative consequence of adhering religiously to the law of likelihood.

ANON-FAN

Do you see the penalty in model selection as utilizing a Bayesian prior? My likelihood colleagues avidly oppose this. Not that they’re 100% clear on justifying penalties or which one to employ. This does not arise in ordinary testing.

“But the data import, with which the priors are combined, would still come through the likelihoods/LRs or the like.”

You were careful in the post and comments to say you were talking about the “law of likelihood” not the likelihood principle. I’ve been talking about the former, which is what I thought you were talking about as well.

Regardless of anything else, the import of the data x for hypothesis H is judged by Bayesians through P(H|x). So it possible to adhere to the likelihood principle, and still violate the Law of Likelihood.

It can hardly be called an “assumption” when there are entire books defending it! Of course, there are also good books that defend the error-statistical approach, including your own, so nothing follows from that fact except perhaps that these issues are not simple.

To anon-fan: A Bayesian obeys the Law of Likelihood in the sense that the odds of H1 against H2 increase upon conditioning on E if and only if E favors H1 over H2 according to the Law of Likelihood:

Pr(H1|E)/Pr(H2|E)=[Pr(H1)/Pr(H2)][Pr(E|H1)/Pr(E|H2)]

This post suggests a slavish adherence to the idea that “data favors hypothesis with bigger likelihoods” causes statisticians to over-fit by picking models which maximize the Likelihood.

This post suggests Bayesians are guilty of this.

Bayesians are not. They judge the model in light of the data using P(H|x) which causes them to maximize a different quantity than the likelihood.

The quantity they do maximize over is formally identical to the penalized maximum (log) likelihood procedures which are commonly used in practice to avoid over-fitting.

To anon-fan: Are you disagreeing with the fact that:

“The odds of H1 against H2 increase upon conditioning on E if and only if E favors H1 over H2 according to the Law of Likelihood”?

Greg: Thanks for your note and explanation to anon-fan. Perhaps s/he wants to hold a different view of what it means to follow a law.

“It can hardly be called an “assumption” when there are entire books defending it!” Hmm, like the books reviewed in this post? As philosophers, of course, our job is to question the laws with problematic consequences, especially when carved into stone. And of course I know you know this.

In retrospect, the “law” is an intermediary in this discussion and it’s not important how it’s defined.

Bayesian theory uses P(H|x) in it’s equations to reason about H in light of x, not the LR in isolation. They aren’t guilty of the crimes you claim for those who do use LR and nothing else.

From a strategic/expository point of view, showing how Frequentist ideas overcome over-fitting is great. Claiming ONLY they do is a communication blunder since any statistician will know it isn’t true either in theory or practice.

The post alleged no crimes of Bayesians. Consider though upon what principle the well known criticism that p-values overstate evidence rests. http://errorstatistics.com/2014/07/14/the-p-values-overstate-the-evidence-against-the-null-fallacy/

From the post:

‘In terms of rejection:

“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)’

Bayes theory doesn’t reject based off likelihood ratios. It uses Bayes factors which can be greater or less than 1 regardless of what the LR is.

If this is an example of the “law of likelihood” then Bayes breaks it.

anon-fan: Just so you’ll know, I tried to free your comments from needing approval, but the ghosts of “anons” past will not let me.

Mayo, this is interesting, but there are two things that you should consider.

First, Edwards specifies that likelihoods and the likelihood principle are model-bound. It seems to me that he is unarguably correct, and if we insist that likelihoods have to be from a single (statistical) model in order to be comparable, and thus subject to the likelihood principle, then many of your objections to the likelihood principle disappear.

The evil demon arguments like “things had to turn out the way they did” in your section 2 have no power if the likelihoods in question have scope limited to their respective models. The “things had to” model is not a statistical model in the same way as the (0.5)^k model is.

Hacking’s concern about the interpretation of likelihood ratios for tanks and diffraction gratings is easily disarmed after recognition that the likelihoods come from two different models. (I would point out the truth that he could obtain a different pair of likelihood ratios simply by choosing a different model for estimation of the tank serial number and for the distribution of diffraction grating sizes.)

Second, according the Royall the likelihood principle and law of likelihood do not say anything about interpretation of the relative evidential meanings of simple versus composite hypotheses. The comparison of the (0.5)^k coin hypothesis H0 is simple but, if it is anything, the “things had to” hypothesis H1 is composite.

Michael: Not sure why you say the things-had-to hypothesis is composite. Composite hypotheses are usually defined as encompassing a non-singleton set of probability distributions on the observable quantities; it’s pretty clear that any things-had-to hypothesis involves only a single (degenerate) distribution. (In fact, on this view the set of simple hypotheses is the convex hull of the set of things-had-to hypotheses, which makes it pretty clear that a things-had-to hypothesis is simple.)

Corey, I guess you may be correct, but with no clear model within which to determine the simple/composite nature of the “ad to happen” hypothesis, the point is probably moot.

Michael: there’s lots I could say here, and I will be touching on Royall next, but let me just say one thing–because I talked to Barnard on just these points and he didn’t think those cases were disarmed by reference to a model. Never mind exotic cases and just consider the favorite example critics allude to (as in the “p-values overestimate) post: a Normal mean with sigma known. The complaint, one of them, is that p= .05 doesn’t correspond to 20 fold comparative likelihood for the max alternative compared to the null.

Mayo, no-one should think that P = 0.05 indicates a maximal likelihood ratio of 20. Who say that is a problem? P-values are not reciprocals of likelihood ratios and likelihood ratios are not reciprocals of P-values.

There is a one-to-one relationship between P-values and likelihood functions for significance tests, but the relationship is between a single number a and a function, and is thus not a simple linear relationship. See my draft paper on the properties of P-values for full details of that relationship: http://arxiv.org/abs/1311.0081

The complaints about P-values overstating evidence relate to definition of evidence, not to the relationship between likelihoods and P-values.

Since Barnard’s “thing just had to” hypotheses (I prefer to call them Sure Thing hypotheses) have come up for conversation, it’s worth pointing out that Sure Things pose just as much (or as little) a problem for severity accounts as for likelihood and Bayesian accounts. For details, check out (arch-frequentist and, in Jamie Robin’s words, “skeptical conscience of statistics”) David Freedman’s paper, Diagnostics Cannot Have Much Power Against General Alternatives.

For any sufficiently general set of (composite) alternatives, Freedman showed that there exists alternatives hypotheses in the set against which the test has power arbitrarily close to the size of the test — even though those alternatives are “far away” from the null hypothesis in some sense. The trick is to consider sets of distributions arbitrarily close to the degenerate distributions specified by Sure Thing hypotheses.

Let me state this more forcefully. Freedman shows that for the assumptions (i) that the complete (i.e., multivariate) data distribution has a density; (ii) assuming it has a density, that the marginal (i.e., data-point-wise) distributions are independent; (iii) assuming the individual data points are independent, that they are identically distributed — *severe tests do not exist*. Ipso facto, null hypotheses that make use of these assumptions cannot pass severe tests of these assumptions; to a certain extent, Freedman says, they have to be taken on faith. (Contrast this with Spanos’s misspecification tests and his claim that these can secure the statistical adequacy of an assumed model!)

My own opinion is that Freedman’s view is overly pessimistic. To the extent that statistical approaches (of any stripe) have been successful in learning about the universe, we can conclude that the lack of severe tests against these general alternatives have not actually impeded our efforts (and presumably will continue to pose no bar in the future). But! — whatever philosophical explanation may exist for this fact, it will be “pre-severity” and hence will be available to likelihoodists too. (And Bayesians, I suppose, but I imagine only “objective Bayesians” will care about that. Subjective Bayesians can just rule them out by fiat without violating the internal consistency of their approach, and we Jaynesians already have good reasons not to worry about most Sure Things.)

Corey: I don’t think I’m familiar with Freedman’s paper, but I fail to see how it’s a problem that tests have low power against these alternatives. The issue for one looking at comparative likelihood is that they find evidence for the perfectly fitting hypothesis, even if they’re false/poorly tested. I call those gellerized hypotheses. But I’m launching in without having read previous comments. Will do upon return from tap.

OK I looked at it, you’re not really mounting this argument based on those low powered diagnostic tests– oy–way off topic, so won’t touch here. I’ll try to be clearer in the next installment.

Here’s what I’m arguing:

Premises:

(Freedman) Severe tests against certain classes of general alternatives that include Sure Things and near-Sure-Things do not exist.

(History) Nevertheless, learning from data has taken place.

Conclusion: Severe tests against these alternatives are not needed to learn from data.

Corollary: Supposing philosophical grounds exist to exclude consideration of these alternatives (which, recall, include Sure Things) a priori, they will not rely on severity arguments. Thus said grounds will be available to likelihoodists to rebut Barnard’s Sure Thing argument.

Corey: where in the world did this argument come from? I’d certain reject the first premise, and I don’t even recall Freedman, whom I liked a lot, talking about severity (although he wrote me several letters on it, in sync with the idea).

Mayo: The argument came from my brain.

Freedman doesn’t say anything directly about severity, but the premise I’ve labeled “Freedman” is a straightforward consequence of his theorems and the definition of severity. Basically, he shows that for each of the assumptions I listed above, for any possible data set, there exists in the set of alternatives a distribution against which *any* non-randomized test will have arbitrarily low power. The method is essentially to point out that there will always be near-Sure-Things within the set of alternative hypotheses that pass the test with probability arbitrarily close to one. (One might rename his paper “When Misspecification Testing Shatters”.)

So let’s get specific. Suppose we have data and an IID model, and we are doing M-S testing on the “Identically Distributed” part of the model assumptions. The severity criterion then reads, “a passing result is a severe test of hypothesis “Identically Distributed” just to the extent that it is very improbable for such a passing result to occur, were “Identically Distributed” false.” Freedman shows that no matter what test is used and what data have been obtained, there exists a near-Sure-Thing under which “Identically Distributed” is false and yet the passing result has probability arbitrarily close to one. That is, a severe test against the general alternative “not-Indentically-Distributed” does not exist.

Corey: Huh? please recheck defns, going out sightseeing for the day.

Mayo: So I rechecked, and the definition I gave is from EGEK, pg 178, with “Identically Distributed” substituted for your general H. You’ve given other definitions which talk about measures of fit/accordance, but I cannot see that these can save the definition of severity from the existence of near-Sure-Things.

Corey: Returned from sight-seeing, despite a downpour.

I’m really not sure how your problem relates. It sounds as if you’re talking about a general underdetermination problem, so since you have EGEK at the ready, please look at chapter 6. This is also sometimes called the “alternative hypothesis” objection. The idea is that a claim H is empirically underdetermined by a method M (and data x) if ~H passes the test just as well with x, according to the method M. I wouldn’t call an alternative against which a test is incapable of discriminating a “sure thing” hypothesis, by the way. If there are alternatives to H (at the same level* as H) against which my test method M cannot distinguish, then I don’t infer I can distinguish them. See EGEK 204. This doesn’t show any kind of general underdetermination for the methodology. Wrt your ex. of model assumptions, I just don’t see what problem (for severity) is supposed to be presented by the fact—supposing it be a fact—that a certain test may have terrible power to detect certain violations. One might at best be able to infer that we’ve got evidence to rule out violations that the test does have a good chance of having alerted us to, if those violations existed. (Assuming no violations were found). So what? We cannot infer we’ve ruled out violations. That may be a problem for certain uses of the model, but severity does just what it should.

But I’m still trying to figure out what this has to do with the topic of this post. Now the topic does link to underdetermination in the way we distinguish “rigged” and data dependent rivals to H. First the rigged (p. 205): if someone says in a completely general way, there is always an alternative that by definition would be found to agree beautifully with any data taken to pass H, then I reject that method as highly unreliable. It does not diminish the severity accorded to H. Blanket rigging could always maintain such a rival exists even when H is true, so the existence of riggings in no way diminished the severity accorded to H. With a data dependent rival constructed or selected to fit the data (I sometimes write it as H(x)), again, the poor error probability of the method scotches the inference to H(x). Fit is not enough. Then there are rivals at “higher levels”, e.g., I’m asking about the mean of a distribution, and someone says there are loads of causal explanations I haven’t ruled out. No change to the severity accorded to claims about the mean, and we deny that tests that probe means are probing causes, so none of those get credit. (e.g., stat sig is not substant sig). I could go on with lots of different cases, and I’ve written papers about this stuff**. It’s the fundamental problem of underdetermination that gives value to looking at error probing capacities.

*i.e., give rival answers to the question under test (if they give the same answers, they are not rivals wrt the test of interest).

**e.g., http://www.phil.vt.edu/dmayo/personal_website/(1997)%20Severe%20Tests%2c%20Arguing%20from%20Error%2c%20and%20Methodological%20Underdetermination.pdf

Mayo: I find your response baffling, as it has virtually nothing to do with the issues I intend to discuss.

I’m getting to the point of repeating myself, so I’m going to leave it at this: I’m not, repeat NOT, saying Sure Things (that is, simple statistical hypotheses that assign probability one to the observed data) pose a problem for severity. Barnard says that the existence of Sure Things poses a problem for likelihoodist accounts. My claim is that YOU should reject Barnard’s reasoning.

Tests that would allow one to rule out all Sure Things with severity don’t exist — by definition, post-data there exists a Sure Thing that gives high probability to the observed data. And yet, for whatever reason, this fact poses no bar to learning from data. All I’m saying is that whatever argument an error statistician might want to make to rule out all possible Sure Things a priori, it cannot be based on severity, and hence will equally be available to a likelihoodist to rule out all possible Sure Things a priori. So even though an “unrestricted” application of the law of likelihood would have the Sure Thing favored over all other statistical hypotheses, an error statistician ought to admit that likelihoodist accounts are “allowed” to rule out all Sure Things a priori, and hence Barnard’s criticism fails.

Corey: Approx 1/3 the things you wrote I’d agree with, but won’t go thru them all. I don’t reject Barnard’s argument. We don’t rule out all sure things a priori, never said anything a priori principles. Never said anything about “sure things” for that matter. If you mean to say an error stat should admit the pure likelihoodist is allowed to rule out gellerized hypotheses, I’d ask: on what grounds can they? It is intrinsic to the law of likelihood that more likely hypes receive higher support. Bayesians may appeal to priors, and we to error probs.

Mayo: You write:

‘H_0: the coin is fair, gets a small likelihood (.5)k given k tosses of a coin, while H_1: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”.’

What Barnard called, “things just had to turn out as they did,” I’m calling “Sure Thing”. Hence you did say something about Sure Things: you identified H_1 as an exemplar.

How would an error statistician demonstrate, by tests on the observed data, that not-H_1 (or any statistical hypothesis that implies not-H_1, e.g., H_0) is well-warranted?

The N-P test guarantees of error control here are violated for data-dependent hypotheses, e.g., see hunting and snooping, ch. 9 EGEK,Giere’s ex while different makes the same point,p. 305.

As for the null being well-warranted, we’d generally merely rule out discrepancies with severity.

Mayo: Yeah, that’s what I mean when I say that severe tests to rule out the post-hoc Sure Thing hypothesis don’t exist.

Sorry, you’re being incomprehensible, your sentence doesn’t parse. It’s well known that error probabilistic properties of methods are altered by a variety of selection effects, optional stopping, trying and trying again, etc. The severity requirement demands revised assessments.With a gellerized alternative, the good fit is assured with maximal probability. If what you’re asking about is whether I can go to a met-level: can I severely pass the claim that “such and such test has poor error probabilities”? the answer is yes.

and by the way, I don’t want to rule out what you’re calling the post-hoc Sure Thing hypothesis. Hypotheses that pass poorly are not thereby “ruled out”. I may even know they are true.

If my answer is a hodge-podge, it’s because yours is. What are you talking about

Mayo: I asked, how would an error statistician demonstrate, by tests on the observed data, that H: “not-Sure-Thing” is well-warranted? You answered, in short, they can’t (and gave the reason). I replied, yeah, that’s what I was getting at.

My continuation goes: suppose an error statistician wants to argue, in some particular case, that the Sure Thing hypothesis is, for whatever reason, not worth considering. (She may not want to make that argument, and that’s fine; but I’m considering the case in which she does.) Whatever argument she offers cannot be based on the observed data — as we’ve established, tests based on the data simply cannot rule out post-hoc Sure Things with severity. Therefore her argument, whatever it may be, will be available to a likelihoodist too, notwithstanding the fact that the data give the Sure Thing hypothesis the maximum possible likelihood.

This is why I say that the existence of Sure Things poses exactly as much, or as little, problem for a likelihoodist as for an error statistician.

I’m done.

Corey: I’m afraid you’re incorrect. The key difference is that data-dependent hypotheses alter the error probing capacities of tests and these are picked up by the sampling distribution. The sampling distribution is irrelevant to the likelihoodist (or holder of the likelihood principle)–once the data are in hand– on grounds that they involve outcomes other than the ones observed. Unless this central distinction is grasped, the key logic of error statistics is missed.You may be “done” but it’s important for readers to know what’s going on when they read, as they will have, that considering the sampling distribution is unscientific, and could only be relevant for someone concerned with long run error rates of repeating the method. I, for one, read this in brand new articles on a daily basis.

Mayo: Well, it turns out I have something new to say, so I guess I’m not done. You’ve asserted that my argument is incorrect, but your rebuttal doesn’t actually defeat my argument.

You wrote: “The sampling distribution is irrelevant to the likelihoodist (or holder of the likelihood principle)–once the data are in hand.”

And this is perfectly true, but it’s not on point, because what we’re really talking about is the *hypothesis space*, which is needed before anyone brings up sampling distributions or likelihood functions. Barnard, replying to Hacking, took at face value Hacking’s statement of the Law of Likelihood, which makes no mention of restricting the hypothesis space. But as you discuss, by the time Barnard was writing, even Hacking wasn’t defending that claim — by that time he was criticizing Edwards, who explicitly held that likelihoods are model-bound — that is, the Law of Likelihood applies in the context of a specific hypothesis space, which is chosen on non-data/non-likelihood grounds. This is what gives likelihoodists room to avoid absurd applications of the unrestricted Law of Likelihood.

You write: “The key difference is that data-dependent hypotheses alter the error probing capacities of tests and these are picked up by the sampling distribution.”

But it does not follow that a likelihoodist cannot avail herself of these arguments when choosing a hypothesis space or model. If we look at a modern likelihoodist like Royall, we can see that he’s perfectly comfortable calculating pre-data probabilities of error for experimental design purposes, so why not for hypothesis-space design purposes?

Corey:

My intended follow-up is all on Royall, but I didt get around to putting it up.

I don’t know what you mean by Royall computing “pre-data probabilities of error …for hypothesis-space design purposes.” Royall was sufficiently perturbed at arguments about his lack of error control–many of which I brought up in EGEK and when we were in a session together–that he highlighted the point against point example (as Savage does). But this only holds for predesignated points.

Mayo: Royall defines and uses what he calls “the probability of obtaining misleading evidence” — a pre-data error probability — for experimental design. I’m saying there’s no reason why he can’t use pre-data error probabilities when choosing which hypotheses to subject to the Law of Likelihood.

yeah, he could become an error statistician too,which would be great, but he’s vehemently opposed. He further denies it’s possible to do anything more than compare likelihood of hyp H and H’ on given evidence x. Never, for example,can we compare x and x’ wrt a single H. His demonstration that I cannot be relevant for evidence is that I take into account stopping rules. I will post an article from him in Taper and Lele, later in the week.

Mayo: Okay, fine. To sum up: I’ve made two distinct arguments as to why Barnard’s Sure Thing critique of the Law of Likelihood is, shall we say, past its prime. Both arguments take as their point of departure the idea that no one defends an unrestricted application of the Law of Likelihood anymore.

First, I pointed out that whenever an error statistician might want to rule out the post-hoc Sure Thing, whatever argument she might use cannot be based on the data, and hence will be available to a likelihoodist too. Second, I pointed out that a likelihoodist philosophy is allowed to choose a restricted hypothesis space in which to apply the Law of Likelihood, and can use pre-data calculations of the “probability of observing misleading evidence” to do so. In “canonical” probability models such as the binomial model or normal model, such calculations can easily be seen to justify removing the set of Sure Thing hypotheses from the hypothesis space prior to applying the Law of Likelihood .

For me, Barnard’s Sure Thing argument now has the status of howler. (…And I’m not even a likelihoodist!)

Corey: I don’t know what you can mean by saying “an error statistician might want to rule out the post-hoc Sure Thing, whatever argument she might use cannot be based on the data”. If you’re meaning to say that I lack empirical grounds to object to criticizing ad hoc hypotheses or ignoring selection effects, you’re wrong.

I think you’re confusing the uncontroversial issue here. That accounts obeying the likelihood principle and laws of likelihood have “a problem” controlling and assessing error probabilities has long been known. They’re not in the business of doing that, and they are the first to admit it. Even where they show error control, it’s deemed a nice extra, or something to show to people who care about such things. Even the restricted cases don’t really get beyond the problem as Barnard and Birnbaum knew. (Recall, for a different illustration, that you were happy with finding evidence for irrelevant conjunctions because of the Bayes boost? We’d cast those out as having poor error probabilities.)

But quite aside from that, and more germane to my post, comparative likelihood is the most common basis for the criticisms of error statistics. Maybe it isn’t always obvious that that’s at the heart of it all. There are often a few steps in the assumptions behind these criticisms. (Of relevance here are the recent posts about p-values overstating evidence: evidence is presumed to be measured by either LRs, Bayes boosts, or Bayesian posteriors. The critics know the significance tests aren’t intended to provide those measures, they’re not just confused or blatantly begging the question. They will appeal to an assumed deeper sense of inductive evidence in order to mount the criticisms. Such notions are assumed to be superior or deeper epistemically because, what do error probabilities have to do with inference in the individual case?

OK, I’ll bite: What do error probabilities have to do with inference in the individual case?

(I suspect that an important part of any answer would be “nothing very much” unless you account for all types of mistaken inferences.)

where have you been?

more another time.

Mayo: I’m out of ways to restate what I mean by, “an error statistician might want to rule out the post-hoc Sure Thing, whatever argument she might use cannot be based on the data”. All I can say is that if you were to give an example of a test statistic that can be used to rule out the post-hoc Sure Thing hypothesis with high severity, that would certainly disprove my claim.

I don’t mean to confuse the “uncontroversial issue”, so lest anyone reading this is unsure as to my goal, it is simply this: Barnard gave a critique of likelihoodism that depended on the existence of Sure Thing hypotheses to argue against unrestricted application of the Law of Likelihood (i.e., the first “if” of Hacking’s “if and only if”, see the original post); I aim to provide reasons why this critique ought to be omitted from discussions of modern likelihoodist statistical philosophy (except possibly for historical purposes).

We rule out claims and general strategies because they FAIL to sustain good severity!

And Barnard’s criticism doesn’t depend on the extreme case–neither do Hackings! The criticism is failure to control error probabilities.

Hacking’s additional criticism, also important, is that LRs don’t mean the same things in different contexts, we don’t know what they mean, 10, 50 200–they are not calibrated. finally, i should throw in, that H’ is comparatively worse supported than H is scarcely to give me information as to whether H is warranted.

Perhaps the best thing at this point is to wait til I get the Royall post up.

Perhaps it will help to remember that for an error statistician, criticizing the well testedness of a claim (with a given test) is not to deny the plausibility or even truth of the claim.

I have been trying to follow the string of comments and have a couple of comments. First, error statistics explicitly negate the effect of the sure thing hypothesis by dealing with sampling error proactively. There is always an attempt to account for the wobble in the results that is created by the procedure used to generate the data. Probabilities relate to procedures. Lew’s simulations reveal the relationship between the p-values and likelihood functions that seem to me confirm the intuitions of Fisher about the meaning and evidential import of significance tests. Of course, there can be a “better” hypothesis somewhere when conducting frequentist tests, but what you make of the null depends upon your understanding of the sampling procedure and the result. Other sure things are not a problem. They are a problem when you apply the Law of Likelihood. You can assume that only the hypotheses considered from the start are relevant and this helps a little, but you are left floundering with the knowledge that sometimes– how often depending upon the procedure, sample size, etc.– the true hypothesis does not have the highest likelihood. This is reality in practice, and it does not satisfy the user of statistics to say “well, we only deal in evidential import under the model.”. That is worthless. I believe this is why Royal offers the 8:1 and 32:1 standards for likelihood ratios as benchmarks. That helps alot, but how then does that differ in philosophical principle from relying upon p-values? It seems to me that Royallhas defined cutoffs in an attempt to control error. If that is right, then have we not come back around to significance tests?

John: first, those ratios still don’t control error probs adequately, and second, we wouldn’t want to insist all hyps are trotted out at the start. Any stipulation requires a justification. We have justifications for taking into account the ways data and hypotheses selections impinge on the capabilities of tests. We take them into account just when they do so impinge. They need not. Remember novelty/avoidance of double counting etc. need not preclude severity (ch 8 in EGEK is relevant but I’ve got loads of papers on this). Likelihoodists only principle is the law of likelihood (and likelhood principle).

John, I agree with most of what you write but have a couple of points in response.

First, you point to the need to consider evidence (via likelihoods or P-values) “under the model” and suggest that that is not satisfactory to users of statistical procedures. You have not captured the whole story because any investigator who is able to deal with minor deviations from the repeated sampling principle can choose a different model at any time to obtain a different perspective on the evidence. It is only a strict ‘error statistician’ who has to choose a model before the experiment and stick with it.

Second, you are correct to think that arbitrary likelihood ratio cutoffs for decisions are interchangeable with arbitrary P-value cutoffs for decisions, and thus it can make no difference whether a dichotomous decision is based on the former or the latter. However, the point of my paper and simulations was to explore the notion that the evidence in data can be much more useful when viewed in a non-dichotomous manner. If the evidence is not particularly compelling then I might choose to make a ‘soft’ decision whereby I am prepared to assume a value for the hypothesis but would be willing to change that decision on the basis of further evidence. (Sounds a bit like what one would expect of a scientist, in my opinion.)

Mayo, please note that that non-dichotomous view has some similarities to a severity curve. I think that you should spend more effort on promoting those curves than on arguments based on decisions between point hypothesis values.

Michael: quick response: I never consider point against point unless I’m dealing with other people’s criticisms of error statistics which usually invoke point against point. I even exclude them because severity requires exhaustion of the answer to the question. I am happy to promote severity curves, and if you look at some of my papers you’d see them, actually the excel thingy on this page shows some. I will have to read the rest of your post later because I’m going to the 3-yr celebration.

Michael: Thanks for your comments. As Mayo stated, what she and Spanos have really promoted are severity curves, as you rightly suggest are most useful in practice. I have been developing curves for some applications in my field. As to the comment about “under the model”, I only brought it up because it has been mentioned in comments above as creating a scenario in which the law of likelihood is useful/valid. I agree with your point and would not advocate for the law of likelihood in any such circumstance. I do not think it is justifiable for practical applications. Royall’s use of the ratio benchmarks suggests to me he would agree, but I am not sure. Back to severity curves, it seems your simulations can be extended to show the relationship between likelihood functions, p-values, and severity.

John, I have thought about the relationships among severity, P-values and likelihoods and, as far as I’ve gotten, severity is equal to one minus the integral of the (scaled) likelihood function. That would suggest a very close relationship between the two, and I suspect it would allow the likelihood principle to apply to severities in some manner. That would probably be necessary if both likelihoods and severities relate to the evidence within the data relative to hypothesised parameter values of a model.

I added a note to my remark from the end of part (A), but I don’t know that Bayesians in general would want it described this way. I take it that Gelman-Bayesians and most others are keen to separate out the import of the empirical data (through the likelihoods) from the priors.However, this issue might have been distracting some people from the point of the post.

Aug 31 note:

If someone wanted to describe the addition of the priors under rubric (2) as tantamount to “breaking the likelihood law”, as opposed to merely requiring it to be supplemented, nothing whatever changes in the point of this post….My sentence, in fact, might well have been “Only the error statistician (3) requires breaking the likelihood law and the likelihood principle (by dint of requiring considerations of the sampling distribution to obtain the evidential import of the data).

A plain vanilla application of Bayes theorem will not pick out, or favor, the “things-had-to hypothesis” because Bayesians will maximize the posterior, not the likelihood.

If you say this has no bearing on your arguments, then I’m afraid I have no idea what you’re arguing.

anon-fan: I didn’t say the Bayesian would favor the best fitting hypothesis. As you say, they will maximize the posterior–if they compute posteriors (not all Bayesians do), and the most probable hip may well differ from the most likely. that’s why, by the way, most/many? likelihoods introduce priors and move away from being pure likelihoodists.

More specifically, it refutes any claim that only sampling theory methods can avoid the “things-had-to hypothesis” and secondly, I can’t imagine on what grounds a Bayesian would criticize a sampling theorist for doing the same.

” it refutes any claim that only sampling theory methods can avoid the “things-had-to hypothesis” ”

Totally agreed here.

“and secondly, I can’t imagine on what grounds a Bayesian would criticize a sampling theorist for doing the same”

Really? their grounds are generally (a) that the appeal to the sampling distribution could only be relevant if your concern was to control long-run error rates, or (b) it violates the likelihood principle by considering outcomes other than the one observed.

What does (a) or (b) have to do with the Law of likelihood?

A side note as well. If I understand Error Statistics correctly, the control of long-run error rates is necessary but not sufficient. You are adding additional conditions on top of the long-run error rates to get valid inferences.

Since controlling long-run error rates are still necessary, however, any criticism of the form “controlling long-run error rates isn’t necessary when making inferences in the individual case” is unaffected by those additional conditions you’ve added.

anon-fan: 3 things. maybe when you said “you couldn’t imagine…” you meant the Bayesian would accept and expect the error statistician to raise such concerns,even if they’re not their own concerns. I see that.

Are you saying the law of likelihood is unrelated to the likelihood principle? I can see holding the latter but not the former. Doubtless one could also accept the former and not the latter. But even so the grounds for criticism could well be,and often is, the LP. This will come up in my Royall post, still on draft during anniversary posts.

I don’t quite understand what you’re saying in your “Since controlling..” sentence. Maybe you should explain it before I try to answer.

To explain it differently. You’re saying “use long-run rates + other stuff”, while the critics are saying “don’t use long-run rates for inferences in the individual instance”.

This opens the door to you and your critics talking past each other.

For instance, you may say “the other stuff I’ve added makes sense”, while they’ll be saying “maybe, maybe not. Either way you already make a big mistake by controlling the long-run error rates”.

anon-fan: I thought it might be along these lines, but without meaning to rule out this possibility for a futuristic critic, it definitely would not answer current ones: most of the long-run error control they criticize are ones I want to keep. It’s not so much my adding requirements as reinterpreting the uses of error probs, and in fact supplying a rather different logic for statistical inference.

I see you are admist some pretty intense discussion, but I have a comment about the original post.

I think I can answer this point:

>I fail to see how anyone can evaluate an inference from data x to >a claim C without learning about the capabilities of the method, >through the relevant sampling distribution.

I think you are essentially right here. On the other hand, a fully Bayesian approach would not attempt to ‘evaluate an inference about claim C’, rather it would predict x_{n+1} using p(x_{n+1}|x). I think if you go over the optimal stopping example you cite and compute the predictive distributions you will find that they are entirely satisfactory regardless of the stopping rule used. If the experiment stops early the predictive distribution will be broad, if not it is a bit narrower. If a lot of data is collected the predictive distribution will have significant uncertainty – but almost all of this is due to the structure of the model rather than posterior uncertainty. Its easy to see that trying to distort the result by using a strange stopping rule just doesn’t work.

For this example p(x_{n+1}|x) is probably sufficient, but you could also use the richer p(x_{n+1},…,x_{n+k}|x) for some k. If you want I can draw up some plots to illustrate the point.

Essentially there is a difference in aims between the Bayesian and error statistical approach.

I think the ‘howlers’ do have some problems and agree with much with what you have written. The ‘howlers’ also tend to mangle the fact that there is a different goal in the two approaches.

I would also add, that while the Bayesian approach tends to win the philosophical arguments, it has a much tougher time in practice. A good example Senn raised was medical trial problems where a decision must be made to either determine if the drug is safe or do further testing. While a Bayesian decision theoretic appraoch could be formulated it would require computing expectations over very complex objects (outcomes of contemplated experiments) and would probably be overly academic rather than practically useful.

David: Just on your last point, I don’t think the “Bayesian approach tends to win the philosophical arguments” and the reason is related to your point about practice. It is an essential requirement for a valuable account of statistical inference in science that it promote the kind of problem solving and further the kind of knowledge that humans actually attain (and like to attain). It should speed things up, not slow things down; it should capitalize on limitations while ensuring something at least is learned.

Your point (from Senn) about Bayesians requiring “expectations over very complex objects” etc. is why they require supplementation with “forward-looking” methods that let you get going now, (jump in, try something, and jump out) but with built-in means to extract clues for what to try next.

Too late to be writing, but I hope you get the idea.

Addition: I do agree with your point about there being a difference of aims and that’s important.

That’s an interesting point. Would you say Bayesian methods have been less successful in practice than classical statistics? What percentage of those papers being retracted used Bayesian methods?

If not, doesn’t that open the possibility there is a complete, coherent philosophical justification for what Bayesian’s do in practice?

Or do you believe all Bayesian methods, even when they assign probabilities to non-repeatable hypothesis, are secretly frequentist?

We seem to agree! Although if I add anything I am sure I can shatter that ;)

I think there is a very strong need for Bayesian methods to be able to “jump in, do something, jump out”. In practice this would mean dropping ful probabilistic specifications and conditioning. While conditioning is problematic in a partial specification a probabilistic ‘audit’ is possible.

Michael Goldstein and collegues have developed a formal theory in this space – which I don’t know as much about as I would like.

Alternatively, its possible – I would argue mandatory – to wrap subjectivism around other methods. I would argue that when people use either non-Bayesian methods or Bayesian methods without a full elicitation process (i.e. _always_) this is what is really happening. Afterall the output: the determination of what you think you should do, or what you think will happen is subjective, in order to have a subjective output, you need a subjective input.

“Alternatively, its possible – I would argue mandatory – to wrap subjectivism around other methods. I would argue that when people use either non-Bayesian methods or Bayesian methods without a full elicitation process (i.e. _always_) this is what is really happening. ”

David, I agree. That seems to be exactly what is happening with the following claim:

“Of course, there can be a “better” hypothesis somewhere when conducting frequentist tests, but what you make of the null depends upon your understanding of the sampling procedure and the result.”

For all the talk about not having to enumerate the space of hypotheses and what a wonderful advantage that is, “your understanding of the sampling procedure and the result” is effectively an _implicit_ enumeration of the space of possible explanations for the result of the sampling procedure.