Author Archives: Deborah Mayo

About Deborah Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)



1.An Assumed Law of Statistical Evidence (law of likelihood)

Nearly all critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with the following general assumption about the nature of inductive evidence or support:

Data x are better evidence for hypothesis H1 than for H0 if x are more probable under H1 than under H0.

Ian Hacking (1965) called this the logic of support: x supports hypotheses H1 more than H0 if H1 is more likely, given x than is H0:

Pr(x; H1) > Pr(x; H0).

[With likelihoods, the data x are fixed, the hypotheses vary.]*


x is evidence for H1 over H0 if the likelihood ratio LR (H1 over H0 ) is greater than 1.

It is given in other ways besides, but it’s the same general idea. (Some will take the LR as actually quantifying the support, others leave it qualitative.)

In terms of rejection:

“An hypothesis should be rejected if and only if there is some rival hypothesis much better supported [i.e., much more likely] than it is.” (Hacking 1965, 89)

2. Barnard (British Journal of Philosophy of Science )

But this “law” will immediately be seen to fail on our minimal severity requirement. Hunting for an impressive fit, or trying and trying again, it’s easy to find a rival hypothesis H1 much better “supported” than H0 even when H0 is true. Or, as Barnard (1972) puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 p. 129).  H0: the coin is fair, gets a small likelihood (.5)k given k tosses of a coin, while H1: the probability of heads is 1 just on those tosses that yield a head, renders the sequence of k outcomes maximally likely. This is an example of Barnard’s “things just had to turn out as they did”. Or, to use an example with P-values: a statistically significant difference, being improbable under the null H0 , will afford high likelihood to any number of explanations that fit the data well.

3.Breaking the law (of likelihood) by going to the “second,” error statistical level:

How does it fail our severity requirement? First look at what the frequentist error statistician must always do to critique an inference: she must consider the capability of the inference method that purports to provide evidence for a claim. She goes to a higher level or metalevel, as it were. In this case, the likelihood ratio plays the role of the needed statistic d(X). To put it informally, she asks:

What’s the probability the method would yield an LR disfavoring H0 compared to some alternative H1  even if H0 is true?

What’s the probability of so small a likelihood for H0 compared to H1, even if H0 adequately describes the data generating procedure? As Pearson and Neyman put it:

“[I]n order to fix a limit between ‘small’ and ‘large’ values of LR we must know how often such values appear when we deal with a true hypothesis. That is to say we must have knowledge of the chance of obtaining [so small a likelihood ratio] in the case where the hypothesis tested [H0 ] is true” (Pearson and Neyman 1930, 106).

Looking at “how often such values appear” of course turns on the sampling distribution of the LR viewed as a statistic. That’s why frequentist error statistical accounts are called sampling theory accounts. This requires considering other values that could have occurred, not just the one you got.

But this this breaks the law of likelihood and so is taboo for the likelihoodist! (Likewise for anyone holding the Likelihood Principle[i].)

Viewing the sampling distribution as taboo (once the data are given) is puzzling in the extreme[ii]. How can it be desirable to block out information about how the data were generated and the hypotheses specified? I fail to see how anyone can evaluate an inference from data x to a claim C without learning about the capabilities of the method, through the relevant sampling distribution. Readers of this blog know my favorite example to demonstrate the lack of error control if you look only at likelihoods: the case of optional stopping. (Keep sampling until you get a nominal p value of .05 against a 0 null hypothesis in two-sided Normal testing of the mean. You can be wrong with maximal probability.)

Just such examples, where the alternative is not a point value, led Barnard to abandon (or greatly restrict) the Likelihood Principle. Interestingly, in raising these criticisms of likelihood, Barnard is reviewing Ian Hacking’s 1965 book: The Logic of Statistical Inference. Only thing is, by the time of this 1972 review, Hacking had given it up as well! In fact, in the pages immediately following Barnard’s review of Hacking, is Hacking reviewing A.F. Edwards’ book Likelihood (1972) wherein Hacking explains why he’s thrown his own likelihood rule of support overboard.

4.Hacking (also BJPS)

A classic case is the normal distribution and a single observation. Reluctantly we will grant Edwards that the observation x is the best supported estimate of the unknown mean. But the hypothesis about the variance, with highest likelihood, is the assumption that there is no variance, which strikes us as monstrous. .. we must concede that as prior information we take for granted the variance is at least w. But even this will not do, for the best supported view on the variance is then that it is exactly w.

For a less artificial example, take the ‘tram-car’ or ‘tank’ problem We capture enemy tanks at random and note the serial numbers on their engines. We know the serial numbers start at 0001. We capture a tank number 2176. How many tanks did the enemy make? On the likelihood analysis, the best supported guess is: 2176. Now one can defend this remarkable result by saying that it does not follow that we should estimate the actual number as 2176, only that comparing individual numbers, 2176 is better supported than any larger figure. My worry is deeper. Let us compare the relative likelihood of the two hypotheses, 2176 and 3000. Now pass to a situation where we are measuring, say, widths of a grating, in which error has a normal distribution with known variance; we can devise data and a pair of hypotheses about the mean which will have the same log-likelihood ratio. I have no inclination to say that the relative support in the tank case is ‘exactly the same as’ that in the normal distribution case, even though the likelihood ratios are the same. Hence even on those increasingly rare days when I will rank hypotheses in order of their likelihoods, I cannot take the actual log-likelihood number as an objective measure of anything. (Hacking 1972, 136-137).

Hacking appears even more concerned with the fact that likelihood ratios do not enjoy a stable evidential meaning or calibration, than the lack of error control in likelihoodist accounts. But Hacking was still assuming the latter must be cashed out in terms of long run error performance[iii] as opposed to stringency of test.

I say: a method that makes it easy to declare evidence against hypotheses erroneously gives an unwarranted inference each time; a method that fails to directly pick up on optional stopping, data dredging, cherry picking, multiple testing or any of the other gambits that alter the capabilities of tests to avoid mistaken inferences are poor methods, but not because of their behavior in the long-run. They license unwarranted or questionable inferences in each and every application.This is so, I aver, even if we happen to know, through other means, that their inferred claim C is correct.

5.Three ways likelihoods arise in inference. Aug. 31 note at end of para.

Likelihoods are fundamental in all statistical inference accounts. One might separate how they arise in three groups (acknowledging divisions within each)

(1) likelihoods only (pure likelihoodist)

(2) likelihoods + priors (Bayesian)

(3) likelihoods + error probabilities based on sampling distributions (error statistics, sampling theory

Only the error statistician (3) requires breaking the likelihood law.[See note.] You can feed us fit measures from (1) and (2), and we will do the same thing: ask about the probability of so good (or poor) a fit between data and some claim C, even if C is false (true). The answer will be based on the sampling distribution of the relevant statistic, computed under the falsity of C, or discrepancies from what C asserts).[iv]

Aug 31 note: 

If someone wanted to describe the addition of the priors under rubric (2) as tantamount to “breaking the likelihood law”, as opposed to merely requiring it to be supplemented, nothing whatever changes in the point of this post. (It would seem to introduce idiosyncrasies in the usual formulation–but these are not germane to my post.) My sentence, in fact, might well have been “Only the error statistician (3) requires breaking the likelihood law and the likelihood principle (by dint of requiring considerations of the sampling distribution to obtain the evidential import of the data).




Installment (B): an ad hoc clarificatory note, prompted by comments from an anonymous fan

6. Of tests and comparative support measures

The statements of “the” law of likelihood, and likelihood support logics are not all precisely identical. Some accounts are qualitative, merely indicating prima facie increased support; others will devise quantitative measures of support based on likelihoods. (There are at least 10 of them we covered in our recent seminar, maybe more.) Some will try out corresponding “tests” others not. One needn’t have anything like a test or a “rejection rule” to be a likelihoodist. I mentioned the construal in terms of tests because it is in the sentence just before the one I quote from Barnard, and wanted to be true to what he had just said about Hacking’s 1965 book.

Remember the topic of my post concerns criticisms of error statistical methods, and a principle (or “law”) of evidence used in those criticisms. (If you reject that principle, then presumably you wouldn’t use it to criticize error statistical methods, so we have no disagreement on this.) A clear rationale for connecting tests of hypotheses—be they Fisherian or N-P style—and logics of likelihood is to mount criticisms: to explain what’s wrong with those (Fisherian or N-P) tests, and how they may be cured of their problems.

Hacking lays out an impressive argument that all that is sensible in N-P likelihood ratio tests are captured by his conception of likelihood tests (the one he advanced back in 1965) while all the (apparently) counterintuitive parts are jettisoned. Now that I’ve access to my NYC library, I can quote the portion to which Barnard is alluding in his review of Hacking.

“Our theory of support leads directly to the theory of testing suggested in the last chapter [VI]. An hypothesis should be rejected if and only if there is some rival hypothesis much better supported than it is. Support has already been analysed in terms of ratios of likelihoods. But what shall serve as ‘much better supported’? For the present I leave this in abeyance, and speak merely of tests of different stringency. With each test will be associated a critical ratio. The greater the critical ratio, the more stringent the test. Roughly speaking hypothesis h will be rejected in favour of rival i at critical level alpha, just if the likelihood ratio of i to h exceeds alpha.” (Hacking 1965 p.89)

I don’t want to pursue this discussion of Hacking here. To repeat, my post concerns criticisms of error statistical methods. A foundational critique of a method of inference depends on holding another view or principle or method of inference. This post is an offshoot of the recent  posts here and here (7/14/14 and 8/17/14)..

Critiques in those posts are based on assuming that it is fair, reasonable, obvious or what have you, to criticize the way p-values arise in inference by means of a different view of inference. (I allude here to genuine or “audited” p-values, not mere nominal or computed p-values.) The p-value, it is reasoned, should be close to either a posterior probability (in the null hypothesis) or a likelihood ratio (or Bayes ratio). Ways to “fix” p-values are proposed to get them closer to these other measures. I don’t think there was anything controversial about this being the basic goal, not just of the particular papers we looked at, but mountains of papers that have been written and are being written this very moment.

I may continue with my intended follow-up (Part C)

*Note; I am not sure whether the powers that be are allowing us to say “data x is” nowadays–I read something about this, maybe it was by Pinker. Can somebody please ask Stephen Pinker for me? Thanks.

[i] Please search this blog for quite a lot on the likelihood principle and the strong likelihood principle.

[ii]I would say this even if we knew the model was adequate. Likelihood principlers may regard using the sampling distribution to test the model as legitimate.

[iii]Perhaps he still is, I don’t mean to saddle him with my testing construal of error probabilities at all. (Some hints of a shift exists in his 1980 article in the Braithwaite volume.)

[iv] This delineation comes from Cox and Hinkley, but I don’t have it here.


Barnard, G. (1972). Review of ‘The Logic of Statistical Inference’ by I. HackingBrit. J. Phil.Sci., 23(2): 123-132.

Hacking, I. (1965). Logic of Statistical Inference. Cambridge: CUP.

Hacking, I. (1972). “Review of Likelihood. An Account of the Statistical Concept of Likelihood and Its Application to Scientific Inference by A. F. Edwards,” Brit. J. Phil.Sci., 23(2): 132-137.

Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite.” In D. H. Mellor (ed.), Science, belief and behavior: Essays in honor of R.B. Braithwaite.  141-160. Cambridge: CUP.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples.Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in  Bul. Acad. Pol.Sci. 73-96.


Categories: highly probable vs highly probed, law of likelihood, Likelihood Principle, Statistics | 42 Comments

Has Philosophical Superficiality Harmed Science?



I have been asked what I thought of some criticisms of the scientific relevance of philosophy of science, as discussed in the following snippet from a recent Scientific American blog. My title elicits the appropriate degree of ambiguity, I think. 

Quantum Gravity Expert Says “Philosophical Superficiality” Has Harmed Physics

By John Horgan | August 21, 2014 |  14

“I interviewed Rovelli by phone in the early 1990s when I was writing a story for Scientific American about loop quantum gravity, a quantum-mechanical version of gravity proposed by Rovelli, Lee Smolin and Abhay Ashtekar[i]

Horgan: What’s your opinion of the recent philosophy-bashing by Stephen Hawking, Lawrence Krauss and Neil deGrasse Tyson?

Rovelli: Seriously: I think they are stupid in this.   I have admiration for them in other things, but here they have gone really wrong.  Look: Einstein, Heisenberg, Newton, Bohr…. and many many others of the greatest scientists of all times, much greater than the names you mention, of course, read philosophy, learned from philosophy, and could have never done the great science they did without the input they got from philosophy, as they claimed repeatedly. You see: the scientists that talk philosophy down are simply superficial: they have a philosophy (usually some ill-digested mixture of Popper and Kuhn) and think that this is the “true” philosophy, and do not realize that this has limitations.

Here is an example: theoretical physics has not done great in the last decades. Why? Well, one of the reasons, I think, is that it got trapped in a wrong philosophy: the idea that you can make progress by guessing new theory and disregarding the qualitative content of previous theories.  This is the physics of the “why not?”  Why not studying this theory, or the other? Why not another dimension, another field, another universe?  Science has never advanced in this manner in the past.  Science does not advance by guessing. It advances by new data or by a deep investigation of the content and the apparent contradictions of previous empirically successful theories.  Quite remarkably, the best piece of physics done by the three people you mention is Hawking’s black-hole radiation, which is exactly this.  But most of current theoretical physics is not of this sort.  Why?  Largely because of the philosophical superficiality of the current bunch of scientists.”

I find it intriguing that Rovelli suggests that “Science does not advance by guessing. It advances by new data or by a deep investigation of the content and the apparent contradictions of previous empirically successful theories.” I think this is an interesting and subtle claim with which I agree. Would this have been brought to light by being better tuned into current philosophy of science? Unclear. I don’t know Hawking’s criticisms, but I think philosophers of science would admit—most of them, at least if they’ve been in the field for awhile—that the promises of 30, 20 and 10 years ago, to be relevant to scientific practice, haven’t really panned out. To be clear, I absolutely think that philosophers of science can and should be at the forefront in any number of methodological, conceptual, and epistemological quagmires across the landscape of the natural and social sciences. I have written about this many times, and have organized forums with likeminded philosophers of science and scientists. With few exceptions, philosophers of science have not been involved in tackling these issues. In philosophy of statistics, philosophers are less of a presence now than when I was in graduate school[ii].

Here was the start of my introduction for a conference in June 2010 at the London School of Economics:

Debates over the philosophical foundations of statistics have a long and fascinating history; the decline of a lively exchange between philosophers of science and statisticians is relatively recent. Is there something special about 2011 (and beyond) that calls for renewed engagement in these fields? I say yes. There are some surprising, pressing, and intriguing new philosophical twists on the long-running controversies that cry out for philosophical analysis, and I hope to galvanize my co-contributors as well as the reader to take up the general cause. It is ironic that statistical science and philosophy of science—so ahead of their time in combining the work of philosophers and practicing scientists1—should now find such dialogues rare, especially at a time when philosophy of science has come to see itself as striving to be immersed in, and relevant to, scientific practice. I will have little to say about why this has occurred, although I do not doubt there is a good story there, with strands colored by philosophy,sociology, economics, and trends in other fields. I want instead to take some steps toward answering our question: Where and why should we meet from this point forward?

The on-line volume growing out of that conference, and contributions obtained shortly after, may be found here.

In a week or so, my paper, “On the Birnbaum Argument for the Strong Likelihood Principle” (and my rejoinder to comments) will appear in Statistical Science. The Likelihood Principle and the general topic of inductive-statistical “concepts of evidence” was of keen interest in philosophy when I was starting out, along with philosophy of statistics more generally. Now there may actually be more interest among statisticians than philosophers. We shall see.

[i] A link Horgan gives to read more on Rovelli’s views on physics and philosophy is a 2012 conversation with him on

[ii] There was a post which garnered quite a lot of comment last year: “What should philosophers of science do” (The comments got kind of off topic.)

Categories: StatSci meets PhilSci, strong likelihood principle | 33 Comments

Are P Values Error Probabilities? or, “It’s the methods, stupid!” (2nd install)



Despite the fact that Fisherians and Neyman-Pearsonians alike regard observed significance levels, or P values, as error probabilities, we occasionally hear allegations (typically from those who are neither Fisherian nor N-P theorists) that P values are actually not error probabilities. The denials tend to go hand in hand with allegations that P values exaggerate evidence against a null hypothesis—a problem whose cure invariably invokes measures that are at odds with both Fisherian and N-P tests. The Berger and Sellke (1987) article from a recent post is a good example of this. When leading figures put forward a statement that looks to be straightforwardly statistical, others tend to simply repeat it without inquiring whether the allegation actually mixes in issues of interpretation and statistical philosophy. So I wanted to go back and look at their arguments. I will post this in installments.

1. Some assertions from Fisher, N-P, and Bayesian camps

Here are some assertions from Fisherian, Neyman-Pearsonian and Bayesian camps: (I make no attempt at uniformity in writing the “P-value”, but retain the quotes as written.)

a) From the Fisherian camp (Cox and Hinkley):

For given observations y we calculate t = tobs = t(y), say, and the level of significance pobs by

pobs = Pr(T > tobs; H0).

….Hence pobs is the probability that we would mistakenly declare there to be evidence against H0, were we to regard the data under analysis as being just decisive against H0.” (Cox and Hinkley 1974, 66).

Thus pobs would be the Type I error probability associated with the test.

b) From the Neyman-Pearson N-P camp (Lehmann and Romano):

“[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level…at which the hypothesis would be rejected for the given observation. This number, the so-called p-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.” (Lehmann and Romano 2005, 63-4) 

Very similar quotations are easily found, and are regarded as uncontroversial—even by Bayesians whose contributions stood at the foot of Berger and Sellke’s argument that P values exaggerate the evidence against the null.

c) Gibbons and Pratt:

“The P-value can then be interpreted as the smallest level of significance, that is, the ‘borderline level’, since the outcome observed would be judged significant at all levels greater than or equal to the P-value[i] but not significant at any smaller levels. Thus it is sometimes called the ‘level attained’ by the sample….Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” (Gibbons and Pratt 1975, 21).

2. So what’s behind the “P values aren’t error probabilities” allegation?

In their rejoinder to Hinkley, Berger and Sellke assert the following: “The use of the term ‘error rate’ suggests that the frequentist justifications, such as they are, for confidence intervals and fixed a-level hypothesis tests carry over to P values.”

They do not disagree with Cox and Hinkley’s assertion above, but they maintain that:

“This hypothetical error rate does not conform to the usual classical notion of ‘repeated-use’ error rate, since the P-value is determined only once in this sequence of tests. The frequentist justifications of significance tests and confidence intervals are in terms of how these procedures perform when used repeatedly.” (Berger and Sellke 1987, 136)

Keep in mind that Berger and Sellke are using “significance tests” to refer to Neyman-Pearson (N-P) tests in contrast to Fisherian P-value appraisals.

So their point appears to be simply that the P value, as intended by Fisher, is not justified by (or not intended to be justified by) a behavioral appeal to controlling long run error rates. It is assumed that those are the only, or the main, justifications available for N-P significance tests and confidence intervals (thus type 1 and 2 error probabilities and confidence levels are genuine error probabilities). They do not entertain the idea that the P value, as the attained significance level, is important for N-P theorists nor that “a p-value gives an idea of how strongly the data contradict the hypothesis”(Lehmann and Romano)—a construal we find early on in David Cox.

But let’s put that aside, as we pin down Berger and Sellke’s point. Here’s how we might construe them. They grant that the P-value is, mathematically, a frequentist error probability, it is the justification that they think differs from what they take to be the justification of Type 1 and 2 errors in N-P statistics. They think N-P tests and confidence intervals get their justification in terms of (actual?) long run error rates, and P-values do not. To continue with their remarks:

“Can P values be justified on the basis of how they perform in repeated use? We doubt it. For one thing, how would one measure the performance of P values? With significance tests and confidence intervals, they are either right or wrong, so it possible to talk about error rates. If one introduces a decision rule into the situation by saying that H0 is rejected when the P value < .05, then of course the classical error rate is .05.”[ii](ibid.)

Thus, P values are error probabilities, but their intended justification (by Fisher?) was not a matter of a behavioristic appeal to low long-run error rates, but rather, something more inferential or evidential. We can actually strengthen their argument in a couple of ways. Firstly, we can remove the business of “actual” versus “hypothetical” repetitions, because the behavioristic justifications that they are trying to call out are also given in terms of hypotheticals. Moreover, behavioristic appeals to controlling error rates are not limited to “reject/do not reject”, but apply even where the inference is in terms of an inferred discrepancy or other test output.

The problem is that the inferential vs behavioristic distinction does not separate Fisherian P-values from confidence levels and type I and 2 error probabilities. All of these are amenable to both types of interpretation! More to follow in installment #2.


Installment 2: Mirror Mirror on the Wall, Who’s the More Behavioral of them all?

Granted, the founders did not make out intended inferential construals fully—though representatives from Fisherian and N-P camps took several steps. At the same time, members of both camps also can be found talking like acceptance samplers!

Berger and Sellke had said: “If one introduces a decision rule into the situation by saying that H0 is rejected when the P value < .05, then of course the classical error rate is .05.” Good. Then we can agree that it is mathematically an error probability. They simply don’t think it reflects the Fisherian ideal.

3. Fisher as acceptance sampler.

But it was Fisher, after all, who declared that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis. “ (DOE 15-16)

Or to quote from an earlier article of Fisher (1926):

…we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen.

The above is a more succinct version of essentially the same points Fisher makes in DOE.[iii]

No wonder Neyman could tell Fisher to look in the mirror (as it were): “Pearson and I were only systematizing your practices for how to interpret data, using those nice charts you made. True, we introduced the alternative hypothesis (and the corresponding type 2 error), but that was only to give a rationale, and apparatus, for the kinds of tests you were using. You never had a problem with the Type 1 error probability, and your concern for how best to increase “sensitivity” was to be reflected in the power assessment. You had no objections—at least at first”. See this post.

The dichotomous “up-down” spirit that Berger and Sellke suggest is foreign to Fisher is not foreign at all. Again from DOE:

Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretation. ….The two classes of results which are distinguished by our test of significance are, on the one hand, those which show a significant discrepancy from a certain hypothesis; …and on the other hand, results which show no significant discrepancy from this hypothesis. (DOE 15)

Even where Fisher is berating Neyman for introducing the Type 2 error–he had no problem with type 1 errors, and both were fine in cases of estimation–Fisher falls into talk of actions, as Neyman points out (Neyman 1956,Triad).

“The worker’s real attitude in such a case might be, according to the circumstances:

(a)”the probable deviation from truth of my working hypothesis, to examine which the test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.” (Fisher 1955, Triad, p. 73)

Pearson responds (1955) that this is entirely the type of interpretation they imagined to be associated with the bare mathematics of the test. And Neyman made it clear early on (though I didn’t discover it at first) that he intended “accept” to serve merely as a shorthand for “do not reject”. See this recent post, which includes links to all three papers in the “triad” (by Fisher, Neyman, and Pearson).

Here’s George Barnard (1972):

“In fact Fisher referred approvingly to the concept of the power curve of a test procedure and although he wrote: ‘On the whole the ideas (a) that a test of significance must be regarded as one of a series of similar tests applied to a succession of similar bodies of data, and (b) that the purpose of the test is to discriminate or ‘decide’ between two or more hypotheses, have greatly obscured their understanding’, he was careful to go on and add ‘when taken not as contingent possibilities but as elements essential to their logic’.” (129).

To see how Fisher links power to his own work early on, please check this post.

So we are back to the key question: what is the basis for Berger and Sellke (and others who follow similar lines of criticism) to allow error probabilities in the case of N-P significance tests and confidence intervals, and not in the case of P-values? It cannot be whether the method involves a rule for mapping outcomes to interpretations (be there two or three—the third might be N-P’s initial “remain undecided” or “get more data”), because we’ve just seen that to be true of Fisherian tests as well.

4. Fixing the type 1 error probability

But isn’t the issue that N-P tests fix the type 1 error probability in advance? Firstly, we must distinguish between fixing the P value threshold to be used in each application, and justifying tests solely by reference to a control of long run error (behavioral justification). So what about the first point of predesignating the threshold? Actually, this was more Fisher than N-P:

“Neyman and Pearson followed Fisher’s adoption of a fixed level” Erich Lehmann tells us. (Lehmann 1993, 1244). Lehmann is flummoxed by the accusation of fixed levels of significance since “[U]nlike Fisher, Neyman and Pearson (1933, p. 296) did not recommend a standard level but suggested that ‘how the balance [between the two kinds of error] should be struck must be left to the investigator.” (ibid.) From their earliest papers, they stressed that the tests were to be “used with discretion and understanding” depending on the context. Pearson made it clear that he thought it “irresponsible”, in a matter of importance, to distinguish rejections at the .025 or .05 level.[iv] (See this post.) And as we already saw, Lehmann (who developed N-P tests as decision rules) recommends reporting the attained P value.

In a famous passage,[v] Fisher (1956) raises the criticism—but without naming names:

A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection….However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

It is assumed he is speaking of N-P, or at least Neyman, but I wonder…

Anyway, the point is that the mathematics admits of different interpretations and uses. The “P values are not error rates” argument really boils down to claiming that the justification for using P-values inferentially is not merely that if you repeatedly did this you’d rarely erroneously interpret results (that’s necessary but not sufficient for the inferential warrant). That, of course, is what I (and others) have been arguing for ages—but I’d extend this to N-P significance tests and confidence intervals, at least in contexts of scientific inference. See, for example, Mayo and Cox (2006/2010), Mayo and Spanos (2006). We couldn’t even express the task of how to construe error probabilities inferentially if we could only use the term “error probabilities” to mean something justified only by behavioristic long-runs.

5. What about the Famous Blow-ups?

What about the big disagreement between Neyman and Fisher (Pearson is generally left out of it)? Well, I think that as hostilities between Fisher and Neyman heated up, the former got more and more evidential (and even fiducial) and the latter more and more behavioral. Still, what has made a lasting impression on people, understandably, are Fisher’s accusations that Neyman (if not Pearson) converted his tests into acceptance sampling devices, more suitable for making money in the U.S. or Russian 5 year plans, than thoughtful inference. (But remember Pearson’s and Neyman’s responses.) Imagine what he might have said about today’s infatuation with converting P value assessments to dichotomous outputs to compute science-wise error rates: Neyman on steroids.[vi]

By the way, it couldn’t have been too obvious that N-P distorted his tests, since Fisher tells us in 1955 that it was only when Barnard brought it to his attention that “despite agreeing mathematically in very large part”, there is a distinct philosophical position emphasized at least by Neyman. So it took like 20 years to realize this? (Barnard also told me this in person, recounted in this theater production.)

Here’s an enlightening passage from Cox (2006):

Neyman and Pearson “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals. As late as 1932 Fisher was writing to Neyman encouragingly about this work, but relations soured, notably when Fisher greatly disapproved of a paper of Neyman’s on experimental design and no doubt partly because their being in the same building at University College London brought them too close to one another!” (195)

Being in the same building,indeed! Recall Fisher declaring that if Neyman teaches in the same building and doesn’t use his book, he would oppose him in all things. See this post for details on some of their anger management problems.

The point is that it is absurd to base conceptions of inferential methods on personality disputes rather than the mathematical properties of tests (and their associated interpretations). These two approaches are best seen as offering clusters of tests appropriate for different contexts within the large taxonomy of tests and estimation methods. We can agree that the radical behavioristic rationale for error rates is not the rationale intended by Fisher in using P-values. I would argue it was not the rationale intended by Pearson, nor, much of the time, by Neyman. Yet we should be beyond worrying about what the founders really thought. It’s the methods, stupid.

Readers should not have to go through this “he said/we said” history again. Enough! Nor should they be misled into thinking there’s a deep inconsistency which renders all standard treatments invalid (by dint of using both N-P and Fisherian tests).

So has pure analytic philosophy, by clarifying terms (along with a bit of history of statistics), solved the apparent disagreement with Berger and Sellke (1987) and others?

It’s gotten us somewhere, yet there’s a big problem that remains. TO BE CONTINUED ON A NEW POST


Barnard, G. (1972). “Review of ‘The Logic of Statistical Inference’ by I. HackingBrit. J. Phil. Sci. 23(2): 123-132.

Berger, J. O.  and Sellke, T. (1987) “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112-139.

Cassella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Cox, D. R. (2006) Principles of Statistical Inference. Cambridge: Cambridge University Press.

Cox, D. R. & Hinkley, D. V. (1974). Theoretical Statistics, London, Chapman & Hall.

Fisher, R. A. (1926). “The Arrangement of Field Experiments”, J. of Ministry of Agriculture, Vol. XXXIII, 503-513.

Fisher, R. A. (1947). The Design of Experiments (4th Ed.) NY Hafner.

Fisher, R. A. (1955) “Statistical Methods and Scientific Induction,” Journal of The Royal Statistical Society (B) 17: 69-78.

Fisher, R.A. (1956). Statistical Methods and Scientific Inference, Hafner

Gibbons, J. & Pratt, J. W. (1975). “P-values: Interpretation and Methodology”, The American Statistician 29: 20-25.

Lehmann, E. (1993). “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?J. Amer. Statist. Assoc., 88(424):1242-1249.

Lehmann and Romano (2005) Testing Statistical Hypotheses (3rd ed.), New York: Springer.

Mayo, D.G. and Cox, D. R. (2006/2010) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Neyman, J. (1977) “Frequentist Probability and Frequentist Statistics,” Synthese 36: 97-131.

Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher,” Journal of the Royal Statistical Society (B), 18:288-294.

Neyman, J. and Pearson, E.S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Philosophical Transactions of the Royal Society of London, (A), 231, 289-337.

Pearson, E.S. (1947), “The Choice of Statistical Tests Illustrated on the Interpretation of Data Classed in a 2×2 Table,” Biometrika 34 (1/2): 139-167.

Pearson, E. S. (1955). “Statistical Concepts in Their Relation to Reality,” Journal of the Royal Statistical Society, (B), 17:  204-207.


[i] With the usual inversions.

[ii] They add “but the expected P value given rejection is .025, an average understatement of the error rate by a factor of two.”

[iii] Neyman did put in a plug for developments in empirical Bayesian methods in his 1977 Synthese paper.

[iv] Pearson says,

“Were the action taken to be decided automatically by the side of the 5% level on which the observations fell, it is clear that [the precise p-value]…would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (1947, 192). See the post:


[v] From the Design of Experiments (DOE):

The test of significance (13):

“It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20–the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent –then it would be useless for him to experiment with only 3 cups of tea…. It is usual and convenient for the experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. …we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (emphasis added)

On 46-7 Fisher clarifies something people often confuse:  it’s not the low probability of the event “rather to the fact, very near in this case, that the correctness of the assertion would entail an event of this low probability.

[vi] It follows a paragraph criticizing Bayesians.




Categories: frequentist/Bayesian, J. Berger, P-values, Statistics | 31 Comments

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Today is Egon Pearson’s birthday: 11 August 1895-12 June, 1980.
E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.)

When Erich Lehmann, in his review of my “Error and the Growth of Experimental Knowledge” (EGEK 1996), called Pearson “the hero of Mayo’s story,” it was because I found in E.S.P.’s work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of N-P statistics. Granted, these “evidential” attitudes and practices have never been explicitly codified to guide the interpretation of N-P tests. If they had been, I would not be on about providing an inferential philosophy all these years.[i] Nevertheless, “Pearson and Pearson” statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect. One of the few sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It begins like this: Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , | 2 Comments

Blog Contents: June and July 2014

Image of business woman rolling a giant stone


Blog Contents: June and July 2014*

(6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)

(6/9) “The medical press must become irrelevant to publication of clinical trials.”

(6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”

(6/14) “Statistical Science and Philosophy of Science: where should they meet?”

(6/21) Big Bayes Stories? (draft ii)

(6/25) Blog Contents: May 2014

(6/28) Sir David Hendry Gets Lifetime Achievement Award

(6/30) Some ironies in the ‘replication crisis’ in social psychology (4th and final installment) Continue reading

Categories: blog contents | Leave a comment

Winner of July Palindrome: Manan Shah


Manan Shah

Winner of July 2014 Contest:

Manan Shah


Trap May Elba, Dr. of Fanatic. I fed naan, deli-oiled naan, deficit an affordable yam part.

The requirements: 

In addition to using Elba, a candidate for a winning palindrome must have used fanatic. An optional second word was: part. An acceptable palindrome with both words would best an acceptable palindrome with just fanatic


Manan Shah is a mathematician and owner of Think. Plan. Do. LLC. ( He also maintains the “Math Misery?” blog at He holds a PhD in Mathematics from Florida State University.

Continue reading

Categories: Palindrome, Rejected Posts | Leave a comment

What did Nate Silver just say? Blogging the JSM 2013

imagesMemory Lane: August 6, 2013. My initial post on JSM13 (8/5/13) was here.

Nate Silver gave his ASA Presidential talk to a packed audience (with questions tweeted[i]). Here are some quick thoughts—based on scribbled notes (from last night). Silver gave a list of 10 points that went something like this (turns out there were 11):

1. statistics are not just numbers

2. context is needed to interpret data

3. correlation is not causation

4. averages are the most useful tool

5. human intuitions about numbers tend to be flawed and biased

6. people misunderstand probability

7. we should be explicit about our biases and (in this sense) should be Bayesian?

8. complexity is not the same as not understanding

9. being in the in crowd gets in the way of objectivity

10. making predictions improves accountability Continue reading

Categories: Statistics, StatSci meets PhilSci | 3 Comments

Neyman, Power, and Severity

April 16, 1894 – August 5, 1981

NEYMAN: April 16, 1894 – August 5, 1981

Jerzy Neyman: April 16, 1894-August 5, 1981. This reblogs posts under “The Will to Understand Power” & “Neyman’s Nursery” here & here.

Way back when, although I’d never met him, I sent my doctoral dissertation, Philosophy of Statistics, to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC.

Some ten 15 years ago, Giere decided to quit philosophy of statistics (while remaining in philosophy of science): I think it had to do with a certain form of statistical exile (in philosophy). He asked me if I wanted his papers—a mass of work on statistics and statistical foundations gathered over many years. Could I make a home for them? I said yes. Then came his caveat: there would be a lot of them.

As it happened, we were building a new house at the time, Thebes, and I designed a special room on the top floor that could house a dozen or so file cabinets. (I painted it pale rose, with white lacquered book shelves up to the ceiling.) Then, for more than 9 months (same as my son!), I waited . . . Several boxes finally arrived, containing hundreds of files—each meticulously labeled with titles and dates.  More than that, the labels were hand-typed!  I thought, If Ron knew what a slob I was, he likely would not have entrusted me with these treasures. (Perhaps he knew of no one else who would  actually want them!) Continue reading

Categories: Neyman, phil/history of stat, power, Statistics | Tags: , , , | 4 Comments

Blogging Boston JSM2014?



I’m not there. (Several people have asked, I guess because I blogged JSM13.) If you hear of talks (or anecdotes) of interest to error, please comment here (or twitter: @learnfromerror)

Categories: Announcement | 7 Comments

Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest posts)

Roger BergerRoger L. Berger

School Director & Professor
School of Mathematical & Natural Science
Arizona State University

Comment on S. Senn’s post: Blood Simple? The complicated and controversial world of bioequivalence”(*)

First, I do agree with Senn’s statement that “the FDA requires conventional placebo-controlled trials of a new treatment to be tested at the 5% level two-sided but since they would never accept a treatment that was worse than placebo the regulator’s risk is 2.5% not 5%.” The FDA procedure essentially defines a one-sided test with Type I error probability (size) of .025. Why it is not just called this, I do not know. And if the regulators believe .025 is the appropriate Type I error probability, then perhaps it should be used in other situations, e.g., bioequivalence testing, as well.

Senn refers to a paper by Hsu and me (Berger and Hsu (1996)), and then attempts to characterize what we said. Unfortunately, I believe he has mischaracterized. Continue reading

Categories: bioequivalence, frequentist/Bayesian, PhilPharma, Statistics | Tags: , | 22 Comments

S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

Stephen Senn


Stephen Senn
Head, Methodology and Statistics Group
Competence Center for Methodology and Statistics (CCMS)

Responder despondency: myths of personalized medicine

The road to drug development destruction is paved with good intentions. The 2013 FDA report, Paving the Way for Personalized Medicine  has an encouraging and enthusiastic foreword from Commissioner Hamburg and plenty of extremely interesting examples stretching back decades. Given what the report shows can be achieved on occasion, given the enthusiasm of the FDA and its commissioner, given the amazing progress in genetics emerging from the labs, a golden future of personalized medicine surely awaits us. It would be churlish to spoil the party by sounding a note of caution but I have never shirked being churlish and that is exactly what I am going to do. Continue reading

Categories: evidence-based policy, Statistics, Stephen Senn | 49 Comments

Continued:”P-values overstate the evidence against the null”: legit or fallacious?



Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 39 Comments

“P-values overstate the evidence against the null”: legit or fallacious? (revised)

0. July 20, 2014: Some of the comments to this post reveal that using the word “fallacy” in my original title might have encouraged running together the current issue with the fallacy of transposing the conditional. Please see a newly added Section 7.

Continue reading

Categories: Bayesian/frequentist, CIs and tests, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 71 Comments

Higgs discovery two years on (2: Higgs analysis and statistical flukes)

Higgs_cake-sI’m reblogging a few of the Higgs posts, with some updated remarks, on this two-year anniversary of the discovery. (The first was in my last post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.  Continue reading

Categories: Higgs, highly probable vs highly probed, P-values, Severity, Statistics | 13 Comments

Higgs Discovery two years on (1: “Is particle physics bad science?”)


July 4, 2014 was the two year anniversary of the Higgs boson discovery. As the world was celebrating the “5 sigma!” announcement, and we were reading about the statistical aspects of this major accomplishment, I was aghast to be emailed a letter, purportedly instigated by Bayesian Dennis Lindley, through Tony O’Hagan (to the ISBA). Lindley, according to this letter, wanted to know:

“Are the particle physics community completely wedded to frequentist analysis?  If so, has anyone tried to explain what bad science that is?”

Fairly sure it was a joke, I posted it on my “Rejected Posts” blog for a bit until it checked out [1]. (See O’Hagan’s “Digest and Discussion”) Continue reading

Categories: Bayesian/frequentist, fallacy of non-significance, Higgs, Lindley, Statistics | Tags: , , , , , | 4 Comments

Winner of June Palindrome Contest: Lori Wike



Winner of June 2014 Palindrome Contest: First Second* Time Winner! Lori Wike

*Her April win is here


Parsec? I overfit omen as Elba sung “I err on! Oh, honor reign!” Usable, sane motif revoices rap.

The requirement: A palindrome with Elba plus overfit. (The optional second word: “average” was not needed to win.)


Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine.

Continue reading

Categories: Announcement, Palindrome | Leave a comment

Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)

freud mirror espThere are some ironic twists in the way social psychology is dealing with its “replication crisis”, and they may well threaten even the most sincere efforts to put the field on firmer scientific footing–precisely in those areas that evoked the call for a “daisy chain” of replications. Two articles, one from the Guardian (June 14), and a second from The Chronicle of Higher Education (June 23) lay out the sources of what some are calling “Repligate”. The Guardian article is “Physics Envy: Do ‘hard’ sciences hold the solution to the replication crisis in psychology?”

The article in the Chronicle of Higher Education also gets credit for its title: “Replication Crisis in Psychology Research Turns Ugly and Odd”. I’ll likely write this in installments…(2nd, 3rd , 4th)


The Guardian article answers yes to the question “Do ‘hard’ sciences hold the solution“:

Psychology is evolving faster than ever. For decades now, many areas in psychology have relied on what academics call “questionable research practices” – a comfortable euphemism for types of malpractice that distort science but which fall short of the blackest of frauds, fabricating data.
Continue reading

Categories: junk science, science communication, Statistical fraudbusting, Statistics | 53 Comments

Sir David Hendry Gets Lifetime Achievement Award

images-17Sir David Hendry, Professor of Economics at the University of Oxford [1], was given the Celebrating Impact Lifetime Achievement Award on June 8, 2014. Professor Hendry presented his automatic model selection program (Autometrics) at our conference, Statistical Science and Philosophy of Science (June, 2010) (Site is here.) I’m posting an interesting video and related links. I invite comments on the paper Hendry published, “Empirical Economic Model Discovery and Theory Evaluation,” in our special volume of Rationality, Markets, and Morals (abstract below). [2]

One of the world’s leading economists, INET Oxford’s Prof. Sir David Hendry received a unique award from the Economic and Social Research Council (ESRC)…
Continue reading

Categories: David Hendry, StatSci meets PhilSci | Tags: | Leave a comment

Blog Contents: May 2014

metablog old fashion typewriter


May 2014

(5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle

(5/3) You can only become coherent by ‘converting’ non-Bayesianly

(5/6) Winner of April Palindrome contest: Lori Wike

(5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)

(5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

(5/15) Scientism and Statisticism: a conference* (i) Continue reading

Categories: blog contents, Metablog, Statistics | Leave a comment

Big Bayes Stories? (draft ii)

images-15“Wonderful examples, but let’s not close our eyes,”  is David J. Hand’s apt title for his discussion of the recent special issue (Feb 2014) of Statistical Science called Big Bayes Stories” (edited by Sharon McGrayne, Kerrie Mengersen and Christian Robert.) For your Saturday night/ weekend reading, here are excerpts from Hand, another discussant (Welsh), scattered remarks of mine, along with links to papers and background. I begin with David Hand:

 [The papers in this collection] give examples of problems which are well-suited to being tackled using such methods, but one must not lose sight of the merits of having multiple different strategies and tools in one’s inferential armory.(Hand [1])_

…. But I have to ask, is the emphasis on ‘Bayesian’ necessary? That is, do we need further demonstrations aimed at promoting the merits of Bayesian methods? … The examples in this special issue were selected, firstly by the authors, who decided what to write about, and then, secondly, by the editors, in deciding the extent to which the articles conformed to their desiderata of being Bayesian success stories: that they ‘present actual data processing stories where a non-Bayesian solution would have failed or produced sub-optimal results.’ In a way I think this is unfortunate. I am certainly convinced of the power of Bayesian inference for tackling many problems, but the generality and power of the method is not really demonstrated by a collection specifically selected on the grounds that this approach works and others fail. To take just one example, choosing problems which would be difficult to attack using the Neyman-Pearson hypothesis testing strategy would not be a convincing demonstration of a weakness of that approach if those problems lay outside the class that that approach was designed to attack.

Hand goes on to make a philosophical assumption that might well be questioned by Bayesians: Continue reading

Categories: Bayesian/frequentist, Honorary Mention, Statistics | 62 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 410 other followers