Monthly Archives: April 2017

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself.

What van Ravenzwaaij and Ioannidis (R&I) have done is investigate the FDA’s famous two trials rule as a requirement for drug registration. To do this R&I simulated two-armed parallel group clinical trials according to the following combinations of scenarios (p4).

Thus, to sum up, our simulations varied along the following dimensions:
1. Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)
2. Number of total trials: 2, 3, 4, 5, and 20
3. Number of participants: 20, 50, 100, 500, and 1,000

The first setting defines the treatment effect in terms of common within-group standard deviations, the second defines the total number of trials submitted to the FDA with exactly two of them significant and the third the total number of patients per group.

They thus had 3 x 5 x 5 = 75  simulation settings in total. In each case the simulations were run until 500 cases arose for which two trials were significant. For each of these cases they calculated a one-sided Bayes factor and then proceeded to judge the FDA’s rule based on P-values according to the value the Bayes factor indicated.

In my opinion this is a hopeless mishmash of two systems: the first, (frequentist) conditional on the hypotheses and the second (Bayesian) conditional on the data. They cannot be mixed to any useful purpose in the way attempted and the result is not only irrelevant frequentist statistics but irrelevant Bayesian.

Before proceeding to discuss the inferential problems, however, I am going to level a further charge of irrelevance as regards the simulations. It is true that the ‘two trials rule’ is rather vague in that it is not clear how many trials one is allowed to run to get two significant ones. In my opinion it is reasonable to consider that the FDA might accept two out of three but it is frankly incredible that they would accept two out of twenty unless there were further supporting evidence. For example, if two large trials were significant but 18 smaller ones were not, but significant as a set in a meta-analysis, one could imagine the programme passing. Even this scenario, however, is most unlikely and I would be interested to know of any case of any sort in which the FDA has accepted a ‘two out of twenty’ registration.

Now let us turn to the mishmash. Let us look, first of all, at the set up in frequentist terms. The simplest common case to take is the ‘two out of two’ significant scenario. Sponsors going into phase III will typically perform calculations to target at least 80% power for the programme as a whole. Thus 90% power for individual trials is a common standard since the product of the powers is just over 80%. For the two effect sizes of 0.3 and 0.5 that R&I consider this would, according to nQuery®, yield 527 and 86 patients per arm respectively. The overall power of the programme would be 81% and the joint two-sided type I error rate would be 2 x (1/40) = 1/800, reflecting the fact that each of two two-sided tests would have to be significant at the 5% level but in the right direction.

Now, of course, these are planned characteristics in advance of running a trial. In practice you will get a result and then, in the spirit of what R&I are attempting, it would be of interest to consider the least impressive result that would just give you registration. This of, course, is P=0.05 for each of the two trials. At this point, by the way, I note that a standard frequentist objection can be entered to the two trials rule. If the designs of two trials are identical, then given that they are of the same size, the sufficient statistic is simply the average of the two results. If conducted simultaneously there would be no reason not to use this. This leads to a critical region for a more powerful test based on the average result from the two providing a 1/1600 type I error rate (one-sided) illustrated in the figure below that is to the right and above the blue diagonal line. The corresponding region for the two-trials rule is to the right of the vertical red line and above the horizontal one. The just ‘significant’ value for the two-trials rule has a standardised z-score of 1.96 x √2 = 277, whereas the rule based on the average from the two trials would have a z-score of 3.02. In other words, evidentially, the value according to the two-trials rule is less impressive[2].

Now, the Bayesian will argue, that the frequentist is controlling the behaviour of the procedure if one of two possible realities out of a whole range applies but has given no prior thought to their likely occurrence or, for that matter, to the occurrence of other values. If, for example, moderate effects sizes are very unlikely, but it is quite plausible that the treatment has no effect at all and the trials are very large, then even though their satisfying the two trials rule would be a priori unlikely, if it was only minimally satisfied, it might actually imply that the null hypothesis was likely true.

A possible way for the Bayesian to assess the evidential value is to assume, just for argument’s sake, that the null hypothesis and the set of possible alternative hypotheses are equally likely a priori (the prior odds are one) and then calculate the posterior probability and hence odds given the observed data. The ratio of the posterior odds to the prior odds is known as the Bayes factor[3]. Citing a paper[4] by Rouder et al describing this approach R&I then use the BayesFactor package created by Morey and Rouder to calculate the Bayes factor corresponding to every case of two significant trials they generate.

Actually it is not the Bayes factor but a Bayes factor. As Morey and Rouder make admirably clear in a subsequent paper[5], what the Bayes factor turns out to be depends very much on how the probability is smeared over the range of the alternative hypothesis. This can perhaps be understood by looking at the ratios of likelihoods (relative to the value under the null) when P=0.05 for each of the two trials as a function of the true (unknown) effect size for the sample sizes of 527 and 86 that would give 90% power for the values of the effect sizes (0.2 and 0.5) that R&I consider. The logs of these (chosen to make plotting easier) are given in the figure below. The blue curve corresponds to the smaller effect size used in planning (0.2) and hence the larger sample size (527) and the red curve corresponds to the larger effect size (0.5) and hence the smaller sample size (86). Given the large number of degrees of freedom available, the Normal distribution likelihoods have been used. The values of the statistic that would be just significant at the 5% level (0.1207 and 0.2989) for the two cases are given by the vertical dashed lines and, since these are the values that we assume observed in the two cases, each curve reaches its respective maximum at the relevant value.

Wherever a value on the curve is positive, the ratio of likelihoods is greater than one and the posited value of the effect size is supported against the null. Wherever it is negative, the ratio is less than one and the null is supported. Thus, whether the posited values of the treatment effect that make up the alternative values are supported as a set or not depends on how you smear the prior probability. The Bayes factor is the ratio of the prior-weighted integral of the likelihoods. In this case the likelihood under the null is a constant so the conditional prior under the alternative is crucial.  There is no automatic solution and careful choice is necessary. So what are you supposed to do? Well, as a Bayesian you are supposed to choose a prior distribution that reflects what you believe. At this point, I want to make it quite clear that if you think you can do it you should do so and I don’t want to argue against that. However, this is really hard and it has serious consequences[6]. Suppose that sample size of 527 has been used corresponding to the blue curve. Then any value of the effect size greater than 0 and less than about 2 x 0.1207 = 0.2414 has more support than the null hypothesis itself but any value more than 0.2414 is less supported than the null. How this pans out in your Bayes factor now depends on your prior distribution. If your prior maintains that all possible values of the effect size when the alternative hypothesis is true must be modest (say never greater than 0.2414), then they are all supported and so is the set. On the other hand, if you think that unless the null hypothesis is true, only values greater than 0.2414 are possible, then all such values are unsupported and so is the set. In general, the way the conditional prior smears the probability is crucial.

Be that as it may, I doubt that choosing, ‘a Cauchy distribution with a width of r = √2/2’ as R&I did would stand any serious scrutiny. Bear in mind that these are molecules that have passed a series of in vitro and in vivo pre-clinical screens as well as phase I, IIa and IIb before being put to the test in phase III. However, if R&I were serious about this, they would consider how well the distribution works as a prediction as to what actually happens in phase III and examine some data.

Instead, they assume, (as far as I can tell) that the Bayes factor they calculate in this way is some sort of automatic gold standard by which any other inferential statistic can and should be judged whether or not the distribution on which the Bayes factor is based on is reasonable. This is reflected in Richard Lehman’s tweet ‘Bayes factors consistently quantify strength of evidence’ which, in fact, however needs to be rephrased ‘Bayes factors coherently quantify strength of evidence for You if  You have chosen coherent prior distributions to construct them.’ It’s a big if.

R&I then make a second mistake of simultaneously conditioning on a result and a hypothesis. Suppose their claim is correct that in each of the cases of two significant trials that they generate the FDA would register the drug without further consideration. Then, for the first two of the three cases ‘Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD)’ the FDA has got it right and for the third it has got it wrong. By the same token wherever any decision based on the Bayes factor would disagree with the FDA it would be wrong in the first two cases and right in the third. However, this is completely useless information. It can’t help us decide between the two approaches. If we want to use true posited values of the effect size, we have to consider all possible outcomes for the two trial rule, not just the ones that indicate ‘register’. For the cases that indicate ‘register’, it is a foregone conclusion that we will have 100% success (in terms of decision-making) in the first two cases and 100% failure in the second. What we need to consider also is the situation where it is not the case that two trials are significant.

If, on the other hand R&I wish to look at this in Bayesian terms, then they have also picked this up the wrong way. If they are committed to their adopted prior distribution, then once they have calculated the Bayes factor there is no more to be said and if they simulate from the prior distribution they have adopted, then their decision making will, as judged by the simulation, turn out to be truly excellent. If they are not committed to the prior distribution, then they are faced with the sore puzzle that is Bayesian robustness. How far can the prior distribution from which one simulates be from the prior distribution one assumes for inference in order for the simulation to be a) a severe test but b) not totally irrelevant?

In short the R&I paper, in contradistinction to Richard Lehman’s claim, tells us nothing about the reasonableness of the FDA’s rule. That would require an analysis of data. Automatic for the people? Not quite. To be Bayesian, ‘to thine own self be true’. However, as I have put it previously, this is very hard and ‘You may believe you are a Bayesian but you are probably wrong’[7].


I am grateful to Don van Ravenzwaaij and John Ioannidis for helpful correspondence and to Andy Grieve for helpful comments. My research on inference for small populations is carried out in the framework of the IDEAL project and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.



  1. van Ravenzwaaij, D. and J.P. Ioannidis, A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results. PLoS One, 2017. 12(3): p. e0173184.
  2. Senn, S.J., Statistical Issues in Drug Development. Statistics in Practice. 2007, Hoboken: Wiley. 498.
  3. O’Hagan, A., Bayes factors. Significance, 2006(4): p. 184-186.
  4. Rouder, J.N., et al., Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic bulletin & review, 2009. 16(2): p. 225-237.
  5. Morey, R.D., J.-W. Romeijn, and J.N. Rouder, The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology, 2016. 72: p. 6-18.
  6. Grieve, A.P., Discussion of Piegorsch and Gladen (1986). Technometrics, 1987. 29(4): p. 504-505.
  7. Senn, S.J., You may believe you are a Bayesian but you are probably wrong. Rationality, Markets and Morals, 2011. 2: p. 48-66.


Categories: Bayesian/frequentist, Error Statistics, S. Senn | 9 Comments

The Fourth Bayesian, Fiducial and Frequentist Workshop (BFF4): Harvard U


May 1-3, 2017
Hilles Event Hall, 59 Shepard St. MA

The Department of Statistics is pleased to announce the 4th Bayesian, Fiducial and Frequentist Workshop (BFF4), to be held on May 1-3, 2017 at Harvard University. The BFF workshop series celebrates foundational thinking in statistics and inference under uncertainty. The three-day event will present talks, discussions and panels that feature statisticians and philosophers whose research interests synergize at the interface of their respective disciplines. Confirmed featured speakers include Sir David Cox and Stephen Stigler.

The program will open with a featured talk by Art Dempster and discussion by Glenn Shafer. The featured banquet speaker will be Stephen Stigler. Confirmed speakers include:

Featured Speakers and DiscussantsArthur Dempster (Harvard); Cynthia Dwork (Harvard); Andrew Gelman (Columbia); Ned Hall (Harvard); Deborah Mayo (Virginia Tech); Nancy Reid (Toronto); Susanna Rinard (Harvard); Christian Robert (Paris-Dauphine/Warwick); Teddy Seidenfeld (CMU); Glenn Shafer (Rutgers); Stephen Senn (LIH); Stephen Stigler (Chicago); Sandy Zabell (Northwestern)

Invited Speakers and PanelistsJim Berger (Duke); Emery Brown (MIT/MGH); Larry Brown (Wharton); David Cox (Oxford; remote participation); Paul Edlefsen (Hutch); Don Fraser (Toronto); Ruobin Gong (Harvard); Jan Hannig (UNC); Alfred Hero (Michigan); Nils Hjort (Oslo); Pierre Jacob (Harvard); Keli Liu (Stanford); Regina Liu (Rutgers); Antonietta Mira (USI); Ryan Martin (NC State); Vijay Nair (Michigan); James Robins (Harvard); Daniel Roy (Toronto); Donald B. Rubin (Harvard); Peter XK Song (Michigan); Gunnar Taraldsen (NUST); Tyler VanderWeele (HSPH); Vladimir Vovk (London); Nanny Wermuth (Chalmers/Gutenberg); Min-ge Xie (Rutgers)

Deadline for poster submission has been extended to Apr 25.  Click here to submit a poster abstract!  

For questions, please email


Visit the links for the conference schedule by day, or view the whole schedule below.

May 1

May 2

May 3

Registration is required–please click here to register!

Early bird rates available through Tuesday, April 18th at 4:00 pm.

Harvard ID holders can attend daytime events for free, but they must register and pay the registration fee to attend the banquet.

Attendee parking is available at your own expense. Please contact Madeleine,, for the link and the code.

Monday, May 1

8:00 am – 8:45 am Registration

8:45 am – 9:00 am Opening Remarks, Xiao-Li Meng, Harvard University

9:00 am – 10:15 am Featured Discussion: What Bayes did, and (more to my point) what Bayes did not do
Speaker: Arthur Dempster, Harvard University
Discussant: Glenn Shafer, Rutgers University

10:15 am – 10:30 am Coffee Break

10:30 am – 12:00 noon Invited Session
Ryan Martin, North Carolina State University, “Confidence, probability, and plausibility”
Jan Hannig, University of North Carolina Chapel Hill, “Generalized Fiducial Inference: Current Challenges”
Nanny Wermuth, Chalmers University, “Characterising model classes by prime graphs and by statistical properties”

12:00 noon – 1:30 pm Poster Session with Lunch

1:30 pm – 2:45 pm Featured Discussion: Using rates of incoherence to refresh some old “foundational” debates
Speaker: Teddy Seidenfeld, Carnegie Mellon University
Discussant: Christian Robert, University of Warwick/Paris-Dauphine

2:45 pm – 3:00 pm Coffee Break

3:00 pm – 4:00 pm Invited Session
Alfred Hero, University of Michigan, “Continuum limits of shortest paths”
Daniel Roy, University of Toronto, “On Extended Admissible Procedures and their Nonstandard Bayes Risk”

4:00 pm – 5:30 pm Panel: Views from the Rising Stars
Panelists: Ruobin Gong, Harvard University; Jan Hannig, University of North Carolina Chapel Hill; Keli Liu, Stanford University; Ryan Martin, North Carolina State; Tyler VanderWeele, Harvard TH Chan School of Public Health
Moderator: Pierre Jacob, Harvard University

7:00 pm Evening Banquet
Speaker: Stephen Stigler, University of Chicago

Tuesday, May 2

9:00 am – 10:15 am Featured Discussion: The Secret Life of I.J. Good
Speaker: Sandy Zabell, Northwestern University
Discussant: Cynthia Dwork, Harvard University

10:15 am – 10:30 am Coffee Break

10:30 am – 12:00 noon Invited Session
Vladimir Vovk, University of London, “Nonparametric predictive distributions”
Don Fraser, University of Toronto, “Distributions for theta: Validity and Risks”
Antonietta Mira, Universita della Svizzera Italiana, “Deriving Bayesian and frequentist estimators from time-invariance estimating equations: a unifying approach”

12:00 noon – 1:30 pm Break for Lunch

1:30 pm – 2:45 pm Featured Discussion: BFF Four–Are We Converging?
Speaker: Nancy Reid, University of Toronto
Discussant: Deborah Mayo, Virginia Tech

2:45 pm – 3:00 pm Coffee Break

3:00 pm – 4:00 pm Invited Session
James M. Robins, Harvard TH Chan School of Public Health, “Coutnerexamples to Bayesian, Pure-Likelihoodist, and Conditional Inference in Biased-Coin Randomized Experiments and Observational Studies: Implications for Foundations and for Practice”
Larry Brown, Wharton School of the University of Pennsylvania

4:00 pm – 5:30 pm Panel: Perspectives of the Pioneers
Panelists: Jim Berger, Duke University; Larry Brown, Wharton School of the University of Pennsylvania; David Cox, Oxford University via remote participation; Don Fraser, Toronto University; Nancy Reid, Toronto University
Moderator: Vijay Nair, University of Michigan

Wednesday, May 3

9:00 am – 10:15 am Featured Discussion: Randomisation isn’t perfect but doing better is harder than you think
Speaker: Stephen Senn, Luxembourg Institute of Health
Discussant: Ned Hall, Harvard University

10:15 am – 10:30 am Coffee Break

10:30 am – 12:00 noon Invited Session
Jim Berger, Duke University, “An Objective Prior for Hyperparameters in Normal Hierarchical Models”
Harry Crane, Rutgers University
Peter Song, University of Michigan, “Confidence Distributions with Estimating Functions: Efficiency and Computing on Spark Platform”

12:00 noon – 1:30 pm Break for Lunch

1:30 pm – 2:45 pm Featured Discussion: Modeling Imprecise Degrees of Belief
Speaker: Susanna Rinard, Harvard University
Discussant: Andrew Gelman, Columbia University

2:45 pm – 3:00 pm Coffee Break

3:00 pm – 4:00 pm Invited Session
Nils Lid Hjort, University of Oslo, “Data Fusion with Confidence Distributions: The II-CC-FF Paradigm”
Gunnar Taraldsen, Norwegian University of Science and Technology, “Improper priors and fiducial inference”

4:00 pm – 5:30 pm Panel: The Scientific Impact of Foundational Thinking
Panelists: Emery Brown, Massachusetts Institute of Technology and Massachusetts General Hospital; Paul Edlefsen, Hutch; Andrew Gelman, Columbia University; Regina Liu, Rutgers University; Donald B. Rubin, Harvard University
Moderator: Min-ge Xie, Rutgers University

Previous BFF Workshops:
BFF3 (Rutgers), BFF2 (East China Normal), and BFF1 (East China Normal)


Categories: Announcement, Bayesian/frequentist | Leave a comment

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday)


Neyman April 16, 1894 – August 5, 1981

For my final Jerzy Neyman item, here’s the post I wrote for his birthday last year: 

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.

Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!)  There go les trois coups–the curtain’s about to open!


The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).

Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.

Borel, however, considered…that the method described could be applied with success provided that the character, x, of the observed facts were properly chosen—were, in fact, a character which he terms ‘en quelque sorte remarquable’” (Neyman and Pearson 1933, p.141/290).

The stage fades to black, then a spotlight shines on Bertrand, stage right.

Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …

[He turns to the audience, shaking his head.]

The application of such calculations to questions of this kind is a delusion and an abuse.” (Bertrand, 1907, p. 166; Lehmann 1993, p. 963).

The stage fades to black, then a spotlight appears on Borel, stage left.

Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)

The question is whether one has the same reservations in the case in which one states that one of the angles of the triangle formed by three stars has “une valeur remarquable” [a striking or noteworthy value], and is for example equal to the angle of the equilateral triangle…. (ibid.)

Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).J. Neyman and E. Pearson

The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)

Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)

“[W]e may consider some specified hypothesis, as that concerning the group of stars, and look for a method which we should hope to tell us, with regard to a particular group of stars, whether they form a system, or are grouped ‘by chance,’…their relative movements unrelated.” (ibid.)
“If this were what is required of ‘an efficient test’, we should agree with Bertrand in his pessimistic view. For however small be the probability that a particular grouping of a number of stars is due to ‘chance’, does this in itself provide any evidence of another ‘cause’ for this grouping but ‘chance’? …Indeed, if x is a continuous variable—as for example is the angular distance between two stars—then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point.” (Emphasis added; ibid. pp. 141-2; 290-1)

Fade to black, spot on narrator mid-stage:

Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.

Fade to black. Spotlight on an older Neyman, stage right.

Neyman April 16, 1894 – August 5, 1981

Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….

This question was the subject of a lively discussion by Borel and others. Borel was optimistic but insisted that: (a) the criterion to test a hypothesis (a ‘statistical hypothesis’) using some observations must be selected not after the examination of the results of observation, but before, and (b) this criterion should be a function of the observations (of some sort remarkable) (Neyman 1977, pp. 102-103).
It is these remarks of Borel that served as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses.”(ibid. p. 103)

Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).

Pearson: “I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”

Fade to black. End Play

Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.

As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.

Deconstructions on the Meaning of the Play by Theater Critics

It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.

Isn’t this taken care of by Fisher’s requirement that Pr(P < p0; Ho) = p—that the test rarely rejects the null if true?   It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).

Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.

Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).

To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).

Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.

“The statistician who does not know in advance with which type of alternative to H0 he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).

In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.

Playbill Souvenir

Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.

“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.

Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t.[2] I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.

But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!

Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.

[1] Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.

[2] We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)


Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.

Borel, E. 1914. Le Hasard. Paris: Alcan.

Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.

Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.

Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2nd ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.

Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.

Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.

Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.

Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.

Reid, Constance (1982). Neyman–from life



Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen


April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 2 Comments

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in honor of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314) Continue reading

Categories: Neyman, Spanos | Leave a comment

If you’re seeing limb-sawing in P-value logic, you’re sawing off the limbs of reductio arguments

images-2I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704).  They say they seek to define a notion of “plausibility” that

“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).

Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C  is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing Hin p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”! Continue reading

Categories: P-values, reforming the reformers, Statistics | 13 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. 3/19 and 3/17 are one, as are  3/19, 3/12 and 3/4, and the 6334 items 3/11, 3/22 and 3/26. So that covers nearly all the posts!

March 2014


  • (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
  • (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
  • (3/3) Capitalizing on Chance (ii)
  • (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
  • (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
  • (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
  • (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
  • (3/12) Get empowered to detect power howlers
  • (3/15) New SEV calculator (guest app: Durvasula)
  • (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)



  • (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
  • (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
  • (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business


  • (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
  • (3/28) Severe osteometric probing of skeletal remains: John Byrd
  • (3/29) Winner of the March 2014 palindrome contest (rejected post)
  • (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.






Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

Announcement: Columbia Workshop on Probability and Learning (April 8)

I’m speaking on “Probing with Severity” at the “Columbia Workshop on Probability and Learning” On April 8:

Meetings of the Formal Philosophy Group at Columbia

April 8, 2017

Department of Philosophy, Columbia University

Room 716
Philosophy Hall, 1150 Amsterdam Avenue
New York 10027
United States


  • The Formal Philosophy Group (Columbia)

Main speakers:

Gordon Belot (University of Michigan, Ann Arbor)

Simon Huttegger (University of California, Irvine)

Deborah Mayo (Virginia Tech)

Teddy Seidenfeld (Carnegie Mellon University)


Michael Nielsen (Columbia University)

Rush Stewart (Columbia University)


Unfortunately, access to Philosophy Hall is by swipe access on the weekends. However, students and faculty will be entering and exiting the building throughout the day (with relateively high frequency since there is a popular cafe on the main floor).

Categories: Announcement | Leave a comment

Er, about those other approaches, hold off until a balanced appraisal is in

I could have told them that the degree of accordance enabling the ASA’s “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved. Continue reading

Categories: P-values, reforming the reformers, Statistics | 6 Comments

Blog at