Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s birthday. In honor of his birthday, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of  “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up. (Even if you are,you may be unaware of some of these key papers.)


Synthese Volume 36, No. 1 Sept 1977: Foundations of Probability and Statistics, Part I

Editorial Introduction:

This special issue of Synthese on the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors of Synthese in October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.


Table of Contents

SUFFICIENCY, CONDITIONALLY AND LIKELIHOOD In December of 1961 Birnbaum presented the paper ‘On the Foundations, of Statistical Inference’ (Birnbaum [19]) at a special discussion meeting of the American Statistical Association. Among the discussants was L. J. Savage who pronounced it “a landmark in statistics”. Explicitly denying any “intent to speak with exaggeration or rhetorically”, Savage described the occasion as “momentous in the history of statistics”. “It would be hard”, he said, “to point to even a handful of comparable events” (Birnbaum [19], pp. 307-8). The reasons for Savage’s enthusiasm are obvious. Birnbaum claimed to have shown that two principles widely held by non-Bayesian statisticians (sufficiency and conditionality) jointly imply an important consequence of Bayesian statistics (likelihood).”[1]
INTRODUCTION AND SUMMARY ….Two contrasting interpretations of the decision concept are formulated: behavioral, applicable to ‘decisions’ in a concrete literal sense as in acceptance sampling; and evidential, applicable to ‘decisions’ such as ‘reject H in a research context, where the pattern and strength of statistical evidence concerning statistical hypotheses is of central interest. Typical standard practice is characterized as based on the confidence concept of statistical evidence, which is defined in terms of evidential interpretations of the ‘decisions’ of decision theory. These concepts are illustrated by simple formal examples with interpretations in genetic research, and are traced in the writings of Neyman, Pearson, and other writers. The Lindley-Savage argument for Bayesian theory is shown to have no direct cogency as a criticism of typical standard practice, since it is based on a behavioral, not an evidential, interpretation of decisions.

[1]By “likelihood” here, Giere means the (strong) Likelihood Principle (SLP). Dotted through the first 3 years of this blog are a number of (formal and informal) posts on his SLP result, and my argument as to why it is unsound. I wrote a paper on this that appeared in Statistical Science 2014. You can find it along with a number of comments and my rejoinder in this post: Statistical Science: The Likelihood Principle Issue is Out.The consequences of having found his proof unsound gives a new lease on life to statistical foundations, or so I argue in my rejoinder.

Categories: Birnbaum, Likelihood Principle, Statistics, strong likelihood principle | Tags: | Leave a comment

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?



ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But it does no such thing! [See my post from the FUSION 2016 conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1]


The Law of Comparative Support

It comes from a comparativist support position which has intrinsic plausibility, although I do not hold to it. It is akin to what some likelihoodists call “the law of support”: if H1 make the observed results probable, while H0 make them improbable, then the results are strong (or at least better) evidence for H1 compared to H0 . It appears to be saying (sensibly) that you have better evidence for a hypothesis that best “explains” the data, only this is not a good measure of explanation. It is not generally required H0 and H1 be exhaustive. Even if you hold a comparative support position, the “ratio of statistical power to significance threshold” isn’t a plausible measure for this. Now BBBS also object to the Rejection Ratio, but only largely because it’s not sensitive to the actual outcome; so they recommend the Bayes Factor post data. My criticism is much, much deeper. To get around the data-dependent part, let’s assume throughout that we’re dealing with a result just statistically significant at the α level.


Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

People often talk of a test “having a power” but the test actually specifies a power function that varies with different point values in the alternative H1 . The power of test T+ in relation to point alternative µ’ is

Pr(Z > 1.96; µ = µ’).

We can abbreviate this as POW(T+,µ’).


Jacob Cohen’s slips

By the way, Jacob Cohen, a founder of power analysis, makes a few slips in introducing power, even though he correctly computes power through the book (so far as I know). [2] Someone recently reminded me of this, and given the confusion about power, maybe it’s had more of an ill effect than I assumed.

In the first sentence on p. 1 of Statistical Power Analysis for the Behavioral Sciences, Cohen says “The power of a statistical test is the probability it will yield statistically significant results.” Also faulty, and for two reasons, is what he says on p. 4: “The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.”

Do you see the two mistakes? 


Examples of alternatives against which T+ has high power:

  • If we add σx (i.e.,σ/√n) to the cut-off  (1.96) we are at an alternative value for µ that test T+ has .84 power to detect. In this example, σx = 1.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null, z= 1.96.

If we were to form a “rejection ratio” or a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[POW(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, even understanding support as comparative likelihood or something akin. The data 1.96 are even closer to 0 than to 4.96. The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0|z0) = 1/(1 + 40) = .024, so Pr(H1|z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Back to our question:

Here’s my explanation for why some think it’s plausible to compute comparative evidence this way:

I presume it stems comes from the comparativist support position noted above. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ = μ0 = 0.

We’ve observed z= 1.96 (so you’ve observed z > 1.96).

Therefore, μ = 4.96 makes the observation more probable than does  μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this.

I can pick any far away alternative I like for purposes of getting high power, and we wouldn’t want to say that just reaching the cut-off (1.96) is good evidence for it! Power works in the reverse. That is,

If POW(T+,µ’) is high, then z= 1.96 is poor evidence that μ  > μ’.

That’s because were μ as great as μ’, with high probability we would have observed a larger z value (smaller p-value) than we did. Power may, if one wishes, be seen as a kind of distance measure, but (just like α) it is inverted.

(Note that our inferences take the form μ > μ’, μ < μ’, etc. rather than to a point value.) 

In fact:

if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that  μ < μ’!

Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.

A post by Stephen Senn:

In my favorite guest post by Stephen Senn here, Senn strengthens a point from his 2008 book (p. 201), namely, that the following is “nonsense”:

[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect. (Senn 2008, p. 201)

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. 

Supposing that it is, is essentially  to treat the test as if it were:

H0: μ < 0 vs H1: μ  > 4.96

This, he says,  is “ludicrous”as it:

would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference. (Senn, 2008, p. 201)

The same holds with H0: μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one-sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ  > 0 or μ  > .3.


What does the severe tester say?

In sync with the confidence interval, she would say SEV(μ > 0) = .975 (if one-sided), and would also note some other benchmarks, e.g., SEV(μ > .96) = .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

would be wrong over 99% of the time!

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.


The (Type 1, 2 error probability) trade-off vanishes

Notice what happens if we consider the “real Type 1 error” as Pr(H0|z0)

Since Pr(H0|z0) decreases with increasing power, it decreases with decreasing Type 2 error. So we know that to identify “Type 1 error” and Pr(H0|z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between Type 1 and 2 error probabilities.


Using size/ power as a likelihood ratio, or even as a preregistrated estimate of expected strength of evidence (with which to accord a rejection) is problematic. The error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses. It’s not unusual for criticisms to start out forming these ratios, and then blame the “tail areas” for exaggerating the evidence against the test hypothesis. We don’t form those ratios. But the pre-data Rejection Ratio is also misleading as an assessment alleged to be akin to a Bayes ratio or likelihood assessment. You can marry frequentist components and end up with something frequentsteinian.


Bayarri, M., Benjamin, D., Berger, J., & Sellke, T. (2016, in press). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses“, Journal of Mathematical Psychology

Benjamin, D. & Berger J. 2016. “Comment: A Simple Alternative to P-values,” The American Statistician (online March 7, 2016).

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.

Mayo, D. 2016. “Don’t throw out the Error Control Baby with the Error Statistical Bathwater“. (My comment on the ASA document)

Mayo, D. 2003. Comments on J. Berger’s, “Could Jeffreys, Fisher and Neyman have Agreed on Testing?  (pp. 19-24)

*Mayo, D. Statistical Inference as Severe Testing, forthcoming (2017) CUP.

Senn, S. 2008. Statistical Issues in Drug Development, 2nd ed. Chichster, New Sussex: Wiley Interscience, John Wiley & Sons.

Wasserstein, R. & Lazar, N. 2016. “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician (online March 7, 2016).

[1] I don’t say there’s no context where the Rejection Ratio has a frequentist role. It may arise in a diagnostic screening or empirical Bayesian context where one has to deal with a dichotomy. See, for example, this post (“Beware of questionable front page articles telling you to beware…”)

[2] It may also be found in Neyman! (Search this blog under Neyman’s Nursery.) However, Cohen uniquely provides massive power computations, before it was all computerized.

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 8 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: April 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. For this month, I’ll include all the 6334 seminars as “one”.

April 2014

  • (4/1) April Fool’s. Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic
  • (4/3) Self-referential blogpost (conditionally accepted*)
  • (4/5) Who is allowed to cheat? I.J. Good and that after dinner comedy hour. . ..
  • (4/6) Phil6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides
  • (4/8) “Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)
  • (4/12) “Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)
  • (4/14) Phil6334: Notes on Bayesian Inference: Day #11 Slides
  • (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
  • (4/17) Duality: Confidence intervals and the severity of tests
  • (4/19) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)
  • (4/21) Phil 6334: Foundations of statistics and its consequences: Day#12
  • (4/23) Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”
  • (4/26) Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day #13)
  • (4/30) Able Stats Elba: 3 Palindrome nominees for April! (rejected post)


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.






Categories: 3-year memory lane, Statistics | Leave a comment

How to tell what’s true about power if you’re practicing within the error-statistical tribe



This is a modified reblog of an earlier post, since I keep seeing papers that confuse this.

Suppose you are reading about a result x  that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. 

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ).*See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined? Continue reading

Categories: power, reforming the reformers | 17 Comments

“Fusion-Confusion?” My Discussion of Nancy Reid: “BFF Four- Are we Converging?”


Here are the slides from my discussion of Nancy Reid today at BFF4: The Fourth Bayesian, Fiducial, and Frequentist Workshop: May 1-3, 2017 (hosted by Harvard University)

Categories: Bayesian/frequentist, C.S. Peirce, confirmation theory, fiducial probability, Fisher, law of likelihood, Popper | Tags: | 1 Comment

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself. Continue reading

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 18 Comments

The Fourth Bayesian, Fiducial and Frequentist Workshop (BFF4): Harvard U


May 1-3, 2017
Hilles Event Hall, 59 Shepard St. MA

The Department of Statistics is pleased to announce the 4th Bayesian, Fiducial and Frequentist Workshop (BFF4), to be held on May 1-3, 2017 at Harvard University. The BFF workshop series celebrates foundational thinking in statistics and inference under uncertainty. The three-day event will present talks, discussions and panels that feature statisticians and philosophers whose research interests synergize at the interface of their respective disciplines. Confirmed featured speakers include Sir David Cox and Stephen Stigler.

The program will open with a featured talk by Art Dempster and discussion by Glenn Shafer. The featured banquet speaker will be Stephen Stigler. Confirmed speakers include:

Featured Speakers and DiscussantsArthur Dempster (Harvard); Cynthia Dwork (Harvard); Andrew Gelman (Columbia); Ned Hall (Harvard); Deborah Mayo (Virginia Tech); Nancy Reid (Toronto); Susanna Rinard (Harvard); Christian Robert (Paris-Dauphine/Warwick); Teddy Seidenfeld (CMU); Glenn Shafer (Rutgers); Stephen Senn (LIH); Stephen Stigler (Chicago); Sandy Zabell (Northwestern)

Invited Speakers and PanelistsJim Berger (Duke); Emery Brown (MIT/MGH); Larry Brown (Wharton); David Cox (Oxford; remote participation); Paul Edlefsen (Hutch); Don Fraser (Toronto); Ruobin Gong (Harvard); Jan Hannig (UNC); Alfred Hero (Michigan); Nils Hjort (Oslo); Pierre Jacob (Harvard); Keli Liu (Stanford); Regina Liu (Rutgers); Antonietta Mira (USI); Ryan Martin (NC State); Vijay Nair (Michigan); James Robins (Harvard); Daniel Roy (Toronto); Donald B. Rubin (Harvard); Peter XK Song (Michigan); Gunnar Taraldsen (NUST); Tyler VanderWeele (HSPH); Vladimir Vovk (London); Nanny Wermuth (Chalmers/Gutenberg); Min-ge Xie (Rutgers)

Continue reading

Categories: Announcement, Bayesian/frequentist | 2 Comments

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday)


Neyman April 16, 1894 – August 5, 1981

For my final Jerzy Neyman item, here’s the post I wrote for his birthday last year: 

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”. Continue reading

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen


April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 2 Comments

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in honor of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314) Continue reading

Categories: Neyman, Spanos | Leave a comment

If you’re seeing limb-sawing in P-value logic, you’re sawing off the limbs of reductio arguments

images-2I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704).  They say they seek to define a notion of “plausibility” that

“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).

Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C  is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing Hin p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”! Continue reading

Categories: P-values, reforming the reformers, Statistics | 13 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. 3/19 and 3/17 are one, as are  3/19, 3/12 and 3/4, and the 6334 items 3/11, 3/22 and 3/26. So that covers nearly all the posts!

March 2014


  • (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
  • (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
  • (3/3) Capitalizing on Chance (ii)
  • (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
  • (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
  • (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
  • (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
  • (3/12) Get empowered to detect power howlers
  • (3/15) New SEV calculator (guest app: Durvasula)
  • (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)



  • (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
  • (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
  • (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business


  • (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
  • (3/28) Severe osteometric probing of skeletal remains: John Byrd
  • (3/29) Winner of the March 2014 palindrome contest (rejected post)
  • (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.






Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

Announcement: Columbia Workshop on Probability and Learning (April 8)

I’m speaking on “Probing with Severity” at the “Columbia Workshop on Probability and Learning” On April 8:

Meetings of the Formal Philosophy Group at Columbia

April 8, 2017

Department of Philosophy, Columbia University

Room 716
Philosophy Hall, 1150 Amsterdam Avenue
New York 10027
United States


  • The Formal Philosophy Group (Columbia)

Main speakers:

Gordon Belot (University of Michigan, Ann Arbor)

Simon Huttegger (University of California, Irvine)

Deborah Mayo (Virginia Tech)

Teddy Seidenfeld (Carnegie Mellon University)


Michael Nielsen (Columbia University)

Rush Stewart (Columbia University)


Unfortunately, access to Philosophy Hall is by swipe access on the weekends. However, students and faculty will be entering and exiting the building throughout the day (with relateively high frequency since there is a popular cafe on the main floor).

Categories: Announcement | Leave a comment

Er, about those other approaches, hold off until a balanced appraisal is in

I could have told them that the degree of accordance enabling the ASA’s “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved. Continue reading

Categories: P-values, reforming the reformers, Statistics | 6 Comments

Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”


Categories: fallacy of rejection, Fisher, fraud, frequentist/Bayesian, Likelihood Principle, reforming the reformers | 16 Comments

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Understanding Reproducibility & Error Correction in Science


57th Annual Program

Download the 57th Annual Program

The Alfred I. Taub forum:


Cosponsored by GMS and BU’s BEST at Boston University.
Friday, March 17, 2017
1:00 p.m. – 5:00 p.m.
The Terrace Lounge, George Sherman Union
775 Commonwealth Avenue

  • Reputation, Variation, &, Control: Historical Perspectives
    Jutta Schickore History and Philosophy of Science & Medicine, Indiana University, Bloomington.
  • Crisis in Science: Time for Reform?
    Arturo Casadevall Molecular Microbiology & Immunology, Johns Hopkins
  • Severe Testing: The Key to Error Correction
    Deborah Mayo Philosophy, Virginia Tech
  • Replicate That…. Maintaining a Healthy Failure Rate in Science
    Stuart Firestein Biological Sciences, Columbia



Categories: Announcement, Statistical fraudbusting, Statistics | Leave a comment

The ASA Document on P-Values: One Year On


I’m surprised it’s a year already since posting my published comments on the ASA Document on P-Values. Since then, there have been a slew of papers rehearsing the well-worn fallacies of tests (a tad bit more than the usual rate). Doubtless, the P-value Pow Wow raised people’s consciousnesses. I’m interested in hearing reader reactions/experiences in connection with the P-Value project (positive and negative) over the past year. (Use the comments, share links to papers; and/or send me something slightly longer for a possible guest post.)
Some people sent me a diagram from a talk by Stephen Senn (on “P-values and the art of herding cats”). He presents an array of different cat commentators, and for some reason Mayo cat is in the middle but way over on the left side,near the wall. I never got the key to interpretation.  My contribution is below: 

Chart by S.Senn

“Don’t Throw Out The Error Control Baby With the Bad Statistics Bathwater”

D. Mayo*[1]

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed. Continue reading

Categories: Bayesian/frequentist, P-values, science communication, Statistics, Stephen Senn | 14 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: February 2014. I normally mark in red three posts from each month that seem most apt for general background on key issues in this blog, but I decided just to list these as they are (some are from a seminar I taught with Aris Spanos 3 years ago; several on Fisher were recently reblogged). I hope you find something of interest!    

February 2014

  • (2/1) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)
  • (2/3) PhilStock: Bad news is bad news on Wall St. (rejected post)
  • (2/5) “Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)
  • (2/9) Phil6334: Day #3: Feb 6, 2014
  • (2/10) Is it true that all epistemic principles can only be defended circularly? A Popperian puzzle
  • (2/12) Phil6334: Popper self-test
  • (2/13) Phil 6334 Statistical Snow Sculpture
  • (2/14) January Blog Table of Contents
  • (2/15) Fisher and Neyman after anger management?
  • (2/17) R. A. Fisher: how an outsider revolutionized statistics
  • (2/18) Aris Spanos: The Enduring Legacy of R. A. Fisher
  • (2/20) R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’
  • (2/21) STEPHEN SENN: Fisher’s alternative to the alternative
  • (2/22) Sir Harold Jeffreys’ (tail-area) one-liner: Sat night comedy [draft ii]
  • (2/24) Phil6334: February 20, 2014 (Spanos): Day #5
  • (2/26) Winner of the February 2014 palindrome contest (rejected post)
  • (2/26) Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)



Categories: 3-year memory lane, Statistics | 2 Comments

R.A Fisher: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based”



A final entry in a week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962). Fisher is among the very few thinkers I have come across to recognize this crucial difference between induction and deduction:

In deductive reasoning all knowledge obtainable is already latent in the postulates. Rigorous is needed to prevent the successive inferences growing less and less accurate as we proceed. The conclusions are never more accurate than the data. In inductive reasoning we are performing part of the process by which new knowledge is created. The conclusions normally grow more and more accurate as more data are included. It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based. Statistical data are always erroneous, in greater or less degree. The study of inductive reasoning is the study of the embryology of knowledge, of the processes by means of which truth is extracted from its native ore in which it is infused with much error. (Fisher, “The Logic of Inductive Inference,” 1935, p 54).

Reading/rereading this paper is very worthwhile for interested readers. Some of the fascinating historical/statistical background may be found in a guest post by Aris Spanos: “R.A.Fisher: How an Outsider Revolutionized Statistics”

Categories: Fisher, phil/history of stat | 30 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”


As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012.  (I will comment in the comments.)

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 13 Comments

Blog at