“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”


Seeing the world through overly rosy glasses

Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”  I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?

(I) Listen to Jacob Cohen (1988) introduce Power Analysis

“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists.

What is really intended by the invalid affirmation of a null hypothesis is not that the population ES is literally zero, but rather that it is negligible, or trivial. This proposition may be validly asserted under certain circumstances. Consider the following: for a given hypothesis test, one defines a numerical value i (or iota) for the ES, where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – b) is then set at a high value, so that b is relatively small. When, additionally, a is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i, i.e., that it is negligible; this conclusion can be offered as significant at the b level specified. In much research, “no” effect (difference, correlation) functionally means one that is negligible; “proof” by statistical induction is probabilistic. Thus, in using the same logic as that with which we reject the null hypothesis with risk equal to a, the null hypothesis can be accepted in preference to that which holds that ES = i with risk equal to b. Since i is negligible, the conclusion that the population ES is not as large as i is equivalent to concluding that there is “no” (nontrivial) effect. This comes fairly close and is functionally equivalent to affirming the null hypothesis with a controlled error rate (b), which, as noted above, is what is actually intended when null hypotheses are incorrectly affirmed (J. Cohen 1988, p. 16).

Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time–something not always available. Even where a negligible i may be specified, the power to detect that i may be low and not high. Two important points can still be made:

  • First, Cohen doesn’t instruct you to infer there’s no discrepancy from H0, merely that it’s “no more than i”.
  • Second, even if your test doesn’t have high power to detect negligible i, you can infer the population discrepancy is less than whatever γ your test does have high power to detect (given nonsignificance).

Now to tell what’s true about Greenland’s concern that “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”

(II) The first step is to understand the assertion, giving the most generous interpretation. It deals with nonsignificance, so our ears are perked for a fallacy of nonrejection or nonsignificance. We know that “high power” is an incomplete concept, so he clearly means high power against “the alternative”.

For a simple example of Greenland’s phenomenon, consider an example of the Normal test we’ve discussed a lot on this blog. Let T+:  H0: µ ≤ 12 versus H1: µ > 12, σ = 2, n = 100. Test statistic Z is √100(M – 12)/2 where M is the sample mean.   With α = .025, the cut-off for declaring .025 significance from M*.025 = 12+ 2(2)/√100 = 12.4 (rounding to 2 rather than 1.96 to use a simple Figure below).

[Note: The thick black vertical line in the Figure, which I haven’t gotten to yet, is going to be the observed mean, M0 = 12.35. It’s a bit lower than the cut-off at 12.4.]

Now a title like Greenland’s is supposed to signal some problem. What is it? The statistical part just boils down to noting that the observed mean M0 (e.g., 12.35) may fail to make it to the cut-off M* (here 12.4), and yet be closer to an alternative against which the test has high power (e.g., 12.6) than it is to the null value, here 12. This happens because the Type 2 error probability is allowed to be greater than the Type 1 error probability (here .025).

Screen Shot 2016-07-20 at 12.53.58 PM

Abbreviate the alternative against which the test T+ has .84 power as, µ.84 , as I’ve often done.  (See, for example, this post.) That is, the probability Test T+ rejects the null when µ = µ.84  = .84. i.e.,POW(T+, µ.84) = .84. One of our power short-cut rules tells us:

µ.84 = M* + 1σ= 12.4 + .2 = 12.6,

where σM: =σ/√100 = .2.

Note: the Type 2 error probability in relation to alternative µ = 12.6 is.16. This is the area to the left of 12.4 under the red curve above. Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β(12.6).

µ.84 exceeds the null value by 3σM: so any observed mean that exceeds 12 by more than 1.5σbut less than 2σgives an example of Greenland’s phenomenon. [Note: the previous sentence corrects an earlier wording.]  In T+ , values 12.3 < M0 <12 .4 do the job. Pick M0 = 12.35. That value is indicated by the black vertical line in the figure above.

Having established the phenomenon, your next question is: so what?

It would be problematic if power analysis took the insignificant result as evidence for μ = 12 (i.e., 0 discrepancy from the null). I’ve no doubt some try to construe it as such, and that Greenland has been put in the position of needing to correct them. This is the reverse of the “mountains out of molehills” fallacy. It’s making molehills out of mountains. It’s not uncommon when a nonsignificant observed risk increase is taken as evidence that risks are “negligible or nonexistent” or the like. The data are looked at through overly rosy glasses (or bottle). Power analysis enters to avoid taking no evidence of increased risk as evidence of no risk. Its reasoning only licenses μ < µ.84 where .84 was chosen for “high power”. From what we see in Cohen, he does not give a green light to the fallacious use of power analysis.

(III) Now for how the inference from power analysis is akin to significance testing (as Cohen observes). Let μ1−β be the alternative against which test T+ has high power, (1 – β). Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T-, we might call it. That is, T- tests H0: μ ≥ μ1−β versus H1: μ < μ1−β at the β level. The test rejects H0 (at level β) when M < μ0 – zβσM. Such a significant result would warrant inferring μ < μ1−β at significance level β. Using power analysis doesn’t require making this switcheroo, which might seem complicated. The point is that there’s really no new reasoning involved in power analysis, which is why the members of the Fisherian tribe manage it without even mentioning power.

EXAMPLE. Use μ.84 in test T+ (α = .025, n = 100, σ= .2) to create test T-.  Test T+ has .84 power against μ.84 = 12 + 3σM = 12.6 (with our usual rounding). So, test T- is

H0: μ ≥ 12.6 versus H1: μ <12 .6

and a result is statistically significantly smaller than 12.6 at level .16 whenever sample mean M < 12.6 – 1σM = 12.4. To check, note (as when computing the Type 2 error probability of test T+) that

Pr(M < 12.4; μ = 12.6) = Pr(Z < -1) = .16 = β. In test T-, this serves as the Type 1 error probability.

So ordinary power analysis follows the identical logic as significance testing. [i] Here’s a qualitative version of the logic of ordinary power analysis.

Ordinary Power Analysis: If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ.[ii]

Or, another way to put this:

If data x are not statistically significantly different from H0, then x indicates that the underlying discrepancy (from H0) is no greater than γ, just to the extent that that the power to detect discrepancy γ is high,


[i] Neyman, we’ve seen, was an early power analyst. See, for example, this post.

[ii] Compare power analytic reasoning with severity reasoning from a negative or insignificant result.

POWER ANALYSIS: If Pr(d > cα; µ’) = high and the result is not significant, then it’s evidence µ < µ’

SEVERITY ANALYSIS: (for an insignificant result): If Pr(d > d0; µ’) = high and the result is not significant, then it’s evidence µ < µ.’

Severity replaces the pre-designated cut-off cα with the observed d0. Thus we obtain the same result remaining in the Fisherian tribe. We still abide by power analysis though, since if Pr(d > d0; µ’) = high then Pr(d > cα; µ’) = high, at least in a sensible test like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting bound: µ < µ’. But why view a miss as good as a mile? Power is too coarse.


Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. [Link to quote above: p. 16]

Greenland, S. 2012. ‘Nonsignificance Plus High Power Does Not Imiply Support for the Null Over the Alternative’, Annals of Epidemiology 22, pp. 364-8. Link to paper: Greenland (2012)














Categories: Cohen, Greenland, power, Statistics | 9 Comments

Philosophy and History of Science Announcements



2016 UK-EU Foundations of Physics Conference

Start Date:16 July 2016

  • The 6th conference of the European Philosophy of Science Association

    Start Date: September 6, 2016
    End Date: September 9, 2016
    Location: University of Exeter, UK
    Website: http://www.philsci.eu/epsa17

  • 2017 Essay Prize: What is Structure?

    Submission Deadline: December 16, 2016
    Flyer: Structure.pdf

  • L’Inconscio Italian Journal of philosophy and psychoanalysis

    Submission Deadline: September 5, 2016
    Flyer: CFPLinconscio_ENG.pdf

Upcoming Deadlines

  • International Ontology Congress

    Submission Deadline: July 17, 2016
    Start Date: October 3, 2016
    End Date: October 7, 2016
    Location: San Sebastian, Spain
    Flyer: Flier-XIIInternationalOntologyCongress.pdf

  • Boulder Conference on the History and Philosophy of Science
    Submission deadline: August 1, 2016Conference date(s):
    October 28, 2016 – October 30, 2016Organizer: Allan Franklin

    Gravity: Its History and Philosophy
    University of Colorado at Boulder

    Invited speakers

    Peter Saulson, Syracuse University, LIGO
    Michel Janssen, University of Minnesota
    Peter Bender, JILA, University of Colorado

     The conference topic is gravity from antiquity to the present. Historical and philosophical papers on both theory and experiment are welcome. 


Events of Interest

Categories: Announcement | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1].  Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.

June 2013

  • (6/1) Winner of May Palindrome Contest
  • (6/1) Some statistical dirty laundry*(recently reblogged)
  • (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers :(6/5 and6/6 are paired as one)
  • (6/6) PhilStock: Topsy-Turvy Game
  • (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
  • (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?*(recently reblogged)
  • (6/11) Mayo: comment on the repressed memory research [How a conceptual criticism, requiring no statistics, might go.]
  • (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
  • (6/19) PhilStock: The Great Taper Caper
  • (6/19) Stanley Young: better p-values through randomization in microarrays
  • (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri*(recently reblogged)
  • (6/26) Why I am not a “dualist” in the sense of Sander Greenland
  • (6/29) Palindrome “contest” contest
  • (6/30) Blog Contents: mid-year

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.






Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)

Allan Birnbaum: May 27, 1923- July 1, 1976

Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading

Categories: Birnbaum, Likelihood Principle, phil/history of stat, Statistics | Tags: | 62 Comments

Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

Professor Gill

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University

It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.

Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:

“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”

Isn’t that the danger in relying on deeply felt background beliefs?  Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post. Continue reading

Categories: 3-year memory lane, junk science, Statistical fraudbusting, Statistics | 2 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Are we lowering the bar?


For entertainment only

In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading

Categories: junk science, replication research, Statistics | 2 Comments

Some statistical dirty laundry: have the stains become permanent?



Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013: Continue reading

Categories: junk science, reproducibility, spurious p values, Statistics | 2 Comments

Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides

Screen Shot 2016-06-19 at 12.53.32 PMHere are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

Categories: P-values, reforming the reformers, replication research, Statistics, StatSci meets PhilSci | Leave a comment

“Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)

Screen Shot 2016-06-15 at 1.19.23 PMI’m giving a joint presentation with Caitlin Parker[1] on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.)[2] The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.

Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology

Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States

Continue reading

Categories: Announcement, replication research, reproducibility | 8 Comments

“So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference



I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black. Continue reading

Categories: frequentist/Bayesian, Honorary Mention, P-values, reforming the reformers, science communication, Statistics | 45 Comments

Winner of May 2016 Palindrome Contest: Curtis Williams



Winner of the May 2016 Palindrome contest

Curtis Williams: Inventor, entrepreneur, and professional actor

The winning palindrome (a dialog): 


“Disable preplan?… I, Mon Ami?”


“Calm…Sit, fella.”

“No! I tag. I vandalized Dezi, lad.”

“Navigational leftism lacks aim…a nominal perp: Elba’s id.”

The requirement: A palindrome using “navigate” or “navigation” (and Elba, of course).

Book choiceError and Inference (D. Mayo & A. Spanos, Cambridge University Press, 2010)

Curtis Cartoon Caption 1


Bio: Curtis Mark Williams is the co-founder of WavHello and the inventor of Bellybuds, who also counts himself as an occasional professional actor who has performed on Broadway [1] and in several television shows and films. 
He currently resides in Los Angeles with his lovely wife, two daughters, his dog, Newton, and his framed New Yorker Caption Contest winning cartoon. [He has been a finalist twice and the one he won is contest #329, by Joe Dator (inspired by his theatrical background. :)] Continue reading

Categories: Palindrome | Leave a comment

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference



Aris Spanos, my colleague (in economics) and co-author, came across this anonymous review of our Error and Inference (2010) [E & I]. Interestingly, the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” We’re not sure what the reviewer means–but it’s appreciated regardless. This post was from yesterday’s 3-year memory lane and was first posted here.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a significant contribution in bridging the gap between scientific practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientific inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum benefit from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians. Continue reading

Categories: 3-year memory lane, Review of Error and Inference, Statistics | 3 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2013. I mark in red three posts that seem most apt for general background on key issues in this blog [1].  Some of the May 2013 posts blog the conference we held earlier that month: “Ontology and Methodology”.  I highlight in burgundy a post on Birnbaum that follows up on my last post in honor of his birthday. New questions or comments can be placed on this post.

May 2013

  • (5/3) Schedule for Ontology & Methodology, 2013
  • (5/6) Professorships in Scandal?
  • (5/9) If it’s called the “The High Quality Research Act,” then ….
  • (5/13) ‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post
  • (5/14) “A sense of security regarding the future of statistical science…” Anon review of Error and Inference
  • (5/18) Gandenberger on Ontology and Methodology (May 4) Conference: Virginia Tech
  • (5/19) Mayo: Meanderings on the Onto-Methodology Conference
  • (5/22) Mayo’s slides from the Onto-Meth conference
  • (5/24) Gelman sides w/ Neyman over Fisher in relation to a famous blow-up
  • (5/26) Schachtman: High, Higher, Highest Quality Research Act
  • (5/27) A.Birnbaum: Statistical Methods in Scientific Inference
  • (5/29) K. Staley: review of Error & Inference

 [1]Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

Categories: 3-year memory lane, Statistics | Leave a comment

Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s birthday. In honor of his birthday this year, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of  “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up.(Even if you are,you may be unaware of some of these key papers.)


Synthese Volume 36, No. 1 Sept 1977: Foundations of Probability and Statistics, Part I

Editorial Introduction:

This special issue of Synthese on the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors of Synthese in October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.


Continue reading

Categories: Birnbaum, Error Statistics, Likelihood Principle, Statistics, strong likelihood principle | 7 Comments

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)



In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters


Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

Categories: reforming the reformers, science-wise screening, Statistical power, statistical tests, Statistics | Tags: , , , , | 5 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

old blogspot typewriter


I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue. 

 Dear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo


From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: 4 years ago!, reproducibility, S. Senn, Statistics | Tags: , , , | 3 Comments

My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)


Categories: P-values, replication research, reproducibility, Statistics | 11 Comments

Some bloglinks for my LSE talk tomorrow: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Popper talk May 10 locationIn my Popper talk tomorrow today (in London), I will discuss topics in philosophy of statistics in relation to:  the 2016 ASA document on P-values, and recent replication research in psychology. For readers interested in links from this blog, see:

I. My commentary on the ASA document on P-values (with links to the ASA document):

Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”

A Small P-value Indicates that the Results are Due to Chance Alone: Fallacious or not: More on the ASA P-value Doc”

“P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell’”

II. Posts on replication research in psychology: Continue reading

Categories: Metablog, P-values, replication research, reproducibility, Statistics | 7 Comments

My Popper Talk at LSE: The Statistical Replication Crisis: Paradoxes and Scapegoats

I’m giving a Popper talk at the London School of Economics next Tuesday (10 May). If you’re in the neighborhood, I hope you’ll stop by.

Popper talk May 10 location

A somewhat accurate blurb is here. I say “somewhat” because it doesn’t mention that I’ll talk a bit about the replication crisis in psychology, and the issues that crop up (or ought to) in connecting statistical results and the causal claim of interest.



Categories: Announcement | 5 Comments

Blog at WordPress.com. The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 1,648 other followers