My Responses (at the P-value debate)


How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer. 

The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts.

Question 1. Given the issues surrounding the misuses and abuse of p-values, do you think they should continue to be used or not? Why or why not?

Yes we should continue to use P-values and statistical significance tests. Uses of P-values are a piece in a rich set of tools for assessing and controlling the probabilities of misleading interpretations of data (error probabilities). They’re “the first line of defense against being fooled by randomness” (Yoav Benjamini). If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.

Even those who criticize P-values will employ them at least if they care to check the assumptions of their statistical models—this includes Bayesians George Box, Andrew Gelman, and Jim Berger.       

Critics of P-values often allege it’s too easy to obtain small P-values, but notice the replication crisis is all about how difficult it is to get small P-values with preregistered hypotheses. This shows the problem isn’t P-values but the selection effects and  data-dredging. However, the same data dredged hypothesis can occur in likelihood ratios, Bayes factors, and Bayesian updating, except that we now lose the direct grounds to criticize inferences flouting error statistical control. The introduction of prior probabilities –which may also be data dependent–offers further researcher flexibility.

Those who reject P values are saying we should reject a method because it can be used badly. That’s a very bad argument committing straw person fallacies.

We should reject misuses and abuses of P-values, but there’s a danger of blithely substituting “alternative tools” that throw out the error control baby with the bad statistics bathwater.

Final remark on P-values

What’s missed in the reject P-values movement is the major reason for calling in statistics in science is that it gives tools to inquire whether an observed phenomenon could be a real effect or just noise in the data. P-values have the intrinsic properties for this task, if used properly. To reject them is to jeopardize this important role of statistics. As Fisher emphasizes, we seek randomized controlled trials in order to ensure the validity of statistical significance tests. To reject P-values because they don’t give posterior probabilities in hypotheses is illicit. The onus is on those claiming we want such posteriors to show, for any way of getting them, why.


Question 2 Should practitioners avoid the use of thresholds (e.g., P-value thresholds) in interpreting data? If so, does this preclude testing?

There’s a lot of confusion about thresholds. What people oppose are dichotomous accept/reject routines. We should move away from them as well as unthinking uses of thresholds like 95% confidence levels or other quantities. Attained P-values should be reported (as all the founders of tests recommended). We should not confuse fixing a threshold to habitually use with prespecifying a threshold beyond which there is evidence of inconsistency with a test hypothesis. I’ll often call it the null for short.

Some think that banishing thresholds would diminish P-hacking and data dredging. It is the opposite. In a world without thresholds, it would be harder to criticize those who fail to meet a small P-value because they engaged in data dredging & multiple testing, and at most have given us a nominally small P-value. Yet that is the upshot of declaring predesignated P-value thresholds should not be used at all in interpreting data. If an account cannot say about any outcomes in advance that they will not count as evidence for a claim, then there is no a test of that claim.

Giving up on tests means forgoing statistical falsification. What’s the point of insisting on replications if at no point can you say, the effect has failed to replicate?

You may favor a philosophy of statistics that rejects statistical falsification, but it will not do to declare by fiat that science should reject the falsification or testing view. (The “no thresholds” view also torpedoes common testing uses of confidence intervals and Bayes Factor standards.)

So my answer is NO and YES: don’t abandon thresholds, to do so is to ban tests. 

Final remark on thresholds Q-2

A common fallacy is to suppose that because we have a continuum, that we cannot distinguish points at the extremes (fallacy of the beard). We can distinguish results readily produced by random variability from cases where there is evidence of incompatibility with the chance variability hypothesis. We use thresholds throughout science to measure if you’re pre-diabetic, diabetic, etc.

When P-values are banned altogether … the eager researcher does not claim, I’m simply describing, but they invariably go on to claim evidence for a substantive psych theory—but on results that would be blocked if they’d required a reasonably small P-value threshold.


Question 3 Is there a role for sharp null hypotheses or should we be thinking about interval nulls?

I’d agree with those who regard testing of a point null hypothesis as problematic and often misused. Notice that arguments purporting to show P-values exaggerate evidence are based on this point null and a spiked or lump of prior to it.  By giving a spike prior to the nil, it’s easy to find the nil more likely than the alternative—Jeffreys-Lindley paradox: the P-value can differ from the posterior probability on the null. But the posterior can also equal the P-value, it can range from p to 1-p. In other words, the Bayesians differ amongst themselves, because with diffuse priors the P-value can equal the posterior on the null hypothesis.  

My own work reformulates results of statistical significance tests in terms of discrepancies from the null that are well or poorly tested. A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it. 

Final remark on sharp nulls Q-3

The move to redefine significance, advanced by a megateam including Jim, all rest upon the lump high prior probability on the null as well as evaluating P-values using Bayes factors.  It’s not equipoise, it’s biased in favor of the null. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies.

Whether to use a lower threshold is one thing, to argue we should based on Bayes factor standards lacks legitimate grounds.[1][2]


Question 4 Should we be teaching hypothesis testing anymore, or should we be focusing on point estimation and interval estimation?

Absolutely [we should be teaching hypothesis testing]. The way to understand confidence interval estimation, and to fix its shortcomings, is to understand their duality with tests. The same person who developed confidence intervals developed tests in the 1930s—Jerzy Neyman. The intervals are inversions of tests.

A 95% CI contains the parameter values that are not statistically significant from the data at the 5% level.

While I agree that P-values should be accompanied by CIs, my own preferred reconstruction of tests blends intervals and tests. It reports the discrepancies from a reference value that are well or poorly indicated at different levels—not just 1 level like .95. This improves on current confidence interval use. For example, the justification standardly given for inferring a particular confidence interval estimate is that it came from a method which, with high probability, would cover the true parameter value. This is a performance justification. The testing perspective on CIs gives an inferential justification. I would justify inferring evidence that the parameter exceeds the CI lower bound this way: if the parameter were smaller than the lower bound, then with high probability we would have observed a smaller value of the test statistic than we did.

Amazingly, the last president of the ASA, Karen Kafadar, had to appoint a new task force on statistical significance tests to affirm that statistical hypothesis testing is indeed part of good statistical practice. Though much credit goes to her for bringing this about.

Final remark on question 4

Understanding the duality between tests and CIs is the key to improving both. …So it makes no sense for advocates of the “new statistics” to shun tests. The testing interpretation of confidence intervals also scotches criticisms of examples where, it can happen that a 95% confidence estimate contains all possible parameter values. Although such an inference is ‘trivially true,’ it is scarcely vacuous in the testing construal. As David Cox remarks, that all parameter values are consistent with the data is an informative statement about the limitations of the data (to detect discrepancies at the particular level).


Question 5  What are your reasons for or against the use of Bayes Factors?

Jim is a leading advocate of Bayes factors and also of the non-subjective interpretation of Bayesian prior probabilities (2006) to be used. ‘Eliciting’ subjective priors, Jim has convincingly argued, is too difficult, expert’s prior beliefs almost never even overlap he says, and scientists are reluctant for subjective beliefs to overshadow data. Default priors (reference or non-subjective priors) are supposed to prevent prior beliefs from influencing the posteriors–they are data dominant in some sense. But there’s a variety of incompatible ways to go about this job.

(A few are maximum entropy, invariance, maximizing the missing information, coverage matching.) As David Cox points out, it’s unclear how we should interpret these default probabilities. Default priors, we are told, are simply formal devices to obtain default posteriors. “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299), being improper.

Prior probabilities are supposed to let us bring in background information, but this pulls in the opposite direction from the goal of the default prior which is to reflect just the data. The goal of representing your beliefs is very different from the goal of finding a prior that allows the data to be dominant. Yet, current uses of Bayesian methods combine both in the same computation—how do you interpret them? I think this needs to be assessed now that they’re being so widely advocated.

Final remark on Q-5  

BFs give a comparative appraisal not a test. It depends on how you assign the priors to the test and alternative hypotheses.

Bayesian testing, Bayesians admit, is a work in progress. My feeling is, we shouldn’t kill a well worked out theory of testing for one that is admitted to being a work in progress.

It might be noted that even default Bayesian Jose Bernardo holds that the difference between the P-value and the BF (the Jeffreys Lindley paradox or Fisher-Jeffreys disagreement) is actually an indictment of the BF because it finds evidence in favor of a null hypothesis even when an alternative is much more likely.

Other Bayesians dislike the default priors because they can lead to improper posteriors and thus to violations of probability theory. This leads some like Dennis Lindley back to subjective Bayesianism.


Question 6 With so much examination of if/why the usual nominal type I error .05 is appropriate, should there be similar questions about the usual nominal type II error?

No, there should not be a similar examination of type 2 error bounds. Rigid bounds for either error should be avoided. N-P themselves urged the specifications be used with discretion and understanding.

It occurs to me, if an examination is wanted it should be done by the new ASA Task Force on Significance Tests and Replicability. Its members aren’t out to argue for rejecting significance tests but to show they are part of proper statistical practice. 

Power, the complement of the type II error probability, I often say is a most abused notion (note it’s only defined in terms of a threshold). Critics of statistical significance tests, I’m afraid to say, often fallaciously take a just statistically significant difference at level α as a better indication of a discrepancy from a null if the test’s power to detect that discrepancy is high rather than low. This is like saying it’s a better indication for a discrepancy of at least 10 than of at least 1 (whatever the parameter is). I call it the Mountains out of Molehill fallacy. It results from trying to use power and alpha as ingredients for a Bayes factor and from viewing non-Bayesian methods through a Bayesian lens.

We set a high power to detect population effects of interest, but finding statistical significance doesn’t warrant saying we’ve evidence for those effects.

(The significance tester doesn’t infer points but inequalities, discrepancies at least such and such).

Final remark on Q-6, power

A legitimate criticism of P-values is they don’t give population effect sizes. Neyman developed power analysis for this purpose, in addition to comparing tests pre-data. Yet critics of tests typically keep to Fisherian tests that don’t have explicit alternatives or power. Neyman was keen to avoid misinterpreting non-significant results as evidence for a null hypothesis. He used power analysis post data (like Jacob Cohen much later) to set an upper bound for a discrepancy from the null value.

If a test has high power to detect a population discrepancy, but does not do so, it’s evidence the discrepancy is absent (qualified by the level).

My preference is to use the attained power but it’s the same reasoning.

I see people objecting to post-hoc power as “sinister” but they’re referring to computing power by using the observed effect as the parameter value in its computation. This is not power analysis.


QUESTION 7 What are the problems that lead to the reproducibility crisis and what are the most important things we should do to address it?

Irreplication is due to many factors from data generation and modeling, to problems of measurement, and linking statistics  to substantive science. Here I just focus on P-values. The key problem is that in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. The fact it becomes difficult to replicate effects when features of the tests are tied down shows the problem isn’t P-values but exploiting researcher flexibility and  multiple testing. The same flexibility can occur when the p-hacked hypotheses enter methods being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. But direct grounds to criticize inferences as flouting error statistical control is lost (at least not without adding non-standard stipulations). Since they condition on the actual outcome they don’t consider outcomes other than the one observed. This is embodied in something called the likelihood principle—.

Admittedly error control, some think, is only of concern to ensure low error rates in some long run. I argue instead that what bothers us about the P-hacker and data dredger is that they have done a poor job in the case at hand. Their method very probably would have found some such effect even if it is merely noise.

Probability here is to assess how well tested claims are, which is very different from how comparatively believable they are—claims can even be known true while poorly tested. Though there’s room for both types of assessments in different contexts, how plausible and how well tested are very different and this needs to be recognized.

To address replication problems, statistical reforms should be developed together with a philosophy of statistics that properly underwrites them.[3]

Final remark on Q-7

Please see the video here or in this news article.

[1]  The following are footnotes 4 and 5 from page 252 of Statistical Inference as Severe testing: How to Get Beyond the Statistics Wars. The relevant section is 4.4. (pp. 246-259)

Casella and Roger (not Jim) Berger (1987b) argue, “We would be surprised if most researchers would place even a 10% prior probability of H0. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H0|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H0] that was used. ” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability.  “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H0” (ibid., p. 345). Thus, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).

Harold Jeffreys developed the spiked priors for a very special case: to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon.)

In defending spiked priors, J. Berger and Sellke move away from the importance of effect size. “Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136).

[2] As Cox and Hinkley explain, most tests of interest are best considered as running two one-sided tests, insofar as we are interested in the direction of departure. (Cox and Hinkley 1974; Cox 2020).

[3] In the error statistical view, the interest is not in measuring how strong your degree of belief in H is but how well you can show why it ought to be believed or not. How well can you put to rest skeptical challenges? What have you done to put to rest my skepticism of your lump prior on “no effect”?



Categories: bayes factors, P-values, Statistics, statistics debate NISS | 1 Comment

P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)


Mayo writing to Kafadar

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

  • “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading

Categories: ASA Guide to P-values, Bayesian/frequentist, P-values | 3 Comments

Neyman vs the ‘Inferential’ Probabilists


We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power: Continue reading

Categories: Bayesian/frequentist | 5 Comments

Neyman vs the ‘Inferential’ Probabilists continued (a)


Today is Jerzy Neyman’s Birthday (April 16, 1894 – August 5, 1981).  I am posting a brief excerpt and a link to a paper of his that I hadn’t posted before: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). [iii] I’ll explain in later stages of this post & in comments…(so please check back); I don’t want to miss the start of the birthday party in honor of Neyman, and it’s already 8:30 p.m in Berkeley!

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program. HAPPY BIRTHDAY NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman, Statistics | Leave a comment

Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*


An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 15 Comments

New venues for the statistics wars

I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading

Categories: Announcement, Bayesian/frequentist, P-values | 3 Comments

Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Sunday, September 10, was C.S. Peirce’s birthday. He’s one of my heroes. He’s a treasure chest on essentially any topic, and anticipated quite a lot in statistics and logic. (As Stephen Stigler (2016) notes, he’s to be credited with articulating and appling randomization [1].) I always find something that feels astoundingly new, even rereading him. He’s been a great resource as I complete my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) [2]. I’m reblogging the main sections of a (2005) paper of mine. It’s written for a very general philosophical audience; the statistical parts are very informal. I first posted it in 2013Happy (belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT): Continue reading

Categories: Bayesian/frequentist, C.S. Peirce | 2 Comments

Can You Change Your Bayesian Prior? The one post whose comments (some of them) will appear in my new book


I blogged this exactly 2 years ago here, seeking insight for my new book (Mayo 2017). Over 100 (rather varied) interesting comments ensued. This is the first time I’m incorporating blog comments into published work. You might be interested to follow the nooks and crannies from back then, or add a new comment to this.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data? Continue reading

Categories: Bayesian priors, Bayesian/frequentist | 16 Comments

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?



ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments

“Fusion-Confusion?” My Discussion of Nancy Reid: “BFF Four- Are we Converging?”


Here are the slides from my discussion of Nancy Reid today at BFF4: The Fourth Bayesian, Fiducial, and Frequentist Workshop: May 1-3, 2017 (hosted by Harvard University)

Categories: Bayesian/frequentist, C.S. Peirce, confirmation theory, fiducial probability, Fisher, law of likelihood, Popper | Tags: | 1 Comment

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself. Continue reading

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 18 Comments

The Fourth Bayesian, Fiducial and Frequentist Workshop (BFF4): Harvard U


May 1-3, 2017
Hilles Event Hall, 59 Shepard St. MA

The Department of Statistics is pleased to announce the 4th Bayesian, Fiducial and Frequentist Workshop (BFF4), to be held on May 1-3, 2017 at Harvard University. The BFF workshop series celebrates foundational thinking in statistics and inference under uncertainty. The three-day event will present talks, discussions and panels that feature statisticians and philosophers whose research interests synergize at the interface of their respective disciplines. Confirmed featured speakers include Sir David Cox and Stephen Stigler.

The program will open with a featured talk by Art Dempster and discussion by Glenn Shafer. The featured banquet speaker will be Stephen Stigler. Confirmed speakers include:

Featured Speakers and DiscussantsArthur Dempster (Harvard); Cynthia Dwork (Harvard); Andrew Gelman (Columbia); Ned Hall (Harvard); Deborah Mayo (Virginia Tech); Nancy Reid (Toronto); Susanna Rinard (Harvard); Christian Robert (Paris-Dauphine/Warwick); Teddy Seidenfeld (CMU); Glenn Shafer (Rutgers); Stephen Senn (LIH); Stephen Stigler (Chicago); Sandy Zabell (Northwestern)

Invited Speakers and PanelistsJim Berger (Duke); Emery Brown (MIT/MGH); Larry Brown (Wharton); David Cox (Oxford; remote participation); Paul Edlefsen (Hutch); Don Fraser (Toronto); Ruobin Gong (Harvard); Jan Hannig (UNC); Alfred Hero (Michigan); Nils Hjort (Oslo); Pierre Jacob (Harvard); Keli Liu (Stanford); Regina Liu (Rutgers); Antonietta Mira (USI); Ryan Martin (NC State); Vijay Nair (Michigan); James Robins (Harvard); Daniel Roy (Toronto); Donald B. Rubin (Harvard); Peter XK Song (Michigan); Gunnar Taraldsen (NUST); Tyler VanderWeele (HSPH); Vladimir Vovk (London); Nanny Wermuth (Chalmers/Gutenberg); Min-ge Xie (Rutgers)

Continue reading

Categories: Announcement, Bayesian/frequentist | 2 Comments

The ASA Document on P-Values: One Year On


I’m surprised it’s a year already since posting my published comments on the ASA Document on P-Values. Since then, there have been a slew of papers rehearsing the well-worn fallacies of tests (a tad bit more than the usual rate). Doubtless, the P-value Pow Wow raised people’s consciousnesses. I’m interested in hearing reader reactions/experiences in connection with the P-Value project (positive and negative) over the past year. (Use the comments, share links to papers; and/or send me something slightly longer for a possible guest post.)
Some people sent me a diagram from a talk by Stephen Senn (on “P-values and the art of herding cats”). He presents an array of different cat commentators, and for some reason Mayo cat is in the middle but way over on the left side,near the wall. I never got the key to interpretation.  My contribution is below: 

Chart by S.Senn

“Don’t Throw Out The Error Control Baby With the Bad Statistics Bathwater”

D. Mayo*[1]

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed. Continue reading

Categories: Bayesian/frequentist, P-values, science communication, Statistics, Stephen Senn | 14 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. This month, I’m grouping the 3 posts from my seminar with A. Spanos, counting them as 1.

January 2014

  • (1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
  • (1/3) Error Statistics Philosophy: 2013
  • (1/4) Your 2014 wishing well. …
  • (1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
  • (1/11) Two Severities? (PhilSci and PhilStat)
  • (1/14) Statistical Science meets Philosophy of Science: blog beginnings
  • (1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
  • (1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
  • (1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
  • (1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
  • (1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
  • (1/27) Winner of the January 2014 palindrome contest (rejected post)
  • (1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics


  • (1/31) Phil 6334: Day #2 Slides


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016-very convenient.







Categories: 3-year memory lane, Bayesian/frequentist, Statistics | 1 Comment

The “P-values overstate the evidence against the null” fallacy



The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming). 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 47 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: December 2013. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. In this post, that makes 12/27-12/28 count as one.

December 2013

  • (12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)
  • (12/7) FDA’s New Pharmacovigilance
  • (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
  • (12/11) Blog Contents for Oct and Nov 2013
  • (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
  • (12/15) Surprising Facts about Surprising Facts
  • (12/19) A. Spanos lecture on “Frequentist Hypothesis Testing
  • (12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3
  • (12/25) “Bad Arguments” (a book by Ali Almossawi)
  • (12/26) Mascots of Bayesneon statistics (rejected post)
  • (12/27) Deconstructing Larry Wasserman
  • (12/28) More on deconstructing Larry Wasserman (Aris Spanos)
  • (12/28) Wasserman on Wasserman: Update! December 28, 2013
  • (12/31) Midnight With Birnbaum (Happy New Year)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016-very convenient.







Categories: 3-year memory lane, Bayesian/frequentist, Error Statistics, Statistics | 1 Comment

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig



I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error. Continue reading

Categories: Bayesian/frequentist, Error Statistics, fallacy of rejection, P-values, Statistics | 12 Comments

Gelman at the PSA: “Confirmationist and Falsificationist Paradigms in Statistical Practice”: Comments & Queries

screen-shot-2016-10-26-at-10-23-07-pmTo resume sharing some notes I scribbled down on the contributions to our Philosophy of Science Association symposium on Philosophy of Statistics (Nov. 4, 2016), I’m up to Gelman. Comments on Gigerenzer and Glymour are here and here. Gelman didn’t use slides but gave a very thoughtful, extemporaneous presentation on his conception of “falsificationist Bayesianism”, its relation to current foundational issues, as well as to error statistical testing. My comments follow his abstract.

Confirmationist and Falsificationist Paradigms in Statistical Practice



Andrew Gelman

There is a divide in statistics between classical frequentist and Bayesian methods. Classical hypothesis testing is generally taken to follow a falsificationist, Popperian philosophy in which research hypotheses are put to the test and rejected when data do not accord with predictions. Bayesian inference is generally taken to follow a confirmationist philosophy in which data are used to update the probabilities of different hypotheses. We disagree with this conventional Bayesian-frequentist contrast: We argue that classical null hypothesis significance testing is actually used in a confirmationist sense and in fact does not do what it purports to do; and we argue that Bayesian inference cannot in general supply reasonable probabilities of models being true. The standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify, which is then taken as evidence in favor of A. Research projects are framed as quests for confirmation of a theory, and once confirmation is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements. Continue reading

Categories: Bayesian/frequentist, Gelman, Shalizi, Statistics | 148 Comments

Taking errors seriously in forecasting elections



Science isn’t about predicting one-off events like election results, but that doesn’t mean the way to make election forecasts scientific (which they should be) is to build “theories of voting.” A number of people have sent me articles on statistical aspects of the recent U.S. election, but I don’t have much to say and I like to keep my blog non-political. I won’t violate this rule in making a couple of comments on Faye Flam’s Nov. 11 article: “Why Science Couldn’t Predict a Trump Presidency”[i].

For many people, Donald Trump’s surprise election victory was a jolt to very idea that humans are rational creatures. It tore away the comfort of believing that science has rendered our world predictable. The upset led two New York Times reporters to question whether data science could be trusted in medicine and business. A Guardian columnist declared that big data works for physics but breaks down in the realm of human behavior. Continue reading

Categories: Bayesian/frequentist, evidence-based policy | 15 Comments

Blog at