Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*


An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1).

The Bayes factor discussed is of H1 over H0, in two-sided Normal testing of H0: μ = 0 versus H1: μ ≠ 0.

“The variance of the observations is known. Without loss of generality, we assume that the variance is 1, and the sample size is also 1.” (p. 2 supplementary)

“This is achieved by assuming that μ under the alternative hypothesis is equal to ± (z0.025 + z0.75) = ± 2.63 [1.96. + .63]. That is, the alternative hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63”. (p. 2 supplementary)

Putting to one side whether this is “without loss of generality”, the use of “power” is quite different from the correct definition. The power of a test T  (with type I error probability α) to detect a discrepancy μ’ is the probability T generates an observed difference that is statistically significant at level α, assuming μ = μ’. The value z = 2.63 comes from the fact that the alternative against which this test has power .75 is the value .63 SE in excess of the cut-off for rejection. (Since an SE is 1, they add .63 to 1.96.) I don’t really see why it’s advantageous to ride roughshod on the definition of power, and it’s not the main point of this blogpost, but it’s worth noting if you’re to avoid sinking into the quicksand.

Let’s distinguish the appropriateness of the test for a Bayesian, from its appropriateness as a criticism of significance tests. The latter is my sole focus. The criticism is that, at least if we accept these Bayesian assignments of priors, the posterior probability on H0 will be larger than the p-value. So if you were to interpret a p-value as a posterior on H0 (a fallacy) or if you felt intuitively that a .05 (2-sided) statistically significant result should correspond to something closer to a .05 posterior on H0, you should instead use a p-value of .005–or so it is argued. I’m not sure of the posterior on H0, but the BF is between around 14 and 26.[1] That is the argument. If you lower the required p-value, it won’t be so easy to get statistical significance, and irreplicable results won’t be as common. [2]

The alternative corresponding to the preferred p =.005 requirement

“corresponds to a classical, two-sided test of size α = 0.005. The alternative hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is nearly identical to the “power” curve if that curve had been defined using 80% power, rather than 75% power. The Power curve for 80% power would place ½ its mass at ±2.80”. (Supplementary, p. 2)

z = 2.8 comes from adding .84 SE to the cut-off: 1.96 SE +.84 SE = 2.8. This gets to the alternative vs which the α = 0.05 test has 80% power. (See my previous post on power.)

Is this a good form of inference from the Bayesian perspective? (Why are we comparing μ = 0 and μ = 2.8?). As is always the case with “p-values exaggerate” arguments, there’s the supposition that testing should be on a point null hypothesis, with a lump of prior probability given to H0 (or to a region around 0 so small that it’s indistinguishable from 0). I leave those concerns for Bayesians, and I’m curious to hear from you. More importantly, does it constitute a relevant and sound criticism of significance testing? Let’s be clear: a tester might well have her own reasons for preferring z = 2.8 rather than z = 1.96, but that’s not the question. The question is whether they’ve provided a good argument for the significance tester to do so?

II. What might the significance tester say?

For starters, when she sets .8 power to detect a discrepancy, she doesn’t “implicitly assume” it’s a plausible population discrepancy, but simply one she wants the test to detect by producing a statistically significant difference (with probability .8). And if the test does produce a difference that differs statistically significantly from H0, she does not infer the alternative against which the test had high power, call it μ’. (The supposition that she does grows out of fallaciously transposing “the conditional” involved in power.) Such a rule of interpreting data would have a high error probability of erroneously inferring a discrepancy μ’ (here 2.8).

The significance tester merely seeks evidence of some (genuine) discrepancy from 0, and eschews a comparative inference such as the ratio of the probability of the data under the points 0 and 2.63 (or 2.8). I don’t say there’s no role for a comparative inference, nor preclude someone arguing it is comparing how well μ = 2.8 “explains” the data compared to μ = 0 (given the assumptions), but the form of inference is so different from significance testing, it’s hard to compare them. She definitely wouldn’t ignore all the points in between 0 and 2.8. A one-sided test is preferable (unless the direction of discrepancy is of no interest). While one or two-sided doesn’t make that much difference for a significance tester, it makes a big difference for the type of Bayesian analyses that is appealed to in the “p-values exaggerate” literature. That’s because a lump prior, often .5 (but here .9!), is placed on the point 0 null. Without the lump, the p-value tends to be close to the posterior probability for H0, as Casella and Berger (1987a,b) show–even though p-values and posteriors are actually measuring very different things.

“In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987 b, p. 334, p. 335).

There is a long and old literature on all this (at least since Edwards, Lindman and Savage 1963–let me know if you’re aware of older sources).

Those who lodge the “p-values exaggerate” critique often say, we’re just showing what would happen even if we made the strongest case for the alternative. No they’re not. They wouldn’t be putting the lump prior on 0 were they concerned not to bias things in favor of the null, and they wouldn’t be looking to compare 0 with so far away an alternative as 2.8 either.

The only way a significance tester can appraise or calibrate a measure such as a BF (and these will differ depending on the alternative picked) is to view it as a statistic and consider the probability of an even larger BF under varying assumptions about the value of μ. This is an error probability associated with the method. Accounts that appraise inferences according to the error probability of the method used I call error statistical (which is less equivocal than frequentist or other terms.)

For example, rejecting H0 when z ≥ 1.96 (which is the .05 test, since they make it 2-sided), we said, had .8 power to detect μ = 2.8, but with the .005 test it has only 50% power to do so. If one insists on a fixed .005 cut-off, this is construed as no evidence against the null (or even evidence for it–for a Bayesian). The new test has only 30% probability of finding significance were the data generated by μ = 2.3. So the significance tester is rightly troubled by the raised type II error [3], although the members of an imaginary Toxic Co. (having the risks of their technology probed) might be happy as clams.[4]

Suppose we do attain statistical significance at the recommended .005 level, say z = 2.8. The BF advocate assures us we can infer μ = 2.8, which is now 25 times as likely as μ = 0, (if all the various Bayesian assignments hold). The trouble is, the significance tester doesn’t want to claim good evidence for μ = 2.8. The significance tester merely infers an indication of a discrepancy (an isolated low p-value doesn’t suffice, and the assumptions also must be checked). She’d never ignore all the points other than 0 and ± 2.8. Suppose we were testing μ ≤ 2.7 vs. μ > 2.7, and observed z = 2.8. What is the p-value associated with this observed difference? The answer is ~.46. (Her inferences are not in terms of points but of discrepancies from the null, but I’m trying to relate the criticism to significance tests. ) To obtain μ ≥ 2.7 using one-sided confidence intervals would require a confidence level of .46! An absurdly high error probability.

The one-sided lower .975 bound with z = 2.8 would only entitle inferring μ > .84 (2.8 – 1.96)–quite a bit smaller than inferring μ = 2.8. If confidence levels are altered as well (and I don’t see why they wouldn’t be), the one-sided lower .995 bound would only be μ > 0. Thus, while the lump prior on  Hresults in a bias in favor of a null–increasing the type II error probability– it’s of interest to note that achieving the recommended p-value licenses an inference much larger than what the significance tester would allow.

Note, their inferences remain comparative in the sense of “H1 over H0” on a given measure, it doesn’t actually say there’s evidence against (or for) either (unless it goes on to compute a posterior, not just odds ratios or BFs), nor does it falsify either hypothesis. This just underscores the fact that the BF comparative inference is importantly different from significance tests which seek to falsify a null hypothesis, with a view toward learning if there are genuine discrepancies, and if so, their magnitude.

Significance tests do not assign probabilities to these parametric hypotheses, but even if one wanted to, the spiked priors needed for the criticism are questioned by Bayesians and frequentists alike. Casella and Berger (1987a) say that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in one or two-sided tests. According to them “The testing of a point null hypothesis is one of the most misused statistical procedures.” (ibid., p. 106)

III. Why significance testers should reject the “redefine statistical significance” argument:

(i) If you endorse this particular Bayesian way of attaining the BF, fine, but then your argument begs the central question against the significance tester (or of the confidence interval estimator, for that matter). The significance tester is free to turn the situation around, as Fisher does, as refuting the assumptions:

Even if one were to imagine that H0  had an extremely high prior probability, says Fisher—never minding “what such a statement of probability a priori could possibly mean”(Fisher, 1973, p.42)—the resulting high posteriori probability to H0 , he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (ibid., p. 44) … “…is not capable of finding expression in any calculation of probability a posteriori” (ibid., p. 43). Indeed, if one were to consider the claim about the priori probability to be itself a hypothesis, Fisher says, “it would be rejected at once by the observations at a level of significance almost as great [as reached by H0 ]. …Were such a conflict of evidence, as has here been imagined under discussion… in a scientific laboratory, it would, I suggest, be some prior assumption…that would certainly be impugned.” (p. 44)

(ii) Suppose, on the other hand, you don’t endorse these priors or the Bayesian computation on which the “redefine significance” argument turns. Since lowering the p-value cut-off doesn’t seem too harmful, you might tend to look the other way as to the argument on which it is based. Isn’t that OK? Not unless you’re prepared to have your students compute these BFs and/or posteriors in just the manner upon which the critique of significance tests rests. Will you say, “oh that was just for criticism, not for actual use”? Unless you’re prepared to defend the statistical analysis, you shouldn’t support it. Lowering the p-value that you require for evidence of a discrepancy, or getting more data (should you wish to do so) doesn’t require it.

Moreover, your student might point out that you still haven’t matched p-values and BFs (or posteriors on H0 ): They still differ, with the p-value being smaller. If you wanted to match the p-value and the posterior, you could do so very easily: use the frequency matching priors (which doesn’t use the spike). You could still lower the p-value to .005, and obtain a rejection region precisely identical to the Bayesian. Why isn’t that a better solution than one based on a conflicting account of statistical inference?

Of course, even that is to grant the problem as put before us by the Bayesian argument. If you’re following good error statistical practice you might instead shirk all cut-offs. You’d report attained p-values, and wouldn’t infer a genuine effect until you’ve satisfied Fisher’s requirements: (a) Replicate yourself, show you can bring about results that “rarely fail to give us a statistically significant result” (1947, p. 14) and that you’re getting better at understanding the causal phenomenon involved. (b) Check your assumptions: both the statistical model, the measurements, and the links between statistical measurements and research claims. (c) Make sure you adjust your error probabilities to take account of, or at least report, biasing selection effects (from cherry-picking, trying and trying again, multiple testing, flexible determinations, post-data subgroups)–according to the context. That’s what prespecified reports are to inform you of. The suggestion that these are somehow taken care of by adjusting the pool of hypotheses on which you base a prior will not do. (It’s their plausibility that often makes them so seductive, and anyway, the injury is to how well-tested claims are, not to their prior believability.) The appeal to diagnostic testing computations of “false positive rates” in this paper opens up a whole new urn of worms. Don’t get me started. (see related posts.)

A final word is from a guest post by Senn.  Harold Jeffreys, he says, held that if you use the spike (which he introduced), you are to infer the hypothesis that achieves greater than .5 posterior probability.

Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) … A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

Please share your views, and alert me to errors. I will likely update this. Stay tuned for asterisks.
12/17 * I’ve already corrected a few typos.

[1] I do not mean the “false positive rate” defined in terms of α and (1 – β)–a problematic animal I put to one side here (Mayo 2003). Richard Morey notes that using their prior odds of 1:10, even the recommended BF of 26 gives us an unimpressive  posterior odds ratio of 2.6 (email correspondence).

[2] Note what I call the “fallacy of replication”. It’s said to be too easy to get low p-values, but at the same time it’s too hard to get low p-values in replication. Is it too easy or too hard? That just shows it’s not the p-value at fault but cherry-picking and other biasing selection effects. Replicating a p-value is hard–when you’ve cheated or been sloppy  the first time.

[3] They suggest increasing the sample size to get the power where it was with rejection at z = 1.96, and, while this is possible in some cases, increasing the sample size changes what counts as one sample. As n increases the discrepancy indicated by any level of significance decreases.

[4] The severe tester would report attained levels and,in this case, would indicate the the discrepancies indicated and ruled out with reasonable severity. (Mayo and Spanos 2011). Keep in mind that statistical testing inferences are  in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values. As for the imaginary Toxic Co., I’d put the existence of a risk of interest in the null hypothesis of a one-sided test.

Related Posts

10/26/17: Going round and round again: a roundtable on reproducibility & lowering p-values

10/18/17: Deconstructing “A World Beyond P-values”

1/19/17: The “P-values overstate the evidence against the null” fallacy

8/28/16 Tragicomedy hour: p-values vs posterior probabilities vs diagnostic error rates

12/20/15 Senn: Double Jeopardy: Judge Jeffreys Upholds the Law, sequel to the pathetic p-value.

2/1/14 Comedy hour at the Bayesian epistemology retreat: highly probable vs highly probed vs B-boosts

11/25/14: How likelihoodists exaggerate evidence from statistical tests


Elements of this post are from Mayo 2018.


Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., 3 … Johnson, V. (2017, July 22), “Redefine statistical significance“, Nature Human Behavior.

Berger, J. O. and Delampady, M. (1987). “Testing Precise Hypotheses” and “Rejoinder“, Statistical Science 2(3), 317-335.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Cassella, G. and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady”, Statistical Science 2(3), 344–347.

Edwards, W., Lindman, H. and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242.

Fisher, R. A. (1947). The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd. (First published 1935).

Fisher, R. A. (1973). Statistical Methods and Scientific Inference, 3rd ed,  New York: Hafner Press.

Ghosh, J. Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. New York: Springer.

Mayo, D. G. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing? Commentary on J. Berger’s Fisher Address,” Statistical Science 18: 19-24.

Mayo (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge (June 2018)

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 1 Comment

Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*


I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

Continue reading

Categories: Fisher, P-values, phil/history of stat | 3 Comments

Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"


These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism). Continue reading

Categories: Error Statistics, P-values, replication research, selection effects | 22 Comments

Going round and round again: a roundtable on reproducibility & lowering p-values


There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

Continue reading

Categories: Announcement, P-values, reforming the reformers, selection effects | 5 Comments

Deconstructing “A World Beyond P-values”

.A world beyond p-values?

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1] Continue reading

Categories: P-values, Philosophy of Statistics, reforming the reformers | 8 Comments

Statistical skepticism: How to use significance tests effectively: 7 challenges & how to respond to them

Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05”  in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)


  • Why use a tool that infers from a single (arbitrary) P-value that pertains to a statistical hypothesis H0 to a research claim H*?
  • Why use an incompatible hybrid (of Fisher and N-P)?
  • Why apply a method that uses error probabilities, the sampling distribution, researcher “intentions” and violates the likelihood principle (LP)? You should condition on the data.
  • Why use methods that overstate evidence against a null hypothesis?
  • Why do you use a method that presupposes the underlying statistical model?
  • Why use a measure that doesn’t report effect sizes?
  • Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?


Categories: P-values, spurious p values, statistical tests, Statistics | Leave a comment

New venues for the statistics wars

I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading

Categories: Announcement, Bayesian/frequentist, P-values | 3 Comments

Thieme on the theme of lowering p-value thresholds (for Slate)


Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.



Illustration by Slate

                 Illustration by Slate

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005. Continue reading

Categories: P-values, reforming the reformers, spurious p values | 14 Comments

“A megateam of reproducibility-minded scientists” look to lowering the p-value


Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading

Categories: Error Statistics, highly probable vs highly probed, P-values, reforming the reformers | 55 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one. This month there are three such groups: 7/8 and 7/10; 7/14 and 7/23; 7/26 and 7/31.

July 2014

  • (7/7) Winner of June Palindrome Contest: Lori Wike
  • (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
  • (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
  • (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
  • (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
  • (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.







Categories: 3-year memory lane, Higgs, P-values | Leave a comment

If you’re seeing limb-sawing in P-value logic, you’re sawing off the limbs of reductio arguments

images-2I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704).  They say they seek to define a notion of “plausibility” that

“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).

Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C  is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing Hin p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”! Continue reading

Categories: P-values, reforming the reformers, Statistics | 13 Comments

Er, about those other approaches, hold off until a balanced appraisal is in

I could have told them that the degree of accordance enabling the ASA’s “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved. Continue reading

Categories: P-values, reforming the reformers, Statistics | 6 Comments

The ASA Document on P-Values: One Year On


I’m surprised it’s a year already since posting my published comments on the ASA Document on P-Values. Since then, there have been a slew of papers rehearsing the well-worn fallacies of tests (a tad bit more than the usual rate). Doubtless, the P-value Pow Wow raised people’s consciousnesses. I’m interested in hearing reader reactions/experiences in connection with the P-Value project (positive and negative) over the past year. (Use the comments, share links to papers; and/or send me something slightly longer for a possible guest post.)
Some people sent me a diagram from a talk by Stephen Senn (on “P-values and the art of herding cats”). He presents an array of different cat commentators, and for some reason Mayo cat is in the middle but way over on the left side,near the wall. I never got the key to interpretation.  My contribution is below: 

Chart by S.Senn

“Don’t Throw Out The Error Control Baby With the Bad Statistics Bathwater”

D. Mayo*[1]

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed. Continue reading

Categories: Bayesian/frequentist, P-values, science communication, Statistics, Stephen Senn | 14 Comments

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand



Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 39 Comments

The “P-values overstate the evidence against the null” fallacy



The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming). 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 47 Comments

Szucs & Ioannidis Revive the Limb-Sawing Fallacy




When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week),  Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject Hwe will actually be left in complete existential vacuum because during the rejection of HNHST ‘saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to reject H0then it follows that pr(data or more extreme data|H0) is useless because H0 is not true” (p.15).

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis] H0. Surely, we must also reject probabilities conditional on H0, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ 

Ha! Ha! By this reasoning, no hypothetical testing or falsification could ever occur. As soon as H is falsified, the grounds for falsifying disappear! If H: all swans are white, then if I see a black swan, H is falsified. But according to this criticism, we can no longer assume the deduced prediction from H! What? Continue reading

Categories: Error Statistics, P-values, reforming the reformers, Statistics | 14 Comments

“Tests of Statistical Significance Made Sound”: excerpts from B. Haig



I came across a paper, “Tests of Statistical Significance Made Sound,” by Brian Haig, a psychology professor at the University of Canterbury, New Zealand. It hits most of the high notes regarding statistical significance tests, their history & philosophy and, refreshingly, is in the error statistical spirit! I’m pasting excerpts from his discussion of “The Error-Statistical Perspective”starting on p.7.[1]

The Error-Statistical Perspective

An important part of scientific research involves processes of detecting, correcting, and controlling for error, and mathematical statistics is one branch of methodology that helps scientists do this. In recognition of this fact, the philosopher of statistics and science, Deborah Mayo (e.g., Mayo, 1996), in collaboration with the econometrician, Aris Spanos (e.g., Mayo & Spanos, 2010, 2011), has systematically developed, and argued in favor of, an error-statistical philosophy for understanding experimental reasoning in science. Importantly, this philosophy permits, indeed encourages, the local use of ToSS, among other methods, to manage error. Continue reading

Categories: Bayesian/frequentist, Error Statistics, fallacy of rejection, P-values, Statistics | 12 Comments

Glymour at the PSA: “Exploratory Research is More Reliable Than Confirmatory Research”

psa-homeI resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.

GLYMOUR’S ARGUMENT (in a nutshell):Glymour_2006_IMG_0965

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

exploratory-research-is-more-reliable-than-confirmatory-research-13-1024(slide #5)

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim. Continue reading

Categories: fallacy of rejection, P-values, replication research | 20 Comments

Gigerenzer at the PSA: “How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual”: Comments and Queries (ii)



Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).



  1. Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.

I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”? Continue reading

Categories: Fisher, frequentist/Bayesian, Gigerenzer, Gigerenzer, P-values, spurious p values, Statistics | 11 Comments

For Statistical Transparency: Reveal Multiplicity and/or Just Falsify the Test (Remark on Gelman and Colleagues)



Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end [1].

[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).

An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a multiverse analysis, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies. Continue reading

Categories: Bayesian/frequentist, Error Statistics, Gelman, P-values, preregistration, reproducibility, Statistics | 9 Comments

Blog at WordPress.com.