# reforming the reformers

## Memory Lane (4 years ago): Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1).

The Bayes factor discussed is of H1 over H0, in two-sided Normal testing of H0: μ = 0 versus H1: μ ≠ 0.

“The variance of the observations is known. Without loss of generality, we assume that the variance is 1, and the sample size is also 1.” (p. 2 supplementary)

“This is achieved by assuming that μ under the alternative hypothesis is equal to ± (z0.025 + z0.75) = ± 2.63 [1.96. + .63]. That is, the alternative hypothesis places ½ its prior mass on 2.63 and ½ its mass on -2.63”. (p. 2 supplementary)

Putting to one side whether this is “without loss of generality”, the use of “power” is quite different from the correct definition. The power of a test T  (with type I error probability α) to detect a discrepancy μ’ is the probability T generates an observed difference that is statistically significant at level α, assuming μ = μ’. The value z = 2.63 comes from the fact that the alternative against which this test has power .75 is the value .63 SE in excess of the cut-off for rejection. (Since an SE is 1, they add .63 to 1.96.) I don’t really see why it’s advantageous to ride roughshod on the definition of power, and it’s not the main point of this blogpost, but it’s worth noting if you’re to avoid sinking into the quicksand.

Let’s distinguish the appropriateness of the test for a Bayesian, from its appropriateness as a criticism of significance tests. The latter is my sole focus. The criticism is that, at least if we accept these Bayesian assignments of priors, the posterior probability on H0 will be larger than the p-value. So if you were to interpret a p-value as a posterior on H0 (a fallacy) or if you felt intuitively that a .05 (2-sided) statistically significant result should correspond to something closer to a .05 posterior on H0, you should instead use a p-value of .005–or so it is argued. I’m not sure of the posterior on H0, but the BF is between around 14 and 26.[1] That is the argument. If you lower the required p-value, it won’t be so easy to get statistical significance, and irreplicable results won’t be as common. [2]

The alternative corresponding to the preferred p =.005 requirement

“corresponds to a classical, two-sided test of size α = 0.005. The alternative hypothesis for this Bayesian test places ½ mass at 2.81 and ½ mass at -2.81. The null hypothesis for this test is rejected if the Bayes factor exceeds 25.7. Note that this curve is nearly identical to the “power” curve if that curve had been defined using 80% power, rather than 75% power. The Power curve for 80% power would place ½ its mass at ±2.80”. (Supplementary, p. 2)

z = 2.8 comes from adding .84 SE to the cut-off: 1.96 SE +.84 SE = 2.8. This gets to the alternative vs which the α = 0.05 test has 80% power. (See my previous post on power.)

Is this a good form of inference from the Bayesian perspective? (Why are we comparing μ = 0 and μ = 2.8?). As is always the case with “p-values exaggerate” arguments, there’s the supposition that testing should be on a point null hypothesis, with a lump of prior probability given to H0 (or to a region around 0 so small that it’s indistinguishable from 0). I leave those concerns for Bayesians, and I’m curious to hear from you. More importantly, does it constitute a relevant and sound criticism of significance testing? Let’s be clear: a tester might well have her own reasons for preferring z = 2.8 rather than z = 1.96, but that’s not the question. The question is whether they’ve provided a good argument for the significance tester to do so?

II. What might the significance tester say?

For starters, when she sets .8 power to detect a discrepancy, she doesn’t “implicitly assume” it’s a plausible population discrepancy, but simply one she wants the test to detect by producing a statistically significant difference (with probability .8). And if the test does produce a difference that differs statistically significantly from H0, she does not infer the alternative against which the test had high power, call it μ’. (The supposition that she does grows out of fallaciously transposing “the conditional” involved in power.) Such a rule of interpreting data would have a high error probability of erroneously inferring a discrepancy μ’ (here 2.8).

The significance tester merely seeks evidence of some (genuine) discrepancy from 0, and eschews a comparative inference such as the ratio of the probability of the data under the points 0 and 2.63 (or 2.8). I don’t say there’s no role for a comparative inference, nor preclude someone arguing it is comparing how well μ = 2.8 “explains” the data compared to μ = 0 (given the assumptions), but the form of inference is so different from significance testing, it’s hard to compare them. She definitely wouldn’t ignore all the points in between 0 and 2.8. A one-sided test is preferable (unless the direction of discrepancy is of no interest). While one or two-sided doesn’t make that much difference for a significance tester, it makes a big difference for the type of Bayesian analyses that is appealed to in the “p-values exaggerate” literature. That’s because a lump prior, often .5 (but here .9!), is placed on the point 0 null. Without the lump, the p-value tends to be close to the posterior probability for H0, as Casella and Berger (1987a,b) show–even though p-values and posteriors are actually measuring very different things.

“In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987 b, p. 334, p. 335).

There is a long and old literature on all this (at least since Edwards, Lindman and Savage 1963–let me know if you’re aware of older sources).

Those who lodge the “p-values exaggerate” critique often say, we’re just showing what would happen even if we made the strongest case for the alternative. No they’re not. They wouldn’t be putting the lump prior on 0 were they concerned not to bias things in favor of the null, and they wouldn’t be looking to compare 0 with so far away an alternative as 2.8 either.

The only way a significance tester can appraise or calibrate a measure such as a BF (and these will differ depending on the alternative picked) is to view it as a statistic and consider the probability of an even larger BF under varying assumptions about the value of μ. This is an error probability associated with the method. Accounts that appraise inferences according to the error probability of the method used I call error statistical (which is less equivocal than frequentist or other terms.)

For example, rejecting H0 when z ≥ 1.96 (which is the .05 test, since they make it 2-sided), we said, had .8 power to detect μ = 2.8, but with the .005 test it has only 50% power to do so. If one insists on a fixed .005 cut-off, this is construed as no evidence against the null (or even evidence for it–for a Bayesian). The new test has only 30% probability of finding significance were the data generated by μ = 2.3. So the significance tester is rightly troubled by the raised type II error [3], although the members of an imaginary Toxic Co. (having the risks of their technology probed) might be happy as clams.[4]

Suppose we do attain statistical significance at the recommended .005 level, say z = 2.8. The BF advocate assures us we can infer μ = 2.8, which is now 25 times as likely as μ = 0, (if all the various Bayesian assignments hold). The trouble is, the significance tester doesn’t want to claim good evidence for μ = 2.8. The significance tester merely infers an indication of a discrepancy (an isolated low p-value doesn’t suffice, and the assumptions also must be checked). She’d never ignore all the points other than 0 and ± 2.8. Suppose we were testing μ ≤ 2.7 vs. μ > 2.7, and observed z = 2.8. What is the p-value associated with this observed difference? The answer is ~.46. (Her inferences are not in terms of points but of discrepancies from the null, but I’m trying to relate the criticism to significance tests. ) To obtain μ ≥ 2.7 using one-sided confidence intervals would require a confidence level of .46 .54. An absurdly low confidence level/high error probability.

The one-sided lower .975 bound with z = 2.8 would only entitle inferring μ > .84 (2.8 – 1.96)–quite a bit smaller than inferring μ = 2.8. If confidence levels are altered as well (and I don’t see why they wouldn’t be), the one-sided lower .995 bound would only be μ > 0. Thus, while the lump prior on  Hresults in a bias in favor of a null–increasing the type II error probability– it’s of interest to note that achieving the recommended p-value licenses an inference much larger than what the significance tester would allow.

Note, their inferences remain comparative in the sense of “H1 over H0” on a given measure, it doesn’t actually say there’s evidence against (or for) either (unless it goes on to compute a posterior, not just odds ratios or BFs), nor does it falsify either hypothesis. This just underscores the fact that the BF comparative inference is importantly different from significance tests which seek to falsify a null hypothesis, with a view toward learning if there are genuine discrepancies, and if so, their magnitude.

Significance tests do not assign probabilities to these parametric hypotheses, but even if one wanted to, the spiked priors needed for the criticism are questioned by Bayesians and frequentists alike. Casella and Berger (1987a) say that “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (p. 111) whether in one or two-sided tests. According to them “The testing of a point null hypothesis is one of the most misused statistical procedures.” (ibid., p. 106)

III. Why significance testers should reject the “redefine statistical significance” argument:

(i) If you endorse this particular Bayesian way of attaining the BF, fine, but then your argument begs the central question against the significance tester (or of the confidence interval estimator, for that matter). The significance tester is free to turn the situation around, as Fisher does, as refuting the assumptions:

Even if one were to imagine that H0  had an extremely high prior probability, says Fisher—never minding “what such a statement of probability a priori could possibly mean”(Fisher, 1973, p.42)—the resulting high posteriori probability to H0 , he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (ibid., p. 44) … “…is not capable of finding expression in any calculation of probability a posteriori” (ibid., p. 43). Indeed, if one were to consider the claim about the priori probability to be itself a hypothesis, Fisher says, “it would be rejected at once by the observations at a level of significance almost as great [as reached by H0 ]. …Were such a conflict of evidence, as has here been imagined under discussion… in a scientific laboratory, it would, I suggest, be some prior assumption…that would certainly be impugned.” (p. 44)

(ii) Suppose, on the other hand, you don’t endorse these priors or the Bayesian computation on which the “redefine significance” argument turns. Since lowering the p-value cut-off doesn’t seem too harmful, you might tend to look the other way as to the argument on which it is based. Isn’t that OK? Not unless you’re prepared to have your students compute these BFs and/or posteriors in just the manner upon which the critique of significance tests rests. Will you say, “oh that was just for criticism, not for actual use”? Unless you’re prepared to defend the statistical analysis, you shouldn’t support it. Lowering the p-value that you require for evidence of a discrepancy, or getting more data (should you wish to do so) doesn’t require it.

Moreover, your student might point out that you still haven’t matched p-values and BFs (or posteriors on H0 ): They still differ, with the p-value being smaller. If you wanted to match the p-value and the posterior, you could do so very easily: use the frequency matching priors (which doesn’t use the spike). You could still lower the p-value to .005, and obtain a rejection region precisely identical to the Bayesian. Why isn’t that a better solution than one based on a conflicting account of statistical inference?

Of course, even that is to grant the problem as put before us by the Bayesian argument. If you’re following good error statistical practice you might instead shirk all cut-offs. You’d report attained p-values, and wouldn’t infer a genuine effect until you’ve satisfied Fisher’s requirements: (a) Replicate yourself, show you can bring about results that “rarely fail to give us a statistically significant result” (1947, p. 14) and that you’re getting better at understanding the causal phenomenon involved. (b) Check your assumptions: both the statistical model, the measurements, and the links between statistical measurements and research claims. (c) Make sure you adjust your error probabilities to take account of, or at least report, biasing selection effects (from cherry-picking, trying and trying again, multiple testing, flexible determinations, post-data subgroups)–according to the context. That’s what prespecified reports are to inform you of. The suggestion that these are somehow taken care of by adjusting the pool of hypotheses on which you base a prior will not do. (It’s their plausibility that often makes them so seductive, and anyway, the injury is to how well-tested claims are, not to their prior believability.) The appeal to diagnostic testing computations of “false positive rates” in this paper opens up a whole new urn of worms. Don’t get me started. (see related posts.)

A final word is from a guest post by Senn.  Harold Jeffreys, he says, held that if you use the spike (which he introduced), you are to infer the hypothesis that achieves greater than .5 posterior probability.

Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) … A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

Please share your views, and alert me to errors. I will likely update this. Stay tuned for asterisks.
12/17 * I’ve already corrected a few typos.

[1] I do not mean the “false positive rate” defined in terms of α and (1 – β)–a problematic animal I put to one side here (Mayo 2003). Richard Morey notes that using their prior odds of 1:10, even the recommended BF of 26 gives us an unimpressive  posterior odds ratio of 2.6 (email correspondence).

[2] Note what I call the “fallacy of replication”. It’s said to be too easy to get low p-values, but at the same time it’s too hard to get low p-values in replication. Is it too easy or too hard? That just shows it’s not the p-value at fault but cherry-picking and other biasing selection effects. Replicating a p-value is hard–when you’ve cheated or been sloppy  the first time.

[3] They suggest increasing the sample size to get the power where it was with rejection at z = 1.96, and, while this is possible in some cases, increasing the sample size changes what counts as one sample. As n increases the discrepancy indicated by any level of significance decreases.

[4] The severe tester would report attained levels and,in this case, would indicate the the discrepancies indicated and ruled out with reasonable severity. (Mayo and Spanos 2011). Keep in mind that statistical testing inferences are  in the form of µ > µ’ =µ+ δ,  or µ ≤ µ’ =µ+ δ  or the like. They are not to point values. As for the imaginary Toxic Co., I’d put the existence of a risk of interest in the null hypothesis of a one-sided test.

Related Posts

Elements of this post are from Mayo 2018.

References

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., 3 … Johnson, V. (2017, July 22), “Redefine statistical significance“, Nature Human Behavior.

Berger, J. O. and Delampady, M. (1987). “Testing Precise Hypotheses” and “Rejoinder“, Statistical Science 2(3), 317-335.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Cassella, G. and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady”, Statistical Science 2(3), 344–347.

Edwards, W., Lindman, H. and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242.

Fisher, R. A. (1947). The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd. (First published 1935).

Fisher, R. A. (1973). Statistical Methods and Scientific Inference, 3rd ed,  New York: Hafner Press.

Ghosh, J. Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. New York: Springer.

Mayo, D. G. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing? Commentary on J. Berger’s Fisher Address,” Statistical Science 18: 19-24.

Mayo (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge (June 2018)

Mayo, D. G. and Spanos, A. (2011) “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

## Excursion 3 Tour III:

A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. CIs thereby obtain an inferential rationale (beyond performance), and several benchmarks are reported. Continue reading

## Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading

## Going round and round again: a roundtable on reproducibility & lowering p-values

.

There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

## Deconstructing “A World Beyond P-values”

.A world beyond p-values?

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1] Continue reading

## Thieme on the theme of lowering p-value thresholds (for Slate)

.

Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

## Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005. Continue reading

Categories: P-values, reforming the reformers, spurious p values | 14 Comments

## “A megateam of reproducibility-minded scientists” look to lowering the p-value

.

Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today: Continue reading

## On the current state of play in the crisis of replication in psychology: some heresies

.

The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek et.al. 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]). Continue reading

## How to tell what’s true about power if you’re practicing within the error-statistical tribe

.

This is a modified reblog of an earlier post, since I keep seeing papers that confuse this.

Suppose you are reading about a result x  that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0.

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ).*See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined? Continue reading

Categories: power, reforming the reformers | 17 Comments

## If you’re seeing limb-sawing in P-value logic, you’re sawing off the limbs of reductio arguments

I was just reading a paper by Martin and Liu (2014) in which they allude to the “questionable logic of proving H0 false by using a calculation that assumes it is true”(p. 1704).  They say they seek to define a notion of “plausibility” that

“fits the way practitioners use and interpret p-values: a small p-value means H0 is implausible, given the observed data,” but they seek “a probability calculation that does not require one to assume that H0 is true, so one avoids the questionable logic of proving H0 false by using a calculation that assumes it is true“(Martin and Liu 2014, p. 1704).

Questionable? A very standard form of argument is a reductio (ad absurdum) wherein a claim C  is inferred (i.e., detached) by falsifying ~C, that is, by showing that assuming ~C entails something in conflict with (if not logically contradicting) known results or known truths [i]. Actual falsification in science is generally a statistical variant of this argument. Supposing Hin p-value reasoning plays the role of ~C. Yet some aver it thereby “saws off its own limb”! Continue reading

Categories: P-values, reforming the reformers, Statistics | 13 Comments

## Er, about those other approaches, hold off until a balanced appraisal is in

I could have told them that the degree of accordance enabling the ASA’s “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved. Continue reading

Categories: P-values, reforming the reformers, Statistics | 6 Comments

## Slides from the Boston Colloquium for Philosophy of Science: “Severe Testing: The Key to Error Correction”

Slides from my March 17 presentation on “Severe Testing: The Key to Error Correction” given at the Boston Colloquium for Philosophy of Science Alfred I.Taub forum on “Understanding Reproducibility and Error Correction in Science.”

## Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

.

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

## High error rates in discussions of error rates: no end in sight

waiting for the other shoe to drop…

“Guides for the Perplexed” in statistics become “Guides to Become Perplexed” when “error probabilities” (in relation to statistical hypotheses tests) are confused with posterior probabilities of hypotheses. Moreover, these posteriors are neither frequentist, subjectivist, nor default. Since this doublespeak is becoming more common in some circles, it seems apt to reblog a post from one year ago (you may wish to check the comments).

Do you ever find yourself holding your breath when reading an exposition of significance tests that’s going swimmingly so far? If you’re a frequentist in exile, you know what I mean. I’m sure others feel this way too. When I came across Jim Frost’s posts on The Minitab Blog, I thought I might actually have located a success story. He does a good job explaining P-values (with charts), the duality between P-values and confidence levels, and even rebuts the latest “test ban” (the “Don’t Ask, Don’t Tell” policy). Mere descriptive reports of observed differences that the editors recommend, Frost shows, are uninterpretable without a corresponding P-value or the equivalent. So far, so good. I have only small quibbles, such as the use of “likelihood” when meaning probability, and various and sundry nitpicky things. But watch how in some places significance levels are defined as the usual error probabilities —indeed in the glossary for the site—while in others it is denied they provide error probabilities. In those other places, error probabilities and error rates shift their meaning to posterior probabilities, based on priors representing the “prevalence” of true null hypotheses.

Begin with one of his kosher posts “Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics” (blue is Frost): Continue reading

## Szucs & Ioannidis Revive the Limb-Sawing Fallacy

.

When logical fallacies of statistics go uncorrected, they are repeated again and again…and again. And so it is with the limb-sawing fallacy I first posted in one of my “Overheard at the Comedy Hour” posts.* It now resides as a comic criticism of significance tests in a paper by Szucs and Ioannidis (posted this week),  Here’s their version:

“[P]aradoxically, when we achieve our goal and successfully reject Hwe will actually be left in complete existential vacuum because during the rejection of HNHST ‘saws off its own limb’ (Jaynes, 2003; p. 524): If we manage to reject H0then it follows that pr(data or more extreme data|H0) is useless because H0 is not true” (p.15).

Here’s Jaynes (p. 524):

“Suppose we decide that the effect exists; that is, we reject [null hypothesis] H0. Surely, we must also reject probabilities conditional on H0, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’

Ha! Ha! By this reasoning, no hypothetical testing or falsification could ever occur. As soon as H is falsified, the grounds for falsifying disappear! If H: all swans are white, then if I see a black swan, H is falsified. But according to this criticism, we can no longer assume the deduced prediction from H! What? Continue reading

## Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides

Here are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

## “So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference

.

I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black. Continue reading

## Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

## Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters

Jackie Mason

Whenever I’m in London, my criminologist friend Katrin H. and I go in search of stand-up comedy. Since it’s Saturday night (and I’m in London), we’re setting out in search of a good comedy club (I’ll complete this post upon return). A few years ago we heard Jackie Mason do his shtick, a one-man show billed as his swan song to England.  It was like a repertoire of his “Greatest Hits” without a new or updated joke in the mix.  Still, hearing his rants for the nth time was often quite hilarious. It turns out that he has already been back doing another “final shtick tour” in England, but not tonight.

A sample: If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’s got a fancy French name, haute cuisine. An empty plate with sauce!

As one critic wrote, Mason’s jokes “offer a window to a different era,” one whose caricatures and biases one can only hope we’ve moved beyond:

But it’s one thing for Jackie Mason to scowl at a seat in the front row and yell to the shocked audience member in his imagination, “These are jokes! They are just jokes!” and another to reprise statistical howlers, which are not jokes, to me. This blog found its reason for being partly as a place to expose, understand, and avoid them. I had earlier used this Jackie Mason opening to launch into a well-known fallacy of rejection using statistical significance tests. I’m going to go further this time around. I began by needling some leading philosophers of statistics: Continue reading

## A. Spanos: Talking back to the critics using error statistics

.

Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.