Monthly Archives: June 2019

“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(i)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, ASA II–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the ASA II authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.)

So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions.

The Invitation to Broader Consideration and Debate

The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. (ASAII p. 1)

The questions around reform need consideration and debate. (p. 9)

Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.

My focus here is just on the intended positions of the ASA, not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

♦ It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)

♦ “Statistically significant”– don’t say it and don’t use it. (p. 2)


I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:

In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.

Since ASA II will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals (CIs) are already routinely given alongside P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.

What’s the Relationship Between ASA I and ASA II?

I assume, for this post, that ASA II is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us try to take them at their word.

But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given. (See Note 5 for the six Principles of ASA I.)

Principle 1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)

However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.

So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from ASA II):

(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, ASA II).

Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.

My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.

So, my first recommendation is:

Replace (1) with:

“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”

Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[4]

My second friendly amendment concerns the second bulleted item:

(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)

Focus just on “presence”. From this assertion it would seem to follow that no P-values[5], however small, even from well-controlled trials, can reveal the presence of an association or effect–and that is too strong. Again, we get a conflict with Principle 1 from ASA I. But I’m guessing, for now, the authors do not intend to say this. If you don’t mean it, don’t say it.

So, my second recommendation is to replace (2) with:

 “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.

Without this friendly amendment, ASA II is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.

This leads to my third bulleted item from ASA II:

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems. Here’s what I say in SIST:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,”  but rather, “ how well-probed”  claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is presupposed by the very important Principle 4 of ASA I (see note (3).

As lawyer Nathan Schachtman observes, in a recent conversation on ASA II:

By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . .  No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)

So, my third recommendation is to replace (3) with (something like):

failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

There’s much else that bears critical analysis and debate in ASA II; I’ll come back to it. I hope to hear from the authors of ASA II about my very slight, constructive amendments (to avoid a conflict with Principle 1).

Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breached should not be considered at all. (This was already happening based on ASA I.)[6] 

Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this.




Gelman, A. (2012) “Ethics and the Statistical Use of Prior Information”.

Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.

Schachtman, N.  (2019).  (private communication)

Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19. (ASA II)

[1]  I gave an invited paper at the conference (“A world Beyond…”) out of which the idea for this volume grew. I was in a session with a few other exiles to describe the contexts where statistical significance tests are of value. I was too much involved in completing my book to write up my paper for this volume, nor did others in our small group. Links are here to: my slides and Yoav Benjamini’s slides. I did post notes to journalists on the Amrhein article here.


[2] Excerpts and mementos from SIST are here.


[3]  Principle 4 ASA I asserts that “proper inference requires full reporting and transparency”: P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. ….Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. (pp. 131-2)


[4] Consider, for example, a two-sided (symmetric) 95% confidence interval estimate of Normal mean: [a, b]. This information can also be given in terms of observed significance levels.
  • CI-lower is the (parameter) value that the data x are just statistically significantly greater than, at the 0.025 level.
  • CI-upper is the (parameter) value that the data x are just statistically significantly smaller than, at the 0.025 level.
There’s a clear duality between statistical significance tests and confidence intervals. (The CI contains those parameter values that would not be rejected at the corresponding significance level, were they the hypotheses under test.) CIs were developed by the same man who co-developed Neyman-Pearson (N-P) tests in the same years (~1930): Jerzy Neyman. There are other ways to get indicated effect sizes such as with (attained) power analysis and the P-value distribution over different values of the parameter. The goal of assessing how severely tested a claim is serves to direct this analysis (Mayo 2018). However, the mathematical computations are well-known (see Fraser’s article in the collection), and continue to be extended in work on Confidence Distributions. See this blog or SIST for references.
      However, confidence intervals as currently used in reform movements inherit many of the weaknesses of N-P tests: they are dichotomous (inside/outside), adhere to a single confidence level, and are justified merely with a long-run performance (or coverage) rationale. By considering the P-values associated with different hypotheses (corresponding to parameter values in the interval), one can scotch all of these weaknesses. See Souvenir Z, Farewell Keepsake, from SIST.
      It is often claimed that anything tests can do CIs do better (sung to the tune of “Annie Get Your Gun”). Not so. (See SIST p. 356). It is odd and ironic that psychologists urging us to use CIs depict statistical tests as exclusively of the artificial “simple” Fisherian variety, with a “nil” null and no explicit alternative, given how Paul Meehl chastised this tendency donkey’s years ago, and given that Jacob Cohen advanced power analysis.
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
See SIST, p. 323. For links to all of excursion 5 on power, see this post.
      Of course, the beauty of the simple Fisherian test shows itself when there is no explicit alternative, as when testing assumptions of models–models that all the alternative statistical methods on offer also employ. ASA I also limits itself to the simple Fisherian test: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power…” (p. 130)


[5] I assume they intend to make claims about valid P-values, not those that are discredited by failing “audits” due either to violated assumptions, or to multiple testing and other selection effects given in Principle 4, ASA I. The, largely unexceptional, six principles of ASA I (2016) are:
    • P-values can indicate how incompatible the data are with a specified statistical model.
    • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
    • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
    • Proper inference requires full reporting and transparency.
    • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
    • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
[6] Just because P-values form a continuum, it doesn’t follow that we can’t use very high and very low P-values to distinguish rather lousy from fairly well indicted discrepancies. Beware the “Fallacy of the Continuum”. Would anyone use a confidence level of 0.5 or 0.6?
Categories: ASA Guide to P-values, Statistics | 12 Comments

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)


returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results.

Exhibit (vi): Non-significance + High Power Does Not Imply Support for the Null over the Alternative. Sander Greenland (2012) has a paper with this title. The first step is to understand the assertion, giving the most generous interpretation. It deals with non-significance, so our ears are perked for a fallacy of non-rejection. Second, we know that “high power” is an incomplete concept, so he clearly means high power against “the alternative.” We have a handy example: alternative μ.84 in T+ (POW(T+, μ.84) = 0.84).

Note to blog reader: μ.84 abbreviates “the alternative against which the test has 0.84 power.” This general abbreviation was introduced in Tour I. 

Use the water plant case, T+: H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, n = 100. With α = 0.025, z0.025 = 1.96, and the corresponding cut-off in terms of x0.025 is [150 + 1.96(10)/√100] = 151.96], μ.84 = 152.96.

Now a title like this is supposed to signal a problem, a reason for those “keep out” signs. His point, in relation to this example, boils down to noting that an observed difference may not be statistically significant – x may fail to make it to the cut-off  x0:025 – and yet be closer to μ.84 than to 0. This happens because the Type II error probability β (here, 0.16)1 is greater than the Type I error probability (0.025).

For a quick computation let x0:025  = 152 and μ.84 = 153. Halfway between alternative 153 and the 150 null is 151.5. Any observed mean greater than 151.5 but less than the x0.025 cut-off, 152, will be an example of Greenland’s phenomenon. An example would be those values that are closer to 153, the alternative against which the test has 0.84 power, than to 150 and thus, by a likelihood measure, support 153 more than 150 –  even though POW(μ  = 153) is high (0.84). Having established the phenomenon, your next question is: so what?

It would  be problematic if power analysis took the insignificant result as evidence for μ  = 150 –  maintaining compliance with the ecological stipulation – and I don’t doubt some try to construe it as such, nor that Greenland has been put in the position of needing to correct them. Power analysis merely licenses μμ.84  where 0.84 was chosen for “high power.” Glance back at Souvenir X. So at least one of the “keep out” signs can be removed.

All of Excursion 5 Tour II (in proofs) is here.


1 That is, β(μ.84) = Pr(d < 0.4; μ = 0.6) = Pr(Z < −1) = 0.16.



*This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

It is still valuable to look at the discussions and comments under “power” and “shpower” on this blog.

Earlier excerpts and mementos from SIST (May 2018-May 2019) are here.


Where YOU are in the journey.






Categories: fallacy of non-significance, power, Statistical Inference as Severe Testing | Leave a comment

Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences


An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….

However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….

The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.

The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii]  More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.

You can read the full article here.

We may reject, with reasonable severity, that promoting correctly interpreted statistical tests is the real goal of the most powerful leaders of the movement to Stop Statistical Tests. The goal is just to stop (error) statistical tests altogether.[iv] That today’s CI leaders advance this goal is especially unwarranted and self-defeating, in that confidence intervals are just inversions of N-P tests, and were developed at the same time by the same man (Neyman) who developed (with E. Pearson) the theory of error statistical tests. See this recent post.

Reader: I’ve placed on draft a number of posts while traveling in England over the past week, but haven’t had the chance to read them over, or find pictures for them. This will soon change, so stay tuned!


[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.

[ii] Perhaps we should call those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.

[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). 

[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously. 

^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.




Categories: statistical tests | Leave a comment

Blog at