1. **New monsters.** One of the bizarre facts of life in the statistics wars is that a method from one school may be criticized on grounds that it conflicts with a conception that is the reverse of what that school intends. How is that even to be deciphered? That was the difficult task I set for myself in writing Statistical* Inference as Severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2008) [SIST 2018]. I thought I was done, but new monsters keep appearing. In some cases, rather than see how the notion of severity gets us beyond fallacies, misconstruals are taken to criticize severity! So, for example, in the last couple of posts, here and here, I deciphered some of the better known power howlers (discussed in SIST Ex 5 Tour II) I’m linking to all of this tour (in proofs). Continue reading

# SIST

## A statistically significant result indicates H’ (μ > μ’) when POW(μ’) is low (not the other way round)–but don’t ignore the standard error

## Do “underpowered” tests “exaggerate” population effects? (iv)

You will often hear that if you reach a just statistically significant result “and the discovery study is underpowered, the observed effects are expected to be inflated” (Ioannidis 2008, p. 64), or “exaggerated” (Gelman and Carlin 2014). This connects to what I’m referring to as the second set of concerns about statistical significance tests, power and magnitude errors. Here, the problem does not revolve around erroneously interpreting power as a posterior probability, as we saw in the fallacy in this post. But there are other points of conflict with the error statistical tester, and much that cries out for clarification — else you will misunderstand the consequences of some of today’s reforms.. Continue reading

## Join me in reforming the “reformers” of statistical significance tests

The most surprising discovery about today’s statistics wars is that some who set out shingles as “statistical reformers” themselves are guilty of misdefining some of the basic concepts of error statistical tests—notably power. (See my recent post on power howlers.) A major purpose of my *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests. The only way that disputing tribes can get beyond the statistics wars is by (at least) understanding correctly the central concepts. But these misunderstandings are *more* common than ever, so I’m asking readers to help. Why are they more common (than before the “new reformers” of the last decade)? I suspect that at least one reason is the popularity of Bayesian variants on tests: if one is looking to find posterior probabilities of hypotheses, then error statistical ingredients may tend to look as if that’s what they supply.

Run a little experiment if you come across a criticism based on the power of a test. Ask: are the critics interpreting the power of a test (with null hypothesis H) against an alternative H’ as if it were a posterior probability on H’? If they are, then it’s fallacious. But it will help understand why some people claim that high power against H’ warrants a stronger indication of a discrepancy H’, upon getting a just statistically significant result. But this is wrong. (See my recent post on power howlers.)

I had a blogpost on Ziliac and McCloskey (2008) (Z & M)on power (from Oct. 2011), following a review of their book by Aris Spanos (2008). They write:

“The error of the second kind is the error of accepting the null hypothesis of (say) zero effect when the null is in face false, that is, when (say) such and such a positive effect is true.”

So far so good, keeping in mind that “positive effect” refers to a parameter discrepancy, say δ, not an observed difference.

And the power of a test to detect that such and such a positive effect δ is true is equal to the probability of rejecting the null hypothesis of (say) zero effect when the null is in fact false, and a positive effect as large as δ is present.

Fine. Let this alternative be abbreviated H’(δ):

H’(δ): there is a positive (population) effect at least as large as δ.

Suppose the test rejects the null when it reaches a significance level of .01 (nothing turns on the small value chosen).

(1) The power of the test to detect H’(δ) =

Pr(test rejects null at the .01 level| H’(δ) is true).

Say it is 0.85.

According to Z & M:

“[If] the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.” (Z & M, 132-3)

But this is not so. They are mistaking (1), defining power, as giving a posterior probability of .85 to H’(δ)! That is, (1) is being transformed to (1′):

(1’) Pr(H’(δ) is true| test rejects null at .01 level)=.85!

(I am using the symbol for conditional probability “|” all the way through for ease in following the argument, even though, strictly speaking, the error statistician would use “;”, abbreviating “under the assumption that”). Or to put this in other words, they argue:

1. Pr(test rejects the null | H’(δ) is true) = 0.85.

2. Test rejects the null hypothesis.

Therefore, the rejection is probably correct, e.g., the probability H’ is true is 0.85.

Oops. Premises 1 and 2 are true, but the conclusion fallaciously replaces premise 1 with 1′.

As Aris Spanos (2008) points out, “They have it *backwards*”. Extracting from a Spanos comment on this blog in 2011:

“When [Ziliak and McCloskey] claim that: ‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough.’ (Z & M, p. 152), they exhibit [confusion] about the notion of power and its relationship to the sample size; their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to

high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (Spanos 2011)

However, their slippery slides are very illuminating for common misinterpretations behind the criticisms of statistical significance tests–assuming a reader can catch them, because they only make them some of the time. [i] According to Ziliak and McCloskey (2008): “It is the history of Fisher significance testing. One erects little “significance” hurdles, six inches tall, and makes a great show of leaping over them, . . . If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.” (ibid., p. 133)

They construe “little significance” as little hurdles! It explains how they wound up supposing high power translates into high hurdles. It’s the opposite. The higher the hurdle required before rejecting the null, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using “sensitivity” rather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejection) fallacy. A powerful test does give the null hypothesis a harder time in the sense that it’s more probable that discrepancies from it are detected. That makes it easier to infer H_{1}. Z & M have their hurdles in a twist.

For a fuller discussion, see this link to Excursion 5 Tour I of SIST (2018). [ii] [iii]

**What power howlers have you found? Share them in the comments. **

Spanos, A. (2008), Review of S. Ziliak and D. McCloskey’s *The Cult of Statistical Significance*, *Erasmus Journal for Philosophy and Economics*, volume 1, issue 1: 154-164.

Ziliak, Z. and McCloskey, D. (2008), *The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives*, University of Michigan Press*.*

[i] When it comes to raising the power by increasing sample size, they often make true claims, so it’s odd when there’s a switch or mixture, as when they say “refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough”. (Z & M, p. 152) It is clear that “low” is not a typo here either (as I at first assumed), so it’s mysterious.

[ii] Remember that a power computation is not the probability of data x under some alternative hypothesis, it’s the probability that data fall in the rejection region of a test under some alternative hypothesis. In terms of a test statistic d(X), it is Pr(test statistic d(X) is statistically significant | H’ true), at a given level of significance. So it’s the probability of getting any of the outcomes that would lead to statistical significance at the chosen level, under the assumption that alternative H’ is true. The alternative H’ used to compute power is a point in the alternative region. However, the inference that is made in tests is not to a point hypothesis but to an inequality, e.g., θ > θ’.

[iii] My rendering of their fallacy above sees it as a type of affirming the consequent. To Z & M, “the so-called fallacy of affirming the consequent may not be a fallacy at all in a science that is serious about decisions and belief.” It is, they think, how Bayesians reason. They are right that if inference is by way of a Bayes boost, then affirming the consequent is not a fallacy. A hypothesis H that entails data x will get a “B-boost” from x, unless its probability is already 1. The error statistician objects that the probability of finding an H that perfectly fits x is high, even if H is false–but the Bayesian need not object if she isn’t in the business of error probabilities. The trouble erupts when Z & M take an error statistical concept like power, and construe it Bayesianly. Even more confusing, they only do so some of the time.

## Tom Sterkenburg Reviews Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars” (2018, CUP)

Tom Sterkenburg, PhD

Postdoctoral Fellow

Munich Center for Mathematical Philosophy

LMU Munich

Munich, German

## Deborah G. Mayo: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The foundations of statistics is not a land of peace and quiet. “Tribal warfare” is perhaps putting it too strong, but it is the case that for decades now various camps and subcamps have been exchanging heated arguments about the right statistical methodology. That these skirmishes are not just an academic exercise is clear from the widespread use of statistical methods, and contemporary challenges that cry for more secure foundations: the rise of big data, the replication crisis.

## The Physical Reality of My New Book! Here at the RSS Meeting (2 years ago)

You can find several excerpts and mementos from the book, including whole “tours” (in proofs) updated June 2020 here.

## Final part of B. Haig’s ‘What can psych stat reformers learn from the error-stat perspective?’ (Bayesian stats)

Here’s the final part of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). The full article, which is open access, is here. I will make some remarks in the comments.

**5. The error-statistical perspective and the nature of science**

As noted at the outset, the error-statistical perspective has made significant contributions to our philosophical understanding of the nature of science. These are achieved, in good part, by employing insights about the nature and place of statistical inference in experimental science. The achievements include deliberations on important philosophical topics, such as the demarcation of science from non-science, the underdetermination of theories by evidence, the nature of scientific progress, and the perplexities of inductive inference. In this article, I restrict my attention to two such topics: The process of falsification and the structure of modeling.

*5.1. Falsificationism* Continue reading

## Part 2 of B. Haig’s ‘What can psych stat reformers learn from the error-stat perspective?’ (Bayesian stats)

Here’s a picture of ripping open the first box of (rush) copies of *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*, *and here’s a continuation of Brian Haig’s recent paper ‘What can psychology’s statistics reformers learn from the error-statistical perspective?’ in *Methods in Psychology *2 (Nov. 2020). Haig contrasts error statistics, the “new statistics”, and Bayesian statistics from the perspective of the statistics wars in psychology. The full article, which is open access, is here. I will make several points in the comments.

**4. Bayesian statistics**

Despite its early presence, and prominence, in the history of statistics, the Bayesian outlook has taken an age to assert itself in psychology. However, a cadre of methodologists has recently advocated the use of Bayesian statistical methods as a superior alternative to the messy frequentist practice that dominates psychology’s research landscape (e.g., Dienes, 2011; Kruschke and Liddell, 2018; Wagenmakers, 2007). These Bayesians criticize NHST, often advocate the use of Bayes factors for hypothesis testing, and rehearse a number of other well-known Bayesian objections to frequentist statistical practice. Continue reading

## SIST: All Excerpts and Mementos: May 2018-June 2020 (updated)

The Meaning of My Title: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* *05/19/18

Blurbs of 16 Tours: *Statistical Inference as Severe Testing*: How to Get Beyond the Statistics Wars (SIST) 03/05/19

*Statistical Inference as Severe Testing*: Preface

**Excursion 1**

**EXCERPTS**

**Tour I **Ex1 TI (full proofs)

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) __09/08/18__

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) __09/11/18__

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) __09/15/18__

**Tour II **Ex1 TII (full proofs)

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt __04/04/19__

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

**MEMENTOS**

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

**Excursion 2**

**EXCERPTS**

**Tour I **(full proofs)

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) __09/29/18__

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

**Tour II **(full proofs)

Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) __10/10/18__

Ex 2 TII (Full proofs)

**MEMENTOS**

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

**Excursion 3**

**EXCERPTS**

**Tour I **Ex3 TI (full proofs)

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] __12/04/18__

**Tour II **Ex3 TII (full proofs)

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

**Tour III **Ex3 Tour III(full proofs)

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

**MEMENTOS**

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

**Excursion 4**

**EXCERPTS**

**Tour I (Full Excursion 4 Tour I)**

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

**Tour II (Full Excursion 4 Tour II)**

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

**Tour III (Full proofs of Excursion 4 Tour III)**

**Tour IV (Full proofs of Excursion 4 Tour IV)**

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

**MEMENTOS**

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

**Excursion 5**

**Tour I** (full proofs)

(Full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

**Tour II** (full proofs)

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower) 06/07/19

**Tour III **(full proofs)

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

**Excursion 6**

**Tour I (full proofs) **What Ever Happened to Bayesian Foundations?

**Tour II (full proofs)**

Excerpts: Souvenir Z: Understanding Tribal Warfare + 6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19 (full excerpt)

** ***Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

## SIST: All Excerpts and Mementos: May 2018-May 2019

**Introduction & Overview**

The Meaning of My Title: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* *05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

**Excursion 1**

**EXCERPTS**

**Tour I **(full proofs)

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) __09/08/18__

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) __09/11/18__

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) __09/15/18__

**Tour II **(full proofs)

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt __04/04/19__

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

**MEMENTOS**

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

**Excursion 2**

**EXCERPTS**

**Tour I **(full proofs)

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) __09/29/18__

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

**Tour II **(full proofs)

Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) __10/10/18__

**MEMENTOS**

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

**Excursion 3**

**EXCERPTS**

**Tour I** **(****full proofs****)**

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] __12/04/18__

**Tour II (full proofs)**

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

**Tour III (full proofs)**

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

**MEMENTOS**

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

**Excursion 4**

**EXCERPTS**

**Tour I (full proofs)**

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

**Tour II (full proofs)**

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

**Tour III (full proofs)**

** Tour IV (full proofs)**

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

**MEMENTOS**

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

**Excursion 5**

**Tour I (full proofs)**

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

**Tour II** **(full proofs)**

**Tour III (full proofs)**

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

**Excursion 6**

**Tour I (full proofs)**

**Tour II** **(full proofs)**

Excerpts: Souvenir Z: Understanding Tribal Warfare + 6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

** ***Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

## Excerpts: Final Souvenir Z, Farewell Keepsake & List of Souvenirs

We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers begin with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!

** Souvenir Z: Understanding Tribal Warfare**

We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.

**6.7 Farewell Keepsake**

Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.

Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (*performance*), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (*probabilism*). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (*probativism*) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology – an account of what’ s warranted to infer – must be:

• directly altered by biasing selection effects

• able to falsify claims statistically

• able to test statistical model assumptions

• able to block inferences that violate minimal severity

These overlapping and interrelated requirements are disinterred over the course of our travels. This final keepsake collects a cluster of familiar criticisms of error statistical methods. They are not intended to replace the detailed arguments, pro and con, within; here we cut to the chase, generally keeping to the language of critics. Given our conception of evidence, we retain testing language even when the statistical inference is an estimation, prediction, or proposed answer to a question. The concept of severe testing is sufficiently general to apply to any of the methods now in use. It follows that a variety of statistical methods can serve to advance the severity goal, and that they can, in principle, find their foundations in an error statistical philosophy. However, each requires supplements and reformulations to be relevant to real-world learning. Good science does not turn on adopting any formal tool, and yet the statistics wars often focus on whether to use one type of test (or estimation, or model selection) or another. Meta-researchers charged with instigating reforms do not agree, but the foundational basis for the disagreement is left unattended. It is no wonder some see the statistics wars as proxy wars between competing tribe leaders, each keen to advance one or another tool, rather than about how to do better science. Leading minds are drawn into inconsequential battles, e.g., whether to use a prespecified cut-off of 0.025 or 0.0025 – when in fact good inference is not about cut-offs altogether but about a series of small-scale steps in collecting, modeling and analyzing data that work together to find things out. Still, we need to get beyond the statistics wars in their present form. By viewing a contentious battle in terms of a difference in goals – finding highly probable versus highly well probed hypotheses – readers can see why leaders of rival tribes often talk past each other. To be clear, the standpoints underlying the following criticisms are open to debate; we’re far from claiming to do away with them. What should be done away with is rehearsing the same criticisms ad nauseum. Only then can we hear the voices of those calling for an honest standpoint about responsible science.

**1. NHST Licenses Abuses.** First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim *H*. Unless *H*’s errors have been well probed by merely finding a small P-value, *H* passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.

**To continue reading:** Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.

*We are reading *Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars* (2018, CUP)

***

**Where YOU are in the journey.**

## (Full) Excerpt of Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

**A month ago, I excerpted just the very start of Excursion 4 Tour I* on The Myth of the “Myth of Objectivity”. It’s a short Tour, and this continues the earlier post.**

## 4.1 Dirty Hands: Statistical Inference Is Sullied with Discretionary Choices

If all flesh is grass, kings and cardinals are surely grass, but so is everyone else and we have not learned much about kings as opposed to peasants. (Hacking 1965, p.211)

Trivial platitudes can appear as convincingly strong arguments that everything is subjective. Take this one: No human learning is pure so anyone who demands objective scrutiny is being unrealistic and demanding immaculate inference. This is an instance of Hacking’s “all ﬂesh is grass.” In fact, Hacking is alluding to the subjective Bayesian de Finetti (who “denies the very existence of the physical property [of] chance” (ibid.)). My one-time colleague, I. J. Good, used to poke fun at the frequentist as “denying he uses any judgments!” Let’s admit right up front that every sentence can be prefaced with “agent x judges that,” and not sweep it under the carpet (SUTC) as Good (1976) alleges. Since that can be done for any statement, it cannot be relevant for making the distinctions in which we are interested, and we know can be made, between warranted or well-tested claims and those so poorly probed as to be BENT. You’d be surprised how far into the thicket you can cut your way by brandishing this blade alone. Continue reading

## Mementos from Excursion 4: Objectivity & Auditing: Blurbs of Tours I – IV

**Excursion**** 4: Objectivity and Auditing (blurbs of Tours I – IV)**

**Excursion 4 Tour I: ****The Myth of “The Myth of Objectivity”**

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.

**Keywords**

objective vs. subjective, objectivity requirements, auditing, dirty hands argument, phenomena vs. epiphenomena, logical positivism, verificationism, loss and cost functions, default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, transparency, epistemology: internal/external distinction

**Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?**

We begin with the *Mountains out of Molehills Fallacy *(large *n* problem): The fallacy of taking a (P-level) rejection of *H*_{0} with larger sample size as indicating greater discrepancy from *H*_{0} than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough *n*, a .05 significant result can correspond to assigning *H*_{0} a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.

It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.

**Keywords**

significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)

**Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization**

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small *P*-values, yet replication attempts find it difficult to get small *P*-values with preregistered results. I call this the *paradox of replication*. The problem isn’t *P*-values but failing to adjust them for cherry picking and other *biasing selection effects*. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)

**Keywords**

error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)

**Excursion 4 Tour IV:** **More Auditing: Objectivity and Model Checking**

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being *adequate for a problem, meaning* it enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.

**Keywords**

adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification

## Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

**Excerpt from Excursion 4 Tour II***

**4.4 Do P-Values Exaggerate the Evidence?**

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π_{0} of 1/2 on a point null *H*_{0} (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on *H*_{0}.

More generally, the “*P*-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − *P*. Continue reading

## January Invites: Ask me questions (about SIST), Write Discussion Analyses (U-Phils)

**ASK ME.** Some readers say they’re not sure where to ask a question of comprehension on *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)–SIST– so here’s a special post to park your questions of comprehension (to be placed in the comments) on a little over the first half of the book. That goes up to and includes Excursion 4 Tour I on “The Myth of ‘The Myth of Objectivity'”. However,I will soon post on Tour II: Rejection Fallacies: Who’s Exaggerating What? So feel free to ask questions of comprehension as far as p.259.

All of the SIST BlogPost (Excerpts and Mementos) so far are here.

**WRITE A DISCUSSION NOTE**: Beginning January 16, anyone who wishes to write a discussion note (on some aspect or issue up to p. 259 are invited to do so (<750 words, longer if you wish). Send them to my error email. I will post as many as possible on this blog.

We initially called such notes “U-Phils” as in “You do a Philosophical analysis”, which really only means it’s an analytic excercize that strives to first give the most generous interpretation to positions, and then examines them. See the general definition of a U-Phil.

Some Examples:

Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution

U-Phil: A Further Comment on Gelman by Christian Hennig.

For a whole group of reader contributions, including Jim Berger on Jim Berger, see: Earlier U-Phils and Deconstructions

If you’re writing a note on objectivity, you might wish to compare and contrast Excursion 4 Tour I with a paper by Gelman and Hennig (2017): “Beyond subjective and objective in Statistics”.

These invites extend through January.

## SIST* Blog Posts: Excerpts & Mementos (to Dec 31 2018)

*Excerpts*

- 05/19: The Meaning of My Title:
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* - 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
- 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
- 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
- 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
- 10/10: Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3)
- 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3
- 12/01: Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)
- 12/04: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]
- 12/11: It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)
- 12/20: Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III
- 12/26: Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)
- 12/29: 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II.

*Mementos, Keepsakes and Souvenirs*

- 10/29: Tour Guide
**Mementos**(Excursion 1 Tour II of How to Get Beyond the Statistics Wars) - 11/8:
**Souvenir**C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) - 10/5: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (
**Keepsake**by Fisher, 2.1) - 11/14: Tour Guide
**Mementos**and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) - 11/17:
**Mementos**for Excursion 2 Tour II Falsification, Pseudoscience, Induction - 12/08: Memento & Quiz (on SEV): Excursion 3, Tour I
- 12/13: Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)
- 12/26: Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

***Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

## Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)

## Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t. Continue reading

## Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

**Excursion 3 Statistical Tests and Scientific Inference**

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

The 1919 eclipse experiments opened Popper’ s eyes to what made Einstein’ s theory so different from other revolutionary theories of the day: Einstein was prepared to subject his theory to risky tests.[1] Einstein was eager to galvanize scientists to test his theory of gravity, knowing the solar eclipse was coming up on May 29, 1919. Leading the expedition to test GTR was a perfect opportunity for Sir Arthur Eddington, a devout follower of Einstein as well as a devout Quaker and conscientious objector. Fearing “ a scandal if one of its young stars went to jail as a conscientious objector,” officials at Cambridge argued that Eddington couldn’ t very well be allowed to go off to war when the country needed him to prepare the journey to test Einstein’ s predicted light deflection (Kaku 2005, p. 113). Continue reading

## SIST* Posts: Excerpts & Mementos (to Nov 30, 2018)

**SIST* BLOG POSTS (up to Nov 30, 2018)**

*Excerpts*

- 05/19: The Meaning of My Title:
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* - 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
- 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
- 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
- 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
- 10/10: Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3)
- 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3

*Mementos, Keepsakes and Souvenirs*

- 10/29: Tour Guide
**Mementos**(Excursion 1 Tour II of How to Get Beyond the Statistics Wars) - 11/8:
**Souvenir**C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) - 10/5: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (
**Keepsake**by Fisher, 2.1) - 11/14: Tour Guide
**Mementos**and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) - 11/17:
**Mementos**for Excursion 2 Tour II Falsification, Pseudoscience, Induction

**Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars *(Mayo, CUP 2018)

## Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)

**Excursion 2 Tour I: Induction and Confirmation ***(Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)*

*Tour Blurb*. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if ** x** confirms

*H*, it confirms (

*H*&

*J*) for any

*J*. This also leads to what’s called the tacking paradox.

**Quiz on 2.1** Soundness vs Validity in Deductive Logic. Let ~*C *be the denial of claim *C*. For each of the following argument, indicate whether it is **valid** **and sound**, **valid but unsound**, **invalid**. Continue reading

## Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)

**Excursion 1 Tour II: Error Probing Tools vs. Logics of Evidence **

*Blurb.* Core battles revolve around the relevance of a method’s error probabilities. What’s distinctive about the severe testing account is that it uses error probabilities evidentially: to assess how severely a claim has passed a test. Error control is necessary but not sufficient for severity. Logics of induction focus on the relationships between given data and hypotheses–so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). Tour II takes us to the crux of central wars in relation to the Law of Likelihood (LL) and Bayesian probabilism. (1.4) Hypotheses deliberately designed to accord with the data can result in minimal severity. The likelihoodist wishes to oust them via degrees of belief captured in prior probabilities. To the severe tester, such gambits directly alter the evidence by leading to inseverity. (1.5) Stopping rules: If a tester tries and tries again until significance is reached–optional stopping–significance will be attained erroneously with high probability. According to the LP, the stopping rule doesn’t alter evidence. The irrelevance of optional stopping is an asset for holders of the LP, it’s the opposite for a severe tester. The warring sides talk past each other. Continue reading