# Statistical Inference as Severe Testing

## Aris Spanos Reviews Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A. Spanos

Aris Spanos was asked to review my Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (CUP, 2018), but he was to combine it with a review of the re-issue of Ian Hacking’s classic  Logic of Statistical Inference. The journal is OEconomia: History, Methodology, Philosophy. Below are excerpts from his discussion of my book (pp. 843-860). I will jump past the Hacking review, and occasionally excerpting for length.To read his full article go to external journal pdf or stable internal blog pdf.

….

## 2 Mayo (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

The sub-title of Mayo’s (2018) book provides an apt description of the primary aim of the book in the sense that its focus is on the current discussions pertaining to replicability and trustworthy empirical evidence that revolve around the main fault line in statistical inference: the nature, interpretation and uses of probability in statistical modeling and inference. This underlies not only the form and structure of inductive inference, but also the nature of the underlying statistical reasonings as well as the nature of the evidence it gives rise to.

A crucial theme in Mayo’s book pertains to the current confusing and confused discussions on reproducibility and replicability of empirical evidence. The book cuts through the enormous level of confusion we see today about basic statistical terms, and in so doing explains why the experts so often disagree about reforms intended to improve statistical science.

Mayo makes a concerted effort to delineate the issues and clear up these confusions by defining the basic concepts accurately and placing many widely held methodological views in the best possible light before scrutinizing them. In particular, the book discusses at length the merits and demerits of the proposed reforms which include: (a) replacing p-values with Confidence Intervals (CIs), (b) using estimation-based effect sizes and (c) redefining statistical significance.

The key philosophical concept employed by Mayo to distinguish between a sound empirical evidential claim for a hypothesis H and an unsound one is the notion of a severe test: if little has been done to rule out flaws (errors and omissions) in pronouncing that data x0 provide evidence for a hypothesis H, then that inferential claim has not passed a severe test, rendering the claim untrustworthy. One has trustworthy evidence for a claim C only to the extent that C passes a severe test; see Mayo (1983; 1996). A distinct advantage of the concept of severe testing is that it is sufficiently general to apply to both frequentist and Bayesian inferential methods.

Mayo makes a case that there is a two-way link between philosophy and statistics. On one hand, philosophy helps in resolving conceptual, logical, and methodological problems of statistical inference. On the other hand, viewing statistical inference as severe testing gives rise to novel solutions to crucial philosophical problems including induction, falsification and the demarcation of science from pseudoscience. In addition, it serves as the foundation for understanding and getting beyond the statistics wars that currently revolves around the replication crises; hence the title of the book, Statistical Inference as Severe Testing.

Chapter (excursion) 1 of Mayo’s (2018) book sets the scene by scrutinizing the different role of probability in statistical inference, distinguishing between:

(i) Probabilism. Probability is used to assign a degree of confirmation, support or belief in a hypothesis H, given data x0 (Bayesian, likelihoodist, Fisher (fiducial)). An inferential claim H is warranted when it is assigned a high probability, support, or degree of belief (absolute or comparative).
(ii) Performance. Probability is used to ensure the long-run reliability of inference procedures; type I, II, coverage probabilities (frequentist, behavioristic Neyman-Pearson). An inferential claim H is warranted when it stems from a procedure with a low long-run error.
(iii) Probativism. Probability is used to assess the probing capacity of inference procedures, pre-data (type I, II, coverage probabilities), as well as post-data (p-value, severity evaluation). An inferential claim H is warranted when the different ways it can be false have been adequately probed and averted.

Mayo argues that probativism based on the severe testing account uses error probabilities to output an evidential interpretation based on assessing how severely an inferential claim H has passed a test with data x0. Error control and long-run reliability is necessary but not sufficient for probativism. This perspective is contrasted to probabilism (Law of Likelihood (LL) and Bayesian posterior) that focuses on the relationships between data x0 and hypothesis H, and ignores outcomes xRother than x0 by adhering to the Likelihood Principle (LP): given a statistical model Mθ(x) and data x0, all relevant sample information for inference purposes is contained in L(θ; x0), ∀θ∈Θ. Such a perspective can produce unwarranted results with high probability, by failing to pick up on optional stopping, data dredging and other biasing selection effects. It is at odds with what is widely accepted as the most effective way to improve replication: predesignation, and transparency about how hypotheses and data were generated and selected.

Chapter (excursion) 2 entitled ‘Taboos of Induction and Falsification’ relates the various uses of probability to draw certain parallels between probabilism, Bayesian statistics and Carnapian logics of confirmation on one side, and performance, frequentist statistics and Popperian falsification on the other. The discussion in this chapter covers a variety of issues in philosophy of science, including, the problem of induction, the asymmetry of induction and falsification, sound vs. valid arguments, enumerative induction (straight rule), confirmation theory (and formal epistemology), statistical affirming the consequent, the old evidence problem, corroboration, demarcation of science and pseudoscience, Duhem’s problem and novelty of evidence. These philosophical issues are also related to statistical conundrums as they relate to significance testing, fallacies of rejection, the cannibalization of frequentist testing known as Null Hypothesis Significance Testing (NHST) in psychology, and the issues raised by the reproducibility and replicability of evidence.

Chapter (excursion) 3 on ‘Statistical Tests and Scientific Inference’ provides a basic introduction to frequentist testing paying particular attention to crucial details, such as specifying explicitly the assumed statistical model Mθ(x) and the proper framing of hypotheses in terms of its parameter space Θ, with a view to provide a coherent account by avoiding undue formalism. The Neyman-Pearson (N-P) formulation of hypothesis testing is explained using a simple example, and then related to Fisher’s significance testing. What is different from previous treatments is that the claimed ‘inconsistent hybrid’ associated with the NHST caricature of frequentist testing is circumvented. The crucial difference often drawn is based on the N-P emphasis on pre-data long-run error probabilities, and the behavioristic interpretation of tests as accept/reject rules. By contrast, the post-data p-value associated with Fisher’s significance tests is thought to provide a more evidential interpretation. In this chapter, the two approaches are reconciled in the context of the error statistical framework. The N-P formulation provides the formal framework in the context of which an optimal theory of frequentist testing can be articulated, but in its current expositions lack a proper evidential interpretation. [For the detailed example see his review  pdf.]

If a hypothesis H0 passes a test Τα that was highly capable of finding discrepancies from it, were they to be present, then the passing result indicates some evidence for their absence. The resulting evidential result comes in the form of the magnitude of the discrepancy γ from H0 warranted with test Τα and data x0 at different levels of severity. The intuition underlying the post-data severity is that a small p-value or a rejection of H0 based on a test with low power (e.g. a small n) for detecting a particular discrepancy γ provides stronger evidence for the presence of γ than if the test had much higher power (e.g. a large n).

The post-data severity evaluation outputs the discrepancy γ stemming from the testing results and takes the probabilistic form:

SEV (θ ≶ θ1; x0)=P(d(X) ≷ d(x0); θ10+γ), for all θ1∈Θ1,

where the inequalities are determined by the testing result and the sign of d(x0). [Ed Note ≶ is his way of combining the definition of severity for both > and <, in order to abbreviate. It is not used in SIST.] When the relevant N-P test result is ‘accept (reject) H0’ one is seeking the smallest (largest) discrepancy γ, in the form of an inferential claim θ ≶ θ10+γ, warranted by Τα and x0 at a high enough probability, say .8 or .9. The severity evaluations are introduced by connecting them to more familiar calculations relating to observed confidence intervals and p-value calculations. A more formal treatment to the post-data severity evaluation is given in chapter (excursion) 5.[Ed. note: “Excursions” are actually Parts, Tours are chapters]

Mayo uses the post-data severity perspective to scorch several misinterpretations of the p-value, including the claim that the p-value is not a legitimate error probability. She also calls into question any comparisons of the tail areas of d(X) under H0 that vary with xRn, with posterior distribution tail areas that vary with θ∈Θ, pointing out that this is tantamount to comparing apples and oranges!

The real life examples of the 1919 eclipse data for testing the General Theory of Relativity, as well as the 2012 discovery of the Higgs particle are used to illustrate some of the concepts in this chapter.

The discussion in this chapter sheds light on several important problems in statistical inference, including several howlers of statistical testing, Jeffreys’ tail area criticism, weak conditionality principle and the likelihood principle.

Chapter (excursion) 5, entitled ‘Power and Severity’, provides an in-depth discussion of power and its abuses or misinterpretations, as well as scotch several confusions permeating the current discussions on the replicability of empirical evidence.

Confusion 1: The power of a N-P test Τα:= {d(X), C1(α)} is a pre-data error probability that calibrates the generic (for any sample realization x∈Rn ) capacity of the test in detecting different discrepancies from H0, for a given type I error probability α. As such, the power is not a point function one can evaluate arbitrarily at a particular value θ1. It is defined for all values in the alternative space θ1∈Θ1.

Confusion 2: The power function is properly defined for all θ1∈Θ1 only when (Θ0, Θ1) constitute a partition of Θ. This is to ensure that θ is not in a subset of Θ ignored by the comparisons since the main objective is to narrow down the unknown parameter space Θ using hypothetical values of θ. …Hypothesis testing poses questions as to whether a hypothetical value θ0 is close enough to θ in the sense that the difference (θ – θ0) is ‘statistically negligible’; a notion defined using error probabilities.

Confusion 3: Hacking (1965) raised the problem of using predata error probabilities, such as the significance level α and power, to evaluate the testing results post-data. As mentioned above, the post-data severity aims to address that very problem, and is extensively discussed in Mayo (2018), excursion 5.

Confusion 4: Mayo and Spanos (2006) define “attained power” by replacing cα with the observed d(x0). But this should not be confused with replacing θ1 with its observed estimate [e.g., xn], as in what is often called “observed” or “retrospective” power. To compare the two in example 2, contrast:

Attained power: POW(µ1)=Pr(d(X) > d(x0); µ=µ1), for all µ10,

with what Mayo calls Shpower which is defined at µ=xn:

Shpower: POW(xn)=Pr(d(X) > d(x0); µ=xn).

Shpower makes very little statistical sense, unless point estimation justifies the inferential claim xn ≅ µ, which it does not, as argued above. Unfortunately, the statistical literature in psychology is permeated with (implicitly) invoking such a claim when touting the merits of estimation-based effect sizes. The estimate xrepresents just a single value of Xn ∼N(µ, σ2/n ), and any inference pertaining to µ needs to take into account the uncertainty described by this sampling distribution; hence, the call for using interval estimation and hypothesis testing to account for that sampling uncertainty. The post-data severity evaluation addresses this problem using hypothetical reasoning and taking into account the relevant statistical context (11). It outputs the discrepancy from H0 warranted by test Τα and data x0, with high enough severity, say bigger than .85. Invariably, inferential claims of the form µ ≷ µ1= xn are assigned low severity of .5.

Confusion 5: Frequentist error probabilities (type I, II, coverage, p-value) are not conditional on H (H0 or H1) since θ=θ0 or θ=θ1 being ‘true or false’ do not constitute legitimate events in the context of Mθ(x); θ is an unknown constant. The clause ‘given H is true’ refers to hypothetical scenarios under which the sampling distribution of the test statistic d(X) is evaluated as in (10).

This confusion undermines the credibility of Positive Predictive Value (PPV):

where (i) F = H0 is false, (ii) R=test rejects H0, and (iii) H0: no disease, used by Ioannidis (2005) to make his case that ‘most published research findings are false’ when PPV = Pr(F|R)<.5. His case is based on ‘guessing’ probabilities at a discipline wide level, such as Pr(F)=.1, Pr(R|F)=.8 and Pr(R|F)=.15, and presuming that the last two relate to the power and significance level of a N-P test. He then proceeds to blame the wide-spread abuse of significance testing (p-hacking, multiple testing, cherry-picking, low power) for the high de facto type I error (.15). Granted, such abuses do contribute to untrustworthy evidence, but not via false positive/negative rates since (i) and (iii) are not legitimate events in the context of Mθ(x), and thus Pr(R|F) and Pr(R|F) have nothing to do with the significance level and the power of a N-P test. Hence, the analogical reasoning relating the false positive and false negative rates in medical detecting devices to the type I and II error probabilities in frequentist testing is totally misplaced. These rates are established by the manufacturers of medical devices after running a very large number (say, 10000) of medical ‘tests’ with specimens that are known to be positive or negative; they are prepared in a lab. Known ‘positive’ and ‘negative’ specimens constitute legitimate observable events one can condition upon. In contrast, frequentist error probabilities (i) are framed in terms of θ (which are not observable events in Mθ(x)) and (ii) depend crucially on the particular statistical context (11); there is no statistical context for the false positive and false negative rates.

A stronger case can be made that abuses and misinterpretations of frequentist testing are only symptomatic of a more extensive problem: the recipe-like/uninformed implementation of statistical methods. This contributes in many different ways to untrustworthy evidence, including: (i) statistical misspecification (imposing invalid assumptions on one’s data), (ii) poor implementation of inference methods (insufficient understanding of their assumptions and limitations), and (iii) unwarranted evidential interpretations of their inferential results (misinterpreting p-values and CIs, etc.).

Mayo uses the concept of a post-data severity evaluation to illuminate the above mentioned issues and explain how it can also provide the missing evidential interpretation of testing results. The same concept is also used to clarify numerous misinterpretations of the p-value throughout this book, as well as the fallacies:
(a) Fallacy of acceptance (non-rejection). No evidence against H0 is misinterpreted as evidence for it. This fallacy can easily arise when the power of a test is low (e.g. small n problem) in detecting sizeable discrepancies.
(b) Fallacy of rejection. Evidence against H0 is misinterpreted as evidence for a particular H1. This fallacy can easily arise when the power of a test is very high (e.g. large n problem) and it detects trivial discrepancies; see Mayo and Spanos (2006).

In chapter 5 Mayo returns to a recurring theme throughout the book, the mathematical duality between Confidence Intervals (CIs) and hypothesis testing, with a view to call into question certain claims about the superiority of CIs over p-values. This mathematical duality derails any claims that observed CIs are less vulnerable to the large n problem and more informative than p-values. Where they differ is in terms of their inferential claims stemming from their different forms of reasoning, factual vs. hypothetical. That is, the mathematical duality does not imply inferential duality. This is demonstrated by contrasting CIs with the post-data severity evaluation.

Indeed, a case can be made that the post-data severity evaluation addresses several long-standing problems associated with frequentist testing, including the large n problem, the apparent arbitrariness of the N-P framing that allows for simple vs. simple hypotheses, say H0: µ= 1 vs. H1: µ=1, the arbitrariness of the rejection thresholds, the problem of the sharp dichotomy (e.g. reject H0 at .05 but accept H0 at .0499), and distinguishing between statistical and substantive significance. It also provides a natural framework for evaluating reproducibility/replicability issues and brings out the problems associated with observed CIs and estimation-based effect sizes; see Spanos (2019).

Chapter 5 also includes a retrospective view of the disputes between Neyman and Fisher in the context of the error statistical perspective on frequentist inference, bringing out their common framing and their differences in emphasis and interpretation. The discussion also includes an interesting summary of their personal conflicts, not always motivated by statistical issues; who said the history of statistics is boring?

Chapter (excursion) 6 of Mayo (2018) raises several important foundational issues and problems pertaining to Bayesian inference, including its primary aim, subjective vs. default Bayesian priors and their interpretations, default Bayesian inference vs. the Likelihood Principle, the role of the catchall factor, the role of Bayes factors in Bayesian testing, and the relationship between Bayesian inference and error probabilities. There is also discussion about attempts by ‘default prior’ Bayesians to unify or reconcile frequentist and Bayesian accounts.

A point emphasized in this chapter pertains to model validation. Despite the fact that Bayesian statistics shares the same concept of a statistical model Mθ(x) with frequentist statistics, there is hardly any discussion on validating Mθ(x) to secure the reliability of the posterior distribution:…upon which all Bayesian inferences are based. The exception is the indirect approach to model validation in Gelman et al (2013) based on the posterior predictive distribution:Since m(x) is parameter free, one can use it as a basis for simulating a number of replications x1, x2, …, xn to be used as indirect evidence for potential departures from the model assumptions vis-à-vis data x0, which is clearly different from frequentist M-S testing of the Mθ(x) assumptions. The reason is that m(x) is a smoothed mixture of f(x; θ) and π(θ|x0 ) and one has no way of attributing blame to one or the other when any departures are detected. For instance, in the case of the simple Normal model in (9), a highly skewed prior might contribute (indirectly) to departures from the Normality assumption when tested using simulated data using (12). Moreover, the ‘smoothing’ with respect to the parameters in deriving m(x) is likely to render testing departures from the IID assumptions a lot more unwieldy.

On the question posed by the title of this review, Mayo’s answer is that the error statistical framework, a refinement or extension of the original Fisher-Neyman-Pearson framing in the spirit of Peirce, provides a pertinent foundation for frequentist modeling and inference.

## 3 Conclusions

A retrospective view of Hacking (1965) reveals that its main weakness is that its perspective on statistical induction adheres too closely to the philosophy of science framing of that period, and largely ignores the formalism based on the theory of stochastic processes {Xt, t∈N} that revolves around the concept of a statistical model Mθ(x). Retrospectively, its value stems primarily from a number of very insightful arguments and comments that survived the test of time. The three that stand out are: (i) an optimal point estimator [θ-hat(X)] of θ does not warrant the inferential claim [θ-hat(x0)]≅ θ, (ii) a statistical inference is very different from a decision, and (iii) the distinction between the pre-data error probabilities and the post-data evaluation of the evidence stemming from testing results; a distinction that permeates Mayo’s (2018) book. Hacking’s change of mind on the aptness of logicism and the problems with the long run frequency is also particularly interesting. Hacking’s (1980) view of the long run frequency is almost indistinguishable from that of Cramer (1946, 332) and Neyman (1952, 27) mentioned above, or Mayo (2018), when he argues: “Probabilities conform to the usual probability axioms which have among their consequences the essential connection between individual and repeated trials, the weak law of large numbers proved by Bernoulli. Probabilities are to be thought of as theoretical properties, with a certain looseness of fit to the observed world. Part of this fit is judged by rules for testing statistical hypotheses along the lines described by Neyman and Pearson. It is a “frequency view of probability” in which probability is a dispositional property…” (Hacking, 1980, 150-151).

Probability as a dispositional property’ of a chance set-up alludes to the propensity interpretation of probability associated with Peirce and Popper, which is in complete agreement with the model-based frequentist interpretation; see Spanos (2019).

The main contribution of Mayo’s (2018) book is to put forward a framework and a strategy to evaluate the trustworthiness of evidence resulting from different statistical accounts. Viewing statistical inference as a form of severe testing elucidates the most widely employed arguments surrounding commonly used (and abused) statistical methods. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to evaluate and control how severely tested different inferential claims are. Without assuming that other statistical accounts aim for severe tests, Mayo proposes the following strategy for evaluating the trustworthiness of evidence: begin with a minimal requirement that if a test has little or no chance to detect flaws in a claim H, then H’s passing result constitutes untrustworthy evidence. Then, apply this minimal severity requirement to the various statistical accounts as well as to the proposed reforms, including estimation-based effect sizes, observed CIs and redefining statistical significance. Finding that they fail even the minimal severity requirement provides grounds to question the trustworthiness of their evidential claims. One need not reject some of these methods just because they have different aims, but because they give rise to evidence [claims] that fail the minimal severity requirement. Mayo challenges practitioners to be much clearer about their aims in particular contexts and different stages of inquiry. It is in this way that the book ingeniously links philosophical questions about the roles of probability in inference to the concerns of practitioners about coming up with trustworthy evidence across the landscape of the natural and the social sciences.

References

• Barnard, George. 1972. Review article: Logic of Statistical Inference. The British Journal of the Philosophy of Science, 23: 123- 190.
• Cramer, Harald. 1946. Mathematical Methods of Statistics, Princeton: Princeton University Press.
• Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society A, 222(602): 309-368.
• Fisher, Ronald A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
• Gelman, Andrew. John B. Carlin, Hal S. Stern, Donald B. Rubin. 2013. Bayesian Data Analysis, 3rd ed. London: Chapman & Hall/CRC.
• Hacking, Ian. 1972. Review: Likelihood. The British Journal for the Philosophy of Science, 23(2): 132-137.
• Hacking, Ian. 1980. The Theory of Probable Inference: Neyman, Peirce and Braithwaite. In D. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite. Cambridge: Cambridge University Press, 141-160.
• Ioannidis, John P. A. 2005. Why Most Published Research Findings Are False. PLoS medicine, 2(8): 696-701.
• Koopman, Bernard O. 1940. The Axioms and Algebra of Intuitive Probability. Annals of Mathematics, 41(2): 269-292.
• Mayo, Deborah G. 1983. An Objective Theory of Statistical Testing. Synthese, 57(3): 297-340.
• Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: The University of Chicago Press.
• Mayo, Deborah G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars. Cambridge: Cambridge University Press.
• Mayo, Deborah G. and Aris Spanos. 2004. Methodology in Practice: Statistical Misspecification Testing. Philosophy of Science, 71(5): 1007-1025.
• Mayo, Deborah G. and Aris Spanos. 2006. Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction. British Journal for the Philosophy of Science, 57(2): 323- 357.
• Mayo, Deborah G. and Aris Spanos. 2011. Error Statistics. In D. Gabbay, P. Thagard, and J. Woods (eds), Philosophy of Statistics, Handbook of Philosophy of Science. New York: Elsevier, 151-196.
• Neyman, Jerzy. 1952. Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. Washington: U.S. Department of Agriculture.
• Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall.
• Salmon, Wesley C. 1967. The Foundations of Scientific Inference. Pittsburgh: University of Pittsburgh Press.
• Spanos, Aris. 2013. A Frequentist Interpretation of Probability for Model-Based Inductive Inference. Synthese, 190(9):1555- 1585.
• Spanos, Aris. 2017. Why the Decision-Theoretic Perspective Misrepresents Frequentist Inference. In Advances in Statistical Methodologies and Their Applications to Real Problems. http://dx.doi.org/10.5772/65720, 3-28.
• Spanos, Aris. 2018. Mis-Specification Testing in Retrospect. Journal of Economic Surveys, 32(2): 541-577.
• Spanos, Aris. 2019. Probability Theory and Statistical Inference: Empirical Modeling with Observational Data, 2nd ed. Cambridge: Cambridge University Press.
• Von Mises, Richard. 1928. Probability, Statistics and Truth, 2nd ed. New York: Dover.
• Williams, David. 2001. Weighing the Odds: A Course in Probability and Statistics. Cambridge: Cambridge University Press.

## 61 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)

.

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. it is now 61. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my (still) new book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, 2018). It’s especially relevant to take this up now, just before we leave 2019, for reasons that will be revealed over the next day or two. For a sneak preview of those reasons, see the “note to the reader” at the end of this post. So, let’s go back to it, with an excerpt from SIST (pp. 170-173). Continue reading

## The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon

1.3

Continue to the third, and last stop of Excursion 1 Tour I of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–Section 1.3. It would be of interest to ponder if (and how) the current state of play in the stat wars has shifted in just one year. I’ll do so in the comments. Use that space to ask me any questions.

How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional eﬀort to provide the scientific world with a uniﬁed testing methodology. (J. Berger 2003, p. 4)

Categories: Statistical Inference as Severe Testing | 3 Comments

## Severity: Strong vs Weak (Excursion 1 continues)

1.2

Marking one year since the appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), let’s continue to the second stop (1.2) of Excursion 1 Tour 1. It begins on p. 13 with a quote from statistician George Barnard. Assorted reflections will be given in the comments. Ask me any questions pertaining to the Tour.

• I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in ﬁelds of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)
Categories: Statistical Inference as Severe Testing | 5 Comments

## How My Book Begins: Beyond Probabilism and Performance: Severity Requirement

This week marks one year since the general availability of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s how it begins (Excursion 1 Tour 1 (1.1)). Material from the preface is here. I will sporadically give some “one year later” reflections in the comments. I invite readers to ask me any questions pertaining to the Tour.

The journey begins..(1.1)

I’m talking about a speciﬁc, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very diﬃcult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huﬀ wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

• Association is not causation.
• Statistical signiﬁcance is not substantive signiﬁcamce
• No evidence of risk is not evidence of no risk.
• If you torture the data enough, they will confess.

## SIST: All Excerpts and Mementos: May 2018-July 2019 (updated)

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

Excursion 1

EXCERPTS

Tour I Ex1 TI (full proofs)

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

Excursion 2

EXCERPTS

Tour I

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II

Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) 10/10/18

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

Excursion 3

EXCERPTS

Tour I

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

Excursion 4

EXCERPTS

Tour I

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19
(Full Excursion 4 Tour II)

Tour III
(Full proofs of Excursion 4 Tour III)

Tour IV

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

Excursion 5

Tour I

(Full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour II

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower) 06/07/19

Tour III

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

Excursion 6

Tour I Ex6 TI What Ever Happened to Bayesian Foundations?

Tour II

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

## (Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)

returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results. Continue reading

## SIST: All Excerpts and Mementos: May 2018-May 2019

view from a hot-air balloon

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

Excursion 1

EXCERPTS

Tour I

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

Excursion 2

EXCERPTS

Tour I

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II

Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3) 10/10/18

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

Excursion 3

EXCERPTS

Tour I

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

Excursion 4

EXCERPTS

Tour I

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

Tour IV

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

Excursion 5

Tour I

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour III

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

Excursion 6

Tour II

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

## Excerpts: Final Souvenir Z, Farewell Keepsake & List of Souvenirs

.

We’ve reached our last Tour (of SIST)*: Pragmatic and Error Statistical Bayesians (Excursion 6), marking the end of our reading with Souvenir Z, the final Souvenir, as well as the Farewell Keepsake in 6.7. Our cruise ship Statinfasst, currently here at Thebes, will be back at dock for maintenance for our next launch at the Summer Seminar in Phil Stat (July 28-Aug 11). Although it’s not my preference that new readers begin with the Farewell Keepsake (it contains a few spoilers), I’m excerpting it together with Souvenir Z (and a list of all souvenirs A – Z) here, and invite all interested readers to peer in. There’s a check list on p. 437: If you’re in the market for a new statistical account, you’ll want to test if it satisfies the items on the list. Have fun!

Souvenir Z: Understanding Tribal Warfare

We began this tour asking: Is there an overarching philosophy that “matches contemporary attitudes”? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised,when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.

6.7 Farewell Keepsake

Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist–Bayesian battles – still simmer below the surface of today’s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.

Rival standards reflect a tension between using probability (a) to constrain the probability that a method avoids erroneously interpreting data in a series of applications (performance), and (b) to assign degrees of support, confirmation, or plausibility to hypotheses (probabilism). We set sail on our journey with an informal tool for telling what’s true about statistical inference: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a severe test . From this minimal severe-testing requirement, we develop a statistical philosophy that goes beyond probabilism and performance. The goals of the severe tester (probativism) arise in contexts sufficiently different from those of probabilism that you are free to hold both, for distinct aims (Section 1.2). For statistical inference in science, it is severity we seek. A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. Viewing statistical inference as severe testing alters long-held conceptions of what’s required for an adequate account of statistical inference in science. In this view, a normative statistical epistemology –  an account of what’ s warranted to infer –  must be:

directly altered by biasing selection effects
able to falsify claims statistically
able to test statistical model assumptions
able to block inferences that violate minimal severity

1. NHST Licenses Abuses. First, there’s the cluster of criticisms directed at an abusive NHST animal: NHSTs infer from a single P-value below an arbitrary cut-off to evidence for a research claim, and they encourage P-hacking, fishing, and other selection effects. The reply: this ignores crucial requirements set by Fisher and other founders: isolated significant results are poor evidence of a genuine effect and statistical significance doesn’t warrant substantive, (e.g., causal) inferences. Moreover, selective reporting invalidates error probabilities. Some argue significance tests are un-Popperian because the higher the sample size, the easier to infer one’s research hypothesis. It’s true that with a sufficiently high sample size any discrepancy from a null hypothesis has a high probability of being detected, but statistical significance does not license inferring a research claim H. Unless H’s errors have been well probed by merely finding a small P-value, H passes an extremely insevere test. No mountains out of molehills (Sections 4.3 and 5.1). Enlightened users of statistical tests have rejected the cookbook, dichotomous NHST, long lampooned: such criticisms are behind the times. When well-intentioned aims of replication research are linked to these retreads, it only hurts the cause. One doesn’t need a sharp dichotomy to identify rather lousy tests – a main goal for a severe tester. Granted, policy-making contexts may require cut-offs, as do behavioristic setups. But in those contexts, a test’s error probabilities measure overall error control, and are not generally used to assess well-testedness. Even there, users need not fall into the NHST traps (Section 2.5). While attention to banning terms is the least productive aspect of the statistics wars, since NHST is not used by Fisher or N-P, let’s give the caricature its due and drop the NHST acronym; “statistical tests” or “error statistical tests” will do. Simple significance tests are a small part of a conglomeration of error statistical methods.

To continue reading: Excerpt Souvenir Z, Farewell Keepsake & List of Souvenirs can be found here.

*We are reading Statistical Inference as Severe Testing: How to Get beyond the Statistics Wars (2018, CUP)

***

Where YOU are in the journey.

## (full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”)

S.S. StatInfasST

It’s a balmy day today on Ship StatInfasST: An invigorating wind has a salutary effect on our journey. So, for the first time I’m excerpting all of Excursion 5 Tour I (proofs) of Statistical Inference as Severe Testing How to Get Beyond the Statistics Wars (2018, CUP)

A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)

So how would you use power to consider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.

Suppose you are reading about a statistically signifi cant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with IID samples, and known σ: H0 : μ ≤ 0 against H1 : μ > 0. Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.

• If the test’ s power to detect μ′ is very low (i.e., POW(μ′ ) is low), then the statistically significant x is poor/good evidence that μ > μ′ .
• Were POW(μ′ ) reasonably high, the inference to μ > μ′ is reasonably/poorly warranted.

## Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)

### Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)

Deborah G. Mayo

Abstract for Book

By disinterring the underlying statistical philosophies this book sets the stage for understanding and finally getting beyond today’s most pressing controversies revolving around statistical methods and irreproducible findings. Statistical Inference as Severe Testing takes the reader on a journey that provides a non-technical “how to” guide for zeroing in on the most influential arguments surrounding commonly used–and abused– statistical methods. The book sets sail with a tool for telling what’s true about statistical controversies: If little if anything has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. In the severe testing account, probability arises in inference, not to measure degrees of plausibility or belief in hypotheses, but to assess and control how severely tested claims are. Viewing statistical inference as severe testing supplies novel solutions to problems of induction, falsification and demarcating science from pseudoscience, and serves as the linchpin for understanding and getting beyond the statistics wars. The book links philosophical questions about the roles of probability in inference to the concerns of practitioners in psychology, medicine, biology, economics, physics and across the landscape of the natural and social sciences.

Keywords for book:

Severe testing, Bayesian and frequentist debates, Philosophy of statistics, Significance testing controversy, statistics wars, replication crisis, statistical inference, error statistics, Philosophy and history of Neyman, Pearson and Fisherian statistics, Popperian falsification

Categories: Statistical Inference as Severe Testing | 3 Comments

## 4.8 All Models Are False

. . . it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. . . . The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis. (Cox 1995, p. 456)

A popular slogan in statistics and elsewhere is “all models are false!”  Is this true? What can it mean to attribute a truth value to a model? Clearly what is meant involves some assertion or hypothesis about the model – that it correctly or incorrectly represents some phenomenon in some respect or to some degree. Such assertions clearly can be true. As Cox observes, “the very word model implies simplification and idealization.”  To declare, “all models are false”  by dint of their being idealizations or approximations, is to stick us with one of those  “all flesh is grass”  trivializations (Section 4.1). So understood, it follows that all statistical models are false, but we have learned nothing about how statistical models may be used to infer true claims about problems of interest. Since the severe tester’s goal in using approximate statistical models is largely to learn where they break down, their strict falsity is a given. Yet it does make her wonder why anyone would want to place a probability assignment on their truth, unless it was 0? Today’s tour continues our journey into solving the problem of induction (Section 2.7). Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

## Excursion 4: Objectivity and Auditing (blurbs of Tours I – IV)

.

Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.

Keywords

objective vs. subjective, objectivity requirements, auditing, dirty hands argument, phenomena vs. epiphenomena, logical positivism, verificationism, loss and cost functions, default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, transparency, epistemology: internal/external distinction

Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?

We begin with the Mountains out of Molehills Fallacy (large n problem): The fallacy of taking a (P-level) rejection of H0 with larger sample size as indicating greater discrepancy from H0 than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough n, a .05 significant result can correspond to assigning H0 a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.

It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.

Keywords

significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)

Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small P-values, yet replication attempts find it difficult to get small P-values with preregistered results. I call this the paradox of replication. The problem isn’t P-values but failing to adjust them for cherry picking and other biasing selection effects. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)

Keywords

error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)

Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being adequate for a problem, meaning it enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical  probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.

Keywords

adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification

Where you are in the Journey

Categories: SIST, Statistical Inference as Severe Testing | 2 Comments

## Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

getting beyond…

### Excerpt from Excursion 4 Tour II*

4.4 Do P-Values Exaggerate the Evidence?

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π0 of 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.

More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P. Continue reading

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

## January Invites: Ask me questions (about SIST), Write Discussion Analyses (U-Phils)

.

ASK ME. Some readers say they’re not sure where to ask a question of comprehension on Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–SIST– so here’s a special post to park your questions of comprehension (to be placed in the comments) on a little over the first half of the book. That goes up to and includes Excursion 4 Tour I on “The Myth of ‘The Myth of Objectivity'”. However,I will soon post on Tour II: Rejection Fallacies: Who’s Exaggerating What? So feel free to ask questions of comprehension as far as p.259.

All of the SIST BlogPost (Excerpts and Mementos) so far are here.

.

WRITE A DISCUSSION NOTE: Beginning January 16, anyone who wishes to write a discussion note (on some aspect or issue up to p. 259 are invited to do so (<750 words, longer if you wish). Send them to my error email.  I will post as many as possible on this blog.

We initially called such notes “U-Phils” as in “You do a Philosophical analysis”, which really only means it’s an analytic excercize that strives to first give the most generous interpretation to positions, and then examines them. See the general definition of  a U-Phil.

Some Examples:

Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution

U-Phil: A Further Comment on Gelman by Christian Hennig.

For a whole group of reader contributions, including Jim Berger on Jim Berger, see: Earlier U-Phils and Deconstructions

If you’re writing a note on objectivity, you might wish to compare and contrast Excursion 4 Tour I with a paper by Gelman and Hennig (2017): “Beyond subjective and objective in Statistics”.

These invites extend through January.

## SIST* Blog Posts: Excerpts & Mementos (to Dec 31 2018)

Surveying SIST Blog Posts So Far

Excerpts

• 05/19: The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
• 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
• 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
• 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
• 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
• 10/10: Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3)
• 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3
• 12/01: Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)
• 12/04: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]
• 12/11: It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II  (Mayo 2018, CUP)
• 12/20: Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III
• 12/26: Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)
• 12/29: 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II.

Mementos, Keepsakes and Souvenirs

• 10/29: Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)
• 11/8:   Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)
• 10/5:  “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1)
• 11/14: Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation)
• 11/17: Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction
• 12/08: Memento & Quiz (on SEV): Excursion 3, Tour I
• 12/13: Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)
• 12/26: Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

## 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)

.

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my new book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST). It’s especially relevant to take this up now, just before we leave 2018, for reasons that will be revealed over the next day or two. So, let’s go back to it, with an excerpt from SIST (pp. 170-173).

Exhibit (vi): Two Measuring Instruments of Different Precisions. Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

Basis for the joke: An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not. Continue reading

.

## Tour I The Myth of “The Myth of Objectivity”*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t. Continue reading

## Excursion 3 Tour III:

A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. CIs thereby obtain an inferential rationale (beyond performance), and several benchmarks are reported. Continue reading

## Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III

Deeper Concepts 3.7, 3.8

## Tour III Capability and Severity: Deeper Concepts

From the itinerary: A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs), but in fact there’s a clear duality between the two. The dual mission of the first stop (Section 3.7) of this tour is to illuminate both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. The severity analysis seamlessly blends testing and estimation. A typical inquiry first tests for the existence of a genuine effect and then estimates magnitudes of discrepancies, or inquires if theoretical parameter values are contained within a confidence interval. At the second stop (Section 3.8) we reopen a highly controversial matter of interpretation that is often taken as settled. It relates to statistics and the discovery of the Higgs particle – displayed in a recently opened gallery on the “Statistical Inference in Theory Testing” level of today’s museum. Continue reading