The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars


Excerpts from the Preface:

The Statistics Wars: 

Today’s “statistics wars” are fascinating: They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. How should the integrity of science be restored? Experts do not agree. This book pulls back the curtain on why.

Methods of statistical inference become relevant primarily when effects are neither totally swamped by noise, nor so clear cut that formal assessment of errors is relatively unimportant. Should probability enter to capture degrees of belief about claims? To measure variability? Or to ensure we won’t reach mistaken interpretations of data too often in the long run of experience? Modern statistical methods grew out of attempts to systematize doing all of these. The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious–likened in some quarters to religious and political debates–that everyone wants to believe we are long past them. We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method “works.” The truth is, long-standing battles still simmer below the surface in questions about scientific trustworthiness and the relationships between Big Data-driven models and theory. The reconciliations and unifications have been revealed to have serious problems, and there’s little agreement on which to use or how to interpret them. As for eclecticism, it’s often not clear what is even meant by “works.” The presumption that all we need is an agreement on numbers–never mind if they’re measuring different things–leads to pandemonium. Let’s brush the dust off the pivotal debates, walk into the museums where we can see and hear such founders as Fisher, Neyman, Pearson, Savage and many others. This is to simultaneously zero in on the arguments between metaresearchers–those doing research on research–charged with statistical reforms.

Statistical Inference as Severe Testing:

Why are some arguing that in today’s world of high powered computer searches that statistical findings are mostly false? The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That’s what it means to view statistical inference as severe testing. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. You may be surprised to learn that many methods advocated by experts do not stand up to severe scrutiny, are even in tension with successful strategies for blocking or accounting for cherry-picking and selective reporting!

The severe testing perspective substantiates, using modern statistics, the idea Karl Popper promoted, but never cashed out. The goal of highly well tested claims differs sufficiently from highly probable ones that you can have your cake and eat it too: retaining both for different contexts. Claims may be “probable” (in whatever sense you choose) but terribly tested by these data. In saying we may view statistical inference as severe testing, I’m not saying statistical inference is always about formal statistical testing. The testing metaphor grows out of the idea that before we have evidence for a claim, it must have passed an analysis that could have found it flawed. The probability that a method commits an erroneous interpretation of data is an error probability. Statistical methods based on error probabilities I call error statistics. The value of error probabilities, I argue, is not merely to control error in the long-run, but because of what they teach us about the source of the data in front of us. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.

Getting Beyond the Statistics Wars:

Thomas Kuhn’s remark that only in the face of crisis “do scientists behave like philosophers” (1970), holds some truth in the current statistical crisis in science. Leaders of today’s programs to restore scientific integrity have their own preconceptions about the nature of evidence and inference, and about “what we really want” in learning from data. Philosophy of science can also alleviate such conceptual discomforts. Fortunately, you needn’t accept the severe testing view in order to employ it as a tool for bringing into focus the crux of all these issues. It’s a tool for excavation, and for keeping us afloat in the marshes and quicksand that often mark today’s controversies. Nevertheless, important consequences will follow once this tool is used. First there will be a reformulation of existing tools (tests, confidence intervals and others) so as to avoid misinterpretations and abuses. The debates on statistical inference generally concern inference after a statistical model and data statements are in place, when in fact the most interesting work involves the local inferences needed to get to that point. A primary asset of error statistical methods is their contributions to designing, collecting, modeling, and learning from data. The severe testing view provides the much-needed link between a test’s error probabilities and what’s required for a warranted inference in the case at hand. Second, instead of rehearsing the same criticisms over and over again, challengers on all sides should now begin by grappling with the arguments we trace within. Kneejerk assumptions about the superiority of one or another method will not do. Although we’ll be excavating the actual history, it’s the methods themselves that matter; they’re too important to be limited by what someone 50, 60 or 90 years ago thought, or to what today’s discussants think they thought……


Join me, then on a series of 6 excursions and 16 tours, during which we will visit 3 leading museums of statistical science and philosophy of science, and engage with a host of tribes marked by family quarrels, peace treaties, and shifting alliances.[i]

[i] A bit of travel trivia for those who not only read to the end of prefaces, but check its footnotes………..[Oops, you’ll have to check the book itself when it’s out. In the mean time, inform me of typos/errors,other queries: or the comments]

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, July 2018)

Categories: Announcement, SIST | Leave a comment

Post navigation

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at