What do I mean by “The Statistics Wars and Their Casualties”? It is the title of the workshop I have been organizing with Roman Frigg at the London School of Economics (CPNSS) , which was to have happened in June. It is now the title of a forum I am zooming on Phil Stat that I hope you will want to follow. It’s time that I explain and explore some of the key facets I have in mind with this title.
The Statistics Wars, of course, refers to the wars of ideas between competing tribes of statisticians, data analysts and probabilistic modelers. Some may be surprised to learn that the field of statistics, arid and staid as it seems, has a fascinating and colorful history of philosophical debate, marked by unusual heights of passion, personality, and controversy for at least a century. Others know them all too well and regard them as “merely” philosophical, or perhaps cultural or political. That is why it seemed apt to use “wars” in the title of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)—although I wasn’t at all sure that Cambridge would allow it. Although the wars go back for many years, my interest is in their current emergence within the “crisis of replication”, and in relation to challenges posed by the rise of Big Data, data analytics, machine learning, and bioinformatics. A variety of high-powered methods, despite impressive successes in some contexts, have often led to failed replication, irreproducibility and bias. A number of statistical “reforms” and reformulations of existing methods are welcome (preregistration, replication, calling out cookbook statistics). Others are radical and even obstruct practices known to improve on replication. With the “war” metaphor in place, it is only natural to dub these untoward consequences its “casualties”–whether intended or unintended.
Nowadays, it is not unusual for people to set out shingles, promising to give clarifying explanations of statistical significance tests, P-values, and Type 1 and 2 errors, but it seems to me that the terms are more confused than ever–including, to my dismay, by some of the “experts”. These are among the casualties I have in mind. These issues are sufficiently urgent not to wait until the coronavirus pandemic is adequately controlled in the U.S. so that I can travel abroad. So I’m inviting the workshop participants (and perhaps others) to speak at a remote forum, even if their topic isn’t the one they choose in an eventual in-person workshop. I will encourage them to draw out some of the more contrarian positions I find in their work.
Why the urgency? For one thing, these issues are increasingly being brought to bear on some very public controversies—including some that are reflected in the pandemic itself. The British Medical Journal found all prediction models fail the tests for bias delineated in their guidelines for machine learning models. Playing on Box’s famous remark, the authors declare “all clinical prediction models for covid-19 to date are wrong and none are useful.” (True, this was back in April.) Second, the “classical” statistics wars between Bayesians and frequentist error statisticians still simmer below the surface in assumptions about the very role probability plays in statistical inference. (Whenever people say ‘the issue is not about Bayesian vs frequentist’, I find, the issue turns out to be about Bayesian vs frequentist disagreements–or grows directly out of them.) Most important, what is at stake is a critical standpoint that we may be in danger of losing, if not permanently, then for a sufficiently long time to do real damage.
Of course, what counts as a welcome reform, and what counts as a casualty depends on who you ask. But the entire issue is rarely even considered. We should at least point up conflicts and inconsistencies in positions being bandied about. I’m most interested in those that are self-defeating or that, however indirectly, weaken stated goals and aims. For example, there are those who call for greater replication tests of purported findings while denying we should ever use a P-value threshold, or any other threshold, in interpreting data (Wasserstein et al, 2019). How then to pinpoint failed replications? We should move away from unthinking thresholds, not just with P-values but with any other statistical quantity. However, unless you can say ahead of time that some outcomes will not be allowed to count in favor of a claim, you don’t have a test of that claim–whatever it is. If statistical consumers are unaware of assumptions behind proposed changes to standards of evidence, they can’t scrutinize the casualties that affect them (in drug treatments, personalized medicine, psychology, economics and so on). They might jump on a popular bandwagon, only to discover, too late, that important protections in interpreting data are gone. When the debate becomes politicized—as they now often are—warranted criticism is easily blurred with irrational distrust of science. Grappling with these issues requires a mix of philosophical, conceptual, and statistical considerations. I hope that by means of this forum we can have an impact on policies about evidence that are being debated, cancelled, adopted and put into practice across the sciences.
The P-value Wars and its Casualties
The best known of the statistics wars concern statistical significance tests. Many blame statistical significance tests for making it too easy to find impressive looking effects that do not replicate with predesignated hypotheses and tighter controls. However, the fact it becomes difficult to replicate effects when features of the tests are tied down gives new understanding and appreciation for the role of statistical significance tests. It vindicates them. Statistical significance tests are a part of a rich set of tools “for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, 1033). These are a method’s error probabilities. Accounts where probability is used to assess and control a method’s error probabilities I call error statistical. This is much more apt than “frequentist”, which fails to identify the core feature of the methodology I have in mind. This error control is nullified by biasing selection effects–cherry picking, multiple testing, selective reporting, data dredging, optional stopping and P-hacking.
However, the same flexibility can occur when cherry-picked or p-hacked hypotheses enter into methods that are being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. There is one big difference: The direct grounds to criticize inferences as flouting error statistical control is lost, unless they are supplemented with principles that are not now standard. We hear that “Bayes factors can be used in the complete absence of a sampling plan, or in situations where the analyst does not know the sampling plan that was used” (Bayarri, Benjamin, Berger, & Sellke 2016, 100). But without the sampling plan, you cannot ascertain the altered capabilities of a method to distinguish genuine from spurious effects. Put simply, the Bayesian inference conditions on the actual outcome and so would not consider error probabilities which refer to outcomes other than the one observed. Even outside of cases with biasing selection effects, the very idea that we are uninterested in how a method would behave in general, with different data, seems very strange. Perhaps practitioners are now prepared to supplement these accounts with new stipulations that can pick up on the general behavior of a method. Excellent, then they are in sync with error statistical intuitions. But they should make this clear.
Admittedly, error statisticians haven’t been clear as to the justification for caring about error probabilities, implicitly accepting that it reflects a concern with good long-run performance. Performance matters, but the justification also concerns the inference at hand.
The problem with the data dredger’s inference is not that it uses a method with poor long-run error control. It should be clear from the replication crisis that what bothers us about P-hackers and data dredgers is that they have done a poor job in the case at hand. They have found data to agree with a hypothesized effect, but they did so by means of a method that very probably would have found some such effect (or other) even if spurious. As Popper would say, the inference has passed a weak, and not a severe test. Notice what the critical reader of a registered report is doing, whether pre-data or post-data. She looks, in effect, at the sampling distribution, the probability that one or another hypothesis, stopping point, choice of grouping variables, and so on, could have led to a false positive–even without a formal error probability computation. Psychologist Daniël Lakens—our August 20 presenter—suggests that the “severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration” (2019, 225)). Links to the video and slides of his excellent talk are below the Notes. Thus, the replication crisis has had the constructive upshot of supplying a rationale never made entirely clear by significance testers. Ironically, however, statistical significance tests are mired in more controversy than ever.
One final remark: It’s important to see that in many contexts the “same” data can be used to erect a model or claim as well provide a warranted test of the claim. (I put quotes around “same”, because the data are actually remodeled.) Examples include: using data to test statistical model assumptions, DNA matching, reliable estimation procedures. It may even be guaranteed that a method will output a claim or model in accordance with data. The problem is not guaranteeing agreement between data and a claim, the problem is doing so even though the claim is false or specifiably false. We shouldn’t confuse cases where we’re trying to determine if there even is a real effect that needs explaining—arguably, the key role for statistical significance tests–and cases where we have a known effect, and are seeking to explain it. In the latter case, we must use the known effect in arriving at and testing proposed explanations.
 Alexander Bird (King’s College London), Mark Burgman (Imperial College London), Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), David Hand (Imperial College London), Christian Hennig (University of Bologna), Katrin Hohl (City University London), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent)
 It would be interesting to collect those non-actual wars that seem aptly described as “wars”. What is it about them? The mommy wars, the culture wars. I occasionally find a new one that has the essence I have in mind, but I haven’t kept up a list. Please share examples in the comments. What is it they share? Or are they too different to be viewed as sharing an essence? I know one thing they share: they will not be won!
 I can hear some people saying, ‘you see, even the experts can’t understand them’, but I have a different theory. The fallacies have become more prevalent because of a growing tendency to interpret them through the lens of quantities that are measuring very different things, e.g., likelihood ratios or Bayes factors.
 A future statistics casualty to consider: Remember when 2013 was dubbed “the year of statistics”? This was partly to avoid being sidelined in the face of all the attention being given to the new kid on the block called “data science” or “data analytics”. The features that made data science so attractive—it gave answers quickly without all the qualifications and care of statistics—led statisticians to question if it was more of a pragmatic, business occupation, good at finding predictive patterns in data, but not a full-blown profession with principles as enjoyed in statistics. Yet, at the risk of losing resources, the field of statistics has rapidly merged in a variety of forms with data science. So I was surprised to see in the latest issue of Significance that “the problems with applied data science exist because data science currently does not constitute a profession but is instead an occupation” (Steuer 2020). The problems are claimed to be rooted in a failure to incorporate both an ethics and an epistemology of data. A philosophy of data science is needed.
 Andrew Gelman is a Bayesian who also wants to be a falsificationist. In a joint paper with the error statistician Cosma Shalizi in 2013, they say: “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data” (Gelman and Shalizi 2013, 20). But, in recent years, Gelman has thrown in with those keen on “abandoning statistical significance”. I don’t see how his own falsificationist philosophy avoids being a casualty.
Video of D. Lakens’ presentation (3 parts):
(Viewing in full screen mode helps with buffering issues.)