Severity and Adversarial Collaborations (i)

.

In the 2025 November/December issue of American Scientist, a group of authors (Ceci, Clark, Jussim and Williams 2025) argue in “Teams of rivals” that “adversarial collaborations offer a rigorous way to resolve opposing scientific findings, inform key sociopolitical issues, and help repair trust in science”. With adversarial collaborations, a term coined by Daniel Kahneman (2003), teams of divergent scholars, interested in uncovering what is the case (rather than endlessly making their case) design appropriately stringent tests to understand–and perhaps even resolve–their disagreements. I am pleased to see that in describing such tests the authors allude to my notion of severe testing (Mayo 2018)*:

Severe testing is the related idea that the scientific community ought to accept a claim only after it surmounts rigorous tests designed to find its flaws, rather than tests optimally designed for confirmation. The strong motivation each side’s members will feel to severely test the other side’s predictions should inspire greater confidence in the collaboration’s eventual conclusions. (Ceci et al., 2025)

1. Why open science isn’t enough

Open science, while valuable, is not enough, they argue because “open science practices do not involve implementing severe tests”. While they appreciate how open science reforms (e.g., preregistration) make it more difficult to immunize favored hypotheses, “open science…lacks  safeguards against bias. … Bias can result from the typical lack of viewpoint diversity among collaborators … Importantly, bias resulting from such viewpoint ‘bubbles’ can occur even among” users of open science practices for the simple reason that they do not require practitioners “include a devil’s advocate who will propose alternative hypotheses, … operationalizations of core constructs …or methodologies. Nor does it demand that someone strive to falsify rather than to confirm”.

I love the concept of “viewpoint bubbles”. It aptly captures the kind of “appeals to numbers” and groupthink we often see surrounding disagreements on controversial issues.[1] If Ceci et al. (2025) are right, as I think they are, all of the attention being paid to open science should be coupled with the severe testing and adversarial collaboration movement.

“In short, unlike adversarial collaborations, open science practices do not involve implementing severe tests or selecting team members to disagree over what evidence would disconfirm a theory. Thus, though laudable, these practices do not address biases or other key shortfalls the way that adversarial collaborations can.”

2. Can there be adversarial collaborations in philosophy of statistics?

I agree with these authors, and would extend their arguments to philosophical controversiesespecially those surrounding the statistical methods relied on in social science debates. Avoiding “viewpoint bubbles” would have encouraged teams that engaged in identifying statistical reforms in the last decade to engage in the issues raised by rival views. The sources of the “replication crisis” (and thus, appropriate reforms) are hotly debated: Should we blame the use of statistical significance tests? Or is the problem a scarcity of genuine effects and the neglect of “base rates”? Low power? Misinterpreting p-values? Or is it the biasing selection effects, data-dredging, and cherry picking? These questions, I claim, can be answered by honest adversarial collaborations. Without answering these, we remain at sea regarding reforms. Even beyond “fixing science” we want to know how humans learn about the world in error prone contexts–how we avoid self-deceit when we do–and how to do it better.

Ironically, my strategy in in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP), is to employ severity as a meta-level concept that enables understanding if not also resolving some of the old and new disagreements in statistical foundations. I write: “Viewing statistical inference as severe testing lets us stand one level removed from existing accounts, where the air is a bit clearer.”

In other words, I set out the error statistical notion of severity largely to understand and clarify why advocates of rival views so often speak past each other: they hold different conceptions of evidence and the role of probability in error prone inference. Accounts that adhere to the severity requirement fall under what I call an “error statistical” approach. Many practitioners and philosophers even presuppose quite different notions of “good tests”. They might, for example, claim H has passed a good test so long as the data are more probable if H is true than if it’s false. This would not satisfy the minimal requirement of evidence for a severe tester. Cherry picking, p-hacking, selection effects, and multiple gaps between what is studied and what is inferred, make it easy to find that data “fit” one theory H much better than another—even if little has been done to probe H’s flaws. That’s the difference between looking at a comparison of likelihoods and computing the error probability associated with the test. (Severity requires both good fit, and good error probabilities.)

3. Upshot

As the authors remark: “research can become ideologically thorny when it lacks a mechanism such as adversarial collaboration to help resolve contradictory conclusions”.  Since these contradictory conclusions often turn on disagreements about method, unearthing  assumptions of rival views in philosophy of statistics is directly relevant to their project. Too often disagreements in statistical foundations reflect the kind of endless “critiques, and rejoinders back and forth” that Kahneman and (Ceci et al. 2025) rightly deem unconstructive. What’s needed isn’t another round of response but more diagnosis. By truly understanding, not dismissing, sources of disagreement, we learn a lot about the type of “safeguards and protocols” that are required of “a neutral referee to adjudicate internal disputes” at the heart of constructive adversarial collaborations.[2] I’m sufficiently encouraged by Ceci et al. 2025 to invite the Bayesian critics of severity (van Dongen, Sprenger, and Wagenmakers) that I take up in Mayo (2025) to join me in identifying the sources of our disagreement–and agreement. Will they accept? [3]

The payoff is not so much to get the disputants to agree– the contexts of inquiry are sufficiently varied to admit very different methods for different goals. The real payoff is for the skeptical consumer of data-driven policy to see how flexible–or fragile the –data driven claims can be, and to judge which methods suit inferences about the policies that affect them. That kind of understanding guides people in knowing where their trust in expertise is warranted.

If research teams included scholars with opposing research agendas, the social sciences would likely see stronger inferences and more severe testing, thereby optimizing methods and producing better science. Adversarial collaborations are an idea whose time has come, and their commissioning could lead to a much-needed cultural shift in science. (Ceci et al. 2025)

*I thank my colleague, philosopher of science, Joseph Pitt for alerting me to this work.

[1] “To take one example, various controversies beset the notion of implicit bias (unconscious or unrecognized prejudice), including how strongly such bias predicts discrimination, whether measures of it differ from explicit prejudice, whether implicit trainings do more harm than good, and whether scores of 0 on the Implicit Association Test (an assessment tool for detecting belief associations and implicit bias) in fact correspond to egalitarian attitudes. Similar controversies surround topics such as the role of racism in policing; microaggressions; the relative prevalence of biases among those on the political right versus the political left; the effectiveness of gender-affirming care; the magnitude and importance of sex differences; the predictive validity of standardized achievement tests; and the validity of various low-cost interventions in academic achievement. For such politically charged issues, adversarial collaborations might also help repair ongoing declines in scientists’ credibility among the public. …”(Ceci et al. 2025)

[2] In one type of error statistical method they describe: “the authors used a multiverse meta-analysis, which subjects data to a full range of possible analytical decisions a researcher might make, thereby testing how sensitive that data might be to various analytical choices”. How to operationalize concepts of interest introduces perhaps the greatest latitude in deciding if data even “fit’ a claim.

[3] This is not the same as calling for applying methods from rival approaches to data (a task that I know has been undertaken at times)–although that might naturally arise. The important goal here is philosophical/methodological–what is warranted to infer (about the question of interest) from the given test? Since the adversarial collaboration in this case would start from a case where our answers disagree, the challenge would be to uncover, explain, and perhaps reduce it. I propose in Mayo (2025) that Bayes factor testers supplement their results with an error statistical severity analysis. That sort of thing could be the basis for a more nuanced interpretation of inferences.

[4] They consider “a parallel notion common in the tech sector, which has long employed ‘red team’ members in research to play devil’s advocate and to find flaws in computer programming code. Red team members function much like the opposing members of an adversarial collaboration” (Ceci et al. 2025).

 

 

 

Categories: severity and adversarial collaborations | 5 Comments

Post navigation

5 thoughts on “Severity and Adversarial Collaborations (i)

  1. Thanks, Deborah for this interesting post. I remember being particularly struck by Popper’s “friendly-hostile co-operation of scientists” when I first read it but I can no longer remember in which of his works it was. Your “adversarial collaborations” would seem to fit the bill.

    I consider that that there are different frameworks of statistical analysis is a strength and a richness but I wonder if in your 3rd paragraph to Section 2 you are not being a little negative about alternatives to your own favourite.

    As regards footnote [2} and the multiverse approach to meta-analysis this blog by Julia Rohrer is interesting. I also recommend Sander Greenland’s paper from 20 years ago Multiple-Bias Modelling for Analysis of Observational Data | Journal of the Royal Statistical Society Series A: Statistics in Society | Oxford Academic

    • Stephen:
      Thanks so much for your comment. I think I wasn’t clear enough in the section that leads you to say: I am “ being a little negative about alternatives to your own favourite”. I make it clear in this post and elsewhere that “the contexts of inquiry are sufficiently varied to admit very different methods for different goals”. (Recall, it’s the frequentist who is often faulted for advocating a hodgepodge, rather than a single, unified approach.) In SIST (2018), I reserve severe testing for rather special contexts:

      I say: “The severity requirement gives a minimal principle based on the fact that highly insevere tests yield bad evidence, no tests (BENT). … In addition to our minimal principle for evidence, one more thing is needed, at least during the time we are engaged in this project: the goal of finding things out.
      The desire to find things out is an obvious goal; yet most of the time it is not what drives us. … Often it is entirely proper to gather information to make your case, and ignore anything that fails to support it. Only if you really desire to find out something, or to challenge so-and-so’s (“trust me”) assurances, will you be prepared to stick your (or their) neck out to conduct a genuine “conjecture and refutation” exercise.” (6-7)

      I concur with Bayesians Van Dongen, Sprenger, and Wagenmakers (VSW 2023): “many Bayesians deny that severity should matter at all in inference. They refer to the Likelihood Principle […] According to this line of response, Popper, Mayo and other defenders of severe testing are just mistaken when they believe that severity should enter the (post-experimental) assessment of a theory”. However, VSW maintain that they deliver severe testing without error probabilities, and while still staying “faithful to the principles of subjective Bayesianism” (529). (A link to their paper is https://errorstatistics.com/wp-content/uploads/2024/08/van-dongen-sprenger-wagenmaker-2022.pdf) My discussion is in Mayo 2025.
      Amplifying the puzzle, VSW claim “the Bayesian ex-post evaluation of the evidence stays the same regardless of whether the test has been conducted in a severe or less severe fashion” (522) Moreover, they rather strongly reject the error statistical notion of severity. All I’m saying is that we’re using the same words in different ways, and that we should clear this up. Else, when they promise, as they do, that severe tests can be had (via Bayes factor tests) while bypassing error probability control, people may be misled (certainly students are). Another pay-off, I think, is that they wouldn’t be so quick to reject the relevance of error probabilities. That’s why I go on to invite them, in section 3, to an adversarial collaboration on concepts and philosophies of evidence and tests.

      Nov 4: I did cut a sentence from the last para of section 2 that was redundant, making it sound too emphatic. Thanks for noticing it.

  2. Stan Young

    Deborah:Nice post. A few typos in yellow. Thoughts in green.Stan Stan & Pat Young Cell 919 219 2030 (Stan)  Cell 919 219 2024 (Pat)

  3. Nathan A Schachtman

    Thanks for an interesting post and pointing to the Ceci article.

    George Olah in his 1994 Nobel Prize (chemistry) acceptance speech noted that an effective way of addressing errors in science “is to have an enemy” who is willing to devote “a vast amount of time and brain power to ferreting out errors both large and small”. https://www.nobelprize.org/uploads/2018/06/olah-lecture.pdf

    • Nathan:
      Thank you so much for your comment. It fits perfectly with the theme, and in fact the author says “adversary” might be a better term than “enemy”. A favorite quote of mine, attributed to Bohr) is: “An expert is a person who has made all the mistakes that can be made in a very narrow field.” So adversaries, and severe error probing, can speed up the process by which we may become experts in a field.

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.