Severe testing of deep learning models of cognition (ii)

.

From time to time I hear of an application of the severe testing philosophy in intriguing ways in fields I know very little about. An example is a recent article by cognitive psychologist Jeffrey Bowers and colleagues (2023): “On the importance of severely testing deep learning models of cognition” (abstract below). Because deep neural networks (DNNs)–advanced machine learning models–seem to recognize images of objects at a similar or even better rate than humans, many researchers suppose DNNs learn to recognize objects in a way similar to humans. However, Bowers and colleagues argue that, on closer inspection, the evidence is remarkably weak, and “in order to address this problem, we argue that the philosophy of severe testing is needed”.

The problem is this. Deep learning models, after all, consist of millions of (largely uninterpretable) parameters. Without understanding how the black box model moves from inputs to outputs, it’s easy to see why observed correlations can easily occur even where the DNN output is due to a variety of factors other than using a similar mechanism as the human visual system. From the standpoint of severe testing, this is a familiar mistake. For data to provide evidence for a claim, it does not suffice that the claim agrees with data, the method must have been capable of revealing the claim to be false, (just) if it is. Here the type of claim of interest is that a given algorithmic model uses similar features or mechanisms as humans to categorize images.[1] The problem isn’t the engineering one of getting more accurate algorithmic models, the problem is inferring claim C: DNNs mimic human cognition in some sense (they focus on vision), even though C has not been well probed.

“Contrary to the model comparison approach that is popular in deep learning applications to cognitive/neural modeling it will be argued that the mere advantage of one model over the other in predicting domain-relevant data is wholly insufficient even as the weakest evidentiary standard”

—in particular, “weak severity”.[2] Bowers et al. argue that many celebrated demonstrations of human–DNN similarity fail even this minimal standard of evidence: nothing has been done that would have found C false, even if it is. While the authors grant that “many questions arise as we attempt to unfold what severity requirements mean in practice. … current testing does not even come close to any reasonable severity requirement”. [Pursuing their questions about applying severity deserves a separate discussion.]

While the experiments are artificial, as with all experiments, they do seem to replicate known features of human vision such as identifying objects by shape rather than texture, and a human sensitivity to relations between parts. Although similar patterns may be found in DNNs, once researchers scratch a bit below the surface using genuinely probative tests, the agreement collapses. One example concerns a disagreement in how DNNs vs humans perceive relations between parts of objects, such as one part being above the other. An experiment goes something like this (I do not claim any expertise): Humans and DNNs are trained on labeled examples and must infer what matters to the classification. In fact the classification depends on how the parts are arranged, but no rule is given. When shown a new object whose parts stand in the same relations as before, humans typically regard it as belonging to the same category. Not so for objects with the same parts but with those relations altered—e.g., what was above is now below–at least for humans.

The DNN appears to do just as well until the relationship, but not the parts, are swapped (e.g., what was above is now below). Keeping the parts the same, in other words, while changing the relation, the model does not change its classification. Humans do. The DNN, it appears, is tracking features (the parts) that suffice for prediction during training, rather than the relational structure that humans infer as explaining the classification. For more examples and discussion (in The Behavioral and Brain Sciences,), see Bowers et al. (2022).

They ask: “Why is there so little severe testing in this domain? We argue that part of the problem lies with the peer-review system that incentivizes researchers to carry out research designed to highlight DNN-human similarities and minimize differences.” As a result, researchers are incentivized—by peer review, publication practices, and the culture of the field—to design experiments that show agreement rather than ones that seriously risk falsifying claims of DNN-human similarities.

“Indeed, reviewers and editors often claim that “negative results” — i.e., results that falsify strong claims of similarity between humans and DNNs — are not enough and that “solutions” — i.e., models that report DNN-human similarities – are needed for publishing in the top venues…”

We should not equate the use of “negative results” in this context, with common “null” results in statistical significance tests. Commonly null results in statistical significance tests merely fail to provide evidence for an effect (or indicate upper bounds to the effect size, based on a proper use of power). In the human/DNN studies, by contrast, the sense of “negative” is closer to a falsification, or at least a serious undermining, of the proposed claim C that the DNNs rely on the same or similar mechanisms as humans, in regard to a certain process. As they put it,  ‘negative results’ are “results that falsify strong claims of similarity between humans and DNNs”:

“the main objective of researchers comparing DNNs to humans is to better understand the brain through DNNs. If apparent DNN-human similarities are mediated by qualitatively different systems, then the claim that DNNs are good models of brains is simply wrong.”

The relational experiments show the two are tracking different things. As such, I recommend they call them “falsifying results” rather than ‘negative’, since we are generally entitled to say something much weaker in the case of “null” results in standard statistical tests (with a no effect null). Even “fixing” the DNN to match the human output does not restore the claim that the two systems are tracking the same things–so far as I understand what’s going on here. (We assume there wasn’t some underlying flaw with the apparently falsifying experiments.)

Of most interest is that the authors stress a constructive upshot,  “that following the principles of severe testing is likely to steer empirical deep learning approaches to brain and cognitive science onto a more constructive direction”. Encouragingly the most updated benchmark tests seem to bear this out, but as an outsider, I can only speculate. [I will write to the authors and report on their sense of a shift in the field.] Whether this shift will become widespread remains to be seen, but it marks a welcome and interesting move toward more severe and genuinely informative testing in these experiments.

[1] The claim, of course, could be something else, such as, DNNs are useful for understanding the relationships between DNNs and human cognition.

[2] Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.

Abstract of Bowers et al. (2023) Researchers studying the correspondences between Deep Neural Networks (DNNs) and humans often give little consideration to severe testing when drawing conclusions from empirical findings, and this is impeding progress in building better models of minds. We first detail what we mean by severe testing and highlight how this is especially important when working with opaque models with many free parameters that may solve a given task in multiple different ways. Second, we provide multiple examples of researchers making strong claims regarding DNN-human similarities without engaging in severe testing of their hypotheses. Third, we consider why severe testing is undervalued. We provide evidence that part of the fault lies with the review process. There is now a widespread appreciation in many areas of science that a bias for publishing positive results (among other practices) is leading to a credibility crisis, but there seems less awareness of the problem here

Reference

Bowers, J. S., Malhotra, G., Dujmović, M., Llera Montero, M., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., & Blything, R. (2023). Deep problems with neural network models of human vision, commentary & response. The Behavioral and Brain Sciences46, Article e385. https://doi.org/10.1017/S0140525X22002813

Categories: severity and deep learning models | 5 Comments

Post navigation

5 thoughts on “Severe testing of deep learning models of cognition (ii)

  1. It became clear at least 10 years ago you could spoof image classifiers with varieties of humanly imperceptible noise, a sufficient proof they do not work the same way as humans. I hope the search for similarity is within the recognition of difference, and/or part of efforts to make them more similar – and therefore more trustworthy. Fail in expected ways.

    • ctwardy
      Thanks for your comment. Yes, what is the status on correcting for these errors. I guess it seems that correcting for those could bring the classifier around in a manner not present in the cases Bowers discusses. Or would you say it’s essentially the same? I hope he will comment.

  2. Reines, Brandon Philip

    great!!!

    Sent from my T-Mobile 5G Device
    Get Outlook for Androidhttps://aka.ms/AAb9ysg


  3. Jeffrey Bowers

    Dear Prof Mayo, thank you for your thoughtful comment on our article on severe testing in NeuroAI. You ask whether the field is starting to appreciate the importance of severe testing. Perhaps there is some progress, such as a recent Brain-Score competition in which researchers were asked to build tests (“benchmarks”) that make models fail, see https://www.youtube.com/watch?v=fYoW8TxUAco But overall, reviewers and editors are still far more interested in publishing articles that highlight similarities between artificial neural networks (ANNs), and not much interested in publishing studies that highlight their differences. This incentivises researchers to carry out correlational studies and not manipulate variables in experiments that are more likely to reach different conclusions.

    For example, Biscione et al. (2025) can’t get their MindSet Vision toolkit (designed to facilitate this alternative experimental approach to NeuroAI) published, twice rejected in NeurIPS, once in ICLR, with comments like “It remains unclear what we should do or not do to improve the designs of these models even after benchmarking models on these experiments”.

    Or consider the Centaur model recently published in Nature that out-predicts cognitive models in 159 out of 160 experiments. But in all 160 experiments, Centaur was assessed in terms of how well it predicted withheld data in correlational studies, without assessing the impact of a manipulation that tested a hypothesis. Accordingly, Bowers et al. (2025) manipulated materials presented to Centaur and found it can often recall 256 digits correctly in a STM digit span task (humans recall about 7 digits), and can output accurate response in a serial reaction time task in 1ms (or indeed, any arbitrary RT we ask it to produce). We submitted this work to Nature, and the editor rejected our commentary, writing:

    “…we have come to the editorial decision that publication of this exchange is not justified. This is because we feel that the authors have acknowledged that Centaur is not a theory, nor should it be expected to replicate in extreme situations.”

    Having a limited model STM when modelling human STM is not an extreme expectation. We then submitted our work to elife, and got the following:

    “…as usually with LLM-stuff, and with benchmarks in general, having the criticism hinge on one result that could change with training data/regime is not ideal”.

    Yes, some other model might succeed, but the point is that the model published in Nature does not. And it will be hard to develop a better model of the human mind if model flaws are difficult to publish and the focus remains on predicting held out samples.

    Falsification is not much appreciated in NeuroAI. Bowers et al. (2023) provided multiple examples of reviewers and editors saying it is necessary to provide “solutions” to publish; that is, show that ANNs are like humans. Or check out my talk at NeurIPS from a few years ago where I go through many more examples of reviewers and editors rejecting falsification because they are more interested in solutions: https://neurips.cc/virtual/2022/63150

    One final example. Bowers (2025) reviews a dozen or more studies all claiming that the success of LLMs undermines the need for innate priors to explain human language acquisition. But in all cases, the authors have not carried out severe tests. The authors have presented the models with orders of magnitude more data, given models easier tasks to solve, given models additional information not available to the child, given models super-human capacities (e.g., STM of 100s or 1000s of words), etc. making the findings irrelevant to human language acquisition. But the articles draw the conclusion that reviewers and editors what to see (that ANNs are like humans) and are published in the top journals.

    So, there is some progress, but in my view, not much….

    Biscione et al. (2025). MindSet: Vision. A toolbox for testing DNNs on key psychological experiments https://openreview.net/forum?id=VkPUQJaoO1&referrer=%5Bthe%20profile%20of%20Jeffrey%20Bowers%5D(%2Fprofile%3Fid%3D~Jeffrey_Bowers1)

    Bowers, J. S. (2025). The successes and failures of artificial neural networks (ANNs) highlight the importance of innate linguistic priors for human language acquisition. Psychological Review.

    Bowers, J. S., Malhotra, G., Adolfi, F., Dujmović, M., Montero, M. L., Biscione, V., … & Heaton, R. F. (2023c). On the importance of severely testing deep learning models of cognition. Cognitive Systems Research, 82, 101158.

    Bowers, J. S., Puebla, G., Thorat, S., Tsetsos, K., & Ludwig, C. J. H. (2025). Centaur: A model without a theory. PsyArxiv https://doi.org/10.31234/osf.io/v9w37_v2

    • Jeffrey:

      Thank you so much for your detailed comment. This issue is of great interest to me. Is the Brain-Score competition ongoing? Or just a single competition. I will watch the video. What was the manipulation by which you tested Centaur? I need to learn more about this to understand what they were objecting to. I’d like for your comment to be the basis of a guest post on this blog.

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.