severity and deep learning models

Severe testing of deep learning models of cognition (i)

.

From time to time I hear of an application of the severe testing philosophy in intriguing ways in fields I know very little about. An example is a recent article by cognitive psychologist Jeffrey Bowers and colleagues (2023): “On the importance of severely testing deep learning models of cognition” On the importance of severely testing deep learning models of cognition (abstract below). Because deep neural networks, DNNs–advanced machine learning models– seem to recognize images of objects at a similar or even better rate than humans, many researchers suppose DNNs learn to recognize objects in a way similar to humans. However, Bowers and colleagues argue that, on closer inspection, the evidence is remarkably weak, and “in order to address this problem, we argue that the philosophy of severe testing is needed”.

Deep learning models, after all, consist of millions of (largely uninterpretable) parameters. Without understanding how the black box model moves from inputs to outputs, it’s easy to see why observed correlations can easily occur even where the DNN output is due to a variety of factors other than using a similar mechanism as the human visual system. From the standpoint of severe testing, this is a familiar mistake. For data to provide evidence for a claim, it does not suffice that the claim agrees with data, the method must have been capable of revealing the claim to be false, (just) if it is. Here the type of claim of interest is that a given algorithmic model uses similar features or mechanisms as humans to categorize images.[1] The problem isn’t the engineering one of getting more accurate algorithmic models, the problem is inferring C: DNNs mimic human cognition in some sense (they focus on vision), even though C has not been well probed.

“Contrary to the model comparison approach that is popular in deep learning applications to cognitive/neural modeling it will be argued that the mere advantage of one model over the other in predicting domain-relevant data is wholly insufficient even as the weakest evidentiary standard”

—in particular, “weak severity”.[2] Bowers et al. argue that many celebrated demonstrations of human–DNN similarity fail even this minimal standard of evidence. While granting that “many questions arise as we attempt to unfold what severity requirements mean in practice. … current testing does not even come close to any reasonable severity requirement”. [Pursuing their questions about applying severity deserves a separate discussion.]

While the experiments are artificial, as with all experiments, they do seem to replicate known features of human vision such as identifying objects by shape rather than texture, and a human sensitivity to relations between parts. Although similar patterns may be found in DNNs, once researchers scratch a bit below the surface using genuinely probative tests, the agreement collapses. One example concerns a disagreement in how DNNs vs humans perceive relations between parts of objects, such as one thing being above the other. An experiment goes something like this (I do not claim any expertise): Humans and DNNs are trained on labeled examples and must infer what matters to the classification. In fact the classification depends on how the parts are arranged, but no rule is given. When shown a new object whose parts stand in the same relations as before, humans typically regard it as belonging to the same category. Not so for objects with the same parts but with those relations altered—e.g., what was above is now below–at least for humans.

The DNN appears to do just as well until the relationship, but not the parts, are swapped (e.g., what was above is now below). Keeping the parts the same, in other words, while changing the relation, the model does not change its classification. Humans do. The DNN, it appears, is tracking features (the parts) that suffice for prediction during training, rather than the relational structure that humans infer as explaining the classification. For more examples and discussion (in The Behavioral and Brain Sciences,), see Bowers et al. (2022).

They ask: “Why is there so little severe testing in this domain? We argue that part of the problem lies with the peer-review system that incentivizes researchers to carry out research designed to highlight DNN-human similarities and minimize differences.” As a result, researchers are incentivized—by peer review, publication practices, and the culture of the field—to design experiments that show agreement rather than ones that seriously risk falsifying claims of DNN-human similarities.

“Indeed, reviewers and editors often claim that “negative results” — i.e., results that falsify strong claims of similarity between humans and DNNs — are not enough and that “solutions” — i.e., models that report DNN-human similarities – are needed for publishing in the top venues…”

We should not equate the use of “negative results” in this context, with common “null” results in statistical significance tests that merely fail to provide evidence for an effect (or allow setting upper bounds to the effect size, based on a proper use of power). In the human/DNN studies, the sense of “negative” is closer to a falsification, or at least an undermining, of the proposed claim C that the DNNs rely on the same or similar mechanisms as humans, at least in regard to a certain process. As they put it,  “’negative results’ — i.e., results that falsify strong claims of similarity between humans and DNNs.” They go on to say: “the main objective of researchers comparing DNNs to humans is to better understand the brain through DNNs. If apparent DNN-human similarities are mediated by qualitatively different systems, then the claim that DNNs are good models of brains is simply wrong.”

The relational tests show the two are tracking different things, they supply evidence to falsify claim C. In fact, I recommend they call them “falsifying results” rather than negative, since something much weaker is entitled to be said in the case of “null” results in standard statistical tests (with a no effect null). Even “fixing” the DNN to match the human output does not restore the claim that the two systems are tracking the same things–so far as I understand what’s going on here. (We assume there wasn’t some underlying flaw with the apparently falsifying experiments.)

The authors stress a constructive upshot,  “that following the principles of severe testing is likely to steer empirical deep learning approaches to brain and cognitive science onto a more constructive direction”. Encouragingly the most updated benchmark tests seem to bear this out, but as an outsider, I can only speculate. Whether this shift will become widespread remains to be seen, but it marks a welcome and interesting move toward more severe and genuinely informative testing.

[1] The claim, of course, could be something else, such as, DNNs are useful for understanding the relationships between DNNs and human cognition.

[2] Severity Requirement (weak): If data x agree with a claim C but the method was practically incapable of finding flaws with C even if they exist, then x is poor evidence for C.

Abstract of Bowers et al. (2023) Researchers studying the correspondences between Deep Neural Networks (DNNs) and humans often give little consideration to severe testing when drawing conclusions from empirical findings, and this is impeding progress in building better models of minds. We first detail what we mean by severe testing and highlight how this is especially important when working with opaque models with many free parameters that may solve a given task in multiple different ways. Second, we provide multiple examples of researchers making strong claims regarding DNN-human similarities without engaging in severe testing of their hypotheses. Third, we consider why severe testing is undervalued. We provide evidence that part of the fault lies with the review process. There is now a widespread appreciation in many areas of science that a bias for publishing positive results (among other practices) is leading to a credibility crisis, but there seems less awareness of the problem here

Reference

Bowers, J. S., Malhotra, G., Dujmović, M., Llera Montero, M., Tsvetkov, C., Biscione, V., Puebla, G., Adolfi, F., Hummel, J. E., Heaton, R. F., Evans, B. D., Mitchell, J., & Blything, R. (2023). Deep problems with neural network models of human vision, commentary & response. The Behavioral and Brain Sciences46, Article e385. https://doi.org/10.1017/S0140525X22002813

Categories: severity and deep learning models | Leave a comment

Blog at WordPress.com.