The AI/ML Wars: “explain” or test black box models?


I’ve been reading about the artificial intelligence/machine learning (AI/ML) wars revolving around the use of so-called “black-box” algorithms–too complex for humans, even their inventors, to understand. Such algorithms are increasingly used to make decisions that affect you, but if you can’t understand, or aren’t told, why a machine predicted your graduate-school readiness, or which drug a doctor should prescribe for you, etc, you’d likely be dissatisfied and want some kind of explanation. Being told the machine is highly accurate (in some predictive sense) wouldn’t suffice. A new AI field has grown up around the goal of developing (secondary) “white box” models to “explain” the workings of the (primary) black box model. Some call this explainable AI, or XAI. The black box is still used to reach predictions or decisions, but the explainable model is supposed to help explain why the output was reached. (The EU and DARPA in the U.S. have instituted broad requirements and programs for XAI.)

Surprisingly, at least to an outsider like me, there is enormous push back against the movement to adopt or require explainable AI. “Beware Explanations From AI” declares one critic; “Stop explaining black boxes!” warns another. As is often the case in statistics wars, opponents of explainable AI disagree with each other, and the criticisms revolve around fundamental disagreements as to the nature and roles of statistical inference and modeling. 

This is the first time I’m writing on this, so I’ll be grateful to hear from readers about mistakes.[1]

Breiman: The parametric vs algorithmic models battle: I remember the early gauntlet thrown down by Breiman (2001),  challenging statisticians to move away from their tendency to seek probabilistic data models to capture a hypothesized data generating mechanism underlying data (“Statistical Modeling: The Two Cultures”). Breiman describes “algorithmic modeling” this way:

“The approach is that nature produces data in a black box whose insides are complex, mysterious, and, at least, partly unknowable. What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y.

The theory in this field shifts focus from data models to the properties of algorithms. It characterizes their “strength” as predictors, convergence if they are iterative, and what gives them good predictive accuracy. The one assumption made in the theory is that the data is drawn iid from an unknown multivariate distribution.” (Breiman 205)

Breiman’s wars are sometimes put as a difference in aims–predictive accuracy vs understanding mechanisms. If the goal is prediction or classification, limited to cases similar to the data, then predictive accuracy might suffice, and clearly there have been impressive successes (e.g., speech, handwriting, facial recognition). But even in fields where the algorithmic vs data modeling war has largely been won by the former, we see a return of the demand to understand, if indeed that goal was ever abandoned. 

The new field of “explanatory AI” (XAI) Put to one side for now that philosophers, despite volumes devoted to the topic, have never come up with an adequate account of “explanation”. We can explain specifically what goes on–and what seems wanted–here without a general account. A major problem XAI critics have is that explaining black box ML models does not reveal the elements of the primary black box model, nor even the data used to build it. By means of interactions with the primary black model, a post hoc, supposedly humanly understandable, explanation can arise. Actual decisions are still made using the black box model, generally regarded as more reliable than the explainable modelthe latter is only to help various stakeholders understand, question and ideally trust the black box while mostly replicating its predictive behavior.

Use RCTs for high risk cases. What first perked up my ears was Babic et al.’s 2021 article in Science blaring out: “Beware explanations from AI in health care”: 

“Explainable AI/ML (unlike interpretable AI/ML) offers post hoc algorithmically generated rationales of black-box predictions, which are not necessarily the actual reasons behind those predictions or related causally to them. Accordingly, the apparent advantage of explainability is a “fool’s gold” because post hoc rationalizations of a black box are unlikely to contribute to our understanding of its inner workings.“ (Babic et al. 2021)

“Interpretable AI/ML” avoids being “fool’s gold,” they claim, and gives the actual reasons behind the predictions. The terms here are not always used in the same way, but the idea is that an “interpretable” model is where the original algorithmic model is already an “understandable” white box, not needing the work of XAI.[2] Although examples are scarce, it’s usually a linear regression model with clear inputs, e.g., education, grades, GRE scores, and an output, like “graduate school ready” or not. By contrast, XAI doesn’t use the original (or what I’m calling “primary”) function that actually generated the prediction, but a white box approximation that mimics it to some degree.

The white box approximation might merely tell you which factors or features seemed to weigh most heavily in the output, or what changes in input would have changed the output. It might, for a hypothetical example, reveal that an AI model that deemed a college student  “non-ready” for graduate school would have deemed her ready if only she had gotten a specified score on a standardized test like the GRE.

Babic et al (2021) think that rather than explain the black box, it should just be used as is, at least when it is considered sufficiently reliable–and the stakes aren’t that high. (Examples are not given.) But in high stakes medical contexts, they aver, we should not try to explain black boxesin the formal XAI sensewe should instead look to well-designed clinical trials on safety and effectiveness of practices and treatments.

“If explainability should not be a strict requirement for AI/ML in health care, what then? Regulators like the FDA should focus on those aspects of the AI/ML system that directly bear on its safety and effectiveness—in particular, how does it perform in the hands of its intended users? To accomplish this, regulators should place more emphasis on well-designed clinical trials, at least for some higher-risk devices, and less on whether the AI/ML system can be explained”. (Babic et al. 2021)

This, they claim, will help avoid the frequent medical reversals where treatments thought to be beneficial either don’t work or turn out to be harmful. Well-designed RCTs can get beyond the limitations of inferring from observational data, which is what AI is limited to. This takes us to design-based or model-based error statistical methods and models. 

A similar call is made in a recent Lancet article:

“Instead of requiring local explanations from a complicated AI system, we should advocate for thorough and rigorous validation of these systems across as many diverse and distinct populations as possible,…

…Despite competing explanations for how acetaminophen works, we know that it is a safe and effective pain medication because it has been extensively validated in numerous randomised controlled trials (RCTs). RCTs have historically been the gold-standard way to evaluate medical interventions, and it should be no different for AI systems.” (Ghassemi et al. 2021)

With RCTs we can run statistical significance tests and compute error probabilities associated with estimates and inferences. The dangers of biasing selection effects and confounding are blocked or controlled.

Nevertheless, the choices shouldn’t be either algorithmic models or clinical trials–which are highly restricted in use—but should include the use of testable parametric statistical models, or possibly combinations of AI models with subsequent testing. AI algorithms from observational data might serve to discover brand new risk factors, to be followed up with studies to test these hypotheses.

Use intrinsically interpretable AI models for high risk cases. An earlier critic of XAI for high stakes decisions, Cynthia Rudin, has a different view. Rudin tells us to: “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead” (2018). Recall, an “interpretable” model is where the original algorithmic model is already an understandable white box, not needing the work of XAI. 

“The lack of transparency and accountability of predictive models can have (and has already had) severe consequences; there have been cases of people incorrectly denied parole, poor bail decisions leading to the release of dangerous criminals, ML-based pollution models stating that highly polluted air was safe to breathe, and generally poor use of limited valuable resources in criminal justice, medicine, energy reliability, finance, and in other domains” (Rudin 2018)

These appear to be examples where black-box algorithms have gone wrong, not necessarily where the second-order attempt to elucidate those models lead to problems. Right?  Moreover, the black-box nature in many of these cases, I take it, was not due to complexity but proprietary models. So how does her recommendation to use intrinsically interpretable AI help? Perhaps black box models should never be used in high stakes decisions, but what if that’s not possible? 

It’s also not clear to me how intrinsically interpretable models necessarily engender trusting or testing black boxes. Some readers will remember the big Anil Potti controversy (discussed on this blog)  some years ago, and the big resulting guidebook as to how to avoid dangers of high throughput predictive models. The example concerned predicting which chemotherapy to use on breast cancer patients at Duke University and, shockingly, it was already being applied before being validated. I don’t think the Potti model would be classified as black box, but it had failed utterly to be well-validated.[3] No legitimate type 1 or 2 errors could be vouchsafed. (They conveniently left out data points that didn’t fit their prediction model, along with a series of howlers. Search this blog if you’re interested.) I would have thought the validation requirements in that great big guidebook would be routine, after horror stories like the Potti case. 

To be clear: I do not take any side in this battle—at least not yet. I would concur with the call by Babic et al (2021) and Ghassemi at al (2021) for well-designed clinical trials where feasible. But it’s not obvious that intrinsically interpretable AI models would afford an error statistical validation. 

Can we severely test AI models? My econometrics colleague, Aris Spanos, in a tour de force, detailed, comparative account of different approaches to modeling, argues that algorithmic modeling amounts to an elaborate curve-fitting project which assumes but does not test its key assumption of IID data. In algorithmic modeling, Spanos remarks, “likelihood-based inference procedures are replaced by loss-function based procedures driven by mathematical approximation theory and goodness of fit measures” (Spanos 2021, 25) So it’s not surprising that they don’t quantify error rates. Spanos’ work might be said to be on Breiman’s data modeling side.

David Watson, while he takes the XAI side of the divide, recognizes these shortcomings:

“[XAI] methods do not even bother to quantify expected error rates. This makes it impossible to subject algorithmic explanations to severe tests, as is required of any scientific hypothesis”. (Watson 2020) 

Nevertheless, Watson holds out hope for remedying this. Or again in the Lancet article:

“[XAI] explanations have no performance guarantees. Indeed, the performance of explanations is rarely tested at all, and most tests that are done rely on heuristic measures rather than explicitly scoring the explanation from a human perspective.” (Ghassendi 2021)

Presumably performance here is how good a job the XAI model does at mimicking the primary black box model. But even that won’t suffice to trust either the XAI model or fix a faulty black box model.

Rudin, who thinks the way to solve the problem is to use only intrinsically interpretable AI models, proposes:

“Let us consider a possible mandate that, for certain high-stakes decisions, no black box should be deployed when there exists an interpretable model with the same level of performance.”

She appeals to the supposed simplicity of nature to argue that reliable interpretable models are generally available. But in any event, if they are available, they should be favored, says Rudin.

“If such a mandate were deployed, organizations that produce and sell black box models could then be held accountable if an equally accurate transparent model exists. It could be considered a form of false advertising …”

This is an interesting proposal. How accuracy and reliability is to be shown needs to be spelled out.[4] Why not also test the AI model as compared to a non-AI model, as with the well-designed clinical trials some advocate, or with validated parametric models?  Has statistics moved too far to the algorithmic modeling side?

I’ve said little about the forms that XAI can take, and might come back to this another time. However, since there is ample latitude to what might be included, there’s no reason to preclude tools for testing primary AI models and contesting predictions or decisions based on them. Rather than seek a reliable mimic of the primary model, it might be better to seek techniques that enable severe probing of the black and white boxes: testing the assumptions of the algorithm and contesting decisions based on them. We might want to call this testable or probative AI or some such thing.

Explainable XAI models are used to trouble-shoot and audit primary black box models, and this would seem relevant to individuals wishing to contest AI-driven decisions as well. For example, some self-critical XAI techniques can show an XAI model had no chance of unearthing the XAI model was biased or unfair in some way. Thus any purported claim that it is unbiased fails to pass with severity. (Perhaps students from Fewready need higher test scores than those from Manyready to be deemed graduate school-ready, say).[5] 

 So it might be that what’s wanted is not an XAI model that passes with severity but one that let us critically appraise, improve, and contest black box and XAI models. I would not rule out XAI as serving this role–at least in a qualitative manner or by reintroducing parametric (probabilistic) reasoning at the XAI level.

All this is by way of sticking my neck out—I’m too much of an outsider to the AI/ML wars to really weigh in on the battles. Your constructive remarks and insights in the comments are welcome.

[1]In writing this post, I consulted with Aris Spanos, who has written, critically, on algorithmic modeling, and David Watson, who works in the field of XAI and has often discussed conceptual and philosophical problems. I acknowledge their assistance in helping me grasp what’s going on here, but I do not elaborate on their views.

[2] A word often seen is “intrinsic”. If the original or primary algorithmic model is intrinsically interpretable, it doesn’t require another algorithm to explain how it works. 

[3] While not a black box in the sense used in AI, it might well have been. It took years to sort out the coding errors, and you might say the Bayesian priors on which the model rested are black boxy. The “mutagenes” generated from signatures are also problematic. But I don’t think they use the term that way.

[4] The algorithmic modelers say things like, do you want the  machine or the doctor to operate, if the former has 90% success, and the latter 80% success? It depends, for one thing, on where that “success rate” comes from. Maybe the patients with the most difficult conditions go to the human. Confronted with an anomalous case, the human can throw out the formula and avert disaster. These days we do well combining the two.

[5] For readers unfamiliar with this hypothetical example, I discuss a similar one in a number of articles. A quick sum-up is on 368-9, Excursion 5 Tour II (proofs) of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). The context is not AI but the “diagnostic screening model” of statistical tests. 

Babic et al. (2021), Beware Explanations From AI in Health Care

Breiman, L. (2021), Statistical Modeling: The Two Cultures. 

Ghassemi, M. (2021). The False Hope of Current Approaches to explainable artificial intelligence in health care

Rudin, C. (2018), Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

Spanos, A. (2021). Statistical modeling and inference in the era of data science and graphical causal modeling. Journal of Economic Surveys, (20211126).

Watson, D. (2020) Conceptual Challenges for Interpretable Machine-Learning.

Categories: machine learning, XAI/ML

Post navigation

15 thoughts on “The AI/ML Wars: “explain” or test black box models?

  1. Paul D. VanPelt

    This reminds me of an old analogy from childhood. It had to do with letting a Fox guard a chicken house.
    The transparency of such course is elegant: it is something one just does not do. With my limited understanding of this obvious, I can’t see the utility of a machine which is autonomous in knowing how it works. Not can I suss the practicality of another one’s explaining how the first functions. Suppose that second interpretive box malfunctions? How would anyone know, save perhaps the first one?

    • I see what might be called “experimental AI”, like experimental economics, to run experiments to see if people interact and predict better with simpler models, say, of apartment prices. These were preregistered reports with p-value computations.
      But of course these are distantly related to interactions with actual algorithms. But it’s interesting how things move full circle.

  2. Here’s a real-life example of the kind of problem I raise with judging “readiness” for college or graduate school employing prevalence of readiness measures in a reference class you belong to. When a major (A-level) exam couldn’t be given during the pandemic in England, they used an algorithm that had the effect of giving lower scores by weighing “the historic performance of individual schools. That had the effect of raising scores for students from private schools and those in wealthy areas and depressing scores for students from less advantaged areas.”
    “Your personal circumstances, your efforts overcoming adversity, it doesn’t matter,” one student said. “Because a person like you, a person from your background, from your socio-economic class, you aren’t expected to do well. That’s the same brush they paint you with.”
    The problem isn’t just with algorithms but with using prevalence in reference classes to evaluate specific claims, as we see in what I call the “diagnostic screening model” of tests. This relies on supposed relative frequencies of true hypotheses in the class from which your hypothesis comes, without considering the specifics of how that hypothesis has been tested. This is now being advocated in “testing” statistical hypotheses, rather than assessing how well tested individual hypotheses are.

  3. People tend to remark on twitter rather than on blogs these days. Here’s just one:

  4. From twitter:

  5. Welcome to xAI!

    The level of debate around fundamental approaches to explanations means we could really use some philosophical brainpower. (I like to think that the benefit also goes the other way — automated explanations forcing philosophers to get concrete about what they mean in a way they didn’t really have to before).

    I suspect that neither white box methods, nor post hoc explanations would pass a test of severity (admittedly, they’re all better viewed as estimation). What is often meant by white box isn’t linear regression, but decision trees or other collections of rules which, while easy to follow, do not replicate across repeat data sets. The same is true of many methods for approximating black boxes, even keeping the black box fixed.

    But in fact there isn’t a lot of clarity in exactly what explanations are intended to represent, and in what situations. Am I trying to explain the workings of this particular black box, however it was obtained? Or am I seeking to describe properties of the world that the black box is supposed to mimic?

    As a case in point: a tree or list of rules is great if what I need to provide is a recipe for making the prediction and that’s it. But if that recipe is also intended as a justification for a decision, something that would be different if given slightly different training data feels problematic.

    These are issues prior even to “how should we use these tools” and discussions of clinical trials. We should know what we are trialing.(Giving randomly generated explanations might indeed improve clinical decision making, but let’s at least understand if that’s what we are doing).

    One can, in principle, provide uncertainty quantification for explanations — although the mathematical machinery for it is only really developed in specialized instances — and there’s a fair amount of development of ways to use black box methods to test hypotheses of interest, which can then also be assessed for severity. But I don’t know any great way of representing variability over tree structures or rules.

    Of course outside of a psychological assist to human decision-making, how much explanations or human-level understanding actually matters is something I would dearly love to see philosophers take up.

    • Thanks for the welcome to AI. Be glad to give some brainpower if I can. Even estimates need to pass with severity, will ponder the rest

  6. I agree with the worth of XAI techniques regarding the analysis of predictive models.
    The emphasis on explainability and trust is based on current hype but the real value is in debugging the model.

    This book:
    Explanatory Model Analysis talks about XAI techniques from a model analysis perspective.
    Similar in spirit is Veridical Data Science nu Bin Yu

    Unfortunately like many terms in the AI world more visibility is gained by more colourful terms such as interpretability or explainability.

    • Przemyslaw: I think the value of debugging the model is evidence of its value, but the question of what needs to be “fixed” is likely to differ for the developer and those affected by the algorithm. However, it seems the debugging could be developed for that purpose, as with checks that reveal bias.
      Do the books you list require knowledge of machine languages?

  7. rkenett

    Glad to see Mayo addressing these issues. The two cultures Breiman envisaged are here now and the data driven AI modeling is overtaking the stochastic modeling.
    Some write ups and presentations on this from an applied statistics perspective are listed below:

    Click to access 667-copia.pdf

    • Ron: Thanks for your links. I read the paper on adversarial AI in insurance, and found it perplexing–which just shows how little I know about this. It is claimed that patients might falsify an image of their benign scan to make it appear malignant in order to fraudulently obtain insurance benefits? I guess I don’t get how a patient could do this, or how it would provide them with benefits. Now maybe I see how a doctor might be able to, but wouldn’t the insurance company check if corresponding treatments were given and check with the patient? Wouldn’t the patient see the record indicating the doctor reported they had a malignancy? It seems there would be several checkable actions associated with such fraud, not just an image. Unless maybe the patient, doctor and pharmacist all collude. What am I missing?
      I’ve heard of fraudulent images used in studies that misrepresent results, but this I hadn’t known.

      • rkenett

        Mayo – Yes, the answer is collusion. In Greece, healthcare is a public funded service. Treating patients is remunerated accordingly. In his acceptance speech of the 2018 ENBIS best manager award, Sotiris Bersimis, the GM of the Hellenic Organization for Health Care Services Provision, described some of the impact of digitization of broken bone claims imaging. They found the same X Ray being presented by orthopedic doctors several thousand times. XAI is taking this one level higher

        • rkenett

          I meant AAI – adversarial AI…

        • Ron: What about offering large rewards for whistleblowing on your doctor, and large punishments–including jail time–for fraud. I mean every insurance claim I know about–car accidents, home damage, fire, prescriptions–is checked to the nth degree by various assessors when there’s no AI involved.

          • rkenett

            Mayo – InsureTech is moving to automated procedures. Some insurance companies adjudicate claims in a few minutes. AAI is actually affecting may industries, specially as regards to imaging AI applications. The challenge in validating applications is a huge statistical problem. In general, validating AI embedded systems is posing challenges in many areas. It certainly requires new perspectives. For an edited book on system testing see

            An interesting related paper which is actually implicitly implying severe testing is – If changing a few data points changes conclusions, the analysis has not been severely tested….

Blog at