“Are Controversies in Statistics Relevant for Responsible AI/ML? (My talk at an AI ethics conference) (ii)

Bayesians, frequentists and AI/ML researchers

1. Introduction

I gave a talk on March 8 at an AI, Systems, and Society Conference at the Emory Center for Ethics. The organizer, Alex Tolbert (who had been a student at Virginia Tech), suggested I speak about controversies in statistics, especially P-hacking in statistical significance testing. A question that arises led to my title:
Are Controversies in Statistics Relevant for Responsible AI/ML?”

Since I was the last speaker, thereby being the only thing separating attendees from their next destination, I decided to give an overview in the first third of my slides. I’ve pasted the slideshare below this post. I want to discuss the main parallel that interests me between P-hacking significance tests in the two fields (sections 1 and 2), as well as some queries raised by my commentator, Ben Jantzen, and another participant Ben Recht (section 3). Let me begin with my abstract:

ABSTRACT. A central debate in statistics concerns probability’s role: is it to control error rates (frequentist) or express belief (Bayesian)? Frequentists prioritize error control, ensuring rare false rejections and strong power to detect falsity. For Bayesians, by contrast, evidence from the data resides in the likelihood ratios, ignoring error probabilities. The replication crisis fueled criticism of statistical significance tests, which are blamed for irreproducibility. Yet, selection biases—p-hacking, optional stopping—are the real issue. Ironically, Bayesian alternatives (e.g., Bayes factors) fail to guard against these biases, leading to reforms that discard crucial error control. The same data-dredged hypothesis can occur in Bayes factors, and Bayesian updating—but without direct grounds to criticize flouting of error statistical control.

AI/ML shifts the focus away from understanding data generation to optimizing prediction. But when it comes to problems of checking if AI/ML outputs are responsible and trustworthy, I argue, error statistical considerations become relevant. Fairness hacking, and explanation (justification) hacking in XAI, like p-hacking and BF hacking, can make it easy to obtain misleading interpretations of data. Reforms from error statistics: preregistration and adjustment of error rates, are relevant for critically appraising them. I put forward a general requirement that is operable in both fields: A claim lacks warrant if little if anything has been done to detect its flaws—it fails a severe test.

1.1. AI/ML Auditors think like error statisticians? My talk was limited to contexts in which statistical significance tests are used in AI/ML, granting that these may be regarded as meta-level tasks. The most ordinary of these would be to test if AI/ML models do statistically significantly better than non-AI methods, and to test if a given algorithm A does statistically significantly better than another algorithm B. Clearly, if only the cases where A does better than B are selectively reported, inferring the superiority of method A, at a small “nominal” P-value would be misleading. The cases that really interest me, and I’ve only recently been reading about them, concern what might be dubbed “explanation hacking” in explanatory AI (XAI), and “fairness hacking” in critically auditing for a method’s avoidance of discrimination.

Consider explanation hacking in “explainable AI”, or XAI. XAI involves developing understandable “white box” models to “explain” the workings of a black box model used to make decisions. So, for example, if the black box model decides you belong in the “high risk” category (for loans, crimes, etc), it is thought that you have a right to know why. The black box is still used to reach predictions or decisions, but the explainable model is supposed to help explain why the output was reached. Creators of a black box decision method have the incentive to defend their system by being prepared to give an acceptable rationale for why the decision was reached.

One method would be to use the data to find a statistically significant correlation between some characteristics of the person classified and the decision reached. (e.g., You were labelled ‘high risk’, say for recidivism, because there’s a statistically significant correlation between 2 or 3 factors you possess and committing a further crime.) (A discussion of XAI on this blog is here.) Strictly speaking, this would only be what’s called a “nominally” significant result because, by and large, the data do not come from a known data model. But what is of main interest is that by exploiting multiple checks for significance, dredging through factors and decisions, it is possible “to Explain and Justify Almost Any Decision” in this way, according to Zhou, J. and Joachims, T. 2023 –even if the decision was really because of falling into a racial category, or was merely assigned by chance. The constructor of the decision model, has the incentive to show this was not the case, and the significance seeking enables this. In criticizing such a defense of the black box decision scheme, the authors echo what I call error statistical reasoning, and the associated “minimal principle” for evidence in Slide 4.

Something similar happens in “Fairness Hacking” (Meding et al. 2024),. Dredging different sensitive attributes (e.g., age, race, sex), or different fairness metrics, can make it easy either to find apparently statistically significant differences (that don’t exist) or not find them (when they exist). It seems that the idea of a data generating model is introduced at the meta-level of auditing these AI/ML methods. In this case they construct synthetic data to satisfy a data model and the authors propose a Bonferroni correction for selection. I don’t know whether such adjustments can operate here, but post data selection is a growing research area of its own. Recognizing the analogy with P-hacking, might stimulate work to connect these efforts. Perhaps that’s already under way. (I did not learn more about these two types of hacking at the conference; if any readers can update me of about work in this area, please let me know in the comments.)

I find the criticisms raised by the authors of these papers to be noteworthy. Multiple testing and selective reporting create a misleading justification in the case at hand. It’s misleading because it would have been easy to find such a defense erroneously. The general capability or incapability of the method is the basis for the assessment, as the severe tester proposes. Even without formal error probabilities, the counterfactual inference to the particular claim C is unwarranted if the method makes it easy (e.g., frequent) to find evidence for C by chance alone, or if C is flawed.

Now this kind of error statistical reasoning is anathema for holders of the Likelihood Principle, as explained in Slides 24 – 29, because it uses the overall error statistical properties of the method in determining if observed data provide evidence. Thus, it appears that auditors of AI/ML methods are akin to error statisticians, rather than those who view statistical inference in terms of Bayesian updating or Bayes factors. If so, then, since this disagreement is at the heart of controversies in statistics, it follows that controversies in statistics are relevant for AI/ML. Corresponding reforms (e.g., preregistration, adjustment for selection) might therefore be relevant for the meta-level tasks in AI/ML. But the most important upshot is to substantiaate grounds for criticism and falsifying claims about the model’s warrant.  That was the gist of my talk.

2. Slide 29: Clarification of optional stopping: The effect of repeated significance testing

During the talk, Ben Recht (Computer Science, Berkeley) asked for a clarification of the table in slide 29 on the effect of optional stopping in a two-sided test of a point hypothesis that a Normal mean is 0. (It is from SIST, p. 44; computed by Aris Spanos.) Failing to find statistical significance after, say, 10 trials, the tester goes on to 20, then 30, and so on until finally a 2SE difference emerges. The table shows that the probability of finding “nominal” significance accumulates as the researcher tries and tries again. The result follows from something called the law of iterated logarithms. As subjective Bayesians Edwards, Lindman, and Savage famously observe:

“if an experimenter uses this [optional stopping] procedure, then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true”. (1963, 239)

Nevertheless, according to the Likelihood Principle (LP), which they champion,

“the import of the…data actually observed will be exactly the same as it would had you planned to take n observations.” (ibid)

See also slide 28.

2.1 “But what is it the probability of?” (Recht asked during the discussion). It is the probability that the test, using this optional stopping rule, rejects H0 with a result “nominally” significant at the 0.05 level at or before n trials, given that H0 is true. (I may add a bit more here in updates.)

2.2 Data dredging need not be pejorative.  What is sometimes overlooked is that the problem is not that a method is sure to find a hypothesis or model that fits the data, the problem is only when it does so erroneously. I reserve the term biasing selection effect for gambits that distort the method’s relevant error probing capability so as to injure severity. This includes preventing assessing severity even approximately. However, there are cases where data dredging is not problematic and can even increase severity. The problem of identifying pejorative selection effects is a very old (and unsolved) problem in the history of philosophy of science (e.g., Keynes, Mill, Popper). It is because I found the popular (Popperian) notions (e.g., non-novel data, double counting) inadequate that I was led to the severity requirement in 1991.

In statistics, I find frequentists are sometimes (unfairly) criticized for requiring adjustments for selection in contexts where in fact they do not, because in those cases error rates are improved by data driven search! Dredging, or using data both to construct and test a claim, in other words, can improve severity. For example, searching for a DNA match with a criminal’s DNA is somewhat akin to finding a statistically significant departure from a null hypothesis: One searches through a full data base of DNA and hones in on the one match with the known criminal’s DNA. In illicit dredging and cherry picking, the concern is that of inferring a genuine effect, when none exists (e.g., the case of data torturing by the drug CEP in Slide 5). In the DNA case, there’s a known effect or specific event, the criminal’s DNA, and reliable procedures are used to track down the specific source. There are many different types of cases where this is the case (see Mayo 2018, SIST). We should not run together examples of pejorative data-driven hunting with cases that are superficially similar, but unproblematic.

3. Comments by Ben Janzen (and Ben Recht): Benchmarking in AI/ML and Error Statistics

Ben Jantzen (Philosophy, Virginia Tech) opened the discussion of my paper with a provocative question: Can error statistics and severity guide benchmarking practices in AI/ML? He argued that benchmark datasets in AI/ML competitions create opportunities akin to P-hacking for providing faux support for hypotheses regarding the effectiveness of a given algorithm. This allows algorithms to appear more effective than they truly are: they don’t generalize well. Competitions, let participants probe what they call leaderboards—rankings based on submissions that predict unknown labels for data—potentially leading to algorithms that score higher on the public benchmark without genuinely improving performance for unseen data. In short: The accepted uses of benchmark data in leaderboards aren’t severe tests, and thus a central practice in AI/ML violates the severity requirement.

For a description of AI/ML competitions, the use of training data and (public vs private) test data, see note (1).

In reply I said something like: Now, come on. Although I’m an outsider to AI/ML, I know that competitions deliberately impose safeguards, such as assessing algorithms on unseen holdout data, advocating cross-validation on the training data, and limiting submissions to prevent participants from tailoring models too closely to the benchmark leaderboard. Given all the handwringing about AI/ML models failing to generalize outside the training sample, it’s clear there’s a concern for severe probing even if the criticisms remain at a qualitative level.

While Jantzen agreed that many practitioners insist on rigorous testing before putting trust in an assessment of an algorithm’s accuracy, he also observed that that some communities of practitioners don’t share the current worry about replicability in AI/ML testing. Ben Recht (Computer Science, Berkeley) jumped in, giving the impression that he was one of the “unworried”. He even seemed to deny the existence (or need?) for restrictions, or holdout data, although I wondered if he was just being provocative. Later conversations seem to confirm he really meant it, even though I’m still not sure of his position. He recommended a look at his blogpost. His argument in the blogpost seems to be that even though, according to his own formal analysis, a contestant should be able to rise in the leaderboard and end up with a model that poorly predicts the unseen holdout data, he claims not to see this happening in practice. In his blog, he admits he “has no idea why”.

On the face of it, this would seem only to show either his formal analysis is incorrect for actual competitions, or contestants don’t dangerously game the leaderboard (which is a random sample of the full test data). See also notes (2) and (3). Has any contestant ever be caught using the “wacky” leaderboard climbing linked to in note (3)?  I’m just an outsider, and don’t know the history of these AI/ML competitions.

Of course the problem is scarcely limited to accuracy on the benchmark test data. As Ben Jantzen notes in his comment, there’s little reason to think that the benchmark datasets are statistically representative of the kinds of systems to which models are likely to be applied.

Upshot. The entire issue that arose from this discussion was valuable: it leads me to identify a new avenue where error statistical probing enters a general practice in AI/ML. Although my talk was restricted to the use of statistical significance tests in AI/ML, since I claim severity is applicable to any context of error-prone learning from data, it should be applicable to assessing claims of algorithm performance. It is, whether or not attempts to climb the leaderboard are akin to P-hacking in statistical significance tests. (I’m not yet convinced.)

I thank both Bens for post conference communication. I may write more on this later, based on further feedback.

4. The picture on the title page.

It might be seen to depict the Bayesian, the frequentist, and AI/ML researchers, with one researcher scrambling to keep up.

5. Slides. They are below the references.

Please use the comments to share queries or thoughts. I’d like to learn more.

NOTES

(1) AI/ML competitions provide a training set with labeled data for model development and a test set that is split into public and private portions. Participants submit predicted labels for the entire test set without knowing the split. Throughout the contest, the leaderboard shows each submission’s accuracy only on the public test set, while final rankings are based on the private test set, which remains hidden.

(2) In support of his “unworried” stance, Recht points to a paper he co-authored. I don’t have the background to analyze it, but my eyes perked up when I saw a statistical significance test. When they tested a null hypothesis that the public and private accuracies were identical, they very often obtained low P-values, which indicates a disparity. Perhaps they are small.

(3) Recht cites an example of brute-force leaderboard climbing that I found easy to understand: Moritz Hardt’s (2015) “wacky boosting“, enabling a contestant to eventually extract the true labels for the public data. Even so, I don’t see that the contestant would be bound to submit this model. There would be many models that could fit the test data.

REFERENCES

Meding, K., Hagendorff, T. Fairness Hacking: The Malicious Practice of Shrouding Unfairness in Algorithms. Philos. Technol. 37, 4 (2024). https://doi.org/10.1007/s13347-023-00679-8

Zhou, J. and Joachims, T. (2023). How to Explain and Justify Almost Any Decision: Potential Pitfalls for Accountability in AI Decision-Making.  https://doi.org/10.1145/3593013.3593972

 

Categories: AI/ML, Ben Janzen, Ben Recht, biasing selection effects, severity | 18 Comments

Post navigation

18 thoughts on ““Are Controversies in Statistics Relevant for Responsible AI/ML? (My talk at an AI ethics conference) (ii)

  1. Brandon Reines

    Regarding your 2.2 Data dredging need not be pejorative, it is significant to me that you use forensic DNA matching to exemplify “good” data dredging:  It seems to me that good vs bad dredging really depends on the quality of the substantive biological hypothesis under test. And hypothesis quality may depend on “intimacy” between analytical machinery and subject matter.

    Perhaps anything that creates distance between analyst and analysand–including high variable/sample ratio–will make it less likely that a good substantive hypothesis will be discovered.   Note: A criminal investigation is the kind of process that produces very concrete hypotheses, unlike many airy ideas that are “tested” by search procedures from Glymour on–perhaps because the whole process is quite intimate and there is a very low variables/samples: one DNA match (variable)/one criminal (sample) or 1:1. This is very different from molecular genetic microarrays that have 10,000 variables (genes)/person-sample or 10,000:1.

    Cognitively, I would argue that discovery is a very intimate process, even in physics.  You have to know and directly experience through thought experiment or other procedure the ideas and data you are dealing with (e.g., Einstein imagining free fall in an elevator).  In biomedical science, since we are alive, the scientist experiences and “knows” the data even more directly, and the better they know it, the more likely they are to make a discovery.  

    Have you written more on your general notion that statistical testing is best for ideas that are inherently quantitative?  I think this may be a route to better explaining why there is so much resistance to theory in biomedical science, and why the more trivial ideas tend to get tested, perhaps because they are more easily quantified.  

    We have many uncertainty principles in the biomedical sciences that need to be articulated and analyzed. 

    Brandon Reines, Director

    Theoretical Unit

    Department of Pathology

    JABSOM, Honolulu

    • Brandon:
      Thanks for your comment. I do talk about the types of cases where dredging and double counting lead to severe tests. You’ll find several in my book Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP 2018). Your view “that discovery is a very intimate process, even in physics. You have to know and directly experience through thought experiment or other procedure the ideas and data you are dealing with (e.g., Einstein imagining free fall in an elevator)” reminds me of a talk I heard in philosophy on Friday about large language models by Robert Smithson. I’m not sure how you understand this intimacy remark. I would agree that knowledge of a genuine effect or the reliability of the method/model are necessary for severe testing despite data dredging. Think of how researchers dredged to find an animal model to show the known birth defects of thalidomide (New England rabbit, I believe). It’s quite different when the context is needing to avoid being fooled by chance. See also “How to discount double counting when it counts: some clarifications” https://errorstatistics.com/wp-content/uploads/2020/10/mayo-2008_double-counting-bjps_red.pdf

      • Brandon Reines

        Thanks Deborah,

        Is there a copy of Stimson’s talk? Regarding thalidomide, I would again say it is not “dredging” per se that was needed but actual subject matter expertise–a good experienced biologist knows that rats will usually eat deformed fetuses. But the wider issue here–as I start to argue in my On the Locus of Medical Discovery J Med Phil 16(2), April 1991–is that the Science does not start until the drug is in human patients. Schioldann documents my thesis for discovery of antimanic action of lithium in his big book, and Chapter 21 on “Reines’s view. . .” Although little understood, biomedical science still has many pre-scientific elements, including overuse of argument per analogiem (from infrahuman species). In his book Scientific Strategies in Human Affairs, I.D.J. Bross contends that the teratogenicity of thalidomide could have been detected at 50 infants–instead of 5,000–if good human surveillance systems had been in place!

        • Brandon:

          I remember Bross.

          On thalidomide, they knew of its teratogenicity at the stage I describe. They were seeking an animal that would display it. I don’t know the details, but it was an example from long ago that represented an important difference between the context of ruling out chance.

          So what is your theory?

      • Brandon Reines

        Deborah,

        • Brandon Reines

          Which of my theories are you asking about? I have been reading your Severe testing book, and on page 282, you return to the issue of thalidomide causing birth defects. “If an effect is known to be genuine, then a sought-for and found explanation
          needn’t receive a low severity. Don’t misunderstand: not just any explanation
          of a known effect passes severely, but one mistake – spurious effect – is already
          taken care of. Once the deflection effect was found to be genuine, it had to be
          a constraint on theorizing about its cause. In other cases, the trick is to hunt for
          a way to make the effect manifest in an experiment. When teratogenicity was
          found in babies whose mothers had been given thalidomide, it took them quite
          some time to find an animal in which the effect showed up: finally it was
          replicated in New Zealand rabbits! It is one thing if you are really going where
          the data take you, as opposed to subliminally taking the data where you want
          them to go. Severity makes the needed distinctions.”

          What makes me unsure of your model is using a uniform vocabulary for describing data and theory in physics/astronomy on the one hand and biomedical science on the other. I applaud your search for generality of explanation and more severe ways ot testing it, but I think you are comparing fields of inquiry that are separated by vast differences in types of data, epistemological evolution, and theoretical sophistication. For instance, using the same word “effect” to describe both the deflection of light during an eclipse predicted by one of the most sophisticated models of nature ever conceived by a human mind–and occurrence of limbless infants whose mothers took thalidomide–presumes that: 1) biomedical scientists had ANY reason to believe a drug developed as a tranquilizer would cause deformities in offspring, 2) that animal model systems are not so varied in type and biochemistry that virtually ANY result can be “replicated” with appropriate choice of species, strain and other chosen characteristics (they can), and 3) that we have laboratory model systems in biomedical science that provide genuine models of actual human pathological phenomena. Irwin Bross made this point succintly in a letter to me in the 1980s: “Animal systems derive their prestige and credibility from the germ theory of disease and Koch’s postulates. But they really stood for Teutonic system against chaos.” Bross was involved in the cigarettes and lung cancer controversy, and recalled that the main argument used by the tobacco industry to forestall action against tobacco was that it had not caused consistent tumors in laboratory animals. Even Fisher himself thought the tobacco industry was right. Does anyone still believe that? See the issue of Observational Science in 2018 which reprints Bross’s paper Statistical Criticism from the journal Cancer from 1960, with many laudatory essays. Irwin thought Fisher was a died in the wool genius, who had trouble seeing issues outside of narrow quantitative ones. Formalized analysis has a very important role to play in progress of science, but it has to be careful to stay in constant touch with subject matter and background assumptions in the particular fields of science it attempts to evalulate.

  2. Christian Hennig

    In my view, competitions aren’t the biggest problem in AI/ML  benchmarking practice. It is so obvious that generalising the  performance of competition winners is affected by selection bias that  I’d hope nobody in their right mind would do that. This doesn’t mean  that “method X winning competition Y” isn’t informative; it basically  means method X has done well at least once, which is something.

    A bigger issue in my view is that the routine benchmarking experiments  that people do in order to demonstrate the superior performance of their  newly proposed algorithm are almost equally (or even more) heavily  influenced by such selection bias, as people put a lot of effort into  tuning their own algorithm so that it does well in their own benchmark  experiment, and hardly any effort into optimally tuning the competitors.  I have seen and reviewed a large number of papers where this happens.  And of course there is pressure on people to do this, because superior  performance needs to be demonstrated for publication. In fact I’d accept  that people should demonstrate that at least there exist situations in  which their method does better than existing competitors, and in this 
    sense such experiment have a value, but very often generality of such  performance is claimed that is not granted in any way.

    Now in most of these papers there isn’t any effort at all to test  whether the performance is significantly better (they just present  misclassification rates or squared error loss or whatever and say that  the new method is clearly better than the competitors). To be fair, it’d  arguably be even worse to run a test that ignores all the  selection/tuning, and this usually does not happen. It may actually be  hopeless to expect people to test the quality of their own method  severely against competitors as part of the designing process is to try  to find an optimal tuning for the kind of data sets that one uses for  benchmarking, so that selection bias quite automatically happens.  Despite that, I have often seen such results cited as if one could take  for granted that they generalise.

    Severe testing would require studies run by people who are neutral and  don’t have an interest in seeing any specific method  winning, and for  sure they should care about severity.

    • Christian:
      Thank you so much for your comment. I’m looking for just such clarification because there seems to be a huge tension in AI/ML research. You write:
      “It may actually be hopeless to expect people to test the quality of their own method severely against competitors as part of the designing process is to try to find an optimal tuning for the kind of data sets that one uses for benchmarking, so that selection bias quite automatically happens. Despite that, I have often seen such results cited as if one could take for granted that they generalise.”
      I thought the key feature that AI/ML advancements are wildly praised for is improvement through trial-and-error competition among equally motivated participants. The high praise for frictionless reproducibility achievements discussed by Donoho recently is of that sort. https://assets.pubpub.org/9bk0194n/Donoho%20(2024)_Just%20Accepted-11706563057147.pdf
      His touting AI/ML as the epitome of good science hinges on the idea that intense competition and open benchmarking naturally drive progress. I imagined each competitor being as critical as possible of someone who claimed their model was better. You seem to be saying this competition may merely favor demonstrating superiority in some self-selected situation rather than genuine testing. Is that correct? If researchers optimize their own models aggressively while not giving competitors’ models the same level of tuning, wouldn’t the competitor do likewise?

      You write: “I have seen and reviewed a large number of papers where this happens.” I’d be very interested in hearing about your experience. What happens when you criticize the papers for selection bias? Are they required to revise their papers? If a researcher A is allowed to craft their own data or evaluation metrics to show “A is better than B, C, D….” then they can easily show superiority. Don’t they have to show A does at least as well as the others do in the contexts of the data and metrics the others use to advance their models? Even that wouldn’t be enough, but I assume that’s a minimal requirement, or not? Imagine drug companies being allowed to petition for FDA approval because there are some patients who are improved, according to their definition of “improved”. The worst sort of data dredging. Surely that can’t be countenanced in AI/ML, right?

      “I’d accept that people should demonstrate that at least there exist situations in which their method does better than existing competitors, and in this sense such experiment have a value, but very often generality of such performance is claimed that is not granted in any way.”

      By the “situation” in which their method is apparently superior, do you mean a particular data set they find where the new method predicts better? Or might it be a specific evaluation metric that highlights the method’s strengths while downplaying its weaknesses?
      It would be interesting to demonstrate that they would do better by requiring not just demonstrating comparative superiority (in some situation they’re allowed to construct) but genuine testing. Do you think this possible? It does not seem to me that it would be so difficult.

      Addition March 24: I meant to comment on your remark that “Severe testing would require studies run by people who are neutral and don’t have an interest in seeing any specific method winning, and for sure they should care about severity.” Why would they need to be neutral? Competitors could self-interestedly point to any unfairness in comparisons made.

      • Christian Hennig

        Well, I’m not really “at home” in that community, and because of my expertise I mainly get papers on unsupervised classification (cluster analysis). For supervised learning, with large enough data sets, chances are that techniques such as cross-validation are good enough to compare different approaches rather reliably regarding the specific problem in hand (if done correctly, which of course isn’t always the case). This means that people don’t need to care much about overly general claims in papers that introduce the methods, because they can see how well or not so well they work for their specific problem (although it may still have an impact on what exactly is chosen for comparison). Competitions basically work like that – the winner will be reliably good for the task in hand. Generalisation is questionable and for sure doesn’t follow automatically, but of course if something works well in one task, this indicates that it may well be a good candidate for similar tasks (which in every single situation of this kind then can be verified or not).

        This is quite different for unsupervised tasks (and to some extent for supervised tasks with small data sets, meaning “small” in relation to the complexity of the situation, dependence structure and the like). If there is no “supervision”, it is in the first place not clear how to evaluate the performance on a given problem. What is usually done is that either data are simulated from some models with control of the “truth” (which is unobserved in the real situation), or that real data are chosen in which the real task is supervised (e.g., with given true classes for the observations), and then the unsupervised method is run without using the true class information, and assessed regarding recovering the “hidden truth”. So algorithms are evaluated on data that differ from the real data that a practitioner wants to analyse, and generalisation matters more. Competitions may also be evaluated based on existing true class information, but data sets with given class information are most likely systematically different from data sets without such information (which are those where unsupervised classification is of real interest), which is an additional problem for generalisation on top of the fact that a typical competition is run on a small number of data sets or even just a single one.

        In principle open benchmarking should be a good thing also in unsupervised learning, but the evaluation and also generalisation are much more tricky. Also different solutions may be equally legitimate for different aims of clustering, but this is incompatible with having a one-dimensional evaluation.

        You are right that it should be rather obvious that one shouldn’t trust “benchmarking” as carried out by a single author/team who has an interest to show their own method as superior, and it is true that better things can be done either by neutral benchmarkers or by involving the authors of all competing methods, but because this is more problematic than in supervised learning, it happens more rarely, and therefore it plays a bigger role that authors “benchmark” their own methods (this happens in supervised learning just as much, but it is easier to go beyond it in a real situation).

        “What happens when you criticize the papers for selection bias? Are they required to revise their papers?” Reviewers have some influence, so such criticism usually leads to either a request for revision or rejection. It depends on the details what exactly I ask for. Sometimes I think that the given work is OK as minimal “proof of concept”, in which case I’d just ask to change the interpretation of results to avoid overgeneralisation. This requires that I have at least the impression that competing methods are treated fairly, i.e., some effort is made to adapt them equally well to the data as the newly proposed method, even though it is never possible to see additional preliminary effort that went into the authors’ method if this is not documented. I don’t think much harm is done saying that the new method does a good job at least for the specific problems treated in the paper. However I (and other reviewers) may also ask for extended simulations, involving more competitors, running them in a better way etc. I also recommend rejection (and editors follow the recommendation more often than not) if experiments don’t look convincing at all, for various reasons (which may include results that don’t look right, usually in connection with a lack of proper explanation what exactly was done, and how come that results look like this). It also depends on whether the method is accompanied by convincing theoretical results; if such results are there, the bar for practical experiments is a bit lower.

        So I (and other reviewers) do request improvements, and I think that that’s a good thing. However I also think that there needs to be awareness that we can only do so much regarding work that is carried out by authors who have the aim to advertise and “sell” their method. I don’t think that ultimately selection bias can be effectively controlled, particularly because it is actually their job to find out (based on experiments) how to tune their own method optimally, and then to come up with experiments in which they beat the competitors. The latter comprises the selection of data, selection of an evaluation metric, and method tuning. I don’t think we could realistically ask for a full documentation of the whole process (which may very well start when the author has the initial idea and starts to play around with it when things haven’t taken proper shape yet). But if the presented data example and evaluation metric look fishy in some way (e.g., if a metric looks very “unnatural” and you suspect that the method would look much wrse with a more “natural” one), that’d be a reason to recommend rejection.

        Still, part of the issue is also that people need to learn to read such papers, i.e., to not take the results for granted as having any generality. I think that some people in AI are aware of this but others aren’t.

        • Christian:
          I see, so you’re talking about a process more like exploratory research, and of course your specialty in stat would enter. So dredging is expected and necessary. But it’s unclear how one model performs better than the other, unless some outside evaluation is aware of what to look for, or what is interesting. Maybe that’s what you do as a reviewer.

          • Christian Hennig

            Classifying this as exploratory research makes sense to me. Also for exploratory research there are standards, and as a reviewer that’s what I’m looking for. We can’t demand proper error control regarding papers that introduce methods for the given reasons, and it should be left to more “neutral” studies to establish generalisable statements regarding comparisons. That said, exploratory analysis should be good enough to convince readers and reviewers that the method is worthwhile to use, and to be involved in a more “neutral” comparison. Selection bias needs to be avoided at least beyond optimising the method before publication (which can be seen as the job of the author, and therefore unavoidable); a minimal requirement is that positive results cannot all be explained by selection bias. This is often not fulfilled.

      • Christian Hennig

        By the way, the issue of (severe) testing comes up occasionally, in case a performance advantage of a method looks so slim that it may be due to random variation, but in most situations the selection problem seems much bigger than that, i.e., the win of the new method looks clearly significant (whether a formal test is run or not), and a formal test cannot address the selection issue, as this is far more complex than just “10 tests were run, and we can adjust the tests for this”.

  3. I do some research in post-selection inference (not AI), but the issues you describe are familiar. We try to provide a fair comparison for other methods, but the primary point of the research is to develop our own method. At the same time, I think we try to avoid overgeneralizing. Preregistering comparisons is a good idea, and we’ve found some unexpected things out that way.

    Here’s another suggestion. We call it “cross-examination data-guided simulation”. Let’s say we have methods A and B, and a data set D.

    Algorithm:

    1. Fit A and B to D, to get fitted models M(A,D) and M(B,D).
    2. Generate new response values from these fits, y*(M(A,D)) and y*(M(B,D))
      • The matrix of observations X is kept fixed
    3. Then, re-apply methods A and B to the simulated datasets
      • We now have four fits:
        • M(A,y*(M(A,D)) (home game for A)
        • M(A,y*(M(B,D)) (away game for B)
        • M(B,y*(M(A,D)) (away game for A)
        • M(B, y*(M(B,D)) (home game for B)
    4. repeat the whole thing many times
    5. I’ve used the language “home game” and “away game” to suggest that a method ought to at least do well on data that was simulated from its own fit. “Doing well” just depends on the context – MSEs, coverage, whatever it is.
    6. Ideally, our method would do well its home game and the away games. There’s no hard and fast way to evaluate this, but it’s more of a heuristic to understand why a method is succeeding or failing in some situations.

    Here are two published articles:

    Rolling, C. A., & Yang, Y. (2014). Model selection for estimating treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(4), 749-769.

    Li, K. C., Lue, H. H., & Chen, C. H. (2000). Interactive tree-structured regression via principal Hessian directions. Journal of the American Statistical Association95(450), 547-560.

    • Hyneken:
      I’ve tried to obtain connections between people working on post-selection inference and AI/ML researchers, but they seem to be in separate worlds. I’m guessing it’s because the AI/ML researchers don’t focus on error probabilities. It seems they ought to connect. If you have a link to something more recent, perhaps you can link it. Thanks

  4. rkenett

    Christian brings up an interesting point, I quote: “which is an additional problem for generalisation on top of the fact that a typical competition is run on a small number of data sets or even just a single one.”

    Most papers on AI/ML that propose new algorithms compare outcomes to another, standard, algorithm. This is a head to head comparison. It is focused on the claim which is better and does not provide information that can be generalised.

    For generalization one needs to conduct designed experiments and sensitivity studies.

    The literature on AI/ML is full of tests on a small data of data sets or just a single one. This is also mostly done at a descriptive level, i.e. bar charts comparing the two methods.

    More seasoned researchers would discuss severity testing of their new algorithm. This has never been done.

    A fundamental dichotomy is between experiments conducted to derive a model and experiments conducted to optimize. Competitions should provide a model explaining the difference between competing methods. Optimization is a sequential path combining exploration and exploitation experiments. In considering sequential experimentation this distinction should be made, see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5129421

    • Ron:

      Could they assess it severely?

      I haven’t read your paper and I definitely will. My feeling, though, from my outsider’s perspective, is that there’s nothing really Bayesian about Bayesian optimization. Just because one considers the probability of something doesn’t make it Bayesian. It looks to be based on frequentist models. I’ll report back after looking at your paper. Thanks for linking it.

      brief update: I very quickly looked at your paper. The techniques are fascinating. I’m impressed at the number of different types of application you work on. I wish I understood them so that I could explain and justify them at a philosophical level. In particular, I’ve wanted to write something on “back propagation” for many years now. I don’t think any philosophers have.

      But yes, the use of probability is frequentist.

  5. I just posted this comment on a post from a few days ago on Gelman’s blog. It concerns the issue of assigning a probability to a particular event, and connects to those epistemology posts I recently wrote. I park it here to find it later.

    https://statmodeling.stat.columbia.edu/2025/03/26/individual-probability-model-multiplicity-and-multicalibration/#comment-2395082

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.