Below are slides from 4 of the talks given in our Philosophy of Science Association (PSA) session from last month: the PSA 22 Symposium: Multiplicity, Data-Dredging, and Error Control. It was held in Pittsburgh on November 13, 2022. I will write some reflections in the “comments” to this post. I invite your constructive comments there as well.
SYMPOSIUM ABSTRACT: High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductive-statistical evidence and the roles of probability in statistical inference. We consider three central questions:
•How should we cope with the fact that data-driven processes, multiplicity and selection effects can invalidate a method’s control of error probabilities?
•Can we use the same data to search non-experimental data for causal relationships and also to reliably test them?
•Can a method’s error probabilities both control a method’s performance as well as give a relevant epistemological assessment of what can be learned from data?
As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, Mayo-Wilson) and statisticians (Berger, Thornton) to reflect on these questions.
Deborah Mayo (Philosophy, Virginia Tech)
Error Control and Severity
ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging.
Suzanne Thornton (Statistics, Swarthmore College)
The Duality of Parameters and the Duality of Probability
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)–the behavior of a procedure under hypothetical repetition–bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. For data observed from a N(θ, 1) distribution to generate a credible interval for θ requires an assumption about the plausibility of different possible values of θ, that is, one must assume a prior. However, depending on the context – is θ the recovery time for a newly created drug? or is θ the recovery time for a new version of an older drug? – there may or may not be an informed choice for the prior. Without appealing to the long-run performance of the interval, how is one to judge a 95% credible interval [a, b] versus another 95% interval [a’, b’] based on the same data but a different prior? In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Clark Glymour (Philosophy, Carnegie Mellon University)
ABSTRACT: “Data dredging”–searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships–was historically common in the natural sciences–the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, “data dredging”–using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity–is widely denounced by both philosophical and statistical methodologists. Notwithstanding, “data dredging” is routinely practiced in the human sciences using “traditional” methods–various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo’s and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of “constructing” them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. In real scientific cases in which the number of variables is large in comparison to the sample size, principled search algorithms can be indispensable. I illustrate the first claim with a simple linear model, and the second claim with an application of the oldest correct graphical model search, the PC algorithm, to genomic data followed by experimental tests of the search results. The latter example, due to Steckhoven et al. (“Causal Stability Ranking,” Bioinformatics, 28 (21), 2819-2823) involves identification of (some of the) genes responsible for bolting in A. thaliana from among more than 19,000 coding genes using as data the gene expressions and time to bolting from only 47 plants. I will also discuss Fast Causal Inference (FCI) which gives asymptotically correct results even in the presence of confounders. These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
James Berger (Statistics, Duke University)
Comparing Frequentists and Bayesian Control of Multiple Testing
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20×100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. In GWAS, scientists assessed the chance of a disease/gene association to be 1/100,000, meaning that each null hypothesis of no association would be assigned a prior probability of 1-1/100,000. Only tests yielding p-values less than 5 x 10—7 would be able to overcome this strong initial belief in no association. In subgroup analysis, the set of possible subgroups under consideration can be expressed as a tree, with probabilities being assigned to differing branches of the tree to deal with the multiplicity. There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.