Below are slides from 4 of the talks given in our Philosophy of Science Association (PSA) session from last month: the PSA 22 Symposium: *Multiplicity, Data-Dredging, and Error Control*** . **It was held in Pittsburgh on November 13, 2022. I will write some reflections in the “comments” to this post. I invite your constructive comments there as well.

**SYMPOSIUM ABSTRACT:** High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductive-statistical evidence and the roles of probability in statistical inference. We consider three central questions:

•How should we cope with the fact that data-driven processes, multiplicity and selection effects can invalidate a method’s control of error probabilities?

•Can we use the same data to search non-experimental data for causal relationships and also to reliably test them?

•Can a method’s error probabilities both control a method’s performance as well as give a relevant epistemological assessment of what can be learned from data?

As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, Mayo-Wilson) and statisticians (Berger, Thornton) to reflect on these questions.

**Deborah Mayo** (Philosophy, Virginia Tech)

*Error Control and Severity*

**ABSTRACT**: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging.

**Suzanne Thornton** (Statistics, Swarthmore College)

*The Duality of Parameters and the Duality of Probability*

**ABSTRACT: **Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)–the behavior of a procedure under hypothetical repetition–bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. For data observed from a N(θ, 1) distribution to generate a credible interval for θ requires an assumption about the plausibility of different possible values of θ, that is, one must assume a prior. However, depending on the context – is θ the recovery time for a newly created drug? or is θ the recovery time for a new version of an older drug? – there may or may not be an informed choice for the prior. Without appealing to the long-run performance of the interval, how is one to judge a 95% credible interval [a, b] versus another 95% interval [a’, b’] based on the same data but a different prior? In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.

**Clark Glymour** (Philosophy, Carnegie Mellon University)

*Good Data-Dredgin g*

**ABSTRACT: **“Data dredging”–searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships–was historically common in the natural sciences–the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, “data dredging”–using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity–is widely denounced by both philosophical and statistical methodologists. Notwithstanding, “data dredging” is routinely practiced in the human sciences using “traditional” methods–various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo’s and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of “constructing” them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. In real scientific cases in which the number of variables is large in comparison to the sample size, principled search algorithms can be indispensable. I illustrate the first claim with a simple linear model, and the second claim with an application of the oldest correct graphical model search, the PC algorithm, to genomic data followed by experimental tests of the search results. The latter example, due to Steckhoven et al. (“Causal Stability Ranking,” Bioinformatics, 28 (21), 2819-2823) involves identification of (some of the) genes responsible for bolting in A. thaliana from among more than 19,000 coding genes using as data the gene expressions and time to bolting from only 47 plants. I will also discuss Fast Causal Inference (FCI) which gives asymptotically correct results even in the presence of confounders. These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.

**James Berger** (Statistics, Duke University)

*Comparing Frequentists and Bayesian Control of Multiple Testing*

**ABSTRACT:** A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20×100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. In GWAS, scientists assessed the chance of a disease/gene association to be 1/100,000, meaning that each null hypothesis of no association would be assigned a prior probability of 1-1/100,000. Only tests yielding p-values less than 5 x 10—7 would be able to overcome this strong initial belief in no association. In subgroup analysis, the set of possible subgroups under consideration can be expressed as a tree, with probabilities being assigned to differing branches of the tree to deal with the multiplicity. There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.

Jim Berger’s presentation is intriguing, as it appears to present us with a dramatic turnabout from his earlier position—one that brings him closer to that of the severe tester, or appears to. For a noteworthy example, Berger’s abstract gives a statement that seems to be a criticism of ignoring stopping rules: “Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true” (Berger abstract). This appears to be an argument that ignoring optional stopping is problematic because it licenses high error probabilities, in the frequentist error statistical sense. The concern is that there is a high probability of attaining a small P-value, at some stopping point or other, even though the null hypothesis is true. Slide #19 of my PSA slides (also in this post) shows the classic example of how the attained P-value blows up with sequential trials in a two sided test of the mean of a Normal distribution. Because of the large error probability, the inference to a discrepancy from the null fails to pass a severe test. However, this is the same famous example to which Savage (1961) refers in arguing that optional stopping “is no sin”. The alleged innocence of optional stopping is reflected in the standard Bayesian position embracing the likelihood principle which leads to conditioning on the data. Taking account of “some stopping point or other,” as statistical significance testers do in dealing with stopping rules, is a violation of the likelihood principle. Statistical significance tests are often criticized for taking into account outcomes other than the one observed, but for the severe tester, the problem is ignoring gambits that wreck error probability control. (Cox and Hinkley 1984 take the optional stopping example as a reductio of the likelihood principle.) Thus, Berger’s position here appears to place him on the same side as the severe tester, when using statistical significance tests.

Now you might say Berger is only pointing out that optional stopping is problematic for one who computes P-values, but the same hypothesized effect attained through optional stopping can be assigned a high posterior probability given the data–even where the null hypothesis is true. That is why the FDA requires Bayesian exploratory inferences to adjust for interim analysis, even though, strictly speaking, the Bayesian analysis does not pick up on it. FDA requires, in other words, raising the cut-off posterior probability for inferring evidence of a genuine effect. I don’t think it’s plausible to suppose Berger is declaring in this presentation that optional stopping, and multiplicity in general, are of concern only for non-Bayesians. Otherwise he would not be keen to show that by adjusting one’s prior probabilities one can essentially wind up in the same place as the error statisticians. There is still a move away from adherence to the likelihood principle–which is all to the good.

If so, then Berger is shifting his view, despite being the well-known co-author (with Wolpert) of The Likelihood Principle (1988). I began to sense a shift in hearing Jim give papers around a decade ago. While he might say that optional stopping only affects the prior probability, allowing one to adhere to the likelihood principle, this strikes me as unsatisfactory for several reasons. Notably, if optional stopping is no sin, then why would learning that such a gambit had been used lead a Bayesian to lower their prior probabilities in the effect in question? The location of the problem instead is that the inference to a discrepancy from the null fails to pass a severe test. Even if the effect is known to be genuine, researchers who present the results of trying and trying again as having done a good job probing the error of relevance—attaining an impressive-looking effect that is actually spurious—are misleading their readers. Their inference to the reality of the effect might be right, but for the wrong reason.

In slide #9, Berger presents himself as still ambivalent, quoting Savage’s famous claim (if you don’t know it, check Berger’s slide #9). I would like to nudge Berger over to the severe tester’s side. After all, as an objective Bayesian, he is already bound to admit “technical violations” of the likelihood principle.