Some claim that no one attends Sunday morning (9am) sessions at the Philosophy of Science Association. But if you’re attending the PSA (in Pittsburgh), we hope you’ll falsify this supposition and come to hear us (Mayo, Thornton, Glymour, Mayo-Wilson, Berger) wrestle with some rival views on the trenchant problems of multiplicity, data-dredging, and error control. Coffee and donuts to all who show up.
Multiplicity, Data-Dredging, and Error Control
November 13, 9:00 – 11:45 AM
(link to symposium on PSA website)
Deborah Mayo (Virginia Tech) abstract Error control and Severity
Suzanne Thornton (Swarthmore College) abstract The Duality of Parameters and the Duality of Probability
Clark Glymour (Carnegie Mellon University) abstract Good Data Dredging
Conor Mayo-Wilson (University of Washington, Seattle abstract Bamboozled By Bonferroni
James O. Berger ( Duke University) abstract Controlling for Multiplicity in Science
High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductive-statistical evidence and the roles of probability in statistical inference. We consider three central questions:
- How should we cope with the fact that data-driven processes, multiplicity and selection effects can invalidate a method’s control of error probabilities?
- Can we use the same data to search non-experimental data for causal relationships and also to reliably test them?
- Can a method’s error probabilities both control a method’s performance as well as give a relevant epistemological assessment of what can be learned from data?
As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, Mayo-Wilson) and statisticians (Berger, Thornton) to reflect on these questions.
Multiple testing, replication and error control. The probabilities that a method leads to misinterpreting data in repeated use may be called its error probabilities. It is well known that control of the probability of a Type I error (erroneously rejecting a null hypothesis H0) is invalidated by cherry-picking, p-hacking, and stopping when the data look good. If a medical researcher combs through unblinded data and selectively reports just the endpoints that show impressive drug benefit, there is a high probability of finding some statistically significant effect or other, even if none are genuine—a high error probability. The problem, for a significance tester, is that the probability of getting some small p-value, say .01, under H0, is no longer .01, but can be much greater. From a Bayesian perspective, the problem is that multiple testing results in p-values being low even though the posterior probability of H0 is not low (on a given prior). The former suggests there is evidence against H0, while the latter says there is not.
Accordingly, the statistical significance tester and the Bayesian propose different ways to solve the problem. Jim Berger will argue that older frequentist solutions, such as Bonferroni and the False Discovery Rate (FDR), are inappropriate for many of today’s complex, high-throughput inquiries. He argues for a unified method that can address any such problems of multiplicity by means of the choice of objective prior probabilities of hypotheses.
Philosophical scrutiny of both older and newer solutions to the multiple test problem reveals challenges to the very assumptions for the necessity of taking account of, and adjusting for, multiplicity. Conor Mayo-Wilson shows that a prevalent argument for the Bonferroni correction, which recommends replacing a p-value threshold with p/n when testing n independent hypotheses, can violate important axioms of evidence. Correcting error probabilities or p-values for multiple testing, he argues, should be viewed as value judgments in deciding which hypotheses or models are worth pursuing.
Using the same data to construct and stringently test causal relationships. Under the guise of fixing the problem of selective reporting, it is increasingly recommended that scientists predesignate all details of experimental procedure, number of tests run, and rules for collecting and analyzing data in advance of the experiment. Clark Glymour asks if predesignation comes at the cost of high Type II error probability—erroneously failing to find effects—and lost opportunities for discovery. In contemporary science, Glymour argues, in which the number of variables is large in comparison to the sample size, principled search algorithms can be invaluable. Some of the leading research areas of machine learning and AI develop “post-selection inferences” that violate the rule against finding one’s hypothesis in the data. These adaptive methods attempt to arrive at reliable results by compensating for the fact that the model was picked in a data-dependent way using methods such as cross validation, simulation, and bootstrapping. Glymour argues that some of these methods are a form of “severe testing” of their output, whereas commonly used regression methods are actually “bad” data dredging methods that do not severely test their results. For both frequentist and Bayesian statistics, search procedures press epistemic issues about how using observational data to try to reach beyond experimental possibilities should be evaluated for accuracy and reliability. We suggest, in each of our contributions, some principled ways to distinguish “bad” from “good” data dredging.
Error probabilities and epistemic assessments. Controversies between Bayesian and frequentist methods reflect different answers to the question of the role of probability in inference—to supply a measure of belief or support in hypotheses? or to control a method’s error probabilities? While a criticism often leveled at Type I and II error probabilities is they do not give direct assessments of epistemic probability, Bayesians are also often keen to show their methods have good performance in repeated sampling. Can the performance of a method under hypothetical uses also supply epistemically relevant measures of belief, confidence or corroboration? Suzanne Thornton presents new developments toward an affirmative answer by means of confidence distributions (CD) which provide confidence intervals for parameters at any level of confidence, not just the typical .95. Even regarding a parameter as fixed, say the mean deflection of light, we can calibrate how reliably a method enables finding out about its values. In this sense, she argues, parameters play a dual role—a possible key to reconciling approaches.
Deborah Mayo’s idea is to view a method’s ability to control erroneous interpretations of data as measuring its capability to probe errors. In her view, we have evidence for a claim just to the extent that it has been subjected to and passes a test that would probably have found it false, just if it is. This probability is the stringency or severity with which it has passed the test. On the severity view, the question of whether, and when, to adjust a statistical method’s error probabilities in the face of multiple testing and data-dredging (debated by Berger, Glymour, and Mayo-Wilson) is directly connected to the relevance of error control for qualifying a particular statistical inference (discussed by Thornton). Thus a platform for connecting the five contributions emerges.
Our goal is to channel some of the sparks that grow out of our contrasting views to vividly illuminate the issues, and point to the directions for new interdisciplinary work.