PSA 2016 Symposium:
Philosophy of Statistics in the Age of Big Data and Replication Crises
Friday November 4th 9-11:45 am (includes coffee break 10-10:15)
Location: Piedmont 4 (12th Floor) Westin Peachtree Plaza
- Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) “Controversy Over the Significance Test Controversy” (Abstract)
- Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) “Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual” (Abstract)
- Andrew Gelman (Professor of Statistics & Political Science, Columbia University, New York) “Confirmationist and Falsificationist Paradigms in Statistical Practice” (Abstract)
- Clark Glymour (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania) “Exploratory Research is More Reliable Than Confirmatory Research” (Abstract)
Key Words: big data, frequentist and Bayesian philosophies, history and philosophy of statistics, meta-research, p-values, replication, significance tests.
Science is undergoing a crisis over reliability and reproducibility. High-powered methods are prone to cherry-picking correlations, significance-seeking, and assorted modes of extraordinary rendition of data. The Big Data revolution may encourage a reliance on statistical methods without sufficient scrutiny of whether they are teaching us about causal processes of interest. Mounting failures of replication in the social and biological sciences have resulted in new institutes for meta-research, replication research, and widespread efforts to restore scientific integrity and transparency. Statistical significance test controversies, long raging in the social sciences, have spread to all fields using statistics. At the same time, foundational debates over frequentist and Bayesian methods have shifted in important ways that are often overlooked in the debates. The problems introduce philosophical and methodological questions about probabilistic tools, and science and pseudoscience—intertwined with technical statistics and the philosophy and history of statistics. Our symposium goal is to address foundational issues around which the current crisis in science revolves. We combine the insights of philosophers, psychologists, and statisticians whose work interrelates philosophy and history of statistics, data analysis and modeling.
Philosophy of statistics tackles conceptual and epistemological problems in using probabilistic methods to collect, model, analyze, and draw inferences from data. The problems concern the nature of uncertain evidence, the role and interpretation of probability, reliability, and robustness—all of which link to a long history of disputes of personality and philosophy between frequentists, Bayesians, and likelihoodists (e.g., Fisher, Neyman, Pearson, Jeffreys, Lindley, Savage). Replication failures have led researchers to reexamine their statistical methods. Although nowadays novel statistical techniques use simulations to detect cherry-picking and p-hacking, we see a striking recapitulation of Bayesian-frequentist debates of old. New philosophical issues arise from successes of machine learning and Big Data analysis: How do its predictions succeed when parameters in models are merely black boxes? One thing we learned in 2015 is why they fail: a tendency to overlook classic statistical issues– confounders, multiple testing, bias, model assumptions, and overfitting. The time is ripe for a forum that illuminates current developments and points to the directions of future work by philosophers and methodologists of science.
The New Statistical Significance Test Controversy. Mechanical, cookbook uses of statistical significance tests have long been lampooned in social sciences, but once high-placed failures revealed poor rates of replication in medicine and cancer research, the problem took on a new seriousness. Drawing on criticisms from social science, however, the new significance test controversy retains caricatures of a “hybrid” view of significance testing, common in psychology (Gigerenzer). Well-known criticisms—statistical significance is not substantive significance, p-values are invalidated by significance seeking and violated model assumptions—are based on uses of methods warned against by the founders of Fisherian and Neyman-Pearson (N-P) tests. A genuine experimental effect, Fisher insisted, cannot be based on a single, isolated significant result (a single low p-value); low p-values had to be generated in multiple settings. Yet sweeping criticisms and recommended changes of method are often based on the rates of false positives assuming a single, just-significant result, with biasing selection effects to boot!
Foundational controversies are tied up with Fisher’s bitter personal feuds with Neyman, and Neyman’s attempt to avoid inconsistencies in Fisher’s “fiducial” probability by means of confidence levels. Only a combined understanding of the early statistical and historical developments can get beyond the received views of the philosophical differences between Fisherian and N-P tests. People should look at the properties of the methods, independent of what the founders supposedly thought.
Bayesian-Frequentist Debates. The Bayesian-frequentist debates need to be revisited. Many discussants, who only a decade ago argued for the “irreconcilability” of frequentist p-values and Bayesian measures, now call for ways to reconcile the two. In today’s most popular Bayesian accounts, prior probabilities in hypotheses do not express degrees of belief but are given by various formal assignments or “defaults,” ideally with minimal impact on the posterior probability. Advocates of unifications are keen to show that Bayesian methods have good (frequentist) long-run performance; and that it is often possible to match frequentist and Bayesian quantities, despite differences in meaning and goals. Other Bayesians deny the idea that Bayesian updating fits anything they actually do in statistics (Gelman). Statistical methods are being decoupled from the philosophies in which they are traditionally couched, calling for new foundations and insights from philosophers.
Is the “Bayesian revolution,” like the significance test revolution before it, ushering in the latest in a universal method and surrogate science (Gigerenzer)? If the key problems of significance tests occur equally with Bayes ratios, confidence intervals and credible regions, then we need a new statistical philosophy to underwrite alternative, more self-critical methods (Mayo).
The Big Data Revolution. New data acquisition procedures in biology and neuroscience yield enormous quantities of high-dimensional data, which can only be analyzed by computerized search procedures. But the most commonly used search procedures have known liabilities and can often only be validated using computer simulations. Analyses used to find predictors in areas such as medical diagnostics are so new that often their statistical properties are unknown, making them ethically problematic. Questions about the very nature of replication and validation, and of reliability and robustness enter. Without a more critical analysis of the foibles, the current Human Connectome project to understand brain processes may result in the same disappointments as gene regulation discovery, with its so far unfulfilled promise of reliably predicting personalized cancer treatments (Glymour).[See new topic.]
The wealth of computational ability allows for the application of countless methods with little handwringing about foundations, but they introduce new quandaries. The techniques that Big Data requires to “clean” and process data introduce biases that are difficult to detect. Can sufficient data obviate the need to satisfy long-standing principles of experimental design? Can data-dependent simulations, resampling and black-box models ever count as valid replications or genuine model validations?
The Contributors: While participants represent diverse statistical philosophies, there is agreement that a central problem concerns the gaps between the outputs of formal statistical methods and research claims of interest. In addition to illuminating problems, each participant will argue for an improved methodology: an error statistical account of inference (Mayo), a heuristic toolbox (Gigerenzer), Bayesian falsification via predictive distributions (Gelman), and a distinct causal-modeling approach (Glymour).
Controversy Over the Significance Test Controversy
(Professor of Philosophy, Virginia Tech, Blacksburg, Virginia)
In the face of misinterpretations and proposed bans of statistical significance tests, the American Statistical Association gathered leading statisticians in 2015 to articulate statistical fallacies and galvanize discussion of statistical principles. I discuss the philosophical assumptions lurking in the background of their recommendations, linking also my co-symposiasts. As is common, probability is assumed to accord with one of two statistical philosophies: (1) probabilism and (2) (long-run) performance. (1) assumes probability should supply degrees of confirmation, support or belief in hypotheses, e.g., Bayesian posteriors, likelihood ratios, and Bayes factors; (2) limits probability to long-run reliability in a series of applications, e.g., a “behavioristic” construal of N-P type 1 and 2 error probabilities; false discovery rates in Big Data.
Assuming probabilism, significance levels are relevant to a particular inference only if misinterpreted as posterior probabilities. Assuming performance, they are criticized as relevant only for quality control, and contexts of repeated applications. Performance is just what’s needed in Big Data searching through correlations (Glymour). But for inference, I sketch a third construal: (3) probativeness. In (2) and (3), unlike (1), probability attaches to methods (testing or estimation), not the hypotheses. These “methodological probabilities” report on a method’s ability to control the probability of erroneous interpretations of data: error probabilities. While significance levels (p-values) are error probabilities, the probing construal in (3) directs their evidentially relevant use.
That a null hypothesis of “no effect” or “no increased risk” is rejected at the .01 level (given adequate assumptions) tells us that 99% of the time, a smaller observed difference would result from expected variability, as under the null hypothesis. If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Looking at the entire p-value distribution under various discrepancies from the null allows inferring those that are well or poorly indicated. This is akin to confidence intervals but we do not fix a single confidence level, and we distinguish the warrant for different points in any interval. My construal connects to Birnbaum’s confidence concept, Popperian corroboration, and possibly Fisherian fiducial probability. The probativeness interpretation better meets the goals driving current statistical reforms.
Much handwringing stems from hunting for an impressive-looking effect, then inferring a statistically significant finding. The actual probability of erroneously finding significance with this gambit is not low, but high, so a reported small p-value is invalid. Flexible choices along “forking paths” from data to inference cause the same problem, even if the criticism is informal (Gelman). However, the same flexibility occurs with probabilist reforms, be they likelihood ratios, Bayes factors, highest probability density (HPD) intervals, or lowering the p-value (until the maximal likely alternative gets .95 posterior). But lost are the direct grounds to criticize them as flouting error statistical control. I concur with Gigerenzer’s criticisms of ritual uses of p-values, but without understanding their valid (if limited) role, there’s a danger of accepting reforms that throw out the error control baby with the “bad statistics” bathwater.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual
(Director of Max Planck Institute for Human Development, Berlin, Germany)
If statisticians agree on one thing, it is that scientific inference should not be made mechanically. Despite virulent disagreements on other issues, Ronald Fisher and Jerzy Neyman, two of the most influential statisticians of the 20th century, were of one voice on this matter. Good science requires both statistical tools and informed judgment about what model to construct, what hypotheses to test, and what tools to use. Practicing statisticians rely on a “statistical toolbox” and on their expertise to select a proper tool. Social scientists, in contrast, tend to rely on a single tool.
In this talk, I trace the historical transformation of Fisher’s null hypothesis testing, Neyman-Pearson decision theory, and Bayesian statistics into a single mechanical procedure that is performed like compulsive hand washing: the null ritual. In the social sciences, this transformation has fundamentally changed research practice, making statistical inference its centerpiece. The essence of the null ritual is:
- Set up a null hypothesis of “no mean difference” or “zero correlation.” Do not specify the predictions of your own research hypothesis. 2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p < .05, p < .01, or
p < .001, whichever comes next to the obtained p-value. 3. Always perform this procedure.
I use the term “ritual” because this procedure shares features that define social rituals: (i) the repetition of the same action, (ii) a focus on special numbers or colors, (iii) fears about serious sanctions for rule violations, and (iv) wishful thinking and delusions that virtually eliminate critical thinking. The null ritual has each of these four characteristics: mindless repetition; the magical 5% number, fear of sanctions by editors or advisors, and delusions about what a p-value means, which block researchers’ intelligence. Starting in the 1940s, writers of bestselling statistical textbooks for the social sciences have silently transformed rivaling statistical systems into an apparently monolithic method that could be used mechanically. The idol of a universal method for scientific inference has been worshipped and institutionalized since the “inference revolution” of the 1950s. Because no such method has ever been found, surrogates have been created, most notably the quest for significant p-values. I show that this form of surrogate science fosters delusions and argue that it is one of the reasons of “borderline cheating” which has done much harm, creating, for one, a flood of irreproducible results in fields such as psychology, cognitive neuroscience and tumor marker research.
Today, proponents of the “Bayesian revolution” are in a similar danger of chasing the same chimera: an apparently universal inference procedure. A better path would be to promote an understanding of the various devices in the “statistical toolbox.” I discuss possible explanations why a toolbox approach to statistics has been so far successfully prevented by journal editors, textbook writers, and social scientists.
Confirmationist and Falsificationist Paradigms in Statistical Practice
(Professor of Statistics & Political Science, Columbia University, New York)
There is a divide in statistics between classical frequentist and Bayesian methods. Classical hypothesis testing is generally taken to follow a falsificationist, Popperian philosophy in which research hypotheses are put to the test and rejected when data do not accord with predictions. Bayesian inference is generally taken to follow a confirmationist philosophy in which data are used to update the probabilities of different hypotheses. We disagree with this conventional Bayesian-frequentist contrast: We argue that classical null hypothesis significance testing is actually used in a confirmationist sense and in fact does not do what it purports to do; and we argue that Bayesian inference cannot in general supply reasonable probabilities of models being true. The standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify, which is then taken as evidence in favor of A. Research projects are framed as quests for confirmation of a theory, and once confirmation is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.
Instead, we recommend a falsificationist Bayesian approach in which models are altered and rejected based on data. The conventional Bayesian confirmation view blinds many Bayesians to the benefits of predictive model checking. The view is that any Bayesian model necessarily represents a subjective prior distribution and as such could never be tested. It is not only Bayesians who avoid model checking. Quantitative researchers in political science, economics, and sociology regularly fit elaborate models without even the thought of checking their fit.
We can perform a Bayesian test by first assuming the model is true, then obtaining the posterior distribution, and then determining the distribution of the test statistic under hypothetical replicated data under the fitted model. A posterior distribution is not the final end, but is part of the derived prediction for testing. In practice, we implement this sort of check via simulation.
Posterior predictive checks are disliked by some Bayesians because of their low power arising from their allegedly “using the data twice”. This is not a problem for us: it simply represents a dimension of the data that is virtually automatically fit by the model.
What can statistics learn from philosophy? Falsification and the notion of scientific revolutions can make us willing to check our model fit and to vigorously investigate anomalies rather than treat prediction as the only goal of statistics. What can the philosophy of science learn from statistical practice? The success of inference using elaborate models, full of assumptions that are certainly wrong, demonstrates the power of deductive inference, and posterior predictive checking demonstrates that ideas of falsification and error statistics can be applied in a fully Bayesian environment with informative likelihoods and prior distributions.
Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.