Below are the slides from my June 14 presentation at the X-Phil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some. However, I did learn something about X-phil, particularly from the very interesting poster session . This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.
 The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.
However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism). Continue reading
The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.
Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek et.al. 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]). Continue reading
I resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.
GLYMOUR’S ARGUMENT (in a nutshell):
“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:
What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim. Continue reading
Scientific Misconduct and Scientific Expertise
1st Barcelona HPS workshop
November 11, 2016
Departament de Filosofia & Centre d’Història de la Ciència (CEHIC), Universitat Autònoma de Barcelona (UAB)
Location: CEHIC, Mòdul de Recerca C, Seminari L3-05, c/ de Can Magrans s/n, Campus de la UAB, 08193 Bellaterra (Barcelona)
Organized by Thomas Sturm & Agustí Nieto-Galan
Current science is full of uncertainties and risks that weaken the authority of experts. Moreover, sometimes scientists themselves act in ways that weaken their standing: they manipulate data, exaggerate research results, do not give credit where it is due, violate the norms for the acquisition of academic titles, or are unduly influenced by commercial and political interests. Such actions, of which there are numerous examples in past and present times, are widely conceived of as violating standards of good scientific practice. At the same time, while codes of scientific conduct have been developed in different fields, institutions, and countries, there is no universally agreed canon of them, nor is it clear that there should be one. The workshop aims to bring together historians and philosophers of science in order to discuss questions such as the following: What exactly is scientific misconduct? Under which circumstances are researchers more or less liable to misconduct? How far do cases of misconduct undermine scientific authority? How have standards or mechanisms to avoid misconduct, and to regain scientific authority, been developed? How should they be developed?
All welcome – but since space is limited, please register in advance. Write to: Thomas.Sturm@uab.cat
09:30 Welcome (Thomas Sturm & Agustí Nieto-Galan) Continue reading
For entertainment only
In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading
Here are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:
I’m giving a joint presentation with Caitlin Parker on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.) The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.
Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology
Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States
Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)
unscrambling soap words clears me of this deed (aosp)
Remember “Repligate”? [“Some Ironies in the Replication Crisis in Social Psychology“] and, more recently, the much publicized attempt to replicate 100 published psychology articles by the Open Science Collaboration (OSC) [“The Paradox of Replication“]? Well, some of the critics involved in Repligate have just come out with a criticism of the OSC results, claiming they’re way, way off in their low estimate of replications in psychology . (The original OSC report is here.) I’ve only scanned the critical article quickly, but some bizarre statistical claims leap out at once. (Where do they get this notion about confidence intervals?) It’s published in Science! There’s also a response from the OSC researchers. Neither group adequately scrutinizes the validity of many of the artificial experiments and proxy variables–an issue I’ve been on about for a while. Without firming up the statistics-research link, no statistical fixes can help. I’m linking to the articles here for your weekend reading. I invite your comments! For some reason a whole bunch of items of interest, under the banner of “statistics and the replication crisis,” are all coming out at around the same time, and who can keep up? March 7 brings yet more! (Stay tuned). Continue reading
Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
I generally find National Academy of Science (NAS) manifestos highly informative. I only gave a quick reading to around 3/4 of this one. I thank Hilda Bastian for twittering the link. Before giving my impressions, I’m interested to hear what readers think, whenever you get around to having a look. Here’s from the intro*:
Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems…
The unpopular P-value is invited to dance.
- The Paradox of Replication
Critic 1: It’s much too easy to get small P-values.
Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).
Is it easy or is it hard?
You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.
The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:
The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …
The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)