replication research

On the current state of play in the crisis of replication in psychology: some heresies


The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]).

“By reviewing the hypotheses and analysis plans in advance, RRs should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case” (Nosek et. al)

A novel idea is that papers are to be provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. How will she use them to learn something (improve or falsify claims or methods)? I see nothing in preregistration, in and of itself, so far, to promote that. Existing replication research doesn’t go there. It would be wrong-headed to condemn CARKing, by the way. Post-data criticism of inquiries must be post-data. How else can you check if assumptions were met by the data in hand? [Note 7/12: Of course, what they must not be are ad hoc saves of the original finding, else they are unwarranted–minimal severity.] It would be interesting to see inquiries into potential hidden biases not often discussed. For example, what did the students (experimental subjects) know and when did they know it (the older the effect the more likely they know it)? What’s the attitude toward the finding conveyed (to experimental subjects) by the person running the study? I’ve little reason to point any fingers, it’s just part of the severe tester’s inclination toward cynicism and error probing. (See my “rewards and flexibility hypothesis” in my earlier discussion.)

It’s too soon to see how RR’s will fare, but plenty of credit is due to those sticking their necks out to upend the status quo. Research into changing incentives is a field in its own right. The severe tester may, again, appear awfully jaundiced to raise any qualms, but we shouldn’t automatically assume that research into incentivizing researchers to behave in a fashion correlated with good science –data sharing, preregistration–is itself likely to improve the original field. Not without thinking through what would be needed to link statistics up with the substantive hypotheses or problem of interest. (Let me be clear, I love the idea of badges and other carrots;it’s just that the real scientific problems shouldn’t be lost sight of.) We might be incentivizing researchers to study how to incentivize researchers to behave in a fashion correlated with good science.

Surely there are areas where the effects or measurement instruments (or both) genuinely aren’t genuine. Isn’t it better to falsify them than to keep finding ad hoc ways to save them? Is jumping on the meta-research bandwagon[iii] just another way to succeed in a field that was questionable? Heresies, I know.

To get the severe tester into further hot water, I’ll share with you her view that, in some fields, if they completely ignored statistics and wrote about plausible conjectures about human motivations, prejudices, attitudes etc. they would have been better off. There’s a place for human interest conjectures, backed by interesting field studies rather than experiments on psych students. It’s when researchers try to “test” them using sciency methods that the whole thing becomes pseudosciency.

Please share your thoughts. (I may add to this, calling it (2).)

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2017, July 8). The Preregistration Revolution (PDF). Open Science Framework. Retrieved from

[i] This article mentions a failed replication discussed on Gelman’s blog on July 8, on which I left some comments.

[ii] New readers, please search likelihood principle on this blog

[iii] This must be distinguished from the use of “meta” in describing a philosophical scrutiny of methods (meta-methodology). Statistical meta-researchers do not purport to be doing philosophy of science.

Categories: Error Statistics, preregistration, reforming the reformers, replication research | 9 Comments

Glymour at the PSA: “Exploratory Research is More Reliable Than Confirmatory Research”

psa-homeI resume my comments on the contributions to our symposium on Philosophy of Statistics at the Philosophy of Science Association. My earlier comment was on Gerd Gigerenzer’s talk. I move on to Clark Glymour’s “Exploratory Research Is More Reliable Than Confirmatory Research.” His complete slides are after my comments.

GLYMOUR’S ARGUMENT (in a nutshell):Glymour_2006_IMG_0965

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

exploratory-research-is-more-reliable-than-confirmatory-research-13-1024(slide #5)

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim. Continue reading

Categories: fallacy of rejection, P-values, replication research | 20 Comments

Announcement: Scientific Misconduct and Scientific Expertise

Scientific Misconduct and Scientific Expertise

1st Barcelona HPS workshop

November 11, 2016

Departament de Filosofia & Centre d’Història de la Ciència (CEHIC),  Universitat Autònoma de Barcelona (UAB)

Location: CEHIC, Mòdul de Recerca C, Seminari L3-05, c/ de Can Magrans s/n, Campus de la UAB, 08193 Bellaterra (Barcelona)

Organized by Thomas Sturm & Agustí Nieto-Galan

Current science is full of uncertainties and risks that weaken the authority of experts. Moreover, sometimes scientists themselves act in ways that weaken their standing: they manipulate data, exaggerate research results, do not give credit where it is due, violate the norms for the acquisition of academic titles, or are unduly influenced by commercial and political interests. Such actions, of which there are numerous examples in past and present times, are widely conceived of as violating standards of good scientific practice. At the same time, while codes of scientific conduct have been developed in different fields, institutions, and countries, there is no universally agreed canon of them, nor is it clear that there should be one. The workshop aims to bring together historians and philosophers of science in order to discuss questions such as the following: What exactly is scientific misconduct? Under which circumstances are researchers more or less liable to misconduct? How far do cases of misconduct undermine scientific authority? How have standards or mechanisms to avoid misconduct, and to regain scientific authority, been developed? How should they be developed?

All welcome – but since space is limited, please register in advance. Write to:

09:30 Welcome (Thomas Sturm & Agustí Nieto-Galan) Continue reading

Categories: Announcement, replication research | 7 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Are we lowering the bar?


For entertainment only

In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading

Categories: junk science, replication research, Statistics | 2 Comments

Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides

Screen Shot 2016-06-19 at 12.53.32 PMHere are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

Categories: P-values, reforming the reformers, replication research, Statistics, StatSci meets PhilSci | Leave a comment

“Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)

Screen Shot 2016-06-15 at 1.19.23 PMI’m giving a joint presentation with Caitlin Parker[1] on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.)[2] The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.

Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology

Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States

Continue reading

Categories: Announcement, replication research, reproducibility | 8 Comments

My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)


Categories: P-values, replication research, reproducibility, Statistics | 11 Comments

Some bloglinks for my LSE talk tomorrow: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Popper talk May 10 locationIn my Popper talk tomorrow today (in London), I will discuss topics in philosophy of statistics in relation to:  the 2016 ASA document on P-values, and recent replication research in psychology. For readers interested in links from this blog, see:

I. My commentary on the ASA document on P-values (with links to the ASA document):

Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”

A Small P-value Indicates that the Results are Due to Chance Alone: Fallacious or not: More on the ASA P-value Doc”

“P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell’”

II. Posts on replication research in psychology: Continue reading

Categories: Metablog, P-values, replication research, reproducibility, Statistics | 7 Comments

Repligate Returns (or, the non-significance of non-significant results, are the new significant results)

Sell me that antiseptic!

unscrambling soap words clears me of this deed (aosp)

Remember “Repligate”? [“Some Ironies in the Replication Crisis in Social Psychology“] and, more recently, the much publicized attempt to replicate 100 published psychology articles by the Open Science Collaboration (OSC) [“The Paradox of Replication“]? Well, some of the critics involved in Repligate have just come out with a criticism of the OSC results, claiming they’re way, way off in their low estimate of replications in psychology [1]. (The original OSC report is here.) I’ve only scanned the critical article quickly, but some bizarre statistical claims leap out at once. (Where do they get this notion about confidence intervals?) It’s published in Science! There’s also a response from the OSC researchers. Neither group adequately scrutinizes the validity of many of the artificial experiments and proxy variables–an issue I’ve been on about for a while. Without firming up the statistics-research link, no statistical fixes can help. I’m linking to the articles here for your weekend reading. I invite your comments!  For some reason a whole bunch of items of interest, under the banner of “statistics and the replication crisis,” are all coming out at around the same time, and who can keep up? March 7 brings yet more! (Stay tuned). Continue reading

Categories: replication research, reproducibility, Statistics | 21 Comments

Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results



Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

I generally find National Academy of Science (NAS) manifestos highly informative. I only gave a quick reading to around 3/4 of this one. I thank Hilda Bastian for twittering the link. Before giving my impressions, I’m interested to hear what readers think, whenever you get around to having a look. Here’s from the intro*:

Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems…
Continue reading

Categories: Error Statistics, replication research, reproducibility, Statistics | 53 Comments

The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)


The unpopular P-value is invited to dance.

  1. The Paradox of Replication

Critic 1: It’s much too easy to get small P-values.

Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).

Is it easy or is it hard?

You might say, there’s no paradox, the problem is that the significance levels in the original studies are often due to cherry-picking, multiple testing, optional stopping and other biasing selection effects. The mechanism by which biasing selection effects blow up P-values is very well understood, and we can demonstrate exactly how it occurs. In short, many of the initially significant results merely report “nominal” P-values not “actual” ones, and there’s nothing inconsistent between the complaints of critic 1 and critic 2.

The resolution of the paradox attests to what many have long been saying: the problem is not with the statistical methods but with their abuse. Even the P-value, the most unpopular girl in the class, gets to show a little bit of what she’s capable of. She will give you a hard time when it comes to replicating nominally significant results, if they were largely due to biasing selection effects. That is just what is wanted; it is an asset that she feels the strain, and lets you know. It is statistical accounts that can’t pick up on biasing selection effects that should worry us (especially those that deny they are relevant). That is one of the most positive things to emerge from the recent, impressive, replication project in psychology. From an article in the Smithsonian magazine “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”:

The findings also offered some support for the oft-criticized statistical tool known as the P value, which measures whether a result is significant or due to chance. …

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated. (Link is here.)

Continue reading

Categories: replication research, reproducibility, spurious p values, Statistics | 21 Comments

Blog at