David Mellor, from the Center for Open Science, emailed me asking if I’d announce his Preregistration Challenge on my blog, and I’m glad to do so. You win $1,000 if your properly preregistered paper is published. The recent replication effort in psychology showed, despite the common refrain – “it’s too easy to get low P-values” – that in preregistered replication attempts it’s actually very difficult to get small P-values. (I call this the “paradox of replication”[1].) Here’s our e-mail exchange from this morning:
Dear Deborah Mayod,
I’m reaching out to individuals who I think may be interested in our recently launched competition, the Preregistration Challenge (https://cos.io/prereg). Based on your blogging, I thought it could be of interest to you and to your readers.
In case you are unfamiliar with it, preregistration specifies in advance the precise study protocols and analytical decisions before data collection, in order to separate the hypothesis-generating exploratory work from the hypothesis testing confirmatory work.
Though required by law in clinical trials, it is virtually unknown within the basic sciences. We are trying to encourage this new behavior by offering 1,000 researchers $1000 prizes for publishing the results of their preregistered work.
Please let me know if this is something you would consider blogging about or sharing in other ways. I am happy to discuss further.
Best,
David
David Mellor, PhDProject Manager, Preregistration Challenge, Center for Open Science
|
David: Yes I’m familiar with it, and I hope that it encourages people to avoid data-dependent determinations that bias results. It shows the importance of statistical accounts that can pick up on such biasing selection effects. On the other hand, coupling prereg with some of the flexible inference accounts now in use won’t really help. Moreover, there may, in some fields, be a tendency to research a non-novel, fairly trivial result.
And if they’re going to preregister, why not go blind as well? Will they?
Best,
Mayo
David Mellor 10:45AM
We’re working now on our evaluation of the effect of preregistration to try to answer those two questions. The question of whether or not people only register trivial questions will be answerable through content expert evaluation asking people to evaluate a series of research questions from registered and similar, unregistered publications. So far we have seen some replications, but also several novel research ideas come through the submission process.
I believe that a similar method could be used to evaluate the degree to which flexibility in inference is affected by preregistration. We are requiring as much of that as possible in the preregistrations, so even if it’s not bullet proof, I think this does get to a big part of the problems.
We’re not requiring blind data analysis at this point, mostly because we already have a lot of requirements for the competition as it stands. We’re hoping to nudge individuals to a greater number of best practices, including blind data analysis and data sharing, but want to meet people where they are as much as possible and encourage better practices from there.
Best, David
David Mellor, PhD
Project Manager, Preregistration Challenge, Center for Open Science
I think David’s reply is interesting, and maybe a little surprising (in a good way)–one reason I’m posting this. Here are some scattered remarks:
The fact that you have a hard time replicating a significant finding attained thanks to hunting and flexible methods is actually a point in favor of significance testing: biasing selection effects show up (in an invalid P-value) and thus are detectable. This can be demonstrated. It is an open question as to whether methods like Bayes ratios (as currently used) have any similar, built-in alarm mechanisms.
In a curmudgeonly mood, let me kvetch some more. A crucial problem with most studies goes beyond the formal statistics to the question of erroneously taking statistical effects as giving information about research hypotheses. I don’t see sufficient attention being paid to this. It’s a well-known fallacy to take statistical significance as substantive significance (when the substantive claim hasn’t been well probed by the statistical effort). So it’s blocked by a non-fallacious use of statistical tests. By contrast, a popular way to block this is to give a high prior to the “no effect” null. The trouble is, this enables (rather than showing what’s fallacious about) moving from statistical to substantive. Rather than a methodological criticism, as it should be, it become a disagreement about the plausibility or truth of the substantive claim.
I’d like to start seeing rewards given to papers that critically examine, and perhaps falsify, some of the standard experimental and measurement assumptions underlying questionable research in the social sciences. For some discussion on this blog, see [2].
What do readers think?
[1] The paradox of replication
Critic 1: It’s much too easy to get small P-values.
Critic 2: We find it very difficult to get small P-values; only 36 of 100 psychology experiments were found to yield small P-values in the recent Open Science collaboration on replication (in psychology).
For a resolution to the paradox, see this post.
[2] “Some Ironies in the Replication Crisis in Social Psychology,” and “Out Damned Pseudoscience”.
More carrots or more sticks? We keep hearing that the lure of rewards associated with publication tends to encourage researchers to exploit flexibilities and give way, perhaps unconsciously, to a host of verification biases. Let’s suppose this hypothesis is correct. In deciding between the carrots and sticks of reward (for publication) and punishment (for QRPs–questionable research practices) the choice often seems to be more carrots. What do you think? In the replication project in psychology, for example, you were assured of publishing your non-replication of a pre-registered, previously significant, finding. Now we have $1,000 for a published, pre-registered finding, to deter reinterpreting or erasing unwelcome results. (Researchers can make changes, if justified.) Granted, the researchers whose work couldn’t be replicated got a stick. But the review was careful not to suggest flaws in the initial study, even claiming they merely found a smaller effect. I understand that no one wants to discourage replication by making failures too painful. It would be better for researchers to be self-critical, and for constructive methodological criticism from others to be commonplace. I happen to come from a field where scholars routinely pick apart each other’s arguments, point up gaps in arguments and unwarranted assumptions. Is it always constructive? No, maybe only half the time, but it’s something.
I came across your tweet:
Deborah G. Mayo @learnfromerror 6h6 hours ago
After winning,maybe tell OSF to keep the $1,000 #PreRegChallenge on the @OSFramework https://cos.io/prereg #OpenScience via @OSFramework
That might be a winner’s brave show of honesty, real or feigned. You do not say that here, unless that’s the message of your carrot-stick comment. Just curious.
If they’re allowed to make alterations afterwards, then it kind of negates the effort, doesn’t it?
e.berk:
Twitter is the perfect way to be provocative in 2 seconds, and I’m trying to avoid it. I recall people saying some time last year that Twitter is ruining blog discussions, and there’s definitely something to that. So I’m coming back to the blog.
But anyway, having said all that, I do think that winners should give the $1,000 back to OSF to spend on future programs. Settle for a badge.
On making alterations post-data, the reason it needn’t negate the goal is that, I assume, they have to be reported as changes. Then people can see, perhaps, how easy it is to mount a highly convincing rationale for post-data adjustments!
I don’t know, I think it will depend greatly on the field.
Preregistration is a good idea and should be basic science. I referee for a journal that essentially requires preregistration. I think it usually happens for RCTs. Observational studies are another thing where mostly it is not done. Feinstein Science 1988 recognized the problem and I think called for preregistration of observational studies. There had been some rather spectacular failures to replicate. So what happened? A new journal appeared, 1990, Epidemiology, and the founding editor said that it is not necessary to correct for multiple testing! Gresham’s Law won again. The journal was a success, but science was not.
http://icis.ucdavis.edu/?p=826
GAMING METRICS: INNOVATION & SURVEILLANCE IN ACADEMIC MISCONDUCT
Stan:
Thanks for your comment and links. The gaming metrics conference looks intriguing, are you going to it? I only scanned the names quickly and noticed 2 philosophers are included, which is good. A lot of this meta-research seems to go on in California (far from me), but that’s based on a small sample. Metagaming and metacheating, I love it.
I find it quite interesting that you mention the journal Epidemiology in 1990 with a founding editor saying it’s not necessary to adjust for multiple testing. (I understand his point about false negatives, but if the inference is positive, that’s irrelevant, and the p-value needs adjusting, or at least the multiple testing should be revealed.) But what’s really interesting to me is that this editor, I take it it’s Rothman, also famously banned p-values from the journal and is regarded as some kind of hero for doing this, at least in many groups. He is very often mentioned in relation to the recent test ban in some psych journal “Don’t Ask Don’t Tell”(https://errorstatistics.com/2015/10/10/p-value-madness-a-puzzle-about-the-latest-test-ban-or-dont-ask-dont-tell/)
This links exactly to my last set of points on this post: If you ban p-values, replacing them with Bayes ratios or other measures that do not pick up on things like multiple testing, then you have thrown away a tool which justifies the critique that the preregistrationists are on about. The issue is serious because the latest movement promoted by the American Statistical Association is to downplay, if not abolish, the use of p-values. Their document isn’t out, but that appears to be the sentiment, according to whispers I hear. The meta-research outfit at Stanford is also led by likelihoodists (who deny the relevance of adjusting for biasing selection effects) and p-bashers. I agree with them entirely about the fallacy of blurring statistical and substantive, and concur as well in opposing cookbook statistics, but there’s a philosophical tension here that is being overlooked. The tension is between measures that ensure and assess error probabilities and those that don’t. The irony is that the handwringing about statistics revolves around failing to control error probabilities. I plan to write about this, but interested readers should see: Statistical Reforms Without Philosophy are Blind: https://errorstatistics.com/2015/10/18/statistical-reforms-without-philosophy-are-blind-ii-update/
Stan Young paints in rather broad strokes when he appears to claim Rothman never advocates taking account of multiple comparisons. Rothman Greenland and Lash (2009, 3rd edition, Chap 13) have a fairly thoughtful discussion of multiple comparisons, and they make it clear there are situations (“multiple inference problems”) where it’s important to do just this.
It makes a mess of Stan’s attempts to lampoon, but perhaps we can credit Rothman with having added some nuance to his position, or even having changed his mind sometime since 1990?
George: Are you in epidemiology?
Mayo: More than Stan Young is, I think.
Not sure what that means, but OK.
In a Data Colada blog “As predicted:Preregistration Made easy”
http://datacolada.org/2015/12/01/44_aspredicted/
it is suggested that so long as one says ahead of time what data dependent choices are going to be used that they are somehow kosher. For example:
“Benefit 2. Go ahead, data peek
Data peeking, where one decides whether to get more data after analyzing the data, is usually a big no-no. It invalidates p-values and (several aspects of) Bayesian inference. 2 But if researchers pre-register how they will data peek, it becomes kosher again.”
Even if the peeking is announced in advance, the p-value still has to be adjusted. I don’t know if the authors are being serious here. Or if I’m missing what they’re saying. It isn’t as if the time that something is said or written changes the influence to the error probability. so for example I could say I’ll keep sampling until I get statistical significance (in an example where it’s known that will eventually happen even if the null is true.) Then the actual p-value still has to be adjusted post data and nothing changes because you wrote the stopping rule down ahead of time.
Uri Simonsohn explained to me in an e-mail that they were saying that it’s still necessary to adjust, if announced in advance that, say, one will take another k samples if the first k doesn’t yield significance. However, it suggests it’s better not to announce what one would do in advance should the result not be significant because one pays the penalty of the adjustment even if it turns out they didn’t need to keep sampling (or whatever the selection effect is).