Going round and round again: a roundtable on reproducibility & lowering p-values


There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

  1. Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of “Redefine Statistical Significance”
  2. Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and a primary co-author of a response to ‘Redefine statistical significance’ (under review).
  3. Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  4. Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance”.
  5. E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper “Redefine Statistical Significance”

To tune in to the presentation and participate in the discussion after the talk, visit this site on the day of the talk. To register for the talk in advance, click here.

The paradox for those wishing to abandon significance tests on grounds that there’s “a replication crisis”–and I’m not alleging everyone under the “lower your p-value” umbrella are advancing this–is that lack of replication is effectively uncovered thanks to statistical significance tests. They are also the basis for fraud-busting, and adjustments for multiple testing and selection effects. Unlike Bayes Factors, they:

  • are directly affected by cherry-picking, data dredging and other biasing selection effects
  • are able to test statistical model assumptions, and may have their own assumptions vouchsafed by appropriate experimental design
  • block inferring a genuine effect when a method has low capability of having found it spurious.

In my view, the result of a significance test should be interpreted in terms of the discrepancies that are well or poorly indicated by the result. So we’d avoid the concern that leads some to recommend a .005 cut-off to begin with. But if this does become the standard for testing the existence of risks, I’d make “there’s an increased risk of at least r” the test hypothesis in a one-sided test, as Neyman recommends. Don’t give a gift to the risk producers. In the most problematic areas of social science, the real problems are (a) the questionable relevance of the “treatment” and “outcome” to what is purported to be measured, (b) cherry-picking, data-dependent endpoints, and a host of biasing selection effects, and (c) violated model assumptions. Lowering a p-value will do nothing to help with these problems; forgoing statistical tests of significance will do a lot to make them worse.

 *Added Oct 27. This is worth noting because in other Bayesian assessment, indeed, in assessments deemed more sensible and less biased in favor of the null hypothesis–the p-value scarcely differs from the posterior on Ho. This is discussed, for example, in Casella and R. Berger 1987. See links in [iii]. The two are reconciled with 1-sided tests, and insofar as the typical study states a predicted direction, that’s what they should be doing.

[i] Both “frequentist” and “sampling theory” are unhelpful names. Since the key feature is basing inference on error probabilities of methods, I abbreviate by error statistics. The error probabilities are based on the sampling distribution of the appropriate test statistic. A proper subset of error statistical contexts are those that utilize error probabilities to assess and control the severity by which a particular claim is tested.

[ii] See#4 of my recent talk on statistical skepticism “7 challenges and how to respond to them”

[iii]  Two related posts: p-values overstate the evidence against the null fallacy

How likelihoodists exaggerate evidence from statistical tests (search the blog for others)









Categories: Announcement, P-values, reforming the reformers, selection effects

Post navigation

5 thoughts on “Going round and round again: a roundtable on reproducibility & lowering p-values

  1. Maybe I’m out of date, but I was always taught that lack of reproducibility tells us that our theory is not adequate to the complexity of the phenomenon.

  2. The whole debate confuses type-I errors and reproducibility and ingores type-II errors.

    There are two reasons why an original result may fail to replicate (not be significant in the replication study). The original study was a false positive and the replication study is a true negative or the original study was a true positive (with inflated effect size) and the replication study is a false negative.

    With a null-hypothesis that is typically specified as no effect (Cohen’s nil-hypothesis), type-I errors are actually a priroi relatively unlikely and the distinction between a false positive and a true positive with a very small effect size is often practically irrelevant.

    We should focus more on the property of a study to produce a significant result in the long run with high frequency (Fisher), which is known as statistical power (Neyman Pearson)

    The call to abandon statistical significance makes no sense. How do you make any claims beyond a particular sample and who cares about the convinience samples of Mturk or undergraduate students in psychology?

  3. Thanatos Savehn

    I think you have something vital to contribute to this conversation. I get the irony of p-values-disputing-people using p-values to show that p-values have been hijacked; but what we need is a theory of variability. Why did Student’s barley vary so? Is it emergence? If so, what is it? Is it stochastic? Hopelessly unpredictable? Or something else? If so, why; and how; and, when? This was Fisher’s great quest. I hope it’s in your forthcoming book (which I shall purchase). If not, if you can set one foot forward in the right direction, please do.

  4. Well I watched the round table. I was impressed with the Zoom technology–now I have to see if I can learn it in order to run webinars. I wasn’t impressed with the discussion though, except that at least Benjamin was clear enough to admit that their recommendation allows inferring a posterior probability to H1: mu = x-bar (in the one-sided Normal testing example he gave), based on .5 priors to the null and x-bar. This is terrible! If we follow this we assign high posterior to H1, but such an inferential procedure has terrible error probabilities: it infers evidence for H1, compared to the null) with high (around 50%) probability when mu < x-bar. That is why I say the Bayes factor analysis they recommend (in any of its forms) overstates the evidence against.
    Moreover, Bayesians ike Wagenmakers deny the relevance of error probabilities, which renders cherry-picking, multiple testing, optional stopping and the like, irrelevant for inference. This is diametrically opposed to what prespecification is supposed to achieve.
    My questions weren't asked, but then again, I only fed them into the person during the talk.

Blog at WordPress.com.