I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data . The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black.
The journal Basic and Applied Social Psychology banned p-values a year ago. I read some of their articles published in the last year. I didn’t like many of them. Here’s why.
First of all, it seems BASP didn’t just ban p-values. They also banned confidence intervals, because God
forbid you use that lower bound to check whether or not it includes 0. They also banned reporting sample sizes for between subject conditions, because God forbid you divide that SD by the square root of N and multiply it by 1.96 and subtract it from the mean and guesstimate whether that value is smaller than 0.
It reminds me of alcoholics who go into detox and have to hand in their perfume, before they are tempted to drink it. Thou shall not know whether a result is significant – it’s for your own good! Apparently, thou shall also not know whether effect sizes were estimated with any decent level of accuracy. Nor shall thou include the effect in future meta-analyses to commit the sin of cumulative science.(my emphasis)
There are some nice papers where the p-value ban has no negative consequences. For example, Swab & Greitemeyer (2015) examine whether indirect (virtual) intergroup contact (seeing you have 1 friend in common with an outgroup member, vs not) would influence intergroup attitudes. It did not, in 8 studies. P-values can’t be used to accept the null-hypothesis, and these authors explicitly note they aimed to control Type 2 errors based on an a-priori power analysis. So, after observing many null-results, they drew the correct conclusion that if there was an effect, it was very unlikely to be larger than what the theory on evaluative conditioning predicted. After this conclusion, they logically switch to parameter estimation, perform a meta-analysis and based on a Cohen’s d of 0.05, suggest that this effect is basically 0. It’s a nice article, and the p-value ban did not make it better or worse.
If the journal is banning reports of inferential notions, then how do power and Type 2 errors slip by the editors’ bloodhounds?
But in many other papers, especially those where sample sizes were small, and experimental designs were used to examine hypothesized differences between conditions, things don’t look good.
In many of the articles published in BASP, researchers make statements about differences between groups. Whether or not these provide support for their hypotheses becomes a moving target, without the need to report p-values. For example, some authors interpret a d of 0.36 as support for an effect, while in the same study, a Cohen’s d < 0.29 (we are not told the exact value) is not interpreted as an effect. You can see how banning p-values solved the problem of dichotomous interpretations (I’m being ironic). Also, with 82 people divided over three conditions, the p-value associated with the d = 0.36 interpreted as an effect is around p= 0.2. If BASP had required authors to report p-values, they might have interpreted this effect a bit more cautiously. And in case you are wondering: No, this is not the only non-significant finding interpreted as an effect. Surprisingly enough, it seems to happen a lot more often than in journals where authors report p-values! Who would have predicted this?! (my emphasis)
Nice work Trafimow and Marks! Just what psychology needs.
Saying one thing is bigger than something else, and reporting an effect size, works pretty well in simple effects. But how would say there is a statistically significant interaction, if you can’t report inferential statistics and p-values? Here are some of my favorite statements.
“The ANOVA also revealed an interaction between [X] and [Y], η² = 0.03 (small to medium effect).”
How much trust do you have in that interaction from an exploratory ANOVA with a small to medium effect size of .03, partial eta squared? That’s what I thought.
“The main effects were qualified by an [X] by [Y] interaction. See Figure 2 for means and standard errors”
The main effects were qualified, but the interaction was not quantified. What does this author expect I do with the means and standard errors? Look at it while humming ‘ohm’ and wait to become enlightened? Everybody knows these authors calculated p-values, and based their statements on these values.
My predictions on the consequences of this journal’s puzzling policy appear to be true, all too true: They allow error statistical methods for purposes of a paper’s acceptance, but then require their extirpation in the published paper. I call it the “Don’t ask, don’t tell” policy (see this post). See also my commentary on the ASA P-value report.
In normal scientific journals, authors sometimes report a Bonferroni correction. But there’s no way you are going to Bonferroni those means and standard deviations, now is there? With their ban on p-values and confidence intervals, BASP has banned error control. For example, read the following statement:
Willpower theories were also related to participants’ BMI. The more people endorsed a limited theory, the higher their BMI. This finding corroborates the idea that a limited theory is related to lower self-control in terms of dieting and might therefore also correlate with patients BMI.
This is based on a two-sided p-value of 0.026, and it was one of 10 calculated correlation coefficient. Would a Bonferroni adjusted p-value have led to a slightly more cautious conclusion?
Oh, and if you hoped banning p-values would lead anyone to use Bayesian statistics: No. It leads to a surprisingly large number of citations to Trafimow’s articles where he tries to use p-values as measures of evidence, and is disappointed they don’t do what he expects. Which is like going to The Hangover part 4 and complaining it’s really not that funny. Except everyone who publishes in BASP mysteriously agrees that Trafimow’s articles show NHST has been discredited and is illogical. (my emphasis)
This last sentence gets to the most unfortunate consequence of all. In a field increasingly recognized to be driven by “perverse incentives,” and desperately in need of publishing reform, even the appearance of “pay to play” is disturbing, when editors hold so idiosyncratic a view about standard statistical methods.
In their latest editorial, Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of p-values. They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.
I’m guessing that Daniel means they might (after liquor at least) be interpreted as converting the many telling criticisms of their ban into such weak versions as to render them “straw men”. I make some comments on this editorial .
The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.
As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.(my emphasis)
Lakens, by dint of this post, certainly deserves an Honorable Mention, and can choose a book prize from the palindrome prize list. He has agreed to answer questions posted in the comments. So share your thoughts.
 As I say (slide 26) in my recent Popper talk at the LSE: “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers from methods evaluating different things. A p-value isn’t ‘invalid’ because it does not supply ‘the probability of the null hypothesis, given the finding’ (the posterior probability of H0) (Trafimow and Marks, 2015).”
 I checked this editorial. Among at least a half dozen fallacies*, the editors say that the definition of a p-value is “true by definition and hence trivial”. But the definition of the posterior probability of H given x is also “true by definition and hence trivial”. Yet they’re quite sure that P(H|x) is informative. Why? It’s just true by definition.
Another puzzling claim is that “One cannot compute the probability of the finding due to chance unless one knows the population effect size. And if one knows the population effect size, there is no need to do the research.”
Given how they understand “probability of the finding due to chance” what this says is you can’t compute P(H|x) unless you know the population effect size. So this nihilistic claim of the editors is that to make a statistical inference about H requires knowing H, but then there’s no need to do the research. So there’s never any reason to do any research!
*I won’t call them statistical howlers—a term I’ve used on this blog–because they really have little to do with statistics and involve rudimentary logical gaffes.
 My one question is what Lakens means in saying that to claim that data support theories “is only possible using Bayesian statistics”. I don’t see how Bayesian statistics infers that data support theories, unless he means they may be used to provide a comparative measure of support, such as a Bayes’ Factor or likelihood ratio. On the other hand, if “supporting a theory” means something like “accepting or inferring a theory is well tested” then it’s outside of Bayesian probabilism (understood as a report of a posterior probability, however defined). An “acceptance” or “rejection” rule could be added to Bayesian updating (e.g., infer H if its posterior is high enough), but I’m not sure Bayesians find this welcome. It’s also possible that Lakens finds authors of this journal claiming their theories are “probable,” and he’s pointing out their error.
Send me your thoughts.
Laken’s most telling remark is:
“They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise”.
The role of p-values as error probabilities is often something people are silent about these days, and yet Lakens is correct to identify this as the P-values central function. Whether one views control of error probabilities in terms of good long-run performance or, as I prefer, enabling claims about the probativeness of the test in question, they are part of a methodology directed to block unwarranted claims about having evidence for genuine, reliable effects. Notice that the replicationist research proceeds by checking for significant p-values (or by means of dual uses of confidence intervals). When only 36 or however many studies were replicable in the recent OCS report,I don’t think anyone maintained that they don’t count as non-replications because they were based on significance testing methodology. More than that, the researchers proceeded to employ significance tests to rule out various explanations for lack of replication. (The complaints I’ve heard concern the assumptions of the tests, such as “fidelity”.)
Unless I missed it, the (2016) ASA P-value statement doesn’t include mention of the role of p-values in controlling or evaluating error probabilities either. One could easily construe the definition given as merely providing a “nominal” p-value as a distance measure. Yet, their key principles (e.g., about cherry picking leading to spurious p-values) depend entirely on assuming (as is correct) that p-values ought to be measuring error probabilities and not mere “fit”. That is why I emphasized those points in my invited commentary on the p-value statement .
What about this: http://andrewgelman.com/2016/03/07/29212/ ?