Given the excited whispers about the upcoming meeting of the American Statistical Association Committee on P-Values and Statistical Significance, it’s an apt time to reblog my post on the “Don’t Ask Don’t Tell” policy that began the latest brouhaha!
A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about (but Saturday night is a perfect time to read/reread the (satirical) Task force post [ia]). Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.
Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!
“Question 1. Will manuscripts with p-values be desk rejected automatically?
Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”
Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?
But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii]
Maybe this won’t happen because prospective authors already know there’s a bias in this particular journal against reporting significance levels, confidence levels, power etc., but the announcement says they are permitted.
Translate P-values into euphemisms
Or might authors be able to describe p-values only using a variety of euphemisms, for instance: “We have consistently observed differences such that, were there no genuine effect, then there is a very high probability we would have observed differences smaller than those we found; yet we kept finding results that could almost never have been produced if we hadn’t got hold of a genuine discrepancy from the null model.” Or some such thing, just as long as the dreaded “P-word” is not mentioned? In one way, that would be good; the genuine basis for why and when small p-values warrant indications of discrepancies should be made explicit. I’m all for that. But, in all likelihood such euphemisms would be laughed at; everyone would know the code for “small p-value” when banned from saying the “p-word”, so what would have been served?
Or, much more likely, rewording p-values wouldn’t be allowed, so authors might opt to:
Find a way to translate error-statistical results to Bayesian posteriors?
They might say something like: “These results make me assign a very small probability to the ‘no effect’ hypothesis”, even though their study actually used p-values and not priors? But a problem immediately arises. If the paper is accepted based on p-values, then if they want to use priors to satisfy the editors in the final publication, they might have to resort to the uninformative priors that the editors have also banned [added: again, on further analysis, it’s unclear which type of Bayesian priors they are permitting as “interesting” enough to be considered on a case by case basis, as the Fisher genetics example supports frequentist priors]. So it would follow that unless authors did a non-objective Bayesian analysis first, the only reasonable thing would be for the authors to give, in their published paper, merely a descriptive report.[iii]
Give descriptive reports and make no inferences
If the only way to translate an error statistical report into a posterior entails illicit uninformative priors, then authors can opt for a purely descriptive report. What kind of descriptive report would convey the basis of the inference if it was actually based on statistical inference methods? Unclear, but there’s something else. Won’t descriptive reports in published papers be a clear tip off for readers that p-values, size, power or confidence intervals were actually used in the original paper? The only way they wouldn’t be is if the papers were merely descriptive from the start. Will readers be able to find out? Will they be able to obtain those error statistics used or will the authors not be allowed to furnish them? If they are allowed to furnish them, then all the test ban would have achieved is the need for a secret middle level source that publishes the outlawed error probabilities. How does this fit with the recent moves toward transparency, shared data, even telling whether variables were selected post hoc, etc. See “upshot” below. This is sounding more like “don’t ask, don’t tell!”
To sum-up this much.
If papers based on error statistics are accepted, then the final published papers must find a different way to justify their results. We have considered three ways, either:
- using euphemisms for error probabilities,
- merely giving a descriptive report without any hint of inference.
- translating what was done so as to give a (non-default? informative? non-nonsubjective?) posterior probability.
But there’s a serious problem with each.
Consider # 3 again. If they’re led to invent priors that permit translating the low p-value into a low posterior for the null, say, then won’t that just create the invalidity that was actually not there at all when p-values were allowed to be p-values? If they’re also led to obey the ban on non-informative priors, mightn’t they be compelled to employ (or assume) information in the form of a prior, say, even though that did not enter their initial argument? You can see how confusing this can get. Will the readers at least be told by the authors that they had to change the justification from the one used in the appraisal of the manuscript? “Don’t ask, don’t tell” doesn’t help if people are trying to replicate the result thinking the posterior probability was the justification when in fact it was based on a p-value? Each generally has different implications for replication. Of course, if it’s just descriptive statistics, it’s not clear what “replication” would even amount to.
What happens to randomization and experimental design?
If we’re ousting error probabilities, be they p-values, type 1 and 2 errors, power, or confidence levels, then shouldn’t authors be free to oust the methods of experimental design and data collection whose justification is in substantiating the “physical basis for the validity of the test” of significance? (Fisher, DOE 17). Why should they go through the trouble of experimental designs whose justification is precisely to support an inference procedure the editors deem illegitimate?
It would have made more sense if the authors were required to make the case without the (alleged) invalid measures from the start. Maybe they should correct this. I’m serious, at least if one is to buy into the test ban. Authors could be encouraged to attend to points almost universally ignored (in social psychology) when the attention is on things like p-values, to wit: what’s the connection between what you’re measuring and your inference or data interpretation? (Remember unscrambling soap words and moral judgments?) [iv] On the other hand, the steps toward progress are at risk of being nullified.
See out damned pseudoscience, Some ironies in the replication crisis in social psychology, and this more recent The Paradox of Replication and the Vindication of the P-value (but she can go deeper).
The major problems with the uses of NHST in social psych involve the presumption that one is allowed to go from a statistical to a substantive (often causal!) inference—never mind that everyone has known this fallacy for 100 years—, invalid statistical assumptions (including questionable proxy variables), and questionable research practices (QRPs): cherry-picking, post-data subgroups, barn-hunting, p-hacking, and so on. That these problems invalidate the method’s error probabilities was the basis for deeming them bad practices!
Everyone can see at a glance (without any statistics) that reporting a lone .05 p-value for green jelly beans and acne (in that cartoon), while failing to report the 19 other colors that showed no association, means that the reported .05 p value is invalidated! We can valuably grasp immediately that finding 1 of 20 with a nominal p-value of .05 is common and not rare by chance alone. Therefore, it shows directly that the actual p-value is not low as purported! That’s what an invalid p-value really means. The main reason for the existence of the p-value is that it renders certain practices demonstrably inadmissible (like this one). They provably alter the actual p-value. Without such invalidating moves, the reported p-value is very close to the actual! Pr(p-value < .05;null) ~ .05. But these illicit moves reveal themselves in invalid p-values![v] What grounds will there be for transparency about such cherry-picking now, in that journal?
Remember that bold move by Simmons, Nelson and Simonsohn? (See “statistical dirty laundry” post here). They had called on researchers to “just say it”: “If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it.”
The new call, for this journal at least, will be: “If you used p-values, confidence intervals, size, power, sampling distributions, just don’t say it”.[vi]
This blog will remain a safe haven for frequentists and error statisticians, in exile or out, as well as all other tribes of statisticians, philosophers, and practitioners. _
*See my comment on this blog concerning their Fisher 1973 reference.
A group of more than two-dozen distinguished statistical professionals is developing an ASA statement on p-values and inference that highlights the issues and competing viewpoints. The ASA encourages the editors of this journal and others who might share their concerns to consider what is offered in the ASA statement to appear later this year and not discard the proper and appropriate use of statistical inference.
10/10/15 update: There was a lone comment on this page that I’m giving an Honorable Mention to. Amos Odeley (whom I don’t know):
“Banning p-values? This is a very unintelligent decision by very intelligent people.
Researchers abusing P-values have nothing to do with the P-value. I think they should have consulted with someone that understand the theory of Statistical inference before just throwing a piece of it out of the window.”
Amos, in this day of anti-politician politicians, it’s no surprise we’d see a rise in anti-statistical inference statisticians, not that anyone would ever call it that. Please write to me at email@example.com for a free copy of my Error and the Growth of Experimental Knowledge (Mayo 1996).
[ia]Even Stephen Ziliac, gives it a shoutout! Ziliac, an anti-statistical tester and anti-randomisationist indicates on his CV that he has been appointed to the ASA pow-wow on P-values. This should be interesting.
[ii]Allowing statistical significance to lead directly to substantive significance, as we often see in NHST, is invalid; but there’s nothing invalid in the correct report of a P-value, as used, for instance in recent discovery of the Higgs particle (search blog for posts), that hormone replacement therapy increases risks of breast cancer (unlike what observational studies were telling us for years), that Anil Potti’s prediction model, on which personalized cancer treatments were based, was invalid. Everyone who reads this blog knows I oppose cookbook statistics, and knows I’d insist on indicating discrepancies passed with good or bad severity, insist on taking account a slew of selection effects, and violation of statistical model assumptions—especially links from observed proxy variables in social psych and claims inferred. Alternatives to the null are made explicit, but what’s warranted may not be the alternative posed for purposes of getting a good distance measure, etc etc. (You can search this post for all these issues and more.)
[iii]Can anyone think of an example wherein a warranted low Bayesian probability of the null hypothesis—what the editors seek—would not have corresponded to finding strong evidence of discrepancy from the null hypothesis by means of a low p-value, ideally with corresponding discrepancy size? I can think of cases where a high posterior in a “real effect” claim is shot down by a non-low p-value (once selection effects, and stopping rules are taken account of) but that’s not at issue, apparently.
[iv]I think one of the editors may have had a representative at the Task force meeting I recently posted.
An aside: These groups seem to love evocative terms and acronyms. We’ve got the Test Ban (reminds me of when I was a kid in NYC public schools and we had to get under our desks) of NHSTP at BASP.
[v] Anyone who reads this blog knows that I favor reporting the discrepancies well-warranted and poorly warranted and not merely a p-value. There are some special circumstances where the p-value alone is of value. (See Mayo and Cox 2010).
[vi] Think of how all this would have helped Diederik Stapel.