Given the excited whispers about the upcoming meeting of the American Statistical Association Committee on P-Values and Statistical Significance, it’s an apt time to reblog my post on the “Don’t Ask Don’t Tell” policy that began the latest brouhaha!
A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about (but Saturday night is a perfect time to read/reread the (satirical) Task force post [ia]). Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors. I don’t know, maybe some good will come of all this.
Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!
“Question 1. Will manuscripts with p-values be desk rejected automatically?
Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”
Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?
But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii]
Maybe this won’t happen because prospective authors already know there’s a bias in this particular journal against reporting significance levels, confidence levels, power etc., but the announcement says they are permitted.
Translate P-values into euphemisms
Or might authors be able to describe p-values only using a variety of euphemisms, for instance: “We have consistently observed differences such that, were there no genuine effect, then there is a very high probability we would have observed differences smaller than those we found; yet we kept finding results that could almost never have been produced if we hadn’t got hold of a genuine discrepancy from the null model.” Or some such thing, just as long as the dreaded “P-word” is not mentioned? In one way, that would be good; the genuine basis for why and when small p-values warrant indications of discrepancies should be made explicit. I’m all for that. But, in all likelihood such euphemisms would be laughed at; everyone would know the code for “small p-value” when banned from saying the “p-word”, so what would have been served?
Or, much more likely, rewording p-values wouldn’t be allowed, so authors might opt to:
Find a way to translate error-statistical results to Bayesian posteriors?
They might say something like: “These results make me assign a very small probability to the ‘no effect’ hypothesis”, even though their study actually used p-values and not priors? But a problem immediately arises. If the paper is accepted based on p-values, then if they want to use priors to satisfy the editors in the final publication, they might have to resort to the uninformative priors that the editors have also banned [added: again, on further analysis, it’s unclear which type of Bayesian priors they are permitting as “interesting” enough to be considered on a case by case basis, as the Fisher genetics example supports frequentist priors]. So it would follow that unless authors did a non-objective Bayesian analysis first, the only reasonable thing would be for the authors to give, in their published paper, merely a descriptive report.[iii]
Give descriptive reports and make no inferences
If the only way to translate an error statistical report into a posterior entails illicit uninformative priors, then authors can opt for a purely descriptive report. What kind of descriptive report would convey the basis of the inference if it was actually based on statistical inference methods? Unclear, but there’s something else. Won’t descriptive reports in published papers be a clear tip off for readers that p-values, size, power or confidence intervals were actually used in the original paper? The only way they wouldn’t be is if the papers were merely descriptive from the start. Will readers be able to find out? Will they be able to obtain those error statistics used or will the authors not be allowed to furnish them? If they are allowed to furnish them, then all the test ban would have achieved is the need for a secret middle level source that publishes the outlawed error probabilities. How does this fit with the recent moves toward transparency, shared data, even telling whether variables were selected post hoc, etc. See “upshot” below. This is sounding more like “don’t ask, don’t tell!”
To sum-up this much.
If papers based on error statistics are accepted, then the final published papers must find a different way to justify their results. We have considered three ways, either:
- using euphemisms for error probabilities,
- merely giving a descriptive report without any hint of inference.
- translating what was done so as to give a (non-default? informative? non-nonsubjective?) posterior probability.
But there’s a serious problem with each.
Consider # 3 again. If they’re led to invent priors that permit translating the low p-value into a low posterior for the null, say, then won’t that just create the invalidity that was actually not there at all when p-values were allowed to be p-values? If they’re also led to obey the ban on non-informative priors, mightn’t they be compelled to employ (or assume) information in the form of a prior, say, even though that did not enter their initial argument? You can see how confusing this can get. Will the readers at least be told by the authors that they had to change the justification from the one used in the appraisal of the manuscript? “Don’t ask, don’t tell” doesn’t help if people are trying to replicate the result thinking the posterior probability was the justification when in fact it was based on a p-value? Each generally has different implications for replication. Of course, if it’s just descriptive statistics, it’s not clear what “replication” would even amount to.
What happens to randomization and experimental design?
If we’re ousting error probabilities, be they p-values, type 1 and 2 errors, power, or confidence levels, then shouldn’t authors be free to oust the methods of experimental design and data collection whose justification is in substantiating the “physical basis for the validity of the test” of significance? (Fisher, DOE 17). Why should they go through the trouble of experimental designs whose justification is precisely to support an inference procedure the editors deem illegitimate?
Upshot
It would have made more sense if the authors were required to make the case without the (alleged) invalid measures from the start. Maybe they should correct this. I’m serious, at least if one is to buy into the test ban. Authors could be encouraged to attend to points almost universally ignored (in social psychology) when the attention is on things like p-values, to wit: what’s the connection between what you’re measuring and your inference or data interpretation? (Remember unscrambling soap words and moral judgments?) [iv] On the other hand, the steps toward progress are at risk of being nullified.
See out damned pseudoscience, Some ironies in the replication crisis in social psychology, and this more recent The Paradox of Replication and the Vindication of the P-value (but she can go deeper).
The major problems with the uses of NHST in social psych involve the presumption that one is allowed to go from a statistical to a substantive (often causal!) inference—never mind that everyone has known this fallacy for 100 years—, invalid statistical assumptions (including questionable proxy variables), and questionable research practices (QRPs): cherry-picking, post-data subgroups, barn-hunting, p-hacking, and so on. That these problems invalidate the method’s error probabilities was the basis for deeming them bad practices!
Everyone can see at a glance (without any statistics) that reporting a lone .05 p-value for green jelly beans and acne (in that cartoon), while failing to report the 19 other colors that showed no association, means that the reported .05 p value is invalidated! We can valuably grasp immediately that finding 1 of 20 with a nominal p-value of .05 is common and not rare by chance alone. Therefore, it shows directly that the actual p-value is not low as purported! That’s what an invalid p-value really means. The main reason for the existence of the p-value is that it renders certain practices demonstrably inadmissible (like this one). They provably alter the actual p-value. Without such invalidating moves, the reported p-value is very close to the actual! Pr(p-value < .05;null) ~ .05. But these illicit moves reveal themselves in invalid p-values![v] What grounds will there be for transparency about such cherry-picking now, in that journal?
Remember that bold move by Simmons, Nelson and Simonsohn? (See “statistical dirty laundry” post here). They had called on researchers to “just say it”: “If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it.”
The new call, for this journal at least, will be: “If you used p-values, confidence intervals, size, power, sampling distributions, just don’t say it”.[vi]
__
This blog will remain a safe haven for frequentists and error statisticians, in exile or out, as well as all other tribes of statisticians, philosophers, and practitioners. _
______
*See my comment on this blog concerning their Fisher 1973 reference.
[i]An NSF Director asked for my response but I didn’t have one for dissemination. They sent me the ASA response by Ronald Wasserstein (Feb 26 on Amstat Blogs):
A group of more than two-dozen distinguished statistical professionals is developing an ASA statement on p-values and inference that highlights the issues and competing viewpoints. The ASA encourages the editors of this journal and others who might share their concerns to consider what is offered in the ASA statement to appear later this year and not discard the proper and appropriate use of statistical inference.
10/10/15 update: There was a lone comment on this page that I’m giving an Honorable Mention to. Amos Odeley (whom I don’t know):
“Banning p-values? This is a very unintelligent decision by very intelligent people.
Researchers abusing P-values have nothing to do with the P-value. I think they should have consulted with someone that understand the theory of Statistical inference before just throwing a piece of it out of the window.”
Amos, in this day of anti-politician politicians, it’s no surprise we’d see a rise in anti-statistical inference statisticians, not that anyone would ever call it that. Please write to me at error@vt.edu for a free copy of my Error and the Growth of Experimental Knowledge (Mayo 1996).
[ia]Even Stephen Ziliac, gives it a shoutout! Ziliac, an anti-statistical tester and anti-randomisationist indicates on his CV that he has been appointed to the ASA pow-wow on P-values. This should be interesting.
[ii]Allowing statistical significance to lead directly to substantive significance, as we often see in NHST, is invalid; but there’s nothing invalid in the correct report of a P-value, as used, for instance in recent discovery of the Higgs particle (search blog for posts), that hormone replacement therapy increases risks of breast cancer (unlike what observational studies were telling us for years), that Anil Potti’s prediction model, on which personalized cancer treatments were based, was invalid. Everyone who reads this blog knows I oppose cookbook statistics, and knows I’d insist on indicating discrepancies passed with good or bad severity, insist on taking account a slew of selection effects, and violation of statistical model assumptions—especially links from observed proxy variables in social psych and claims inferred. Alternatives to the null are made explicit, but what’s warranted may not be the alternative posed for purposes of getting a good distance measure, etc etc. (You can search this post for all these issues and more.)
[iii]Can anyone think of an example wherein a warranted low Bayesian probability of the null hypothesis—what the editors seek—would not have corresponded to finding strong evidence of discrepancy from the null hypothesis by means of a low p-value, ideally with corresponding discrepancy size? I can think of cases where a high posterior in a “real effect” claim is shot down by a non-low p-value (once selection effects, and stopping rules are taken account of) but that’s not at issue, apparently.
[iv]I think one of the editors may have had a representative at the Task force meeting I recently posted.
An aside: These groups seem to love evocative terms and acronyms. We’ve got the Test Ban (reminds me of when I was a kid in NYC public schools and we had to get under our desks) of NHSTP at BASP.
[v] Anyone who reads this blog knows that I favor reporting the discrepancies well-warranted and poorly warranted and not merely a p-value. There are some special circumstances where the p-value alone is of value. (See Mayo and Cox 2010).
[vi] Think of how all this would have helped Diederik Stapel.
Here is the response given by Stephen Senn, who often guestblogs here, on the Royal Stat Society website:
Stephen Senn – Head of Competence Center for Methodology and Statistics at the Luxembourg Institute of Health
http://www.statslife.org.uk/opinion/2114-journal-s-ban-on-null-hypothesis-significance-testing-reactions-from-the-statistical-arena
Regarding multiple-comparisons adjustments:
No one questions the dishonesty of reporting a P-value picked out for being “significant” without explaining how it was picked or describing the ensemble from which it was picked.
And obviously, a ‘P-value’ computed as (say) the minimum from an ensemble of P values is biased toward zero – a valid P-value for this situation would be one computed over the selection process.
Nonetheless, I think you have made two serious oversights in your comments:
1) No matter how many tests a study does, adjustments to single-comparison P-values are biasing in that they destroy the uniformity condition for each single P-value (held by Berger, Robins, and me among others)
– these adjustments force the ensemble of P-values to concentrate toward 1, making the adjusted P-values even worse than posterior predictive P-values (which concentrate toward 1/2).
Honest, valid adjustment applies to the alpha level instead (as in genomics going down to around 10^-8), since the clear goal of the adjustment is to avoid being swamped by false positives when many tests are done, even if those are only 5% of the true nulls tested.
2) This clarification reveals something Neyman knew and most statisticians since seem to forget when adjustments are demanded in risk and safety assessments:
The adjustment breaks the implicit social contract to default to a 5% (or whatever) false-positive rate across tests, instead opting for an ever-shrinking false-positive rate as the number of tests increases. This in turn dramatically increases the false-negative rate.
Yes if I do 100 tests and all 100 nulls were correct, I’d expect 5 false positives; but that is the price I agreed to pay in adopting a 5% global error rate.
Furthermore, I’d expect fewer and fewer false positives as the proportion of the 100 that have false nulls – And I don’t know all the nulls are true, otherwise I’d have no reason to do a study, so 5 false positives is truly a worst-case scenario.
The desired error rates vary with stakeholder. The only stakeholders who would desire the situation produced by conventional adjustments (which ignore damage to power) are those with stakes in never rejecting any of the nulls (such as a manufacturer or defense team attacking safety studies), even if all the nulls were false. Conventional adjustments shut out stakeholders on the other side. Everyone who uses tests should read Neyman Synthese 1977 p. 104-106 for his discussion about stakeholder issues in single comparisons – the points he raises are amplified in direct proportion to the number of tests at issue. He suggested reversing the test and alternative hypotheses to see the point. Another approach would be to require alpha=beta (or more flexibly, alpha=k*beta) as a fairness criterion applied to all tests when there are opposing stakeholders (including multiple-testing situations).
It seems to me that the problems with P come from not only inadequate understanding of P-values themselves, but inadequate consideration of the processes of statistical inference and scientific inference. The poor old P-values are being forced into the position of being a marker of inference without any thought about their actual meaning.
Royall’s three questions are very useful for this type of discussion:
1. What do the data say?
2. What should I believe now that I have these data?
3. What should I do now that I have these data?
These questions are interrelated (obviously any answer to questions 2 and 3 should involve consideration of the answer to question 1, and an answer to question 3 can be informed by an answer to question 2), but they differ in their nature and there are different statistical approaches that inform their answers. P-values are one way to answer question 1 (likelihood functions provide a richer answer); Bayesian approaches are necessary for a formal response to question 2; and decision theory-based methods answer question 3 when a loss function can be supplied. Neyman’s hypothesis test methods deal with 3 (when a loss function is considered when designing the experiment [almost never outside of clinical trials, I suspect] and the error statistical methods promoted by Mayo deal with 1 and with 3, as far as I can tell.
Given that those questions are usually unstated when inferences are being made and discussed, and given that they are not explicitly mentioned in statistical education, it is unsurprising that P-values are conscripted and distorted into pseudo-answers to them all.
Michael: Surely this is a big part of the problem. I don’t know anyone (surely none of the founders of significance tests) who thinks a single P-value supplies all that the data have to say. For starters, several significance tests are generally needed in order to “audit” a primary one, (a) to see if the assumptions of the model are adequately met*. And (b) to determine if the P-value is actual or merely nominal, e.g., due to cherry-picking, data dependent selections, fishing and other biases. If the P-value is actual, then, at most, it gives an indication of how well the result could be produced by expected variability described in a null hypothesis Ho. There’s an indication of inconsistency between a null and the data at a given level. An isolated small P-value isn’t even indicative of a genuine experimental effect, unless small P-values “rarely fail to be” generated (as Fisher said). There are very few cases where (due to background knowledge, or a very limited question) the reported P-value suffices in a statistical report of what the data say, and Cox has given a taxonomy of those few cases. In general, one goes on to estimate the indicated discrepancy (with small P-values), or the discrepancies ruled out (for non-small P-values). A severity analysis essentially combines testing and estimation but one need not.
The psych people speak of NHST, a process that appears to permit going from a statistically significant result to a substantive research hypothesis! This is illicit. If the editors had spoken of such a move being illicit that would be one thing. The fact that they consider it noteworthy that a P-value doesn’t give a posterior probability of some sort (and it’s unclear what sort they desire) is completely puzzling. You only get a posterior by feeding in a prior, and the tests are designed not to employ priors. Of course, if there is an empirical prior, and the posterior is what you want to know, the methodology would enable computing it. This is just a piece of deductive inference really.
It’s puzzling as well that the editors merely assume a high (low) posterior probability in a hypothesis would be indicative of strong (weak) evidence for it in some sense. In many cases, data x lower the probability of H (x “disconfirm” H), but given a high prior to H, Pr(H|x) is still high. Most Bayesians would not say that x was good evidence for H in that case. More generally, the P-value is part of a methodology that aims to use probability to assess and control the capabilities of methods to avoid erroneous interpretations of data. That is, it’s part of what I’d call an error statistical approach. The goals of testing and error probing are rather different from summarizing the support, belief, or plausibility of a hypothesis H. The latter group can be called probabilisms, but it doesn’t matter what they’re called. There’s no reason to suppose the complex process of learning from data limits you to one and only one of these.
The bottom line is that the editors should confront these problems in detail. Statisticians have been talking about these issues at a highly sophisticated level for 70+ years (Cox will say 90). They come in and report the prosecutors fallacy: Pr(H|x) differs from Pr(x|H), and appear to believe they’re saying something brand new. They might be fascinated to discover how far these issues have been taken—at both a technical and a philosophical level—by statisticians.
*In the case of many psych experiments, I’d be much more worried about whether my experiments and measurements were even capable of teaching about the phenomenon of interest. Agonizing over P-values is quite small change in the face of the dubious assumptions required to link statistical inferences with their research hypotheses. But without statistical inference, those gaps will be buried. One wonders:are they trying to improve psych or allow its most serious problems to remain under wraps?
It’s interesting, despite all this, that p-values were excellent predictors of replication success in the recent attempt at replicating 100 psych experiments. See this post: https://errorstatistics.com/2015/08/31/the-paradox-of-replication-and-the-vindication-of-the-p-value-but-she-can-go-deeper-i/
Mayo, I agree with what you say. However, I am not really comfortable with your distinction between “actual” and “merely nominal” P-values. It seems to me that every P-value is model-bound and so they are all “actual” within their respective models. Rather than kvetching about “merely nominal” P-values, things that I might consider to be nothing more than differently conditioned P-values, it would be better to talk (and think) about the relevance of the variously conditioned models to the inferential purposes at hand.
Michael: P-values actually have a meaning as error probabilities, and when the P-value reported is far from the actual error probability, then it’s not the actual P-value. Likewise, when they fail for violated model assumptions. Now some people recommend considering the P-value as nothing more than a logical distance measure or summary of distance, as given by the test statistic. I take it that’s what you’re recommending. But then it is no longer serving its inferential purpose. It would be like reporting 5 sigma in particle physics without being able to say it’s exceedingly rare under the assumption of background fluctuations in the particle accelerator. We need to have
Pr(P < p;Ho) = p.
It's the "second level" probability–the error probability–that's all important, even if only approximate. Maybe I'm popularizing "kvetching" on the blog.
The problem is that your formula Pr(P<p;H0)=p is not a sufficiently specified model to have a fixed meaning. If I do a two-stage optional stopping experiment and condition on the actual sample size then I get a P-value that meets the criterion of your formula, but you would, presumably, prefer to condition on the the experimental design rather than the actual sample size and so you would have a different statistical model which yields a different P-value. Your P-value would reflect the extremeness of the test statistic among all of the test statistics expected by the statistical model that includes both possible sample sizes, whereas mine looks at the extremeness of the test statistic among the test statistics expected by a model where the two possible sample sizes are kept separate.They are both actual P-values and both actual "error probabilities", but their accounting rules differ.