Palavering about Palavering about P-values


Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:

The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019]note Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.

Admittedly, their recent statement, which I refer to as ASA II,note has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post:

…Kraemer’s eye-catching title creates the impression that the p-value is unnecessary and inimical to valid inference.

Remarkably, Kraemer’s article commits the very mistake that the ASA set out to correct back in 2016 [ASA I], by conflating the probability of the data under a hypothesis of no association with the probability of a hypothesis given the data:

“If P value is less than .05, that indicates that the study evidence was good enough to support that hypothesis beyond reasonable doubt, in cases in which the P value .05 reflects the current consensus standard for what is reasonable.”

The ASA tried to break the bad habit of scientists’ interpreting p-values as allowing us to assign posterior probabilities, such as beyond a reasonable doubt, to hypotheses, but obviously to no avail.

While I share Schachtman’s puzzlement over a number of remarks in her article, this particular claim, while contorted, need not be regarded as giving a posterior probability to “that hypothesis” (the alternative to a test hypothesis). It is perhaps close to being tautological. If a P-value of .05 “reflects the current consensus standard for what is reasonable” evidence of a discrepancy from a test or null hypothesis, then it is reasonable evidence of such a discrepancy. Of course, she would have needed to say it’s a standard for “beyond a reasonable doubt” (BARD), but there’s no reason to suppose that that standard is best seen as a posterior probability.

I think we should move away from that notion, given how ill-defined and unobtainable it is. That a claim is probable, in any of the manifold senses that is meant, is very different from its having been well tested, corroborated, or its truth well-warranted. It might well be that finding statistically significant increased risks 3 or 4 times is sufficient for inferring a genuine risk exists–beyond a reasonable doubt– given the tests pass audits of their assumptions. The 5 sigma Higgs results warranted claiming a discovery insofar as there was a very high probability of getting less statistically significant results, were the bumps due to background alone. In other words, evidence BARD for H can be supplied by H’s having passed a sufficiently severe test (set of tests). It’s denial may be falsified in the strongest (fallible) manner possible in science. Back to Schachtman:

Perhaps in her most misleading advice, Kraemer asserts that:

“[w]hether P values are banned matters little. All readers (reviewers, patients, clinicians, policy makers, and researchers) can just ignore P values and focus on the quality of research studies and effect sizes to guide decision-making.”

Really? If a high quality study finds an “effect size” of interest, we can now ignore random error?

I agree her claim here is extremely strange, though one can surmise how it’s instigated by some suggested “reforms” in ASA II. It might also be the result of confusing observed or sample effect size with population or parametric effect size (or magnitude of discrepancy). But the real danger in speaking cavalierly about “banning” P-values is not that there aren’t some cases where genuine and spurious effects may be distinguished by eye-balling alone. It is that we lose an entire critical reasoning tool for determining if a statistical claim is based on methods with even moderate capability of revealing mistaken interpretations of data.  The first thing that a statistical consumer needs to ask those who assure them they’re not banning P-values, is whether they’ve so stripped them of their error statistical force as to deprive us of an essential tool for holding the statistical “experts” accountable.

The ASA 2016 Statement, with its “six principles,” has provoked some deliberate or ill-informed distortions in American judicial proceedings, but Kraemer’s editorial creates idiosyncratic meanings for p-values. Even the 2019 ASA “post-modernism” does not advocate ignoring random error and p-values, as opposed to proscribing dichotomous characterization of results as “statistically significant,” or not.

You may have an overly sanguine construal of ASA IInote(2019 ASA)  (as merely “proscribing dichotomous characterization of results”). As I read it, although their actual position is quite vague, their recommended P-values appear to be merely descriptive and do not (or need not) have error probabilistic interpretations, even if assumptions pass audits. Granted, the important Principle 4 in ASA I (that data dredging and multiple testing invalidate P-values), implies that error control matters. But I think this is likely to be just another inconsistency between ASA I and IInote. Neither mentions Type I or II errors or power (except to say that it is not mentioning them). I think it is the onus of the ASA II authors to clarify this and other points I’ve discussed elsewhere on this blog.


  1. chattering, talking unproductively and at length
  2. persuading by flattery, browbeating or bullying


Categories: ASA Guide to P-values, P-values | Tags:

Post navigation

12 thoughts on “Palavering about Palavering about P-values

  1. Miodrag Lovric

    Deborah, the paper that you are referring to should have a different title: ‘Is It Time to Ban the papers written by the authors who do not have a basic understanding of p-value?” Does it seem that JAMA Psychiatry is no longer a peer-reviewed journal? I am sure that almost all my second-year students have a much better understanding of hypothesis testing than H,C.K. It is not politically correct but maybe her basic lack of understanding is due to some other lurking variable. ( In any case, her paper proves my thesis that statistical science is in (deep) crisis.

    • Miodrag: I agree with you about statistical science being in a great crisis of understanding; it’s possible that there has never been as much confusion, chaos, and equivocation in high places as there is now. As such, one needn’t look deeper into personal characteristics to explain some of the remarks in this paper. The exemplar at the ASA, I’m afraid, has made it acceptable to upend basic statistical principles and terms.

  2. Mayo, Miodrag,

    Yes, I was perhaps uncharitable in how I construed some of Kraemer’s language, and especially “beyond a reasonable doubt.” In legal circles, BARD is essentially a posterior probability. Under Supreme Court precedent (In re Winship), prosecutors must prove every element of the crime charged BARD. So if “intent” is an element of the crime, then the government must show BARD that the defendant intended to bring about the result, which is proscribed by the criminal statute. That is a posterior probability (probability of intent given the evidence), although the law is coy about exactly what the probability is. A moral certainty but not absoute certainty.

    Some years ago, Judge Jack Weinstein, who used to teach evidence at Columbia Law School, and who is an author of a leading treatise on the federal law of evidence, wrote an opinion about BARD. He took an informal poll of judges in his courthouse (Brooklyn federal courthouse), and he found judges that believed that BARD meant a posterior probability, quantified in the amount of 68% to 95%. Pretty scary in my view.

    Having taken depositions of clinicians in hundreds of cases, I can say informally that most clinicians think of BARD as a posterior probability and that most of them think that the p-value involves a posterior probability BARD. Kraemer may have something else in mind, but her language was at best loose, and it will clearly abet prevalent misunderstandings of p-values among physicians who likely have never and will never read ASA I or ASA II.

    The JAMA family of journals is generally well regarded, but not as much as the New England Journal of Medicine, or the Annals of Internal Medicine. The JAMA journals, including JAMA Psychiatry, are peer reviewed but Kraemer’s piece was an editorial, which means it was likely not peer reviewed other than by the editors, who should have known better.


    • Firstly, “probability” has informal English meanings, and the fact that they might say it means very highly probable still doesn’t entail it’s a posterior obtained by a prior any more than does the inference to the Higgs particle. It’s embarrassing to think this position could depend on an informal poll: I could “probably” create one that showed moral certainty is not equivalent to have a very high posterior. Moreover, she is talking science and not law; the former involves theories, models and generalizations, the latter, presumably, events. It’s unclear why she invoked the legal phrase. The very fact that it’s unsettled in law is grounds for opening the question anew. We know it doesn’t have that meaning for scientists generally. Of course, once a claim is morally certain, thanks to having passed tests that very probably would have falsified them, we may say, and we very strongly believe them and would bet on them. Or some such thing.

  3. Regarding banning p-values, stat sig, etc.,

    Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban” (
    ) write

    “In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”


    • Justin: Yes and notice how little attention it received from the authors of ASA II. Have you added anything to your “report” as of late?

  4. rkenett

    Palavra is an interesting term to use here. In Hebrew iit has a derogatory sense but is mostly about “une parole” Regarding the ASA original statement on P-values…. There was a precursor ASA statement on value added educational models (VAM). We reviewed it from a perspective of information quality in Chapter 6, section 6.3 of our book on information quality. we gave it a score of 57%, i.e. having poor information quality. More in

    • Interesting about “palavra”, I wonder if it’s used much in law. Thanks for the link on the earlier guideline, I guess I’d seen or heard of it a long time ago.

  5. Thanatos Savehn

    The courts have long allowed men to be hanged by p-values. And that alone is enough for me to oppose them. Calling their misinterpretations mere “howlers” is no comfort to those laid to rest in the cold ground.

    • Thanatos: Think of how many die and spend much $ treatments sold by snake oil salesmen who distort and twist thresholds or violate standards for evidence. FDA and other agencies created regulations to block the “spinning” of treatments and medical devices that give results undesired by the Co. wanting to make $. Data dredging, subgroup analysis, outcome switching, etc. are to be blocked through requirements like CONSORT. If ASA II is taken seriously and to the letter (e.g., no P-value can reveal the presence of a genuine effect–even if repeated studies pass audits) there’s no point at which the snake oil claims are to be found false. I don’t say that ASA really means it, but in that case, why say it? See my post on ASA II

    • “The courts have long allowed men to be hanged by p-values. And that alone is enough for me to oppose them. Calling their misinterpretations mere “howlers” is no comfort to those laid to rest in the cold ground.”

      Thanatos, that criticism can apply to any method based on probability or reasonable doubt, beyond reasonable doubt, etc., going from incomplete evidence to making a decision with upfront known possibility of making errors. That criticism is not specific to p-values whatsoever. Also, p-values or any other statistic would just be 1 piece out of N pieces used to make a decision.

      I fail to see how other approaches would do better. Did you want to try to establish your apparent claim of how non-p-value methods would solve that issue?


      • John

        Well said, Justin. How many men are about to be hanged by poorly chosen priors, or blind acceptance of likelihood ratios?

Blog at