Comments on “The ASA p-value statement 10 years on” (ii)

.

Given how much I’ve blogged about the 2016 ASA p-value statement, the 2019 Executive Editor’s editorial in The American Statistician (TAS), the 2020 ASA (President’s) Task Force, and the various casualties of the related teeth pulling, I thought I should say something about the recent article by Robert Matthews in Significance (March 2026): “The ASA p-value statement 10 years on: An event of statistical significance?” He begins: “Ten years ago this month, the American Statistical Association (ASA) took the unprecedented step of issuing a statement on one of the most controversial issues in statistics: the use and abuse of p-values.” The Statement is here, 2016 ASA Statement on P-Values and Statistical Significance [1]. The Executive director of the ASA, Ronald Wasserstein, invited me to be a ”philosophical observer” at the meeting which gave rise to the 2016 statement. Although the 2016 ASA statement wasn’t radically controversial, at least as compared to the 2019 Executive Editor’s editorial, which I’ll get to in a minute, it was met with critical reactions on all sides. Stephen Senn provides a figure displaying relationships between reactions. Here’s how Matthews’ article begins:

Popularised in the 1920s by the hugely influential English statistician Ronald Fisher, p-values lie at the heart of “significance testing”, widely used by researchers to claim to have found something interesting lurking in data. Yet despite their ubiquity in research journals, p-values have also long been criticised as misunderstood, misleading and open to abuse. The problem lies in their definition. p-values typically give the chances of getting an effect at least as impressive as that seen, assuming it’s actually just a fluke. If these chances are sufficiently low – less than 0.05 is the traditional standard – the finding is then deemed “statistically significant”. For many researchers, this has been taken as implying that their finding is not a fluke, and worth taking seriously. But this overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid…

Wait a minute. According to Matthews, taking a small p-value as evidence the observed effect is not a fluke “overlooks the fact that p-values are calculated on the assumption the result is a fluke. As such, they cannot also be used to decide if this assumption is valid.” This overlooks the very nature of reductio (or indirect or falsificationist) proofs, say that there’s no smallest rational q: Assume q is the smallest rational. If so, q/2 would be a smaller rational. From this contradiction, infer there is no smallest rational number. It is a deductively valid argument. P-value reasoning is a statistical version of the reductio argument– providing a statistical contradiction to the fluke assumption, with an associated error probability. The small p-value tells us it’s very probable (1-p) that a smaller effect would have resulted, were it due to chance alone. Replicating the small p-value strengthens the contradiction further. [0] So can we please stop saying that assuming a claim C in a reductio argument precludes finding evidence to falsify C?

The assumption in the null hypothesis  is just an “implicationary assumption” for purposes of drawing out the consequences of C. Overlooking falsificationist logic is at the heart of today’s confusion over p-value reasoning. If we could run an experiment in which the p-value critics magically became falsificationists for 1 day, I think the scales would fall from the eyes of a statistically significant proportion of them at least during that time.[2]

Admittedly, statistical significance tests are just a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). The simple Fisherian test that the 2016 Statement restricts itself to–there’s just the single null hypothesis without considering alternatives or power–is an even smaller part. But even they have important uses, especially in testing assumptions of statistical models or misspecification tests. In any event, their limited use is not grounds for misinterpreting their logic. Much less is it grounds to abandon or retire them.

Returning to Matthews:

“Finally, in 2021, the ASA issued [3] another statement, this time from a Presidential Task Force whose focus was not promoting the 2016 principles but addressing concerns” that an editorial in TAS–I’ll call it the ASA Executive Director editorial– “might be seen as official ASA policy.” Why the worry it might be seen as ASA policy? One reason is that one of the authors was the ASA Executive director Wasserstein. The second was that it sounded like a continuation of the 2016 ASA statement–which is ASA policy. According to the 2019 Executive Director’s editorial, the 2016 ASA Statement had “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned”, and they announce: “We take that step here….‘statistically significant’—don’t say it and don’t use it”. The use of p-value thresholds is also verboten. “[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups…” (2019 Executive Director Editorial, p. 2).

Then ASA president Karen Kafadar (2019) wrote in an ASA Newsletter:

Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA.

So she appointed a Task Force in 2019. Its full (1 page) report is in the The Annals of Applied Statistics, also on my blogpost.[4] The report (Benjamini et al. 2021) begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Among its main points:

  • the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”…
  • P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.
  • They are important tools that have advanced science through their proper application. …(Benjamini et al. 2021)

According to Matthews:

“For those who saw improper use and misinterpretation as the key issue in the p-value debate, this seemed to miss the point.”

But defending the scientific value of a tool when an Executive Director’s editorial is calling for its abandonment is exactly to the point. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. Giving up on tests means forgoing falsification even of the statistical variety. What’s the point of requiring replication if at no point can you say an effect has failed to replicate?

Maybe the ASA should invite 10 year reflections, or maybe they’re out there and I haven’t seen them.

Please share your queries and thoughts in the comments.

References
Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Some related posts (search this blog for others):

March 7, 2016: “Don’t throw out the error control baby with the bad statistics bathwater
May 21, 2024: 5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)
June 20, 2021: At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Mayo 31, 2024: 2-4 year review: The Statistics Wars and Intellectual Conflicts of Interest
June 17, 2019: The 2019 ASA executive editor’s guide to p-values: Don’t say what you don’t mean
June 4, 2024: 2-4 year review: commentaries on my editorial
May 15, 2022: 2-4 year review: commentaries on my editorial

My editorial: The statistics wars and intellectual conflicts of interest

 


[0] p-value. The significance test arises to test the conformity of the particular data under analysis with H0 in some respect: To do this we find a function t = t(y) of the data, to be called the test statistic, such that

  • the larger the value of t the more inconsistent are the data with H0;
  • the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H0 is true.

…[We define the] p-value corresponding to any t as p = p(t) = P(T ≥ t; H0). (Mayo and Cox 2006, p. 81)

[1] The 2016 ASA Statement’s six principles: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

[2] There are a few critics who are falsificationists, notably Andrew Gelman.

[3]  The 2019 ASA [president’s] task force submitted its statement to the ASA in 2020, and for a long time its contents were shrouded in mystery. It eventually was published in 2021 in the Annals of Applied Statistics where Kafadar was editor in chief.

[4] The 2019 Task Force members: Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

 

Categories: abandon statistical significance, ASA Task Force on Significance and Replicability, P-values, significance tests, stat wars and their casualties | 26 Comments

Post navigation

26 thoughts on “Comments on “The ASA p-value statement 10 years on” (ii)

  1. Henry Wyneken

    Hi, thanks for the post. I’ve been thinking a lot more about these issues at my work lately. Also, you responded to a post I made a couple months ago, and I didn’t get back to you. I’m sorry about that.

    You mentioned falsification. Have you seen this paper by Amrhein, Trafimow and Greenland? https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543137#d1e919

    I think that they take a strong stance against making qualitative conclusions from any single study (quote below). I don’t agree with this, but I want to understand their point better. I think they’re trying to say: “if your data do not agree with the consensus, you should not claim to have rejected or falsified anything.”. In a more limited way I agree with this. But I think if you have a well-conceived project and a statistically significant result, it is appropriate to say “we have rejected or falsified this theory”.

    “So when can we be confident that we know something? This is the topic of the vast domains epistemology, scientific inference, and philosophy of science, and thus far beyond the scope of the present paper (and its authors). Nonetheless, a successful theory is one that survives decades of scrutiny. If every study claims to provide decisive results (whether from inferential statistics or narrative impressions—or a confusion of the two), there will be ever more replication failures, which in turn will further undermine public confidence in science. We thus believe that decision makers must act based on cumulative knowledge—which means they should preferably not rely solely on single studies or even single lines of research (even if such contributions may determine a decision when all other evidence appears ambiguous or unreliable).”

    • Henry:

      Thanks for your comment.

      Yes, I know their work. In nearly every non-trivial case, it’s absurd to make statistical inferences from a single result. R.A. Fisher was clear from the start that we need, not isolated significant results:

      …but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

      The severe testing account limits itself to inferring there’s an indication of a discrepancy or anomaly until the study has passed an audit. Auditing requires scrutinizing statistical assumptions, checking (perhaps adjusting for) biasing selection effects, and appraising any links between statistical and substantive scientific claims. One is directed by the need to evaluate the capability of the method to have probed relevant errors in relation to the claim inferred.

  2. Henry Wyneken

    Thanks for the response. I’m not sure if we disagree or not. I take you to be saying that simply reading the phrase “p < 0.05” is not enough – we need to check assumptions, biasing selection effects, etc. I definitely agree with that. I took Amrhein et al to be saying that no individual paper or trial can demonstrate (or falsify) some claim, even if it did pass an audit. That’s what I disagree with.

    • Henry:
      If it really passed a stringent audit, I agree. But science is all about building on results, and this depends on reporting how the result can be wrong. In general think it depends on what’s known, and what’s going to be inferred or done with the results. Although the Galleri test did not show a statistically significant reduction in diagnosis of state 3 or 4 cancers, and the stock went crashing, yet this is not being regarded as grounds to give up on its intended use–nor should it be.

      • Henry Wyneken

        I hadn’t heard about the NHS-Galleri trial before – thanks for bringing it up. I spent some time reading the protocol and various reactions tonight. I agree with your point that from what I’ve read, the test works in the sense of having a PPV of around 0.6. As for the trial, there was certainly some disagreements in the NYT from Grail “we did see a very compelling clinical benefit” to a researcher who was quoted as saying “The study failed, he said. “End of story.”.

        As I was reading the protocol, I was struck by how complex the primary endpoint was:

        This objective will be evaluated using a fixed-sequence statistical strategy. First, we will evaluate for a statistically significant difference in a prespecified group of 12 cancer types: lung, head and neck, colorectal, pancreas, myeloma/plasma cell neoplasm, liver/bile duct, stomach, oesophagus, anus, lymphoma, ovary, and bladder [17]. If a significant reduction (p < 0.05) in incidence rates is found, we will then evaluate for a difference in all stageable cancer types (defined as invasive solid cancers (excluding basal cell carcinoma and squamous cell carcinoma of the skin) and haematological malignancies) other than prostate cancer; cancers not routinely staged (e.g., brain cancers and leukaemias) will be excluded. If the second evaluation also demonstrates a significant reduction in incidence rates, then all stageable cancers (including prostate) will be evaluated.

        If I’m reading this literally, their evaluation of their primary aim unfolds in three steps. First, they check for a reduction in incidence in twelve(!!) cancer types. It’s not clear to me if they need to see a reduction in all of them or just one of them, or some combination. Then, if this test passes, they test for a difference in larger group of cancers. Finally, if this second test passes, then they test for a reduction in all cancers.

        Honestly, I’m pretty new to clinical trials. I’ve had some success, and I think SIST was part of that. But I can’t speak with decades of experience in this space.

        But I really do not like this primary aim. It is way too complicated. How are multiple comparisons controlled for, if at all? You also have the issue of selective inference – how do we model the significance test in the second round, conditional on the result in the first? Finally, they plan for 90% power with a two-sided test. I really don’t get that – they have a one-sided hypothesis.

        Their effect size assumption is:

        the microsimulation model predicts a relative reduction in stages III and IV cancers of approximately 20% after three rounds of MCED testing and one year of follow-up after the third round, with a cumulative incidence of stage III and IV cancers of approximately 1% in the control arm during this follow-up period.

        I’m reading this as saying that the expected proportion of the control group with stage III/IV cancer is 1%, and the proportion of the treatment group is 0.8%. If I plug this into G*Power for the difference between two independent proportions, with alpha = 0.05, power = 0.9 and a two-sided test, I get a required sample size of n = 46856 per group, which they well exceeded.

        But I have no clear of idea of what exactly has to happen for their primary aim to be a success. If their primary test failed, what link in in the chain of reasoning was responsible? Their protocol doesn’t say, and since they haven’t published any data, I can’t even guess.

        ESMO 2025_PF2 Initial Results_Presentation_FINAL CLEAN.pptx

        A Cancer Detection Test Fails in Major Study – The New York Times

        Cell-Free DNA–Based Multi-Cancer Early Detection Test in an Asymptomatic Screening Population (NHS-Galleri): Design of a Pragmatic, Prospective Randomised Controlled Trial – PMC

  3. rkenett

    Mayo

    Some context to all this, summarized in three points listed below, and an open question:

    1. The VAM precedent – The ASA appetite for statements started in 2014 with the ASA Statement on Using Value-Added Models for Educational Assessment https://www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf This was an official statement and it proved pretty controversial indicating (to me) that professional associations should not make such statements. We discussed this in chapter 6 of my book with Galit Shmueli on informational quality. ASA should have done some retrospective analaysis of the VAM statement experience and not take a position on p-values. Facilitate discussions YES, come out with a statement NO
    2. AI tsunami – The p-value discussions coincided with the AI tsunami. The unfortunate result was that instead of a focus of the role of statistics in the age of AI attention was diverted elsewhere. Poor timing with non productive discussions, at lead by judging their impact on the practice of statistics
    3. Foundations – I was involved in organizing a conference on the foundations of applied statistics and a follow up special edition of a journal. The somehow surprising result is that very very few contributors showed interest on foundations per se. The good news is that we got contributions on a very wide range of topics.

    And the open question: We are immersed in verbal analysis with LLM and other such methods. To me, this means that we need to move the search projector to verbal descriptions of problems and their analysis. This requires a shift away from parametrized statements.

    Will be glad to get feedback on this perspective. My contributions to this shift can be seen in

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3862006

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5361981

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6366221

    bst

    ron

    • Henry Wyneken

      You wrote “To me, this means that we need to move the search projector to verbal descriptions of problems and their analysis”. I haven’t read your work, but I like the way you put this. I think the words we use are really the most important thing. I can do whatever analysis I want to do, but what will be remembered or acted upon are the words.

    • Ron:
      The ASA (as in Matthews’ article, regards the 2016 statement as its first foray into such pronouncement. You say, “ASA should have done some retrospective analaysis of the VAM statement experience and not take a position on p-values.” I recall your mentioning this before; I know nothing about that episode. I kind of doubt that, even now, that Wasserstein would say it was a mistake. I’m curious to know what others think. I will look at your links. (I had to approve this because WordPress only allows 3 links.)

    • Ron:
      What’s your “boundary of meaning” analysis of the fundamental problem of how to warrant claims by showing they’ve passed severe tests–that the method used was capable of revealing that purported solutions to problems are flawed? Maybe, deliberately punning on one of your titles, you could write on the severity of probing faults with “fault severity estimation in gears”.

      Clearly, most substantive scientific problems require qualitative assessments of severity, but the logic still parallels the more formal analyses. I think most would agree that if nothing has been done that could have revealed that a proposed solution to a problem is wrong, even if it is, that we don’t have good evidence for it.

      I think you’re much too quick in dumping parametrized statements and the methods for probing them. Maybe if formal statistical arguments were grasped (as in the case of simple reductios), we wouldn’t see some of the flaws found in Matthews.

      • rkenett

        A boundary kf meaning (BOM) can be based on directional statements. I proposed using the S tyoe error of Gelman and Carlin to evaluate such BOMs.

  4. David Colquhoun

    It seems to me to be remarkable that statisticians have been unable to agree on how to solve the most basic question that’s asked of them: is the difference between the means of two independent, normally distributed, samples real or just chance? That being the case, it’s no wonder that users of statistics are confused.

    Perhaps what’s surprising is that the answer to that question depends so strongly on the details of exactly how the question is worded. An experiment that results in p = 0.05.may also give a posterior probability of H0 of at least ~0.3 (J. Berger, V. Johnson). All of these statements are true, and all are relevant to the question being asked. 

     Everyone agrees that it’s desirable for experimental results to be accompanied by an estimate of their uncertainty. But the uncertainty in how to express the uncertainty is a lot bigger than most users believe,

    • David:
      You change the question from “real or chance” when you get the .3 posterior. The same result can also give a posterior probability to H0 of .05. Anyway, I get your point. Unfortunately, instead of explaining how these computations can occur with different questions, we have Bayesians declaring this demonstrates that p-values exaggerate evidence, fail to be useful as evidence etc. I call Excursion 1 of SIST (Statistical Inference as Severe Testing: how to get beyond the statistics wars): “How to Tell What’s True About Statistical Inference”. This was going to be the title of the book. That’s what I mean by “how to get beyond the statistics wars”, namely, by understanding them. Those in power opted instead for the “les stats c’est moi” stance–at least during that period– declaring the tools be abandoned/retired. Of course, p-values require supplements, as I’ve discussed for decades.

      • David Colquhoun

        That’s exactly my point. Different ways of formulating the same question (are my results just chance?) can give very different answers. The uncertainty in the uncertainty is surprisingly big.

        Of course different answers involve different assumptions. The critical assumption for the posterior of ~0.3 being that it’s reasonable to put a lump of probability on the hypothesis that the true effect is near zero. Sometimes this seems to me to be entirely reasonable, sometimes not. In any case, similar conclusions can be drawn from likelihood ratio arguments without the need for Bayes,

        • David Colquhoun

          I guess that most people would find it a bit unsatisfactory to conclude, after doing an experiment that gives p = 0.049, that the risk of the null hypothesis being nonetheless true is between 4.9% and ~30% (or even higher if the hypothesis is very implasible). The only way to avoid such vague statements is to insist on a much lower p value before claiming that an effect is not mere chance.

          The price of doing this would be that we’d fail to detect more real effects, and the judgement about whether it’s worth that price must depend on the costs of being wrong.

          My conclusion from these arguments is that science is harder than most people think.

  5. nathan229

    Mayo,

    Thank you for your thoughts on Matthews’ 10 year retrospective.

    There is a bit of whitewashing going on in Matthews’ article. Perhaps it is politically uncomfortable to describe the extent to which Dr Wasserstein undertook a zealous campaign, based upon his editorial, to influence publishing practice. There were a spate of emails sent by Wasserstein, with the ASA logo, to the editors of many journals, including Clinical Trials, the New England Journal of Medicine and the Journal of the American Medical Ass’n. The thrust of the emails was to encourage the abandonment of p-values and significance testing. Although the stirring of the pot led to some revisions in statistical guidelines, the clinical medical journals were generally unmoved. There was some tightening of practice guidelines, but statistical significance was largely retained as a part of hypothesis testing practice. See Jonathan A. Cook, Dean A. Fergusson, Ian Ford, Mithat Gonen, Jonathan Kimmelman, Edward L. Korn, and Colin B. Begg, “There is still a place for significance testing in clinical trials,” 16 Clin. Trials 223 (2019);  David Harrington, Ralph B. D’Agostino, Sr., Constantine Gatsonis, Joseph W. Hogan, David J. Hunter, Sharon-Lise T. Normand, Jeffrey M. Drazen, and Mary Beth Hamel, “New Guidelines for Statistical Reporting in the Journal,” 381 New Engl. J. Med. 285 (2019).

    The NEJM started to encourage the use of confidence intervals in the late 1970s, and their prevalent usage in modern clinical medical journals has nothing to do with Wasserstein’s advocacy.

    The presidential task force, led by then ASA President Kafadar, was both a substantive and political rebuke to Wasserstein’s advocacy campaign.

    NAS

    • Nathan:
      Thanks for your comment: “There is a BIT of whitewashing going on in Matthews’ article? Don’t get me started. Well you already did, and I pasted down a few more links. There must be 40 or more posts on the issue in this blog, and I’m very thankful to readers like you for a lot of help. You’d think that the fact that even letters from high priests on letterhead hasn’t diminished the use of p-values, might convince p-bashers that the tests have value. Or not? I wrote a very short letter to Significance making the point about the nature of reductio arguments. Do you suppose those who argue this way really think a reductio doesn’t allow falsifying the assumed hypothesis? At this point I really don’t know. I think it must be so because they seem to be arguing in earnest. You will find the 10 commentaries on my “intellectual conflicts of interest” editorial (ultimately, there were 12), in one of the links I just added. This gives me an idea. Will you send your comment as a letter to the editor of Significance? I can paste the link.

      • nathan229

        Well, we have had fuzzy logic, queer logic, etc., but modus tollens still holds.

        Your point about the durability of significance testing is a good one. I tend to see this issue through the lens of health effects claims and litigation, and I spend most of my time looking at epidemiologic and toxicologic studies. As I noted, I have seen some tightening of statistical practice at the important journals, but I’ve not seen any susbstantial departure from frequentist methods

        Send me the link, and I will submit a letter.

        NAS

        • Henry Wyneken

          NAS,

          Have your heard this podcast? What I’m taking from this is that some journals in epidemiology are accepting results p = 0.1 (or maybe more). I guess I wouldn’t mind so much if they would just own up and say “yep, this is the standard” vs making everything an untransparent judgement call.

          Annals On Call – Improving Research Reports: Avoiding P Values | Annals of Internal Medicine

          • nathan229

            I downloaded the mp3 file, and I’ll give it a listen on my dog walk. In my book, the Annals is probably the best of the big general clinical medical journals. Interestingly, its statistical editor was Stephen Goodman for many years. Despite Steve’s preference for Bayesian methods, the Annals has mostly published studies that use frequentist methods.

        • Nathan:
          They say most letters are around 300 words. Send letters to significance@rss.org.uk. I’d really like to hear about your recent experiences with stat in the law. Want to write a blogpost?

          • nathan229

            Mayo,

            Thanks. The ASA issue died out, and I see very few references to the Wasserstein editorial now. The NASEM/NRC and the Federal Judicial Center released a new edition of the Reference Manual on Scientific Evidence (4th ed.), on 12/31/2025. Lots of issues. You may have seen that 27 State Attorneys General requested the retraction of the climate science chapter. Judge Rosenberg, who is the Director of the FJC withdrew the chapter. The NRC refused.

            I’ve blogged on the first two chapters so far. The first chapter on is on the law that governs the admissibility of expert witness testimony. I posted several pieces about that chapter, which are collected here:

            REVIEW: Expert Witness Testimony Admissibility in the Reference Manual on Scientific Evidence (4th ed.),” Working Paper, SSRN (Mar. 2026), at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6480438

            I also blogged about the second chapter, How Science Works, which addresses statistical issues rather incompetently. See https://schachtmanlaw.com/2026/03/12/how-science-works-in-the-new-reference-manual-on-scientific-evidence/

            Right now, I am plowing my way through the Manual’s chapter on epidemiology.

            My working paper on the IARC hazard classification system should be out soon.

            I will be working my way back to some statistical issues, and I’d be happy to post here. I think I will send a letter to editor of Significance. I’ll copy you on what I send.

            NAS

            • Nathan:
              Thanks for the ultra-interesting materials. Why didn’t you let me know about the fallacious construal of p-values so I could have saved a philosopher from looking silly: Schachtman: “STATISTICS DONE POORLY. When it comes to explaining and discussing the role of statistical methods in the scientific process, Weisberg and Thanukos go off the rails. The new chapter is an unmitigated disaster, which should have been corrected in the peer review and oversight process. The first sign of trouble became apparent upon checking the definition of “p-value” in the chapter’s glossary:

              “p-value. A statistic that gives the calculated probability that the null hypothesis could be true even given the observed differences between conditions.”[37]

              This definition is the transposition fallacy on steroids.”

            • I hadn’t seen the call for retraction of the climate science chapter. What is the objection?

  6. David Colquhoun

    I see in these comments a couple of justifications of p values based on their longevity. But surely it’s a possibility that their continued popularity is a result of the fact that, of all the methods that have been proposed, p values are the most ‘optimistic’. Any method that makes it easier to claim that you’ve made a discovery will inevitably be popular with experimentalists.

    It will also be popular with some of the less honest journal editors. When I suggested to someone involved with editors of European journals of pharmacology that the should consider asking for a Bayesian test with a skeptical prior, as well as a p value, his answer was “we can’t do that. It might reduce the impact factors of our journals.” That response strikes me as deeply corrupt.

    • David:
      It’s the opposite: demanding error control–the very basis for criticizing data dredging and multiple testing– enables holding accountable those engaging in corrupt practices. Bayesian accounts that obey the likelihood principle and abandon error probabilities lack the grounds to criticize the cheating. It might be thought that an account insensitive to error probabilities escapes the consequences of gambits enabling high error probabilities. It doesn’t. They just allow the corruption to go unchecked. The so-called replication crisis vindicates error statistical methods.
      It’s not an argument from longevity that vindicates error statistical methods. But, as with randomization, when we find scientists unwilling to give up on hard-earned requirements of error control, it is an obligation to learn just what these methods can do if used correctly.
      The fact that it’s possible give results a hard time with “skeptical priors” does not give confidence in the probative value of the method. It’s not enough to make it hard to find an effect, it has to be for the right reason, and it must be warranted. An artifice such as spiked priors on the null can result in high posterior probability to the null, but it doesn’t mean it is warranted. The very fact that any significant result can be “outweighed” (I think Fisher put it) by a spike prior to the null just shows what’s wrong with such a trick. Such a method can readily have terrible error probabilities. For readers new to these ideas: please search this blog which alsohas links to my Statistical Inference as Severe Testing: how to get beyond the statistics wars (CUP, 2018)

      • David Colquhoun

        You say “An artifice such as spiked priors on the null can result in high posterior probability to the null, but it doesn’t mean it is warranted. “

        I think that in some cases, it is certainly warranted. In the initial stages of testing new drug candidates, the failure rate is high. Most candidates don’t work. In such cases it’s overoptimistic to use a spiked prior with half the prior probability on the null. It’s abundantly warranted in such cases.

        Neither is it the case, surely, that such methods lack error control. Simple simulations allow you to count the number of cases that achieve a specified p value in which H0 is true -the false positive risk. They answer the relevant question, which is, in many cases this: if, having observed p = 0.02 (or whatever) and I claim to have made a discovery, what is the probability that I’m wrong.

Leave a reply to nathan229 Cancel reply

Blog at WordPress.com.