At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability

The ASA President’s Task Force Statement on Statistical Significance and Replicability has finally been published. It found a home in The Annals of Applied Statistics, after everyone else they looked to–including the ASA itself– refused to publish it.  For background see this post. I’ll comment on it in a later post. There is also an Editorial: Statistical Significance, P-Values, and Replicability by Karen Kafadar.

THE ASA PRESIDENT’S TASK FORCE STATEMENT ON STATISTICAL SIGNIFICANCE AND REPLICABILITY

BY YOAV BENJAMINI, RICHARD D. DE VEAUX, BRADLEY EFRON, SCOTT EVANS, MARK GLICKMAN,*, BARRY I. GRAUBARD, XUMING HE, XIAO-LI MENG,†, NANCY REID8, STEPHEN M. STIGLER, STEPHEN B. VARDEMAN, CHRISTOPHER K. WIKLE, TOMMY WRIGHT, LINDA J. YOUNG AND KAREN KAFADAR (for affiliations see the article)

Over the past decade, the sciences have experienced elevated concerns about replicability of study results. An important aspect of replicability is the use of statistical methods for framing conclusions. In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force, and the ASA invited us to publicize it. Its purpose is two-fold: to clarify that the use of P -values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.

P -values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P -values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application.

Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability. The following general principles underlie the appropriate use of P -values and the reporting of statistical significance and apply more broadly to good statistical practice.

Capturing the uncertainty associated with statistical summaries is critical. Different measures of uncertainty can complement one another; no single measure serves all purposes. The sources of variation that the summaries address should be described in scientific articles and reports. Where possible, those sources of variation that have not been addressed should also be identified.

Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data. Setting aside the possibility of fraud, important sources of replicability problems include poor study design and conduct, insufficient data, lack of attention to model choice without a full appreciation of the implications of that choice, inadequate description of the analytical and computational procedures, and selection of results to report. Selective reporting, even the highlighting of a few persuasive results among those reported, may lead to a distorted view of the evidence. In some settings this problem may be mitigated by adjusting for multiplicity. Controlling and accounting for uncertainty begins with the design of the study and measurement process and continues through each phase of the analysis to the reporting of results. Even in well-designed, carefully executed studies, inherent uncertainty remains, and the statistical analysis should account properly for this uncertainty.

The theoretical basis of statistical science offers several general strategies for dealing with uncertainty. P -values, confidence intervals and prediction intervals are typically associated with the frequentist approach. Bayes factors, posterior probability distributions and credible intervals are commonly used in the Bayesian approach. These are some among many statistical methods useful for reflecting uncertainty.

Thresholds are helpful when actions are required. Comparing P -values to a significance level can be useful, though P -values themselves provide valuable information. P – values and statistical significance should be understood as assessments of observations or effects relative to sampling variation, and not necessarily as measures of practical significance. If thresholds are deemed necessary as a part of decision-making, they should be explicitly defined based on study goals, considering the consequences of incorrect decisions. Conventions vary by discipline and purpose of analyses.

In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data. Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed. Although all scientific methods have limitations, the proper application of statistical methods is essential for interpreting the results of data analyses and enhancing the replicability of scientific results.

“The most reckless and treacherous of all theorists is he who professes to let facts and figures speak for themselves, who keeps in the background the part he has played, perhaps unconsciously, in selecting and grouping them.” (Alfred Marshall, 1885)

Categories: ASA Task Force on Significance and Replicability

Post navigation

11 thoughts on “At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability

  1. It’s good that they eventually succeeded in publishing this, but it’s very disappointing that the ASA Board did not publicize the recommendations of the task force that they themselves formed. They’re calling it a President’s task force, but it was the ASA Board that voted to create it. Please see the link to my earlier blogpost that describes the specific charge to this Task Force. Will future task forces be willing to take the considerable time to carry out a requested mission knowing that unless it meets with the approval of one side of a debate, it will not see the light of day? The other thing I wonder about is this: There were 2 recommendations that were included in Kafadar’s presentation at the JSM which I do not see. (I will study it more carefully.) I have been told by Task Force members that they were asked to revise their recommendations numerous times and there was an extensive back and forth. So I wonder what, if anything, was left out or changed.

  2. I imagine that at the beginning of their deliberations there was a natural tendency to get “down into the weeds” and come up with some bold, clear new recommendations. Then at some point they realized they’d bitten of more than they could chew, and so finessed matters by writing what in essence is an abstract that is, on its surface very all-embracing and should offend no one.

    One of their charges, I believe, was to address the “problem” of multiplicities, a big topic. Their single sentence on that matter states that “in some settings” [not specified] “adjusting for multiplicity” [by methods unspecified] might be called for. That would seem to imply that in most or even all settings such the making of such adjustments is not obligatory and not even possible in an objective manner, the giant literature on the topic not withstanding. Kudos to the authors for a strong point subtly made!

    Similarly clever and nuanced is their statement that “Thresholds [e.g. alphas] are helpful when actions are required,” implicitly by decision-makers or decision-making bodies. The implication is that they are not or may not ever be helpful for researchers who only research, write and advise but have no authority to take or implement actions. And from that is the further implication that researchers never need to specify alphas or to use the phrases “statistically significant” or “statistically nonsignificant.”

    The Committee closed off further discussion among themselves with the statement, “Conventions
    vary by discipline and purpose of analyses.” Fair enough, but they might have added, “… as a result of the very uneven quality of teaching and statistical analysis from one discipline to another and lack of familiarity with the historical literature of statistics on the part of most statistics profs. It’s not as if there are any rational grounds justify interdisciiplinary variations in “conventions.”

    The bull of “Coup de grace” remains skewered and unrevived. Time to haul him off to the taxidermist! https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543616

    • Stuart: Your reconstruction and imagination of how they set about their work doesn’t comport with this committee at all. Take a look again at who is on it. They hadn’t bitten off more than they can chew. I’m extremely impressed with how they set about their work. Almost immediately after the group was formed, we had the pandemic, and I had guessed they would delay their work. Instead, they arrived at a report months before the Nov. date–at the end of July 2019–before the JSM.
      It was the ASA Board that dragged its feet for months, continually asking for revisions, only to later refuse to “endorse and share” their recommendations. This is the first I’m seeing of this. The ASA Board created this group [1] in November 2019 “with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors.” (AMSTATNEWS 1 February 2020).
      People that I know on the committee were concerned to fix the lopsided impression left by the Wasserstein et al (2019) editorial, but the confusion as to what is endorsed by the ASA as a whole, and what is espoused by those who wish to replace, abandon or retire statistical significance will likely never be straightened out. The reason is simple, Ron Wasserstein doesn’t want them to be distinguished. “Les stats c’est moi” has worked for him so far. https://errorstatistics.com/2019/12/13/les-stats-cest-moi-we-take-that-step-here-adopt-our-fav-word-or-phil-stat/

  3. It is surprising how brief and self-evident these statements are. Without additional explanation and context, I suspect many non-statisticians will be bewildered with what the big deal is, potentially distancing statisticians even more from the rest of the scientific community.

  4. It’s too bad that Kafadar quotes Sotomayor in her editorial, given the enormous confusion and controversy her “obiter dicta” remarks caused. (Some allege she was derogating statistical significance tests, when all she was doing was saying, in effect, that in some cases anecdotal evidence can suffice to pinpoint harms, and doesn’t free a company to ignore them.) You can search this blog, especially under Matrixx, for more on this. Also Schachtman.

  5. Stuart Hurlbert

    Mayo: Do you really think that on any specific issue we should be concerned about “what is endorsed by the ASA as a whole” ? Most statisticians, like most non-statistician users of statistics, operate much of the time on the “conventional wisdom” of the moment and have little to no knowledge of the historical literature onany specific issue. And when I, as a non-staticians, started looking in some depth at the statistical literature of specific topics, I early saw evidence that the review process for statistical papers (in stat journals as well as those of other disciiplines) seemed not very effective in filtering out errors of even simple sorts.

    The task force may have “set about its work” well but Steve Ruberg’s assessment of the end product will strike many as being on-target — especially those who miss iBenjamini et al.’s strong but nuanced — and perhaps unintentional! — support for the proposal that researchers never have a need to fix alphas or use the phrase “statistically significant.” NeoFisherianism survives unscathed, as, on that point, do also Wasserstein, Lazar & Schirm (2019).

    But I do understand your “Alas!” in the title of your introductory notice of this new thread!

    • People are concerned about the latest bandwagon that is approved by the thought leaders, because they get their grants and publications through their largesse.
      It’s wrong to spoze the Task Force’s job was to kill Wasserstein et al (2019), it was to be a fair and balanced report (just not off the rails like the abandoners).

  6. rkenett

    A side effect of this “statement” is that it finally properly converges in terminology to proper terms. The initial discussions, in many circles mixed reproducibility repeatability and replicability as a matter of speech. We pointed this out in 2015: https://www.nature.com/articles/nmeth.3489?proof=t

    • Ron: The ASA put out its own delineation of replicability/reproducibility, but they differ from yours.

  7. Pingback: Should Bayesian Clinical Trialists Wear Error Statistical Hats? | Error Statistics Philosophy

  8. Pingback: Too little, too late? The “Don’t say significance…” editorial gets a disclaimer | Error Statistics Philosophy

Blog at WordPress.com.