statistical significance tests

The First 2023 Act of Stat Activist Watch: Statistics ‘for the people’

One of the central roles I proposed for “stat activists” (after our recent workshop, The Statistics Wars and Their Casualties) is to critically scrutinize mistaken claims about leading statistical methods–especially when such claims are put forward as permissible viewpoints to help “the people” assess methods in an unbiased manner. The first act of 2023 under this umbrella concerns an article put forward as “statistics for the people” in a journal of radiation oncology. We are talking here about recommendations for analyzing data for treating cancer!  Put forward as a fair-minded, or at least an informative, comparison of Bayesian vs frequentist methods, I find it to be little more than an advertisement for subjective Bayesian methods in favor of a caricature of frequentist error statistical methods. The journal’s “statistics for the people” section would benefit from a full-blown article on frequentist error statistical methods–not just the letter of ours they recently published–but I’m grateful to Chowdhry and other colleagues who joined me in this effort. You will find our letter below, followed by the authors’ response. You can also find a link to their original “statistics for the people” article in the references. Let me admit right off that my criticisms are a bit stronger than my co-authors.

Two quick additional things that I would wish to tell the authors in relation to their paper and response are:

  1. The application of Bayes rule in their example of diagnostic screening to compute the probability of Covid given a positive test, is just an application of conditional probability to events. It is fully carried out by frequentist means. There’s nothing really “Bayesian” about (frequentist!) diagnostic screening, yet it is a main example relied on to argue against frequentist probability.
  2. There’s no such thing as an uninformative prior–this was given up on over a decade ago.

I would never have come across an article in radiation oncology, if it were not for exchanges between members of a session I was in on “why we disagree” in statistical analysis in that field. I hereby invite all readers and the nearly 1000 registrants from our workshop to alert us throughout the year of interesting items under any of the stat activist banner.

Our letter: Bayesian Versus Frequentist Statistics: In Regard to Fornacon-Wood et al. (PDF of letter)

To the Editor:

We appreciate the authors bringing attention to controversies surrounding the use of Bayesian and frequentist statistics.1 [PDF of paper] There are many benefits to frequentist statistics and disadvantages of Bayesian statistics which were not discussed in the referenced article. We write this accompanying letter to aim for a more balanced presentation of Bayesian and frequentist statistics.

With frequentist statistical significance tests, we can learn whether the data indicate there is a genuine effect or difference in a statistical analysis, as they have the ability to control type I and type II error probabilities.2  Posteriors and Bayes factors do not ensure that the method rarely reports one treatment is better or worse than the other erroneously. A well-known threat to reliable results stems from the ease of using high powered methods to data-dredge and try to hunt for impressive-looking results that fail to replicate with new data. However, the Bayesian assessment is not altered by things like stopping rules-at least not without violating inference by Bayes theorem.3  The frequentist account,4  by contrast, is required to take account of such selection effects in reporting error probabilities. Another caution for those unfamiliar with practical Bayesian research is that estimation of a prior distribution is nontrivial. The priors they discuss are subjective degrees of belief, but there is considerable disagreement about which beliefs are warranted, even among experts. Furthermore, should conclusions differ if the prior is chosen by a radiation oncologist or a surgeon?5  These considerations are some of the reasons why most phase 3 studies in oncology rely on frequentist designs. The article equates frequentist methods with simple null hypothesis testing without alternatives, thereby overlooking hypothesis testing methods that control both type I and II errors. The frequentist takes account of type II errors and the corresponding notion of power. If a test has high power to detect a meaningful effect size, then failing to detect a statistically significant difference is evidence against a meaningful effect. Therefore, a  value that is not small is informative.

The authors write that frequentist methods do not use background information, but this is to ignore the field of experimental design and all of the work that goes into specifying the test (eg, sample size, statistical power) and critically evaluating the connection between statistical and substantive results. An effect that corresponds to a clinically meaningful effect, or effect sizes well warranted from previous studies, would clearly influence the design.

Although their article engenders important discussion, these differences between frequentist and Bayesian methods may help readers understand why so many researchers around the world still prefer the frequentist approach.

  • Amit K. Chowdhry, MD, PhD
    Department of Radiation Oncology
    University of Rochester Medical Center
    Rochester, New York
  • Deborah Mayo,
    Department of Philosophy
    Virginia Tech
    Blacksburg, Virginia
  • Stephanie L. Pugh, PhD
    NRG Oncology Statistical and Data Management Center
    American College of Radiology
    Philadelphia, Pennsylvania
  • John Park, MD
    Department of Radiation Oncology
    Kansas City VA Medical Center
    Kansas City, Missouri
  • Clifton David Fuller, MD,
    Department of Radiation Oncology
    MD Anderson Cancer Center
    Houston, Texas
  • John Kang, MD, PhD
    Department of Radiation Oncology
    University of Washington
    Seattle, Washington

References

  1. Fornacon-Wood I, Mistry H, Johnson-Hart C, et al. Understanding the differences between Bayesian and frequentist statistics. Int J Radiat Oncol Biol Phys 2022;112:1076-1082 .(PDF)
  2. Mayo DG. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge, UK: Cambridge University Press; 2018.
  3. Ryan EG, Brock K, Gates S, Slade D. Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Med Res Methodol 2020;20:150.
  4. Jennison C, Turnbull BW. Group Sequential Methods With Applications to Clinical Trials. Boca Raton, FL: CRC Press; 1999.
  5. Staley K, Park J. Comment on Mayo’s “The statistics wars and intellectual conflicts of interest”. Conserv Biol 2022;36:e13861.

 

Fornacon-Wood Reply: In Reply to Chowdhry et al. (PDF of letter)

To the Editor:

We thank the authors for their response  to our “statistics for the people” article that aimed to introduce perhaps unfamiliar readers to Bayesian statistics and some potential advantages of their use. We agree that frequentist statistics are a useful and widespread statistical analytical approach, and we are not aiming to revisit the frequentist versus Bayesian arguments that have been well articulated in the literature.  However, there are a couple of points we would like to make.

First, we acknowledge that the majority of phase 3 studies use frequentist designs, and this has the advantage of facilitating meta-analyses using established techniques. However, we would argue that the reason such frequentist designs are so prevalent is likely to have as much to do with convention (from funders/regulators as well as from researchers themselves), the relative exposure of the 2 approaches in educational materials, and the historic difficulties in calculating Bayesian posteriors as it does with the arguments the authors make.

Second, although we agree with Chowdhry et al that there are many challenges associated with the estimation of prior probability distributions, we note that similar arguments apply to effect size estimation, which they cite as a strength of the Neyman-Pearson/null hypothesis significance testing approach (ie, the use of power calculations to limit the risk of type II errors).  We would also re-enforce the point we make in the article about the importance of testing the influence of the prior (represented as the divergent beliefs of the hypothetical radiation oncologist and surgeon in the communication by Chowdhry et al) in the analysis results. If the data are strong enough, the posterior distributions will be in close enough agreement to convince both parties. As we noted, it is also possible to undertake Bayesian analyses without prior information, using an uninformative prior, in which case the analysis is driven directly by the data, as for a frequentist calculation. As an aside, there is continued debate about the relative merits and deficiencies of the different frequentist approaches to significance testing, particularly around the widespread use of the hybrid Neyman-Pearson/null hypothesis significance testing approach.

********************************************************

Please share your constructive remarks in the comments to this post.

 

 

 

Categories: stat activist watch 2023, statistical significance tests | 2 Comments

D. Lakens responds to confidence interval crusading journal editors

.

In what began as a guest commentary on my 2021 editorial in Conservation Biology, Daniël Lakens recently published a response to a recommendation against using null hypothesis significance tests by journal editors from the International Society of Physiotherapy Journal. Here are some excerpts from his full article, replies (‘response to Lakens‘), links and a few comments of my own. Continue reading

Categories: stat wars and their casualties, statistical significance tests | 12 Comments

Join me in reforming the “reformers” of statistical significance tests

.

The most surprising discovery about today’s statistics wars is that some who set out shingles as “statistical reformers” themselves are guilty of misdefining some of the basic concepts of error statistical tests—notably power. (See my recent post on power howlers.) A major purpose of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) is to clarify basic notions to get beyond what I call “chestnuts” and “howlers” of tests. The only way that disputing tribes can get beyond the statistics wars is by (at least) understanding correctly the central concepts. But these misunderstandings are more common than ever, so I’m asking readers to help. Why are they more common (than before the “new reformers” of the last decade)? I suspect that at least one reason is the popularity of Bayesian variants on tests: if one is looking to find posterior probabilities of hypotheses, then error statistical ingredients may tend to look as if that’s what they supply.  Continue reading

Categories: power, SIST, statistical significance tests | Tags: , , | 2 Comments

Sir David Cox: Significance tests: rethinking the controversy (September 5, 2018 RSS keynote)

Sir David Cox speaking at the RSS meeting in a session: “Significance Tests: Rethinking the Controversy” on 5 September 2018.

Continue reading

Categories: Sir David Cox, statistical significance tests | Tags:

John Park: Poisoned Priors: Will You Drink from This Well?(Guest Post)

.

John Park, MD
Radiation Oncologist
Kansas City VA Medical Center

Poisoned Priors: Will You Drink from This Well?

As an oncologist, specializing in the field of radiation oncology, “The Statistics Wars and Intellectual Conflicts of Interest”, as Prof. Mayo’s recent editorial is titled, is one of practical importance to me and my patients (Mayo, 2021). Some are flirting with Bayesian statistics to move on from statistical significance testing and the use of P-values. In fact, what many consider the world’s preeminent cancer center, MD Anderson, has a strong Bayesian group that completed 2 early phase Bayesian studies in radiation oncology that have been published in the most prestigious cancer journal —The Journal of Clinical Oncology (Liao et al., 2018 and Lin et al, 2020). This brings about the hotly contested issue of subjective priors and much ado has been written about the ability to overcome this problem. Specifically in medicine, one thinks about Spiegelhalter’s classic 1994 paper mentioning reference, clinical, skeptical, or enthusiastic priors who also uses an example from radiation oncology (Spiegelhalter et al., 1994) to make his case. This is nice and all in theory, but what if there is ample evidence that the subject matter experts have major conflicts of interests (COIs) and biases so that their priors cannot be trusted?  A debate raging in oncology, is whether non-invasive radiation therapy is as good as invasive surgery for early stage lung cancer patients. This is a not a trivial question as postoperative morbidity from surgery can range from 19-50% and 90-day mortality anywhere from 0–5% (Chang et al., 2021). Radiation therapy is highly attractive as there are numerous reports hinting at equal efficacy with far less morbidity. Unfortunately, 4 major clinical trials were unable to accrue patients for this important question. Why could they not enroll patients you ask? Long story short, if a patient is referred to radiation oncology and treated with radiation, the surgeon loses out on the revenue, and vice versa. Dr. David Jones, a surgeon at Memorial Sloan Kettering, notes there was no “equipoise among enrolling investigators and medical specialties… Although the reasons are multiple… I believe the primary reason is financial” (Jones, 2015). I am not skirting responsibility for my field’s biases. Dr. Hanbo Chen, a radiation oncologist, notes in his meta-analysis of multiple publications looking at surgery vs radiation that overall survival was associated with the specialty of the first author who published the article (Chen et al, 2018). Perhaps the pen is mightier than the scalpel! Continue reading

Categories: ASA Task Force on Significance and Replicability, Bayesian priors, PhilStat/Med, statistical significance tests | Tags:

Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i)

 

I. A principled disagreement

The other day I was in a practice (zoom) for a panel I’m in on how different approaches and philosophies (Frequentist, Bayesian, machine learning) might explain “why we disagree” when interpreting clinical trial data. The focus is radiation oncology.[1] An important point of disagreement between frequentist (error statisticians) and Bayesians concerns whether and if so, how, to modify inferences in the face of a variety of selection effects, multiple testing, and stopping for interim analysis. Such multiplicities directly alter the capabilities of methods to avoid erroneously interpreting data, so the frequentist error probabilities are altered. By contrast, if an account conditions on the observed data, error probabilities drop out, and we get principles such as the stopping rule principle. My presentation included a quote from Bayarri and J. Berger (2004): Continue reading

Categories: multiple testing, statistical significance tests, strong likelihood principle

Invitation to discuss the ASA Task Force on Statistical Significance and Replication

.

The latest salvo in the statistics wars comes in the form of the publication of The ASA Task Force on Statistical Significance and Replicability, appointed by past ASA president Karen Kafadar in November/December 2019. (In the ‘before times’!) Its members are:

Linda Young, (Co-Chair), Xuming He, (Co-Chair) Yoav Benjamini, Dick De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry Graubard, Xiao-Li Meng, Vijay Nair, Nancy Reid, Stephen Stigler, Stephen Vardeman, Chris Wikle, Tommy Wright, Karen Kafadar, Ex-officio. (Kafadar 2020)

The full report of this Task Force is in the The Annals of Applied Statistics, and on my blogpost. It begins:

In 2019 the President of the American Statistical Association (ASA) established a task force to address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis.) This document is the statement of the task force… (Benjamini et al. 2021)

Continue reading

Categories: 2016 ASA Statement on P-values, ASA Task Force on Significance and Replicability, JSM 2020, National Institute of Statistical Sciences (NISS), statistical significance tests

Why hasn’t the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?

something’s not revealed

A little over a year ago, the board of the American Statistical Association (ASA) appointed a new Task Force on Statistical Significance and Replicability (under then president, Karen Kafadar), to provide it with recommendations. [Its members are here (i).] You might remember my blogpost at the time, “Les Stats C’est Moi”. The Task Force worked quickly, despite the pandemic, giving its recommendations to the ASA Board early, in time for the Joint Statistical Meetings at the end of July 2020. But the ASA hasn’t revealed the Task Force’s recommendations, and I just learned yesterday that it has no plans to do so*. A panel session I was in at the JSM, (P-values and ‘Statistical Significance’: Deconstructing the Arguments), grew out of this episode, and papers from the proceedings are now out. The introduction to my contribution gives you the background to my question, while revealing one of the recommendations (I only know of 2). Continue reading

Categories: 2016 ASA Statement on P-values, JSM 2020, replication crisis, statistical significance tests, straw person fallacy

The Statistics Debate! (NISS DEBATE, October 15, Noon – 2 pm ET)

October 15, Noon – 2 pm ET (Website)

Where do YOU stand?

Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading

Categories: Announcement, J. Berger, P-values, Philosophy of Statistics, reproducibility, statistical significance tests, Statistics | Tags:

My paper, “P values on Trial” is out in Harvard Data Science Review

.

My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue. Continue reading

Categories: multiple testing, P-values, significance tests, Statistics

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)

.

“Before we stood on the edge of the precipice, now we have taken a great step forward”

 

What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II(note)–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading

Categories: P-values, stat wars and their casualties, statistical significance tests

Blog at WordPress.com.