National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 20 Comments

Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)

.

The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues: Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access)

.

A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated(note) the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II)(note) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II(note), and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 6 Comments

Gelman blogged our exchange on abandoning statistical significance

A. Gelman

I came across this post on Gelman’s blog today:

Exchange with Deborah Mayo on abandoning statistical significance

It was straight out of blog comments and email correspondence back when the ASA, and significant others, were rising up against the concept of statistical significance. Here it is: Continue reading

Categories: Gelman blogs an exchange with Mayo | Tags: | 7 Comments

All She Wrote (so far): Error Statistics Philosophy: 8 years on

.

Error Statistics Philosophy: Blog Contents (8 years)
By: D. G. Mayo

Dear Reader: I began this blog 8 years ago (Sept. 3, 2011)! A double celebration is taking place at the Elbar Room Friday evening (a smaller one was held earlier in the week), both for the blog and the 1 year anniversary of the physical appearance of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars [SIST] (CUP). A special rush edition made an appearance on Sept 3, 2018 in time for the RSS meeting in Cardiff. If you’re in the neighborhood, stop by for some Elba Grease.

Ship Statinfasst made its most recent journey at the Summer Seminar for Phil Stat from July 28-Aug 11, co-directed with Aris Spanos. It was one of the main events that occupied my time the past academic year, from the planning, advertising and running. We had 15 fantastic faculty and post-doc participants (from 55 applicants), and plan to continue the movement to incorporate PhilStat in philosophy and methodology, both in teaching and research. You can find slides from the Seminar (zoom videos, including those of special invited speakers, to come) on SummerSeminarPhilStat.com. Slides and other materials from the Spring Seminar co-taught with Aris Spanos (and cross-listed with Economics) can be found on this blog here

Continue reading

Categories: 8 year memory lane, blog contents, Metablog | 3 Comments

(one year ago) RSS 2018 – Significance Tests: Rethinking the Controversy

.

Here’s what I posted 1 year ago on Aug 30, 2018.

 

Day 2, Wednesday 05/09/2018

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Speakers:
Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

Categories: memory lane | Tags: | Leave a comment

Palavering about Palavering about P-values

.

Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:

The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019]note Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.

Admittedly, their recent statement, which I refer to as ASA II,note has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post: Continue reading

Categories: ASA Guide to P-values, P-values | Tags: | 12 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with posts on E.S. Pearson in marking his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom. Continue reading

Categories: Egon Pearson, Statistics | Leave a comment

Statistical Concepts in Their Relation to Reality–E.S. Pearson

11 August 1895 – 12 June 1980

In marking Egon Pearson’s birthday (Aug. 11), I’ll  post some Pearson items this week. They will contain some new reflections on older Pearson posts on this blog. Today, I’m posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve linked to it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, it might be said that some people concentrate to an absurd extent on “science-wise error rates” in their view of statistical tests as dichotomous screening devices.) Continue reading

Categories: Egon Pearson, phil/history of stat, Philosophy of Statistics | Tags: , , | Leave a comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll post some Pearson items this week to mark his birthday.

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

Continue reading

Categories: E.S. Pearson, Error Statistics | Leave a comment

S. Senn: Red herrings and the art of cause fishing: Lord’s Paradox revisited (Guest post)

 

Stephen Senn
Consultant Statistician
Edinburgh

Background

Previous posts[a],[b],[c] of mine have considered Lord’s Paradox. To recap, this was considered in the form described by Wainer and Brown[1], in turn based on Lord’s original formulation:

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls : : : . Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded. [2](p. 304)

The issue is whether the appropriate analysis should be based on change-scores (weight in June minus weight in September), as proposed by a first statistician (whom I called John) or analysis of covariance (ANCOVA), using the September weight as a covariate, as proposed by a second statistician (whom I called Jane). There was a difference in mean weight between halls at the time of arrival in September (baseline) and this difference turned out to be identical to the difference in June (outcome). It thus follows that, since the analysis of change score is algebraically equivalent to correcting the difference between halls at outcome by the difference between halls at baseline, the analysis of change scores returns an estimate of zero. The conclusion is thus, there being no difference between diets, diet has no effect. Continue reading

Categories: Stephen Senn | 26 Comments

Summer Seminar in PhilStat Participants and Special Invited Speakers

.

Participants in the 2019 Summer Seminar in Philosophy of Statistics

Continue reading

Categories: Summer Seminar in PhilStat | Leave a comment

The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)

The New England Journal of Medicine NEJM announced new guidelines for authors for statistical reporting  yesterday*. The ASA describes the change as “in response to the ASA Statement on P-values and Statistical Significance and subsequent The American Statistician special issue on statistical inference” (ASA I and II,(note) in my abbreviation). If so, it seems to have backfired. I don’t know all the differences in the new guidelines, but those explicitly noted appear to me to move in the reverse direction from where the ASA I and II(note) guidelines were heading.

The most notable point is that the NEJM highlights the need for error control, especially for constraining the Type I error probability, and pays a lot of attention to adjusting P-values for multiple testing and post hoc subgroups. ASA I included an important principle (#4) that P-values are altered and may be invalidated by multiple testing, but they do not call for adjustments for multiplicity, nor do I find a discussion of Type I or II error probabilities in the ASA documents. NEJM gives strict requirements for controlling family-wise error rate or false discovery rates (understood as the Benjamini and Hochberg frequentist adjustments). Continue reading

Categories: ASA Guide to P-values | 23 Comments

B. Haig: The ASA’s 2019 update on P-values and significance (ASA II)(Guest Post)

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

The American Statistical Association’s (ASA) recent effort to advise the statistical and scientific communities on how they should think about statistics in research is ambitious in scope. It is concerned with an initial attempt to depict what empirical research might look like in “a world beyond p<0.05” (The American Statistician, 2019, 73, S1,1-401). Quite surprisingly, the main recommendation of the lead editorial article in the Special Issue of The American Statistician devoted to this topic (Wasserstein, Schirm, & Lazar, 2019; hereafter, ASA II(note)) is that “it is time to stop using the term ‘statistically significant’ entirely”. (p.2) ASA II(note) acknowledges the controversial nature of this directive and anticipates that it will be subject to critical examination. Indeed, in a recent post, Deborah Mayo began her evaluation of ASA II(note) by making constructive amendments to three recommendations that appear early in the document (‘Error Statistics Philosophy’, June 17, 2019). These amendments have received numerous endorsements, and I record mine here. In this short commentary, I briefly state a number of general reservations that I have about ASA II(note). Continue reading

Categories: ASA Guide to P-values, Brian Haig | Tags: | 33 Comments

SIST: All Excerpts and Mementos: May 2018-June 2020 (updated)

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

Statistical Inference as Severe Testing: Preface

Excursion 1

EXCERPTS

Tour I Ex1 TI (full proofs)

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II Ex1 TII (full proofs)

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

 

Excursion 2

EXCERPTS

Tour I (full proofs)

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II (full proofs)

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3) 10/10/18
Ex 2 TII (Full proofs)

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

 

Excursion 3

EXCERPTS

Tour I Ex3 TI (full proofs)

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II Ex3 TII (full proofs)

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III Ex3 Tour III(full proofs)

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

 

Excursion 4

EXCERPTS

Tour I (Full Excursion 4 Tour I)

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II (Full Excursion 4 Tour II)

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

Tour III (Full proofs of Excursion 4 Tour III)

Tour IV (Full proofs of Excursion 4 Tour IV)

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

 

Excursion 5

Tour I (full proofs)

(Full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour II (full proofs)

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower) 06/07/19

Tour III (full proofs)

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

 

Excursion 6

Tour I (full proofsWhat Ever Happened to Bayesian Foundations?

Tour II (full proofs)

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19 (full excerpt)

 

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | 11 Comments

The Statistics Wars: Errors and Casualties

.

Had I been scheduled to speak later at the 12th MuST Conference & 3rd Workshop “Perspectives on Scientific Error” in Munich, rather than on day 1, I could have (constructively) illustrated some of the errors and casualties by reference to a few of the conference papers that discussed significance tests. (Most gave illuminating discussions of such topics as replication research, the biases that discredit meta-analysis, statistics in the law, formal epistemology [i]). My slides follow my abstract. Continue reading

Categories: slides, stat wars and their casualties | Tags: | Leave a comment

“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, ASA II(note)–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the ASA II(note) authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.) Continue reading

Categories: ASA Guide to P-values, Statistics | 97 Comments

(Full) Excerpt. Excursion 5 Tour II: How Not to Corrupt Power (Power Taboos, Retro Power, and Shpower)

.

returned from London…

The concept of a test’s power is still being corrupted in the myriad ways discussed in 5.5, 5.6.  I’m excerpting all of Tour II of Excursion 5, as I did with Tour I (of Statistical Inference as Severe Testing:How to Get Beyond the Statistics Wars 2018, CUP)*. Originally the two Tours comprised just one, but in finalizing corrections, I decided the two together was too long of a slog, and I split it up. Because it was done at the last minute, some of the terms in Tour II rely on their introductions in Tour I.  Here’s how it starts:

5.5 Power Taboos, Retrospective Power, and Shpower

Let’s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “confirmed” post-data. After all, it’s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ve made it, it doesn’t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’t know that we’ve “made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results. Continue reading

Categories: fallacy of non-significance, power, Statistical Inference as Severe Testing | Leave a comment

Don’t let the tail wag the dog by being overly influenced by flawed statistical inferences

.

An article [i],“There is Still a Place for Significance Testing in Clinical Trials,” appearing recently in Clinical Trials, while very short, effectively responds to recent efforts to stop error statistical testing [ii]. We need more of this. Much more. The emphasis in this excerpt is mine: 

Much hand-wringing has been stimulated by the reflection that reports of clinical studies often misinterpret and misrepresent the findings of the statistical analyses. Recent proposals to address these concerns have included abandoning p-values and much of the traditional classical approach to statistical inference, or dropping the concept of statistical significance while still allowing some place for p-values. How should we in the clinical trials community respond to these concerns? Responses may vary from bemusement, pity for our colleagues working in the wilderness outside the relatively protected environment of clinical trials, to unease about the implications for those of us engaged in clinical trials….

However, we should not be shy about asserting the unique role that clinical trials play in scientific research. A clinical trial is a much safer context within which to carry out a statistical test than most other settings. Properly designed and executed clinical trials have opportunities and safeguards that other types of research do not typically possess, such as protocolisation of study design; scientific review prior to commencement; prospective data collection; trial registration; specification of outcomes of interest including, importantly, a primary outcome; and others. For randomised trials, there is even more protection of scientific validity provided by the randomisation of the interventions being compared. It would be a mistake to allow the tail to wag the dog by being overly influenced by flawed statistical inferences that commonly occur in less carefully planned settings….

Furthermore, the research question addressed by clinical trials (comparing alternative strategies) fits well with such an approach and the corresponding decision-making settings (e.g. regulatory agencies, data and safety monitoring committees and clinical guideline bodies) are often ones within which statistical experts are available to guide interpretation. The carefully designed clinical trial based on a traditional statistical testing framework has served as the benchmark for many decades. It enjoys broad support in both the academic and policy communities. There is no competing paradigm that has to date achieved such broad support. The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.

The issues that have led to the criticisms of conventional statistical testing are of much greater concern where statistical inferences are derived from observational data. … Even when the study is appropriately designed, there is also a common converse misinterpretation of statistical tests whereby the investigator incorrectly infers and reports that a non-significant finding conclusively demonstrates no effect. However, it is important to recognise that an appropriately designed and powered clinical trial enables the investigators to potentially conclude there is ‘no meaningful effect’ for the principal analysis.[iii]  More generally, these problems are largely due to the fact that many individuals who perform statistical analyses are not sufficiently trained in statistics. It is naive to suggest that banning statistical testing and replacing it with greater use of confidence intervals, or Bayesian methods, or whatever, will resolve any of these widespread interpretive problems. Even the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could make things worse. By removing the prespecified significance level, typically 5%, interpretation could become completely arbitrary. It will also not stop data-dredging, selective reporting, or the numerous other ways in which data analytic strategies can result in grossly misleading conclusions.

These considerations notwithstanding, the field of clinical trials is in rapid evolution and it is entirely possible and appropriate that the statistical framework used for their evaluation must also change. However, such evolution should emerge from careful methodological research and open-minded, self-critical enquiry. We earnestly hope that Clinical Trials will continue to be seen as a natural academic home for exploration and debate about alternative statistical frameworks for making inferences from clinical trials. The Editors welcome articles that evaluate or debate the merits of such alternative paradigms along with the conventional one within the context of clinical trials. Especially welcome are exemplar trial articles and those which are illustrated using practical examples from clinical trials that permit a realistic evaluation of the strengths and weaknesses of the approach.

You can read the full article here.

We may reject, with reasonable severity, the supposition that promoting correctly interpreted statistical tests is the real goal of the most powerful leaders of the movement to Stop Statistical Tests. The goal is just to stop (error) statistical tests altogether.[iv] That today’s CI leaders advance this goal is especially unwarranted and self-defeating, in that confidence intervals are just inversions of N-P tests, and were developed at the same time by the same man (Neyman) who developed (with E. Pearson) the theory of error statistical tests. See this recent post.

Reader: I’ve placed on draft a number of posts while traveling in England over the past week, but haven’t had the chance to read them over, or find pictures for them. This will soon change, so stay tuned!

*****************************************************************

[i] Jonathan A Cook, Dean A Fergusson, Ian Ford , Mithat Gonen, Jonathan Kimmelman, Edward L Korn and Colin B Begg (2019). “There is still a place for significance testing in clinical trials”, Clinical Trials 2019, Vol. 16(3) 223–224.

[ii] Perhaps we should call those driven to Stop Error Statistical Tests “Obsessed”. I thank Nathan Schachtman for sending me the article.

[iii] It’s disappointing how many critics of tests seem unaware of this simple power analysis point, and how it avoids egregious fallacies of non-rejection, or moderate P-value. It precisely follows simple significance test reasoning. The severity account that I favor gives a more custom-tailored approach that is sensitive to the actual outcome. (See, for example, Excursion 5 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). 

[iv] Bayes factors, like other comparative measures, are not “tests”, and do not falsify (even statistically). They can only say one hypothesis or model is better than a selected other hypothesis or model, based on some^ selected criteria. They can both (all) be improbable, unlikely, or terribly tested. One can always add a “falsification rule”, but it must be shown that the resulting test avoids frequently passing/failing claims erroneously. 

^The Anti-Testers would have to say “arbitrary criterion”, to be consistent with their considering any P-value “arbitrary”, and denying that a statistically significant difference, reaching any P-value, indicates a genuine difference from a reference hypothesis.

 

 

 

Categories: statistical tests | Leave a comment

SIST: All Excerpts and Mementos: May 2018-May 2019

view from a hot-air balloon

Introduction & Overview

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* 05/19/18

Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) 03/05/19

 

Excursion 1

EXCERPTS

Tour I (full proofs)

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1) 09/08/18

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2) 09/11/18

Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3) 09/15/18

Tour II (full proofs)

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt 04/04/19

Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) 11/08/18

MEMENTOS

Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars) 10/29/18

 

Excursion 2

EXCERPTS

Tour I (full proofs)

Excursion 2: Taboos of Induction and Falsification: Tour I (first stop) 09/29/18

“It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1) 10/05/18

Tour II (full proofs)

Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3) 10/10/18

MEMENTOS

Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) 11/14/18

Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction 11/17/18

 

Excursion 3

EXCERPTS

Tour I (full proofs)

Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3 11/30/18

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2) 12/01/18

First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3] 12/04/18

Tour II (full proofs)

It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP) 12/11/18

60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II. 12/29/18

Tour III (full proofs)

Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III 12/20/18

MEMENTOS

Memento & Quiz (on SEV): Excursion 3, Tour I 12/08/18

Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6) 12/13/18

Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts 12/26/18

 

Excursion 4

EXCERPTS

Tour I (full proofs)

Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP) 12/26/18

Tour II (full proofs)

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?” 01/10/19

Tour III (full proofs)

 Tour IV (full proofs)

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking 01/27/19

MEMENTOS

Mementos from Excursion 4: Blurbs of Tours I-IV 01/13/19

 

Excursion 5

Tour I (full proofs)

(full) Excerpt: Excursion 5 Tour I — Power: Pre-data and Post-data (from “SIST: How to Get Beyond the Stat Wars”) 04/27/19

Tour II (full proofs)

Tour III (full proofs)

Deconstructing the Fisher-Neyman conflict wearing Fiducial glasses + Excerpt 5.8 from SIST 02/23/19

 

Excursion 6

Tour I (full proofs)

Tour II (full proofs)

Excerpts: Souvenir Z: Understanding Tribal Warfare +  6.7 Farewell Keepsake from SIST + List of Souvenirs 05/04/19

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

Blog at WordPress.com.