Sir David Cox: An intellectual interview by Nancy Reid

Hinkley, Reid & Cox

Here’s an in-depth interview of Sir David Cox by Nancy Reid that brings out a rare, intellectual understanding and appreciation of some of Cox’s work. Only someone truly in the know could have managed to elicit these fascinating reflections. The interview was in Oct 1993, published in 1994.

Nancy Reid (1994). A Conversation with Sir David Cox, Statistical Science 9(3): 439-455.

 

 

 

 

 

 

Categories: Sir David Cox | Leave a comment

A interview with Sir David Cox by “Statistics Views” (upon turning 90)

Sir David Cox

Sir David Cox: July 15, 1924-Jan 18, 2022

The original Statistics Views interview is here:

“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”– An interview with Sir David Cox

FEATURES

  • Author: Statistics Views
  • Date: 24 Jan 2014

Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. The Cox point process was named after him.

Sir David studied mathematics at St John’s College, Cambridge and obtained his PhD from the University of Leeds in 1949. He was employed from 1944 to 1946 at the Royal Aircraft Establishment, from 1946 to 1950 at the Wool Industries Research Association in Leeds, and from 1950 to 1955 worked at the Statistical Laboratory at the University of Cambridge. From 1956 to 1966 he was Reader and then Professor of Statistics at Birkbeck College, London. In 1966, he took up the Chair position in Statistics at Imperial College Londonwhere he later became Head of the Department of Mathematics for a period. In 1988 he became Warden of Nuffield College and was a member of the Department of Statistics at Oxford University. He formally retired from these positions in 1994 but continues to work in Oxford.

Sir David has received numerous awards and honours over the years. He has been awarded the Guy Medals in Silver (1961) and Gold (1973) by the Royal Statistical Society. He was elected Fellow of the Royal Society of London in 1973, was knighted in 1985 and became an Honorary Fellow of the British Academy in 2000. He is a Foreign Associate of the US National Academy of Sciences and a foreign member of the Royal Danish Academy of Sciences and Letters. In 1990 he won the Kettering Prize and Gold Medal for Cancer Research for “the development of the Proportional Hazard Regression Model” and 2010 he was awarded the Copley Medal by the Royal Society.

He has supervised and collaborated with many students over the years, many of whom are now successful in statistics in their own right such as David Hinkley and Past President of the Royal Statistical Society, Valerie Isham. Sir David has served as President of theBernoulli Society, Royal Statistical Society, and the International Statistical Institute.

This year, Sir David is to turn 90*. Here Statistics Views talks to Sir David about his prestigious career in statistics, working with the late Professor Lindley, his thoughts on Jeffreys and Fisher, being President of the Royal Statistical Society during the Thatcher Years, Big Data and the best time of day to think of statistical methods.

1. With an educational background in mathematics at St Johns College, Cambridge and the University of Leeds, when and how did you first become aware of statistics as a discipline?

I was studying at Cambridge during the Second World War and after two years, one was sent either into the Forces or into some kind of military research establishment. There were very few statisticians then, although it was realised there was a need for statisticians. It was assumed that anybody who was doing reasonably well at mathematics could pick up statistics in a week or so! So, aged 20, I went to the Royal Aircraft Establishment in Farnborough, which is enormous and still there to this day if in a different form, and I worked in the Department of Structural and Mechanical Engineering, doing statistical work. So statistics was forced upon me, so to speak, as was the case for many mathematicians at the time because, aside from UCL, there had been very little teaching of statistics in British universities before the Second World War. Afterwards, it all started to expand.

2. From 1944 to 1946 you worked at the Royal Aircraft Establishment and then from 1946 to 1950 at the Wool Industries Research Association in Leeds. Did statistics have any role to play in your first roles out of university?

Totally. In Leeds, it was largely statistics but also to some extent, applied mathematics because there were all sorts of problems connected with the wool and textile industry in terms of the physics, chemistry and biology of the wool and some of these problems were mathematical but the great majority had a statistical component to them. That experience was not totally uncommon at the time and many who became academic statisticians had, in fact, spent several years working in a research institute first.

3. From 1950 to 1955, you worked at the Statistical Laboratory at Cambridge and would have been there at the same time as Fisher and Jeffreys. The late Professor Dennis Lindley, who was also there at that time, told me that the best people working on statistics were not in the statistics department at that time. What are your memories when you look back on that time and what do you feel were your main achievements?

Lindley was exactly right about Jeffreys and Fisher. They were two great scientists outside statistics – Jeffreys founded modern geophysics and Fisher was a major figure in genetics. Dennis was a contemporary and very impressive and effective. We were colleagues for five years and our children even played together.

The first lectures on statistics I attended as a student consisted of a short course by Harold Jeffreys who had at the time a massive reputation as virtually the inventor of modern geophysics. His Theory of Probability, published first as a monograph in physics was and remains of great importance but, amongst other things, his nervousness limited the appeal of his lectures, to put it gently. I met him personally a couple of times – he was friendly but uncommunicative. When I was later at the Statistical Laboratory in Cambridge, relations between the Director, Dr Wishart and R.A. Fisher had been at a very low ebb for 20 years and contact between the Lab and Fisher was minimal. I hear him speak on three of four occasions, interesting if often rambunctious occasions. To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.

“To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.”

Continue reading

Categories: Sir David Cox | 2 Comments

Sir David Cox: Significance tests: rethinking the controversy (September 5, 2018 RSS keynote)

Sir David Cox speaking at the RSS meeting in a session: “Significance Tests: Rethinking the Controversy” on 5 September 2018.

Continue reading

Categories: Sir David Cox, statistical significance tests | Tags: | Leave a comment

Sir David Cox

July 15, 1924-January 18, 2022

 

Categories: Error Statistics | 2 Comments

Nathan Schactman: Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice (Guest Post)

.

Nathan Schachtman,  Esq., J.D.
Legal Counsel for Scientific Challenges

Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice

The metaphor of law as an “empty vessel” is frequently invoked to describe the law generally, as well as pejoratively to describe lawyers. The metaphor rings true at least in describing how the factual content of legal judgments comes from outside the law. In many varieties of litigation, not only the facts and data, but the scientific and statistical inferences must be added to the “empty vessel” to obtain a correct and meaningful outcome. Continue reading

Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, PhilStat Law, Schachtman | 2 Comments

John Park: Poisoned Priors: Will You Drink from This Well?(Guest Post)

.

John Park, MD
Radiation Oncologist
Kansas City VA Medical Center

Poisoned Priors: Will You Drink from This Well?

As an oncologist, specializing in the field of radiation oncology, “The Statistics Wars and Intellectual Conflicts of Interest”, as Prof. Mayo’s recent editorial is titled, is one of practical importance to me and my patients (Mayo, 2021). Some are flirting with Bayesian statistics to move on from statistical significance testing and the use of P-values. In fact, what many consider the world’s preeminent cancer center, MD Anderson, has a strong Bayesian group that completed 2 early phase Bayesian studies in radiation oncology that have been published in the most prestigious cancer journal —The Journal of Clinical Oncology (Liao et al., 2018 and Lin et al, 2020). This brings about the hotly contested issue of subjective priors and much ado has been written about the ability to overcome this problem. Specifically in medicine, one thinks about Spiegelhalter’s classic 1994 paper mentioning reference, clinical, skeptical, or enthusiastic priors who also uses an example from radiation oncology (Spiegelhalter et al., 1994) to make his case. This is nice and all in theory, but what if there is ample evidence that the subject matter experts have major conflicts of interests (COIs) and biases so that their priors cannot be trusted?  A debate raging in oncology, is whether non-invasive radiation therapy is as good as invasive surgery for early stage lung cancer patients. This is a not a trivial question as postoperative morbidity from surgery can range from 19-50% and 90-day mortality anywhere from 0–5% (Chang et al., 2021). Radiation therapy is highly attractive as there are numerous reports hinting at equal efficacy with far less morbidity. Unfortunately, 4 major clinical trials were unable to accrue patients for this important question. Why could they not enroll patients you ask? Long story short, if a patient is referred to radiation oncology and treated with radiation, the surgeon loses out on the revenue, and vice versa. Dr. David Jones, a surgeon at Memorial Sloan Kettering, notes there was no “equipoise among enrolling investigators and medical specialties… Although the reasons are multiple… I believe the primary reason is financial” (Jones, 2015). I am not skirting responsibility for my field’s biases. Dr. Hanbo Chen, a radiation oncologist, notes in his meta-analysis of multiple publications looking at surgery vs radiation that overall survival was associated with the specialty of the first author who published the article (Chen et al, 2018). Perhaps the pen is mightier than the scalpel! Continue reading

Categories: ASA Task Force on Significance and Replicability, Bayesian priors, PhilStat/Med, statistical significance tests | Tags: | 3 Comments

Brian Dennis: Journal Editors Be Warned:  Statistics Won’t Be Contained (Guest Post)

.


Brian Dennis

Professor Emeritus
Dept Fish and Wildlife Sciences,
Dept Mathematics and Statistical Science
University of Idaho

 

Journal Editors Be Warned:  Statistics Won’t Be Contained

I heartily second Professor Mayo’s call, in a recent issue of Conservation Biology, for science journals to tread lightly on prescribing statistical methods (Mayo 2021).  Such prescriptions are not likely to be constructive;  the issues involved are too vast.

The science of ecology has long relied on innovative statistical thinking.  Fisher himself, inventor of P values and a considerable portion of other statistical methods used by generations of ecologists, helped ecologists quantify patterns of biodiversity (Fisher et al. 1943) and understand how genetics and evolution were connected (Fisher 1930).  G. E. Hutchinson, the “founder of modern ecology” (and my professional grandfather), early on helped build the tradition of heavy consumption of mathematics and statistics in ecological research (Slack 2010). Continue reading

Categories: ecology, editors, Likelihood Principle, Royall | Tags: | 2 Comments

Philip Stark (guest post): commentary on “The Statistics Wars and Intellectual Conflicts of Interest” (Mayo Editorial)

.

Philip B. Stark
Professor
Department of Statistics
University of California, Berkeley

I enjoyed Prof. Mayo’s comment in Conservation Biology Mayo, 2021 very much, and agree enthusiastically with most of it. Here are my key takeaways and reflections.

Error probabilities (or error rates) are essential to consider. If you don’t give thought to what the data would be like if your theory is false, you are not doing science. Some applications really require a decision to be made. Does the drug go to market or not? Are the girders for the bridge strong enough, or not? Hence, banning “bright lines” is silly. Conversely, no threshold for significance, no matter how small, suffices to prove an empirical claim. In replication lies truth. Abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of evidence, publication decisions are even more subject to cronyism, “taste”, confirmation bias, etc. Throwing away P-values because many practitioners don’t know how to use them is perverse. It’s like banning scalpels because most people don’t know how to perform surgery. People who wish to perform surgery should be trained in the proper use of scalpels, and those who wish to use statistics should be trained in the proper use of P-values. Throwing out P-values is self-serving to statistical instruction, too: we’re making our lives easier by teaching less instead of teaching better. Continue reading

Categories: ASA Task Force on Significance and Replicability, editorial, multiplicity, P-values | 4 Comments

Kent Staley: Commentary on “The statistics wars and intellectual conflicts of interest” (Guest Post)

.


Kent Staley

Professor
Department of Philosophy
Saint Louis University

 

Commentary on “The statistics wars and intellectual conflicts of interest” (Mayo editorial)

In her recent Editorial for Conservation Biology, Deborah Mayo argues that journal editors “should avoid taking sides” regarding “heated disagreements about statistical significance tests.” Particularly, they should not impose bans suggested by combatants in the “statistics wars” on statistical methods advocated by the opposing side, such as Wasserstein et al.’s (2019) proposed ban on the declaration of statistical significance and use of p value thresholds. Were journal editors to adopt such proposals, Mayo argues, they would be acting under a conflict of interest (COI) of a special kind: an “intellectual” conflict of interest.

Conflicts of interest are worrisome because of the potential for bias. Researchers will no doubt be all too familiar with the institutional/bureaucratic requirement of declaring financial interests. Whether such disclosures provide substantive protections against bias or simply satisfy a “CYA” requirement of administrators, the rationale is that assessment of research outcomes can incorporate information relevant to the question of whether the investigators have arrived at a conclusion that overstates (or even fabricates) the support for a claim, when the acceptance of that claim would financially benefit them. This in turn ought to reduce the temptation of investigators to engage in such inflation or fabrication of support. The idea obviously applies quite naturally to editorial decisions as well as research conclusions. Continue reading

Categories: conflicts of interest, editors, intellectual COI, significance tests, statistical tests | 5 Comments

Yudi Pawitan: Behavioral aspects in the statistical significance war-game(Guest Post)

.

 

Yudi Pawitan
Professor
Department of Medical Epidemiology and Biostatistics
Karolinska Institutet, Stockholm

 

Behavioral aspects in the statistical significance war-game

I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler – except when the likelihood principle is thrown in, always guaranteed to confound the frequentist – and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple as it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics. Typically, a persistent controversy between otherwise sensible and knowledgeable people – thus excluding anti-vaxxers and conspiracy theorists – might indicate we are missing some common perspectives or perhaps the big picture. In complex issues there can be genuinely distinct aspects about which different players disagree and, at some point, agree to disagree. I am not sure we have reached that point yet, with each side still working to persuade the other side the faults of their position. For now, I can only concur with Mayo (2021)’s appeal that at least the umpires – journals editors – recognize (a) the issue at hand and (b) that genuine debates are still ongoing, so it is not yet time to take sides. Continue reading

Categories: Error Statistics | 7 Comments

January 11: Phil Stat Forum (remote): Statistical Significance Test Anxiety

Special Session of the (remote)
Phil Stat Forum:

11 January 2022

“Statistical Significance Test Anxiety”

TIME: 15:00-17:00 (London, GMT); 10:00-12:00 (EST)

Presenters: Deborah Mayo (Virginia Tech) &
Yoav Benjamini (Tel Aviv University)

Moderator: David Hand (Imperial College London)

Deborah Mayo       Yoav Benjamini        David Hand

Continue reading

Categories: Announcement, David Hand, Phil Stat Forum, significance tests, Yoav Benjamini | Leave a comment

The ASA controversy on P-values as an illustration of the difficulty of statistics

.

Christian Hennig
Professor
Department of Statistical Sciences
University of Bologna

The ASA controversy on P-values as an illustration of the difficulty of statistics

“I work on Multidimensional Scaling for more than 40 years, and the longer I work on it, the more I realise how much of it I don’t understand. This presentation is about my current state of not understanding.” (John Gower, world leading expert on Multidimensional Scaling, on a conference in 2009)

“The lecturer contradicts herself.” (Student feedback to an ex-colleague for teaching methods and then teaching what problems they have)

1 Limits of understanding

Statistical tests and P-values are widely used and widely misused. In 2016, the ASA issued a statement on significance and P-values with the intention to curb misuse while acknowledging their proper definition and potential use. In my view the statement did a rather good job saying things that are worthwhile saying while trying to be acceptable to those who are generally critical on P-values as well as those who tend to defend their use. As was predictable, the statement did not settle the issue. A “2019 editorial” by some of the authors of the original statement (recommending “to abandon statistical significance”) and a 2021 ASA task force statement, much more positive on P-values, followed, showing the level of disagreement in the profession. Continue reading

Categories: ASA Task Force on Significance and Replicability, Mayo editorial, P-values | 3 Comments

E. Ionides & Ya’acov Ritov (Guest Post) on Mayo’s editorial, “The Statatistics Wars and Intellectual Conflicts of Interest”

.

Edward L. Ionides

.

Director of Undergraduate Programs and Professor,
Department of Statistics, University of Michigan

Ya’acov Ritov Professor
Department of Statistics, University of Michigan

 

Thanks for the clear presentation of the issues at stake in your recent Conservation Biology editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”. Continue reading

Categories: ASA Task Force on Significance and Replicability, editors, P-values, significance tests | 2 Comments

B. Haig on questionable editorial directives from Psychological Science (Guest Post)

.

Brian Haig, Professor Emeritus
Department of Psychology
University of Canterbury
Christchurch, New Zealand

 

What do editors of psychology journals think about tests of statistical significance? Questionable editorial directives from Psychological Science

Deborah Mayo’s (2021) recent editorial in Conservation Biology addresses the important issue of how journal editors should deal with strong disagreements about tests of statistical significance (ToSS). Her commentary speaks to applied fields, such as conservation science, but it is relevant to basic research, as well as other sciences, such as psychology. In this short guest commentary, I briefly remark on the role played by the prominent journal, Psychological Science (PS), regarding whether or not researchers should employ ToSS. PS is the flagship journal of the Association for Psychological Science, and two of its editors-in-chief have offered explicit, but questionable, advice on this matter. Continue reading

Categories: ASA Task Force on Significance and Replicability, Brian Haig, editors, significance tests | Tags: | 1 Comment

D. Lakens (Guest Post): Averting journal editors from making fools of themselves

.

Daniël Lakens

Associate Professor
Human Technology Interaction
Eindhoven University of Technology

Averting journal editors from making fools of themselves

In a recent editorial, Mayo (2021) warns journal editors to avoid calls for authors guidelines to reflect a particular statistical philosophy, and not to go beyond merely enforcing the proper use of significance tests. That such a warning is needed at all should embarrass anyone working in statistics. And yet, a mere three weeks after Mayo’s editorial was published, the need for such warnings was reinforced when a co-editorial by journal editors from the International Society of Physiotherapy (Elkins et al., 2021) titled “Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors” stated: “[This editorial] also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.” Continue reading

Categories: D. Lakens, significance tests | 3 Comments

Midnight With Birnbaum (Remote, Virtual Happy New Year 2021)!

.

.For the second year in a row, unlike the previous 9 years that I’ve been blogging, it’s not feasible to actually revisit that spot in the road, looking to get into a strange-looking taxi, to head to “Midnight With Birnbaum”.  Because of the extended pandemic, I am not going out this New Year’s Eve again, so the best I can hope for is a zoom link of the sort I received last year, not long before midnight– that will link me to a hypothetical party with him. (The pic on the left is the only blurry image I have of the club I’m taken to.) I just keep watching my email, to see if a zoom link arrives. My book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)  doesn’t include the argument from my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), but you can read it at that link along with commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser (who sadly passed away in 2021), Jan Hannig, and Jan Bjornstad  but there’s much in it that I’d like to discuss with him. The (Strong) Likelihood Principle (LP or SLP)–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and statistical significance testing in general. Continue reading

Categories: Birnbaum, Birnbaum Brakes, strong likelihood principle | Tags: , , , | 1 Comment

“This is the moment” to discount a positive Covid test (after 5 days) (i)

.

This week’s big controversy concerns the CDC’s deciding to cut recommended days for isolation for people infected with Covid. CDC director Walensky was all over the news explaining that this “was the moment” for a cut, given the whopping number of new Covid cases (over 400,000 on Dec. 28, exceeding the previous record which was in the 300,000’s).

“In the context of the fact that we were going to have so many more cases — many of those would be asymptomatic or mildly symptomatic — people would feel well enough to be at work, they would not necessarily tolerate being home, and that they may not comply with being home, this was the moment that we needed to make that decision,” Walensky told CNN.

The CDC had already explained last week that “health care workers’ isolation period could be cut to five days, or even fewer, in the event of severe staffing shortages at U.S. hospitals”.

Then, on Monday, the CDC announced that individuals who test positive for Covid-19 and are asymptomatic need to isolate for only five days, not 10 days, citing increasing evidence that people are most infectious in the initial days after developing symptoms.

What’s really causing alarm among many health experts is that the new policy has no requirement for a negative test result, with a rapid test, before ending isolation. Even if you test positive on day 5, the CDC says, you can go about your business. So long as you’re asymptomatic or mildly symptomatic or your “symptoms are resolving” and you wear a mask. I don’t suppose the new looser guidance would result in any pressure being put on a pilot or other worker to get back to work even with some mild brain fog or coughing that seemed to be resolving.[1] Continue reading

Categories: covid-19 | 5 Comments

January 11: Phil Stat Forum (remote)

Special Session of the (remote)
Phil Stat Forum:

11 January 2022

“Statistical Significance Test Anxiety”

TIME: 15:00-17:00 (London, GMT); 10:00-12:00 (EST)

Presenters: Deborah Mayo (Virginia Tech) &
Yoav Benjamini (Tel Aviv University)

Moderator: David Hand (Imperial College London)

Deborah Mayo       Yoav Benjamini        David Hand


Focus of the Session: 

Continue reading

Categories: Announcement, David Hand, Phil Stat Forum, significance tests, Yoav Benjamini | Leave a comment

The Statistics Wars and Intellectual Conflicts of Interest

.

My editorial in Conservation Biology is published (open access): “The Statistical Wars and Intellectual Conflicts of Interest”. Share your comments, here and/or send a separate item (to Error), if you wish, for possible guest posting*. (All readers are invited to a special January 11 Phil Stat Session with Y. Benjamini and D. Hand described here.) Here’s most of the editorial:

The Statistics Wars and Intellectual Conflicts of Interest

How should journal editors react to heated disagreements about statistical significance tests in applied fields, such as conservation science, where statistical inferences often are the basis for controversial policy decisions? They should avoid taking sides. They should also avoid obeisance to calls for author guidelines to reflect a particular statistical philosophy or standpoint. The question is how to prevent the misuse of statistical methods without selectively favoring one side.

The statistical‐significance‐test controversies are well known in conservation science. In a forum revolving around Murtaugh’s (2014) “In Defense of P values,” Murtaugh argues, correctly, that most criticisms of statistical significance tests “stem from misunderstandings or incorrect interpretations, rather than from intrinsic shortcomings of the P value” (p. 611). However, underlying those criticisms, and especially proposed reforms, are often controversial philosophical presuppositions about the proper uses of probability in uncertain inference. Should probability be used to assess a method’s probability of avoiding erroneous interpretations of data (i.e., error probabilities) or to measure comparative degrees of belief or support? Wars between frequentists and Bayesians continue to simmer in calls for reform.

Consider how, in commenting on Murtaugh (2014), Burnham and Anderson (2014 : 627) aver that “P‐values are not proper evidence as they violate the likelihood principle (Royall, 1997).” This presupposes that statistical methods ought to obey the likelihood principle (LP), a long‐standing point of controversy in the statistics wars. The LP says that all the evidence is contained in a ratio of likelihoods (Berger & Wolpert, 1988). Because this is to condition on the particular sample data, there is no consideration of outcomes other than those observed and thus no consideration of error probabilities. One should not write this off because it seems technical: methods that obey the LP fail to directly register gambits that alter their capability to probe error. Whatever one’s view, a criticism based on presupposing the irrelevance of error probabilities is radically different from one that points to misuses of tests for their intended purpose—to assess and control error probabilities.

Error control is nullified by biasing selection effects: cherry‐picking, multiple testing, data dredging, and flexible stopping rules. The resulting (nominal) p values are not legitimate p values. In conservation science and elsewhere, such misuses can result from a publish‐or‐perish mentality and experimenter’s flexibility (Fidler et al., 2017). These led to calls for preregistration of hypotheses and stopping rules–one of the most effective ways to promote replication (Simmons et al., 2012). However, data dredging can also occur with likelihood ratios, Bayes factors, and Bayesian updating, but the direct grounds to criticize inferences as flouting error probability control is lost. This conflicts with a central motivation for using p values as a “first line of defense against being fooled by randomness” (Benjamini, 2016). The introduction of prior probabilities (subjective, default, or empirical)–which may also be data dependent–offers further flexibility.

Signs that one is going beyond merely enforcing proper use of statistical significance tests are that the proposed reform is either the subject of heated controversy or is based on presupposing a philosophy at odds with that of statistical significance testing. It is easy to miss or downplay philosophical presuppositions, especially if one has a strong interest in endorsing the policy upshot: to abandon statistical significance. Having the power to enforce such a policy, however, can create a conflict of interest (COI). Unlike a typical COI, this one is intellectual and could threaten the intended goals of integrity, reproducibility, and transparency in science.

If the reward structure is seducing even researchers who are aware of the pitfalls of capitalizing on selection biases, then one is dealing with a highly susceptible group. For a journal or organization to take sides in these long-standing controversies—or even to appear to do so—encourages groupthink and discourages practitioners from arriving at their own reflective conclusions about methods.

The American Statistical Association (ASA) Board appointed a President’s Task Force on Statistical Significance and Replicability in 2019 that was put in the odd position of needing to “address concerns that a 2019 editorial [by the ASA’s executive director (Wasserstein et al., 2019)] might be mistakenly interpreted as official ASA policy” (Benjamini et al., 2021)—as if the editorial continues the 2016 ASA Statement on p-values (Wasserstein & Lazar, 2016). That policy statement merely warns against well‐known fallacies in using p values. But Wasserstein et al. (2019) claim it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” and announce taking that step. They call on practitioners not to use the phrase statistical significance and to avoid p value thresholds. Call this the no‐threshold view. The 2016 statement was largely uncontroversial; the 2019 editorial was anything but. The President’s Task Force should be commended for working to resolve the confusion (Kafadar, 2019). Their report concludes: “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results” (Benjamini et al., 2021). A disclaimer that Wasserstein et al., 2019 was not ASA policy would have avoided both the confusion and the slight to opposing views within the Association.

The no‐threshold view has consequences (likely unintended). Statistical significance tests arise “to test the conformity of the particular data under analysis with [a statistical hypothesis] H0 in some respect to be specified” (Mayo & Cox, 2006: 81). There is a function D of the data, the test statistic, such that the larger its value (d), the more inconsistent are the data with H0. The p value is the probability the test would have given rise to a result more discordant from H0 than d is were the results due to background or chance variability (as described in H0). In computing p, hypothesis H0 is assumed merely for drawing out its probabilistic implications. If even larger differences than d are frequently brought about by chance alone (p is not small), the data are not evidence of inconsistency with H0. Requiring a low pvalue before inferring inconsistency with H0 controls the probability of a type I error (i.e., erroneously finding evidence against H0).

Whether interpreting a simple Fisherian or an N‐P test, avoiding fallacies calls for considering one or more discrepancies from the null hypothesis under test. Consider testing a normal mean H0: μ ≤ μ0 versus H1: μ > μ0. If the test would fairly probably have resulted in a smaller p value than observed, if μ = μ1 were true (where μ1 = μ0 + γ, for γ > 0), then the data provide poor evidence that μ exceeds μ1. It would be unwarranted to infer evidence of μ > μ1. Tests do not need to be abandoned when the fallacy is easily avoided by computing p values for one or two additional benchmarks (Burgman, 2005; Hand, 2021; Mayo, 2018; Mayo & Spanos, 2006).

The same is true for avoiding fallacious interpretations of nonsignificant results. These are often of concern in conservation, especially when interpreted as no risks exist. In fact, the test may have had a low probability to detect risks. But nonsignificant results are not uninformative. If the test very probably would have resulted in a more statistically significant result were there a meaningful effect, say μ > μ1 (where μ1 = μ0 + γ, for γ > 0), then the data are evidence that μ < μ1. (This is not to infer μ ≤ μ0.) “Such an assessment is more relevant to specific data than is the notion of power” (Mayo & Cox, 2006: 89). This also matches inferring that μ is less than the upper bound of the corresponding confidence interval (at the associated confidence level) or a severity assessment (Mayo, 2018). Others advance equivalence tests (Lakens, 2017; Wellek, 2017). An N‐P test tells one to specify H0 so that the type I error is the more serious (considering costs); that alone can alleviate problems in the examples critics adduce (H0would be that the risk exists).

Many think the no‐threshold view merely insists that the attained p value be reported. But leading N‐P theorists already recommend reporting p, which “gives an idea of how strongly the data contradict the hypothesis…[and] enables others to reach a verdict based on the significance level of their choice” (Lehmann & Romano, 2005: 63−64). What the no‐threshold view does, if taken strictly, is preclude testing. If one cannot say ahead of time about any result that it will not be allowed to count in favor of a claim, then one does not test that claim. There is no test or falsification, even of the statistical variety. What is the point of insisting on replication if at no stage can one say the effect failed to replicate? One may argue for approaches other than tests, but it is unwarranted to claim by fiat that tests do not provide evidence. (For a discussion of rival views of evidence in ecology, see Taper & Lele, 2004.)

Many sign on to the no‐threshold view thinking it blocks perverse incentives to data dredge, multiple test, and p hack when confronted with a large, statistically nonsignificant p value. Carefully considered, the reverse seems true. Even without the word significance, researchers could not present a large (nonsignificant) p value as indicating a genuine effect. It would be nonsensical to say that even though more extreme results would frequently occur by random variability alone that their data are evidence of a genuine effect. The researcher would still need a small value, which is to operate with a threshold. However, it would be harder to hold data dredgers culpable for reporting a nominally small p value obtained through data dredging. What distinguishes nominal p values from actual ones is that they fail to meet a prespecified error probability threshold.

 

While it is well known that stopping when the data look good inflates the type I error probability, a strict Bayesian is not required to adjust for interim checking because the posterior probability is unaltered. Advocates of Bayesian clinical trials are in a quandary because “The [regulatory] requirement of Type I error control for Bayesian [trials] causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle” (Ryan etal., 2020: 7).

It may be retorted that implausible inferences will indirectly be blocked by appropriate prior degrees of belief (informative priors), but this misses the crucial point. The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in. There are ample forums for debating statistical methodologies. There is no call for executive directors or journal editors to place a thumb on the scale. Whether in dealing with environmental policy advocates, drug lobbyists, or avid calls to expel statistical significance tests, a strong belief in the efficacy of an intervention is distinct from its having been well tested. Applied science will be well served by editorial policies that uphold that distinction.

For the acknowledgments and references, see the full editorial here.

I will cite as many (constructive) readers’ views as I can at the upcoming forum with Yoav Benjamini and David Hand on January 11 on zoom (see this post). *Authors of articles I put up as guest posts or cite at the Forum will get a free copy of my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018).

Categories: significance tests, spurious p values, stat wars and their casualties, strong likelihood principle | 2 Comments

Bickel’s defense of significance testing on the basis of Bayesian model checking

.

In my last post, I said I’d come back to a (2021) article by David Bickel, “Null Hypothesis Significance Testing Defended and Calibrated by Bayesian Model Checking” in The American Statistician. His abstract begins as follows:

 

Significance testing is often criticized because p-values can be low even though posterior probabilities of the null hypothesis are not low according to some Bayesian models. Those models, however, would assign low prior probabilities to the observation that the p-value is sufficiently low. That conflict between the models and the data may indicate that the models needs revision. Indeed, if the p-value is sufficiently small while the posterior probability according to a model is insufficiently small, then the model will fail a model check….(from Bickel 2021)

Continue reading

Categories: Bayesian/frequentist, D. Bickel, Fisher, P-values | 3 Comments

Blog at WordPress.com.