Here’s an in-depth interview of Sir David Cox by Nancy Reid that brings out a rare, intellectual understanding and appreciation of some of Cox’s work. Only someone truly in the know could have managed to elicit these fascinating reflections. The interview was in Oct 1993, published in 1994.

Nancy Reid (1994). A Conversation with Sir David Cox, *Statistical Science* 9(3): 439-455.

The original *Statistics Views* interview is here:

“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”– An interview with Sir David CoxFEATURES

Author:Statistics ViewsDate:24 Jan 2014Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. The Cox point process was named after him.

Sir David studied mathematics at St John’s College, Cambridge and obtained his PhD from the University of Leeds in 1949. He was employed from 1944 to 1946 at the Royal Aircraft Establishment, from 1946 to 1950 at the Wool Industries Research Association in Leeds, and from 1950 to 1955 worked at the Statistical Laboratory at the University of Cambridge. From 1956 to 1966 he was Reader and then Professor of Statistics at Birkbeck College, London. In 1966, he took up the Chair position in Statistics at Imperial College Londonwhere he later became Head of the Department of Mathematics for a period. In 1988 he became Warden of Nuffield College and was a member of the Department of Statistics at Oxford University. He formally retired from these positions in 1994 but continues to work in Oxford.

Sir David has received numerous awards and honours over the years. He has been awarded the Guy Medals in Silver (1961) and Gold (1973) by the Royal Statistical Society. He was elected Fellow of the Royal Society of London in 1973, was knighted in 1985 and became an Honorary Fellow of the British Academy in 2000. He is a Foreign Associate of the US National Academy of Sciences and a foreign member of the Royal Danish Academy of Sciences and Letters. In 1990 he won the Kettering Prize and Gold Medal for Cancer Research for “the development of the Proportional Hazard Regression Model” and 2010 he was awarded the Copley Medal by the Royal Society.

He has supervised and collaborated with many students over the years, many of whom are now successful in statistics in their own right such as David Hinkley and Past President of the Royal Statistical Society, Valerie Isham. Sir David has served as President of theBernoulli Society, Royal Statistical Society, and the International Statistical Institute.

This year, Sir David is to turn 90*. Here Statistics Views talks to Sir David about his prestigious career in statistics, working with the late Professor Lindley, his thoughts on Jeffreys and Fisher, being President of the Royal Statistical Society during the Thatcher Years, Big Data and the best time of day to think of statistical methods.

1. With an educational background in mathematics at St Johns College, Cambridge and the University of Leeds, when and how did you first become aware of statistics as a discipline?I was studying at Cambridge during the Second World War and after two years, one was sent either into the Forces or into some kind of military research establishment. There were very few statisticians then, although it was realised there was a need for statisticians. It was assumed that anybody who was doing reasonably well at mathematics could pick up statistics in a week or so! So, aged 20, I went to the Royal Aircraft Establishment in Farnborough, which is enormous and still there to this day if in a different form, and I worked in the Department of Structural and Mechanical Engineering, doing statistical work. So statistics was forced upon me, so to speak, as was the case for many mathematicians at the time because, aside from UCL, there had been very little teaching of statistics in British universities before the Second World War. Afterwards, it all started to expand.

2. From 1944 to 1946 you worked at the Royal Aircraft Establishment and then from 1946 to 1950 at the Wool Industries Research Association in Leeds. Did statistics have any role to play in your first roles out of university?Totally. In Leeds, it was largely statistics but also to some extent, applied mathematics because there were all sorts of problems connected with the wool and textile industry in terms of the physics, chemistry and biology of the wool and some of these problems were mathematical but the great majority had a statistical component to them. That experience was not totally uncommon at the time and many who became academic statisticians had, in fact, spent several years working in a research institute first.

3. From 1950 to 1955, you worked at the Statistical Laboratory at Cambridge and would have been there at the same time as Fisher and Jeffreys. The late Professor Dennis Lindley, who was also there at that time, told me that the best people working on statistics were not in the statistics department at that time. What are your memories when you look back on that time and what do you feel were your main achievements?Lindley was exactly right about Jeffreys and Fisher. They were two great scientists outside statistics – Jeffreys founded modern geophysics and Fisher was a major figure in genetics. Dennis was a contemporary and very impressive and effective. We were colleagues for five years and our children even played together.

The first lectures on statistics I attended as a student consisted of a short course by Harold Jeffreys who had at the time a massive reputation as virtually the inventor of modern geophysics. His

Theory of Probability, published first as a monograph in physics was and remains of great importance but, amongst other things, his nervousness limited the appeal of his lectures, to put it gently. I met him personally a couple of times – he was friendly but uncommunicative. When I was later at the Statistical Laboratory in Cambridge, relations between the Director, Dr Wishart and R.A. Fisher had been at a very low ebb for 20 years and contact between the Lab and Fisher was minimal. I hear him speak on three of four occasions, interesting if often rambunctious occasions. To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.

“To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.”

4. You have also taught at many institutions over the years including Princeton, Berkeley, Cambridge, Birkbeck College and Imperial College London before joining Nuffield College here at Oxford. Over the years, how did the teaching of statistics evolve and adapt to meet the changing needs of students?As I said, when I was a student, there was very little teaching of statistics in British universities. It has evolved over the years and was first primarily a postgraduate subject, taken after reading mathematics if you wished to be a scientific statistician, rather than an economic statistician. You took at least a diploma, or a one-year MA or a doctorate. Then statistics came into mathematics degrees, partly to make them more appealing to a wider audience and that has changed, so nowadays, most statisticians start fairly intensively in an undergraduate course, which has some advantages and some disadvantages.

5. How did your teaching and research motivated and influenced each other? Did you get research ideas from statistics and incorporate them into your teaching?Much of my research has come from talking to scientists. Sometimes ideas come from lecturing because the way to really understand a subject is to give a course of lectures on the subject and sometimes that throws up more theoretical issues that you might not otherwise have been thought of. The overwhelming majority of my work comes either directly or indirectly from some physical biological or medical problem, but in many different ways – casual conversation sometimes.

6. You have taught many who have gone onto make their own important contributions towards statistics such as David Hinkley whom is now renowned for his work on bootstrap methods and Valerie Isham who recently served as the President of the Royal Statistical Society. The late Professor Dennis Lindley told me that “One of the joys of life is teaching a really good graduate.” Would you be in agreement?I would say that one of the joys of life is learning from a good graduate. The first duty of a doctoral student is clearly is to educate their supervisor which my own doctoral students have down. Hopefully, they’ve learnt a bit from me occasionally! I am absolutely certain that I learnt a lot from Valerie, for instance, as we’ve worked together on and off for around forty years. Having such students is fantastic. I have been fortunate and happy as at Birkbeck, I had largely evening students. They were highly motivated and very able. Many of the graduate students at Imperial came from other places, or were international students of high standard. Also rather importantly, my students have been very nice people!

7. You are best known for your innovative work on the proportional hazards model, which is now widely used in the analysis of survival data. What research led to this discovery? What set you on the right path?Two different things – first of all, I had been interested in reliability in an industrial context since I worked in the textile industry and to some extent, when I was at the Royal Aircraft Establishment, when strength of materials was important. I had a long interest in testing the strength and reliability which is also related to looking at the duration of life. Then the more specific thing was that at least four or five people from different areas in the US and the UK said that they had a certain kind of data with people’s survival times under various treatments and all sorts of further aspects with regards to the patient but they did not know how to analyse this data. The work led to one paper but the reason it is so popular is totally accidental. Other people wrote easily useful software in which to implement the method which is not my speciality at all. I had software to implement it but it was not suitable for general use. In a sense, it became almost too easy and so people just started to use the method because it was painless! The proportion of my life that I spent working on the proportional hazards model is, in fact, very small. I had an idea of how to solve it but I could not complete the argument and so it took me about four years on and off, often thinking about it before I went to bed.

(Editor’s note: I tell Sir David that I now had a picture in my head of him pacing the house in his pyjamas at four o’clock in the morning with a hot chocolate in one hand, thinking statistical thoughts and he laughs).Not quite! It was right before going to bed. There is a well-established literature in mathematics that people who thought about a problem and do not know how to solve it, go to bed thinking about it and wake up the next morning with a solution. It’s not easily explicable but if you’re wide awake, you perhaps argue down the conventional lines of argument but what you need to do is something a bit crazy which you’re more likely to do if you’re half-awake or asleep. Presumably that’s the explanation!

“The proportion of my life that I spent working on the proportional hazards model is, in fact, very small. I had an idea of how to solve it but I could not complete the argument and so it took me about four years on and off…”

8. The getstats campaign by the Royal Statistical Society focuses on improving the public’s understanding of statistics in every-day life. Would you have any advice for them and what areas should they focus on that you feel there should be more awareness of in statistics?Of course, to some extent, the notion that some very simple and non-technical ideas about collecting data and analysing it are taught to children is very good but then at the other end, there are people who are highly educated but have no sense of statistical arguments, such as many lawyers and senior civil servants. The RSS has done an excellent job in trying to interest MPs in statistical ideas. Both these extremes are important. Sending a very general message to people as far as possible helps, but also sending very focussed messages to key groups of people is more important in the short term. You do see on TV, for instance, that basic principles are being ignored in collecting and analysing evidence. Of course, it’s easy for me to stand on the sidelines and criticise.

9. You have served as the President for several societies over the years including the Royal Statistical Society, the Bernoulli Society and the International Statistical Institute. What are your memories of your time at the RSS for instance and how you helped the society adapt to the changing needs of the statistical community?It was a bit different in my time. I was the President of the RSS at the time when Margaret Thatcher was PM and massacring the civil service and in particular, the government’s statistical service and there was a lot of activity going on about that. But it was done more by going to see people, talking to them and trying to influence them than writing them formal letters. While, of course, openness is a good thing, it is not always the best way to get results. People can take up inflexible attitudes but if you talk to them quietly in private, they are perhaps then more open to new ideas.

10. You have received numerous awards from the Guy Medal both in Silver and Gold to the Marvin Zelen Leadership Award. Is there a particular award that you were most proud of being awarded?It would be the Copley Medal from the Royal Society as it was for general science. It is very nice to receive these awards, of course, perhaps particularly because they represent the fact that your friends have put in efforts on your behalf. Therefore, what I really value is not the award but the appreciation of friends and colleagues. That is what is important but the award and degrees are certainly an honour. If you overvalue an award, that can be dangerous.

11. You have written many papers and books. What are the ones that you are most proud of?The one I’m going to write next, of course! I have flitted about all sorts of different topics different fields of application, different parts of the subject, and so on. Really, I don’t look back very much.

At the moment, I have just finished a book with a colleague called

Case Control Studieswhich is mainly about epidemiological investigation.

“…what I really value is not the award but the appreciation of friends and colleagues.”

12. What has been the best book on statistics that you have ever read?I honestly don’t know. The position of books is interesting as when I first started in my career, there were hardly any books at all that were treating statistics in a modern way. Then they very slowly began to appear and now there is a flood of them. The standard on the whole published now is very high but there is too much to keep up with!

13. What has been the most exciting development that you have worked on in statistics during your career?I’m not sure about exciting (!) but one of the most demanding was being involved in the issues about Bovine TB in badgers, which went on for about ten years. It involved a great deal of work, which was very interesting and instructive in all sorts of ways, and not just in statistics.

I’ve been involved in other government-based topics, such as the group which made the first predictions for the AIDS epidemic, which was also very interesting.

14. At the recent Future of Statistical Sciences workshop, there was much talk about Big Data and a concern that many ‘hot areas’ such as big data/data analytics, which have close connections with statistics and the statistical sciences, are being monopolised by computer scientists and/or engineers. What do statisticians need to do to ensure their work and their profession gets noticed?Do better quality work, which I don’t mean as a criticism as to what is done at the moment but rather, do high quality work that is important in some sense, either intellectually or practically in particular fields. Part of the problem is that relatively speaking, there are not that many statisticians who are trained to the level needed.

15. What do you think the most important recent developments in the field have been? What do you think will be the most exciting and productive areas of research in statistics during the next few years?The most immediately important is as you said – Big Data, which will bring forward new ideas but it does not mean that old ideas from the more traditional part of the subject is useless. It is the most obvious and biggest challenge.

Ideally, we should be looking at very important practical problems in a different number of fields and see some sort of common element and build the ideas that are necessary in order to tackle any issues that arise. You should not tackle just one issue successfully but tackle a collection of issues – the Big Data aspect is one common theme undoubtedly. It goes beyond statistics – to what extent Big Data can replace small, carefully planned investigations which are much more sharply focussed on a very specific issue.

My intrinsic feeling is that more fundamental progress is more likely to be made by very focused, relatively small scale, intensive investigations than collecting millions of bits of information on millions of people, for example. It’s possible to collect such large data now, but it depends on the quality, which may be very high or not, and if it is not, what do you do about it?

16. Do you think over the years too much research has focussed on less important areas of statistics? Should the gap between research and applications get reduced? How so and by whom?In British statistics at the moment, the gap between theory and applications is difficult. Theory has almost disappeared. Almost everyone is working on applications. The issue is whether this has gone a bit too far. Everyone has to find their best way of working in principle but if you are a theoretician, then to have really serious contact with applications is for most people, extremely fruitful and indeed almost essential. Some individuals will think that is not true and that it may be better that they sit at their desk and think great thoughts, so to speak! That is another way of working but the danger then is that the great thoughts may have no bearing on the real world. But for most people, it is the interplay which is crucial. Maybe I am not imaginative enough to just sit there and think to myself of abstract problems which are really important enough to spend time on! Others are undoubtedly much better at that which may be their better way of thinking.

“My intrinsic feeling is that more fundamental progress is more likely to be made by very focused, relatively small scale, intensive investigations than collecting millions of bits of information on millions of people, for example. It’s possible to collect such large data now, but it depends on the quality, which may be very high or not, and if it is not, what do you do about it?”

17. What do you see as the greatest challenges facing the profession of statisticians in the coming years?I know the term ‘the profession of statistics’ is widely used but I am not that keen on it. I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics. That is a question of words to some extent. One answer would be to that the challenge, preferably for an academic statistician, is to be involved in several fields of application in a non-trivial sense and combine the stimulus and the contribution you can make that way with theoretical contributions that those contacts will suggest. As I said before, I don’t think you can lay down a rule as to how what is most productive for everyone.

18. Are there people or events that have been influential in your career? Also, given that you are one of the most well respected statisticians of your generation and many statisticians look up to you, whose work do you admire (it can be someone working now, or someone whose work you admired greatly earlier on in your career?).The person who influenced me by far the greatest was Henry Daniels. I went to work with him at the Wool Research Association and then he went to Cambridge and from there, Birmingham. He was both a very clever mathematician and a very good statistician. He was also actually a very skilful experimental physicist, which is interesting. At the Wool Research Association, he was a statistician but he also ran a measurement lab (what they called a fibre-measurement lab where he developed all sorts of clever measurement techniques).

Maurice Bartlett, who was at Manchester, UCL and then here in Oxford was another major influence and then in the background were people like R.A. Fisher and Jeffreys. I met Jeffreys a few times and went to his lectures – although he wrote beautifully, his lectures were really rather impossible, which was sad.

Otherwise, I have learnt from almost everybody that I’ve had contact with and I certainly include students amongst them.

19. If you had not got involved in the field of statistics, what do you think you would have done? (Is there another field that you could have seen yourself making an impact on?)I thought I would go into either theoretical physics or pure mathematics but I’m very glad I didn’t. I’m not clever enough for either of those fields. They are both fascinating subjects but statistics is a much more easily satisfying life because there are so many different directions in which to go. Whereas in pure mathematics, you are possibly doing things that only two other people in the world may understand and that requires a certain austerity of spirit in order to do that, which I do not possess! I also find quantum mechanics absolutely fascinating but I am not original enough to do striking things in that field.

*He turned 90 in July 2014.

Please share your comments.

]]>We were part of a session:

**Speakers:**

**Sir David Cox, Nuffield College, Oxford**

**Deborah Mayo, Virginia Tech**

**Richard Morey, Cardiff University**

**Aris Spanos, Virginia Tech**

All 4 talks are on this post:

It was the same day and conference that my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) first made its physical appearance:

Blurb for session:

]]>

Nathan Schachtman, Esq., J.D.

Legal Counsel for Scientific Challenges

**Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice**

The metaphor of law as an “empty vessel” is frequently invoked to describe the law generally, as well as pejoratively to describe lawyers. The metaphor rings true at least in describing how the factual content of legal judgments comes from outside the law. In many varieties of litigation, not only the facts and data, but the scientific and statistical inferences must be added to the “empty vessel” to obtain a correct and meaningful outcome.

Once upon a time, the expertise component of legal judgments came from so-called expert witnesses, who were free to opine about the claims of causality solely by showing that they had more expertise than the lay jurors. In Pennsylvania, for instance, the standard to qualify witnesses to give “expert opinions” was to show that they had “a reasonable pretense to expertise on the subject.”

In the 19^{th} and the first half of the 20^{th} century, causal claims, whether of personal injuries, discrimination, or whatever, virtually always turned on a conception of causation as necessary and sufficient to bring about the alleged harm. In discrimination claims, plaintiffs pointed to the “inexorable zero,” in cases in which no Black citizen was ever seated on a grand jury, in a particular county, since the demise of Reconstruction. In health claims, the mode of reasoning usually followed something like Koch’s postulates.

The second half of the 20^{th} century was marked by the rise of stochastic models in our understanding of the world. The consequence is that statistical inference made its way into the empty vessel. The rapid introduction of statistical thinking into the law did not always go well. In a seminal discrimination case, *Casteneda v. Partida*, 430 U.S. 432 (1977), in an opinion by Associate Justice Blackmun, the court calculated a binomial probability for observing the sample result (rather than a result at least as extreme as such a result), and mislabeled the measurement “standard deviations” rather than standard errors:

“As a general rule for such large samples, if the difference between the expected value and the observed number is greater than two or three standard deviations, then the hypothesis that the jury drawing was random would be suspect to a social scientist. The 11-year data here reflect a difference between the expected and observed number of Mexican-Americans of approximately 29 standard deviations. A detailed calculation reveals that the likelihood that such a substantial departure from the expected value would occur by chance is less than I in 10^{140}.” *Id*. at 430 U.S. 482, 496 n.17 (1977). Justice Blackmun was graduated from Harvard College, *summa cum laude*, with a major in mathematics.

Despite the extreme statistical disparity in the 11-year run of grand juries, Justice Blackmun’s opinion provoked a robust rejoinder, not only on the statistical analysis, but on the Court’s failure to account for obvious omitted confounding variables in its simplistic analysis. And then there were the inconvenient facts that Mr. Partida was a rapist, indicted by a grand jury (50% with “Hispanic” names), which was appointed by jury commissioners (3/5 Hispanic). Partida was convicted by a petit jury (7/12 Hispanic), in front a trial judge who was Hispanic, and he was denied a writ of habeas court by Judge Garza, who went on to be a member of the Court of Appeals. In any event, Justice Blackmun’s dictum about “two or three” standard deviations soon shaped the outcome of many thousands of discrimination cases, and was translated into a necessary p-value of 5%.

Beginning in the early 1960s, statistical inference became an important feature of tort cases that involved claims based upon epidemiologic evidence. In such health-effects litigation, the judicial handling of concepts such as p-values and confidence intervals often went off the rails. In 1989, the United States Court of Appeals for the Fifth Circuit resolved an appeal involving expert witnesses who relied upon epidemiologic studies by concluding that it did not have to resolve questions of bias and confounding because the studies relied upon had presented their results with confidence intervals.[1] Judges and expert witnesses persistently interpreted single confidence intervals from one study as having a 95 percent probability of containing the actual parameter.[2] Similarly, many courts and counsel committed the transposition fallacy in interpreting p-values as posterior probabilities for the null hypothesis.[3]

Against this backdrop of mistaken and misrepresented interpretation of p-values, the American Statistical Association’s p-value statement was a helpful and understandable restatement of basic principles.[4] Within a few weeks, however, citations to the p-value Statement started to show up in the briefs and examinations of expert witnesses, to support contentions that p-values (or any procedure to evaluate random error) were unimportant, and should be disregarded.[5]

In 2019, Ronald Wasserstein, the ASA executive director, along with two other authors wrote an editorial, which explicitly called for the abandonment of using “statistical significance.”[6] Although the piece was labeled “editorial,” the journal provided no disclaimer that Wasserstein was not speaking *ex cathedra*.

The absence of a disclaimer provoked a great deal of confusion. Indeed, Brian Turran, the editor of *Significance*, published jointly by the ASA and the Royal Statistical Society, wrote an editorial interpreting the Wasserstein editorial as an official ASA “recommendation.” Turran ultimately retracted his interpretation, but only in response to a pointed letter to the editor.[7] Turran adverted to a misleading press release from the ASA as the source of his confusion. Inquiring minds might wonder why the ASA allowed such a press release to go out.

In addition to press releases, some people in the ASA started to send emails to journal editors, to nudge them to abandon statistical significance testing on the basis of what seemed like an ASA recommendation. For the most part, this campaign was unsuccessful in the major biomedical journals.[8]

While this controversy was unfolding, then President Karen Kafadar of the ASA stepped into the breach to state definitively that the Executive Director was not speaking for the ASA.[9] In November 2019, the ASA board of directors approved a motion to create a “Task Force on Statistical Significance and Replicability.”[8] Its charge was “to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA Board.”

Professor Mayo’s editorial has done the world of statistics, as well as the legal world of judges, lawyers, and legal scholars, a service in calling attention to the peculiar intellectual conflicts of interest that played a role in the editorial excesses of some of the ASA’s leadership. From a lawyer’s perspective, it is clear that courts have been misled, and distracted by, some of the ASA officials who seem to have worked to undermine a consensus position paper on p-values.[10]

Curiously, the task force’s report did not find a home in any of the ASA’s several scholarly publications. Instead “The ASA President’s Task Force Statement on Statistical Significance and Replicability”[11] appeared in the *The Annals of Applied Statistics*, where it is accompanied by an editorial by ASA former President Karen Kafadar.[12] In November 2021, the ASA’s official “magazine,” *Chance*, also published the Task Force’s Statement.[13]

*Judges and litigants who must navigate claims of statistical inference need guidance on the standard of care scientists and statisticians should use in evaluating such claims. Although the *Taskforce did not elaborate, it advanced five basic propositions, which had been obscured by many of the recent glosses on the ASA 2016 p-value statement, and the 2019 editorial discussed above:

- “Capturing the uncertainty associated with statistical summaries is critical.”
- “Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data.”
- “The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.”
- “Thresholds are helpful when actions are required.”
- “P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.”

Although the Task Force’s Statement will not end the debate or the “wars,” it will go a long way to correct the contentions made in court about the insignificance of significance testing, while giving courts a truer sense of the professional standard of care with respect to statistical inference in evaluating claims of health effects.

**REFERENCES**

[1] *Brock v. Merrill Dow Pharmaceuticals, Inc.*, 874 F.2d 307, 311-12 (5th Cir. 1989).

[2] Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 *Am. J. L. & Med*. 189, 210 (2004) (“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”) (Both authors testify for claimants cases involving alleged environmental and occupational harms.); Schachtman, “Confidence in Intervals and Diffidence in the Courts” (Mar. 4, 2012) (collecting numerous examples of judicial offenders).

[3] *See, e.g*., *In re Ephedra Prods. Liab. Litig*., 393 F.Supp. 2d 181, 191, 193 (S.D.N.Y. 2005) (Rakoff, J.) (credulously accepting counsel’s argument that the use of a critical value of less than 5% of significance probability increased the “more likely than not” burden of proof upon a civil litigant). The decision has been criticized in the scholarly literature, but it is still widely cited without acknowledging its error. *See* Michael O. Finkelstein, *Basic Concepts of Probability and Statistics in the Law* 65 (2009).

[4] Ronald L. Wasserstein & Nicole A. Lazar, “The ASA’s Statement on p-Values: Context, Process, and Purpose,” 70 *The Am. Statistician *129 (2016); *see* “The American Statistical Association’s Statement on and of Significance” (March 17, 2016). The commentary beyond the “bold faced” principles was at times less helpful in suggesting that there was something inherently inadequate in using p-values. With the benefit of hindsight, this commentary appears to represent editorizing by the authors, and not the sense of the expert committee that agreed to the six principles.

[5] Schachtman, “The American Statistical Association Statement on Significance Testing Goes to Court, Part I” (Nov. 13, 2018), “Part II” (Mar. 7, 2019).

[6] Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar, “Editorial: Moving to a World Beyond ‘p < 0.05’,” 73 *Am. Statistician *S1, S2 (2019); *see* Schachtman,“Has the American Statistical Association Gone Post-Modern?” (Mar. 24, 2019).

[7] Brian Tarran, “THE S WORD … and what to do about it,” *Significance *(Aug. 2019); Donald Macnaughton, “Who Said What,” *Significance* 47 (Oct. 2019).

[8] *See, e.g*., David Harrington, Ralph B. D’Agostino, Sr., Constantine Gatsonis, Joseph W. Hogan, David J. Hunter, Sharon-Lise T. Normand, Jeffrey M. Drazen, and Mary Beth Hamel, “New Guidelines for Statistical Reporting in the *Journal*,” 381 *New Engl. J. Med*. 285 (2019); Jonathan A. Cook, Dean A. Fergusson, Ian Ford, Mithat Gonen, Jonathan Kimmelman, Edward L. Korn, and Colin B. Begg, “There is still a place for significance testing in clinical trials,” 16 *Clin. Trials* 223 (2019).

[9] Karen Kafadar, “The Year in Review … And More to Come,” *AmStat News* 3 (Dec. 2019); *see also* Kafadar, “Statistics & Unintended Consequences,” *AmStat News* 3,4 (June 2019).

[10] Deborah Mayo, “The statistics wars and intellectual conflicts of interest,” 36 *Conservation Biology* (2022) (in-press, online Dec. 2021).

[11] Yoav Benjamini, Richard D. DeVeaux, Bradly Efron, Scott Evans, Mark Glickman, Barry Braubard, Xuming He, Xiao Li Meng, Nancy Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young, and Karen Kafadar, “The ASA President’s Task Force Statement on Statistical Significance and Replicability,” 15 *Annals of Applied Statistics* (2021) (in press)

[12] *Karen Kafadar, “*Editorial: Statistical Significance, P-Values, and Replicability,” 15 *Annals of Applied Statistics* (2021).

[13] Yoav Benjamini, Richard D. De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry I. Graubard, Xuming He, Xiao-Li Meng, Nancy M. Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young & Karen Kafadar, “ASA President’s Task Force Statement on Statistical Significance and Replicability,” 34 *Chance* 10 (2021).

**Previous commentaries on my editorial (more to come*)**

Park

Dennis

Stark**
**Staley

Pawitan

Hennig

Ionides and Ritov

Haig

Lakens

*Let me know if you wish to write one

]]>**John Park, MD
**Radiation Oncologist

Kansas City VA Medical Center

**Poisoned Priors: Will You Drink from This Well?**

As an oncologist, specializing in the field of radiation oncology, “The Statistics Wars and Intellectual Conflicts of Interest”, as Prof. Mayo’s recent editorial is titled, is one of practical importance to me and my patients (Mayo, 2021). Some are flirting with Bayesian statistics to move on from statistical significance testing and the use of P-values. In fact, what many consider the world’s preeminent cancer center, MD Anderson, has a strong Bayesian group that completed 2 early phase Bayesian studies in radiation oncology that have been published in the most prestigious cancer journal —The Journal of Clinical Oncology (Liao et al., 2018 and Lin et al, 2020). This brings about the hotly contested issue of subjective priors and much ado has been written about the ability to overcome this problem. Specifically in medicine, one thinks about Spiegelhalter’s classic 1994 paper mentioning reference, clinical, skeptical, or enthusiastic priors who also uses an example from radiation oncology (Spiegelhalter et al., 1994) to make his case. This is nice and all in theory, but what if there is ample evidence that the subject matter experts have major conflicts of interests (COIs) and biases so that their priors cannot be trusted? A debate raging in oncology, is whether non-invasive radiation therapy is as good as invasive surgery for early stage lung cancer patients. This is a not a trivial question as postoperative morbidity from surgery can range from 19-50% and 90-day mortality anywhere from 0–5% (Chang et al., 2021). Radiation therapy is highly attractive as there are numerous reports hinting at equal efficacy with far less morbidity. Unfortunately, 4 major clinical trials were unable to accrue patients for this important question. Why could they not enroll patients you ask? Long story short, if a patient is referred to radiation oncology and treated with radiation, the surgeon loses out on the revenue, and vice versa. Dr. David Jones, a surgeon at Memorial Sloan Kettering, notes there was no “equipoise among enrolling investigators and medical specialties… Although the reasons are multiple… I believe the primary reason is financial” (Jones, 2015). I am not skirting responsibility for my field’s biases. Dr. Hanbo Chen, a radiation oncologist, notes in his meta-analysis of multiple publications looking at surgery vs radiation that overall survival was associated with the specialty of the first author who published the article (Chen et al, 2018). Perhaps the pen is mightier than the scalpel!

Currently, there is one surgery vs radiation trial that is accruing well, the VALOR study, a Veterans Affairs (VA) only trial. Although only 9 VA medical centers were involved in 2020, it had enrolled more participants than all previous major (phase 3) trials combined (Moghanaki and Hagan, 2020). I do not believe it is too bold to say that a major portion of this success is due to the fact there are no financial incentives for the surgeons or radiation therapists at the VA (i.e. VA physicians are salaried and do not receive payment per patient).

Here are some clear examples of what I call “poisoned priors” due to COIs. Whether financial or for prestige (would you want to be known as the inferior treatment modality for one of the most common cancers?), the COIs loom large. Many of the specialists in question are highly biased, with exposed COIs. Are we to trust priors constructed from them? Will the errors really be contained within the posteriors from these biased priors? In order to overcome this, you say that you want to use an uninformative or weakly informative prior as a statistical method to judge incoming data? Then what’s the point of having prior knowledge, in this case the surgeons’ and radiation oncologists’ priors who are the subject matter experts, if you are not willing to use them? Indeed as Prof. Mayo notes “It may be retorted that implausible inferences will indirectly be blocked by appropriate prior degrees of belief (informative priors), but this misses the crucial point. The key function of statistical tests is to constrain the human tendency to selectively favor views they believe” (Mayo, 2021). If this statement holds for appropriate prior degrees of belief, how much more is it relevant when we can show that those involved have inappropriate prior degrees belief?

These types of poisoned priors are ubiquitous in medicine and must be taken into account — we haven’t even dealt with “Big Pharma” (and don’t get me started)! We must not give up the apparatus of the phase 3 randomized trial, with its randomization, blinding, multiplicity control, and preregistered statistical thresholds for type I and II error control, which is the best form of severe testing we have for our patients.

**References**

- Chang JY, Mehran RJ, Feng L, et al. Stereotactic ablative radiotherapy for operable stage I non-small-cell lung cancer (revised STARS): long-term results of a single-arm, prospective trial with prespecified comparison to surgery. The Lancet Oncology. 2021;22(10):1448-1457. doi:10.1016/S1470-2045(21)00401-0
- Chen H, Laba JM, Boldt RG, et al. Stereotactic Ablative Radiation Therapy Versus Surgery in Early Lung Cancer: A Meta-analysis of Propensity Score Studies. Int J Radiat Oncol Biol Phys. 2018;101(1):186-194. doi:10.1016/j.ijrobp.2018.01.064
- Jones DR. Do we know bad science when we see it? The Journal of Thoracic and Cardiovascular Surgery. 2015;150(3):472-473. doi:10.1016/j.jtcvs.2015.07.032
- Liao Z, Lee JJ, Komaki R, et al. Bayesian Adaptive Randomization Trial of Passive Scattering Proton Therapy and Intensity-Modulated Photon Radiotherapy for Locally Advanced Non-Small-Cell Lung Cancer. J Clin Oncol. 2018;36(18):1813-1822. doi:10.1200/JCO.2017.74.0720
- Lin SH, Hobbs BP, Verma V, et al. Randomized Phase IIB Trial of Proton Beam Therapy Versus Intensity-Modulated Radiation Therapy for Locally Advanced Esophageal Cancer. J Clin Oncol. 2020;38(14):1569-1579. doi:10.1200/JCO.19.02503
- Mayo DG. The statistics wars and intellectual conflicts of interest. Conserv Biol. Published online December 6, 2021. doi:10.1111/cobi.13861
- Moghanaki D, Hagan M. Strategic Initiatives for Veterans with Lung Cancer. Fed Pract. 2020;37(Suppl 4):S76-S80. doi:10.12788/fp.0019
- Razi, S. S., Kodia, K., Alnajar, A., Block, M. I., Tarrazzi, F., Nguyen, D., & Villamizar, N. (2020). Lobectomy Versus Stereotactic Body Radiotherapy In Healthy Octogenarians With Stage I Lung Cancer. The Annals of Thoracic Surgery, S000349752031448X. https://doi.org/10.1016/j.athoracsur.2020.06.097
- Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian Approaches to Randomized Trials. Journal of the Royal Statistical Society Series A (Statistics in Society). 1994;157(3):357-416. doi:10.2307/2983527

**Previous commentaries on Mayo (2021) editorial (more to come*)
**Dennis

Stark

Pawitan

Hennig

Ionides and Ritov

Haig

Lakens

**(*if you wish to contribute a commentary, let me know)**

Brian Dennis

Professor Emeritus

Dept Fish and Wildlife Sciences,

Dept Mathematics and Statistical Science

University of Idaho

**Journal Editors Be Warned: Statistics Won’t Be Contained**

I heartily second Professor Mayo’s call, in a recent issue of *Conservation Biology*, for science journals to tread lightly on prescribing statistical methods (Mayo 2021). Such prescriptions are not likely to be constructive; the issues involved are too vast.

The science of ecology has long relied on innovative statistical thinking. Fisher himself, inventor of P values and a considerable portion of other statistical methods used by generations of ecologists, helped ecologists quantify patterns of biodiversity (Fisher et al. 1943) and understand how genetics and evolution were connected (Fisher 1930). G. E. Hutchinson, the “founder of modern ecology” (and my professional grandfather), early on helped build the tradition of heavy consumption of mathematics and statistics in ecological research (Slack 2010). Investigators in the early days of the subfield of conservation biology, saw the need for stochastic approaches to modeling rare or colonizing populations and for assessing extinction jeopardy (MacArthur and Wilson 1967, Leigh 1981, Lande and Orzack 1988, Dennis 1989, Dennis et al. 1991). Data arising from modern molecular genetics are now a huge cornerstone of conservation, and analyzing such data well often requires considerable statistical sophistication. Other data in ecology are highly nonstandard and require custom made generalized linear models, generalized additive models, integrated models, state space models, structural equation models, spatial capture-recapture models… an ever-expanding list. Nonstandard data, and the very theories of ecology themselves, require the modern ecologist to master an extensive statistical arsenal (Ellison and Dennis 2010).

Lack of replicability has long been acknowledged in ecology, as ecological systems are severely heterogeneous. Ecologists turned heavily to hierarchical models of various sorts to better capture a fuller picture of the sources of variability in data. The likelihoods involved, for all but the usual normal-based random effects models, are wicked multiple integrals that for many years defied computation. The Bayesian revolution swept through ecology after the discovery in statistics that the posterior distributions for such models could be simulated with MCMC algorithms, bypassing the need to calculate likelihood functions. Most ecologists I talked to had little patience for the philosophical-scientific issues involved in the Bayesian/frequentist choice but rather were enthralled with the quantum leap in complexity and realism in models that could be handled with these Bayesian methods. Frequentist methods were late to the party, but the development of algorithms for likelihood maximization such as data cloning (Lele et al. 2007, Lele et al. 2010) have now given investigators a real choice between frequentist or Bayesian inference for hierarchical models. The philosophical issues can no longer be ignored; the choice between frequentist and Bayesian approaches has consequential differences in the types of conclusions to be drawn from data (Mayo 2018, Lele 2020a,b).

It is no wonder that ecologists have long indulged in substantial introspection and questioning of statistical practice. Single papers, single papers with commentary, forums in journals, whole journal issues, and even whole journals are devoted to expounding on and debating statistical methods in ecology. The “null hypothesis” as an ecological-scientific tool rated an entire issue of *The American Naturalist* (November 1983).

In a contemporary example, *Frontiers in Ecology and Evolution* devoted a “research topic” featuring papers on “evidence statistics.” The evidence project seeks to extend Richard Royall’s (1997) ideas about evidence to statistical cases with unknown parameters and misspecified models and to endow the approach with a frequentist error structure useful for pre-data design and post-data inference (Dennis et al. 2019, Taper et al. 2021). The extension is accomplished with “evidence functions” (Lele 2004). The main structural departure from Neyman-Pearson (NP) hypothesis testing or Fisherian significance testing is that the concepts of evidence and frequentist error are separated.

The quality of inferences should increase as the amount of data available. This presents problems for NP hypothesis testing if inferences are bound to error rates, as Type I error rates (alpha) are constant regardless of sample size. On the other hand, with evidence functions, both error rates (probabilities of misleading evidence, analogous to alpha and beta in NP testing) approach zero asymptotically as sample size increases, *even when models are misspecified*. Results thus far suggest that differences of *consistent* model selection indexes (such as SIC, a.k.a. BIC) retain properties of evidence functions. AIC differences by contrast have error properties similar to NP hypothesis testing (one of the probabilities of misleading evidence does not go to zero but rather becomes constant, similar to alpha in NP hypothesis testing). Evidence functions are for comparing two models; evidence functions are point estimates of the differences of discrepancies of two models from the true data generating mechanism. Interval estimates for evidence can be produced with valid coverage properties, *even when models are misspecified*.

An argument against an evidence-error project is the Likelihood Principle (LP), the concept that experiment outcomes giving equal likelihood to a parameter value must be considered equal evidence for that value. The concept requires, for instance, that 7 successes out of 20 Bernoulli trials is the same evidence for a particular value of the success probability regardless of whether the experiment was a binomial experiment (number of trials fixed) or as a negative binomial experiment (trials occur until 7 successes are attained). Mayo (2018) provides an entertaining takedown of the LP on philosophical-scientific grounds. Statistically, the variances of those two success probability estimates would be different between the two experiment designs, and so any assessment of long-run error rates must depend on design as well. Similarly, to consider error rates for evidence functions, the LP must necessarily be left behind.

Journal editors can best help ecology by facilitating, promoting, and encouraging such discourse. Prescribing some fixed statistical approach (as agriculture journals once did for multiple comparisons) in the instructions to authors is likely to be ill-informed and harmful to scientific progress. The statistical landscape is growing and changing rapidly, and how statistical approaches can contribute to a particular science is best left to practitioners to sort out on the journal pages.

**References**

- Dennis B. 1989. Allee effects: population growth, critical density, and the chance of extinction. Natural Resource Modeling 3:481-538.
- Dennis B., PL Munholland, JM Scott. 1991. Estimation of growth and extinction parameters for endangered species. Ecological Monographs 61:115-143.
- Dennis B, Ponciano JM, Taper ML, Lele SR. 2019. Errors in statistical inference under model misspeciﬁcation: evidence, hypothesis testing, and AIC. Frontiers in Ecology and Evolution 7:372.
- Ellison AM, Dennis B. 2010. Paths to statistical fluency for ecologists. Frontiers in Ecology and the Environment 8:362-370.
- Fisher RA. 1930. The genetical theory of natural selection. The Clarendon Press, Oxford, UK.
- Fisher RA, AS Corbet, SB Williams. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12:42-58.
- Lande R, Orzack SH. 1988. Extinction dynamics of age-structured populations in a structured environment. Proceedings of the National Academy of Sciences (USA) 85:7418-7421.
- Leigh EG. 1981. The average lifetime of a population in a varying environment. Journal of Theoretical Biology 90:213-239.
- Lele SR. 2004. Evidence functions and the optimality of the law of likelihood. In: The nature of scientiﬁc evidence: statistical, philosophical and empirical considerations, eds ML Taper, SR Lele. The University of Chicago Press, Chicago, Illinois.
- Lele SR. 2020a. Consequences of lack of parameterization invariance of non-informative Bayesian analysis for wildlife management: survival of San Joaquin kit fox and declines in amphibian populations. Frontiers in Ecology and Evolution 7:501.
- Lele SR. 2020b. How should we quantify uncertainty in statistical inference? Frontiers in Ecology and Evolution 8:35.
- Lele SR, Dennis B, Lutscher F. 2007. Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Carlo methods. Ecology Letters 10:551–563.
- Lele SR, Nadeem K, Schmuland B. 2010. Estimability and likelihood inference for generalized linear mixed models using data cloning. Journal of the American Statistical Association 105:1617–1625
- MacArthur RH, Wilson EO. 1967. The theory of island biogeography. Princeton University Press, Princeton, New Jersey.
- Mayo, D. 2018. Statistical inference as severe testing: how to get beyond the statistics wars. Cambridge University Press, Cambridge, UK.
- Mayo, D. 2021. The statistics wars and intellectual conﬂicts of interest. Conservation Biology 2021:1-3.
- Royall R. 1997. Statistical evidence: a likelihood paradigm. Chapman & Hall, London, UK.
- Slack NG. 2010. G. Evelyn Hutchinson and the invention of modern ecology. Yale University Press, New Haven, Connecticut.
- Taper ML, Lele SR, Ponciano JM, Dennis B, Jerde CL. 2021. Assessing the global and local uncertainty of scientific evidence in the presence of model misspecification. Frontiers of Ecology and Evolution 9:679155.

**Previous commentaries on Mayo (2021) editorial (more to come*)
**Stark

Pawitan

Hennig

Ionides and Ritov

Haig

Lakens

Stark

**(*if you wish to contribute a commentary, let me know)**

]]>

**Philip B. Stark**

Professor

Department of Statistics

University of California, Berkeley

I enjoyed Prof. Mayo’s comment in *Conservation Biology* Mayo, 2021 very much, and agree enthusiastically with most of it. Here are my key takeaways and reflections.

Error probabilities (or error rates) are essential to consider. If you don’t give thought to what the data would be like if your theory is false, you are not doing science. Some applications really require a decision to be made. Does the drug go to market or not? Are the girders for the bridge strong enough, or not? Hence, banning “bright lines” is silly. Conversely, no threshold for significance, no matter how small, suffices to prove an empirical claim. In replication lies truth. Abandoning P-values exacerbates moral hazard for journal editors, although there has always been moral hazard in the gatekeeping function. Absent any objective assessment of evidence, publication decisions are even more subject to cronyism, “taste”, confirmation bias, etc. Throwing away P-values because many practitioners don’t know how to use them is perverse. It’s like banning scalpels because most people don’t know how to perform surgery. People who wish to perform surgery should be trained in the proper use of scalpels, and those who wish to use statistics should be trained in the proper use of P-values. Throwing out P-values is self-serving to statistical instruction, too: we’re making our lives easier by teaching *less* instead of teaching *better.*

In my opinion, the main problems with P-values are: faulty interpretation, even of genuine P-values; use of nominal P-values that are not genuine P-values; and perhaps most importantly, testing statistical hypotheses that have no connection to the scientific hypotheses.

A P-value is the observed value of any statistic whose probability distribution is dominated by the uniform distribution when the null is true. That is, a P-value is any measurable function *T* of the data that doesn’t depend on any unknown parameters and for which, if the null hypothesis is true, Pr {*T* ≤ *p*} ≤ *p*. Reported P-values often do not have that **defining** property. One reason is that calculating *T* may involve many steps, including data selection, model selection, test selection, and selective reporting, but practitioners ignore all but the final step in making the probability calculation. That is, in reality, *T* is generally the composition of many functions *T _{n}* ο

In my experience, perhaps the most pernicious error in the use of P-values in applications is a Type III error: answering the wrong question by testing a statistical null hypothesis that has nothing to do with the scientific hypothesis, aside from having some words in common. A statistical null hypothesis needs to capture the science, or testing it sheds no light on the matter. For example, consider a randomized controlled trial with a binary treatment and a binary outcome. The *scientific* null is that the remedy does not improve clinical outcomes (either subject by subject, or on average across the subjects in the trial). A typical *statistical* null is that the responses to treatment and placebo are all IID *N* (*µ*, *σ*^{2}). The scientific null does not involve independence, normal distributions, or equality of variances. A genuine P-value for the statistical null does not say much about the scientific null. Here is a more nuanced example: do academic audiences interrupt female speakers more often than they interrupt male speakers? A typical statistical hypothesis might involve positing a model for interruptions, say a zero-inflated negative binomial regression model with coefficients for gender, speaker’s years since PhD, and other covariates. The statistical hypothesis might be that the coefficient of gender in that model is zero. Even if one computes a genuine P-value for that statistical hypothesis, what does it have to do with the original scientific question?

I close with a comment regarding likelihood-based tests, which are mentioned in the commentary. There are indeed tests that depend only on likelihoods or likelihood ratios — and that allow optional stopping when “the data look favorable” — but that nonetheless rigorously control the probability of a Type I error. Wald’s sequential probability ratio test is the seminal example, but there are a host of other martingale-based methods that give the same protections.

**Earlier commentaries on Mayo 2021 Editorial**

Kent Staley

Professor

Department of Philosophy

Saint Louis University

**Commentary on “The statistics wars and intellectual conflicts of interest” (Mayo editorial)**

In her recent Editorial for *Conservation Biology*, Deborah Mayo argues that journal editors “should avoid taking sides” regarding “heated disagreements about statistical significance tests.” Particularly, they should not impose bans suggested by combatants in the “statistics wars” on statistical methods advocated by the opposing side, such as Wasserstein et al.’s (2019) proposed ban on the declaration of statistical significance and use of *p *value thresholds. Were journal editors to adopt such proposals, Mayo argues, they would be acting under a conflict of interest (COI) of a special kind: an “intellectual” conflict of interest.

Conflicts of interest are worrisome because of the potential for bias. Researchers will no doubt be all too familiar with the institutional/bureaucratic requirement of declaring financial interests. Whether such disclosures provide substantive protections against bias or simply satisfy a “CYA” requirement of administrators, the rationale is that assessment of research outcomes can incorporate information relevant to the question of whether the investigators have arrived at a conclusion that overstates (or even fabricates) the support for a claim, when the acceptance of that claim would financially benefit them. This in turn ought to reduce the temptation of investigators to engage in such inflation or fabrication of support. The idea obviously applies quite naturally to editorial decisions as well as research conclusions.

Mayo’s “intellectual” COIs differ from this familiar case. The relevant interests of (in this case) journal editors are not financial, but concern policies governing the conduct of science itself.

One might object that journal editors are entrusted with decision-making power precisely to adopt and act upon such policies, and this distinguishes intellectual COIs from financial ones. Journal editors, according to this view, are responsible for making informed and reasoned judgments about the standards that distinguish credible research conclusions. They cannot do so if they are barred from adopting standards in accord with their personal judgments. To have an intellectual interest in a policy is simply to think that it is a good idea, and shouldn’t journal editors act on good ideas when they (think that they) have them?

To continue the objection, take an example from the field of particle physics: The editors of *Physical Review D* surely ought to be free to impose the requirement that claims to have “observed” a new phenomenon cannot be published unless the putative signal for that phenomenon constitutes at least a 5s departure from the null hypothesis prediction. They ought to have the ability to impose such a requirement even though there are some members of the particle physics community who are critical of that policy, or who (perhaps because they prefer Bayesian analyses) reject even the use of significance calculations as a requirement of discovery claims.

Perhaps such an objection might be encouraged by the idea of an intellectual COI, but I think it misses the point of Mayo’s argument. The dispute within the statistical community over significance testing, and the “statistics wars” more generally, is fundamentally a philosophical one, or at least involves, in Mayo’s words, “philosophical presuppositions.” These presuppositions concern such fundamental aspects of scientific inquiry as “what is the purpose of a statistical test?” and “do the beliefs of investigators matter to how the results of inquiry are characterized, and if so, how?” Philosophical disputes tend to have a bad reputation among non-philosophers because they are often thought to be never-ending or even unresolvable in principle. Perhaps some are, but even in those cases (and I don’t think this is one), there is at least the possibility for progress in terms of clarifying what is at stake and eliminating non-viable positions from consideration. In any case, so long as competing methodological approaches in a given field rest upon differing philosophical presuppositions, about which there is legitimate and ongoing disagreement, to preclude the use of one of those approaches as a matter of editorial policy would be to foreclose on the possibility of engaging that philosophical dispute at the level of scientific practice. The consequences of that foreclosure for the scientific discipline itself would be impoverishing.

]]>

**Yudi Pawitan**

Professor

Department of Medical Epidemiology and Biostatistics

Karolinska Institutet, Stockholm

**Behavioral aspects in the statistical significance war-game**

I remember with fondness the good old days when the only ‘statistical war’-game was fought between the Bayesian and the frequentist. It was simpler – except when the likelihood principle is thrown in, always guaranteed to confound the frequentist – and the participants were for the most part collegial. Moreover, there was a feeling that it was a philosophical debate. Even though the Bayesian-frequentist war is not fully settled, we can see areas of consensus, for example in objective Bayesianism or in conditional inference. However, on the P-value and statistical significance front, the war looks less simple as it is about statistical praxis; it is no longer Bayesian vs frequentist, with no consensus in sight and with wide implications affecting the day-to-day use of statistics. Typically, a persistent controversy between otherwise *sensible and knowledgeable* people – thus excluding anti-vaxxers and conspiracy theorists – might indicate we are missing some common perspectives or perhaps the big picture. In complex issues there can be genuinely distinct aspects about which different players disagree and, at some point, agree to disagree. I am not sure we have reached that point yet, with each side still working to persuade the other side the faults of their position. For now, I can only concur with Mayo (2021)’s appeal that at least the umpires – journals editors – recognize (a) the issue at hand and (b) that genuine debates are still ongoing, so it is not yet time to take sides.

I have previously described my disagreement with the ideas of banning the P-value or just its threshold, or retiring statistical significance (Pawitan, 2020). Rather than repeating or expanding the arguments here, I want instead to discuss where genuine disagreements can occur and be accepted. In game-theoretic or behavior-economic analyses, it is accepted that rational-intelligent individuals can act differently, thus disagree, reflecting different personal preferences or utility functions. In this game framework, the differing parties accept each other’s position, and there is no need to persuade and change each other’s opinion.

So let’s start by assuming we are all rational-intelligent players: we fully understand the correct meaning and usage of the P-value in particular and statistical inference in general. Excluding deliberate frauds, most objections to the P-value or its threshold seem to refer to at least three concerns: (i) the potential misunderstanding of non-expert practitioners, who then produce misleading statements; (ii) the potential misunderstanding of the public or consumer of the statistical results, leading to poor decisions or confusing public discourse or both; (iii) the potential of more false-positive or false-negative errors. Since the P-value threshold controls the false-positive rate under the null, we must suppose that the concern regarding false positives is either due to a belief that the reported P-value does not represent the true level of uncertainty, or that the standard threshold – such as 0.05 – is too large. The former will occur when using biased data or analysis procedures, so in principle can be cured by better data or more rigorous procedures. But reducing the threshold to cure the latter will increase the false-negative rate; vice versa, increasing the threshold reduces false-negative but will increase false-positive rate. It seems to me the attitude is that, rather than try to balance these two errors, let’s just not use the P-value or its threshold. This reflects a preference that is perhaps amenable to further theoretical analysis and discussion, for example in relation to replicability/validation, but in any case it will not affect the other two concerns.

The first two concerns are different: they reflect some degree of distrust of non-experts and the gullible public. Although I share these concerns, there is a genuine difference in where I put them on my utility scale relative to the advantages of having the P-value. Furthermore, in the game theory for a social setting, we talk about a personal preference and a social preference; on a single issue these can be distinct or may also coincide. For instance, personally I would never consider abortion, but I will not impose my preference on other people, so in my social preference, abortion is acceptable. But somebody else, who would not only reject abortion for herself, but also wants to live a society that does not allow abortion, so would militate for its ban. Even in liberal countries, where individual preferences/liberty are supposed to be paramount, there are many such issues where you might want to project your personal preference as the social preference: vaccination, addictive drugs such as marijuana to cocaine, pornography, prostitution, open-carry firearms, gambling, death penalty, euthanasia, etc. These social issues are typically solved by democratic means, directly in referenda or indirectly by decisions of elected representatives. In either case, there is a mechanism – such as voting – and an authority that can impose the agreed decision as a social contract to the whole society.

What kind of social solution is suitable for something like the P-value war? It is indeed a challenging problem, since we have (i) no boundary that defines the legitimate stake-holders (academic statisticians? +applied statisticians? +chartered statisticians? +statistically literate scientists? +…?), (ii) no formal mechanism to express and combine preferences, and (iii) no real authority to impose any agreed decision. As in society in general, social norms that are not formally democratically controlled are dictated by culture. But how cultures evolve and which social rules get adopted are not predictable; in particular they may not be decided by the majority. They may well depend on a small number of influencers, in our case perhaps top-ranked-journal editors, or top-ranked statisticians or scientists. Nassim Taleb (2020) highlighted how social changes can be driven by a small intolerant/loud minority in the face of a tolerant/quiet majority. For instance, the few editors of the journal Basic and Applied Social Psychology banned the P-value and statistical inference, and the numerous authors must acquiesce regardless of their personal views.

I have never seen any rigorous opinion poll on the use of P-values. Formal professional bodies such as the ASA or the RSS could perhaps run such a poll. They will of course still face the boundary problem I mention above, as their members do not represent all users of statistics, but it will be a start. A rigorous poll would be useful, so we can judge the extent of the division within our profession. One may argue strongly that science is not a democratic enterprise: 1000 dissenting but wrong votes cannot beat a single correct vote. But on an issue with no definite right-wrong answer such as the use of the P-value, a large support for banning it or its threshold should encourage all of us to come to a workable consensus. But a small support – please do not ask for a threshold! – should give the intolerant/loud minority pause for thought.

**References**

Mayo, D (2021) The statistics wars and intellectual conflicts of interest. *Conservation Biology*.

Pawitan, Y (2020). Defending the P-value. https://arxiv.org/abs/2009.02099

Taleb, N N (2020). *Skin in the Game: Hidden Asymmetries in Daily Life.* New York: Random House.

Phil Stat Forum:

**11 January 2022**

**“Statistical Significance Test Anxiety”**

**TIME: 15:00-17:00 (London, GMT); 10:00-12:00 (EST)**

**Presenters: **Deborah Mayo (Virginia Tech) &

Yoav Benjamini (Tel Aviv University)

**Moderator: **David Hand (Imperial College London)

Benjamini, et al. (2021), “The ASA President’s Task Force Statement on Statistical significance and Replicability” (Link to article)

Mayo, D. (2021), “The Statistics Wars and Intellectual Conflicts of Interest” (editorial). *Conservation Biology*. (Link to article)

Benjamini, Y. (2020), “Selective Inference: The Silent Killer of Replicability”. Harvard Data Science Review.

Wasserstein R. & Lazar, N.(2016). “The ASA’s Statement on p-Values: Context, Process, and Purpose,” *The American Statistician* 70(2), 129-133.

Wasserstein, R., Schirm, A,. & Lazar, N. (2019). “Moving to a world beyond “p < 0.05” (Editorial).” *The American Statistician* 73(S1), 1–19.

For** posts on this topic **see this blog post.

**For a full listing of meetings (including links to videos & slides), see our Phil Stat Forum Schedule page.**

Christian HennigProfessorDepartment of Statistical Sciences

University of Bologna

**The ASA controversy on P-values as an illustration of the difficulty of statistics**

“I work on Multidimensional Scaling for more than 40 years, and the longer I work on it, the more I realise how much of it I don’t understand. This presentation is about my current state of not understanding.”(John Gower, world leading expert on Multidimensional Scaling, on a conference in 2009)

“The lecturer contradicts herself.”(Student feedback to an ex-colleague for teaching methods and then teaching what problems they have)

**1 Limits of understanding**

Statistical tests and P-values are widely used and widely misused. In 2016, the ASA issued a statement on significance and P-values with the intention to curb misuse while acknowledging their proper definition and potential use. In my view the statement did a rather good job saying things that are worthwhile saying while trying to be acceptable to those who are generally critical on P-values as well as those who tend to defend their use. As was predictable, the statement did not settle the issue. A “2019 editorial” by some of the authors of the original statement (recommending “to abandon statistical significance”) and a 2021 ASA task force statement, much more positive on P-values, followed, showing the level of disagreement in the profession.

Statistics is hard. Well-trained, experienced and knowledgeable statisticians disagree about standard methods. Statistics is based on probability modelling, and probability modelling in data analysis is essentially about whether and how often things that did not happen could have happened, which can never be verified. The very meaning of probability, and by extension of every probability statement, is controversial.

The 2021 task force statement states: “Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.” I do not disagree with this. Probability models assign probabilities to sets, and considering the probability of a well chosen data-dependent set is a very elementary way to assess the compatibility of a model with the data. The likelihood is another way, not requiring the specification of a test statistic that defines a “direction” in which the model may be violated, instead relying somewhat more on the exact model specification. Still, considering the P-value as “among the best understood”, it is remarkable how much controversy, lack of understanding, and misunderstanding regarding them exist. Indeed there are issues with tests and P-values about which there is disagreement even among the most proficient experts, such as when and how exactly corrections for multiple testing should be used, or under what exact conditions a model can be taken as “valid”. Such decisions depend on the details of the individual situation, and there is no way around personal judgement.

I do not think that this is a specific defect of P-values and tests. The task of quantifying evidence and reasoning under uncertainty is so hard that problems of these or other kinds arise with all alternative approaches as well. The opening quote by John Gower is not on P-values, but it would be heart-warming to see top experts on statistical inference talking this way, too. It is also important to acknowledge that there is agreement when it comes to mathematics and basic interpretation (not rejecting the null hypothesis does not mean that it is true, and neither is the P-value a probability for it to be true), from which the general perception may be distracted when focusing too much on philosophical differences.

**2 Tension**

A much bigger problem is the tension between the difficulty of statistics and the demand for it to be simple and readily available. Data analysis is essential for science, industry, and society as a whole. Not all data analysis can be done by highly qualified statisticians, and society cannot wait with analysing data for statisticians to achieve perfect understanding and agreement. On top of this there are incentives for producing headline grabbing results, and society tends to attribute authority to those who convey certainty rather than to those who emphasise uncertainty. Statistics provides standard model based indications of uncertainty, but on top of that there is model uncertainty, uncertainty about the reliability of the data, and uncertainty about appropriate strategies of analysis and their implications. A statistician who emphasises all of these will often meet confusion and disregard.

Another important tension exists between the requirement for individual judgement and decision-making depending on the specifics of a situation, and the demand for automated mechanical procedures that can be easily taught, easily transferred from one situation to another, justified by appealing to simple general rules (even though their applicability to the specific situation of interest may be doubtful), and also investigated by statistical theory and systematic simulation.

P-values are so elementary and apparently simple a tool that they are particularly suitable for mechanical use and misuse. To have the data’s verdict about a scientific hypothesis summarised in a single number is a very tempting perspective, even more so if it comes without the requirement to specify a prior first, which puts many practitioners off a Bayesian approach. As a bonus, there are apparently well established cutoff values so that the number can even be reduced to a binary “accept or reject” statement. Of course all this belies the difficulty of statistics and a proper account of the specifics of the situation.

As said in the 2016 ASA Statement, the P-value is an expression of the compatibility of the data with the null model, in a certain respect that is formalised by the test statistic. As such, I have no issues with tests and P-values as long as they are not interpreted as something that they are not. The null model should not believed to be true (and neither should any other model). A P-value is surely informative; regarding given data, compatibility is the best that models can ever achieve, of course keeping in mind that many models can be compatible with the same data. The fact that P-values (and statistical reasoning in general) regard idealised models that are different from reality seems to be hard to stomach and easy to ignore; contrarily sometimes this is interpreted as testifying the uselessness of P-values (or frequentist statistical inference in general). It seems more difficult to acknowledge how models can help us to handle reality without being true, and how finding an incompatibility between data and model can be a starting point of an investigation how exactly reality is different and what that means. For this, a test gives a rough direction (such as “the mean looks too large”), which can be useful, but is certainly limited as information.

Alternative statistical approaches have their merits and pitfalls, too, always including the temptation to over-interpret their implications, often by taking the assumed model as a truth rather than a model (also a Bayesian model of belief should not just be believed). The pessimistic belief seems realistic that the general popularity and spread of any statistical approach will correspond to its capacity of being mechanically used, misused, and over-interpreted, making it easy for its opponents to criticise it.

**3 Dilemma**

As statisticians we face the dilemma that we want statistics to be popular, authoritative, and in widespread use, but we also want it to be applied carefully and correctly, avoiding oversimplification and misinterpretation. That these aims are in conflict is in my view a major reason for the trouble with P-values, and if P-values were to be replaced by other approaches, I am convinced that we would see very similar trouble with them, and to some extent we already do.

Ultimately I believe that as statisticians we should stand by the complexity and richness of our discipline, including the plurality of approaches. We should resist the temptation to give those who want a simple device to generate strong claims what they want, yet we also need to teach methods that can be widely applied, with a proper appreciation of pitfalls and limitations, because otherwise much data will be analysed with even less insight. Making reference to the second quote above, we exactly need to “contradict ourselves” in the sense of conveying what can be done, together with what the problems of any such approach are.

**4 Conclusion**

When it comes to a representative association such as ASA, I think that the approach taken in the initial statement followed this ideal and was as such valuable. I would have hoped that the assertions made could be accepted by a vast majority of statisticians despite much existing disagreement, maybe tolerating disagreement with certain details of the statement. The “2019 editorial” had a different spirit by recommending to “abandon” methodology that a substantial number of statisticians routinely use and defend. This was obviously not something that could hope for broad agreement, and I think it was quite damaging for the profession. If we see ourselves as flag bearers of the acknowledgement and communication of uncertainty (and I think we should define ourselves in this way), this task alone puts us in a difficult position with a public who expect certainty and quick results. Regarding methodological controversies within our profession, we should be pluralist and open for the arguments of each side, rather than trying to shut one side out.

Unfortunately, for the participants in such controversies it is tempting and easy to hold difficulties and issues against an approach that they do not favour, for promoting a particular alternative approach. But the latter may well be affected in one way or another by the same or strongly related issues, as the difficulties with formalising uncertainty run deeper.

What we should like to see is scientists (and other statistics users) who are aware of the many sources of uncertainty and misunderstanding, and interpret their results keeping this in mind. Most of them are not highly trained statisticians, so we cannot expect them to have deep statistical insight or to do very sophisticated things. In any case we should not give them the impression that whether they do things right or wrong is a matter of whether they follow one or the other statistical approach, as long as both find support within the statistics community. Instead it is a matter of awareness of the limitations of whatever they do.

See Ionides and Riccov commentary here. Prior to that are commentaries by Haig and by Lakens.

Please join us for our special remote Phil Stat Forum on Tuesday Jan 11, 10 AM EST: phil-stat-wars.com (“statistical significance test anxiety”)

]]>Edward L. Ionides

Director of Undergraduate Programs and Professor,

Department of Statistics, University of Michigan

Ya’acov Ritov Professor

Department of Statistics, University of Michigan

Department of Statistics, University of Michigan

Thanks for the clear presentation of the issues at stake in your recent *Conservation Biology* editorial (Mayo 2021). There is a need for such articles elaborating and contextualizing the ASA President’s Task Force statement on statistical significance (Benjamini et al, 2021). The Benjamini et al (2021) statement is sensible advice that avoids directly addressing the current debate. For better or worse, it has no references, and just speaks what looks to us like plain sense. However, it avoids addressing why there is a debate in the first place, and what are the justifications and misconceptions that drive different positions. Consequently, it may be ineffective at communicating to those swing voters who have sympathies with some of the insinuations in the Wasserstein & Lazar (2016) statement. We say “insinuations” here since we consider that their 2016 statement made an attack on p-values which was forceful, indirect and erroneous. Wasserstein & Lazar (2016) started with a constructive discussion about the uses and abuses of p-values before moving against them. This approach was good rhetoric: “I have come to praise p-values, not to bury them” to invert Shakespeare’s Anthony. Good rhetoric does not always promote good science, but Wasserstein & Lazar (2016) successfully managed to frame and lead the debate, according to Google Scholar. We warned of the potential consequences of that article and its flaws (Ionides et al, 2017) and we refer the reader to our article for more explanation of these issues (it may be found below). Wasserstein, Schirm and Lazar (2019) made their position clearer, and therefore easier to confront. We are grateful to Benjamini et al (2021) and Mayo (2021) for rising to the debate. Rephrasing Churchill in support of their efforts, “Many forms of statistical methods have been tried, and will be tried in this world of sin and woe. No one pretends that the p-value is perfect or all-wise. Indeed (noting that its abuse has much responsibility for the replication crisis) it has been said that the p-value is the worst form of inference except all those other forms that have been tried from time to time”.

Benjamini, Y., De Veaux, R.D., Efron, B., Evans, S., Glickman, M., Graubard, B.I., He, X., Meng, X.L., Reid, N.M., Stigler, S.M. and Vardeman, S.B., 2021. ASA President’s Task Force Statement on Statistical Significance and Replicability. Annals of Applied Statistics, 15(3), pp. 1084-1085.

Ionides, E.L., Giessing, A., Ritov, Y. and Page, S.E., 2017. Response to the ASA’s statement on p-values: context, process, and purpose. The American Statistician, 71(1), pp. 88-89.

Mayo, D.G., The statistics wars and intellectual conflicts of interest. Conservation Biology, to appear. (Online Mayo 2021.)

Wasserstein, R.L. and Lazar, N.A., 2016. The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), pp. 129-133.

Wasserstein, R.L., Schirm, A.L. and Lazar, N.A., 2019. Moving to a world beyond “p< 0.05”. The American Statistician, 73(sup1), pp. 1-19.

******

**THE AMERICAN STATISTICIAN 71(1): 88-89.**

**LETTERS TO THE EDITOR**

Edward L. Ionides^{a}, Alexander Giessing^{a}, Yaacov Ritov^{a}, and Scott E. Page^{b}

^{a}Department of Statistics, University of Michigan, Ann Arbor, MI; ^{b}Departments of Complex Systems, Political Science and Economics, University of Michigan, Ann Arbor, MI

The ASA’s statement on *p*-values: context, process, and purpose (Wasserstein and Lazar 2016) makes several reasonable practical points on the use of *p*-values in empirical scientific inquiry. The statement then goes beyond this mandate, and in opposition to mainstream views on the foundations of scientific reasoning, to advocate that researchers should move away from the practice of frequentist statistical inference and deductive science. Mixed with the sensible advice on how to use *p*-values comes a message that is being interpreted across academia, the business world, and policy communities, as, “Avoid *p*-values. They don’t tell you what you want to know.” We support the idea of an activist ASA that reminds the statistical community of the proper use of statistical tools. However, any tool that is as widely used as the *p*– value will also often be misused and misinterpreted. The ASA’s statement, while warning statistical practitioners against these abuses, simultaneously warns practitioners away from legitimate use of the frequentist approach to statistical inference.

In particular, the ASA’s statement ends by suggesting that other approaches, such as Bayesian inference and Bayes factors, should be used to solve the problems of using and interpreting *p*-values. Many committed advocates of the Bayesian paradigm were involved in writing the ASA’s statement, so perhaps this conclusion should not surprise the alert reader. Other applied statisticians feel that adding priors to the model often does more to obfuscate the challenges of data analysis than to solve them. It is formally true that difficulties in carrying out frequentist inference can be avoided by following the Bayesian paradigm, since the challenges of properly assessing and interpreting the size and power for a statistical procedure disappear if one does not attempt to calculate them. However, avoiding frequentist inference is not a constructive approach to carrying out better frequentist inference.

On closer inspection, the key issue is a fundamental position of the ASA’s statement on the scientific method, related to but formally distinct from the differences between Bayesian and frequentist inference. Let us focus on a critical paragraph from the ASA’s statement: “In view of the prevalent misuses of and misconceptions concerning *p*-values, some statisticians prefer to supplement or even replace *p*-values with other approaches. These include methods that emphasize estimation over test- ing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes factors; and other approaches such as decision-theoretical modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.”

Some people may want to think about whether it makes scientific sense to “directly address whether the hypothesis is correct.” Some people may have already concluded that usually it does not, and be surprised that a statement on hypothesis test- ing that is at odds with mainstream scientific thought is apparently being advocated by the ASA leadership. Albert Einstein’s views on the scientific method are paraphrased by the assertion that, “No amount of experimentation can ever prove me right; a single experiment can prove me wrong” (Calaprice 2005). This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. In the words of Popper (1963), “It is easy to obtain confirmations, or verifications, for nearly every theory,” while, “Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability.” The ASA’s statement appears to be contradicting the scientific method described by Einstein and Popper. In case the interpretation of this paragraph is unclear, the position of the ASA’s statement is clarified in their Principle 2: “*p*-values do not measure the probability that the studied hypothesis is true, or the prob- ability that the data were produced by random chance alone. Researchers often wish to turn a *p*-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The *p*-value is neither.” Here, the ASA’s statement misleads through omission: a more accurate end of the paragraph would read, “The *p*-value is neither. Nor is any other statistical test used as part of a deductive argument.” It is implicit in the way the authors have stated this principle that they believe alternative scientific methods may be appropriate to assess more directly the truth of the null hypothesis. Many readers will infer the ASA to imply the inferiority of deductive frequentist methods for scientific reasoning. The ASA statement, in its current form, will therefore make it harder for scientists to defend a choice of frequentist statistical methods during peer review. Frequentist articles will become more difficult to publish, which will create a cascade of effects on data collection, research design, and even research agendas.

Gelman and Shalizi (2013) provided a relevant discussion of the distinction between deductive reasoning (based on deducing conclusions from a hypothesis and checking whether they can be falsified, permitting data to argue against a scientific hypothesis but not directly for it) and inductive reasoning (which permits generalization, and therefore allows data to provide direct evidence for the truth of a scientific hypothesis). It is held widely, though less than universally, that only deductive reasoning is appropriate for generating scientific knowledge. Usually, frequentist statistical analysis is associated with deductive reasoning and Bayesian analysis is associated with inductive reasoning. Gelman and Shalizi (2013) argued that it is possible to use Bayesian analysis to support deductive reasoning, though that is not currently the mainstream approach in the Bayesian community. Bayesian deductive reasoning may involve, for example, refusing to use Bayes factors to support scientific conclusions. The Bayesian deductive methodology proposed by Gelman and Shalizi (2013) isa close cousin to frequentist reasoning, and in particular emphasizes the use of Bayesian p-values.

The ASA probably did not intend to make a philosophical statement on the possibility of acquiring scientific knowledge by inductive reasoning. However, it ended up doing so, by making repeated assertions implying, directly and indirectly, the legitimacy and desirability of using data to directly assess the correctness of a hypothesis. This philosophical aspect of the ASA statement is far from irrelevant for statistical practice, since the ASA position encourages the use of statistical arguments that might be considered inappropriate.

A judgment against the validity of inductive reasoning for generating scientific knowledge does not rule out its utility for other purposes. For example, the demonstrated utility of standard inductive Bayesian reasoning for some engineering applications is outside the scope of our current discussion. This amounts to the distinction Popper (1959) made between “common sense knowledge” and “scientific knowledge.”

Calaprice, A. (2005), *The New Quotable Einstein*, Princeton, NJ: Princeton University Press. [88]

Gelman, A., and Shalizi, C. R. (2013), “Philosophy and the Practice of Bayesian Statistics,” *British Journal of Mathematical and Statistical Psychology*, 66, 8–38.

Popper, K. (1963), *Conjectures and Refutations: The Growth of Scientific Knowledge*, New York: Routledge and Kegan Paul. [88]

Popper, K. R. (1959), *The Logic of Scientific Discovery*, London: Hutchinson.

Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on *p*– Values: Context, Process, and Purpose,” *The American Statistician*, 70, 129–133. [88]

**Brian Haig, Professor Emeritus**

Department of Psychology

University of Canterbury

Christchurch, New Zealand

**What do editors of psychology journals think about tests of statistical significance? Questionable editorial directives from Psychological Science**

Deborah Mayo’s (2021) recent editorial in *Conservation Biology* addresses the important issue of how journal editors should deal with strong disagreements about tests of statistical significance (ToSS). Her commentary speaks to applied fields, such as conservation science, but it is relevant to basic research, as well as other sciences, such as psychology. In this short guest commentary, I briefly remark on the role played by the prominent journal, *Psychological Science *(PS), regarding whether or not researchers should employ ToSS. PS is the flagship journal of the Association for Psychological Science, and two of its editors-in-chief have offered explicit, but questionable, advice on this matter.

In the May 2005 issue of PS, the experimental psychologist, Peter Killeen (2005), published an article on a new statistic, that he maintained overcame some important deficiencies of null hypothesis significance testing. The alternative statistic, ‘*p*_{rep}’, he understood as the probability of replicating an experimental effect. In the same issue of PS, the editor-in chief, James Cutting, opined that Killeen’s article “may change how all psychologists report their statistics”, and he promptly informed prospective contributors to use *p*_{rep} rather than *p* values when analysing their data. Within a few years, a majority of empirical articles published in PS employed *p*_{rep}, along with effect sizes. This quick rise to local prominence of *p*_{rep} was immediately followed by the publication of a number of articles that were highly critical of the statistic. Among other things, Killeen’s article was criticized for containing mathematical errors, and for not actually being a replication probability.

Significantly, none of the articles critical of *p*_{rep} were published in PS, despite the fact that the journal decided at the time to devote more space to commentaries. One might reasonably fault Cutting’s editorial decision to accord *p*_{rep} favored status before statisticians and research methodologists had time to evaluate its soundness. To this end, he might have used PS as a forum for scrutiny of Killeen’s article. After a few years, and in the face of strong criticism, PS quietly dropped its recommendation that researchers use *p*_{rep}.

In 2014, the first issue of PS contained a tutorial article by Geoff Cumming (2014) on the “new statistics” that was commissioned by the incoming editor-in chief, Erich Eich. In his accompanying editorial (Eich, 2014) explicitly discouraged prospective authors from using null hypothesis significance testing, and invited them to consider using the new statistics of effect sizes, estimation, and meta-analysis. Cumming, now with Bob Calin-Jageman, continues to assiduously promote the new statistics in the form of textbooks, articles, workshops, symposia, tutorials, and a dedicated website. It is fair to say that the new statistics has become the quasi-official position of the Association for Psychological Science, and that PS continues to play a role in the uptake of the new statistics (Giofrè, et al., 2017).

To my knowledge, PS has published no major critical evaluations of the new statistics, nor presented alternatives to them for consideration. In keeping with this uncritical, one-sided attitude, the major proponents of the new statistics have been reluctant to engage with published criticisms of their position. However, a strong methodological pluralism is required for the advancement of knowledge. In particular, the regular critical interplay of alternative perspectives on ToSS is crucial for their ongoing development and understanding. By promoting two questionable alternatives to ToSS (*p*_{rep} and the new statistics), and shunning well-founded alternatives to them (notably, the error-statistical and Bayesian perspectives; see Mayo, 2018; Haig, 2020), the attitudes to ToSS highlighted here can fairly be interpreted as forms of editorial negligence. Although journal editors cannot be expected to solve major statistical controversies, the directives they issue to prospective authors about statistical practice should be properly informed by relevant debates in the statistics wars.

See the previous commentary by Daniel Lakens.

**References**

Cumming, G. (2014). The new statistics: Why and how. *Psychological Science,* *25*, 7-29.

Eich, E. (2014). Business not as usual. *Psychological Science, 25*, 3-6.

Giofrè, D., et al. (2017). The influence of journal submission guidelines on authors’ reporting of statistics and use of open research practices. *PLoS ONE 12* (4): e0175583. https://doi.org/10.1731/joirnal.pone. 0175583

Haig, B. D. (2020). What can psychology’s statistics reformers learn from the error-statistical perspective. *Methods in Psychology. *https://doi.org/10.1016/j.metip.2020.100020

Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. *Psychological **Science, 16*, 345-352.

Mayo, D. G. (2018). *Statistical inference as severe testing: How to get beyond the statistics **wars. *Cambridge University Press.

Mayo, D. G. (2021). The statistics wars and intellectual conflicts of interest. *Conservation **Biology* DOI: 10.1111/cobi.13861

]]>

**Daniël Lakens**

Associate Professor

Human Technology Interaction

Eindhoven University of Technology

**Averting journal editors from making fools of themselves**

In a recent editorial, Mayo (2021) warns journal editors to avoid calls for authors guidelines to reflect a particular statistical philosophy, and not to go beyond merely enforcing the proper use of significance tests. That such a warning is needed at all should embarrass anyone working in statistics. And yet, a mere three weeks after Mayo’s editorial was published, the need for such warnings was reinforced when a co-editorial by journal editors from the International Society of Physiotherapy (Elkins et al., 2021) titled “Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors” stated: “[This editorial] also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.”

This co-editorial by journal editors in the field of physiotherapy shows the incompetence that typically underlies bans of p-values – because let’s be honest, it is always the p-value and associated significance tests that are banned, even when empirical research has shown confidence intervals or Bayes factors are misused and misinterpreted as much, or more (Fricker et al., 2019; Hoekstra et al., 2014; Wong et al., 2021). In the co-editorial, the no-doubt well-intentioned physiotherapy journal editors recommend “Estimation as an alternative approach for statistical inference”. At first glance, one might think this means the editors are recommending estimation as an alternative approach to statistical tests. In other words, we would expect to see questions that are answered by effect size estimates, and not by dichotomous claims about the presence or absence of effects. But then the editors write the following (page 3):

“The estimate and its confidenceinterval should be compared against the ‘smallest worthwhile effect’of the intervention on that outcome in that population. Thesmallest worthwhile effect is the smallest benefit from an interventionthat patients feel outweighs its costs, risk and other inconveniences.If the estimate and the ends of its confidence intervalare all more favourable than the smallest worthwhile effect, then thetreatment effect can be interpreted as typically considered worthwhileby patients in that clinical population.”

This is confused advice, at best. The description of the statistical inference the editors want researchers to make is a dichotomous claim. It is made based on whether a confidence interval excludes the smallest effect size of interest. This procedure is mathematically identical to using *p* < alpha. The question whether a treatment effect is worthwhile or not is *logically* answered by a dichotomous ‘yes’ or ‘no’. An estimate of the effect size does not tell one whether the effect should be regarded as random noise around a true effect size of zero, or a non-zero effect.

The editors should clearly have followed Mayo’s (2021) advice to not go beyond enforcing proper use of significance tests. Estimation and significance testing answer two different questions. Estimation can’t, as the physiotherapists hope, replace significance tests. The conflict between the two approaches becomes apparent when we asks ourselves how researchers who want to publish in these physiotherapy journals should deal with situations where they would lower the alpha level to correct for multiple comparisons or sequential analyses. Are authors required to report a 99% confidence interval in cases where they would have used a Bonferroni correction when examining 5 independent test results, because they would otherwise have divided the 5% alpha by five? Or should they ignore error rates, and make claims based on a 95% confidence interval, even when this would lead to many more articles claiming treatments are beneficial than we currently find acceptable? Related applied questions that researchers who want to publish in physiotherapy journals face are which confidence interval they should report to begin with (as a 95% confidence interval is based on the idea that a maximum of a 5% error rate is deemed acceptable when making dichotomous claims, but a desired accuracy requires a different justification), as well as questions about sample size justifications (will editors accept papers with any sample size, or do they still expect an a-priori power analysis based on low Type 1 and Type 2 error rates when making claims about effect sizes?).

As Mayo (2021) writes, “The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in.” Fricker and colleagues (2019) show how removing the p-values and significance testing in the journal of Behavioral and Applied Social Psychology have led to the publication of articles in which claims are made that have a much higher probability of being wrong than was the case before p-values were banned, but without transparently communicating this high error rate. Anyone who reads physiotherapy journals that follow the guidelines of journal editors to use ‘estimation’ need to be prepared for the same development in their journals. As Mayo (2021) notes in her editorial, banning proper uses of thresholds in significance tests makes it “harder to hold data dredgers culpable for reporting a nominally small *p *value obtained through data dredging”.

The statistical philosophy of estimation is not designed to answer questions about the presence or absence of a beneficial effect. That a large group of journal editors thinks it can shows how rational thought often takes a backseat when journal editors start to make recommendations about how to improve statistical inferences.

What can journal editors require to avert incoherent recommendations that force researchers to use approaches that do not answer the questions they are asking? The answer is simple: They should require a coherent approach to statistical inferences, anchored in an epistemology, that answers the question a researcher is interested in. The task of journals is to evaluate the quality of the work that is submitted, not to dictate the questions researchers ask. Of course, a journal can state that they believe that only work in which no scientific claims are made, or where claims are made without any control on the rate at which these claims are wrong, is the definition of ‘high quality’ – I would look forward to the arguments for such a viewpoint, and doubt they would be convincing. Let’s hope Mayo’s (2021) editorial prevents similar groups of journal editors from making fools of themselves in the future.

**See Brian Haig’s commentary next.**

**References**

- Elkins, M. R., Pinto, R. Z., Verhagen, A., Grygorowicz, M., Söderlund, A., Guemann, M., Gómez-Conesa, A., Blanton, S., Brismée, J.-M., Ardern, C., Agarwal, S., Jette, A., Karstens, S., Harms, M., Verheyden, G., & Sheikh, U. (2021). Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors.
*Journal of Physiotherapy*. https://doi.org/10.1016/j.jphys.2021.12.001 - Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban.
*The American Statistician*,*73*(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892 - Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals.
*Psychonomic Bulletin & Review*,*21*(5), 1157–1164. https://doi.org/10.3758/s13423-013-0572-3 - Mayo, D. (2021). The Statistics Wars and Intellectual Conflicts of Interest.
- Wong, T. K., Kiers, H., & Tendeiro, J. (2021).
*On the Potential Mismatch between the Function of the Bayes Factor and Researchers’ Expectations*.

** Commentaries on my editorial (from Jan 5-Jan 18*)**

Park

Dennis

Stark**
**Staley

Pawitan

Hennig

Ionides and Ritov

Haig

Lakens

*Let me know if you wish to write one

]]>