2023 Syllabus for Philosophy of Inductive-Statistical Inference

PHIL 6014 (crn: 20919): Spring 2023 

Philosophy of Inductive-Statistical Inference
(This is an IN-PERSON class*)
Wed 4:00-6:30 pm, McBryde 22
*There may be opportunities for zooming half-way through the semester.

Syllabus: First Installment (PDF)

D. Mayo (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) CUP, 2018: SIST (electronic and paper provided to those taking the class; proofs are at errorstatistics.com, see below).
Articles from the Captain’s Bibliography (links to new articles will be provided). Other useful information can be found on the SIST Abstracts & Keywords and this post with SIST Excerpts & Mementos)

DateThemes/readings
1. 1/18      Introduction to the Course:
How to tell what’s true about statistical inference

(1/18/23 SLIDES here)

Reading: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST): Preface, Excursion 1 Tour I 1.1-1.3, 9-29

MISC: Souvenir A; SIST Abstracts & Keywords for all excursions and tours
2. 1/25
Q #2
 
Error Probing Tools vs Comparative Evidence: Likelihood & Probability
What counts as cheating?
Intro to Logic: arguments validity & soundness

(1/25/23 SLIDES here)

Reading: SIST: Excursion 1 Tour II 1.4-1.5, 30-55
Session #2 Questions: (PDF)

MISC: NOTES on Excursion 1, SIST: Souvenirs B, C & D, Logic Primer (PDF)
3. 2/1
   Q #3
UPDATED
Induction and Confirmation: PhilStat & Formal Epistemology
The Traditional Problem of Induction
Is Probability a Good Measure of Confirmation? Tacking Paradox

(2/1/23 SLIDES here)

Reading: SIST: Excursion 2, Tour I: 2.1-2.2, 59-74
Hacking “The Basic Rules of Probability” Hand Out (PDF)
UPDATED: Session #3 Questions: (PDF)

MISC: Excursion 2 Tour I Blurb & notes
4. 2/8 &
5. 2/15
Assign 1 2/15 
Falsification, Science vs Pseudoscience, Induction
Statistical Crises of Replication in Psychology & other sciences

Popper, severity and novelty, array of problems and models
Fallacies of rejection, Duhem’s problem; solving induction now

Reading for 2/8: Popper, Ch 1 from Conjectures and Refutations (PDF), Popper Test
Reading for 2/15: SIST: Excursion 2, Tour II: 2.3-2.7, pages TBA
Optional for 2/15: Gelman & Loken (2014)

ASSIGNMENT 1 (due 2/15) (PDF)

MISC: SIST Souvenirs (E), (F), (G),(H
Excursion 2 Tour II Blurb & notes
 Fisher Birthday: February 17: Celebration of N-F wars
6. 2/22 & 7. 3/1Ingenious and Severe Tests: Fisher, Neyman-Pearson, Cox: Concepts of Tests
Reading SIST: Excursion 3 Tour I: 3.1-3.3, 119-163 (trade-offs 328-330)

The Triad: Fisher (1955), Pearson (1955); Neyman (1956)
The 1919 eclipse tests; Fisherian and N-P Tests;
Frequentist principle of evidence: FEV

Apps for statistical testing 
MISC: Excursion 3 Tour I Blurb & notes
SPRING BREAK Statistical Exercises While Sunning (March 4-12)

The following is very tentative, and will depend on student interests.
8. 3/15 Assign 2Confidence & Fiducial Intervals and Deeper Concepts:
Higgs Discovery
9. 3/22Objectivity in Science: Objectivity in Error Statistics & Bayesian Philosophies
10. 3/29 Short essayBayes factors and Bayes/Fisher Disagreement, Jeffreys-Lindley Paradox
11. 4/5Biasing Selection Effects, P-Hacking, Data Dredging etc.
12. 4/12
Assign 3
Negative Results: Power vs Severity
13. 4/19Should Statistical Significance Tests be Abandoned, Retired, or Replaced?
Other: TBA
14. 4/26Severity, Sensitivity, Safety: PhilStat and Classical Epistemology
15. 5/3Current Reforms and Stat Activism: Practicing Our Skills
  Final Paper
Categories: Announcement, new course | 2 Comments

I’m teaching a New Intro to PhilStat Course Starting Wednesday:

Ship StatInfasst (Statistical Inference as Severe Testing: SIST) will set sail on Wednesday January 18 when I begin a weekly seminar on the Philosophy of Inductive-statistical inference. I’m planning to write a new edition and/or companion to SIST (Mayo 2018, CUP), so it will be good to retrace the journey. I’m not requiring a statistics or philosophy background. All materials will be on this blog, and around halfway through there may be an opportunity to zoom, if there’s interest.

 

Categories: Announcement, new course | 2 Comments

The First 2023 Act of Stat Activist Watch: Statistics ‘for the people’

One of the central roles I proposed for “stat activists” (after our recent workshop, The Statistics Wars and Their Casualties) is to critically scrutinize mistaken claims about leading statistical methods–especially when such claims are put forward as permissible viewpoints to help “the people” assess methods in an unbiased manner. The first act of 2023 under this umbrella concerns an article put forward as “statistics for the people” in a journal of radiation oncology. We are talking here about recommendations for analyzing data for treating cancer!  Put forward as a fair-minded, or at least an informative, comparison of Bayesian vs frequentist methods, I find it to be little more than an advertisement for subjective Bayesian methods in favor of a caricature of frequentist error statistical methods. The journal’s “statistics for the people” section would benefit from a full-blown article on frequentist error statistical methods–not just the letter of ours they recently published–but I’m grateful to Chowdhry and other colleagues who joined me in this effort. You will find our letter below, followed by the authors’ response. You can also find a link to their original “statistics for the people” article in the references. Let me admit right off that my criticisms are a bit stronger than my co-authors.

Two quick additional things that I would wish to tell the authors in relation to their paper and response are:

  1. The application of Bayes rule in their example of diagnostic screening to compute the probability of Covid given a positive test, is just an application of conditional probability to events. It is fully carried out by frequentist means. There’s nothing really “Bayesian” about (frequentist!) diagnostic screening, yet it is a main example relied on to argue against frequentist probability.
  2. There’s no such thing as an uninformative prior–this was given up on over a decade ago.

I would never have come across an article in radiation oncology, if it were not for exchanges between members of a session I was in on “why we disagree” in statistical analysis in that field. I hereby invite all readers and the nearly 1000 registrants from our workshop to alert us throughout the year of interesting items under any of the stat activist banner.

Our letter: Bayesian Versus Frequentist Statistics: In Regard to Fornacon-Wood et al. (PDF of letter)

To the Editor:

We appreciate the authors bringing attention to controversies surrounding the use of Bayesian and frequentist statistics.1 [PDF of paper] There are many benefits to frequentist statistics and disadvantages of Bayesian statistics which were not discussed in the referenced article. We write this accompanying letter to aim for a more balanced presentation of Bayesian and frequentist statistics.

With frequentist statistical significance tests, we can learn whether the data indicate there is a genuine effect or difference in a statistical analysis, as they have the ability to control type I and type II error probabilities.2  Posteriors and Bayes factors do not ensure that the method rarely reports one treatment is better or worse than the other erroneously. A well-known threat to reliable results stems from the ease of using high powered methods to data-dredge and try to hunt for impressive-looking results that fail to replicate with new data. However, the Bayesian assessment is not altered by things like stopping rules-at least not without violating inference by Bayes theorem.3  The frequentist account,4  by contrast, is required to take account of such selection effects in reporting error probabilities. Another caution for those unfamiliar with practical Bayesian research is that estimation of a prior distribution is nontrivial. The priors they discuss are subjective degrees of belief, but there is considerable disagreement about which beliefs are warranted, even among experts. Furthermore, should conclusions differ if the prior is chosen by a radiation oncologist or a surgeon?5  These considerations are some of the reasons why most phase 3 studies in oncology rely on frequentist designs. The article equates frequentist methods with simple null hypothesis testing without alternatives, thereby overlooking hypothesis testing methods that control both type I and II errors. The frequentist takes account of type II errors and the corresponding notion of power. If a test has high power to detect a meaningful effect size, then failing to detect a statistically significant difference is evidence against a meaningful effect. Therefore, a  value that is not small is informative.

The authors write that frequentist methods do not use background information, but this is to ignore the field of experimental design and all of the work that goes into specifying the test (eg, sample size, statistical power) and critically evaluating the connection between statistical and substantive results. An effect that corresponds to a clinically meaningful effect, or effect sizes well warranted from previous studies, would clearly influence the design.

Although their article engenders important discussion, these differences between frequentist and Bayesian methods may help readers understand why so many researchers around the world still prefer the frequentist approach.

  • Amit K. Chowdhry, MD, PhD
    Department of Radiation Oncology
    University of Rochester Medical Center
    Rochester, New York
  • Deborah Mayo,
    Department of Philosophy
    Virginia Tech
    Blacksburg, Virginia
  • Stephanie L. Pugh, PhD
    NRG Oncology Statistical and Data Management Center
    American College of Radiology
    Philadelphia, Pennsylvania
  • John Park, MD
    Department of Radiation Oncology
    Kansas City VA Medical Center
    Kansas City, Missouri
  • Clifton David Fuller, MD,
    Department of Radiation Oncology
    MD Anderson Cancer Center
    Houston, Texas
  • John Kang, MD, PhD
    Department of Radiation Oncology
    University of Washington
    Seattle, Washington

References

  1. Fornacon-Wood I, Mistry H, Johnson-Hart C, et al. Understanding the differences between Bayesian and frequentist statistics. Int J Radiat Oncol Biol Phys 2022;112:1076-1082 .(PDF)
  2. Mayo DG. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge, UK: Cambridge University Press; 2018.
  3. Ryan EG, Brock K, Gates S, Slade D. Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Med Res Methodol 2020;20:150.
  4. Jennison C, Turnbull BW. Group Sequential Methods With Applications to Clinical Trials. Boca Raton, FL: CRC Press; 1999.
  5. Staley K, Park J. Comment on Mayo’s “The statistics wars and intellectual conflicts of interest”. Conserv Biol 2022;36:e13861.

 

Fornacon-Wood Reply: In Reply to Chowdhry et al. (PDF of letter)

To the Editor:

We thank the authors for their response  to our “statistics for the people” article that aimed to introduce perhaps unfamiliar readers to Bayesian statistics and some potential advantages of their use. We agree that frequentist statistics are a useful and widespread statistical analytical approach, and we are not aiming to revisit the frequentist versus Bayesian arguments that have been well articulated in the literature.  However, there are a couple of points we would like to make.

First, we acknowledge that the majority of phase 3 studies use frequentist designs, and this has the advantage of facilitating meta-analyses using established techniques. However, we would argue that the reason such frequentist designs are so prevalent is likely to have as much to do with convention (from funders/regulators as well as from researchers themselves), the relative exposure of the 2 approaches in educational materials, and the historic difficulties in calculating Bayesian posteriors as it does with the arguments the authors make.

Second, although we agree with Chowdhry et al that there are many challenges associated with the estimation of prior probability distributions, we note that similar arguments apply to effect size estimation, which they cite as a strength of the Neyman-Pearson/null hypothesis significance testing approach (ie, the use of power calculations to limit the risk of type II errors).  We would also re-enforce the point we make in the article about the importance of testing the influence of the prior (represented as the divergent beliefs of the hypothetical radiation oncologist and surgeon in the communication by Chowdhry et al) in the analysis results. If the data are strong enough, the posterior distributions will be in close enough agreement to convince both parties. As we noted, it is also possible to undertake Bayesian analyses without prior information, using an uninformative prior, in which case the analysis is driven directly by the data, as for a frequentist calculation. As an aside, there is continued debate about the relative merits and deficiencies of the different frequentist approaches to significance testing, particularly around the widespread use of the hybrid Neyman-Pearson/null hypothesis significance testing approach.

********************************************************

Please share your constructive remarks in the comments to this post.

 

 

 

Categories: stat activist watch 2023, statistical significance tests | 2 Comments

Midnight With Birnbaum: Happy New Year 2023!

.

For the last three years, unlike the previous 10 years that I’ve been blogging, it was not feasible to actually revisit that spot in the road, looking to get into a strange-looking taxi, to head to “Midnight With Birnbaum”.  But this year I will, and I’m about to leave at 10pm. (The pic on the left is the only blurry image I have of the club I’m taken to.) My book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)  doesn’t include the argument from my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), but you can read it at that link along with commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. (The first David R. Cox Foundations of Statistics Prize will be awarded at the JSM 2023.) The (Strong) Likelihood Principle (LP or SLP) remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics and of error statistics in general.  Continue reading

Categories: Likelihood Principle, optional stopping, P-value | Leave a comment

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Below are the videos and slides from the 7 talks from Session 3 and Session 4 of our workshop The Statistics Wars and Their Casualties held on December 1 & 8, 2022. Session 3 speakers were: Daniele Fanelli (London School of Economics and Political Science), Stephan Guttinger (University of Exeter), and David Hand (Imperial College London).  Session 4 speakers were: Jon Williamson (University of Kent),  Margherita Harris  (London School of Economics and Political Science), Aris Spanos (Virginia Tech), and Uri Simonsohn (Esade Ramon Llull University). Abstracts can be found here. In addition to the talks, you’ll find (1) a Recap of recaps at the beginning of Session 3 that provides a summary of Sessions 1 & 2, and (2) Mayo’s (5 minute) introduction to the final discussion: “Where do we go from here (Part ii)”at the end of Session 4.

The videos & slides from Sessions 1 & 2 can be found on this post.

Readers are welcome to use the comments section on the PhilStatWars.com workshop blog post here to make constructive comments or to ask questions of the speakers. If you’re asking a question, indicate to which speaker(s) it is directed. We will leave it to speakers to respond. Thank you! Continue reading

Categories: Error Statistics | Leave a comment

Slides from PSA22 symposium: Multiplicity, Data-Dredging, and Error Control

.

Below are slides from 4 of the talks given in our Philosophy of Science Association (PSA) session from last month: the PSA 22 Symposium: Multiplicity, Data-Dredging, and Error Control. It was held in Pittsburgh on November 13, 2022. I will write some reflections in the “comments” to this post. I invite your constructive comments there as well. Continue reading

Categories: data dredging, multiplicity, PSA | 1 Comment

Final Session: The Statistics Wars and Their Casualties: 8 December, Session 4

Thursday, December 8 will be the Final Session (Session 4) of my workshop, The Statistics Wars and Their Casualties. There will be 4 new speakers. It’s not too late to register:

registration form

At the end of this post is “A recap of recaps”, the short video we showed at the beginning of Session 3 last week that summarizes the presentations from Sessions 1 & 2 back in September 22-23. Continue reading

Categories: Announcement, Stistics Wars and Their Casualties Workshop | Leave a comment

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

It’s not too late to register for Sessions #3 and #4 of our online Workshop. There will be 7 new (live) speakers and, for the the first time ever, the (short) movie; “The Recap of recaps” will be shown at the start of session #3. registration form

Categories: Announcement, Stistics Wars and Their Casualties Workshop | Leave a comment

Final Sessions: The Statistics Wars and Their Casualties: 1 December and 8 December

The Statistics Wars

and Their Casualties

1 December and 8 December 2022
Sessions #3 and #4

15:00-18:15 pm London Time/10:00am-1:15pm EST
ONLINE
(London School of Economics, CPNSS)
registration form

For slides and videos of Sessions #1 and #2: see the workshop page

1 December

Session 3 (Moderator: Daniël Lakens, Eindhoven University of Technology)

OPENING 

  • “What Happened So Far”: A medley (20 min) of recaps from Sessions 1 & 2: Deborah Mayo (Virginia Tech), Richard Morey (Cardiff), Stephen Senn (Edinburgh), Daniël Lakens (Eindhoven), Christian Hennig (Bologna) & Yoav Benjamini (Tel Aviv).

SPEAKERS

  • Daniele Fanelli (London School of Economics and Political Science) The neglected importance of complexity in statistics and Metascience  (Abstract)
  • Stephan Guttinger (University of Exeter) What are questionable research practices? (Abstract)
  • David J. Hand (Imperial College, London) What’s the question? (Abstract)

DISCUSSIONS:

  • Closing Panel: “Where Should Stat Activists Go From Here (Part i)?”: Yoav Benjamini, Daniele Fanelli, Stephan Guttinger, David Hand, Christian Hennig, Daniël Lakens, Deborah Mayo, Richard Morey, Stephen Senn

8 December

Session 4 (Moderator: Deborah Mayo, Virginia Tech)

SPEAKERS

  • Jon Williamson (University of Kent) Causal inference is not statistical inference (Abstract)
  • Margherita Harris (London School of Economics and Political Science) On Severity, the Weight of Evidence, and the Relationship Between the Two (Abstract)
  • Aris Spanos (Virginia Tech) Revisiting the Two Cultures in Statistical Modeling and Inference as they relate to the Statistics Wars and Their Potential Casualties (Abstract)
  • Uri Simonsohn (Esade Ramon Llull University) Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors) (Abstract)

DISCUSSIONS;

  • Closing Panel: “Where Should Stat Activists Go From Here (Part ii)?”: Workshop Participants: Yoav Benjamini, Alexander Bird, Mark Burgman, Daniele Fanelli, Stephan Guttinger, David Hand, Margherita Harris, Christian Hennig, Daniël Lakens, Deborah Mayo, Richard Morey, Stephen Senn, Uri Simonsohn, Aris Spanos, Jon Williamson

**********************************************************************

  • DESCRIPTION: While the field of statistics has a long history of passionate foundational controversy, the last decade has, in many ways, been the most dramatic. Misuses of statistics, biasing selection effects, and high-powered methods of big-data analysis, have helped to make it easy to find impressive-looking but spurious results that fail to replicate. As the crisis of replication has spread beyond psychology and social sciences to biomedicine, genomics, machine learning and other fields, the need for critical appraisal of proposed reforms is growing. Many are welcome (transparency about data, eschewing mechanical uses of statistics); some are quite radical. The experts do not agree on the best ways to promote trustworthy results, and these disagreements often reflect philosophical battles–old and new– about the nature of inductive-statistical inference and the roles of probability in statistical inference and modeling. Intermingled in the controversies about evidence are competing social, political, and economic values. If statistical consumers are unaware of assumptions behind rival evidence-policy reforms, they cannot scrutinize the consequences that affect them. What is at stake is a critical standpoint that we may increasingly be in danger of losing. Critically reflecting on proposed reforms and changing standards requires insights from statisticians, philosophers of science, psychologists, journal editors, economists and practitioners from across the natural and social sciences. This workshop will bring together these interdisciplinary insights–from speakers as well as attendees.

Speakers/Panellists:

Sponsors/Affiliations:

  • The Foundation for the Study of Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (E.R.R.O.R.S.); Centre for Philosophy of Natural and Social Science (CPNSS), London School of Economics; Virginia Tech Department of Philosophy
  • Organizers: D. Mayo, R. Frigg and M. Harris
    Logistician
    (chief logistics and contact person): Jean Miller
    Executive Planning Committee: Y. Benjamini, D. Hand, D. Lakens, S. Senn
Categories: Announcement, Stistics Wars and Their Casualties Workshop | Leave a comment

S. Senn: Lauding Lord (Guest Post)

 

.

Stephen Senn
Consultant Statistician
Edinburgh, Scotland

A Diet of Terms

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex difference in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and their weight the following June are recorded.(P304)

This is how Frederic Lord (1912-2000) introduced the paradox (1) that now bears his name. It is justly famous (or notorious). However, the addition of sex as a factor adds nothing to the essence of the paradox and (in my opinion) merely confuses the issue. Furthermore, studying the effect of diet needs some sort of control. Therefore, I shall consider the paradox in the purer form proposed by Wainer and Brown (2), which was subtly modified by Pearl and Mackenzie in The Book of Why (3) (See pp212-217). Continue reading

Categories: Lord's paradox, S. Senn | 8 Comments

Multiplicity, Data-Dredging, and Error Control Symposium at PSA 2022: Mayo, Thornton, Glymour, Mayo-Wilson, Berger

.

Some claim that no one attends Sunday morning (9am) sessions at the Philosophy of Science Association. But if you’re attending the PSA (in Pittsburgh), we hope you’ll falsify this supposition and come to hear us (Mayo, Thornton, Glymour, Mayo-Wilson, Berger) wrestle with some rival views on the trenchant problems of multiplicity, data-dredging, and error control. Coffee and donuts to all who show up.

Multiplicity, Data-Dredging, and Error Control
November 13, 9:00 – 11:45 AM
(link to symposium on PSA website)

Speakers: Continue reading

Categories: Announcement, PSA | Leave a comment

Where should stat activists go from here? (part (i))

.

From what standpoint should we approach the statistics wars? That’s the question from which I launched my presentation at the Statistics Wars and Their Casualties workshop (phil-stat-wars.com). In my view, it should be, not from the standpoint of technical disputes, but from the non-technical standpoint of the skeptical consumer of statistics (see my slides here). What should we do now as regards the controversies and conundrums growing out of the statistics wars? We should not leave off the discussions of our workshop without at least sketching a future program for answering this question. We still have 2 more sessions, December 1 and 8, but I want to prepare us for the final discussions which should look beyond a single workshop. (The slides and videos from the presenters in Sessions 1 and 2 can be found here.)

I will consider three, interrelated, responsibilities and tasks that we can undertake as statistical activist citizens. In so doing I will refer to presentations from the workshop, limiting myself to session #1. (I will add more examples in part (ii) of this post.) Continue reading

Categories: Error Statistics, significance tests, stat wars and their casualties | Leave a comment

My Slides from the workshop: The statistics wars and their casualties

.

I will be writing some reflections on our two workshop sessions on this blog soon, but for now, here are just the slides I used on Thursday, 22 September. If you wish to ask a question of any of the speakers, use the blogpost at phil-stat-wars.com. The slides from the other speakers will also be up there on Monday.

Deborah G. Mayo’s. Slides from the workshop: The Statistics Wars and Their Casualties, Session 1, on September 22, 2022.

Categories: Error Statistics | 3 Comments

22-23 September final schedule for workshop: The statistics wars and their casualties ONLINE

The Statistics Wars
and Their Casualties

Final Schedule for September 22 & 23 (Workshop Sessions 1 & 2) Continue reading

Categories: Error Statistics | Leave a comment

22-23 Workshop Schedule: The Statistics Wars and Their Casualties: ONLINE

You can still register: https://phil-stat-wars.com/2022/09/19/22-23-september-workshop-schedule-the-statistics-wars-and-their-casualties/ Continue reading

Categories: Error Statistics | Leave a comment

Upcoming Workshop: The Statistics Wars and Their Casualties workshop

The Statistics Wars
and Their Casualties

22-23 September 2022
15:00-18:00 pm London Time*
ONLINE
(London School of Economics, CPNSS)

To register for the  workshop,
please fill out the registration form here.

For schedules and updated details, please see the workshop webpage: phil-stat-wars.com.

*These will be sessions 1 & 2, there will be two more
online sessions (3 & 4) on December 1 & 8.

Continue reading

Categories: Announcement, stat wars and their casualties | 1 Comment

Free access to Statistical Inference as Severe Testing: How to Get Beyond the Stat Wars” (CUP, 2018) for 1 more week

.

Thanks to CUP, the electronic version of my book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018), is available for free for one more week (through August 31) at this link:  https://www.cambridge.org/core/books/statistical-inference-as-severe-testing/D9DF409EF568090F3F60407FF2B973B2  Blurbs of the 16 tours in the book may be found here: blurbs of the 16 tours.

Categories: Announcement, SIST | Leave a comment

Statistical Concepts in Their Relation to Reality–E.S. Pearson

11 August 1895 – 12 June 1980

This is my third and final post marking Egon Pearson’s birthday (Aug. 11). The focus is his little-known paper: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve linked to it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a repeated applications or long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, it might be said that some people concentrate to an absurd extent on “science-wise error rates” in their view of statistical tests as dichotomous screening devices.) Continue reading

Categories: Egon Pearson, phil/history of stat, Philosophy of Statistics | Tags: , , | 1 Comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with posts on E.S. Pearson in marking his birthday, I reblog this guest post by Aris Spanos. 

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.”  (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(X), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(X)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.”  (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, Egon Pearson shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s.

This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in Nature, and dated June 8th, 1929.  Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in Nature, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

Egon Pearson recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on simulation, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ0(X)=|[√n(X-bar- μ0)/s]|, C1:={x: τ0(x) > cα},    (4)

for testing the hypotheses:

H0: μ = μ0 vs. H1: μ ≠ μ0,                                             (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a test for the Normality assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that simulation alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does not provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown f(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

Xk ∽ U(a-μ,a+μ),   k=1,2,…,n,…        (6)

where f(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(X)=|{(n-1)([X[1] +X[n]]-μ0)}/{[X[1]-X[n]]}|∽F(2,2(n-1)),   (7)

with a rejection region C1:={x: w(x) > cα},  where (X[1], X[n]) denote the smallest and the largest element in the ordered sample (X[1], X[2],…, X[n]), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ10+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ1) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

References

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” Biographical Memoirs of Fellows of the Royal Society, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” Biometrika, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” Metron, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” Journal of the Royal Statistical Society, 85: 597-612.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” Proceedings of the London Mathematical Society, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” Biometrika, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, Philosophical Transanctions of the Royal Society, A, 231: 289-337.

Lehmann, E. L. (1975) Nonparametrics: statistical methods based on ranks, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” Statistical Science, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, Nature, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” Biometrika, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” Biometrika, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” Biometrika, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” Biometrika, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” Biometrika, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” Biometrika, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” Biometrika, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” Biometrika, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” Biometrika, 6: 1-25.

Categories: Egon Pearson, Statistics | Leave a comment

Behavioral vs Evidential Interpretations of N-P tests: E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980)–one of my statistical heroes. It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. Yes, I know I’ve been neglecting this blog as of late, because I’m busy planning our workshop: The Statistics Wars and Their Casualties (22-23 September, online). See phil-stat-wars.com. I will reblog some favorite Pearson posts in the next few days.

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) PearsonCases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171) 

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

Three Steps in the Original Construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”.

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

These points on Pearson are discussed in more depth in my book Statistical Inference as Severe Testing (SIST): How to Get Beyond the Statistics Wars (CUP 2018). You can read and download the entire book for free during the month of August 2022 at the following link:

https://www.cambridge.org/core/books/statistical-inference-as-severe-testing/D9DF409EF568090F3F60407FF2B973B2

 

References:

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to RealityJournal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.Biometrika 20(A): 175-240.


[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

Categories: E.S. Pearson, Error Statistics | Leave a comment

Blog at WordPress.com.