The reading from this session is from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP, 2018)
D. Mayo
Tour I Ingenious and Severe Tests
[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)
The 1919 eclipse experiments opened Popper’ s eyes to what made Einstein’ s theory so different from other revolutionary theories of the day: Einstein was prepared to subject his theory to risky tests.[1] Einstein was eager to galvanize scientists to test his theory of gravity, knowing the solar eclipse was coming up on May 29, 1919. Leading the expedition to test GTR was a perfect opportunity for Sir Arthur Eddington, a devout follower of Einstein as well as a devout Quaker and conscientious objector. Fearing “ a scandal if one of its young stars went to jail as a conscientious objector,” officials at Cambridge argued that Eddington couldn’ t very well be allowed to go off to war when the country needed him to prepare the journey to test Einstein’ s predicted light deflection (Kaku 2005, p. 113).
The museum ramps up from Popper through a gallery on “ Data Analysis in the 1919 Eclipse” (Section 3.1) which then leads to the main gallery on origins of statistical tests (Section 3.2). Here’ s our Museum Guide:
According to Einstein’ s theory of gravitation, to an observer on earth, light passing near the sun is deflected by an angle, λ , reaching its maximum of 1.75″ for light just grazing the sun, but the light deflection would be undetectable on earth with the instruments available in 1919. Although the light deflection of stars near the sun (approximately1 second of arc) would be detectable, the sun’ s glare renders such stars invisible, save during a total eclipse, which “ by strange good fortune” would occur on May 29, 1919 (Eddington [1920] 1987, p. 113).
There were three hypotheses for which “ it was especially desired to discriminate between” (Dyson et al. 1920 p. 291). Each is a statement about a parameter, the deflection of light at the limb of the sun (in arc seconds): λ = 0″ (no deflection), λ = 0.87″ (Newton), λ = 1.75″ (Einstein). The Newtonian predicted deflection stems from assuming light has mass and follows Newton’ s Law of Gravity. The difference in statistical prediction masks the deep theoretical differences in how each explains gravitational phenomena. Newtonian gravitation describes a force of attraction between two bodies; while for Einstein gravitational effects are actually the result of the curvature of spacetime. A gravitating body like the sun distorts its surrounding spacetime, and other bodies are reacting to those distortions.
Where Are Some of the Members of Our Statistical Cast of Characters in 1919? In 1919, Fisher had just accepted a job as a statistician at Rothamsted Experimental Station. He preferred this temporary slot to a more secure offer by Karl Pearson (KP), which had so many strings attached – requiring KP to approve everything Fisher taught or published – that Joan Fisher Box writes: After years during which Fisher “ had been rather consistently snubbed” by KP, “It seemed that the lover was at last to be admitted to his lady’ s court – on conditions that he first submit to castration” (J. Box 1978, p. 61). Fisher had already challenged the old guard. Whereas KP, after working on the problem for over 20 years, had only approximated “the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments” in 1915 (Spanos 2013a). Unable to fight in WWI due to poor eyesight, Fisher felt that becoming a subsistence farmer during the war, making food coupons unnecessary, was the best way for him to exercise his patriotic duty.
In 1919, Neyman is living a hardscrabble life in a land alternately part of Russia or Poland, while the civil war between Reds and Whites is raging. “It was in the course of selling matches for food” (C. Reid 1998, p. 31) that Neyman was first imprisoned (for a few days) in 1919. Describing life amongst “roaming bands of anarchists, epidemics” (ibid., p. 32), Neyman tells us,“existence” was the primary concern (ibid., p. 31). With little academic work in statistics, and “ since no one in Poland was able to gauge the importance of his statistical work (he was ‘sui generis,’ as he later described himself)” (Lehmann 1994, p. 398), Polish authorities sent him to University College in London in 1925/1926 to get the great Karl Pearson’ s assessment. Neyman and E. Pearson begin work together in 1926. Egon Pearson, son of Karl, gets his B.A. in 1919, and begins studies at Cambridge the next year, including a course by Eddington on the theory of errors. Egon is shy and intimidated, reticent and diffi dent, living in the shadow of his eminent father, whom he gradually starts to question after Fisher’ s criticisms. He describes the psychological crisis he’ s going through at the time Neyman arrives in London: “ I was torn between conflicting emotions: a. finding it difficult to understand R.A.F., b. hating [Fisher] for his attacks on my paternal ‘ god,’ c. realizing that in some things at least he was right” (C. Reid 1998, p. 56). As far as appearances amongst the statistical cast: there are the two Pearsons: tall, Edwardian, genteel; there’ s hardscrabble Neyman with his strong Polish accent and small, toothbrush mustache; and Fisher: short, bearded, very thick glasses, pipe, and eight children. Let’ s go back to 1919, which saw Albert Einstein go from being a little known German scientist to becoming an international celebrity.
…To read further, see Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)
Excursion 3: Statistical Tests and Scientific Inference
Tour I Ingenious and Severe Tests 119
YOU
3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test 121
3.2 NP Tests: An Episode in AngloPolish Collaboration 131
3.3 How to Do All NP Tests D (and more) While
a Member of the Fisherian Tribe 146
(we only covered portions of this)
Today is R.A. Fisher’s birthday! I am reblogging what I call the “Triad”–an exchange between Fisher, Neyman and Pearson (NP) published 20 years after the FisherNeyman breakup. My seminar on PhilStat is studying these this week, so it’s timely. While my favorite is still the reply by E.S. Pearson, which alone should have shattered Fisher’s allegations that NP “reinterpret” tests of significance as “some kind of acceptance procedure”, all three are chock full of gems for different reasons. They are short and worth rereading. Neyman’s article pulls back the cover on what is really behind Fisher’s overthetop polemics, what with Russian 5year plans and commercialism in the U.S. Not only is Fisher jealous that NP tests came to overshadow “his” tests, he is furious at Neyman for driving home the fact that Fisher’s fiducial approach had been shown to be inconsistent (by others). The flaw is illustrated by Neyman in his portion of the triad. I discuss this briefly in my Philosophy of Science Association paper from a few months ago (slides are here*).Further details may be found in my book, SIST (2018) especially pp 388392 linked to here. It speaks to a common fallacy seen every day in interpreting confidence intervals. As for Neyman’s “behaviorism”, Pearson’s last sentence is revealing.
HAPPY BIRTHDAY R.A. FISHER!
*Slides from Glymour and J. Berger’s presentations are also there.
“Statistical Methods and Scientific Induction“
by Sir Ronald Fisher (1955)
SUMMARY
The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.
The three phrases examined here, with a view to elucidating they fallacies they embody, are:
Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.
To continue reading Fisher’s paper.
“Note on an Article by Sir Ronald Fisher“
by Jerzy Neyman (1956)
Summary
(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation. (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.
(3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values. The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight. (4) The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.
“Statistical Concepts in Their Relation to Reality“.
by E.S. Pearson (1955)
Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Statistical Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.
In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to fiveyear plans. It was really much simpler–or worse. The original heresy, as we shall see, was a Pearson one!…
Use this link to continue reading, “Statistical Concepts in Their Relation to Reality“.
]]>My Phil Stat seminar has been meeting for 4 weeks now, and we’re soon to experiment with a small group of outside participants zooming in (write to us, if you are interested in joining us). I’ve been so busy with the seminar that I haven’t blogged. Have you been following? All the materials are on a continually updated syllabus on this blog (SYLLABUS). We’re up to Excursion 2, Tour II.
Last week, we did something unusual: we read from Popper’s Conjectures and Refutations. I wanted to do this because scientists often appeal to distorted and unsophisticated accounts of Popper, especially in discussing falsification, and what demarcates good science from poor science. While I don’t think Popper made good on his most winning slogans, he gives us many seminal launchingoﬀ points for improved accounts of falsiﬁcation, induction, corroboration, and demarcation.
Do people still assume EI is “rational”? Good science can’t be demarcated from poor, questionable, fringe science and the like by its empirical method, if that method is understood as enumerative induction (EI), says Popper, rightly. While it comes in many forms, EI (taken up in Ex 2 Tour I) infers from observed instances (or frequencies) of A’s that are B’s to inferring claims like: the next A will be a B, or most A’s are B’s, or k% of A’s are B’s or even that the probability an A is B is k. Such a method is unreliable, so we shouldn’t be keen to justify it. It permits inferring poorly probed claims and violates the minimal requirement for evidence (weak severity).
Yet we are familiar with claims from epistemologists and others that some version of (EI) is a “rational” method. (It is the basis for famous quandries in legal reasoning). An additional stipulation is generally something like: “nothing else is known,” (which itself is knowing something else), but even that does not help. Neither do claims about indifference or uninformativeness. The philosopher Carnap called EI “the straight rule” and tried for many years to justify it–unsuccessfully. Lack of randomness, biasing selection effects in generating the data and in the choice of reference classes are key issues. Although the data in EI may be seen as relative frequencies, it is very different from frequentist statistics (See SIST, 11011 on Neyman (1955): “Statistics as the Frequentist Theory of Induction”.)
Popper also rejected the empiricist assumption that observations are known relatively unproblematically. If they are at the “foundation,” it is only because there are apt methods for testing their validity. In fact, we dub claims observable because or to the extent that they are open to stringent checks. (Popper: “anyone who has learned the relevant technique can test it” (1959, p. 99).) Accounts of hypothesis appraisal that start with “evidence x,” as in conﬁrmation logics, vastly oversimplify how data enters in learning.
Demarcation and Investigating Bad Science. Popper’s right that if using enumerative induction (EI) makes you scientiﬁc then anyone from an astrologer to one who blithely moves from observed associations to full blown theories is scientiﬁc. Yet Popper’s criterion of testability and falsiﬁability – as it is typically understood – may be nearly as bad. It is both too strong and too weak. Any crazy theory found false would be scientiﬁc, and our most impressive theories are not deductively falsiﬁable. The only theories that deductively prohibit observations are of the sort one mainly ﬁnds in philosophy books: All swans are white is falsiﬁed by a single nonwhite swan. There are some statistical claims and contexts, I argue, where it’s possible to achieve deductive falsiﬁcation: claims such as, these data are independent and identically distributed (IID). Going beyond a mere denial to reliably replacing them, of course, requires more work.
However, interesting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsiﬁcation. Their tests can be reconstructed as deductively valid, but in order to warrant the premises requires evidencetranscending (ampliative) inferences. So there’s a “whiff of induction” even in Popper (as some of his critics claim), even though not of the crude (EI) sort. (Note Popper’s claim about when a statistical hypothesis is falsified below.)
“The Demise of the Demarcation Problem”. Forty years ago, Larry Laudan’s famous (1983) paper declared the demarcation problem taboo. This is a highly unsatisfactory situation for philosophers of science wishing to grapple with today’s statistical replication crisis. Laudan and I generally see eye to eye, so perhaps our disagreement here is just semantics. I share his view that what really matters is determining if a hypothesis is warranted or not, rather than whether the theory is “scientiﬁc,” but surely Popper didn’t mean logical falsiﬁability suﬃced. Popper is clear that many unscientiﬁc theories (e.g., Marxism, astrology) are falsiﬁable. It’s clinging to falsiﬁed theories that leads to unscientiﬁc practices. It’s trying and trying again in the face of unwelcome results, cherrypicking cases that support preferred hypotheses, and all the rest of the biases that make it easy to find apparent support for poorly probed claims.
Following Laudan, philosophers tend to shy away from saying anything general about science versus pseudoscience – the predominant view is that there is no such thing. One gets the impression that the demarcation task is being left to committees investigating allegations of poor science or fraud. They are forced to articulate what to count as fraud, as bad statistics, or as mere questionable research practices (QRPs). People’s careers depend on their ruling: they have “skin in the game,” as Nassim Nicholas Taleb might say (2018).
Free of the qualms that give philosophers of science cold feet, the committee investigating fraudster Diederik Stapel advance some obvious, yet crucially important rules with Popperian echoes:
One of the most fundamental rules of scientiﬁc research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that conﬁrm the research hypotheses. (Levelt Committee, Noort Committee, and Drenth Committee 2012).
This is the gist of our minimal requirement for evidence (weak severity principle). To scrutinize the scientiﬁc credentials of an inquiry is to determine if there was a serious attempt to detect and report mistaken interpretations of data.
Demarcating Inquiries (4 requirements). However, I say Popper confuses things by making it sound as if he’s asking: When is a theory unscientiﬁc? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H, unscientiﬁc? We want to distinguish meritorious modes of inquiry from those that are BENT. Despite being logically falsiﬁable, theories can be rendered immune from falsiﬁcation by means of cavalier methods for their testing. Some areas have so much noise and/or ﬂexibility that they can’t or won’t distinguish warranted from unwarranted explanations of failed predictions. It does not suffice– for an inquiry to be scientific– that there is criticism of methods and models. The criticism must be constrained by what’s actually responsible for any alleged problems. It may be correct to criticize an inference to a hypothesis H, but it may be for the wrong reason. For instance, the problem might be traced to H’s improbability when in fact the flaw is due to lack of error control due to datadredging, optional stopping, and Phacking.
A scientific inquiry or test must be able:
(a) to block inferences that fail the minimal requirement for severity
(b) to embark on a reliable probe to pinpoint blame for anomalies
(c) (from (a)) to directly pick up on altered error probing capacities due to biasing selection effects, optional stopping, cherry picking, datadredging etc.
(d) (from (b)) to test and falsify claims.
So we get four requirements for an inquiry to be scientific.
Methodological probability. A valuable idea to take from Popper is that probability in learning attaches to a method: it is methodological probability. An error probability is a special case of a methodological probability.
Popper wrote to me expressing regret that he didn’t learn more statistics, but he referred to Fisher, Neiman and Pearson, and also Pierce in explaining when a statistical hypothesis is to count as falsiﬁed. Although extremely rare events may occur, Popper notes:
such occurrences would not be physical eﬀects, because, on account of their immense improbability, they are not reproducible at will … If, however, we ﬁnd reproducible deviations from a macro eﬀect .. . deduced from a probability estimate … then we must assume that the probability estimate is falsiﬁed. (Popper 1959, p. 203)
In the same vein, we heard Fisher deny that an “isolated record” of statistically signiﬁcant results suﬃces to warrant a reproducible or genuine eﬀect (Fisher 1935a, p. 14). Even where a scientiﬁc hypothesis is thought to be deterministic, inaccuracies and knowledge gaps involve errorladen predictions; so our methodological rules typically involve inferring a statistical hypothesis. Popper calls it a falsifying hypothesis. It’s a hypothesis inferred in order to falsify some other claim. A ﬁrst step is often to infer an anomaly is real, by falsifying a “due to chance” hypothesis. That is the role of statistical significance tests.
Insofar as we falsify general scientiﬁc claims, we are all methodological falsiﬁcationists. Some people say, “I know my models are false, so I’m done with the job of falsifying before I even begin.” Really? That’s not falsifying. Let’s look at your method: always infer that H is false, or fails to solve its intended problem. Then you’re bound to infer this even when this is erroneous. (Were H a null hypothesis of “no effect” you’d always be inferring the effect is genuine.) Your method fails the minimal severity requirement.
^{ }
]]>Philosophy of InductiveStatistical Inference
(This is an INPERSON class*)
Wed 4:006:30 pm, McBryde 223
(Office hours: Tuesdays 34; Wednesdays 1:302:30)
Syllabus: Second Installment (PDF)
D. Mayo (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST) CUP, 2018: SIST (electronic and paper provided to those taking the class; proofs are at errorstatistics.com, see below).
Supplemental text: Hacking, I. (2001). An introduction to probability and inductive logic. Cambridge University Press.
Articles from the Captain’s Bibliography (links to new articles will be provided). Other useful information can be found on the SIST Abstracts & Keywords and this post with SIST Excerpts & Mementos)
Date  Themes/readings 
1. 1/18  Introduction to the Course: How to tell what’s true about statistical inference (1/18/23 SLIDES here) Reading: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST): Preface, Excursion 1 Tour I 1.11.3, 929 MISC: Souvenir A; SIST Abstracts & Keywords for all excursions and tours 
2. 1/25 Q #2 
Error Probing Tools vs Comparative Evidence: Likelihood & Probability What counts as cheating? Intro to Logic: arguments validity & soundness (1/25/23 SLIDES here) Reading: SIST: Excursion 1 Tour II 1.41.5, 3055 Session #2 Questions: (PDF) MISC: NOTES on Excursion 1, SIST: Souvenirs B, C & D, Logic Primer (PDF) 
3. 2/1 Q #3 UPDATED 
Induction and Confirmation: PhilStat & Formal Epistemology The Traditional Problem of Induction Is Probability a Good Measure of Confirmation? Tacking Paradox (2/1/23 SLIDES here) Reading: SIST: Excursion 2, Tour I: 2.12.2, 5974 Hacking “The Basic Rules of Probability” Hand Out (PDF) UPDATED: Session #3 Questions: (PDF) MISC: Excursion 2 Tour I Blurb & notes 
4. 2/8 & 5. 2/15 Assign 1 2/15 
Falsification, Science vs Pseudoscience, Induction Statistical Crises of Replication in Psychology & other sciences Popper, severity and novelty, array of problems and models Fallacies of rejection, Duhem’s problem; solving induction now (/2/8/23 SLIDES here) Reading for 2/8: Popper, Ch 1 from Conjectures and Refutations up to p. 59. (PDF), This class overlaps with the next, so if you have time read Excursion 2, Tour II: (p. 7582); Exhibit vi. (p. 82); and p. 108 Session #4 Questions: (PDF) MISC (2/8): Selfquiz on Popper for Fun! (PDF); Cartoon Guide to Statistics (Link to VT Library link is here) ——————— Reading for 2/15: SIST: Excursion 2, Tour II: read sections that interest you from those not covered last week. You can choose the example in 2.6 (or one from your field) or the discussion of solving induction in 2.7. Optional for 2/15: Gelman & Loken (2014) (2/15/23 SLIDES here) ASSIGNMENT 1 (due 2/15) (PDF) MISC (2/15): SIST Souvenirs (E), (F), (G), (H); Excursion 2 Tour II Blurb & notes 
Fisher Birthday: February 17: Celebration on 2/22  
6. 2/22 Q #6 & 7. 3/1

Ingenious and Severe Tests: Fisher, NeymanPearson, Cox: Concepts of Tests
Session #6 Questions: (PDF) Optional: The pathological Fisher (fiducial) and Neyman (performance) battle: SIST 388391 ——————————————

8. 3/15 Assign 2 
Deeper Concepts (2 parts): Stat in the Higg’s discovery, and Confidence intervals and their duality with tests Reading (for first part): Excursion 3 Tour III, 3.8 Higgs Discovery (See the ASA 6 principles on Pvalues: Note 4, P. 216, and Live Exhibit (ix) p. 200: Souv. N p. 201 Reading (for second part): Excursion 3 Tour III, 3.7: pp. 189195 Assignment 2 (PDF) due 3/17/23 (3/15/23 (revised) SLIDES here) Misc. Excursion 3 Tour III blurb & notes 
9. 3/22 
Testing Assumptions of Statistical Models (Guest Speaker: Aris Spanos on misspecification testing in statistics) Misc. Excursion 4 Tour IV blurb & notes 
10. 3/29

Who’s Exaggerating what? Bayes factors and Bayes/Fisher Disagreement, JeffreysLindley Paradox (Guest Speaker: Richard Morey on Bayes Factors) Misc. Excursion 4 Tour II blurb & notes 
11. 4/5 Mini essay 
More on: Bayes factors and Bayes/Fisher Disagreement, JeffreysLindley Paradox
Reading. Excursion 4 Tour II and Excursion 6, Tour I: 395423 (We are spending 2 weeks on these: Excursion 6 Tour I will be post zoom.)
Optional: Objectivity in stat Excursion 4 Tour I: 4.1, 4.2; 221238
Peek Ahead: 6.7 Farewell Keepsake: 436444
Miniessay (PDF)

12. 4/12 
Biasing Selection Effects and Randomization 
13. 4/19

Power: Predata and Postdata Reading: Excursion 5 Tour I 
14. 4/26 Assign 3 
Positive Predictive Value and Probabilistic Instantiation
Controversies about inferring probabilities from frequencies (in law and epistemology) Reading: Selection from Section 5.6 (excursion 5 Tour II); C. Howson (1997) 
15. 5/3  Current Reforms and Stat Activism: Practicing our skills on some wellknown papers 
Final Paper 
]]>
One of the central roles I proposed for “stat activists” (after our recent workshop, The Statistics Wars and Their Casualties) is to critically scrutinize mistaken claims about leading statistical methods–especially when such claims are put forward as permissible viewpoints to help “the people” assess methods in an unbiased manner. The first act of 2023 under this umbrella concerns an article put forward as “statistics for the people” in a journal of radiation oncology. We are talking here about recommendations for analyzing data for treating cancer! Put forward as a fairminded, or at least an informative, comparison of Bayesian vs frequentist methods, I find it to be little more than an advertisement for subjective Bayesian methods in favor of a caricature of frequentist error statistical methods. The journal’s “statistics for the people” section would benefit from a fullblown article on frequentist error statistical methods–not just the letter of ours they recently published–but I’m grateful to Chowdhry and other colleagues who joined me in this effort. You will find our letter below, followed by the authors’ response. You can also find a link to their original “statistics for the people” article in the references. Let me admit right off that my criticisms are a bit stronger than my coauthors.
Two quick additional things that I would wish to tell the authors in relation to their paper and response are:
I would never have come across an article in radiation oncology, if it were not for exchanges between members of a session I was in on “why we disagree” in statistical analysis in that field. I hereby invite all readers and the nearly 1000 registrants from our workshop to alert us throughout the year of interesting items under any of the stat activist banner.
Our letter: Bayesian Versus Frequentist Statistics: In Regard to FornaconWood et al. (PDF of letter)
To the Editor:
We appreciate the authors bringing attention to controversies surrounding the use of Bayesian and frequentist statistics.^{1} [PDF of paper] There are many benefits to frequentist statistics and disadvantages of Bayesian statistics which were not discussed in the referenced article. We write this accompanying letter to aim for a more balanced presentation of Bayesian and frequentist statistics.
With frequentist statistical significance tests, we can learn whether the data indicate there is a genuine effect or difference in a statistical analysis, as they have the ability to control type I and type II error probabilities.^{2} Posteriors and Bayes factors do not ensure that the method rarely reports one treatment is better or worse than the other erroneously. A wellknown threat to reliable results stems from the ease of using high powered methods to datadredge and try to hunt for impressivelooking results that fail to replicate with new data. However, the Bayesian assessment is not altered by things like stopping rulesat least not without violating inference by Bayes theorem.^{3} The frequentist account,^{4} by contrast, is required to take account of such selection effects in reporting error probabilities. Another caution for those unfamiliar with practical Bayesian research is that estimation of a prior distribution is nontrivial. The priors they discuss are subjective degrees of belief, but there is considerable disagreement about which beliefs are warranted, even among experts. Furthermore, should conclusions differ if the prior is chosen by a radiation oncologist or a surgeon?^{5} These considerations are some of the reasons why most phase 3 studies in oncology rely on frequentist designs. The article equates frequentist methods with simple null hypothesis testing without alternatives, thereby overlooking hypothesis testing methods that control both type I and II errors. The frequentist takes account of type II errors and the corresponding notion of power. If a test has high power to detect a meaningful effect size, then failing to detect a statistically significant difference is evidence against a meaningful effect. Therefore, a value that is not small is informative.
The authors write that frequentist methods do not use background information, but this is to ignore the field of experimental design and all of the work that goes into specifying the test (eg, sample size, statistical power) and critically evaluating the connection between statistical and substantive results. An effect that corresponds to a clinically meaningful effect, or effect sizes well warranted from previous studies, would clearly influence the design.
Although their article engenders important discussion, these differences between frequentist and Bayesian methods may help readers understand why so many researchers around the world still prefer the frequentist approach.
References
FornaconWood Reply: In Reply to Chowdhry et al. (PDF of letter)
To the Editor:
We thank the authors for their response to our “statistics for the people” article that aimed to introduce perhaps unfamiliar readers to Bayesian statistics and some potential advantages of their use. We agree that frequentist statistics are a useful and widespread statistical analytical approach, and we are not aiming to revisit the frequentist versus Bayesian arguments that have been well articulated in the literature. However, there are a couple of points we would like to make.
First, we acknowledge that the majority of phase 3 studies use frequentist designs, and this has the advantage of facilitating metaanalyses using established techniques. However, we would argue that the reason such frequentist designs are so prevalent is likely to have as much to do with convention (from funders/regulators as well as from researchers themselves), the relative exposure of the 2 approaches in educational materials, and the historic difficulties in calculating Bayesian posteriors as it does with the arguments the authors make.
Second, although we agree with Chowdhry et al that there are many challenges associated with the estimation of prior probability distributions, we note that similar arguments apply to effect size estimation, which they cite as a strength of the NeymanPearson/null hypothesis significance testing approach (ie, the use of power calculations to limit the risk of type II errors). We would also reenforce the point we make in the article about the importance of testing the influence of the prior (represented as the divergent beliefs of the hypothetical radiation oncologist and surgeon in the communication by Chowdhry et al) in the analysis results. If the data are strong enough, the posterior distributions will be in close enough agreement to convince both parties. As we noted, it is also possible to undertake Bayesian analyses without prior information, using an uninformative prior, in which case the analysis is driven directly by the data, as for a frequentist calculation. As an aside, there is continued debate about the relative merits and deficiencies of the different frequentist approaches to significance testing, particularly around the widespread use of the hybrid NeymanPearson/null hypothesis significance testing approach.
********************************************************
Please share your constructive remarks in the comments to this post.
]]>
For the last three years, unlike the previous 10 years that I’ve been blogging, it was not feasible to actually revisit that spot in the road, looking to get into a strangelooking taxi, to head to “Midnight With Birnbaum”. But this year I will, and I’m about to leave at 10pm. (The pic on the left is the only blurry image I have of the club I’m taken to.) My book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018) doesn’t include the argument from my article in Statistical Science (“On the Birnbaum Argument for the Strong Likelihood Principle”), but you can read it at that link along with commentaries by A. P. David, Michael Evans, Martin and Liu, D. A. S. Fraser, Jan Hannig, and Jan Bjornstad. David Cox, who very sadly did in January 2022, is the one who encouraged me to write and publish it. (The first David R. Cox Foundations of Statistics Prize will be awarded at the JSM 2023.) The (Strong) Likelihood Principle (LP or SLP) remains at the heart of many of the criticisms of NeymanPearson (NP) statistics and of error statistics in general.
As Birnbaum emphasized, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder to commentators.) Birnbaum bemoaned the lack of an explicit evidential interpretation of NP methods. I purport to give one in SIST 2018. Anyway, let’s see what happens this year. Happy New Year!
BACKGROUND
You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight on New Year’s Eve and lo and behold, finds herself in the company of Allan Birnbaum.[i]
OUR EXCHANGE:
ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to have published on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)
BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (STINT, 2018, CUP).
ERROR STATISTICIAN: You’ve read my book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found the problem in 2006, when I was writing something on “conditioning” with David Cox. [ii] Sorry,…I know it’s famous…
BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).
ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.
BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!
ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.
BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.
ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.
BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:
(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)
ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding pvalues (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the pvalue quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.
BIRNBAUM: Suppose you’ve observed x”, a 2standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.
ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.
BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BBexperiment. In a BBexperiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BBexperiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.
(They fill their glasses again)
ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?
BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.
ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BBexperiment: whether my observed 2standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.
BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BBexperiment.
ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.
BIRNBAUM: Well, but since the BB experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept. Yes? (The weak LP is the same as the sufficiency principle).
ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2standard deviation difference from optional stopping experiment E”. How do I calculate the pvalue within a Birnbaumized experiment?
BIRNBAUM: I don’t think anyone has ever called it that.
ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the pvalue within a BBexperiment?
BIRNBAUM: You would report the overall pvalue, which would be the average over the sampling distributions: (p’ + p”)/2
Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).
ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2standard deviation difference from E’, I do not report the associated pvalue p’, but instead I am to report the average pvalue, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?
BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BBexperiment.
My this drink is sour!
ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.
BIRNBAUM: Perhaps you’re in want of a gene; never mind.
I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BBexperiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).
ERROR STATISTICAL PHILOSOPHER: But the result would be that the pvalue associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.
BIRNBAUM: Yes, the BBexperiment computes the Pvalue in an unconditional manner: it takes the convex combination over the 2 ways the result could have come about.
ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.
BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to sufficiency, it is just a matter of mathematical equivalence.
By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.
ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)
BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!
BIRNBAUM: So far all of this was step (1).
ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?
BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2standard deviation difference actually came, you ought to report the pvalue corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.
This gives us premise (2a):
(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB experiment where pvalues are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the pvalue, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BBexperiment, and report an average pvalue, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?
BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.
(2b) Likewise, if you knew the 2standard deviation difference came from E’, then
x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BBexperiment, in which I report the average pvalue, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.
BIRNBAUM: Yes. There was no need to repeat the whole spiel.
ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course, all of this assumes the model is correct or adequate to begin with.
BIRNBAUM: Yes, the LP (or SLP, to indicate it’s the strong LP) is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?
ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.
BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?
ERROR STATISTICAL PHILOSOPHER: Well the WCP originally refers to actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.
BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BBexperiment, and that’s all I need. Notice
(1), (2a) and (2b) yield the strong LP!
Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).
ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).
BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?
(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)
ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:
Step 1 requires us to analyze results in accordance with a BB experiment. If we do so, true enough we get:
premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):
That is because in either case, the pvalue would be (p’ + p”)/2
Step 2 now insists that we should NOT calculate evidential import as if we were in a BB experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:
premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its pvalue should be p”.
premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its pvalue should be p’.
If (1) is true, then (2a) and (2b) must be false!
If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:
The average pvalue (p’ + p”)/2 = p’ which is false.
Likewise if (1) is true, then (2b) is asserting:
the average pvalue (p’ + p”)/2 = p” which is false
Alternatively, we can see what goes wrong by realizing:
If (2a) and (2b) are true, then premise (1) must be false.
In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB experiment (and report the average pvalue) and also that we are not, but rather should report the actual pvalue.
I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).
BIRNBAUM: Yet some people still think it is a breakthrough. I never agreed to go as far as Jimmy Savage wanted me too, namely, to be a Bayesian….
ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science? The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.
Birnbaum: Yes, the “monster of the LP” arises from viewing WCP as an equivalence, instead of going in one direction (from mixtures to the known result).
ERROR STATISTICAL PHILOSOPHER: In my 2014 paper (unlike my earlier treatments) I too construe WCP as giving an “equivalence” but there is an equivocation that invalidates the purported move to the LP.
On the one hand, it’s true that if z is known (and known for example to have come from optional stopping), it’s irrelevant that it could have resulted from either fixed sample testing or optional stopping.
But it does not follow that if z is known, it’s irrelevant whether it resulted from fixed sample testing or optional stopping. It’s the slippery slide into this second statement–which surely sounds the same as the first–that makes your argument such a brain buster.
BIRNBAUM: Yes I have seen your 2014 paper! Your Rejoinder to some of the critics is gutsy, to say the least. I’ve also seen the slides on your blog.
ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I must get your answer to a question before you leave this year.
Sudden interruption by the waiter who, very wisely, is wearing an N95 mask:
WAITER: Who gets the tab? We’re closing a bit early due to Covid.
BIRNBAUM: I do. To Elbar Grease! And to your (still) new book SIST! I have a list of comments and questions right here.
ERROR STATISTICAL PHILOSOPHER: Let me see, I’d love to read your questions and comments. (She takes a long legalsized yellow sheet from Birnbaum, noticing it is filled with tiny handwritten comments, covering both sides.)
BIRNBAUM: To Elbar Grease! To Severe Testing! Happy New Year!
ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962) paper, you seemed to agree with Pratt that WCP can’t do the job you intend.
BIRNBAUM: Savage, you know, never got off my case about remaining at “the halfway house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)
ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?
WAITER: We’re closing now; shall I call Remote Taxi?
BIRNBAUM: Yes, yes!
ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?
MANAGER: We’re closing now; I’m sorry you must leave.
ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….
Large group of people bustle past, mostly unmasked.
Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227266.
[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as classic background papers may be found in this blogpost. A link to slides and video of a very introductory presentation of my argument from the 2021 Phil Stat Forum is here.
[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.
Below are the videos and slides from the 7 talks from Session 3 and Session 4 of our workshop The Statistics Wars and Their Casualties held on December 1 & 8, 2022. Session 3 speakers were: Daniele Fanelli (London School of Economics and Political Science), Stephan Guttinger (University of Exeter), and David Hand (Imperial College London). Session 4 speakers were: Jon Williamson (University of Kent), Margherita Harris (London School of Economics and Political Science), Aris Spanos (Virginia Tech), and Uri Simonsohn (Esade Ramon Llull University). Abstracts can be found here. In addition to the talks, you’ll find (1) a Recap of recaps at the beginning of Session 3 that provides a summary of Sessions 1 & 2, and (2) Mayo’s (5 minute) introduction to the final discussion: “Where do we go from here (Part ii)”at the end of Session 4.
The videos & slides from Sessions 1 & 2 can be found on this post.
Readers are welcome to use the comments section on the PhilStatWars.com workshop blog post here to make constructive comments or to ask questions of the speakers. If you’re asking a question, indicate to which speaker(s) it is directed. We will leave it to speakers to respond. Thank you!
SESSION 3
Recap of recaps summary of Sessions 1 & 2:
Introduction to Session: Daniël Lakens (Eindhoven University of Technology)
Daniele Fanelli (London School of Economics and Political Science)
The neglected importance of complexity in statistics and Metascience
Stephan Guttinger (University of Exeter)
What are questionable research practices?
David Hand (Imperial College London)
What’s the question?
Discussion (Session 3): (a) Panel discussion of speakers; (b) general audience discussion; (c) “Where do we go from here (Part i)” participant discussion.
SESSION 4
Introduction to Session 4: Deborah Mayo (Virginia Tech)
Jon Williamson (University of Kent)
Causal inference is not statistical inference
Margherita Harris (London School of Economics and Political Science)
On Severity, the Weight of Evidence, and the Relationship Between the Two
Aris Spanos (Virginia Tech)
Revisiting the Two Cultures in Statistical Modeling and Inference as they relate to the Statistics Wars and Their Potential Casualties
Uri Simonsohn (Esade Ramon Llull University)
Mathematically Elegant Answers to Research Questions No One is Asking (metaanalysis, random effects models, and Bayes factors)
Where Should Stat Activists Go From Here? Deborah Mayo (Virginia Tech):
Discussion: (a) Panel discussions; (b) General audience discussion; (c) “Where do we go from here (Part ii)” participants and audience.
]]>Below are slides from 4 of the talks given in our Philosophy of Science Association (PSA) session from last month: the PSA 22 Symposium: Multiplicity, DataDredging, and Error Control. It was held in Pittsburgh on November 13, 2022. I will write some reflections in the “comments” to this post. I invite your constructive comments there as well.
SYMPOSIUM ABSTRACT: High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductivestatistical evidence and the roles of probability in statistical inference. We consider three central questions:
•How should we cope with the fact that datadriven processes, multiplicity and selection effects can invalidate a method’s control of error probabilities?
•Can we use the same data to search nonexperimental data for causal relationships and also to reliably test them?
•Can a method’s error probabilities both control a method’s performance as well as give a relevant epistemological assessment of what can be learned from data?
As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, MayoWilson) and statisticians (Berger, Thornton) to reflect on these questions.
Deborah Mayo (Philosophy, Virginia Tech)
Error Control and Severity
ABSTRACT: I put forward a general principle for evidence: an errorprone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal postdata, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as onesided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and datadredging, while blocking inferences based on problematic datadredging.
Suzanne Thornton (Statistics, Swarthmore College)
The Duality of Parameters and the Duality of Probability
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Wellknown debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293308)–the behavior of a procedure under hypothetical repetition–bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. For data observed from a N(θ, 1) distribution to generate a credible interval for θ requires an assumption about the plausibility of different possible values of θ, that is, one must assume a prior. However, depending on the context – is θ the recovery time for a newly created drug? or is θ the recovery time for a new version of an older drug? – there may or may not be an informed choice for the prior. Without appealing to the longrun performance of the interval, how is one to judge a 95% credible interval [a, b] versus another 95% interval [a’, b’] based on the same data but a different prior? In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a datadependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Clark Glymour (Philosophy, Carnegie Mellon University)
Good DataDredging
ABSTRACT: “Data dredging”–searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships–was historically common in the natural sciences–the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, “data dredging”–using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity–is widely denounced by both philosophical and statistical methodologists. Notwithstanding, “data dredging” is routinely practiced in the human sciences using “traditional” methods–various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo’s and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of “constructing” them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. In real scientific cases in which the number of variables is large in comparison to the sample size, principled search algorithms can be indispensable. I illustrate the first claim with a simple linear model, and the second claim with an application of the oldest correct graphical model search, the PC algorithm, to genomic data followed by experimental tests of the search results. The latter example, due to Steckhoven et al. (“Causal Stability Ranking,” Bioinformatics, 28 (21), 28192823) involves identification of (some of the) genes responsible for bolting in A. thaliana from among more than 19,000 coding genes using as data the gene expressions and time to bolting from only 47 plants. I will also discuss Fast Causal Inference (FCI) which gives asymptotically correct results even in the presence of confounders. These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including prespecification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
James Berger (Statistics, Duke University)
Comparing Frequentists and Bayesian Control of Multiple Testing
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20×100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the pvalue of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, phacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the pvalue, one is guaranteed to obtain a pvalue less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in metaanalysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. In GWAS, scientists assessed the chance of a disease/gene association to be 1/100,000, meaning that each null hypothesis of no association would be assigned a prior probability of 11/100,000. Only tests yielding pvalues less than 5 x 10—7 would be able to overcome this strong initial belief in no association. In subgroup analysis, the set of possible subgroups under consideration can be expressed as a tree, with probabilities being assigned to differing branches of the tree to deal with the multiplicity. There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
]]>At the end of this post is “A recap of recaps”, the short video we showed at the beginning of Session 3 last week that summarizes the presentations from Sessions 1 & 2 back in September 2223.
]]>The Statistics Wars
and Their Casualties
1 December and 8 December 2022
Sessions #3 and #4
15:0018:15 pm London Time/10:00am1:15pm EST
ONLINE
(London School of Economics, CPNSS)
registration form
For slides and videos of Sessions #1 and #2: see the workshop page
Session 3 (Moderator: Daniël Lakens, Eindhoven University of Technology)
OPENING
SPEAKERS
DISCUSSIONS:
Session 4 (Moderator: Deborah Mayo, Virginia Tech)
SPEAKERS
DISCUSSIONS;
**********************************************************************
Speakers/Panellists:
Sponsors/Affiliations:
Stephen Senn
Consultant Statistician
Edinburgh, Scotland
A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex difference in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and their weight the following June are recorded.(P304)
This is how Frederic Lord (19122000) introduced the paradox (1) that now bears his name. It is justly famous (or notorious). However, the addition of sex as a factor adds nothing to the essence of the paradox and (in my opinion) merely confuses the issue. Furthermore, studying the effect of diet needs some sort of control. Therefore, I shall consider the paradox in the purer form proposed by Wainer and Brown (2), which was subtly modified by Pearl and Mackenzie in The Book of Why (3) (See pp212217).
In the Wainer and Brown form, two dining rooms are mentioned, Dining Room A and Dining Room B. Pearl and McKenzie, however, although they too refer to Dining Room A and Dining Room B in the diagram they present, also refer to two diets. In my discussion below I shall maintain a distinction between Hall (using Lord’s original term) and Diet. This distinction is of causal interest since I shall assume that if Diet A was given in Hall 1 and Diet B in Hall 2 (say) the alternative arrangement of Diet B in Hall 1 and Diet A in Hall 2 might have been possible and also that a difference between diets may be of wider general interest than a difference between halls.
A most thorough and penetrating analysis of assumptions made in discussing Lord’s paradox is given by Holland and Rubin (4), and the reader who is interested in learning more is referred to their paper.
I shall now consider four variants for the way that the data to be analysed might have arisen and I shall illustrate the analysis using John Nelder’s approach to designed experiments (5, 6) as incorporated in Genstat®(7). This requires separate identification of structure in the experimental material that exists prior to experimentation (the block structure) and the nature of the treatments that are subsequently applied (the treatment structure). This, together with a third piece of information, the design matrix, which maps treatments onto units, determines the analysis.
Students have already decided in which of the two halls they will dine. The university authorities then decide to allocate (at random) diet A to one hall and diet B to the other and measure initial and final weights of 100 students in each hall.
The disposition of students looks like this.
Count
Diet A B
Hall
1 100 0
2 0 100
The Genstat® code for the dummy ANOVA (dummy because ANOVA has not been given an outcome variate) for the experiment looks like this.
BLOCKSTRUCTURE Hall/Student
COVARIATE Initial
TREATMENTSTRUCTURE Diet
ANOVA
Note that the fact that students are ‘nested’ within halls is shown using the / operator. The dummy analysis of variance includes this output:
Analysis of variance (adjusted for covariate)
Covariate: Initial Weight
Source of variation d.f.
Hall stratum
Diet 1
Hall.Student stratum
Covariate 1
Residual 197
Total 199
From this we see that Diet appears in the Hall stratum (that is to say at the higher level) but there are only two halls and so the effect of Diet cannot be separated from the effect of Hall.
It has been decided to trial Diet A in Hall 1 (say) and Diet B in Hall 2 (say). Students are then randomly allocated in equal numbers to dine in one or the other hall and in each hall 100 students are chosen to be measured. The disposition of students is as before. It is now accepted that the effects of Diet and Hall cannot be separated but it is agreed that the joint effect of both will be studied. Hall can now be transferred from the block structure to the treatment structure. The code is now
BLOCKSTRUCTURE Student
COVARIATE Initial
TREATMENTSTRUCTURE Diet+Hall
ANOVA
The output includes
Analysis of variance (adjusted for covariate)
Covariate: Initial Weight
Source of variation d.f.
Student stratum
Diet 1
Covariate 1
Residual 197
Total 199
Information summary
Aliased model terms
Hall
It appears that the effect of Diet can now be studied. In fact, having fitted the initial weight and the covariate, 197 degrees of freedom are left for estimating residual variation. Note, however, that we are warned that Hall is an aliased model term. Since, as Hall is changed Diet is changed, the effect of one cannot be separated from the other. Thus although nothing can be said about the effect of Diet independently of Hall, their joint effect can be studied. Not only will it be possible to calculate a standard error but an appropriate covariate adjustment can be made.
It must be conceded, however, that the analysis proposed requires an important assumption. Although the method of assignment (students allocated independently to the hall/diet combination) does make initial weights independent of each other, the same is not necessarily true of final weights. It is possible that living together over the period of the experiment will introduce some sort of correlation. Thus a conventional analysis requires an assumption of independence that the experimental procedure cannot guarantee.
It is decided to vary diets within Halls. In each hall an equal number of students will be randomly assigned to follow Diet A and randomly assigned to follow diet B. In each hall 100 students (50 on Diet A and 50 on Diet B) will have their initial and final weights measured. The disposition of students looks like this.
Count
Diet2 A B
Hall
1 50 50
2 50 50
The code to analyse this experiment will look like this.
BLOCKSTRUCTURE Hall/Student
COVARIATE Initial
TREATMENTSTRUCTURE Diet2
ANOVA
Here the code is apparently the same as in Variant 1a apart from the fact that Diet is replaced by Diet2. The former has a pattern whereby Diet is varied between halls and the latter where it is varied within halls.
The output includes the following.
Analysis of variance (adjusted for covariate)
Covariate: Initial Weight
Source of variation d.f. .
Hall stratum
Covariate 1
Hall.Student stratum
Diet2 1
Covariate 1
Residual 196
Total 199
It now becomes possible to estimate the effect of diet on weight.
A possible illustration of the data is given in Figure 1.
It is now decided to assign students independently to a diet. There is no attempt to block this by hall. (This would be a reasonable strategy if one believed the effect of Hall was negligible.) To what degree numbers are balanced by diet within halls is a matter of chance. As it turns out, the disposition of students is like this.
Count
Diet3 A B
Hall
1 45 55
2 55 45
The code for analysis will look like this.
BLOCKSTRUCTURE Student
COVARIATE Initial
TREATMENTSTRUCTURE Diet3
ANOVA
Here Diet3 is a factor representing how the diet has been allocated.
The output includes the following:
Analysis of variance (adjusted for covariate)
Covariate: Initial Weight
Source of variation d.f.
Student stratum
Diet3 1
Covariate 1
Residual 197
Total 199
“However, for statisticians who are trained in “conventional” (i.e. modelblind) methodology and avoid using causal lenses, it is deeply paradoxical that the correct conclusion in one case would be incorrect in another, even though the data look exactly the same.” The Book of Why (3), P217
Well, I am not that straw man and I suspect very few statistician are. I was trained in the Rothamsted approach to statistics and that takes design very seriously and elucidating causes is the central objective of experimental design.
Trialists and medical statisticians will recognise variant 1a as being a cluster randomised design, albeit a rather degenerate example of the class, since there are only two clusters. Variant 2 is a blocked parallel trial, the blocking factor being hall. The clinical trial analogy would be blocking by centre. Variant 3 is a completely randomised parallel group trial. Variant 1b is more unusual. It is theoretical possible and I would not be surprised to find that some clinical trial analogue has been run but I know of no examples.
Each of these four cases leads to a different analysis. It seems intuitively right that they do and John Nelder’s approach delivers a different answer for each.
However, I am not sure that Directed Acyclic Graphs (DAGs) are up to the job. I shall be happy to be proved wrong but must conclude for the moment that it is the causal analysts who will find these four cases deeply paradoxical. They may even refuse to recognise that they are different: if DAGs can’t be drawn to illustrate them, the differences don’t exist.
In fact, it is difficult to decide which of these variants the authors of The Book of Why think they are discussing. Versions 2 and 3 ought to be dismissed from their discussion, yet the proposed analysis that they offer, adjusts the difference in final weights using the within halls regression on initial weights. This is the analysis that is appropriate to variant 3 illustrated in Figure 2. It is very similar to the analysis for variant 1b but the interpretation is different. Variant 1b would not permit separation of Hall and Diet effects.
Lord’s Paradox illustrates the wellknown statistical phenomenon that how data arose is essential to a correct understanding of their analysis. I consider that there are these lessons for causal inference.
Other related guest posts by Senn include:
Please share your comments.
]]>Some claim that no one attends Sunday morning (9am) sessions at the Philosophy of Science Association. But if you’re attending the PSA (in Pittsburgh), we hope you’ll falsify this supposition and come to hear us (Mayo, Thornton, Glymour, MayoWilson, Berger) wrestle with some rival views on the trenchant problems of multiplicity, datadredging, and error control. Coffee and donuts to all who show up.
Multiplicity, DataDredging, and Error Control
November 13, 9:00 – 11:45 AM
(link to symposium on PSA website)
Speakers:
Deborah Mayo (Virginia Tech) abstract Error control and Severity
Suzanne Thornton (Swarthmore College) abstract The Duality of Parameters and the Duality of Probability
Clark Glymour (Carnegie Mellon University) abstract Good Data Dredging
Conor MayoWilson (University of Washington, Seattle abstract Bamboozled By Bonferroni
James O. Berger ( Duke University) abstract Controlling for Multiplicity in Science
Summary
High powered methods, the big data revolution, and the crisis of replication in medicine and social sciences have prompted new reflections and debates in both statistics and philosophy about the role of traditional statistical methodology in current science. Experts do not agree on how to improve reliability, and these disagreements reflect philosophical battles–old and new– about the nature of inductivestatistical evidence and the roles of probability in statistical inference. We consider three central questions:
As reforms to methodology are being debated, constructed or (in some cases) abandoned, the time is ripe to bring the perspectives of philosophers of science (Glymour, Mayo, MayoWilson) and statisticians (Berger, Thornton) to reflect on these questions.
Topic Description
Multiple testing, replication and error control. The probabilities that a method leads to misinterpreting data in repeated use may be called its error probabilities. It is well known that control of the probability of a Type I error (erroneously rejecting a null hypothesis H_{0}) is invalidated by cherrypicking, phacking, and stopping when the data look good. If a medical researcher combs through unblinded data and selectively reports just the endpoints that show impressive drug benefit, there is a high probability of finding some statistically significant effect or other, even if none are genuine—a high error probability. The problem, for a significance tester, is that the probability of getting some small pvalue, say .01, under H_{0}, is no longer .01, but can be much greater. From a Bayesian perspective, the problem is that multiple testing results in pvalues being low even though the posterior probability of H_{0} is not low (on a given prior). The former suggests there is evidence against H_{0}, while the latter says there is not.
Accordingly, the statistical significance tester and the Bayesian propose different ways to solve the problem. Jim Berger will argue that older frequentist solutions, such as Bonferroni and the False Discovery Rate (FDR), are inappropriate for many of today’s complex, highthroughput inquiries. He argues for a unified method that can address any such problems of multiplicity by means of the choice of objective prior probabilities of hypotheses.
Philosophical scrutiny of both older and newer solutions to the multiple test problem reveals challenges to the very assumptions for the necessity of taking account of, and adjusting for, multiplicity. Conor MayoWilson shows that a prevalent argument for the Bonferroni correction, which recommends replacing a pvalue threshold with p/n when testing n independent hypotheses, can violate important axioms of evidence. Correcting error probabilities or pvalues for multiple testing, he argues, should be viewed as value judgments in deciding which hypotheses or models are worth pursuing.
Using the same data to construct and stringently test causal relationships. Under the guise of fixing the problem of selective reporting, it is increasingly recommended that scientists predesignate all details of experimental procedure, number of tests run, and rules for collecting and analyzing data in advance of the experiment. Clark Glymour asks if predesignation comes at the cost of high Type II error probability—erroneously failing to find effects—and lost opportunities for discovery. In contemporary science, Glymour argues, in which the number of variables is large in comparison to the sample size, principled search algorithms can be invaluable. Some of the leading research areas of machine learning and AI develop “postselection inferences” that violate the rule against finding one’s hypothesis in the data. These adaptive methods attempt to arrive at reliable results by compensating for the fact that the model was picked in a datadependent way using methods such as cross validation, simulation, and bootstrapping. Glymour argues that some of these methods are a form of “severe testing” of their output, whereas commonly used regression methods are actually “bad” data dredging methods that do not severely test their results. For both frequentist and Bayesian statistics, search procedures press epistemic issues about how using observational data to try to reach beyond experimental possibilities should be evaluated for accuracy and reliability. We suggest, in each of our contributions, some principled ways to distinguish “bad” from “good” data dredging.
Error probabilities and epistemic assessments. Controversies between Bayesian and frequentist methods reflect different answers to the question of the role of probability in inference—to supply a measure of belief or support in hypotheses? or to control a method’s error probabilities? While a criticism often leveled at Type I and II error probabilities is they do not give direct assessments of epistemic probability, Bayesians are also often keen to show their methods have good performance in repeated sampling. Can the performance of a method under hypothetical uses also supply epistemically relevant measures of belief, confidence or corroboration? Suzanne Thornton presents new developments toward an affirmative answer by means of confidence distributions (CD) which provide confidence intervals for parameters at any level of confidence, not just the typical .95. Even regarding a parameter as fixed, say the mean deflection of light, we can calibrate how reliably a method enables finding out about its values. In this sense, she argues, parameters play a dual role—a possible key to reconciling approaches.
Deborah Mayo’s idea is to view a method’s ability to control erroneous interpretations of data as measuring its capability to probe errors. In her view, we have evidence for a claim just to the extent that it has been subjected to and passes a test that would probably have found it false, just if it is. This probability is the stringency or severity with which it has passed the test. On the severity view, the question of whether, and when, to adjust a statistical method’s error probabilities in the face of multiple testing and datadredging (debated by Berger, Glymour, and MayoWilson) is directly connected to the relevance of error control for qualifying a particular statistical inference (discussed by Thornton). Thus a platform for connecting the five contributions emerges.
Our goal is to channel some of the sparks that grow out of our contrasting views to vividly illuminate the issues, and point to the directions for new interdisciplinary work.
]]>
From what standpoint should we approach the statistics wars? That’s the question from which I launched my presentation at the Statistics Wars and Their Casualties workshop (philstatwars.com). In my view, it should be, not from the standpoint of technical disputes, but from the nontechnical standpoint of the skeptical consumer of statistics (see my slides here). What should we do now as regards the controversies and conundrums growing out of the statistics wars? We should not leave off the discussions of our workshop without at least sketching a future program for answering this question. We still have 2 more sessions, December 1 and 8, but I want to prepare us for the final discussions which should look beyond a single workshop. (The slides and videos from the presenters in Sessions 1 and 2 can be found here.)
I will consider three, interrelated, responsibilities and tasks that we can undertake as statistical activist citizens. In so doing I will refer to presentations from the workshop, limiting myself to session #1. (I will add more examples in part (ii) of this post.)
1. Keep alert to ongoing evidence policy “reforms”. Scrutinize attempts to replace designs and methods that ensure error control with alternatives that actually make it harder to achieve error control. Be on the lookout for methods that presuppose a principle of evidence where error probabilities drop out–the Likelihood Principle. While they’re unlikely to be described that way, ask journals/ authors etc. directly if the LP is being presupposed. Write letters to editors asking how the proposed change in method benefits (rather than hurts) the skeptical statistical consumer.
In slide #64 of my presentation, I proposed that in the context of the skeptical consumer of statistics, methods should be:
directly altered by biasing selection effects
able to falsify claims statistically,
able to test statistical model assumptions.
able to block inferences that violate minimal severity
If someone is trying to sell you a reform where any of these are lacking, you might wish to hold off buying.
In reacting to proposed Bayesian replacements for error statistical methods, ask how they are arrived at, what they mean, and how to check them. Here’s a slide (#22) from Stephen Senn on the various types of Bayesian approaches (Slides from Senn presentation)
2. Reject the howlers and caricatures of error statistical methods that are the basis of the vast majority of criticisms against them. Typical examples are claims that either Pvalues must be misinterpreted as posterior probabilities or else they are irrelevant for science. Resist popular mantras that error statistical control is only relevant to ensure ‘quality control’, apt for such contexts as needing to avoid the acceptance of a batch of bolts with too high a proportion of defectives, but not for science. The supposition that the choice is either “belief or performance” is to commit a false dilemma fallacy. Admittedly what is still needed is a clear articulation of uses of error statistical methods that reflect what Cox, E. Pearson, Birnbaum, Giere (and others, including Mayo) dub the “evidential” vs the “behavioristic” uses of tests. Scientists use error statistical tests to appraise, develop, and answer questions about theories and models (e.g., could the observed effect readily be due to sampling variability? could the data have been generated by a process approximately represented by model M? Is the datamodel fit ‘too good to be true?”).
By the way, founders of error statistical methods never claimed low Type I and 2 error probabilities suffice for warranted inference. Observe that a statistically significant result inseverely passes an alternative H’ against which a test has high power. (Mayo, slide #50):
3. No preferential treatment for one methodology or philosophy. Developing author and journal guidelines for avoiding problematic uses of Bayesian (and other) methods is long overdue. In many journals, authors are warned to avoid classic fallacies of statistical significance: statistical significance is not substantive importance; Pvalues aren’t effect size measures; a nonstatistically significant difference isn’t evidence of no difference; a Pvalue is not a posterior probability of H_{0}; biasing selection effects (cherry picking, optional stopping, multiple testing, etc.) can alter a method’s error probabilities. Have you ever seen guidelines that alert authors to fallacies of methods that are recommended as replacements to statistical significance tests? (Let me know if you have.) It’s time.[0]
Let’s focus here on one of the rivals that arose in several presentations: Bayes factors (BFs).[1] We can begin with the uses to which they are routinely put, especially in the service of critiques of statistical significance tests. BFs do not satisfy the 4 requirements I list at the outset. Two main problems arise.
Problem #1: High probability of erroneous claims of evidence against a hypothesis H_{0}. Because error probabilities drop out of Bayes factors, the ability to control them goes by the wayside. Stat activists should uncover how biasing selection effects (e.g., multiple testing, datadredging, optional stopping) might adversely affect a method’s ability to have uncovered mistaken interpretations of data. A part of what I have in mind is an active research area in genomics, machine learning (ML) and big data science under terms such as ‘post data selective inference’. The skeptical statistical consumer should be aware of how data dependent methods can succeed or badly fail in some of the ML algorithms that affect them in medical, legal, and a host of social policies.
One of many wellknown examples involves optional stopping in the context of a type of example the BF advocate often recommends—twosided testing of the mean of a Normal distribution.[2] This example “is enough to refute the strong likelihood principle” (Cox 1978, p. 54), since, with high probability, it will stop with a “nominally” significant result even though the point null hypothesis is true. It contradicts what Cox and Hinkley call “the weak repeated sampling principle” (See SIST 2018, p. 45 or Mayo slides).
Inference by Bayes theorem entails the LP, so either one accepts error probability control or accepts Bayesian “incoherence”.
Interestingly, Richard Morey, a leading developer of BFs, also focused his presentation on how BFs preclude satisfying error statistical severity. But he does not look at false rejections due to biasing selection effects. Rather, Morey shows that BFs allow erroneously accepting hypothesis H_{0} with high probability (see Morey slides here). Call this problem #2.
Problem #2: Even a statistically significant difference from H_{0}—a low Pvalue—can, according to a BF computation, become evidence in favor of H_{0 }by assigning it a high enough prior degree of belief, especially coupled with a suitable choice of alternative.[3] See Morey’s conclusions in his slides.
What needs to be done: If a BF purports to supply evidence for a point null, compute, as does Morey, the probability the assignments used would find as much or even more evidence for H_{0} as it does, even though in fact there’s a discrepancy δ of from H_{0}, for δ a discrepancy of interest (a substantive issue). That is, compute the probability of a Type 2 error in relation to alternatives of form: there’s a discrepancy at least δ from H_{0}. Simple apps can be given for computing this (a twist on Morey’s own severity app would do). If this error probability is not low, then not only should you reject the inference to point hypothesis H_{0, }even a claim that any discrepancy from H_{0} is less than δ (whether 1 or 2sided) is unwarranted. Other analogous ways can and have been developed to critically evaluate inferences based on BFs.That is what an adequate metamethodology demands from the standpoint of the skeptical critical consumer of statistics.
Morey begins by remarking that he himself has developed popular methods for computing BFs, so it is especially meaningful that he concedes their inability to sustain severity. He does an excellent job. One thing the skeptical statistical consumer will want to know is whether Morey alerts the user to these consequences. The user needs to see precisely what he so clearly shows us in his presentation: applying the defaults can have seriously problematic error statistical consequences. If he hasn’t already, I propose Morey include examples of this in his next BF computer package.[4]
The bottom line is: we need to erect guidelines to ward off the bad statistics that can easily result from rivals to error statistical methods, and encourage journals to include such guidelines, especially now that these alternatives are becoming so prevalent. That a method is Bayesian should not make it above reproach, as if it’s a protected class.[5]
——–
[0] In the open discussion in Session 1, Mark Burgman, editor of Conservation Biology, seemed surprised at my suggestion that he add to the guidelines in his journal caveats directed at methods other than statistical significance tests and Pvalues, including confidence intervals and Bayes factors. I am surprised at his surprise (but perhaps I misunderstood his reaction.)
[1] As I say in my presentation, there are other goals in statistics. We are not always trying to critically probe what is the case. In some contexts, we might merely be developing hypotheses and models for subsequent severe probing. However, it’s important to see that even weaker claims such as “this model is worth probing further” need to be probed with reasonable severity. Severity provides a minimal requirement of evidence for any type of claim, in other words. Moreover, I should note, there are Bayesians who reject BFs and criticize the spiked priors to point null hypotheses. Their objections should be part of the remedy for problem #2. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).
[2] The error statistical tester, but also some Bayesians, eschew these twosided tests as artificial—particularly when paired with the lump of prior placed on the point null hypothesis. However, they are the mainstay of the examples currently relied on in launching criticisms of statistical significance tests.
[3] Strictly speaking, BFs do not supply evidence for or against hypotheses—they are only comparative claims, e.g., the data support or fit one hypothesis or model better than another. Morey speaks of “accepting H_{0}”, and that is entirely in sync with the way BF advocates purport to use BFs—namely as tests. Ironically, while BF enthusiasts (like Likelihoodists) are one in criticizing Pvalues because they are not comparative (and thus, according to them, cannot supply evidence), BF advocates strive to turn their own comparative accounts into tests, by allowing values of the BF to count as accepting or rejecting statistical hypotheses. The trouble is that it is often forgotten that they were really only entitled to a comparative claim. Construed as tests, error probabilities can be high. Moreover, as Morey points out, both claims under comparison can be terribly warranted. Moreover, the value of the BF—essentially a likelihood ratio—doesn’t mean the same thing in different contexts, especially if hypotheses are data dependent. As Morey points out, for the BF advocate, the value of the BF is the evidence measure, whereas for error statisticians, such “fit” measures only function as statistics to which we would need to attach a sampling distribution. By contrast sampling distributions are rejected as irrelevant post data by the BF advocate.
[4] Such consequences are often hidden by Bayesians behind the cover of: we are warranted in a high spiked prior degree of belief in H_{0} because nature is “simple” or we are “conservative”. The former is a presumed metaphysics, not evidence. As for the latter, consider where the null hypothesis asserts “no serious toxicity exists”. Assigning H_{0} a high prior is quite the opposite of taking precautions.[4] Even if one would be correct to doubt the existence of the effect, that is very different from having evidential reasons for this. One may reject the effect for the wrong reasons.
[5] All slides and videos from Sessions 1 and 2 can be found on this post.
]]>