Error Statistics

Excursion 1 Tour I (2nd Stop): Probabilism, Performance, and Probativeness (1.2)

.

Readers: Last year at this time I gave a Neyman seminar at Berkeley and posted on a panel discussion we had. There were lots of great questions, and follow-ups. Here’s a link.

“I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth”. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Error Statistics | Leave a comment

Response to Ben Recht’s post (“What is Statistics’ Purpose?”) on my Neyman seminar (ii)

.

There was a very valuable panel discussion after my October 9 Neyman Seminar in the Statistics Department at UC Berkeley.  I want to respond to many of the questions put forward by the participants (Ben Recht, Philip Stark, Bin Yu, Snow Zhang)  that we did not address during that panel. Slides from my presentation, “Severity as a basic concept of philosophy of statistics” are at the end of this post (but with none of the animations). I begin in this post by responding to Ben Recht, a professor of Artificial Intelligence and Computer Science at Berkeley, and his recent blogpost, What is Statistics’ Purpose? On severe testing, regulation, and butter passing, on my talk. I will consider: (1) A complex or leading question; (2) Why I chose to focus about Neyman’s philosophy of statistics and (3) What the “100 years of fighting and browbeating” were/are all about. Continue reading

Categories: affirming the consequent, Ben Recht, Neyman, P-values, Severity, statistical significance tests, statistics wars | 10 Comments

Excursion 1 Tour I (2nd Stop): Probabilism, Performance, and Probativeness (1.2)

.

Readers: I gave the Neyman Seminar at Berkeley last Wednesday, October 9, and had been so busy preparing it that I did not update my leisurely cruise for October. This is the second stop. I will shortly post remarks on the the panel discussion that followed my Neyman talk (with panelists, Ben Recht, Philip Stark, Bin Yu, and Snow Zhang), which was quite illuminating. 

“I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth”. (George Barnard 1985, p. 2)

Continue reading

Categories: Error Statistics | Leave a comment

The leisurely cruise begins: Excerpt from Excursion 1 Tour 1 of Statistical Inference as Severe Testing (SIST)

Ship Statinfasst

Excerpt from excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

NOTE: The following is an excerpt from my existing book: Statistical Inference as Severe Testing: How to get beyond the statistics wars (CUP, 2018). For any new reflections or corrections, I will use the comments. The initial announcement is here.

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

Continue reading

Categories: Error Statistics | Leave a comment

5-year review: “Les stats, c’est moi”: We take that step here! (Adopt our fav word or phil stat!)(iii)

 

les stats, c’est moi

This is the last of the selected posts I will reblog from 5 years ago on the 2019 statistical significance controversy. The original post, published on this blog on December 13, 2019, had 85 comments, so you might find them of interest.  I invite readers to share their thoughts as to where the field is now, in relation to that episode, and to alternatives being used as replacements for statistical significance tests. Use the comments and send me guest posts.  Continue reading

Categories: 5-year memory lane, Error Statistics, statistical significance tests | Leave a comment

Preregistration, promises and pitfalls, continued v2

..

In my last post, I sketched some first remarks I would have made had I been able to travel to London to fulfill my invitation to speak at a Royal Society conference, March 4 and 5, 2024, on “the promises and pitfalls of preregistration.” This is a continuation. It’s a welcome consequence of today’s statistical crisis of replication that some social sciences are taking a page from medical trials and calling for preregistration of sampling protocols and full reporting. In 2018, Brian Nosek and others wrote of the “Preregistration Revolution”, as part of open science initiatives. Continue reading

Categories: Bayesian/frequentist, Likelihood Principle, preregistration, Severity | 3 Comments

Princeton talk: Statistical Inference as Severe Testing: Beyond Performance and Probabilism

On November 14, I gave a talk at the Seminar in Advanced Research Methods for the Department of Psychology, Princeton University.

“Statistical Inference as Severe Testing: Beyond Probabilism and Performance”

The video of my talk is below along with the slides. It reminds me to return to a paper, half-written, replying to a paper on “A Bayesian Perspective on Severity” (van Dongen, Sprenger, Wagenmakers (2022). These authors claim that Bayesians can satisfy severity “regardless of whether the test has been conducted in a severe or less severe fashion”, but what they mean is that data can be much more probable on hypothesis H1 than on H0 –the Bayes factor can be high. However, “severity” can be satisfied in their comparative (subjective) Bayesian sense even for claims that are poorly probed in the error statistical sense (slides 55-6). Share your comments. Continue reading

Categories: Severity, Severity vs Posterior Probabilities | Leave a comment

David R. Cox Foundations of Statistics Award

Link to announcement on ASA website.

First Winner

Nancy Reid

.

Nancy Reid
University of Toronto

For contributions to the foundations of statistics that significantly advanced the frontiers of statistics and for insight that transformed understanding of parametric statistical inference, Nancy Reid is the inaugural recipient of the David R. Cox Foundations of Statistics Award, presented by the American Statistical Association (ASA). Reid will formally receive the award and deliver a lecture at the Joint Statistical Meetings in Toronto in August. Continue reading

Categories: Error Statistics | Leave a comment

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Below are the videos and slides from the 7 talks from Session 3 and Session 4 of our workshop The Statistics Wars and Their Casualties held on December 1 & 8, 2022. Session 3 speakers were: Daniele Fanelli (London School of Economics and Political Science), Stephan Guttinger (University of Exeter), and David Hand (Imperial College London).  Session 4 speakers were: Jon Williamson (University of Kent),  Margherita Harris  (London School of Economics and Political Science), Aris Spanos (Virginia Tech), and Uri Simonsohn (Esade Ramon Llull University). Abstracts can be found here. In addition to the talks, you’ll find (1) a Recap of recaps at the beginning of Session 3 that provides a summary of Sessions 1 & 2, and (2) Mayo’s (5 minute) introduction to the final discussion: “Where do we go from here (Part ii)”at the end of Session 4.

The videos & slides from Sessions 1 & 2 can be found on this post.

Readers are welcome to use the comments section on the PhilStatWars.com workshop blog post here to make constructive comments or to ask questions of the speakers. If you’re asking a question, indicate to which speaker(s) it is directed. We will leave it to speakers to respond. Thank you! Continue reading

Categories: Error Statistics | Leave a comment

Where should stat activists go from here? (part (i))

.

From what standpoint should we approach the statistics wars? That’s the question from which I launched my presentation at the Statistics Wars and Their Casualties workshop (phil-stat-wars.com). In my view, it should be, not from the standpoint of technical disputes, but from the non-technical standpoint of the skeptical consumer of statistics (see my slides here). What should we do now as regards the controversies and conundrums growing out of the statistics wars? We should not leave off the discussions of our workshop without at least sketching a future program for answering this question. We still have 2 more sessions, December 1 and 8, but I want to prepare us for the final discussions which should look beyond a single workshop. (The slides and videos from the presenters in Sessions 1 and 2 can be found here.)

I will consider three, interrelated, responsibilities and tasks that we can undertake as statistical activist citizens. In so doing I will refer to presentations from the workshop, limiting myself to session #1. (I will add more examples in part (ii) of this post.) Continue reading

Categories: Error Statistics, significance tests, stat wars and their casualties | Leave a comment

My Slides from the workshop: The statistics wars and their casualties

.

I will be writing some reflections on our two workshop sessions on this blog soon, but for now, here are just the slides I used on Thursday, 22 September. If you wish to ask a question of any of the speakers, use the blogpost at phil-stat-wars.com. The slides from the other speakers will also be up there on Monday.

Deborah G. Mayo’s. Slides from the workshop: The Statistics Wars and Their Casualties, Session 1, on September 22, 2022.

Categories: Error Statistics | 3 Comments

22-23 September final schedule for workshop: The statistics wars and their casualties ONLINE

The Statistics Wars
and Their Casualties

Final Schedule for September 22 & 23 (Workshop Sessions 1 & 2) Continue reading

Categories: Error Statistics | Leave a comment

22-23 Workshop Schedule: The Statistics Wars and Their Casualties: ONLINE

You can still register: https://phil-stat-wars.com/2022/09/19/22-23-september-workshop-schedule-the-statistics-wars-and-their-casualties/ Continue reading

Categories: Error Statistics | 1 Comment

Behavioral vs Evidential Interpretations of N-P tests: E.S. Pearson’s Statistical Philosophy: Belated Birthday Wish

E.S. Pearson

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980)–one of my statistical heroes. It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. Yes, I know I’ve been neglecting this blog as of late, because I’m busy planning our workshop: The Statistics Wars and Their Casualties (22-23 September, online). See phil-stat-wars.com. I will reblog some favorite Pearson posts in the next few days.

HAPPY BELATED BIRTHDAY EGON!

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) PearsonCases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171) 

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

Three Steps in the Original Construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”.

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

These points on Pearson are discussed in more depth in my book Statistical Inference as Severe Testing (SIST): How to Get Beyond the Statistics Wars (CUP 2018). You can read and download the entire book for free during the month of August 2022 at the following link:

https://www.cambridge.org/core/books/statistical-inference-as-severe-testing/D9DF409EF568090F3F60407FF2B973B2

 

References:

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to RealityJournal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.Biometrika 20(A): 175-240.


[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

Categories: E.S. Pearson, Error Statistics | Leave a comment

The Statistics Wars and Their Casualties Workshop-Now Online

The Statistics Wars
and Their Casualties 

22-23 September 2022
15:00-18:00 pm London Time*

ONLINE 

To register for the workshop, please fill out the registration form here.

*These will be sessions 1 & 2, there will be two more
The future online sessions (3 & 4)  at 15:00-18:00 pm London Time on December 1 & 8.

Yoav Benjamini (Tel Aviv University), Alexander Bird (University of Cambridge), Mark Burgman (Imperial College London),  Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science),
Stephan Guttinger
(University of Exeter), David Hand (Imperial College London), Margherita Harris (London School of Economics and Political Science), Christian Hennig (University of Bologna), Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent) Continue reading

Categories: Announcement, Error Statistics | Leave a comment

10 years after the July 4 statistical discovery of the the Higgs & the value of negative results

Higgs

Today marks a decade since the discovery on July 4, 2012 of evidence for a Higgs particle based on a “5 sigma observed effect”. CERN celebrated with a scientific symposium (webcast here). The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number that would be expected from background alone—which they can simulate in particle detectors. Because the 5-sigma standard refers to a benchmark from frequentist significance testing, the discovery was immediately imbued with controversies that, at bottom, concerned statistical philosophy. Continue reading

Categories: Error Statistics | 2 Comments

Dissent

 

Continue reading

Categories: Error Statistics | 5 Comments

D. Mayo & D. Hand: “Statistical significance and its critics: practicing damaging science, or damaging scientific practice?”

.

Prof. Deborah Mayo, Emerita
Department of Philosophy
Virginia Tech

.

Prof. David Hand
Department of Mathematics Statistics
Imperial College London

Statistical significance and its critics: practicing damaging science, or damaging scientific practice?  (Synthese)

[pdf of full paper.] Continue reading

Categories: Error Statistics | 3 Comments

Insevere Tests of Severe Testing (iv)

.

One does not have evidence for a claim if little if anything has been done to rule out ways the claim may be false. The claim may be said to “pass” the test, but it’s one that utterly lacks stringency or severity. On the basis of this very simple principle, I build a notion of evidence that applies to any error prone inference. In this account, data x are evidence for a claim C only if (and only to the extent that) C has passed a severe test with x.[1] How to apply this simple idea, however, and how to use it to solve central problems of induction and statistical inference requires careful consideration of how it is to be fleshed out. (See this post on strong vs weak severity.) Continue reading

Categories: Error Statistics | 3 Comments

No fooling: The Statistics Wars and Their Casualties Workshop is Postponed to 22-23 September, 2022

The Statistics Wars
and Their Casualties

Postponed to
22-23 September 2022

 

London School of Economics (CPNSS)

Yoav Benjamini (Tel Aviv University), Alexander Bird (University of Cambridge), Mark Burgman (Imperial College London),
Daniele Fanelli (London School of Economics and Political Science), Roman Frigg (London School of Economics and Political Science), Stephen Guttinger (University of Exeter), David Hand (Imperial College London), Margherita Harris (London School of Economics and Political Science), Christian Hennig (University of Bologna), Katrin Hohl *(City University London),
Daniël Lakens (Eindhoven University of Technology), Deborah Mayo (Virginia Tech), Richard Morey (Cardiff University), Stephen Senn (Edinburgh, Scotland), Jon Williamson (University of Kent) Continue reading

Categories: Error Statistics | Leave a comment

Blog at WordPress.com.