# Statistics

## Mayo-Spanos Summer Seminar PhilStat: July 28-Aug 11, 2019: Instructions for Applying Now Available

## You Should Be Binge Reading the (Strong) Likelihood Principle

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term *sampling theory, *or my preferred *error statistics, *as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the *strong likelihood principle* (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

**SLP** (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E

_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomesx* andy* (from E_{1}and E_{2}respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), thenx* andy* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

Continue reading

## Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)

**3.2 N-P Tests: An Episode in Anglo-Polish Collaboration***

We proceed by setting up a specific hypothesis to test,

H_{0 }in Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives toH_{0 }which we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:

Step 1. We must first specify the set of results . . .

Step 2.We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level the chance that, ifH_{0}is true, a result will occur in random sampling lying beyond that level . . .In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)

In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!

We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed: Continue reading

## Mementos for Excursion 2 Tour II: Falsiﬁcation, Pseudoscience, Induction (2.3-2.7)

**Excursion 2 Tour II: Falsification, Pseudoscience, Induction***

*Outline of Tour.* Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

**Mementos from 2.3** Continue reading

## Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)

**Excursion 2 Tour I: Induction and Confirmation ***(Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)*

*Tour Blurb*. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if ** x** confirms

*H*, it confirms (

*H*&

*J*) for any

*J*. This also leads to what’s called the tacking paradox.

**Quiz on 2.1** Soundness vs Validity in Deductive Logic. Let ~*C *be the denial of claim *C*. For each of the following argument, indicate whether it is **valid** **and sound**, **valid but unsound**, **invalid**. Continue reading

## Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

I’m talking about a speciﬁc, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

*It is easy to lie with statistics*. Or so the cliché goes. It is also very diﬃcult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huﬀ wrote *How to Lie with Statistics *in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

- Association is not causation.
- Statistical signiﬁcance is not substantive signiﬁcamce
- No evidence of risk is not evidence of no risk.
- If you torture the data enough, they will confess.

Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “evidence-based” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, signiﬁcance seeking, *P*-hacking, and assorted modes of extra- ordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraud-busting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed.

## A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

*Continuing with the discussion of E.S. Pearson in honor of his birthday:*

**Egon Pearson’s Neglected Contributions to Statistics**

by** Aris Spanos**

**Egon Pearson** (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the * Neyman-Pearson (1933) theory of hypothesis testing*. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

**(i) specification**: the need to state explicitly the inductive premises of one’s inferences,

**(ii) robustness**: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

**(iii) Mis-Specification (M-S) testing**: probing for potential departures from the Normality assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the **Student’s t** fame] and then **Fisher** (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as *the simple Normal model*: Continue reading

## Egon Pearson’s Heresy

*Today is Egon Pearson’s birthday.* In honor of his birthday, I am posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

## “Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday

**Today is Allan Birnbaum’s Birthday. **Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in *Breakthroughs in Statistics (volume I 1993), *concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, *properties of the sampling distribution of the test statistic vanish *(as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman *did* discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.) Continue reading

## Neyman vs the ‘Inferential’ Probabilists continued (a)

**Today is Jerzy Neyman’s Birthday (April 16, 1894 – August 5, 1981). ** I am posting a brief excerpt and a link to a paper of his that I hadn’t posted before: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”. “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute *a priori* distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). [iii] I’ll explain in later stages of this post & in comments…(so please check back); I don’t want to miss the start of the birthday party in honor of Neyman, and it’s already 8:30 p.m in Berkeley!

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program. **HAPPY BIRTHDAY NEYMAN!** Continue reading

## Deconstructing the Fisher-Neyman conflict wearing fiducial glasses (continued)

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. I supply a few more intriguing articles you may find enlightening to read and/or reread on a Saturday night

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74). Continue reading

## Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a couple of years ago on a topic that had previously not been discussed on this blog: Fisher’s* fiducial probability*.

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehman (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.

So what’s *fiducial inference*? I follow Cox (2006), adapting for the case of the lower limit: Continue reading

## R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

In recognition of R.A. Fisher’s birthday on February 17….

**‘R. A. Fisher: How an Outsider Revolutionized Statistics’**

by **Aris Spanos**

Few statisticians will dispute that R. A. Fisher **(February 17, 1890 – July 29, 1962)** is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of *optimal estimation* based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of *optimal testing* in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the *ultimate outsider* when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

## Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

**As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017. The comments from 2017 lead to a troubling issue that I will bring up in the comments today.**

*‘Fisher’s alternative to the alternative’*

*By: Stephen Senn*

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

## Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

*Today is R.A. Fisher’s birthday. I’ll post some Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!*

“Two New Properties of Mathematical Likelihood“

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

## S. Senn: Being a statistician means never having to say you are certain (Guest Post)

**Stephen Senn**

* Head of Competence Center *

*for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn*

**Being a statistician means never having to say you are certain**

A recent discussion of randomised controlled trials[1] by Angus Deaton and Nancy Cartwright (D&C) contains much interesting analysis but also, in my opinion, does not escape rehashing some of the invalid criticisms of randomisation with which the literatures seems to be littered. The paper has two major sections. The latter, which deals with generalisation of results, or what is sometime called *external validity,* I like much more than the former which deals with *internal validity*. It is the former I propose to discuss.

## 60 yrs of Cox’s (1958) weighing machine, & links to binge-read the Likelihood Principle

2018 will mark 60 years since the famous chestnut from Sir David Cox (1958). The example “is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992, p. 582). When I describe it, you’ll find it hard to believe many regard it as causing an earthquake in statistical foundations, unless you’re already steeped in these matters. A simple version: If half the time I reported my weight from a scale that’s always right, and half the time use a scale that gets it right with probability .5, would you say I’m right with probability ¾? Well, maybe. But suppose you *knew* that this measurement was made with the scale that’s right with probability .5? The overall error probability is scarcely relevant for giving the warrant of the particular measurement, *knowing* which scale was used. So what’s the earthquake? First a bit more on the chestnut. Here’s an excerpt from Cox and Mayo (2010, 295-8): Continue reading

## The Conversion of Subjective Bayesian, Colin Howson, & the problem of old evidence (i)

“The subjective Bayesian theory as developed, for example, by Savage … cannot solve the deceptively simple but actually intractable old evidence problem, whence as a foundation for a logic of confirmation at any rate, it must be accounted a failure.” (Howson, (2017), p. 674)

What? Did the “old evidence” problem cause Colin Howson to recently abdicate his decades long position as a leading subjective Bayesian? It seems to have. I was so surprised to come across this in a recent perusal of* Philosophy of Science* that I wrote to him to check if it is really true. (It is.) I thought perhaps it was a different Colin Howson, or the son of the one who co-wrote 3 editions of Howson and Urbach: *Scientific Reasoning: The Bayesian Approach* espousing hard-line subjectivism since 1989.[1] I am not sure which of the several paradigms of non-subjective or default Bayesianism Howson endorses (he’d argued for years, convincingly, against any one of them), nor how he handles various criticisms (Kass and Wasserman 1996), I put that aside. Nor have I worked through his, rather complex, paper to the extent necessary, yet. What about the “old evidence” problem, made famous by Clark Glymour 1980? What is it? Continue reading

## Statistical skepticism: How to use significance tests effectively: 7 challenges & how to respond to them

Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)

**7 QUESTIONS**

*Why use a tool that infers***from a single (arbitrary) P-value that pertains to a***statistical*hypothesis*H*_{0}*to a research claim**H***?***Why use an incompatible hybrid (of Fisher and N-P)?**- Why apply a method
**that uses error probabilities, the sampling distribution,**researcher “**intentions**” and**violates the likelihood principle**(LP)? You should**condition**on the data. **Why use methods that***overstate evidence*against a null hypothesis?**Why do you use a method that****presupposes the underlying statistical model?****Why use a measure that doesn’t report effect sizes?****Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?**

** **

## Egon Pearson’s Heresy

Here’s one last entry in honor of Egon Pearson’s birthday: “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years (6!), but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

(Nowadays, some people concentrate to an absurd extent on “science-wise error rates in dichotomous screening”.) Continue reading