Statistics

My paper, “P values on Trial” is out in Harvard Data Science Review

.

My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue.

This is a case where reality proves the parody (or maybe, the proof of the parody is in the reality) or something like that. More specifically, Excursion 4 Tour III of my book Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP) opens with a parody of a legal case, that of Scott Harkonen (in the parody, his name is Paul Hack). You can read it here. A few months after the book came out, the actual case took a turn that went even a bit beyond what I imagined could transpire in my parody. I got cold feet when it came to naming names in the book, but in this article I do.

Below I paste Meng’s blurb, followed by the start of my article.

Meng’s blurb (his full editorial is here):

P values on Trial (and the Beauty and Beast in a Single Number)

Perhaps there are no statistical concepts or methods that have been used and abused more frequently than statistical significance and the p value.  So much so that some journals are starting to recommend authors move away from rigid p value thresholds by which results are classified as significant or insignificant. The American Statistical Association (ASA) also issued a statement on statistical significance and p values in 2016, a unique practice in its nearly 180 years of history.  However, the 2016 ASA statement did not settle the matter, but only ignited further debate, as evidenced by the 2019 special issue of The American Statistician.  The fascinating account by the eminent philosopher of science Deborah Mayo of how the ASA’s 2016 statement was used in a legal trial should remind all data scientists that what we do or say can have completely unintended consequences, despite our best intentions.

The ASA is a leading professional society of the studies of uncertainty and variabilities. Therefore, the tone and overall approach of its 2016 statement is understandably nuanced and replete with cautionary notes. However, in the case of Scott Harkonen (CEO of InterMune), who was found guilty of misleading the public by reporting a cherry-picked ‘significant p value’ to market the drug Actimmune for unapproved uses, the appeal lawyers cited the ASA Statement’s cautionary note that “a p value without context or other evidence provides limited information,” as compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false.  I doubt the authors of the ASA statement ever anticipated that their warning against the inappropriate use of p value could be turned into arguments for protecting exactly such uses.

To further clarify the ASA’s position, especially in view of some confusions generated by the aforementioned special issue, the ASA recently established a task force on statistical significance (and research replicability) to “develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors” within 2020.  As a member of the task force, I’m particularly mindful of the message from Mayo’s article, and of the essentially impossible task of summarizing scientific evidence by a single number.  As consumers of information, we are all seduced by simplicity, and nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making.  But, again, there is no free lunch.  Most problems are just too complex to be summarized by a single number, and concision in this context can exact a considerable cost. The cost could be a great loss of information or validity of the conclusion, which are the central concerns regarding the p value.  The cost can also be registered in terms of the tremendous amount of hard work it may take to produce a usable single summary.

P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting

Abstract

In an attempt to stem the practice of reporting impressive-looking findings based on data dredging and multiple testing, the American Statistical Association’s (ASA) 2016 guide to interpreting p values (Wasserstein & Lazar) warns that engaging in such practices “renders the reported p-values essentially uninterpretable” (pp. 131-132). Yet some argue that the ASA statement actually frees researchers from culpability for failing to report or adjust for data dredging and multiple testing. We illustrate the puzzle by means of a case appealed to the Supreme Court of the United States: that of Scott Harkonen. In 2009, Harkonen was found guilty of issuing a misleading press report on results of a drug advanced by the company of which he was CEO. Downplaying the high p value on the primary endpoint (and 10 secondary points), he reported statistically significant drug benefits had been shown, without mentioning this referred only to a subgroup he identified from ransacking the unblinded data. Nevertheless, Harkonen and his defenders argued that “the conclusions from the ASA Principles are the opposite of the government’s” conclusion that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16). On the face of it, his defenders are selectively reporting on the ASA guide, leaving out its objections to data dredging. However, the ASA guide also points to alternative accounts to which some researchers turn to avoid problems of data dredging and multiple testing. Since some of these accounts give a green light to Harkonen’s construal, a case might be made that the guide, inadvertently or not, frees him from culpability.

Keywords: statistical significance, p values, data dredging, multiple testing, ASA guide to p values, selective reporting

  1. Introduction

The biggest source of handwringing about statistical inference boils down to the fact it has become very easy to infer claims that have not been subjected to stringent tests. Sifting through reams of data makes it easy to find impressive-looking associations, even if they are spurious. Concern with spurious findings is considered sufficiently serious to have motivated the American Statistical Association (ASA) to issue a guide to stem misinterpretations of p values (Wasserstein & Lazar, 2016; hereafter, ASA guide). Principle 4 of the ASA guide asserts that:

Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. (pp. 131–132)

An intriguing example is offered by a legal case that was back in the news in 2018, having made it to the U.S. Supreme Court (Harkonen v. United States, 2018). In 2009, Scott Harkonen (CEO of drug company InterMune) was found guilty of wire fraud for issuing a misleading press report on Phase III results of a drug Actimmune in 2002, successfully pumping up its sales. While Actimmune had already been approved for two rare diseases, it was hoped that the FDA would approve it for a far more prevalent, yet fatal, lung disease (whose treatment would cost patients $50,000 a year). Confronted with a disappointing lack of statistical significance (p = .52)[1] on the primary endpoint—that the drug improves lung function as reflected by progression free survival—and on any of ten prespecified endpoints, Harkonen engaged in postdata dredging on the unblinded data until he unearthed a non-prespecified subgroup with a nominally statistically significant survival benefit. The day after the Food and Drug Administration (FDA) informed him it would not approve the use of the drug on the basis of his post hoc finding, Harkonen issued a press release to doctors and shareholders optimistically reporting Actimmune’s statistically significant survival benefits in the subgroup he identified from ransacking the unblinded data.

What makes the case intriguing is not its offering yet another case of p-hacking, nor that it has found its way more than once to the Supreme Court. Rather, it is because in 2018, Harkonen and his defenders argued that the ASA guide provides “compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false” (Goodman, 2018, p. 3). His appeal alleges that “the conclusions from the ASA Principles are the opposite of the government’s” charge that his construal of the data was misleading (Harkonen v. United States, 2018, p. 16 ).

Are his defenders merely selectively reporting on the ASA guide, making no mention of Principle 4, with its loud objections to the behavior Harkonen displayed? It is hard to see how one can hold Principle 4 while averring the guide’s principles run counter to the government’s charges against Harkonen. However, if we view the ASA guide in the context of today’s disputes about statistical evidence, things may look topsy turvy. None of the attempts to overturn his conviction succeeded (his sentence had been to a period of house arrest and a fine), but his defenders are given a leg to stand on—wobbly as it is. While the ASA guide does not show that the theory of statistical significance testing ‘is demonstrably false,’ it might be seen to communicate a message that is in tension with itself on one of the most important issues of statistical inference.

Before beginning, some caveats are in order. The legal case was not about which statistical tools to use, but merely whether Harkonen, in his role as CEO, was guilty of intentionally issuing a misleading report to shareholders and doctors. However, clearly, there could be no hint of wrongdoing if it were acceptable to treat post hoc subgroups the same as prespecified endpoints. In order to focus solely on that issue, I put to one side the question whether his press report rises to the level of wire fraud. Lawyer Nathan Schachtman argues that “the judgment in United States v. Harkonen is at odds with the latitude afforded companies in securities fraud cases” even where multiple testing occurs (Schachtman, 2020, p. 48). Not only are the intricacies of legal precedent outside my expertise, the arguments in his defense, at least the ones of interest here, regard only the data interpretation. Moreover, our concern is strictly with whether the ASA guide provides grounds to exonerate Harkonen-like interpretations of data.

I will begin by describing the case in relation to the ASA guide. I then make the case that Harkonen’s defenders mislead by omission of the relevant principle in the guide. I will then reopen my case by revealing statements in the guide that have thus far been omitted from my own analysis. Whether they exonerate Harkonen’s defenders is for you, the jury, to decide.

You can read the full article at HDSR here. The Harkonen case is also discussed on this blog: search Harkonen (and Matrixx).

 

Categories: multiple testing, P-values, significance tests, Statistics | 29 Comments

Posts of Christmas Past (1): 13 howlers of significance tests (and how to avoid them)

.

I’m reblogging a post from Christmas past–exactly 7 years ago. Guess what I gave as the number 1 (of 13) howler well-worn criticism of statistical significance tests, haunting us back in 2012–all of which are put to rest in Mayo and Spanos 2011? Yes, it’s the frightening allegation that statistical significance tests forbid using any background knowledge! The researcher is imagined to start with a “blank slate” in each inquiry (no memories of fallacies past), and then unthinkingly apply a purely formal, automatic, accept-reject machine. What’s newly frightening (in 2019) is the credulity with which this apparition is now being met (by some). I make some new remarks below the post from Christmas past: Continue reading

Categories: memory lane, significance tests, Statistics | Tags: | Leave a comment

How My Book Begins: Beyond Probabilism and Performance: Severity Requirement

This week marks one year since the general availability of my book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP). Here’s how it begins (Excursion 1 Tour 1 (1.1)). Material from the preface is here. I will sporadically give some “one year later” reflections in the comments. I invite readers to ask me any questions pertaining to the Tour.

The journey begins..(1.1)

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 4 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with posts on E.S. Pearson in marking his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom. Continue reading

Categories: Egon Pearson, Statistics | Leave a comment

“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, ASA II–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the ASA II authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.) Continue reading

Categories: ASA Guide to P-values, Statistics | 95 Comments

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

Neyman April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | Leave a comment

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday yesterday)

images-14

Neyman April 16, 1894 – August 5, 1981

My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

Continue reading

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST

imgres-4

Fisher/ Neyman

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. These 2 posts reflect my working out of these ideas in writing Section 5.8 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, CUP 2018). Here’s all of Section 5.8 (“Neyman’s Performance and Fisher’s Fiducial Probability”) for your Saturday night reading.* 

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74). Continue reading

Categories: fiducial probability, Fisher, Neyman, Statistics | 2 Comments

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

imgres

R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehmann (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics. Continue reading

Categories: fiducial probability, Fisher, Phil6334/ Econ 6614, Statistics | Leave a comment

Guest Blog: R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

A SPANOS

.

In recognition of R.A. Fisher’s birthday on February 17…a week of Fisher posts!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Spanos, Statistics | 2 Comments

Guest Post: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of posts on R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012, and 2017. See especially the comments from Feb 2017. 

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | Leave a comment

Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I will post some Fisherian items this week in recognition of it*. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  We may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Statistics | Tags: , , , | Leave a comment

Mayo-Spanos Summer Seminar PhilStat: July 28-Aug 11, 2019: Instructions for Applying Now Available

INSTRUCTIONS FOR APPLYING ARE NOW AVAILABLE

See the Blog at SummerSeminarPhilStat

Categories: Announcement, Error Statistics, Statistics | Leave a comment

You Should Be Binge Reading the (Strong) Likelihood Principle

 

.

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term sampling theory, or my preferred error statistics, as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the strong likelihood principle (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

SLP (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E1 and E2 with different probability models f1, f2, but with the same unknown parameter θ, if outcomes x* and y* (from E1 and E2 respectively) determine the same (i.e., proportional) likelihood function (f1(x*; θ) = cf2(y*; θ) for all θ), then x* and y* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)
Continue reading

Categories: Error Statistics, Statistics, strong likelihood principle | 2 Comments

Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)

Neyman & Pearson

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration*

We proceed by setting up a specific hypothesis to test, Hin Neyman’s and my terminology, the null hypothesis in R. A. Fisher’s . . . in choosing the test, we take into account alternatives to Hwhich we believe possible or at any rate consider it most important to be on the look out for . . .Three steps in constructing the test may be defined:

Step 1. We must first specify the set of results . . .

Step 2. We then divide this set by a system of ordered boundaries . . .such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level the chance that, if H0 is true, a result will occur in random sampling lying beyond that level . . .

In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ, was a very useful one . . . Thus Step 2 proceeded Step 3. In later papers [1933–1938] we started with a fixed value for the chance, ε, of Step 3 . . . However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order. (Egon Pearson 1947, p. 173)

In addition to Pearson’s 1947 paper, the museum follows his account in “The Neyman–Pearson Story: 1926–34” (Pearson 1970). The subtitle is “Historical Sidelights on an Episode in Anglo-Polish Collaboration”!

We meet Jerzy Neyman at the point he’s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’t that impressed: Continue reading

Categories: E.S. Pearson, Neyman, Statistical Inference as Severe Testing, statistical tests, Statistics | 1 Comment

Mementos for Excursion 2 Tour II: Falsification, Pseudoscience, Induction (2.3-2.7)

.

Excursion 2 Tour II: Falsification, Pseudoscience, Induction*

Outline of Tour. Tour II visits Popper, falsification, corroboration, Duhem’s problem (what to blame in the case of anomalies) and the demarcation of science and pseudoscience (2.3). While Popper comes up short on each, the reader is led to improve on Popper’s notions (live exhibit (v)). Central ingredients for our journey are put in place via souvenirs: a framework of models and problems, and a post-Popperian language to speak about inductive inference. Defining a severe test, for Popperians, is linked to when data supply novel evidence for a hypothesis: family feuds about defining novelty are discussed (2.4). We move into Fisherian significance tests and the crucial requirements he set (often overlooked): isolated significant results are poor evidence of a genuine effect, and statistical significance doesn’t warrant substantive, e.g., causal inference (2.5). Applying our new demarcation criterion to a plausible effect (males are more likely than females to feel threatened by their partner’s success), we argue that a real revolution in psychology will need to be more revolutionary than at present. Whole inquiries might have to be falsified, their measurement schemes questioned (2.6). The Tour’s pieces are synthesized in (2.7), where a guest lecturer explains how to solve the problem of induction now, having redefined induction as severe testing.

Mementos from 2.3 Continue reading

Categories: Popper, Statistical Inference as Severe Testing, Statistics | 5 Comments

Tour Guide Mementos and QUIZ 2.1 (Excursion 2 Tour I: Induction and Confirmation)

.

Excursion 2 Tour I: Induction and Confirmation (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars)

Tour Blurb. The roots of rival statistical accounts go back to the logical Problem of Induction. (2.1) The logical problem of induction is a matter of finding an argument to justify a type of argument (enumerative induction), so it is important to be clear on arguments, their soundness versus their validity. These are key concepts of fundamental importance to our journey. Given that any attempt to solve the logical problem of induction leads to circularity, philosophers turned instead to building logics that seemed to capture our intuitions about induction. This led to confirmation theory and some projects in today’s formal epistemology. There’s an analogy between contrasting views in philosophy and statistics: Carnapian confirmation is to Bayesian statistics, as Popperian falsification is to frequentist error statistics. Logics of confirmation take the form of probabilisms, either in the form of raising the probability of a hypothesis, or arriving at a posterior probability. (2.2) The contrast between these types of probabilisms, and the problems each is found to have in confirmation theory are directly relevant to the types of probabilisms in statistics. Notably, Harold Jeffreys’ non-subjective Bayesianism, and current spin-offs, share features with Carnapian inductive logics. We examine the problem of irrelevant conjunctions: that if x confirms H, it confirms (H & J) for any J. This also leads to what’s called the tacking paradox.

Quiz on 2.1 Soundness vs Validity in Deductive Logic. Let ~C be the denial of claim C. For each of the following argument, indicate whether it is valid and sound, valid but unsound, invalid. Continue reading

Categories: induction, SIST, Statistical Inference as Severe Testing, Statistics | 10 Comments

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

The cruise begins…

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “evidence-based” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, significance seeking, P-hacking, and assorted modes of extra- ordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraud-busting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed.

Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 9 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with the discussion of E.S. Pearson in honor of his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, Egon Pearson, Statistics | 1 Comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Today is Egon Pearson’s birthday. In honor of his birthday, I am posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | 2 Comments

Blog at WordPress.com.