Error Statistics

Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)

picture-216-1

.

We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning.

We then turned to a severity evaluation of tests as a way to avoid classic fallacies and misinterpretations.

spanos

.

“Probability/Statistics Lecture Notes 5 for 3/20/14: Post-data severity evaluation” (Prof. Spanos)

[i] Fisher, Neyman, and E. Pearson.

[ii] In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.

N-P did provide a theory of testing that could avoid the purely technical problem that can theoretically emerge in an account that does not consider alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post.  If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale: Lawrence Erlbaum Associates.

Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

Nuzzo, R .(2014). “Scientific method: Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume”. Nature, 12 February 2014.

Categories: phil/history of stat, Phil6334, science communication, Severity, significance tests, Statistics | Tags: | 35 Comments

New SEV calculator (guest app: Durvasula)

Unknown-1Karthik Durvasula, a blog follower[i], sent me a highly apt severity app that he created: https://karthikdurvasula.shinyapps.io/Severity_Calculator/
I have his permission to post it or use it for pedagogical purposes, so since it’s Saturday night, go ahead and have some fun with it. Durvasula had the great idea of using it to illustrate howlers. Also, I would add, to discover them.
It follows many of the elements of the Excel Sev Program discussed recently, but it’s easier to use.* (I’ll add some notes about the particular claim (i.e, discrepancy) for which SEV is being computed later on).
*If others want to tweak or improve it, he might pass on the source code (write to me on this).
[i] I might note that Durvasula was the winner of the January palindrome contest.
Categories: Severity, Statistics | 12 Comments

Cosma Shalizi gets tenure (at last!) (metastat announcement)

ShaliziNews Flash! Congratulations to Cosma Shalizi who announced yesterday that he’d been granted tenure (Statistics, Carnegie Mellon). Cosma is a leading error statistician, a creative polymath and long-time blogger (at Three-Toad sloth). Shalizi wrote an early book review of EGEK (Mayo 1996)* that people still send me from time to time, in case I hadn’t seen it! You can find it on this blog from 2 years ago (posted by Jean Miller). A discussion of a meeting of the minds between Shalizi and Andrew Gelman is here.

*Error and the Growth of Experimental Knowledge.

Categories: Announcement, Error Statistics, Statistics | Tags: | Leave a comment

Phil6334: Popper self-test

images-10Those reading Popper[i] with us might be interested in an (undergraduate) item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].
[i]Popper reading (for Feb 13, 2014) from Conjectures and Refutations
[ii]I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 12, and 3.

Categories: Error Statistics | 1 Comment

Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech) UPDATE: JAN 21

FURTHER UPDATED: New course for Spring 2014: Thurs 3:30-6:15 (Randolph 209)

first installment 6334 syllabus_SYLLABUS (first) Phil 6334: Philosophy of Statistical Inference and ModelingPicture 216 1mayo

picture-072-1-1

D. Mayo and A. Spanos

Contact: error@vt.edu

This new course, to be jointly taught by Professors D. Mayo (Philosophy) and A. Spanos (Economics) will provide an introductory, in-depth introduction to graduate level research in philosophy of inductive-statistical inference and probabilistic methods of evidence (a branch of formal epistemology). We explore philosophical problems of confirmation and induction, the philosophy and history of frequentist and Bayesian approaches, and key foundational controversies surrounding tools of statistical data analytics, modeling and hypothesis testing in the natural and social sciences, and in evidence-based policy.

We now have some tentative topics and dates:

 

course flyer pic

1. 1/23 Introduction to the Course: 4 waves of controversy in the philosophy of statistics
2. 1/30 How to tell what’s true about statistical inference: Probabilism, performance and probativeness
3. 2/6 Induction and Confirmation: Formal Epistemology
4. 2/13 Induction, falsification, severe tests: Popper and Beyond
5. 2/20 Statistical models and estimation: the Basics
6. 2/27 Fundamentals of significance tests and severe testing
7. 3/6 Five sigma and the Higgs Boson discovery Is it “bad science”?
SPRING BREAK Statistical Exercises While Sunning
 8. 3/20  Fraudbusting and Scapegoating: Replicability and big data: are most scientific results false?
9. 3/27 How can we test the assumptions of statistical models?
All models are false; no methods are objective: Philosophical problems of misspecification testing: Spanos method
10. 4/3 Fundamentals of Statistical Testing: Family Feuds and 70 years of controversy
11. 4/10 Error Statistical Philosophy: Highly Probable vs Highly Probed
Some howlers of testing
12. 4/17 What ever happened to Bayesian Philosophical Foundations? Dutch books etc. Fundamental of Bayesian statistics
13. 4/24 Bayesian-frequentist reconciliations, unifications, and O-Bayesians
14. 5/1 Overview: Answering the critics: Should statistical philosophy be divorced from methodology?
(15. TBA) Topic to be chosen (Resampling statistics and new journal policies? Likelihood principle)

 Interested in attending? E.R.R.O.R.S.* can fund travel (presumably driving) and provide accommodation for Thurs. night in a conference lodge in Blacksburg for a few people through (or part of)  the semester. If interested, write ASAP for details (with a brief description of your interest and background) to error@vt.edu. (Several people asked about long-distance hook-ups: We will try to provide some sessions by Skype, and will put each of the seminar items here (also check the Phil6334 page on this blog). 

A sample of questions we consider*:

  • What makes an inquiry scientific? objective? When are we warranted in generalizing from data?
  • What is the “traditional problem of induction”?  Is it really insoluble?  Does it matter in practice?
  • What is the role of probability in uncertain inference? (to assign degrees of confirmation or belief? to characterize the reliability of test procedures?) 3P’s: Probabilism, performance and probativeness
  • What is probability? Random variables? Estimates? What is the relevance of long-run error probabilities for inductive inference in science?
  • What did Popper really say about severe testing, induction, falsification? Is it time for a new definition of pseudoscience?
  • Confirmation and falsification: Carnap and Popper, paradoxes of confirmation; contemporary formal epistemology
  • What is the current state of play in the “statistical wars” e.g., between frequentists, likelihoodists, and (subjective vs. “non-subjective”) Bayesians?
  • How should one specify and interpret p-values, type I and II errors, confidence levels?  Can one tell the truth (and avoid fallacies) with statistics? Do the “reformers” themselves need reform?
  • Is it unscientific (ad hoc, degenerating) to use the same data both in constructing and testing hypotheses? When and why?
  • Is it possible to test assumptions of statistical models without circularity?
  • Is the new research on “replicability” well-founded, or an erroneous use of screening statistics for long-run performance?
  • Should randomized studies be the “gold standard” for “evidence-based” science and policy?
  • What’s the problem with big data: cherry-picking, data mining, multiple testing
  • The many faces of Bayesian statistics: Can there be uninformative prior probabilities? (No) Principles of indifference over the years
  • Statistical fraudbusting: psychology, economics, evidence-based policy
  • Applied controversies (selected): Higgs experiments, climate modeling, social psychology, econometric modeling, development economic

D. Mayo (books):

How to Tell What’s True About Statistical Inference, (Cambridge, in progress).

Error and the Growth of Experimental KnowledgeChicago: Chicago University Press, 1996. (Winner of 1998 Lakatos Prize).

Acceptable Evidence: Science and Values in Risk Managementco-edited with Rachelle Hollander, New York: Oxford University Press, 1994.

Aris Spanos (books):

Probability Theory and Statistical Inference, Cambridge, 1999.

Statistical Foundations of Econometric Modeling, Cambridge, 1986.

Joint (books): Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, D. Mayo & A. Spanos (eds.), Cambridge: Cambridge University Press, 2010. [The book includes both papers and exchanges between Mayo and A. Chalmers, A. Musgrave, P. Achinstein, J. Worrall, C. Glymour, A. Spanos, and joint papers with Mayo and Sir David Cox].

Categories: Announcement, Error Statistics, Statistics | 5 Comments

Objective/subjective, dirty hands and all that: Gelman/ Wasserman blogolog (ii)

Objectivity #2: The “Dirty Hands” Argument for Ethics in EvidenceAndrew Gelman says that as a philosopher, I should appreciate his blog today in which he records his frustration: “Against aggressive definitions: No, I don’t think it helps to describe Bayes as ‘the analysis of subjective beliefs’…”  Gelman writes:

I get frustrated with what might be called “aggressive definitions,” where people use a restrictive definition of something they don’t like. For example, Larry Wasserman writes (as reported by Deborah Mayo):

“I wish people were clearer about what Bayes is/is not and what 
frequentist inference is/is not. Bayes is the analysis of subjective
 beliefs but provides no frequency guarantees. Frequentist inference 
is about making procedures that have frequency guarantees but makes no 
pretense of representing anyone’s beliefs.”

I’ll accept Larry’s definition of frequentist inference. But as for his definition of Bayesian inference: No no no no no. The probabilities we use in our Bayesian inference are not subjective, or, they’re no more subjective than the logistic regressions and normal distributions and Poisson distributions and so forth that fill up all the textbooks on frequentist inference.

To quickly record some of my own frustrations:*: First, I would disagree with Wasserman’s characterization of frequentist inference, but as is clear from Larry’s comments to (my reaction to him), I think he concurs that he was just giving a broad contrast. Please see Note [1] for a remark from my post: Comments on Wasserman’s “what is Bayesian/frequentist inference?” Also relevant is a Gelman post on the Bayesian name: [2].

Second, Gelman’s “no more subjective than…” evokes  remarks I’ve made before. For example, in “What should philosophers of science do…” I wrote:

Arguments given for some very popular slogans (mostly by non-philosophers), are too readily taken on faith as canon by others, and are repeated as gospel. Examples are easily found: all models are false, no models are falsifiable, everything is subjective, or equally subjective and objective, and the only properly epistemological use of probability is to supply posterior probabilities for quantifying actual or rational degrees of belief. Then there is the cluster of “howlers” allegedly committed by frequentist error statistical methods repeated verbatim (discussed on this blog).

I’ve written a lot about objectivity on this blog, e.g., here, here and here (and in real life), but what’s the point if people just rehearse the “everything is a mixture…” line, without making deeply important distinctions? I really think that, next to the “all models are false” slogan, the most confusion has been engendered by the “no methods are objective” slogan. However much we may aim at objective constraints, it is often urged, we can never have “clean hands” free of the influence of beliefs and interests, and we invariably sully methods of inquiry by the entry of background beliefs and personal judgments in their specification and interpretation. Continue reading

Categories: Bayesian/frequentist, Error Statistics, Gelman, Objectivity, Statistics | 41 Comments

Two Severities? (PhilSci and PhilStat)

Janus--2faceThe blog “It’s Chancy” (Corey Yanofsky) has a post today about “two severities” which warrants clarification. Two distinctions are being blurred: between formal and informal severity assessments, and between a statistical philosophy (something Corey says he’s interested in) and its relevance to philosophy of science (which he isn’t). I call the latter an error statistical philosophy of science. The former requires both formal, semi-formal and informal severity assessments. Here’s his post:

In the comments to my first post on severity, Professor Mayo noted some apparent and some actual misstatements of her views.To avert misunderstandings, she directed readers to two of her articles, one of which opens by making this distinction:

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justification.”

In Mayo’s writings I see  two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).[It’s Chancy, Jan 11, 2014)

A needed clarification may be found in a post of mine which begins: 

Error statistics: (1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.).

I assume the interest here* is on the former, (a). I have stated it in numerous ways, but the basic position is that inductive inference—i.e., data-transcending inference—calls for methods of controlling and evaluating error probabilities (even if only approximate). An inductive inference, in this conception, takes the form of inferring hypotheses or claims to the extent that they have been well tested. It also requires reporting claims that have not passed severely, or have passed with low severity. In the “severe testing” philosophy of induction, the quantitative assessment offered by error probabilities tells us not “how probable” but, rather, “how well probed” hypotheses are.  The local canonical hypotheses of formal tests and estimation methods need not be the ones we entertain post data; but they give us a place to start without having to go “the designer-clothes” route.

The post-data interpretations might be formal, semi-formal, or informal.

See also: Staley’s review of Error and Inference (Mayo and Spanos eds.)

Categories: Review of Error and Inference, Severity, StatSci meets PhilSci | 52 Comments

“Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)

New course for Spring 2014: Thursday 3:30-6:15

picture-072-1-1Phil 6334: Philosophy of Statistical Inference and ModelingPicture 216 1mayo

D. Mayo and A. Spanos

Contact: error@vt.edu

This new course, to be jointly taught by Professors D. Mayo (Philosophy) and A. Spanos (Economics) will provide an introductory, in-depth introduction to graduate level research in philosophy of inductive-statistical inference and probabilistic methods of evidence (a branch of formal epistemology). We explore philosophical problems of confirmation and induction, the philosophy and history of frequentist and Bayesian approaches, and key foundational controversies surrounding tools of statistical data analytics, modeling and hypothesis testing in the natural and social sciences, and in evidence-based policy.

course flyer pic

A sample of questions we consider*:

  • What makes an inquiry scientific? objective? When are we warranted in generalizing from data?
  • What is the “traditional problem of induction”?  Is it really insoluble?  Does it matter in practice?
  • What is the role of probability in uncertain inference? (to assign degrees of confirmation or belief? to characterize the reliability of test procedures?) 3P’s: Probabilism, performance and probativeness
  • What is probability? Random variables? Estimates? What is the relevance of long-run error probabilities for inductive inference in science?
  • What did Popper really say about severe testing, induction, falsification? Is it time for a new definition of pseudoscience?
  • Confirmation and falsification: Carnap and Popper, paradoxes of confirmation; contemporary formal epistemology
  • What is the current state of play in the “statistical wars” e.g., between frequentists, likelihoodists, and (subjective vs. “non-subjective”) Bayesians?
  • How should one specify and interpret p-values, type I and II errors, confidence levels?  Can one tell the truth (and avoid fallacies) with statistics? Do the “reformers” themselves need reform?
  • Is it unscientific (ad hoc, degenerating) to use the same data both in constructing and testing hypotheses? When and why?
  • Is it possible to test assumptions of statistical models without circularity?
  • Is the new research on “replicability” well-founded, or an erroneous use of screening statistics for long-run performance?
  • Should randomized studies be the “gold standard” for “evidence-based” science and policy?
  • What’s the problem with big data: cherry-picking, data mining, multiple testing
  • The many faces of Bayesian statistics: Can there be uninformative prior probabilities? (No) Principles of indifference over the years
  • Statistical fraudbusting: psychology, economics, evidence-based policy
  • Applied controversies (selected): Higgs experiments, climate modeling, social psychology, econometric modeling, development economic

Interested in attending? E.R.R.O.R.S.* can fund travel (presumably driving) and provide lodging for Thurs. night in a conference lodge in Blacksburg for a few people through (or part of)  the semester. Topics will be posted over the next week, but if you might be interested, write ASAP for details (with a brief description of your interest and background) to error@vt.edu. 

*This course will be a brand new version of related seminar we’ve led in the past, so we don’t have the syllabus set yet. We’re going to try something different this time. I’ll be updating in subsequent installments to the blog.

Dates: January 23, 30; February 6, 13, 20, 27; March 6, [March 8-16 break], 20, 27; April 3,10, 17, 24; May 1

D. Mayo (books):

How to Tell What’s True About Statistical Inference, (Cambridge, in progress).

Error and the Growth of Experimental KnowledgeChicago: Chicago University Press, 1996. (Winner of 1998 Lakatos Prize).

Acceptable Evidence: Science and Values in Risk Managementco-edited with Rachelle Hollander, New York: Oxford University Press, 1994.

Aris Spanos (books):

Probability Theory and Statistical Inference, Cambridge, 1999.

Statistical Foundations of Econometric Modeling, Cambridge, 1986.

Joint (books): Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, D. Mayo & A. Spanos (eds.), Cambridge: Cambridge University Press, 2010. [The book includes both papers and exchanges between Mayo and A. Chalmers, A. Musgrave, P. Achinstein, J. Worrall, C. Glymour, A. Spanos, and joint papers with Mayo and Sir David Cox].

Categories: Announcement, Error Statistics, Statistics | 9 Comments

Your 2014 wishing well….

images-3A reader asks how I would complete the following sentence:
I wish that new articles* written in 2014 would refrain from_______.  

Here are my quick answers, in no special order:
(a) rehearsing the howlers of significance tests and other frequentist statistical methods;

(b) misinterpreting p-values, ignoring discrepancy assessments (and thus committing fallacies of rejection and non-rejection);

(c) confusing an assessment of boosts in belief (or support) in claim H ,with assessing what (if anything) has been done to ensure/increase the severity of the tests H passes;

(d) declaring that “what we really want” are posterior probability assignments in statistical hypotheses without explaining what they would mean, and why we should want them;

(e) promoting the myth that frequentist tests (and estimates) form an inconsistent hybrid of incompatible philosophies (from Fisher and Neyman-Pearson);

(f) presupposing that a relevant assessment of the scientific credentials of research would be an estimate of the percentage of null hypothesis that are “true” (selected from an “urn of nulls”) given they are rejectable with a low p-value in an “up-down” use of tests;

(g) sidestepping the main sources of pseudoscience: insevere tests through interpretational and inferential latitude, and violations of statistical model assumptions.

The  “2014 wishing well” stands ready for your sentence completions.

*The question alluded to articles linked with philosophy & methodology of statistical science.

Categories: Error Statistics, science communication, Statistics | Leave a comment

Error Statistics Philosophy: 2013

metablog old fashion typewriter

I blog ergo I blog

Error Statistics Philosophy: 2013
Organized by Nicole Jinn & Jean Anne Miller* 

January 2013

(1/2) Severity as a ‘Metastatistical’ Assessment
(1/4) Severity Calculator
(1/6) Guest post: Bad Pharma? (S. Senn)
(1/9) RCTs, skeptics, and evidence-based policy
(1/10) James M. Buchanan
(1/11) Aris Spanos: James M. Buchanan: a scholar, teacher and friend
(1/12) Error Statistics Blog: Table of Contents
(1/15) Ontology & Methodology: Second call for Abstracts, Papers
(1/18) New Kvetch/PhilStock
(1/19) Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST
(1/22) New PhilStock
(1/23) P-values as posterior odds?
(1/26) Coming up: December U-Phil Contributions….
(1/27) U-Phil: S. Fletcher & N.Jinn
(1/30) U-Phil: J. A. Miller: Blogging the SLP

February 2013
(2/2) U-Phil: Ton o’ Bricks
(2/4) January Palindrome Winner
(2/6) Mark Chang (now) gets it right about circularity
(2/8) From Gelman’s blog: philosophy and the practice of Bayesian statistics
(2/9) New kvetch: Filly Fury
(2/10) U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof
(2/11) U-Phil: Mayo’s response to Hennig and Gandenberger
(2/13) Statistics as a Counter to Heavyweights…who wrote this?
(2/16) Fisher and Neyman after anger management?
(2/17) R. A. Fisher: how an outsider revolutionized statistics
(2/20) Fisher: from ‘Two New Properties of Mathematical Likelihood’
(2/23) Stephen Senn: Also Smith and Jones
(2/26) PhilStock: DO < $70
(2/26) Statistically speaking… Continue reading

Categories: blog contents, Error Statistics, Statistics | Leave a comment

Wasserman on Wasserman: Update! December 28, 2013

Professor Larry Wasserman

Professor Larry Wasserman

I had invited Larry to give an update, and I’m delighted that he has! The discussion relates to the last post (by Spanos), which follows upon my deconstruction of Wasserman*. So, for your Saturday night reading pleasure, join me** in reviewing this and the past two blogs and the links within.

“Wasserman on Wasserman: Update! December 28, 2013″

My opinions have shifted a bit.

My reference to Franken’s joke suggested that the usual philosophical 
debates about the foundations of statistics were un-important, much 
like the debate about media bias. I was wrong on both counts.

First, I now think Franken was wrong. CNN and network news have a 
strong liberal bias, especially on economic issues. FOX has an 
obvious right wing, and anti-atheist bias. (At least FOX has some 
libertarians on the payroll.) And this does matter. Because people
 believe what they see on TV and what they read in the NY times. Paul
 Krugman’s socialist bullshit parading as economics has brainwashed 
millions of Americans. So media bias is much more than who makes 
better hummus.

Similarly, the Bayes-Frequentist debate still matters. And people —
including many statisticians — are still confused about the 
distinction. I thought the basic Bayes-Frequentist debate was behind 
us. A year and a half of blogging (as well as reading other blogs) 
convinced me I was wrong here too. And this still does matter. Continue reading

Categories: Error Statistics, frequentist/Bayesian, Statistics, Wasserman | 55 Comments

A. Spanos lecture on “Frequentist Hypothesis Testing”

may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809d

Aris Spanos

I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.

Frequentist Hypothesis Testing: A Coherent Approach

Aris Spanos

1    Inherent difficulties in learning statistical testing

Statistical testing is arguably  the  most  important, but  also the  most difficult  and  confusing chapter of statistical inference  for several  reasons, including  the following.

(i) The need to introduce numerous new notions, concepts and procedures before one can paint —  even in broad brushes —  a coherent picture  of hypothesis  testing.

(ii) The current textbook discussion of statistical testing is both highly confusing and confused.  There  are several sources of confusion.

  • (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
  • (b) Inadequate knowledge by textbook writers who often do not have  the  technical  skills to read  and  understand the  original  sources, and  have to rely on second hand  accounts  of previous  textbook writers that are  often  misleading  or just  outright erroneous.   In most  of these  textbooks hypothesis  testing  is poorly  explained  as  an  idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square,  etc., where the underlying  statistical  model that gives rise to the testing procedure  is hidden  in the background.
  • (c)  The  misleading  portrayal of Neyman-Pearson testing  as essentially  decision-theoretic in nature, when in fact the latter has much greater  affinity with the Bayesian rather than the frequentist inference.
  • (d)  A deliberate attempt to distort and  cannibalize  frequentist testing by certain  Bayesian drumbeaters who revel in (unfairly)  maligning frequentist inference in their  attempts to motivate their  preferred view on statistical inference.

(iii) The  discussion of frequentist testing  is rather incomplete  in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary  literatures attempting to address  these  problems,  but  often making  things  much  worse!  Indeed,  in some fields like psychology  it has reached the stage where one has to correct the ‘corrections’ of those chastising  the initial  correctors!

In an attempt to alleviate  problem  (i),  the discussion  that follows uses a sketchy historical  development of frequentist testing.  To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain  erroneous  in- terpretations or misleading arguments.  The discussion will pay special attention to (iii), addressing  some of the key foundational problems.

[i] It is based on Ch. 14 of Spanos (1999) Probability Theory and Statistical Inference. Cambridge[ii].

[ii] You can win a free copy of this 700+ page text by creating a simple palindrome! http://errorstatistics.com/palindrome/march-contest/

Categories: Bayesian/frequentist, Error Statistics, Severity, significance tests, Statistics | Tags: | 36 Comments

Surprising Facts about Surprising Facts

Mayo mirror

double-counting

A paper of mine on “double-counting” and novel evidence just came out: “Some surprising facts about (the problem of) surprising facts” in Studies in History and Philosophy of Science (2013), http://dx.doi.org/10.1016/j.shpsa.2013.10.005

ABSTRACT: A common intuition about evidence is that if data x have been used to construct a hypothesis H, then x should not be used again in support of H. It is no surprise that x fits H, if H was deliberately constructed to accord with x. The question of when and why we should avoid such ‘‘double-counting’’ continues to be debated in philosophy and statistics. It arises as a prohibition against data mining, hunting for significance, tuning on the signal, and ad hoc hypotheses, and as a preference for predesignated hypotheses and ‘‘surprising’’ predictions. I have argued that it is the severity or probativeness of the test—or lack of it—that should determine whether a double-use of data is admissible. I examine a number of surprising ambiguities and unexpected facts that continue to bedevil this debate.

Categories: double-counting, Error Statistics, philosophy of science, Statistics | 36 Comments

The error statistician has a complex, messy, subtle, ingenious, piece-meal approach

RMM: "A Conversation Between Sir David Cox & D.G. Mayo"A comment today by Stephen Senn leads me to post the last few sentences of my (2010) paper with David Cox, “Frequentist Statistics as a Theory of Inductive Inference”:

A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results; we have set out considerations to guide these pieces[i]. Although the complexity of the issues makes it more difficult to set out neatly, as, for example, one could by imagining that a single algorithm encompasses the whole of inductive inference, the payoff is an account that approaches the kind of arguments that scientists build up in order to obtain reliable knowledge and understanding of a field.” (273)[ii]

A reread for Saturday night?

[i]The pieces hang together by dint of the rationale growing out of a severity criterion (or something akin but using a different term.)

[ii]Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

Categories: Bayesian/frequentist, Error Statistics | 20 Comments

Why ecologists might want to read more philosophy of science

Mayo:

Jeremy Fox often publishes interesting blogposts like today’s. I’m “reblogging” straight from his site as an experiment.

Originally posted on Dynamic Ecology:

Someone* once said that scientists need to study philosophy of science about as much as birds need to study ornithology. And there’s definitely some truth to that, as evidenced by the fact that plenty of scientists do plenty of good science without any philosophical training.** But in this post I’ll argue that it’s not entirely true. There are reasons why scientists might want to read some philosophy of science.

Disclaimer: I am by no means a professional philosopher of science. I had several philosophy classes as an undergrad. My favorite two profs were both philosophers, so I took as many classes with them as I could. None of my classes were in philosophy of science, though. Since then, I’ve perhaps read a bit more philosophy of science than the average ecologist has; I’m not sure. But my reading is haphazard, not systematic. I also attend the philosophy seminars at Calgary…

View original 2,542 more words

Categories: Error Statistics | 12 Comments

“The probability that it be a statistical fluke” [iia]

imagesMy rationale for the last post is really just to highlight such passages as:

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance.” (Strassler)….

Even before the dust had settled regarding the discovery of a Standard Model-like Higgs particle, the nature and rationale of the 5-sigma discovery criterion began to be challenged. But my interest now is not in the fact that the 5-sigma discovery criterion is a convention, nor with the choice of 5. It is the understanding of “the probability that it be a statistical fluke” that interests me, because if we can get this right, I think we can understand a kind of equivocation that leads many to suppose that significance tests are being misinterpreted—even when they aren’t! So given that I’m stuck, unmoving, on this bus outside of London for 2+ hours (because of a car accident)—and the internet works—I’ll try to scratch out my point (expect errors, we’re moving now). Here’s another passage…

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. …Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?”

A very sketchy nutshell of the Higgs statistics: There is a general model of the detector, and within that model researchers define a “global signal strength” parameter “such that H0: μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the Standard Model (SM) Higgs boson signal in addition to the background” (quote from an ATLAS report). The statistical test may be framed as a one-sided test; the test statistic records differences in the positive direction, in standard deviation or sigma units. The interest is not in the point against point hypotheses, but in finding discrepancies from H0 in the direction of the alternative, and then estimating their values.  The improbability of the 5-sigma excess alludes to the sampling Continue reading

Categories: Error Statistics, P-values, statistical tests, Statistics | 66 Comments

Probability that it is a statistical fluke [i]

cropped-qqqq.jpgFrom another blog:
“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify the probability that it is a statistical fluke — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon' as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed within three days by the breakdown of the Large Hadron Collider (LHC) during its first week of running... Coincidence?  I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.

To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable. Continue reading

Categories: Error Statistics, Severity vs Posterior Probabilities, spurious p values | 22 Comments

Lucien Le Cam: “The Bayesians hold the Magic”

Nov.18, 1924 -April 25, 2000

Nov.18, 1924 -April 25, 2000

Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me.  A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a passage from Le Cam (1977):

“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation…. It tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices”. (Le Cam 1977, 158)

Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,

In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .

The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others. Continue reading

Categories: Error Statistics, frequentist/Bayesian, phil/history of stat, Statistics, strong likelihood principle | 52 Comments

Null Effects and Replication

RR

Categories: Comedy, Error Statistics, Statistics | 3 Comments

Forthcoming paper on the strong likelihood principle

Picture 216 1mayo My paper, “On the Birnbaum Argument for the Strong Likelihood Principle” has been accepted by Statistical Science. The latest version is here. (It differs from all versions posted anywhere). If you spot any typos, please let me know (error@vt.edu). If you can’t open this link, please write to me and I’ll send it directly. As always, comments and queries are welcome.

I appreciate considerable feedback on SLP on this blog. Interested readers may search this blog for quite a lot of discussion of the SLP (e.g., here and here) including links to the central papers, “U-Phils” (commentaries) by others (e.g., herehere, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum), and more…..

Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes x and y from experiments E1 and E2 (both with unknown parameter θ), have different probability models f1( . ), f2( . ), then even though f1(xθ) = cf2(yθ) for all θ, outcomes x and y may have different implications for an inference about θ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox (1958) proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known which Ei produced the measurement, the assessment should be in terms of the properties of Ei. The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP].

Key words: Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

 

Categories: Birnbaum Brakes, Error Statistics, Statistics, strong likelihood principle | 24 Comments

Blog at WordPress.com. Customized Adventure Journal Theme.

Follow

Get every new post delivered to your Inbox.

Join 319 other followers