Error Statistics Philosophy

3 YEARS AGO (JULY 2012): MEMORY LANE

Posted on July 22, 2015 by Mayo

: 3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1] This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014. (Once again it was tough to pick just 3; please check out others which might interest you, e.g., Schachtman on StatLaw, the machine learning conference on simplicity, the story of Lindley and particle physics, Glymour and so on.)

July 2012

(7/1) PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based’” (Schachtman)
(7/2) More from the Foundations of Simplicity Workshop*
(7/3) Elliott Sober Responds on Foundations of Simplicity
(7/4) Comment on Falsification
(7/6) Vladimir Cherkassky Responds on Foundations of Simplicity
(7/8) Metablog: Up and Coming
(7/9) Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
(7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
(7/11) Is Particle Physics Bad Science?
(7/12) Dennis Lindley’s “Philosophy of Statistics”
(7/15) Deconstructing Larry Wasserman – it starts like this…
(7/16) Peter Grünwald: Follow-up on Cherkassky’s Comments
(7/19) New Kvetch Posted 7/18/12
(7/21) “Always the last place you look!”
(7/22) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 1)
(7/23) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 2)
(7/27) P-values as Frequentist Measures
(7/28) U-PHIL: Deconstructing Larry Wasserman (connects to 7/15)
(7/31) What’s in a Name? (Gelman’s blog)

[1] excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane, Statistics | Leave a comment

“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Posted on July 17, 2015 by Mayo

Mayo, frustrated

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services
Effective Health Care Program
Glossary of Terms

We know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here. First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability. Continue reading →

Categories: P-values, Statistics | 69 Comments

Spot the power howler: α = ß?

Posted on July 14, 2015 by Mayo

Spot the fallacy!

The power of a test is the probability of correctly rejecting the null hypothesis. Write it as 1 – β.
So, the probability of incorrectly rejecting the null hypothesis is β.
But the probability of incorrectly rejecting the null is α (the type 1 error probability).

So α = β.

I’ve actually seen this, and variants on it [i].

[1] Although they didn’t go so far as to reach the final, shocking, deduction.

Categories: Error Statistics, power, Statistics | 12 Comments

Higgs discovery three years on (Higgs analysis and statistical flukes)

Posted on July 11, 2015 by Mayo

2015: The Large Hadron Collider (LHC) is back in collision mode in 2015[0]. There’s a 2015 update, a virtual display, and links from ATLAS, one of two detectors at (LHC)) here. The remainder is from one year ago. (2014) I’m reblogging a few of the Higgs posts at the anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.

“Higgs Analysis and Statistical Flukes: part 2”

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgs particle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels. Continue reading →

Categories: Higgs, highly probable vs highly probed, P-values, Severity | 1 Comment

Winner of the June Palindrome contest: Lori Wike

Posted on July 9, 2015 by Mayo

Winner of June 2015 Palindrome Contest: (a dozen book choices)

Lori Wike: Principal bassoonist of the Utah Symphony; Faculty member at University of Utah and Westminster College

Palindrome: Sir, a pain, a madness! Elba gin in a pro’s tipsy end? I know angst, sir! I taste, I demonstrate lemon omelet arts. Nome diet satirists gnaw on kidneys, pits or panini. Gab less: end a mania, Paris!

Book choice: Conjectures and Refutations (K. Popper 1962, New York: Basic Books)

The requirement: A palindrome using “demonstrate” (and Elba, of course).

Bio: Lori Wike is principal bassoonist of the Utah Symphony and is on the faculty of the University of Utah and Westminster College. She holds a Bachelor of Music degree from the Eastman School of Music and a Master of Arts degree in Comparative Literature from UC-Irvine. Continue reading →

Categories: Palindrome | Leave a comment

Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

Posted on July 3, 2015 by Mayo

Larry Laudan

Professor Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“When the ‘Not-Guilty’ Falsely Pass for Innocent” by Larry Laudan

While it is a belief deeply ingrained in the legal community (and among the public) that false negatives are much more common than false positives (a 10:1 ratio being the preferred guess), empirical studies of that question are very few and far between. While false convictions have been carefully investigated in more than two dozen studies, there are virtually no well-designed studies of the frequency of false acquittals. The disinterest in the latter question is dramatically borne out by looking at discussions among intellectuals of the two sorts of errors. (A search of Google Books identifies some 6.3k discussions of the former and only 144 treatments of the latter in the period from 1800 to now.) I’m persuaded that it is time we brought false negatives out of the shadows, not least because each such mistake carries significant potential harms, typically inflicted by falsely-acquitted recidivists who are on the streets instead of in prison.

In criminal law, false negatives occur under two circumstances: when a guilty defendant is acquitted at trial and when an arrested, guilty defendant has the charges against him dropped or dismissed by the judge or prosecutor. Almost no one tries to measure how often either type of false negative occurs. That is partly understandable, given the fact that the legal system prohibits a judicial investigation into the correctness of an acquittal at trial; the double jeopardy principle guarantees that such acquittals are fixed in stone. Thanks in no small part to the general societal indifference to false negatives, there have been virtually no efforts to design empirical studies that would yield reliable figures on false acquittals. That means that my efforts here to estimate how often they occur must depend on a plethora of indirect indicators. With a bit of ingenuity, it is possible to find data that provide strong clues as to approximately how often a truly guilty defendant is acquitted at trial and in the pre-trial process. The resulting inferences are not precise and I will try to explain why as we go along. As we look at various data sources not initially designed to measure false negatives, we will see that they nonetheless provide salient information about when and why false acquittals occur, thereby enabling us to make an approximate estimate of their frequency.

My discussion of how to estimate the frequency of false negatives will fall into two parts, reflecting the stark differences between the sources of errors in pleas and the sources of error in trials. (All the data to be cited here deal entirely with cases of crimes of violence.) Continue reading →

Categories: evidence-based policy, false negatives, PhilStatLaw, Statistics | Tags: false negatives | 9 Comments

Stapel’s Fix for Science? Admit the story you want to tell and how you “fixed” the statistics to support it!

Posted on June 30, 2015 by Mayo

Stapel’s “fix” for science is to admit it’s all “fixed!”

That recent case of the guy suspected of using faked data for a study on how to promote support for gay marriage in a (retracted) paper, Michael LaCour, is directing a bit of limelight on our star fraudster Diederik Stapel (50+ retractions).

The Chronicle of Higher Education just published an article by Tom Bartlett: “Can a Longtime Fraud Help Fix Science? You can read his full interview of Stapel here. A snippet:

You write that “every psychologist has a toolbox of statistical and methodological procedures for those days when the numbers don’t turn out quite right.” Do you think every psychologist uses that toolbox? In other words, is everyone at least a little bit dirty?

Stapel: In essence, yes. The universe doesn’t give answers. There are no data matrices out there. We have to select from reality, and we have to interpret. There’s always dirt, and there’s always selection, and there’s always interpretation. That doesn’t mean it’s all untruthful. We’re dirty because we can only live with models of reality rather than reality itself. It doesn’t mean it’s all a bag of tricks and lies. But that’s where the inconvenience starts. Continue reading →

Categories: junk science, Statistics | 11 Comments

3 YEARS AGO (JUNE 2012): MEMORY LANE

Posted on June 25, 2015 by Mayo

: 3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1] It was extremely difficult to pick only 3 this month; please check out others that look interesting to you. This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

June 2012

(6/2) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
(6/6) Review of Error and Inference (Mayo and Spanos) by C. Hennig
(6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
(6/12) CMU Workshop on Foundations for Ockham’s Razor
(6/14) Answer to the Homework & a New Exercise
(6/15) Scratch Work for a SEV Homework Problem
(6/17) Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers
(6/17) G. Cumming Response: The New Statistics
(6/19) The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi
(6/23) Promissory Note
(6/26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*
(6/29) Further Reflections on Simplicity: Mechanisms

[1]excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane | 1 Comment

Can You change Your Bayesian prior? (ii)

Posted on June 18, 2015 by Mayo

This is one of the questions high on the “To Do” list I’ve been keeping for this blog. The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading →

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”

Posted on June 14, 2015 by Mayo

Objectivity 1: Will the Real Junk Science Please Stand Up?

I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this: Continue reading →

Categories: junk science, spurious p values | 14 Comments

Evidence can only strengthen a prior belief in low data veracity, N. Liberman & M. Denzler: “Response”

Posted on June 11, 2015 by Mayo

Förster

I thought the criticisms of social psychologist Jens Förster were already quite damning (despite some attempts to explain them as mere QRPs), but there’s recently been some pushback from two of his co-authors Liberman and Denzler. Their objections are directed to the application of a distinct method, touted as “Bayesian forensics”, to their joint work with Förster. I discussed it very briefly in a recent “rejected post“. Perhaps the earlier method of criticism was inapplicable to these additional papers, and there’s an interest in seeing those papers retracted as well as the one that was. I don’t claim to know. A distinct “policy” issue is whether there should be uniform standards for retraction calls. At the very least, one would think new methods should be well-vetted before subjecting authors to their indictment (particularly methods which are incapable of issuing in exculpatory evidence, like this one). Here’s a portion of their response. I don’t claim to be up on this case, but I’d be very glad to have reader feedback.

Nira Liberman, School of Psychological Sciences, Tel Aviv University, Israel

Markus Denzler, Federal University of Applied Administrative Sciences, Germany

June 7, 2015

Response to a Report Published by the University of Amsterdam

The University of Amsterdam (UvA) has recently announced the completion of a report that summarizes an examination of all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us. The report is available online. The report relies solely on statistical evaluation, using the method originally employed in the anonymous complaint against JF, as well as a new version of a method for detecting “low scientific veracity” of data, developed by Prof. Klaassen (2015). The report concludes that some of the examined publications show “strong statistical evidence for low scientific veracity”, some show “inconclusive evidence for low scientific veracity”, and some show “no evidence for low veracity”. UvA announced that on the basis of that report, it would send letters to the Journals, asking them to retract articles from the first category, and to consider retraction of articles in the second category.

After examining the report, we have reached the conclusion that it is misleading, biased and is based on erroneous statistical procedures. In view of that we surmise that it does not present reliable evidence for “low scientific veracity”.

We ask you to consider our criticism of the methods used in UvA’s report and the procedures leading to their recommendations in your decision.

Let us emphasize that we never fabricated or manipulated data, nor have we ever witnessed such behavior on the part of Jens Förster or other co-authors.

Here are our major points of criticism. Please note that, due to time considerations, our examination and criticism focus on papers co-authored by us. Below, we provide some background information and then elaborate on these points. Continue reading →

Categories: junk science, reproducibility | Tags: Jens Forster | 9 Comments

“Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)

Posted on June 9, 2015 by Mayo

Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)

Categories: evidence-based policy, frequentist/Bayesian, junk science, Rejected Posts | 2 Comments

What Would Replication Research Under an Error Statistical Philosophy Be?

Posted on June 4, 2015 by Mayo

Around a year ago on this blog I wrote:

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of a conceptual clarification

Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)

It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H₀) (2015 Trafimow and Marks)

Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.

Simple conceptual job that philosophers are good at

(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)

____________________________________________________________________________________

Example of revealing inconsistencies and tensions

Critic: It’s too easy to satisfy standard significance thresholds

You: Why do replicationists find it so hard to achieve significance thresholds?

Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs

You: So, the replication researchers want methods that pick up on and block these biasing selection effects.

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.

________________________________________________________________

Whether this can be resolved or not is separate.

We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility

As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

Categories: Error Statistics, reproducibility, Statistics | 18 Comments

3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

Posted on May 30, 2015 by Mayo

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2012. Lots of worthy reading and rereading for your Saturday night memory lane; it was hard to choose just 3.

I mark in red three posts that seem most apt for general background on key issues in this blog* (Posts that are part of a “unit” or a group of “U-Phils” count as one.) This new feature, appearingthe end of each month, began at the blog’s 3-year anniversary in Sept, 2014.

*excluding any that have been recently reblogged.

May 2012

(5/1) Stephen Senn: A Paradox of Prior Probabilities
(5/5) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
(5/8) LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics
(5/10) Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”
(5/12) Saturday Night Brainstorming & Task Forces: The TFSI on NHST (for update see 2015 Task Force)
(5/17) Do CIs Avoid Fallacies of Tests? Reforming the Reformers
(5/20) Betting, Bookies and Bayes: Does it Not Matter?
(5/23) Does the Bayesian Diet Call For Error-Statistical Supplements?
(5/24) An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)-short intro to error statistics
(5/28) Painting-by-Number #1
(5/31) Metablog: May 31, 2012

Categories: 3-year memory lane | Leave a comment

“Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday

Posted on May 27, 2015 by Mayo

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).

Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”! “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.

Continue reading →

Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 48 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

Posted on May 24, 2015 by Mayo

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

D. Mayo: “Error Statistical Control: Forfeit at your Peril”

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)

Posted on May 19, 2015 by Mayo

2nd part of the double header:

Society for Philosophy and Psychology (SPP): 41st Annual meeting

SPP 2015 Program

Wednesday, June 3rd
1:30-6:30: Preconference Workshop on Replication in the Sciences, organized by Edouard Machery

1:30-2:15: Edouard Machery (Pitt)
2:15-3:15: Andrew Gelman (Columbia, Statistics, via video link)
3:15-4:15: Deborah Mayo (Virginia Tech, Philosophy)
4:15-4:30: Break
4:30-5:30: Uri Simonshon (Penn, Psychology)
5:30-6:30: Tal Yarkoni (University of Texas, Neuroscience)

SPP meeting: 4-6 June 2015 at Duke University in Durham, North Carolina

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First part of the double header:

The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference, 2015 APS Annual Convention Saturday, May 23 2:00 PM- 3:50 PM in Wilder (Marriott Marquis 1535 B’way)

Andrew Gelman

Stephen Senn

Deborah Mayo

Richard Morey, Session Chair & Discussant

taxi: VA-NYC-NC

See earlier post for Frank Sinatra and more details

Categories: Announcement, reproducibility | Leave a comment

“Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo

Posted on May 16, 2015 by Mayo

A new joint paper….

“Error statistical modeling and inference: Where methodology meets ontology”

Aris Spanos · Deborah G. Mayo

Abstract: In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

Keywords: Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

The related conference.

Reference: Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” Synthese (online May 13, 2015), pp. 1-23.

Categories: Error Statistics, misspecification testing, O & M conference, reproducibility, Severity, Spanos | 2 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

Posted on May 9, 2015 by Mayo

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

Note that in the case that only one counter remains we have n = m + 1 and the two probabilities are the same. However, if n > m+1 they are not the same and in particular if m is large but n is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See Dicing with Death(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Continue reading →