My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Posted on May 10, 2016 by Mayo

Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)

Categories: P-values, replication research, reproducibility, Statistics | 11 Comments

11 thoughts on “My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats””

May 10, 2016

Editor of the Fabius Maximus website

I found this a helpful look at this important issue, clarifying the debate and pointing forward to solutions (i.e., new best practices).

Question: what do you mean by “severe testing” as used by Popper? The only cite I recall is from “Conjectures and Refutations: The Growth of Scientific Knowledge” (1963).

“Confirmations should count only if they are the result of risky predictions; that is to say, if, unenlightened by the theory in question, we should have expected an event which was incompatible with the theory — an event which would have refuted the theory.”

But that does not appear to be what you mean.

Reply

June 29, 2016

Mayo

Hadn’t seen your comment. It’s true that Popper didn’t adequately define severity. I think I give a better defn using error statistics.
H passes test T w/severity when T accords with x*
and
Pr(a worse accordance would have occurred; H is false)= high

* for an adequate accordance measure

Reply

June 29, 2016

Editor of the Fabius Maximus website

Professor Mayo,

Thanks for the alternative perspective on severity, a quantitative rather than qualitative one.

I’ve long used Popper’s definition, finding it analytically useful to sort wheat from chaff.

I’d love to see a post contrasting the two approaches to testing theories, with the relative strengths and weaknesses of each. I suspect that when the stakes are high both are required.

Reply

June 29, 2016

Mayo

Check my Error and the Growth of Experimental Knowledge. His severity is satisfied by the first theory to explain a fact–just for one example of its weakness. (He changed his views, but he affirmed “theoretical novelty” as what he intended by severity (i.e., T entails or fits novel fact x, in the sense that it hasn’t already been explained).

Reply

June 29, 2016

Editor of the Fabius Maximus website

Professor Mayo,

Thanks for the pointer. I’ll check it out.

While I have you here, any thoughts — broadly speaking — on successful predictions as a “gold standard” (an anachronistic economic concept, but vivid) for testing theories?

No matter how sophisticated the math, backtesting models remains problematic, especially when the stakes are high.

Reply

May 10, 2016

Enrique Guerra-Pujol

Why lump Stapel (outright fraud) with garden-variety sloppy statistical methods (negligence)?

Reply

May 11, 2016

Mayo

I don’t, what the slide notes is that in investigating him, a whole culture of verification bias emerged as routine.

Reply

May 11, 2016

Enrique Guerra-Pujol

The slide (read literally) does do this, but perhaps there is a connection: maybe the general sloppiness of statistical methods generally made easier for fakes to engage in such reckless data fabrication

Reply

May 11, 2016

Mayo

Yes, but I was also pointing out that the investigators, (Levelt committee) were seeking to find out about Stapel and his coworkers (to see, for example, if co-authors were guilty) and to their shock, found themselves in a culture where leaving out results you don’t like, reporting just what looks good, mix and matching control and treated groups from different experiments (with the defense that it’s all random) were not only commonplace, the researchers claimed that’s what they were taught to do. I will link to one of my posts on Stapel. Interestingly, the audience yesterday was unfamiliar with this case.

https://errorstatistics.com/2015/06/14/some-statistical-dirty-laundry-the-tilberg-stapel-report-on-flawed-science/

Reply

May 14, 2016

Kim Tullar

Is it possible to distinguish a proposition as being believable but not well-tested? It seems somewhat plausible that the optimum way to test a proposition assumes everything we know, or believe with some degree of certainty, and nothing else other than the logical connections between those beliefs. Furthermore, it seems reasonable that our test should, on these assumptions, output some ideally quantitative indication of the validity or error of our proposition. But then it seems like a well-tested proposition is simply one for which we have calculated P(Proposition | Beliefs). Or perhaps, a well-tested proposition is, if we accept the proposition, one with very high probability, and if we reject the proposition, one with very low probability, and a proposition with middling probability is yet to be well-tested. Either way, we end up with some version of Bayesianism, collapsing the notions of believability and well-tested.

I’m playing devil’s advocate above, by the way. You might be interested that, as part of a report I conducted on Simonsohn et al.’s p-curves, I conducted a p-curve analysis of Joshua Knobe’s work (prominent experimental philosopher), and was able to reject the null of no evidential value with a minute p-value (10^-5 or something). But, unfortunately, in my analysis of p-curve theory I found that p-values which fail to account for multiple comparisons can bias the test for evidential massively in favour of rejecting, and I believe some of Knobe’s p-values failed to account for multiplicity.

Reply

May 14, 2016

Mayo

Kim: No, in appraising whether test T (with data x) did a good job probing claim C, I wouldn’t consider everything I knew from other tests of C (although I would obviously use the background needed to assess test T). So I might say the deflection of light, with thus and so properties, was well tested in 1960, say; but not well tested in the famous 1919 eclipse experiment. The 1919 experimental test remains as imprecise as it was in 1919, even though in later years radioastronomy was capable of discerning errors not distinguishable in 1919. Of course time doesn’t matter, some of the 1919 experiments were decent, one from Sobral, no evidence at all. One needn’t consider anything so high falutin. We have good evidence x for mad cow, cloning or what have you, but you wouldn’t say that tea leaf reading supplies such evidence. For any well tested empirical claim C, I can find a method/data that does a lousy job in substantiating C. (Of course I can ask a question about overall evidence for Cwhich is different.)

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Post navigation

11 thoughts on “My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats””

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Related

Post navigation

11 thoughts on “My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats””

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.