My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”

Below are the slides from my Popper talk at the LSE today (up to slide 70): (post any questions in the comments)

Categories: P-values, replication research, reproducibility, Statistics

11 thoughts on “My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats””

1. I found this a helpful look at this important issue, clarifying the debate and pointing forward to solutions (i.e., new best practices).

Question: what do you mean by “severe testing” as used by Popper? The only cite I recall is from “Conjectures and Refutations: The Growth of Scientific Knowledge” (1963).

“Confirmations should count only if they are the result of risky predictions; that is to say, if, unenlightened by the theory in question, we should have expected an event which was incompatible with the theory — an event which would have refuted the theory.”

But that does not appear to be what you mean.

• Hadn’t seen your comment. It’s true that Popper didn’t adequately define severity. I think I give a better defn using error statistics.
H passes test T w/severity when T accords with x*
and
Pr(a worse accordance would have occurred; H is false)= high

* for an adequate accordance measure

• Professor Mayo,

Thanks for the alternative perspective on severity, a quantitative rather than qualitative one.

I’ve long used Popper’s definition, finding it analytically useful to sort wheat from chaff.

I’d love to see a post contrasting the two approaches to testing theories, with the relative strengths and weaknesses of each. I suspect that when the stakes are high both are required.

• Check my Error and the Growth of Experimental Knowledge. His severity is satisfied by the first theory to explain a fact–just for one example of its weakness. (He changed his views, but he affirmed “theoretical novelty” as what he intended by severity (i.e., T entails or fits novel fact x, in the sense that it hasn’t already been explained).

• Professor Mayo,

Thanks for the pointer. I’ll check it out.

While I have you here, any thoughts — broadly speaking — on successful predictions as a “gold standard” (an anachronistic economic concept, but vivid) for testing theories?

No matter how sophisticated the math, backtesting models remains problematic, especially when the stakes are high.

2. Why lump Stapel (outright fraud) with garden-variety sloppy statistical methods (negligence)?

• I don’t, what the slide notes is that in investigating him, a whole culture of verification bias emerged as routine.

• The slide (read literally) does do this, but perhaps there is a connection: maybe the general sloppiness of statistical methods generally made easier for fakes to engage in such reckless data fabrication

• Yes, but I was also pointing out that the investigators, (Levelt committee) were seeking to find out about Stapel and his coworkers (to see, for example, if co-authors were guilty) and to their shock, found themselves in a culture where leaving out results you don’t like, reporting just what looks good, mix and matching control and treated groups from different experiments (with the defense that it’s all random) were not only commonplace, the researchers claimed that’s what they were taught to do. I will link to one of my posts on Stapel. Interestingly, the audience yesterday was unfamiliar with this case.

https://errorstatistics.com/2015/06/14/some-statistical-dirty-laundry-the-tilberg-stapel-report-on-flawed-science/

3. Kim Tullar

Is it possible to distinguish a proposition as being believable but not well-tested? It seems somewhat plausible that the optimum way to test a proposition assumes everything we know, or believe with some degree of certainty, and nothing else other than the logical connections between those beliefs. Furthermore, it seems reasonable that our test should, on these assumptions, output some ideally quantitative indication of the validity or error of our proposition. But then it seems like a well-tested proposition is simply one for which we have calculated P(Proposition | Beliefs). Or perhaps, a well-tested proposition is, if we accept the proposition, one with very high probability, and if we reject the proposition, one with very low probability, and a proposition with middling probability is yet to be well-tested. Either way, we end up with some version of Bayesianism, collapsing the notions of believability and well-tested.

I’m playing devil’s advocate above, by the way. You might be interested that, as part of a report I conducted on Simonsohn et al.’s p-curves, I conducted a p-curve analysis of Joshua Knobe’s work (prominent experimental philosopher), and was able to reject the null of no evidential value with a minute p-value (10^-5 or something). But, unfortunately, in my analysis of p-curve theory I found that p-values which fail to account for multiple comparisons can bias the test for evidential massively in favour of rejecting, and I believe some of Knobe’s p-values failed to account for multiplicity.

• Kim: No, in appraising whether test T (with data x) did a good job probing claim C, I wouldn’t consider everything I knew from other tests of C (although I would obviously use the background needed to assess test T). So I might say the deflection of light, with thus and so properties, was well tested in 1960, say; but not well tested in the famous 1919 eclipse experiment. The 1919 experimental test remains as imprecise as it was in 1919, even though in later years radioastronomy was capable of discerning errors not distinguishable in 1919. Of course time doesn’t matter, some of the 1919 experiments were decent, one from Sobral, no evidence at all. One needn’t consider anything so high falutin. We have good evidence x for mad cow, cloning or what have you, but you wouldn’t say that tea leaf reading supplies such evidence. For any well tested empirical claim C, I can find a method/data that does a lousy job in substantiating C. (Of course I can ask a question about overall evidence for Cwhich is different.)