Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).
- Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.
I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”?
- Instead of teaching a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes, and others, textbook writers created a hybrid theory with the null ritual at its core, and presented it anonymously as statistics per se.
I’m curious as to how he/we might cash out teaching “a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes….” Each has been open to caricature, to rival interpretations and philosophies, and each includes several methods. There should be a way to recognize distinct questions and roles, without reinforcing the “received view” that lies behind the guilt and anxiety confronting the researcher in Gigerenzer’s famous “superego-ego-id” metaphor (SLIDE #3):
In this view, N-P demands fixed, long-run performance criteria and are relevant for acceptance sampling only–no inference allowed; Fisherian significance tests countenance moving from small p-values to substantive scientific claims, as in the illicit animal dubbed NHST. The Bayesian “id” is the voice of wishful thinking that tempts some to confuse the question: “How stringently have I tested H?” with “How strongly do I believe H?”
As with all good caricatures, there are several grains of truth in Gigerenzer’s colorful Freudian metaphor, but I say it’s time to move away from exaggerating the differences between N-P and Fisher. I think we must first see why Fisher and N-P statistics do not form an inconsistent hybrid, if we’re to see their overlapping roles in the “toolbox”.
(added 11/11/16: here’s the early, best known (so far as I’m aware) introduction of the Freudian metaphor: Link: https://www.mpib-berlin.mpg.de/volltexte/institut/dok/full/gg/ggstehfda/ggstehfda.html
A treatment that is unbiased as well as historically and statistically adequate might be possible, but would have to be created anew (Vic Barnett’s Comparative Statistical Inference comes close). Whether this would be at all practical for a routine presentation of statistics is another issue.
3(a).The null ritual requires delusions about the meaning of the p-value. It’s blind spots led to studies with a power so low that throwing a coin would do better. To compensate, researchers engage in bad science to produce significant results which are unlikely to be reproducible.
I put aside my queries about the “required delusions” of the first sentence, and improving power by throwing a coin in the second. The point about power in the last sentence is important. It might help explain the confusion I increasingly see between (i) reaching small significance levels with a test of low power and (ii) engaging in questionable research practices (QRPs) in order to arrive at the small p-value. If you achieve (i) without the QRPs of (ii), you have a good indication of a discrepancy from a null hypothesis. It would not yield an exaggerated effect size as some allege, if correctly interpreted. The problem is when QRPs are used “to compensate” for a test that would otherwise produce bubkes.
3(b). Researchers’ delusion that the p-value already specifies the probability of replication (1 – p) makes replication studies appear superfluous.
Hmm. They should read Fisher’s denunciation of taking an isolated p-value as indicating a genuine experimental effect:
In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result (Fisher 1947, p. 14).
4. The replication crisis in the social and biomedical sciences is typically attributed to wrong incentives. But that is only half the story. Researchers tend to believe in the ritual, and the null ritual also explains why these incentives and not others were set in the first place.
The perverse incentives generally refer to the “publish or perish” mantra, the competition to produce novel and sexy results, and editorial biases in favor of a neat narrative, free of those messy qualifications or careful, critical caveats. Is Gigerenzer saying that the reason these incentives were set is because researchers believe that recipe statistics is a good way to do science? What do people think?
It would follow that if researchers rejected statistical rituals as a good way to do science, then incentives would change. To some extent that might be happening. The trouble is, even recognizing the inadequacy of the statistical rituals that have been lampooned for 80+ years, it doesn’t follow they returned to the notion of “good scientific practice” described in point #1.
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) “Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual” (Abstract)
Some Relevant Posts:
- Barnett, V. (1999; 2009; 2000). Comparative Statistical Inference (3rd Ed.). Chichester; New York: Wiley.
- Fisher, R. A. 1947. The Design of Experiments (4th Ed.). Edinburgh: Oliver and Boyd.
- Gigerenzer, G. (1993). The Superego, the Ego, and the Id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Erlbaum.