Gigerenzer at the PSA: “How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual”: Comments and Queries (ii)

screen-shot-2016-10-26-at-10-23-07-pm

.

Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).

SLIDE #19

gigerenzer-slide-19

  1. Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.

I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”?

  1. Instead of teaching a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes, and others, textbook writers created a hybrid theory with the null ritual at its core, and presented it anonymously as statistics per se.

I’m curious as to how he/we might cash out teaching “a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes….” Each has been open to caricature, to rival interpretations and philosophies, and each includes several methods. There should be a way to recognize distinct questions and roles, without reinforcing the “received view” that lies behind the guilt and anxiety confronting the researcher in Gigerenzer’s famous “superego-ego-id” metaphor (SLIDE #3):

In this view, N-P demands fixed, long-run performance criteria and are relevant for acceptance sampling only–no inference allowed; Fisherian significance tests countenance moving from small p-values to substantive scientific claims, as in the illicit animal dubbed NHST. The Bayesian “id” is the voice of wishful thinking that tempts some to confuse the question: “How stringently have I tested H?” with “How strongly do I believe H?”gigerenzer-slide-3

As with all good caricatures, there are several grains of truth in Gigerenzer’s colorful Freudian metaphor, but I say it’s time to move away from exaggerating the differences between N-P and Fisher. I think we must first see why Fisher and N-P statistics do not form an inconsistent hybrid, if we’re to see their overlapping roles in the “toolbox”.

(added 11/11/16: here’s the early, best known (so far as I’m aware) introduction of the Freudian metaphor: Link: https://www.mpib-berlin.mpg.de/volltexte/institut/dok/full/gg/ggstehfda/ggstehfda.html
Full text

A treatment that is unbiased as well as historically and statistically adequate might be possible, but would have to be created anew (Vic Barnett’s Comparative Statistical Inference comes close). Whether this would be at all practical for a routine presentation of statistics is another issue.

3(a).The null ritual requires delusions about the meaning of the p-value. It’s blind spots led to studies with a power so low that throwing a coin would do better. To compensate, researchers engage in bad science to produce significant results which are unlikely to be reproducible.

I put aside my queries about the “required delusions” of the first sentence, and improving power by throwing a coin in the second. The point about power in the last sentence is important. It might help explain the confusion I increasingly see between (i) reaching small significance levels with a test of low power and (ii) engaging in questionable research practices (QRPs) in order to arrive at the small p-value. If you achieve (i) without the QRPs of (ii), you have a good indication of a discrepancy from a null hypothesis. It would not yield an exaggerated effect size as some allege, if correctly interpreted. The problem is when QRPs are used “to compensate” for a test that would otherwise produce bubkes.

3(b). Researchers’ delusion that the p-value already specifies the probability of replication (1 – p) makes replication studies appear superfluous.

Hmm. They should read Fisher’s denunciation of taking an isolated p-value as indicating a genuine experimental effect:

In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result (Fisher 1947, p. 14).

4. The replication crisis in the social and biomedical sciences is typically attributed to wrong incentives. But that is only half the story. Researchers tend to believe in the ritual, and the null ritual also explains why these incentives and not others were set in the first place.

The perverse incentives generally refer to the “publish or perish” mantra, the competition to produce novel and sexy results, and editorial biases in favor of a neat narrative, free of those messy qualifications or careful, critical caveats. Is Gigerenzer saying that the reason these incentives were set is because researchers believe that recipe statistics is a good way to do science? What do people think?

It would follow that if researchers rejected statistical rituals as a good way to do science, then incentives would change. To some extent that might be happening. The trouble is, even recognizing the inadequacy of the statistical rituals that have been lampooned for 80+ years, it doesn’t follow they returned to the notion of “good scientific practice” described in point #1.

imgres-13

.

Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) “Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual” (Abstract)

Some Relevant Posts:

Erich Lehmann: Neyman Pearson vs Fisher on P-values

Jerzy Neyman and ‘Les Miserables Citations’: Statistical Theater in honor of His Birthday”

Stephen Senn: “The Pathetic P-value” (guest post)

Neyman: Distinguishing Tests of Statistical Hypotheses and Tests of Significance Might Have Been a Lapse of Someone’s Pen

REFERENCES

  • Barnett, V. (1999; 2009; 2000). Comparative Statistical Inference (3rd Ed.). Chichester; New York: Wiley.
  • Fisher, R. A. 1947. The Design of Experiments (4th Ed.). Edinburgh: Oliver and Boyd.
  • Gigerenzer, G. (1993). The Superego, the Ego, and the Id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Erlbaum.

 

 

Categories: Fisher, frequentist/Bayesian, Gigerenzer, Gigerenzer, P-values, spurious p values, Statistics

Post navigation

11 thoughts on “Gigerenzer at the PSA: “How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual”: Comments and Queries (ii)

  1. Robert M. McIntyre

    Sorry. Got it: Questionable Research Practice.

    ________________________________

    • Robert: yes, because it was in the previous line, I thought that sufficed, so I let it stay as it was. I have now added the acronym to the previous line.

  2. martin

    Dr. Mayo, thank you for your post.

    Regarding health sciences, Dr. Gigerenzer might be right when he says that researchers believe in the ritual. For what I’ve seen, recipe statistics is what we’re taught as good science. We have lots of workshops with statistical software, some problem solving, then pass the final test and you’re ready. Statistical inference practice is not very well connected to the theory, and it is definitely not well taught, where I come from.

    Unluckily, many researchers in my field are going to continue to work with recipe statistics for many years to come because they believe it is the right thing to do and the textbooks they read did not tell them otherwise.

    Common pitfalls and misuses are left for the student to “discover” afterwards. Learning statistics is really hard, and only motivated and “skeptical” researchers will eventually get to “statistical thinking”.

    Then there’s the problem with research ethics. The “publish or perish” has gone from individual researchers to institutions. I know of certain universities in my country that delegate research duties to epidemiologists without the necessary substantive knowledge on the topic they’re working on. The clinical expert comes up with a question and the epidemiologist does the rest. The expert is confident that the epidemiologist is going to get it right but, without substantive knowledge, this is just a waste of resources. It becomes some kind of “research production factory” with very low quality standards, but with the “seal” of the epidemiologist.

    On another issue, I have doubts as to whether the “well-probed” hypothesis has to do with the methods or with the statistical analysis (or both). Fisher says that there has to be a reliable (and unbiased, I suppose) method for which the p-value is consistently significant. One could say something similar for posterior probabilities or likelihood ratios. If the method is unbiased and reliable, then one could be testing a substantive hypothesis severely, regardless of the statistical inference tool. Statistical analysis is very important, but to overcome the replication crisis we should focus on (sample selection, accuracy of information) methods too.

    Best regards.

    • Martin: It’s too bad if cookbook stat is still endorsed. I’m not sure what you mean when you write: “One could say something similar for posterior probabilities or likelihood ratios. If the method is unbiased and reliable, then one could be testing a substantive hypothesis severely, regardless of the statistical inference tool.”

      I don’t rule out assessing the severity of these methods (at the meta-level as it were)–I agree one can–but the question is whether one has to jump out of the Bayes/Likelihood framework in order to warrant the error-statistical (or severe testing) rationale.

  3. Michael Lew

    As you know, I have long struggled with understanding and communicating about the intersections between schools of thought in statistics. I have decided that an important impediment is the fact that different aspects of interest exist within different dimensions. It is hard to avoid switching between dimensions when writing about the issues, and unnoticed switching leads to misunderstanding. You provide a nice example here:

    “The Bayesian “id” is the voice of wishful thinking that tempts some to confuse the question: “How stringently have I tested H?” with “How strongly do I believe H?”

    In that sentence the dimension of the first option (how stringently?) relates to methodological performance, whereas the second (belief) is a Bayesian dimension where the belief is a probability incorporating a prior of some sort. The first relates to evidence combined with methodological stringency but the second relates to evidence combined with prior belief. One cannot substitute for the other for all purposes because they consist of different stuff. They are not rivals in the sense that Hillary Clinton and Donald Trump were rivals, where acceptance of one entails rejection of the other. Instead they are alternatives that might be preferred in some circumstances and not in others–alternatives that not only co-exist, but can be consumed together like ice-cream and cookies.

    Thus your suggestion that there is a confusion regarding the question is false. Instead, you might talk of confusion about which question is more appropriate to the task at hand. Or you might talk about the important issue of how one could (or should) make inferences that are based on the evidence _and_ the methodological performance _and_ prior information or beliefs.

    Gigerenzer’s metaphors are useful, and I think that his concerns are well-founded.

    • Michael: I was sketching a vague but colorful metaphor that Gigerenzer developed over 20 years ago (1993) to caricature competing (mutually inconsistent) tensions running through an imaginary researcher’s psyche. (It’s probably even older than that.) I have an old-fashioned overhead from my Lakatos talk in 1999 with a three-headed researcher, which I’ve used for years and years in teaching. The head is spinning around (in my slide.) Now it is only a metaphor,and it’s only an imaginary psychological state, so one shouldn’t be too hard on it, but if one takes seriously the three-headed researcher, he is pulled in these competing directions in applying significance tests. The super-ego of N-P demands strictly pre-designated rule-following that Gigerenzer compares to compulsive handwashing. The Fisherian ego reflects what the researcher does to get things published, which requires violating the superego rules. Thus he feels guilty. And all along (still applying the significance test) he’s craving a statement of degrees of belief in the form of posteriors. As I say, because this is a metaphor, one can’t press it too hard. I think it’s a good metaphor for the caricature presented by critical appraisers of statistics rather than the psyche of an actual researcher—but it’s more colorful to present it as he does. He’s got several other wonderful psychological metaphors to describe cookbook statistics and statisticians seeking automatic methods.

  4. Christian Hennig

    Thanks for this. When you announced the meeting programme I hoped that I’d see something from the talks here.

    “Is Gigerenzer saying that the reason these incentives were set is because researchers believe that recipe statistics is a good way to do science? What do people think?”

    I think that some historical development needs to be taken into acccount here. I guess (I’m not informed enough to write “I know”) that in the early days many researchers found the new developments very exciting and believed that this gave them a way to do better science. Probably already from the beginning there was quite some variance between appropriate and inappropriate uses and interpretations, but the dangers of inappropriate use and interpretation were not yet well known and discussed. Those researchers, probably pioneers in their own discipline, who learned the material from the original literature, had to teach to in their own discipline, often with somewhat less insight and more “spin”. Probably, in any discipline apart perhaps from statistics itself (and philosophy!) there are many students and researchers who are not interested a lot in understanding such “tools” in detail (even if, in principle, they’d be interested in doing good replicable research), and they learn such tools then from presentations that lack understanding and mentioning of potential issues. Some of them would even go on to teach it to their students based on the non-understanding that they have (although there will always be a minority who realises that what they learn in their Statistics for Psychologists 1 is not the whole story and would try to achieve a better understanding).

    So I think it’s a mix, originally it wasn’t recipe statistics and researchers believed it was good (but they were probably pioneers); when it became recipe statistics, many would just believe it not out of understanding but because they were told, and some wouldn’t bother about whether to believe it or not, it was just the thing to do.

    • Christian: Thanks for your interesting comment. Your point may be right, but remember that the warnings against viewing the methods in recipe-like fashion already came with Fisher (his “political principle” for lying with statistics) and before. On the other hand, correctly used, the methods did afford a way to be treat statistical questions scientifically: learning with reliability despite errors and variability. So, it would make sense for rewards to be built on employing tools that enable distinguishing signal and noise, & paying attention to the experimental design that this requires. But then it can’t be said, in shock, ‘many researchers actually believe this is a good method’. In other words, it’s not the recipe-style stat that they believe in (contrary to his suggestion). Undoubtedly, there’s a blur.

      But I think it’s time to get beyond that. That might be main way in which I disagree with Gigerenzer.

      • Christian Hennig

        My idea of how things evolved would amount to thinking that the researchers who adopted these methods originally didn’t use them recipe-style, and probably successfully to a large extent. But as a result of this they were made into recipes, and insight was lost (or not present in the first place for those who learnt them in this fashion). If we think about whether researchers “believe” in them, there are different kinds of belief, belief from understanding (which wouldn’t apply to cookbook-type application) and belief in (teaching) authority and general practice, which may well be a belief in following recipes.

        I think that this is a pattern that applies to many phenomena, not just to significance tests in statistics: a good and well thought through but nontrivial idea comes up; people apply this idea with some understanding and some success, and (partly as a result of this) the idea is transformed into a recipe, the initial understanding dries out, and the initially good idea turns into something dubious and problematic. Notwithstanding that it could still be applied in a good way, and even improved understanding may in principle be available among specialists.
        I believe that Bayesian statistics is equally prone to the same kind of thing.

        For this reason I am skeptical against all tendencies to formalise/automatise statistical analysis, other aspects of science and even scientific standards in general (although I am at the same time fascinated by them); they can degenerate into recipes and be applied as such, with little understanding.

        • Christian: Yes, I think that’s Gigerenzer’s position too. My point is that we should recover the ways tools may be used to control and assess erroneous interpretations of (statistically modeled) data. By saying merely “mechanically applying any tool is (equally?)bad, however,or imagining the user of statistical hypotheses tests is forced into the 3-headed tension, we fail to disinter the correct uses of tests, and lose the differences in the capabilities of tools from different schools.

          • Christian Hennig

            Fair enough. I think a balance has to be struck between discouraging thoughtless mechanical application of any tool and encouraging uses that are thoughtful enough to make sense but not so sophisticated that only a handful of experts can use them. And by “thoughtful enough” I mean with awareness for specific capabilities and incapabilities of the tool.

Blog at WordPress.com.