We spent the first half of Thursday’s seminar discussing the Fisher, Neyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning.
We then turned to a severity evaluation of tests as a way to avoid classic fallacies and misinterpretations.
“Probability/Statistics Lecture Notes 5 for 3/20/14: Post-data severity evaluation” (Prof. Spanos)
[ii] In a recent Nature article by Regina Nuzzo, we hear that N-P statistics “was spearheaded in the late 1920s by Fisher’s bitter rivals”. Nonsense. It was Neyman and Pearson who came to Fisher’s defense against the old guard. See for example Aris Spanos’ post here. According to Nuzzo, “Neyman called some of Fisher’s work mathematically ‘worse than useless’”. It never happened. Nor does she reveal, if she is aware of, the purely technical notion being referred to. Nuzzo’s article doesn’t give the source of the quote; I’m guessing it’s from Gigerenzer quoting Hacking, or Goodman (whom she is clearly following and cites) quoting Gigerenzer quoting Hacking, but that’s a big jumble.
N-P did provide a theory of testing that could avoid the purely technical problem that can theoretically emerge in an account that does not consider alternatives or discrepancies from a null. As for Fisher’s charge against an extreme behavioristic, acceptance sampling approach, there’s something to this, but as Neyman’s response shows, Fisher, in practice, was more inclined toward a dichotomous “thumbs up or down” use of tests than Neyman. Recall Neyman’s “inferential” use of power in my last post. If Neyman really had altered the tests to such an extreme, it wouldn’t have required Barnard to point it out to Fisher many years later. Yet suddenly, according to Fisher, we’re in the grips of Russian 5-year plans or U.S. robotic widget assembly lines! I’m not defending either side in these fractious disputes, but alerting the reader to what’s behind a lot of writing on tests (see my anger management post). I can understand how Nuzzo’s remark could arise from a quote of a quote, doubly out of context. But I think science writers on statistical controversies have an obligation to try to avoid being misled by whomever they’re listening to at the moment. There are really only a small handful of howlers to take note of. It’s fine to sign on with one side, but not to state controversial points as beyond debate. I’ll have more to say about her article in a later post (and thanks to the many of you who have sent it to me).
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale: Lawrence Erlbaum Associates.
Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.
Nuzzo, R .(2014). “Scientific method: Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume”. Nature, 12 February 2014.