Prof. Geoff Cumming [i] has taken up my invite to respond to “Do CIs Avoid Fallacies of Tests? Reforming the Reformers” (May 17th), reposted today as well. (I extend the same invite to anyone I comment on, whether it be in the form of a comment or full post). He reviews some of the complaints against p-values and significance tests, but he has not here responded to the particular challenge I raise: to show how his appeals to CIs avoid the fallacies and weakness of significance tests. The May 17 post focuses on the fallacy of rejection; the one from June 2, on the fallacy of acceptance. In each case, one needs to supplement his CIs with something along the lines of the testing scrutiny offered by SEV. At the same time, a SEV assessment avoids the much-lampooned uses of p-values–or so I have argued. He does allude to a subsequent post, so perhaps he will address these issues there.
PROFESSOR GEOFF CUMMING [ii] (submitted June 13, 2012)
I’m new to this blog—what a trove of riches! I’m prompted to respond by Deborah Mayo’s typically insightful post of 17 May 2012, in which she discussed one-sided tests and referred to my discussion of one-sided CIs (Cumming, 2012, pp 109-113). A central issue is:
Cumming (quoted by Mayo): as usual, the estimation approach is better
Mayo: Is it?
Lots to discuss there. In this first post I’ll outline the big picture as I see it.
‘The New Statistics’ refers to effect sizes, confidence intervals, and meta-analysis, which, of course, are not themselves new. But using them, and relying on them as the basis for interpretation, would be new for most researchers in a wide range of disciplines—that for decades have relied on null hypothesis significance testing (NHST). My basic argument for the new statistics rather than NHST is summarised in a brief magazine article (http://tiny.cc/GeoffConversation) and radio talk (http://tiny.cc/geofftalk). The website www.thenewstatistics.com has information about the book (Cumming, 2012) and ESCI software, which is a free download.
Over 45 years of teaching NHST and p values I became increasingly unhappy about the weird backward logic, the difficulty of teaching it, and the seeming rarity of accurate understanding even among researchers. I believe Rex Kline (2004) has given the best summary critique of NHST, and how it is typically used, in a book chapter, now made freely available at the APA Style website (http://tinyurl.com/klinechap3). To those who argue that the problem is that we simply need to teach NHST better, I offer the evidence of Haller and Krauss (http://tinyurl.com/nhstohdear) that even a substantial proportion of those who teach NHST don’t understand it correctly.
Two NHST problems are most telling for me. First, it encourages and licences dichotomous thinking. This impoverishes the research questions we ask, the conclusions we draw, and the theories we are prompted to develop and evaluate. Meehl (1978) and Gigerenzer (1998) argued that the dichotomous decision making of NHST hobbled how psychologists have theorised—they are satisfied with surrogates (Gigerenzer) for theories, that limply postulate there will be a difference, maybe in a specified direction. No quantitative formulation, so little scope for bold testable predictions.
I suspect that changing our research aims from ‘Does the therapy improve things?’, or ‘Test the hypothesis that introverts get up earlier in the morning’, to ‘How large is the effect of the therapy?’ or ‘What is the relationship between introversion and time of arising?’ may be the crucial step. Asking ‘how much’ almost demands an answer ‘this much’ rather than an uninformative statement about rejecting or not the notion that there’s a zero effect. Estimation thinking, not dichotomous thinking. Lots about that in Chapter 1.
The second main problem arises from weird blindness to sampling variability in relation to p values. Every intro textbook has, as it should, chapters on sampling variability of means, and how we should quantify and try to reduce that. If CIs are discussed, so is the variation of the interval we calculate over repeated sampling. Great! In stark contrast, a p value is calculated to 2 or 3 decimal places, with no mention that a replication will give a different value. It turns out that the sampling distribution of the p value is very—even astonishingly—wide. I’ve explored that in Cumming (2008) and there’s an ESCI simulation to illustrate. The YouTube video is at http://tinyurl.com/danceptrial2
My conclusion is that any p value gives only extremely vague information because, thanks to the vagaries of sampling variability, it could easily have been very different. The 80% p interval for .05 is (.00008, .44), meaning that if an initial experiment gives two-tailed p = .05, then a replication of that experiment—exactly the same but with a new sample of participants—has an 80% chance of giving one-tailed p within that interval, and fully a 10% chance of p < .00008, and 10% chance of p > .44. The p interval depends only on the initial p value, not on sample size N, which was taken into account in the calculation of p. More in Cumming (2008) and Chapter 5.
So, I’m a fan of estimation and meta-analysis. I suspect that researchers, once free of the overwhelming expectation to report p and talk in terms of statistical significance, might feel it’s entirely natural to ask ‘how much’ and reply ‘8.5, 95% CI [4, 13]’. Rather as a chemist reports a melting point as 32.4 ± 0.1 degrees; what a nonsense to report it as ‘highly significantly greater than zero’! Yes, I know there is duality between CIs and tests, but there are also big differences. I’ll talk about those in another post.
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Gigerenzer, G. (1998). Surrogates for theories. Theory & Psychology, 8, 195-204.
Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834.
[i] I first met Geoff at one of those terrific “group grope” workshops at NCEAS, National Center for Ecological Analysis and Synthesis, in Santa Barbara, CA. There he extended one of his CI excel programs so as to also compute SEV for a class of cases, but he does not avail himself of this interpretive aid in Cumming (2012).
[ii] Cumming is Emeritus Professor, Faculty of Science, Technology and Engineering, Department of Psychology, La Trobe University (g.cumming@latrobe.edu.au)
I’m curious: Is there any reason that a CI reformer couldn’t just add the upper CI bound that is needed in order to avoid “the fallacy of rejection”, identified by Mayo? So the result is now: m > the CI lower and the reformer would additionally advise that it is less than the CI upper?
Thanks Eileen, well I suggested this…but one needs a justification for such a move. It is not in the corresponding CI for the test. (nor even, as I also note, in the two-sided CI.) Reformers should wish to add something akin to a SEV assessment for avoiding fallacies.
I’ve seen confidence intervals used just as tests: if the parameter value is in the interval then it is accepted, otherwise rejected. This is what Cumming also appears to recommend. Another rule of thumb is that if both the null value and some important alternative are in the interval then the result is inconclusive. Can SEV also have an “inconclusive” class?
It’s a very misleading and flawed rule of them; I’ll direct you to my discussion of this when I’m not racing out.
Eileen: On the problem with the “rule of thumb”: Suppose for example that a discrepancy of .3s.d. from the 0 null (in our test T+) is of substantive importance. Observing a mean that is only 1.3s.d. (in excess of 0)—p-value of .1—would not be deemed statistically significantly greater than 0. Both 0 and .3s.d. would be in a 95% CI. So the rule of thumb says inconclusive. Yet the hypothesis: m > .3s.d. passes with SEV of .84. Were m < .3s.d., the probability of a larger observed difference is .84. This is scarcely conveyed by declaring the results uninformative as regards the discrepancy of interest. The rule of thumb forfeits this information. It’s just silly to ignore or dismiss an enlightened use of p-values and stick to ordinary CIs, yet in some circles it’s become an unquestioned, “politically correct” standpoint. CIs require the same "testing" supplement that significance tests do, if they are to avoid classic fallacies.