We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here). Continue reading
Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0 and μ0 = 0, α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say
Mo = .39.
The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results. To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out. For our test T+, we reason from insignificant results to inferential claims of the form:
μ < μ0 + γ
Fisher continually emphasized that failure to reject was not evidence for the null. Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:
Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).
By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.
“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):
This may be captured in :
FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).
This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):
SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…
But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading
For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.
Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:
“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)
For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.
However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).
“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”
Is it? Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.
H0: µ ≤ 0 against H1: µ > 0 , and let σ= 1.
Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:
µ > M – 2(1/ √n ).
where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.
Central problems with significance tests (whether of the N-P or Fisherian variety) include:
(1) results are too dichotomous (e.g., significant at a pre-set level or not);
(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way (whereas the larger the sample size the smaller the discrepancy the test is able to detect);
(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).
We would like to know for what values of δ it is warranted to infer µ > µ0 + δ. Continue reading
Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology.
While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost. Continue reading