“Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

Sell me that antiseptic!

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1]

But since then the field has, ostensibly, attempted to clean up its act. On the meta-level, Simmons, Nelson, and Simonsohn (2011) is an excellent example of the kind of self-scrutiny the field needs, and their list of requirements and guidelines offer a much needed start (along with their related work). But the research itself appears to be going on in the same way as before (I don’t claim this one is representative), except that now researchers are keen to show their ability and willingness to demonstrate failure to replicate. So negative results are the new positives! If the new fashion is non-replication, that’s what will be found (following Kahneman‘s call for a “daisy chain” in [1]).

In “Out Damned Spot,” The authors are unable to replicate what they describe as a famous experiment (Zhong and Lilijenquist 2006) wherein participants who read “a passage describing an unethical deed as opposed to an ethical deed, … were subsequently likelier to rate cleansing products as more desirable than other consumer products”. (92). There are a variety of protocols, all rather similar. For instance students are asked to write out a passage to the effect that:

“I shredded a document that I knew my co-worker Harlan was desperately looking for so that I would be the one to get a promotion.”


“I place the much sought-after document in Harlan’s mail box.”

See the article for the exact words. Participants are told, untruthfully, that the study is on handwriting, or on punctuation or the like. (Aside: Would you feel more desirous of soap products after punctuating a paragraph about shredding a file that your colleague is looking for? More desirous than when…? More desirous than if you put it in his mailbox, I guess.[2]) In another variation on the Zhong et al studies, when participants are asked to remember an unethical vs ethical deed they committed, they tended to pick antiseptic wipe over a pen as compensation.

Yet these authors declare there is “a robust experimental foundation for the existence of a real-life Macbeth Effect” and therefore are  surprised that they are unable to replicate the result. The very fact that the article starts with giving high praise to these earlier studies already raises a big question mark in my mind as to their critical capacities, so I am not too surprised that they do not bring such capacities into their own studies. It’s so nice to have cross-out capability. Given that the field considers this effect solid and important, it is appropriate for the authors to regard it as such. (I think they are just jumping onto the new bandwagon. Admittedly, I’m skeptical, so send me defenses, if you have them. I place this under “fallacies of negative results”)

I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits to where they could look or what kind of imaginary experiment one could run. Hmmm. We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around (invisibly) to see if they don’t wash so much? (Never mind that cleanliness is next to godliness.) Of course if the killer has got blood on her (as in Lady “a little water clears us of this deed” Macbeth) she’s going to wash up, but the whole point is to apply it to moral culpability more generally (seeing if moral impurity cashes out as physical). So the first signal that an empirical study is at best wishy-washy, and at worst pseudoscientific is the utter vagueness of the effect they are studying. There’s little point to a sophisticated analysis of the statistics if you cannot get past this…unless you’re curious as to what other howlers lie in store. Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in…. (unless too trivial to bother studying).

Returning to their study, after the writing exercise, the current researchers (Earl,, ) have participants rate various consumer products for their desirability on a scale of 1 to 7.

They found “no significant difference in the mean desirability of the cleansing items between the moral condition (M= 3.09) and immoral condition (M = 3.08)” (94)—a difference that is so small as to be suspect in itself. Their two-sided confidence interval contains 0 so the null is not rejected. (We get a p-value and Cohen’s d, but no data.) Aris Spanos brought out a point we rarely hear (that came up in our criticism of a study on hormesis): it’s easy to get phony results with artificial measurement scales like 1-7. (Send links of others discussing this.) The mean isn’t even meaningful, and anyway, by adjusting the scale, a non-significant difference can become significant. (I don’t think this is mentioned in Simmons, Nelson, and Simonsohn 2011, but I need to reread it.)

The authors seem to think that failing to replicate studies restores credibility, and is indicative of taking a hard-nosed line, getting beyond the questionable significant results that have come in for such a drubbing. It does not. You can do just as questionable a job finding no effect as finding one. What they need to do is offer a stringent critique of the other (and their own) studies. A negative result is not a stringent critique. (Kahneman: please issue this further requirement.)

In fact, the scrutiny our seminar group arrived at in a mere one-hour discussion did more to pinpoint the holes in the other studies than all their failures to replicate. As I see it, that’s the kind of meta-level methodological scrutiny that their field needs if they are to lift themselves out of the shadows of questionable science. I could go on for pages and pages on all that is irksome and questionable about their analysis but will not. These researchers don’t seem to get it. (Or so it seems.)

If philosophers are basing philosophical theories on such “experimental” work without tearing them apart methodologically, then they’re not doing their job. Quine was wrong, and Popper was right (on this point): naturalized philosophy (be it ethics, epistemology or other) is not a matter of looking to psychological experiment.

Some proposed labels: We might label as questionable science any inferential inquiry where the researchers have not shown sufficient self-scrutiny of fairly flagrant threats to the inferences of interest. These threats would involve problems all along the route from the data generation and modeling to their interpretation. If an enterprise regularly fails to demonstrate such self-scrutiny, or worse, if its standard methodology revolves around reports that do a poor job at self-scrutiny, then I label the research area pseudoscience. If it regularly uses methods that permit erroneous interpretations of data with high probability, then we might be getting into “fraud” or at least “junk” science. (Some people want to limit “fraud” to a deliberate act. Maybe so, but my feeling is, as professional researchers claiming to have evidence of something, the onus is on them to be self-critical. Unconscious wishful thinking doesn’t get you off the hook.)

[1] In 2012 Kahneman said he saw a train-wreck looming for social psychology and suggested a “daisy chain” of replication.

[2] Correction I had “less” switched with “more” in the early draft (I wrote this quickly during the seminar).

[3] New reference from Uri Simonsohn:

[4]Addition: April 11, 2014: A commentator wrote that I should read Mook’s classic paper against “external validity”.

 In replying,  I noted that I agreed with Mook entirely. “I entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. …I have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making.

I especially like Mook’s example of the wire monkeys. Despite the artificiality, we can use that experiment to understand hunger reduction is not valued more than motherly comfort or the like. That’s the trick of a good experiment, that if the theory (e.g., about hunger-reduction being primary) were true, then we would not expect those laboratory results. The key question is whether we are gaining an understanding, and that’s what I’m questioning.

I’m emphasizing the meaningfulness of the theory-statistical hypothesis link on purpose. People get so caught up in the statistics that they tend to ignore, at times, the theory-statistics link.

Granted, there are at least two distinct things that might be tested: the effect itself (here the Macbeth effect), and the reliability of the previous positive results. Even if the previous positive results are irrelevant for understanding the actual effect of interest, one may wish to argue that it is or could be picking up something reliably. Even though understanding the effect is of primary importance, one may claim only to be interested in whether the previous results are statistically sound. Yet another interest might be claimed to be learning more about how to trigger it. I think my criticism, in this case, actually gets to all of these, and for different reasons. I’ll be glad to hear other positions.




Simmons, J., Nelson, L. and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science XX(X) 1-8.

Zhong, C. and Liljenquist, K. 2006. Washing away your sins: Threatened morality and physical cleansing. Science, 313, 1451-1452.

Categories: fallacy of non-significance, junk science, reformers, Statistics | 12 Comments

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Having reblogged the 5/17/12 post on “reforming the reformers” yesterday, I thought I should reblog its follow-up: 6/2/12.

Consider again our one-sided Normal test T+, with null H0: μ < μ0 vs μ >μ0  and  μ0 = 0,  α=.025, and σ = 1, but let n = 25. So M is statistically significant only if it exceeds .392. Suppose M (the sample mean) just misses significance, say

Mo = .39.

The flip side of a fallacy of rejection (discussed before) is a fallacy of acceptance, or the fallacy of misinterpreting statistically insignificant results.  To avoid the age-old fallacy of taking a statistically insignificant result as evidence of zero (0) discrepancy from the null hypothesis μ =μ0, we wish to identify discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferential claims of the form:

μ < μ0 + γ

Fisher continually emphasized that failure to reject was not evidence for the null.  Neyman, we saw, in chastising Carnap, argued for the following kind of power analysis:

Neymanian Power Analysis (Detectable Discrepancy Size DDS): If data x are not statistically significantly different from H0, and the power to detect discrepancy γ is high (low), then x constitutes good (poor) evidence that the actual effect is < γ. (See 11/9/11 post).

By taking into account the actual x0, a more nuanced post-data reasoning may be obtained.

“In the Neyman-Pearson theory, sensitivity is assessed by means of the power—the probability of reaching a preset level of significance under the assumption that various alternative hypotheses are true. In the approach described here, sensitivity is assessed by means of the distribution of the random variable P, considered under the assumption of various alternatives. “ (Cox and Mayo 2010, p. 291):

This may be captured in :

FEV(ii): A moderate p-value is evidence of the absence of a discrepancy d from Ho only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller p value) were a discrepancy d to exist. (Mayo and Cox 2005, 2010, 256).

This is equivalently captured in the Rule of Acceptance (Mayo (EGEK) 1996, and in the severity interpretation for acceptance, SIA, Mayo and Spanos (2006, p. 337):

SIA: (a): If there is a very high probability that [the observed difference] would have been larger than it is, were μ > μ1, then μ < μ1 passes the test with high severity,…

But even taking tests and CIs just as we find them, we see that CIs do not avoid the fallacy of acceptance: they do not block erroneous construals of negative results adequately. Continue reading

Categories: CIs and tests, Error Statistics, reformers, Statistics | Tags: , , , , , , , | Leave a comment

Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12)

The one method that enjoys the approbation of the New Reformers is that of confidence intervals. The general recommended interpretation is essentially this:

For a reasonably high choice of confidence level, say .95 or .99, values of µ within the observed interval are plausible, those outside implausible.

Geoff Cumming, a leading statistical reformer in psychology, has long been pressing for ousting significance tests (or NHST[1]) in favor of CIs. The level of confidence “specifies how confident we can be that our CI includes the population parameter m (Cumming 2012, p.69). He recommends prespecified confidence levels .9, .95 or .99:

“We can say we’re 95% confident our one-sided interval includes the true value. We can say the lower limit (LL) of the one-sided CI…is a likely lower bound for the true value, meaning that for 5% of replications the LL will exceed the true value. “ (Cumming 2012, p. 112)[2]

For simplicity, I will use the 2-standard deviation cut-off corresponding to the one-sided confidence level of ~.98.

However, there is a duality between tests and intervals (the intervals containing the parameter values not rejected at the corresponding level with the given data).[3]

“One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better.”

Is it?   Consider a one-sided test of the mean of a Normal distribution with n iid samples, and known standard deviation σ, call it test T+.

H0: µ ≤  0 against H1: µ >  0 , and let σ= 1.

Test T+ at significance level .02 is analogous to forming the one-sided (lower) 98% confidence interval:

µ > M – 2(1/ √n ).

where M, following Cumming, is the sample mean (thereby avoiding those x-bars). M – 2(1/ √n ) is the lower limit (LL) of a 98% CI.

Central problems with significance tests (whether of the N-P or Fisherian variety) include:

(1) results are too dichotomous (e.g., significant at a pre-set level or not);

(2) two equally statistically significant results but from tests with different sample sizes are reported in the same way  (whereas the larger the sample size the smaller the discrepancy the test is able to detect);

(3) significance levels (even observed p-values) fail to indicate the extent of the effect or discrepancy (in the case of test T+ , in the positive direction).

We would like to know for what values of δ it is warranted to infer  µ > µ0 + δ. Continue reading

Categories: confidence intervals and tests, reformers, Statistics | Tags: , , , | 7 Comments

Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST

img_0737Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update

Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology. 

While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?  

This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost. Continue reading

Categories: Comedy, reformers, statistical tests, Statistics | Tags: , , , , , , | 8 Comments

Blog at Customized Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 312 other followers