“Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

Sell me that antiseptic!

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics for purposes of analysis. The example–and it’s a great one (thanks Rory M!)–revolves around priming research in social psychology. Yes the field that has come in for so much criticism as of late, especially after Diederik Stapel was found to have been fabricating data altogether (search this blog, e.g., here).[1]

But since then the field has, ostensibly, attempted to clean up its act. On the meta-level, Simmons, Nelson, and Simonsohn (2011) is an excellent example of the kind of self-scrutiny the field needs, and their list of requirements and guidelines offer a much needed start (along with their related work). But the research itself appears to be going on in the same way as before (I don’t claim this one is representative), except that now researchers are keen to show their ability and willingness to demonstrate failure to replicate. So negative results are the new positives! If the new fashion is non-replication, that’s what will be found (following Kahneman‘s call for a “daisy chain” in [1]).

In “Out Damned Spot,” The authors are unable to replicate what they describe as a famous experiment (Zhong and Lilijenquist 2006) wherein participants who read “a passage describing an unethical deed as opposed to an ethical deed, … were subsequently likelier to rate cleansing products as more desirable than other consumer products”. (92). There are a variety of protocols, all rather similar. For instance students are asked to write out a passage to the effect that:

“I shredded a document that I knew my co-worker Harlan was desperately looking for so that I would be the one to get a promotion.”

or

“I place the much sought-after document in Harlan’s mail box.”

See the article for the exact words. Participants are told, untruthfully, that the study is on handwriting, or on punctuation or the like. (Aside: Would you feel more desirous of soap products after punctuating a paragraph about shredding a file that your colleague is looking for? More desirous than when…? More desirous than if you put it in his mailbox, I guess.[2]) In another variation on the Zhong et al studies, when participants are asked to remember an unethical vs ethical deed they committed, they tended to pick antiseptic wipe over a pen as compensation.

Yet these authors declare there is “a robust experimental foundation for the existence of a real-life Macbeth Effect” and therefore are  surprised that they are unable to replicate the result. Given that the field considers this effect solid and important–indeed the project is to address replicability of well-accepted results–it is appropriate for the authors to regard it as such. (4/26/14) (I think they are just jumping onto the new bandwagon. Admittedly, I’m skeptical, so send me defenses, if you have them. I place this under “fallacies of negative results”)

I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits to where they could look or what kind of imaginary experiment one could run. Hmmm. We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around (invisibly) to see if they don’t wash so much? (Never mind that cleanliness is next to godliness.) Of course if the killer has got blood on her (as in Lady “a little water clears us of this deed” Macbeth) she’s going to wash up, but the whole point is to apply it to moral culpability more generally (seeing if moral impurity cashes out as physical). So the first signal that an empirical study is at best wishy-washy, and at worst pseudoscientific is the utter vagueness of the effect they are studying. There’s little point to a sophisticated analysis of the statistics if you cannot get past this…unless you’re curious as to what other howlers lie in store. Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in…. (unless too trivial to bother studying).

Returning to their study, after the writing exercise, the current researchers (Earp, et.al., ) have participants rate various consumer products for their desirability on a scale of 1 to 7.

They found “no significant difference in the mean desirability of the cleansing items between the moral condition (M= 3.09) and immoral condition (M = 3.08)” (94)—a difference that is so small as to be suspect in itself. Their two-sided confidence interval contains 0 so the null is not rejected. (We get a p-value and Cohen’s d, but no data.) Aris Spanos brought out a point we rarely hear (that came up in our criticism of a study on hormesis): it’s easy to get phony results with artificial measurement scales like 1-7. (Send links of others discussing this.) The mean isn’t even meaningful, and anyway, by adjusting the scale, a non-significant difference can become significant. (I don’t think this is mentioned in Simmons, Nelson, and Simonsohn 2011, but I need to reread it.)

The authors seem to think that failing to replicate studies restores credibility, and is indicative of taking a hard-nosed line, getting beyond the questionable significant results that have come in for such a drubbing. It does not. You can do just as questionable a job finding no effect as finding one. What they need to do is offer a stringent critique of the other (and their own) studies. A negative result is not a stringent critique. (Kahneman: please issue this further requirement.)

In fact, the scrutiny our seminar group arrived at in a mere one-hour discussion did more to pinpoint the holes in the other studies than all their failures to replicate. As I see it, that’s the kind of meta-level methodological scrutiny that their field needs if they are to lift themselves out of the shadows of questionable science. I could go on for pages and pages on all that is irksome and questionable about their analysis but will not. These researchers don’t seem to get it. (Or so it seems.)

If philosophers are basing philosophical theories on such “experimental” work without tearing them apart methodologically, then they’re not doing their job. Quine was wrong, and Popper was right (on this point): naturalized philosophy (be it ethics, epistemology or other) is not a matter of looking to psychological experiment.

Some proposed labels: We might label as questionable science any inferential inquiry where the researchers have not shown sufficient self-scrutiny of fairly flagrant threats to the inferences of interest. These threats would involve problems all along the route from the data generation and modeling to their interpretation. If an enterprise regularly fails to demonstrate such self-scrutiny, or worse, if its standard methodology revolves around reports that do a poor job at self-scrutiny, then I label the research area pseudoscience. If it regularly uses methods that permit erroneous interpretations of data with high probability, then we might be getting into “fraud” or at least “junk” science. (Some people want to limit “fraud” to a deliberate act. Maybe so, but my feeling is, as professional researchers claiming to have evidence of something, the onus is on them to be self-critical. Unconscious wishful thinking doesn’t get you off the hook.)

[1] In 2012 Kahneman said he saw a train-wreck looming for social psychology and suggested a “daisy chain” of replication.

[2] Correction I had “less” switched with “more” in the early draft (I wrote this quickly during the seminar).

[3] New reference from Uri Simonsohn: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879

[4]Addition: April 11, 2014: A commentator wrote that I should read Mook’s classic paper against “external validity”.

http://www.uoguelph.ca/~psystats/readings_3380/mook%20article.pdf

 In replying,  I noted that I agreed with Mook entirely. “I entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. …I have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making.

I especially like Mook’s example of the wire monkeys. Despite the artificiality, we can use that experiment to understand hunger reduction is not valued more than motherly comfort or the like. That’s the trick of a good experiment, that if the theory (e.g., about hunger-reduction being primary) were true, then we would not expect those laboratory results. The key question is whether we are gaining an understanding, and that’s what I’m questioning.

I’m emphasizing the meaningfulness of the theory-statistical hypothesis link on purpose. People get so caught up in the statistics that they tend to ignore, at times, the theory-statistics link.

Granted, there are at least two distinct things that might be tested: the effect itself (here the Macbeth effect), and the reliability of the previous positive results. Even if the previous positive results are irrelevant for understanding the actual effect of interest, one may wish to argue that it is or could be picking up something reliably. Even though understanding the effect is of primary importance, one may claim only to be interested in whether the previous results are statistically sound. Yet another interest might be claimed to be learning more about how to trigger it. I think my criticism, in this case, actually gets to all of these, and for different reasons. I’ll be glad to hear other positions.

 

 

______

Simmons, J., Nelson, L. and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science XX(X) 1-8.

Zhong, C. and Liljenquist, K. 2006. Washing away your sins: Threatened morality and physical cleansing. Science, 313, 1451-1452.

Categories: fallacy of non-significance, junk science, reformers, Statistics | 15 Comments

Post navigation

15 thoughts on ““Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)

  1. I think this article on the “chump effect” had it right a couple of years ago: “Reporters are Credulous” studies show: http://www.weeklystandard.com/articles/chump-effect_610143.html

  2. Sleepy

    “I asked the group of seminar participants if they could even identify a way to pick up on the “Macbeth” effect assuming no limits…We were hard pressed to come up with any. Follow evil-doers around (invisibly) and see if they clean-up? Follow do-gooders around to see if they don’t wash so much?”

    It’s been a while since I’ve had an opportunity to design a dubious psychological experiment, but here goes. Participants are asked to perform a neutral task, in preparation for which they must sit in a waiting room that contains a hand-sanitizer dispenser. After signing their consent forms, each participant is told to pass on some piece of information to another student, who is already sitting in the waiting room (and is actually a confederate).

    In the experimental group, the participant is asked to lie, on behalf of the researcher, to the confederate for a plausible-sounding but morally questionable reason. In the control group, he or she is asked to give the confederate a neutral piece of information. We then record whether or not the participants use hand sanitizer over a five-minute wait.

    Of course, this leaves out the whole issue of verifying whether or not the action actually induced guilt in any of the experimental group members – but folks would argue that this is good because we’re trying to avoid subjective measurements like survey questions.

    • Sleepy: we were trying to come up with an imaginary but authentic way to observe the intended effect, just to see if the phenomenon was even clear.

    • Sleepy: I may have been a little sleepy when I first responded, I didn’t mean to dismiss your protocol out of hand, if you thought this would be ideal. OK. So, I’m curious about the lie that the person was to tell, and the full details of the scenario.
      Also, the thing out those sanitizers, is that we keep hearing how they are helping to create superbugs and resistant bacteria, and thus using them might actually be harmful. So, in this respect the person committing the unethical action might be seen to be guilty of a second if they take the sanitizer! I’m not being facetious, because it is a growing problem.

      http://consciouslifenews.com/handy-dangers-hand-sanitizer/

  3. Is there any research into the ‘Pilate effect’ whereby people who ratify questionable decisions at the behest of others feel a need to wash their hands?

  4. Received a note from Uri Simonsohn with a new reference:

    “Interesting discussion. I have tried to tackle the nonreplication issue and I happen to mention re-analyze a replication of the Macbeth effect (by coincidence). I should say I think the issues I worry about are more purely statistical compare to some of the more substantial ones you raised in your blog/discussion

    In case you are interested:”
    http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879

  5. Anonymous

    Students in these experiments know they are lied to and know about the more popular studies being run. It’s a game they are made to play.

    • Expericonomists feel superior to psych researchers on this score. They get real money for economic activities. But it’s not clear how the students in the Macbeth study would use a suspicion of the topic.

  6. Here’s a valid psychological effect of sorts—but is it “priming”? https://twitter.com/search?q=priming&src=typd

  7. Brian Nosek

    Hi Deborah —

    Yes, it does appear that it is becoming possible to get replication efforts published. I take this as a good sign because a lot of this work has been occurring, just never making it into publication. We have had many instances of our own in which we tried to replicate an effect so that we could extend the paradigm to investigate related questions, but got stuck never able to get the original result. I think it is a good trend that such attempts can become more widely known so that we can improve understanding of the conditions necessary to obtain the effects.

  8. Mavan

    “Yet with all of these experiments, the “causal” inference of interest is miles and miles away from the artificial exercises subjects engage in….”

    I think you should read Mook’s classic paper:

    Click to access mook%20article.pdf

    • Mavan:

      Thanks for the Mook link. On a quick read (I will read it more carefully later), I can entirely agree that “artificial” experiments can enable the most probative and severe tests. But, Mook emphasizes the hypothetical deductive method (which we may assume would be of the statistical and not purely deductive variety). This requires the very entailments that are questionable in these studies. I have written quite a lot about experimental testing (links to which can be found on this blog), and have argued (in sync with Mook I think) as to why “generalizability”, external validity and the like may miss the point of severe testing, which typically requires honing in on, amplifying, isolating, and even creating effects that would not occur in any natural setting. If our theory T would predict a result or effect with which these experiments conflict, then the argument to the flaw in T holds—as Mook remarks. What is missing from the experiments I criticize is the link that’s needed for testing—the very point that Mook is making. If you can explain where the good test is, then I have no criticism. I don’t even think this counts as a good test of the original positive results, and that’s one of my main points.

      Granted, “miles and miles” doesn’t, by itself, describe my criticism (think of severe tests of the Higgs boson in a particle accelerator, or even randomized clinical trials). I’m not in the least saying social science, or even psychology, never attains probative tests. Experimental economics, from what I can tell, often does. Anyway, my main point concerned the nature of the self-scrutiny that would be required here, and also why I think philosophers should criticize the methodology behind experiments on which they might seek to test philosophical theory.

  9. Author Earp was defensive in response to my criticisms on twitter. We still haven’t seen the data–

    • Sleepy

      I’d like to see him join the conversation here. I’m very curious about the data!

  10. Will philosophers be able to prevent bad psych studies from encroaching upon philosophy?
    This comment was written for my March 5, 2016 blog, but accidentally got deposited here. It’s now (also) at:
    https://errorstatistics.com/2016/03/04/repligate-returns-or-the-non-significance-of-non-significant-results-are-the-new-significant-results/comment-page-1/#comment-139513

    My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is shouting disconfirmation, if not falsification.

    It’s particularly concerning to me in philosophy because these types of experiments are becoming increasingly fashionable in “experimental philosophy,” especially ethics. Ethicists rarely are well-versed in statistics, but they’re getting so enamored of introducing an “empirical” component into their work that they rely on just the kinds of psych studies open to the most problems. Worse, they seem to think they are free from providing arguments for a position, if they can point to a psych study, and don’t realize how easy it is to read your favorite position into the data. This trend, should it grow, may weaken the philosopher’s sharpest set of tools: argumentation and critical scrutiny. Worse still, they act like they’re in a position to adjudicate long-standing philosophical disagreements by pointing to a toy psych study. One of the latest philosophical “facts” we’re hearing now is that political conservatives have greater “disgust sensitivity” than liberals. The studies are a complete mess, but I never hear any of the speakers who drop this “fact” express any skepticism. (Not to mention that it’s known the majority of social scientists are non-conservatives–by their definition.)

    One of the psych replication studies considered the hypothesis: Believing determinism (vs free-will) makes you more likely to cheat. The “treatment” is reading a single passage on determinism. How do they measure cheating? You’re supposed to answer a math problem, but are told the computer accidentally spews out the correct answer, so you should press a key to get it to disappear, and work out the problem yourself. The cheating effect is measured by seeing how often you press the button. But the cheater could very well copy down the right answer given by the computer and be sure to press the button often so as to be scored as not cheating. Then there’s the Macbeth effect tested by unscrambling soap words and getting you to rate how awful it is to eat your just run-over dog. See this post: https://errorstatistics.com/2014/04/08/out-damned-pseudoscience-non-significant-results-are-the-new-significant-results/I could go on and on.

    Maybe this new fad is the result of the death of logical positivism and the Quinean push to “naturalize” philosophy; or maybe it’s simply that ethics has run out of steam. Fortunately, I’m not in ethics, but it’s encroaching upon philosophical discussions and courses. It offends me greatly to see hard-nosed philosophers uncritically buying into these results. In fact, I find it triggers my sensitivity to disgust, even though I score high on their 6-point “liberal” scale.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.