Some statistical dirty laundry: have the stains become permanent?



Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013:

[i] I assume this is no longer true. 

[2] June 24: Earp’s correction was that QRPs had “become standard practice”. But if they were taught as things a scientist with integrity must avoid, or adjust for (or at least inform the reader about), then how did they become standard practice? In the interviews conducted by the Stapel committee, the interviewees showed a cavalier attitude toward these moves.

I finally had a chance to fully read the 2012 Tilberg Report* on “Flawed Science” last night. Here are some stray thoughts…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job:

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?



3. Hanging out some statistical dirty laundry.
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
  • A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
  • The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
  • The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
  • Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

4.  Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and  associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one?  Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!”  No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)

I recommend reading the Tilberg report!

*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From  Simmons, Nelson and Simonsohn:

 Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.

If you determined sample size in advance, say it.

If you did not drop any variables, say it.

If you did not drop any conditions, say it.

The Fall 2012 Newsletter for the Society for Personality and Social Psychology
Popper, K. 1994, The Myth of the Framework.






Categories: junk science, reproducibility, spurious p values, Statistics | 4 Comments

Post navigation

4 thoughts on “Some statistical dirty laundry: have the stains become permanent?

  1. Re: “A philosophy department could well create an entire core specialization … (ideally linked with one or more other departments).”

    This would be a tall order. Maybe there are places where it could happen, but there is the “philosophy of science” that philosophy departments teach, there is the reflection on practice that informs the “working philosophy of scientists”, and the two barely overlap from what I’ve seen. The concrete of the silos is very thick.

    • Jon: I take it that the SPSP was established to combat this non-overlap. But if you think about it, not much would be missing from a philo dept, and we already have specialized depts. as in Carnegie Mellon. Even history of philosophy could be included. More realistically, what wouldn’t be far-fetched is for a set of courses on meta-methodology across a few depts.

  2. How could one replicate something that just could not be replicated?
    In physics replication of experiment means same conditions. Not “similar”, not “almost same”. Same is same. If conditions are different during replication then it is not replication at all. It is experiment conducted under DIFFERENT conditions.

    Talking about psychology we never could have same conditions. Because inner state of participants always different and can not be controlled in principle. No way to control it.

    So such thing as “reproducible psychology research” just oxymoron and can not exist in real world.
    We always have experiment under different conditions, this is not the replication at all.


    • kogdato: I responded to your comment in my “statistical dirty laundry” post. Strictly speaking, conditions are never identical, even in highly controlled experiments in physics, but in the land of statistical inference, it’s a mistake to suppose there are no warranted generalizations on these grounds. That’s a way to let them off too easily. Every die tossing is different, but we can warrant statistical regularities with high accuracy.

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at