Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”

Objectivity 1: Will the Real Junk Science Please Stand Up?


I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

3. Hanging out some statistical dirty laundry.images
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
  • A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
  • The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
  • The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
  • Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

4.  Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and  associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one?  Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!”  No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)


*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From  Simmons, Nelson and Simonsohn:

 Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.

If you determined sample size in advance, say it.

If you did not drop any variables, say it.

If you did not drop any conditions, say it.

The Fall 2012 Newsletter for the Society for Personality and Social Psychology
Popper, K. 1994, The Myth of the Framework.
Categories: junk science, spurious p values

Post navigation

14 thoughts on “Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”

  1. See also Stapel’s autobiographical/confessional book at

  2. Tom Kepler

    Dear Deborah,

    Your suggestion regarding the “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” is something that I have been thinking about a great deal, spurred, in part by, some of the exchanges I’ve witnessed and participated in on this blog. I am developing a course for the Program in Biomedical Sciences here at the BU School of Medicine that is nominally a statistics course, but which will emphasize the broader connection to scientific method. The syllabus for _Statistical Reasoning for the Basic Biomedical Sciences_ begins

    “Statistics is a key competency in scientific research—never more so than today—but too often is presented in a dry and detached manner, leaving the impression that statistics is an unfortunate but necessary hurdle to clear after the real science is done. In contrast to this view, we will approach the subject from the broader perspective of reasoning under uncertainty as an integral part of scientific research, and statistics as essential formalizations of foundational scientific methods.”

    I am writing the class notes now for the course to be offered in the Spring of 2016. I am currently analyzing several papers from the contemporary biomedical research literature to determine just what it is that the authors are doing when they say they are doing biomedical research. I certainly welcome suggestions from you and from the readers of this blog.

    Best wishes,

  3. Hi Tom! Haven’t hear from you in awhile. Your course sounds exciting. I’m afraid it will necessitate going beyond what you’ll find in the vast majority of philosophical and statistical literatures. If there is one thing to stress (on the statistical side) it would be the importance of error probabilities for appraising and controlling how well tested claims are. The first 9 slides of my recent presentation allude to this:
    Techniques with “long-run properties” are of use because of what they can tell us about the case at hand. In short, unless the attitude toward the value of frequentist ideas is changed, going through the usual definitions of probability and competing methods will leave everyone in the dark. (As most are now). If people care about cherry picking, P-hacking, significance seeking, bad experimental design, invalid statistical assumptions, then they need methods that can check and pick up on them (at the very least).Not methods that hide them or declare they are not part of the evidence at all! [Beyond that would be rationales for the methods, and beyond that, ways to improve upon them.]

    My only kvetch with the common concept: “reasoning under uncertainty”, is that it suggests the goal is to capture how unsure we are in some sense (as in How much would you bet?), when statistical inference in science, as I see it, is to control mistaken interpretations of data (taken broadly), and figure out how capable a given method is, or is not, for the job of uncovering deceptions and biases. (Toward the end of those slides is an example of “bending over backwards” to self-criticize. Better than reporting on negative results would be publishing work on how not to go about studying such and such employing such and data and instruments.) This is all too brief, obviously; we should talk more.

    I hope this means you were awarded some swell grant!

  4. Tom: I am going to agree with Mayo here in that it will be an insurmountable opportunity and there are so few with actually experience and understanding in philosophy and statistical practice involving varied empirical studies you may well be pioneering. (I think it requires enough math to comfortably represent things that are abstract realistically, enough of an understanding of the process of scientific inquiry [how to best get less wrong] and a lot of experience doing realistic empirical studies.)

    A couple past experiences.

    Auditing Ian Hacking’s course at the University of Toronto 1996/7 and proof reading his book An Introduction to Probability and Inductive Logic. The students were a mix of philosophy and statistics undergrads and grad students. They were kept busy for the term and learned about induction abstractly. So Mayo’s comment “the usual definitions of probability and competing methods will leave everyone in the dark” will apply for what goes on in actual empirical work.

    The other was giving a similar course to your – Scientific Inquiry (1986) Faculty of Medicine, University of Toronto. The students were rehab med undergrads (very bright actually) and it was a one term course spread over two terms (which helped).

    The first half was on scientific method in terms of why and how studies were done, why groups to be compared are randomized (if possible). patients blinded to treatment, outcome assessors blinded, etc. and how studies should be reported and then critically read and assessed when published and systematically reviewed (aka meta-analysis). The major assignment was to quality score published papers from their journals. The students picked papers their faculty members had published and the highest score out of 10 was a 2 or 3. The students really enjoyed this part (learning from other’s mistakes) and the Dean told me they thought I had become the most popular faculty member.

    The next term was statistics and I tried to kept it conceptual and used a lot of bootstrap ideas but the students just hated it. This material was assessed on an exam and most did very well on it (and I had not made it easy they had to understand type one error, power, false positives, false negatives, differing importance by context) but they hated it.

    Unfortunately, I only got to give the course once, a new faculty member had been hired in the Rehap department and they did not know much about statistics but wanted to learn more about it so the new Dean gave the course to them.

    Overall, I think “learning from other’s mistake” does offer the best route forward.

    Best of luck.

    Keith O’Rourke

    • Tom: I want to be clear that I don’t see it as a herculean undertaking in the least. I’ve taught dozens of courses of the sort (though not with biology as focus), including one I developed when I was 50% in Economics: Philosophy of Science and Economic Methodology. If we talk some time, I can send you syllabi (some are on this blog, but mostly graduate).
      For general philo of science that’s also up to date: Alan Chalmers: what is this thing called science, and Kent Staley’s brand new intro to phil sci (that also includes some error statistics along with Bayesian philosophy. Ideally you’d do it with a philosopher. Maybe I should come out there?

  5. I do think “and statistics that enable the responsible practice of science” is a herculean undertaking but maybe the most challenging part is in the statistics.

    The use of toy examples with overly convenient assumptions is acceptable in (initial) philosophical discussions and introductory statistical courses but it simply is not adequate for research in medical sciences. Here even with the simplest two group randomized study with binary outcomes, nuisance parameters have to be understood and dealt with (due to limited sample sizes.) Non-randomized epidemiology studies are far more difficult to understand and deal with and require a good grasp of confounding and distinguishing intrinsic versus apparent similarity of groups being compared.

    Choosing to focus on the scientific method with toy statistical examples with overly convenient assumptions might be wise for a single term course. It would have made the course I gave much easier to do and much more enjoyable for the students.

    An approach worth considering might be to enable responsible interactions with well trained statisticians including how to choose _and_ audit them. Auditing them will require some understanding of the role of statistics in enabling the practice of science. That would be my preference for what to try to teach beyond the toy examples with overly convenient assumptions.

    (I’ll search for the syllabi on your blog but it would be nice to see one for undergrads.)

    Keith O’Rourke

    • The reason that the majority of foundational discussions in stat have evoked simple examples is that so many deep problems arise in regards to them that if one cannot be clear on those, it’s pointless to go into more complex cases. Another reason is to get at the logic of tests. I was reading J. Berger making just this point yesterday. So at least we agree there. I also think that if a person who plans to get into statistical practice, with complex examples, gets clear on the logic of simple cases, then they’ll be able to extend the logic to cover them. Don Fraser is correct (in his quick and dirty confidence paper) to argue that beyond the location example we often use, the Bayesian results don’t sustain the “confidence concept”, but the interesting thing is that Bayesians debate frequentists on just those examples where there could be agreement. Consider the whole “p-values exaggerate the evidence” charge, even with one sided tests no less! Aside from the classic “rigged” examples, the entire Bayesian-frequentist-likelihoodist debates over the years have concerned simple cases. There is a tendency for Bayesian texts to start with simple examples of applying probability to events—of the sort that frequentists would also have no trouble assigning—and then jump to invoking all kinds of (usually conventionality) priors to parameters, as if there’s nothing more questionable than with ordinary events. I have found (in the book I’m currently finishing) that so many of the debates back and forth concern basic cases, and even Bayesians don’t agree amongst themselves about them. (See recent Senn blog posts, and my likelihood posts).

      Anyway, I’m not sure what kind of course on scientific method you have in mind. I like the idea of “auditing” statisticians. An entire course could revolve around what that should require. What happens when you have one side that says auditing should involve checking if their tests have good error probabilities (or whether they are vitiated by biasing selection effects) and the other side says that such things are irrelevant to the assessment of data, and could only matter to contexts where there will be a long run sequence of repetitions. What if one side says, you must audit the model assumptions, and the other says no model checks. What if one side says statistical inference must provide a probabilist assignment to statistical hypotheses (so frequentist methods could only be used by misinterpretation), and the other says they haven’t a clue what posteriors to hypotheses even mean if the goal is evidential assessment rather than subjective degree of belief. I could go on and on.

      The real challenges are: (a) combining a genuine philosophical sensibility with the statistical background (plus some knowledge of the philosophy and history of the foundational debates), and (b) avoiding bias according to the perspective of the instructor’s preferred philosophy. The best would be to jointly teach such a class with a philosopher and a statistician.

      • I think there is a lot we agree about but probably best not to make that claim.

        I am not disagreeing with the claim that most arguments in statistical logic can be illustrated with simple examples (in fact, Jim Berger told me he purposely looked for such for his LP book and some of Don Fraser’s favorite examples are very simple – two sample from Uniform[u-1,u+1]) but experience with just such examples helps little in learning how to deal adequately with the complexity in any actual research.

        Also its mainly the too convenient assumptions I am worried about and for instance it was Don Fraser who was the first to convince me that dealing with nuisance parameters is critical and those problems are often hidden by convenient assumptions. (Many statisticians were rascals by finding and unduly focusing on these.)

        > (a) combining a genuine philosophical sensibility with the statistical background (plus some knowledge of the philosophy and history of the foundational debates)

        To me it should be (a) combining a genuine statistical sensibility with the philosophy of science background (some knowledge of the philosophy and history of science) – which underlines your point b and the desirability of a class with a philosopher and a statistician.

        Your syllabus looks fine for a graduate course to me if at least one third was spent on “Statistics and Scientific Integrity” rather than just one session. For someone else giving a graduate course – it likely is too much Mayo 😉

        • Maybe Tom will tell us more as to what he has in mind. I always include some philosophy of science, e.g., Popper. With luck my book will be out by then (How to Tell What’s True About Statistical Inference).

    • PhaneronO: This was just a recent version of a type of class I’ve done in myriad different ways–usually with a heavier dose of philosophy, but these students all had philosophy.

      You make it sound as if “the scientific method” is something well understood and about which there is clear agreement. There isn’t! You make it sound as if there’s agreement on the alleged “toy” examples, but there isn’t. The entire philosophy and history of statistics has been built upon a handful of examples, and they are used again and again by everyone. In the above course, we did try to focus on issues arising in current problems of reproducibility, with a some guest lectures, but I agree that 2 semesters would be best.

      • Tom Kepler

        Dear Deborah,

        First, my apologies for not following the replies in a more timely manner. It still surprises me, returning to find a whole thread to catch up on.

        I fully agree that one has to go beyond available resources in the philosophical and statistical literatures to really do the topic justice, and I am prepared to do so to the extent that I am able and to seek help to the extent that I am not.

        My principal aim in this course is to make my students more successful in the practice of basic biomedical science, and, in so doing, to make the larger pursuit of basic biomedical science more efficient. Relative efficiency may seem prosaic in contrast to the discovery of fundamental truths in, say, physics, but somebody’s got to do it.

        I am particularly interested in following up on your statement that, “statistical inference in science, as I see it, is to control mistaken interpretations of data (taken broadly), and figure out how capable a given method is, or is not, for the job of uncovering deceptions and biases.”

        I agree to a large extent, but think that that in the case of the biomedical science there are other, more seemingly direct, ways to measure success: how many lives have been saved, how much suffering relieved? (Or, a bit less naively, how much money has been made in the health sector? Just kidding.) I am leaning toward a perspective in which the role of statistics in biomedical science is simply to increase efficiency. Deceptions and bias will always be revealed eventually; statistics done properly shortens the time spent chasing erroneous leads but also ensures that promising leads are in play. I am genuinely struggling over this point, and would indeed look forward to the opportunity to discuss it at greater length with you, either in Boston or in Blacksburg. Let’s make plans.

        Finally, and alas: This new mission does not signal the receipt of a new grant. I had been thinking about applying for support to develop a very short “module” on reproducibility but ultimately felt that a real face-to-face course would be most effective. No money and much more time. The things we do for love.


        • Hi Tom: I think I didn’t (and don’t) quite understand your intended class. You wrote in response to those recommendations in the Stapel report, so I assumed your interest was in exploring some of the foundational issues in statistics that pertain to poor replicability, and especially to “reforms” that are intended to help.
          The assessment I was alluding to (in terms of the questions: How well tested? How self-critical?) is aimed at specific studies or research efforts and their associated methodologies. But you say you consider a “more direct” measure to be How many lives saved? How much suffering relieved? I don’t see how one can measure such things, or criticize methods or inferences that way. so clearly you have something very different in mind.
          Now as far as increasing ‘efficiency’. How to shorten the time spent chasing erroneous leads? Well it would help to criticize those “leads” that haven’t bent over backwards to self-correct, but instead, like Potti, promote alleged “success stories” based on wishful thinking rather than following norms for replication.
          Yes, we should talk.

        • Tom:
          > role of statistics in biomedical science is simply to increase efficiency
          Likely best way to do that ” is to control mistaken interpretations of data (taken broadly), and figure out how capable a given method is, or is not, for the job of uncovering deceptions and biases.”

          And “criticize those “leads” that haven’t bent over backwards to self-correct”.

          My perception/idealization of your course should be to enable the [students for] responsible practice of science. Seems not too different from yours? For this they would need to get something more than experience with the simple examples used to explicate philosophical issues.

          Now, my overall sense of the reports phrase ““and statistics that enable the responsible practice of science” is actually “we will make our students eat cake”, but reasonable gains are possible.

          Getting across that science is a never ending struggle to get less wrong or even aspiring to accelerating the process of getting less wrong (my way of saying what I think Mayo is saying or actually its CS Peirce’s way) would be a good start. Enabling responsible interactions with well trained statisticians including how to choose _and_ audit them might be feasible might be worth attempting but I believe that would be pioneering.

          Keith O’Rourke

  6. Pingback: Popper on pseudoscience: a comment on Pigliucci (i) | Error Statistics Philosophy

Blog at