Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

Professor Gill

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University

It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.

Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:

“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”

Isn’t that the danger in relying on deeply felt background beliefs?  Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post.

I certainly have been thinking about these issues a lot in recent months. I got entangled in intensive scientific and media discussions – mainly confined to the Netherlands  – concerning the cases of social psychologist Dirk Smeesters and of psychologist Elke Geraerts.  See:

And I recently got asked to look at the statistics in some papers of another … [researcher] ..but this one is still confidential ….
The verdict on Smeesters was that he like Stapel actually faked data (though he still denies this).

The Geraerts case is very much open, very much unclear. The senior co-authors Merckelbach, McNally of the attached paper, published in the journal “Memory”, have asked the journal editors for it to be withdrawn because they suspect the lead author, Elke Geraerts, of improper conduct. She denies any impropriety. It turns out that none of the co-authors have the data. Legally speaking it belongs to the University of Maastricht where the research was carried out and where Geraerts was a promising postdoc in Merckelbach’s group. She later got a chair at Erasmus University Rotterdam and presumably has the data herself but refuses to share it with her old co-authors or any other interested scientists. Just looking at the summary statistics in the paper one sees evidence of “too good to be true”. Average scores in groups supposed in theory to be similar are much closer to one another than one would expect on the basis of the within group variation (the paper reports averages and standard deviations for each group, so it is easy to compute the F statistic for equality of the three similar groups and use its left tail probability as test statistic.

The same phenomenon turns up in another unpublished paper by the same authors and moreover in one of the papers contained in Geraerts (Maastricht) thesis. I attach the two papers published in Geraert’s thesis which present results in very much the same pattern as the disputed “Memory” paper. Four groups of subjects, three supposed in theory to be rather similar, one expected to be strikingly different. In one of the two, just as in the Memory paper, the average scores of the three similar groups are much closer to one another than one would expect on the basis of the within-groups variation.

I got involved in the quarrel between Merckelbach and Geraerts which was being fought out in the media so various science journalists also consulted me about the statistical issues. I asked Geraerts if I could have the data of the Memory paper so that I could carry out distribution-free versions of the statistical tests of “too good to be true” which are easy to perform if you just have the summary statistics. She claimed that I had to get permission from the University of Maastricht. At some point both the presidents of Maastricht and Erasmus university were involved and presumably their legal departments too. Finally I got permission and arranged a meeting with Geraerts where she was going to tell me “her side of the story” and give me the data and we would look at my analyses together. Merckelbach and his other co-authors all enthusiastically supported this too, by the way. However at the last moment the chair of her department at Erasmus university got worried and stepped in and now an internal Rotterdam (=Erasmus) committee is investigating the allegations and Geraerts is not allowed to give anyone the data or talk to anyone about the problem.

I think this is totally crazy. First of all, the data set should have been made public years ago. Secondly, the fact that the co-authors of the paper never even saw the data themselves is a sign of poor research practices. Thirdly, getting university lawyers and having high level university ethics committees involved does not further science. Science is furthered by open discussion. Publish the data, publish the criticism, and let the scientific community come to its own conclusion. Hold a workshop where different points of view of presented about what is going on in these papers, where statisticians and psychologists communicate to one another.

Probably, Geraerts’s data has been obtained by some combination of the usual “questionable research practices” which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published: sample sizes are too small, effects are too small, noise is too large. People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them and are just doing the best to make this as clear as possible to everyone.


PS summary of my investigation of the papers contained in Geraert’s PhD thesis:

ch 8 Geraerts et al 2006b BRAT Long term consequences of suppression of intrusive anxious thoughts and repressive coping.

ch 9 Geraerts et al 2006 AJP Suppression of intrusive thoughts and working memory capacity in repressive coping.These two chapters show the pattern of four groups of subjects, three of which are very similar, while the fourth is strikingly different with respect to certain (but not all) responses.In the case of chapter 8, the groups which are expected to be similar are (just as in the already disputed Memory and JAb papers) actually much too similar! The average scores are closer to one another than one can expect on the basis of the observed within-group variation (1 over square root of N law).In the case of chapter 9, nothing odd seems to be going on. The variation between the average scores of similar groups of subjects is just as big as it ought to be, relative to the variation within the groups.

Geraerts et al (2008 Memory pdf). “Recovered memories of childhood sexual abuse: Current findings and their legal implications” Legal and Criminological Psychology 13, 165–176





Categories: 3-year memory lane, junk science, Statistical fraudbusting, Statistics

Post navigation

4 thoughts on “Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

  1. On the one hand, people are much more aware of how QRPs blow up statistical significance and lead to spurious p-values; on the other,they’re much more inclined to blame the tests rather than the abusers. Criticism of QRPs is more widespread, but Gill’s suggestion that they can’t be helped in some fields seems to have grown as well. To block that, one would have to be prepared to relegate some current fields to the realms of “for entertainment only” or non-science.

  2. Steven McKinney

    When the Emperor is wearing no clothes, someone has to stand up and say that the Emperor is wearing no clothes.

    Just because everyone in the profession is doing it, doesn’t make it science. This type of phenomenon requires eternal vigilance. The problem with science is that it is conducted by people, and people are messy entities, subject to all kinds of biases and peer pressures.

    You are right to label such fields as “for entertainment only”. Psychology currently seems rife with such examples, and when “experts” in such a field opine to leaders and military officials, we end up with a useless torture program, so such pseudoscience can and does end up producing terrible outcomes. We do need to keep calling out the sloppy implementation of programs that wish to be considered as scientific. Intelligent Design easily comes to mind. I also believe much of Economics falls into this pseudoscientific category.

    Thank you for your vigilant reviews of such matters. Please keep it up!

  3. The hard truth is: the vast majority of statistics in such fields as social psychology is a “questionable research practice” just because any possible type of statistical inference is heavily based on randomness of samples and in social sciences we almost never have true random sample.

    So it would be better just to stop pretending we can estimate such things as p-values when we really can not do that just because our samples are NOT random (and what is worse: usually it is very hard or even impossible identify directions in which particular sample is biased and how much. We only know it is biased and not random for sure almost always)

    So almost everything you talking about here if you talking about stats in social sciences really just do not make any sense. There are no such thing as good statistical inference for non-random samples with some unidentified systematical shift. Also there is not possible to make really reproducible research when we talk about social human behavior just because the inner state of the subjects is one of the parameters and we never could control this. So any time we try to replicate some social psychology experiment we really have different conditions, not the same conditions, and we never could have true same conditions.

    The truth is short and simple: we can not estimate statistical significance in such kind of data.
    Just stop to fool people pretending we can.

    • Kogdata:
      I too am skeptical of alleged findings in many social sciences, but not for the reasons you give. We don’t have to have identical conditions in order to replicate a statistical effect. Were that true, we would never be able to have evidence for the effectiveness of drugs or most any other areas where mere statistical regularities, not universal saws, are the best we can do. In design-based RCTs, if done correctly, it’s possible to say something about the probability of obtaining differences as large as observed by dint of the randomized assignment alone. That’s the basis for the p-value in RCTs. Members of experimental treatment-control trials aren’t usually representative of the full population, nor need they be, in order to identify genuine effects for the the experimental group at least. Granted, that means more is required in order to generalize beyond.
      The main problems with the artificial experiments in fields like social psychology, even if we imagine no QRPs–a big if– concern doubts that they’re studying/measuring what they purport to be measuring.

Blog at