Shortly afterwards, the Retraction Watch blog posted a (leaked?) copy of an internal report that set out the accusations against Förster.
The report, titled Suspicion of scientific misconduct by Dr. Jens Förster, is anonymous and dated September 2012. Reportedly it came from a statistician(s) at Förster’s own university. It relates to three of Förster’s papers, including the one that the University says should be retracted, plus two others.[Below is the abstract from Retraction Watch].
Here we analyze results from three recent papers (2009, 2011, 2012) by Dr. Jens Förster from the Psychology Department of the University of Amsterdam. These papers report 40 experiments involving a total of 2284 participants (2242 of which were undergraduates). We apply an F test based on descriptive statistics to test for linearity of means across three levels of the experimental design. Results show that in the vast majority of the 42 independent samples so analyzed, means are unusually close to a linear trend. Combined left-tailed probabilities are 0.000000008, 0.0000004, and 0.000000006, for the three papers, respectively. The combined left-tailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000). Such a level of linearity is extremely unlikely to have arisen from standard sampling. We also found overly consistent results across independent replications in two of the papers. As a control group, we analyze the linearity of results in 10 papers by other authors in the same area. These papers differ strongly from those by Dr. Förster in terms of linearity of effects and the effect sizes. We also note that none of the 2284 participants showed any missing data, dropped out during data collection, or expressed awareness of the deceit used in the experiment, which is atypical for psychological experiments. Combined these results cast serious doubt on the nature of the results reported by Dr. Förster and warrant an investigation of the source and nature of the data he presented in these and other papers.
Read the whole report here.
A vigorous discussion of the allegations has been taking place in this Retraction Watch comment thread. The identity and motives of the unknown accuser(s) are one main topic of debate; another is whether Förster’s inability to produce raw data and records relating the studies is suspicious or not.
The actual accusations have been less discussed, and there’s a perception that they are based on complex statistics that ordinary psychologists have no hope of understanding. But as far as I can see, they are really very simple – if poorly explained in the report – so here’s my attempt to clarify the accusations.
First a bit of background.
In the three papers in question, Forster reported a large number of separate experiments. In each experiment, participants (undergraduate students) were randomly assigned to three groups, and each group was given a different ‘intervention’. All participants were then tested on some outcome measure.
In each case, Förster’s theory predicted that one of the intervention groups would test low on the outcome measure, another would be medium, and another would be high (Low < Med < High).
Generally the interventions were various tasks designed to make the participants pay attention to either the ‘local’ or the ‘global’ (gestalt) properties of some visual, auditory, smell or taste stimulus. Local and global formed the low and high groups (though not always in that order). The Medium group either got no intervention, or a balanced intervention with neither a local nor global emphasis. The outcome measures were tests of creative thinking, and others.
The headline accusation is that the results of these experiments were too linear: that the mean outcome scores of the three groups, Low, Medium, and High, tended to be almost evenly spaced. That is to say, the difference between the Low and Medium group means tended to be almost exactly the same as the difference between the Medium and High means.
The report includes six montages, each showing graphs of from one batch of the experiments. Here’s my meta-montage of all of the graphs:
This montage is the main accusation in a nutshell: those lines just seem too good to be true. The trends are too linear, too ‘neat’, to be real data. Therefore, they are… well, the report doesn’t spell it out, but the accusation is pretty clear: they were made up.
The super-linearity is especially stark when you compare Förster’s data to the accuser’s ‘control’ sample of 21 recently published, comparable results from the same field of psychology:
Using a method they call delta-F, the accusers calculated the odds of seeing such linear trends, even assuming that the real psychological effects were perfectly linear. These odds came out as 1 in 179 million, 1 out of 128 million, and 1 out of 2.35 million in each of the three papers individually.
Combined across all three papers, the odds were one out of 508 quintillion: 508,000,000,000,000,000,000. (The report, using the long scale, says 508 ‘trillion’ but in modern English ‘trillion’ refers to a much smaller number.)
So the accusers say
Thus, the results reported in the three papers by Dr. Förster deviate strongly from what is to be expected from randomness in actual psychological data.
Unless the sample size is huge, a perfectly linear observed result is unlikely, even assuming that the true means of the three groups are linearly spaced. This is because there is randomness (‘noise’) in each observation. This noise is measurable as the variance in the scores within each of the three groups.
For a given level of within-group variance, and a given sample size, we can calculate the odds of seeing a given level of linearity in the following way.
delta-F is defined as the difference in the sum of squares accounted for by a linear model (linear regression) and a nonlinear model (one-way ANOVA), divided by the mean squared error (within-group variance.) The killer equation from the report:
If this difference is small, it means that a nonlinear model can’t fit the data any better than a linear one – which is pretty much the definition of ‘linear’.
Assuming that the underlying reality is perfectly linear (independent samples from three distributions with evenly spaced means), this delta-F metric should follow what’s known as an F distribution. We can work out how likely a given delta-F score is to occur, by chance, given this assumption, i.e. we can convert delta-F scores to p-values.
Remember, this is assuming that the underlying psychology is always linear. This is almost certainly implausible, but it’s thebest possible assumption for Förster. If the reality were nonlinear, the odds of getting low delta-F scores would be evenmore unlikely.
The delta-F metric is not new, but the application of it is (I think). Delta-F is a case of the well-known use of F-tests to compare the fit of two statistical models. People normally use this method to see whether some ‘complex’ model fits the data significantly better than a ‘simple’ model (the null hypothesis). In that case, they are looking to see if Delta-F is high enough to be unlikely given the null hypothesis.
But here the whole thing is turned on its head. Random noise means that a complex model will sometimes fit the data better than a simple one, even if the simple model describes reality. In a conventional use of F-tests, that would be regarded as a false positive. But in this case it’s the absence of those false positives that’s unusual.
I’m not a statistician but I think I understand the method (and have bashed together some MATLAB simulations). I find the method convincing. My impression is that delta-F is a valid test of non-linearity and ‘super-linearity’ in three-group designs.
I have been trying to think up a ‘benign’ scenario that could generate abnormally low delta-F scores in a series of studies. I haven’t managed it yet.
But there is one thing that troubles me. All of the statistics above operate on the assumption that data are continuously distributed. However, most of the data in Förster’s studies were categorical i.e. outcome scores were fixed to be (say) 1 2 3 4 or 5, but never 4.5, or any other number.
Now if you simulate categorical data (by rounding all numbers to the nearest integer), the delta-F distribution starts behaving oddly. For example given the null hypothesis, the p-curve should be flat, like it is in the graph on the right. But with rounding, it looks like the graph on the left:
The p-values at the upper end of the range (i.e. at the end of the range corresponding to super-linearity) start to ‘clump’.
The authors of the accusation note this as well (when I replicated the effect, I knew my simulations were working!). They say that it’s irrelevant because the clumping doesn’t make the p-values either higher or lower on average. The high and low clumps average out. My simulations also bear this out: rounding to integers doesn’t introduce bias.
However, a p-value distribution just shouldn’t look like that, so it’s still a bit worrying. Perhaps, if some additional constraints and assumptions are added to the simulations, delta-F might become not just clumped, but also biased – in which case the accusations would fall apart.
Perhaps. Or perhaps the method is never biased. But in my view, if Förster and his defenders want to challenge the statistics of the accusations, this is the only weak spot I can see. Förster’s career might depend on finding a set of conditions that skew those curves.
UPDATE 8th May 2014: The findings of the Dutch scientific integrity commission, LOWI, on Förster, have been released. English translation here. As was already known, LOWI recommended the retraction of the 2012 paper, on grounds that the consistent linearity was so unlikely to have occured by chance that misconduct seems likely. What’s new in the report, however, is the finding that the superlinearity was not present when male and female participants were analysed seperately. This is probably the nail in the coffin for Förster because it shows that there is nothing inherent in the data that creates superlinearity (i.e. it is not a side effect of the categorical data, as I speculated it might be.) Rather, both male and female data show random variation but they always seem to ‘cancel out’ to produce a linear mean. This is very hard to explain in a benign way.
[i]This doesn’t mean that fraud busting charges, even those that rise to a level of concern, should not also be critically evaluated. On the contrary, it is crucial that they be scrupulously criticized.
[ii]Warning to fans of Nate Silver,he shows the greatest disrespect for and misunderstanding of R.A. Fisher and significance tests: “Fisher and his contemporaries …sought to develop a set of statistical methods that they hoped would free us from any possible contamination from bias…[T]he frequentist methods–in striving for immaculate statistical procedures that can’t be contaminated by the researchers’s bias–keep him hermetically sealed off from the real-world.” (Silver 2012, 252-3). Fisher designed methods, relied on to this day, to detect and unearth bias based on understanding of how they arise and, with care, may be controlled and/or discerned objectively. Where does his “immaculate conception” come from? Silver does a great disservice to Fisher and fraud busting (e.g., in his 2012 “The Signal and the Noise”, pp.250-255). I hope he will correct his perception.