Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

Posted on January 18, 2015 by Mayo

fraudbusters

If questionable research practices (QRPs) are prevalent in your field, then apparently you can’t be guilty of scientific misconduct or fraud (by mere QRP finagling), or so some suggest. Isn’t that an incentive for making QRPs the norm?

The following is a recent blog discussion (by Ulrich Schimmack) on the Jens Förster scandal: I thank Richard Gill for alerting me. I haven’t fully analyzed Schimmack’s arguments, so please share your reactions. I agree with him on the importance of power analysis, but I’m not sure that the way he’s using it (via his “R index”) shows what he claims. Nor do I see how any of this invalidates, or spares Förster from, the fraud allegations along the lines of Simonsohn[i]. Most importantly, I don’t see that cheating one way vs another changes the scientific status of Forster’s flawed inference. Forster already admitted that faced with unfavorable results, he’d always find ways to fix things until he got results in sync with his theory (on the social psychology of creativity priming). Fraud by any other name.

Förster

The official report, “Suspicion of scientific misconduct by Dr. Jens Förster,” is anonymous and dated September 2012. An earlier post on this blog, “Who ya gonna call for statistical fraud busting” featured a discussion by Neuroskeptic that I found illuminating, from Discover Magazine: On the “Suspicion of Scientific Misconduct by Jens Förster.” Also see Retraction Watch.

Does anyone know the official status of the Forster case?

“How Power Analysis Could Have Prevented the Sad Story of Dr. Förster”

From Ulrich Schimmack’s “Replicability Index” blog January 2, 2015. A January 14, 2015 update is here. (occasional emphasis in bright red is mine)

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

Mayo: Note the language: “acted in accordance with”. Not even “acted in a way that, while leading to illicit results, is not so very uncommon in this field, so may not rise to the level of scientific misconduct”. With this definition, there’s no misconduct with Anil Potti and a number of other apparent ‘frauds’ either.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.

Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.

Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.

Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results.The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below).

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68). An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results, and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. This account of events is consistent with other examples.

For example, the R-Index for a set of studies by Roy Baumeister was 49%. Roy Baumeister also explained why his R-Index is so low.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

Mayo: Here I am in total agreement. Yet well-known critics claim significance tests can say nothing in the case of statistically insignificant results, or that use of power is an “inconsistent hybrid”. It is not. See Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in (Mayo and Spanos) Error and Inference.

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

If so, then why would a “prevalent” practice be to bias inferences by selecting results in sync with one’s hypothesis?

Schimmack has a (Jan 15, 2015) update here in which he appears to retract what he said above! Why? As best as I could understand it, it’s because the accused fraudster denies committing any QRPs, and so if he doesn’t want to admit the lesser crime, sparing him from the “fraud” label, then he must be guilty of the more serious crime of fraud after all.

Since Richard Gill alerted me to these blogposts, and I trust Gill’s judgment, there’s bound to be something in all of this reanalysis.

[i] Fake Data Colada.Maybe the author has also changed his mind, given his update.

Categories: junk science, reproducibility, Statistical fraudbusting, Statistical power, Statistics | Tags: Jens Forster | 22 Comments

22 thoughts on “Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?”

January 19, 2015

Dr. R

Thank you for comments on my blog (please also post my follow up post).

I think any discussion about QRPs or fraud as explanations for Dr. Forster’s data requires clear definitions of QRPs and fraud.

I draw the line between fraud and QRPs between the addition and deletion of values. Fraud is making up data. QRPs remove undesirable data that were actually collected.

Did Dr. Forster actually collect data? I don’t know.

Could Dr. Forster have collected data and then used QRPs to obtain the results that are reported in the three articles? My blog merely points out that this is possible.

I was concerned about the implication that Dr. Forster’s data are sufficiently improbable to rule out QRPs. The reason is that other data that have been investigated with the R-Index (see my blog for several examples) are statistically even less probable than Dr. Forster’s data that I examined on my blog.

If we would conclude that an R-Index of 51% that I obtained for Dr. Forster’s data is sufficient to infer that data were fabricated, it would have dramatic implications for many other articles published in social psychology.

I DO NOT believe that statistically improbable results automatically imply fraud. I do believe that the reason for small samples is that it allows researchers to collect actual data and to compensate for low power with the help of QRPs.

I think there is other evidence, including the large number of experiments within each article, and most important Dr. Forster’s DENIAL of having used QRPs that rule out QRPs as an explanation and suggest the data are fraudulent.

I think we could have arrived at this conclusion faster if the commission had considered QRPs more carefully and asked Dr. Forster pointed questions about specific practices (deleting cases from the middle group) and if the commission had presented a clearer and more informative report of the investigation for the scientific community.

The personal conversation with Dr. Forster convinced me that he did not use QRPs. As a result, the data should be considered unexplainable and should be retracted because they provide no empirical evidence for or against Dr. Forster’s theoretical claims.

Sincerely, Dr. R

https://replicationindex.wordpress.com/2015/01/14/further-reflections-on-the-linearity-in-dr-forsters-data/

Reply

January 19, 2015

Mayo

D. R:
Dashing to catch a plane, so this will be quick:

You wrote: “Fraud is making up data. QRPs remove undesirable data that were actually collected.”

Surely you wouldn’t want to limit fraud or misconduct to the Stapel ideal of no data at all.

“Dr. Forster’s DENIAL of having used QRPs that rule out QRPs as an explanation and suggest the data are fraudulent.”

But we know he contradicts himself on this very point when admitting data finagling until he gets the right result. Surely that qualifies as under the QRP banner,even if “everyone does it”. Yes? So, the fact that he doesn’t label what he does a QRP cannot mean it isn’t. He’s using words in a different way. And even if he hadn’t admitted to “trying and trying again” until he got the data in sync with his theory, “merely asking X” would be a bizarre way to criticize X’s scientific inference. Stapel denied he cooked any numbers when asked, it was only after incontrovertible evidence that he confessed.

Reply

January 19, 2015

Dr. R

The title of the blog asks “If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?”

The answer is simple. You can still be found guilty of scientific fraud even in a field where questionable research practices are prevalent and accepted. Staple is a case in point. The final report was strongly critical of social psychology as a “sloppy science.” However, sloppy science is just bad science and does not constitute fraud. Fraud implies that data were fabricated. Staple admitted making up data on his laptop in his kitchen.

The problem with Dr. Forster is that he is not admitting to fraud and that at present there is just a statistical pattern in the data that cannot be explained without QRPs or fraud. But is is just sloppy science or fraud. Statistics alone will not be able to answer this question. However, Dr. Forster stated that he did not use QRPs. This leaves fraud as the only plausible explanation. Dr. Forster denies that he committed fraud and implies that the data were manipulated by somebody else. Whether this is plausible or an OJ Simpson is no longer a statistical question. It would require an actual investigation of the chain of custody of the data.

Reply

January 19, 2015

Mayo

Dr. R:
The point of that question in my title is to (ironically) suggest it is absurd that if QRPs are prevalent then they don’t count as misconduct, whereas in a field where QRPs are rare they do. What Anil Potti did would count as a mere QRP under your system, as I understand it. (Please search Potti on this blog). It is more than that in cancer research, fortunately (or at least one hopes).

If a practice requires QRPs to get results, then it’s a questionable science, and the more prevalent, the more it slips into the realm of pseudoscience (as I see it). That’s not the same as having many failed attempts and eventually finding a stringently demonstrated effect. Take any great scientist from Brown (Brownian motion) to the recent Stanley Prusiner (prions, mad cow). Their progress is a history of failed attempts to demonstrate the result–both Nobel prize winners. Or the Higgs. And yes, they discuss the failed attempts in detail in published work because that’s what learning in real science is about: learning how to get good at interacting with one’s subject.

More later, gotta run!

Reply

January 19, 2015

Dan Riley

I dispute Dr. R’s line drawing.

As I see it, the sample size is an essential property of the experiment (perhaps the most essential single property, just try calculating the significance or statistical power without considering the sample size).

If I throw out 5 samples out of 25 and report a sample size of 20, then I have replaced the true sample size with a fake value I manipulated by removing samples. That is fraudulent fabrication of data.

If I throw out 5 samples out of 25 and report a sample size of 20 with 5 removed for no adequate reason, that’s a QRP. It isn’t outright fraud since I reported the true sample size, but it is a QRP because I used a different sample size in the analysis without a clear explanation.

If I throw out 5 samples out of 25 and report a sample size of 20 with 5 removed for cause, that may be a borderline QRP depending on the legitimacy and rigor of the reason (bonus points for blinding the data and pre-registering a procedure for developing rejection criteria).

If I include an analysis of the systematics of the removals, that’s responsible science.

Reply

January 19, 2015

Dr. R

Dear Dan Riley,

I just want to clarify that my personal view is that most QRPs constitute scientific misconduct because they lead to dishonest reporting of evidence that inflates effect sizes and replicability.

The main purpose of the R-Index and the Test of Insufficient Variance is to reveal QRPs, to correct for the influence of QRPs, and to create a disincentive to use QRPs.

However, at what point the use of QRPs should trigger an investigation of scientific misconduct is not a matter of personal opinions, but has to be decided by scientific organizations, universities, and funding agencies.

Sincerely, Dr. R

Reply

January 19, 2015

Richard Gill

Dictionary: “Fraud is a deception deliberately practiced in order to secure unfair or unlawful gain”. Someone who does junk science and publishes junk science in a junk field is not necessarily fraudulous. They can do it with the best of intentions.

Questionable science is not the same as fraudulous science. If scientific publications are questionable, the publications should be retracted. If scientists fraud, they should be fired (and their publications retracted).

We need to distinguish between personal integrity of researchers and scientific integrity of scientific works. When a scientist is accused of fraud, the burden of proof is high: the person is innocent, till proven guilty. When a scientific work is exposed as questionable, the burden of proof is the other way round.

Reply

January 19, 2015

Dr. R

Well said. I agree 100%.

Reply

January 20, 2015

Mayo

In that case, I find the blog position to be incoherent, e.g., The Jan 15, 2015 update retracts the blog’s QRP explanation*, simply because Forster denies committing QRPs,thereby inferring he’s guilty of fraud after all ?

“Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.”

Isn’t this going from his ignorance to inferring QRPs can’t explain it after all (even though he may simply not grasp the convoluted stats of the R index) and thereby agreeing with the committee that it’s fraud? Very puzzling, if not incoherent.

*I can’t judge the QRP explanation on the basis of what I’ve read, but it’s far from clear it explains the linearity problem Simonsohn and others argue cannot be due to any QRPS that anyone has come up with.

Reply

January 20, 2015

Dr. R

Dear Mayo,

1. My first blog suggested an explanation for the pattern of linearity and it suggested that QRPs could have been used to produce it. Nobody has challenged the explanation that excessive linearity would occur when QRPs are used to show significant contrasts between the control group and both extreme groups. It is easy to show in simulations that this is the case.

2. I contacted Dr. Forster and he admitted to dropping studies with non-significant results. He did neither deny nor admit the use of other QRPs.

3. Additional investigations of the data suggested to me that the extreme groups do not show a statistically abnormal pattern. This cannot be tested with the linearity test because there are only two groups. It was a novel contribution to show that the extreme groups do not show unusual statistical patterns.

4. However, the middle group is clearly fishy because the means are very strongy predicted by the means of the extreme groups, when they should be varying at random. Again, this had not been shown in previous investigations that I could find online.

5. Based on these results, I came up with a QRP that would produce the desired effect (significant contrasts) and lead to linearity in the means. Namely, the researcher could have oversampled the middle group and selectively delete participants to shift the mean. This is a QRP and was used in the Simmons et al. (2011) paper to create significant results of music on biological age (drop 13 participants). I haven’t seen any simulation or other evidence that this QRP cannot explain linearity in Dr. Forster’s data.

6. I then contacted Dr. Forster and asked him whether this is what he did. He did not directly respond to this question but stated that he did not use QRPs.

So, what is incoherent? That I believe he could have used QRPs or that I believed him when he denied using QRPs? I think it is easy to reconcile these two statements.

QRPs could have produced the results, but they didn’t because no data were collected and the data are fabricated.

Looks like most people agree now that the reported results are not trustworthy and should be retracted.

I still think that the best way for Dr. Forster’s best option to demonstrate that we are all wrong about his data is to conduct an open replication study with high power that can produce his published results. Given his ability to hit the target again and again, it shouldn’t be so difficult for him to do so one more time, unless of course he is a Texan sharpshooter. If he dropped studies, he should have a vague idea how many studies had to be dropped to get the significant ones and he can adjust the power analysis accordingly. I would love to see the results of a replication study. Interestingly, a team of researchers just published a replication study of Stapel. The results DID NOT replicate. So, fake data are unlikely to replicate and replication studies can provide valuable information.

Reply

January 20, 2015

Keith O'Rourke

>replication studies can provide valuable information
Agree – as I tried to argue in the link below – it is the easiest for the most to grasp.

http://andrewgelman.com/2007/07/09/how_should_unpr/#comment-42903

Reply

January 21, 2015

Mayo

Keith: I can’t quite tell whether you’re saying (in that comment) that those who claim it’s no problem to selectively report stat sig results are wrong,or that one needs to show empirically they’re wrong, or?

Reply

January 21, 2015

Mayo

Do you have a link to the Stapel replication?

the incoherence is concluding with the report which indicates fraud, when you had just commented that it couldn’t be that without intent or whatever.

Reply

January 21, 2015

Dr. R

The link is here:

http://rolfzwaan.blogspot.ca/

also just in the news

http://blogs.discovermagazine.com/neuroskeptic/2015/01/20/how-diederik-stapel-became-fraud/#.VL_Tei7Zb-0

Reply

January 19, 2015

Mayo

Richard: I realize you have a sympathetic standpoint on this, and there’s an admirable generosity about it, but here’s my position. When you undertake a certain profession you pledge yourself to norms to which you are obligated to follow. Choosing to be that scientific professional, you have willed not to violate those requirements of validly using the scientific instruments/methods on which your work relies. A surgeon performing an operation is expected to be informed about bad surgical practices of the day. That he or she has the best intentions does not remove the obligation by which the professional has agreed to abide. That is why there are moral, legal, scientific strictures whose violation by a practitioner of the relevant profession renders them culpable. Ignorance, real or feigned does not remove that culpability. Statistical tools are not just window dressing to make your theory look scienc-ey. Their use has implications as do scalpels (though fortunately psych folks are less likely to do bodily harm.) How much more so now that we’ve had professional treatises and panels (e.g., the Stapel report) that speak very directly to acceptable and unacceptable research practices, and explains exactly WHY. I think the Report could not be clearer, and it’s the obligation of practitioners of a field to inform themselves of best/worst practices. More than that, no one in psychology could be unaware of the replication and reproducibility projects broadcast everywhere, or the many journal issues over the past years not only addressing these problems but explaining why the statistical guarantees go completely out the window if you’re allowed cherry-picking data, and gaming statistics by exploiting other researcher degrees of freedom. How many years until practitioners are held accountable to standards of the legitimate/illegitimate use of statistics? I understand that Forster is/was an extremely prominent leader in the field. Any deliberate efforts to remain ignorant of the rulings being issued on a daily basis in both popular and professional forums is just that: deliberate and willful. I’m no lawyer or ethicist, but I can guess what the ethicists would say.

Reply

January 20, 2015

Keith O'Rourke

I do have to agree with Richard and Dr R, even though I believe QRPs do far more damage than fraud.

We don’t have the “ignorance of the law is no defence” argument as good research practices simply are not as clearly coded and declared applicable.

Even if they were, we would still have the “was not capable of understanding the difference between right and wrong research practices” defence that many currently active faculty researchers could honestly make (from my experience working with them).

The “Criminals in the Citadel…” below is great overview of what such ignorance encourages and how hard it will be to address it.

Reply

January 20, 2015

Mayo

Keith: If it makes people feel better to call it scientific malpractice or the like, fine, but people are losing their jobs over such misrepresentations and abuses of data, including in psych. If there is no professional responsibility created by the numerous “reports” in the last few years, why does anyone bother to write them:
SPSP on fraud

Click to access dialogue_26(2).pdf

Tilberg Report:

Click to access tilberg-report-stapel-final-report-levelt1.pdf

Reply

January 20, 2015

Mayo

Richard: “Fraudulous” is a good word. If it’s a “junk field” why even call it science? I’m all in favor of considering dubbing some of these areas “for entertainment purposes only”. That would save a lot of time now being spent examining the statistical fine points of a lot of egregious work, and the daily blame game as statistical tests are scapegoated. It’s precisely those criticisms that make it easier to flout the rules in serious applications, e.g., clinical trials (as in the Potti case).

I notice your distinction between fraudulous or fraudulent scientific work vs a fraud scientist. I hadn’t commented on that. I think that’s OK. But what if we countenance “merely” saying the guy produces fraudulent work? (and no matter how many times he’s told he continues to. But he’s no fraud.)Wouldn’t the Humboldt foundation still withdraw the award?

Reply

January 20, 2015

Dr. R

The goal of the R-Index is to provide scientific and objective information about the status of research in any field. There are good social psychologists who are doing good work. It would be unfair to label their field of research junk science because many or even the majority use QRPs. It will be a lot of work, but everybody deserves to be treated fairly and to be considered a good researcher until proven otherwise.

Reply

January 20, 2015

Mayo

Richard: “Fraudulous” is a good word. If it’s a “junk field” why even call it science? I’m all in favor of considering dubbing some of these areas “for entertainment purposes only”. That would save a lot of time now being spent examining the statistical fine points of a lot of egregious work, and the daily blame game as statistical tests are scapegoated. It’s precisely those criticisms that make it easier to flout the rules in serious applications, e.g., clinical trials (as in the Potti case).

I notice your distinction between fraudulous or fraudulent scientific work vs a fraud scientist. I hadn’t commented on that. I think that’s OK. But what if we countenance “merely” saying the guy produces fraudulent work? (and no matter how many times he’s told he continues to. But he’s no fraud.) Wouldn’t the Humboldt foundation still withdraw the award?

Reply

January 19, 2015

e.berk

Stan Young once cited Tharyon, “Criminals in the Citadel…” on different meanings of fraud vs scientific misconduct, e.g., 33
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3353596/

Reply
January 21, 2015

Mayo

The link Dr. R sent me about people trying to replicate Stapel’s fraudulent studies is a blog by
Rolf Zwaan, “When Replicating Stapel is not an Exercise in Futility”
http://rolfzwaan.blogspot.ca/2015/01/when-replicating-stapel-is-not-exercise.html

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

“How Power Analysis Could Have Prevented the Sad Story of Dr. Förster”

Post navigation

22 thoughts on “Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?

“How Power Analysis Could Have Prevented the Sad Story of Dr. Förster”

Related

Post navigation

22 thoughts on “Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.