Hunting for “nominally” significant differences, trying different subgroups and multiple endpoints, can result in a much higher probability of erroneously inferring evidence of a risk or benefit than the nominal p-value, even in randomized controlled trials. This was an issue that arose in looking at RCTs in development economics (an area introduced to me by Nancy Cartwright), as at our symposium at the Philosophy of Science Association last month[i][ii]. Reporting the results of hunting and dredging in just the same way as if the relevant claims were predesignated can lead to misleading reports of actual significance levels.[iii]
Still, even if reporting spurious statistical results is considered “bad statistics,” is it criminal behavior? I noticed this issue in Nathan Schachtman’s blog over the past couple of days. The case concerns a biotech company, InterMune, and its previous CEO, Dr. Harkonen. Here’s an excerpt from Schachtman’s discussion (part 1).
In August 2002, Dr. Harkonen approved a press release, which carried a headline, “phase III data demonstrating survival benefit of Actimmune in IPF.” A subtitle announced the 70% relative reduction in patients with mild to moderate disease. ……
….The prosecution asserted that Dr. Harkonen engaged in data dredging, grasping for the right non-prespecified end point that had a low p-value attached. Such data dredging implicates the problem of multiple comparisons or tests, with the result of increasing the risk of a false-positive finding, notwithstanding the p-value below 0.05.
Supported by the testimony of Professor Thomas Fleming, who chaired the Data Safety Monitoring Board for the clinical trial in question, the government claimed that the trial results were “negative” because the p-values for all the pre-specified endpoints exceeded 0.05. Shortly after the press release, Fleming sent InterMune a letter that strongly dissented from the language of the press release, which he characterized as misleading. Because the primary and secondary end points were not statistically significant, and because the reported mortality benefit was found in a non-prespecified subgroup, the interpretation of the trial data required “greater caution,” and the press release was a “serious misrepresentation of results obtained from exploratory data subgroup analyses.”
The district court sentenced Harkonen to six months of home confinement, three years of probation, 200 hours of community service, and a fine of $20,000. Dr. Harkonen appealed on grounds that the federal fraud statutes do not permit the government to prosecute persons for expressing scientific opinions about which reasonable minds can differ. Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution. Statements that have support even from a minority of the scientific community should not be the basis for a fraud charge. In Dr. Harkonen’s case, the government did not allege any misstatement of an objectively verifiable fact, but alleged falsity in his characterization of the data’s “demonstration” of an efficacy effect. The government cross-appealed to complain about the leniency of the sentence…..
Read the rest of part 1; there’s also a part 2.
And what about the claim that “Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution.”?
[i] “Development: Knowing What Works, Evidence, Evaluation, and Experiment” (J. Burke, N. Cartwright, H. Seckinelgin, D. Fennel).
[ii] This has long been argued by Angus Deaton: http://www.nber.org/papers/w14690
[iii] I’m not sure if it’s required, but the FDA at least recommends that analytical plans be submitted prior to trial to avoid the effects of “trying and trying again”.
Cartwright, N. and Hardie,J. (2012),Evidence-Based Policy: A Practical Guide to Doing It Better (Oxford).
Thank you for this great post Dr Mayo.
U.S. FDA Guidance for Clinical Trial Sponsors recommends (‘mandates’) statistical interim points, with the following disclaimer: “This guidance represents the Food and Drug Administration’s (FDA’s) current thinking on this topic. It does not create or confer any rights for or on any person and does not operate to bind FDA or the public. You can use an alternative approach if the approach satisfies the requirements of the applicable statutes and regulations. If you want to discuss an alternative approach, contact the appropriate FDA staff.” (GCTS, iii)
Although statistical monitoring plans are mandated by official documents, this mandate is rarely followed. Even when the Data Safety Monitoring Board does follow the mandate, there are often disputes about the appropriate statistical monitoring policy. However differences in judgments about RCT monitoring policies are important in understanding interim monitoring decisions – E.g. why did they stopped the trial halfway rather than continue?
What I find essential in this legal case (and other similar cases) is that having the interim data and the monitoring rule are not sufficient for a proper reconstruction of DSMB interim decisions. Many decisions (whether to stop or continue) often violate the prescription of the monitoring rule in the protocol. That is because statistical monitoring rules generally function only as guidelines for the DSMB, and rightly so. There are numerous cases in which the DSMB acts in violation of its interim monitoring rule when, for instance, deciding to continue the trial despite early evidence for efficacy. There are many other reasons, including cases of futility or possible harm. It’s an important area that deserves more attention from philosophers, statisticians and policy makers.