“Bad statistics”: crime or free speech?

wavy-capital3Hunting for “nominally” significant differences, trying different subgroups and multiple endpoints, can result in a much higher probability of erroneously inferring evidence of a risk or benefit than the nominal p-value, even in randomized controlled trials. This was an issue that arose in looking at RCTs in development economics (an area introduced to me by Nancy Cartwright), as at our symposium at the Philosophy of Science Association last month[i][ii]. Reporting the results of hunting and dredging in just the same way as if the relevant claims were predesignated can lead to misleading reports of actual significance levels.[iii]

Still, even if reporting spurious statistical results is considered “bad statistics,” is it criminal behavior? I noticed this issue in Nathan Schachtman’s blog over the past couple of days. The case concerns a biotech company, InterMune, and its previous CEO, Dr. Harkonen. Here’s an excerpt from Schachtman’s discussion (part 1).

In August 2002, Dr. Harkonen approved a press release, which carried a headline, “phase III data demonstrating survival benefit of Actimmune in IPF.” A subtitle announced the 70% relative reduction in patients with mild to moderate disease. ……

….The prosecution asserted that Dr. Harkonen engaged in data dredging, grasping for the right non-prespecified end point that had a low p-value attached. Such data dredging implicates the problem of multiple comparisons or tests, with the result of increasing the risk of a false-positive finding, notwithstanding the p-value below 0.05.

Supported by the testimony of Professor Thomas Fleming, who chaired the Data Safety Monitoring Board for the clinical trial in question, the government claimed that the trial results were “negative” because the p-values for all the pre-specified endpoints exceeded 0.05.  Shortly after the press release, Fleming sent InterMune a letter that strongly dissented from the language of the press release, which he characterized as misleading.  Because the primary and secondary end points were not statistically significant, and because the reported mortality benefit was found in a non-prespecified subgroup, the interpretation of the trial data required “greater caution,” and the press release was a “serious misrepresentation of results obtained from exploratory data subgroup analyses.”

The district court sentenced Harkonen to six months of home confinement, three years of probation, 200 hours of community service, and a fine of $20,000. Dr. Harkonen appealed on grounds that the federal fraud statutes do not permit the government to prosecute persons for expressing scientific opinions about which reasonable minds can differ.  Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution.  Statements that have support even from a minority of the scientific community should not be the basis for a fraud charge.  In Dr. Harkonen’s case, the government did not allege any misstatement of an objectively verifiable fact, but alleged falsity in his characterization of the data’s “demonstration” of an efficacy effect.  The government cross-appealed to complain about the leniency of the sentence…..

Read the rest of part 1; there’s also a part 2.

And what about the claim that “Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution.”?

[i] “Development: Knowing What Works, Evidence, Evaluation, and Experiment”
 (J. Burke, N. Cartwright, H. Seckinelgin, D. Fennel).

[ii] This has long been argued by Angus Deaton: http://www.nber.org/papers/w14690

[iii] I’m not sure if it’s required, but the FDA at least recommends that analytical plans be submitted prior to trial to avoid the effects of “trying and trying again”.

Cartwright, N. and Hardie,J. (2012),Evidence-Based Policy: A Practical Guide to Doing It Better (Oxford).

Categories: PhilStatLaw, significance tests, spurious p values, Statistics

Post navigation

27 thoughts on ““Bad statistics”: crime or free speech?

  1. Thank you for this great post Dr Mayo.

    U.S. FDA Guidance for Clinical Trial Sponsors recommends (‘mandates’) statistical interim points, with the following disclaimer: “This guidance represents the Food and Drug Administration’s (FDA’s) current thinking on this topic. It does not create or confer any rights for or on any person and does not operate to bind FDA or the public. You can use an alternative approach if the approach satisfies the requirements of the applicable statutes and regulations. If you want to discuss an alternative approach, contact the appropriate FDA staff.” (GCTS, iii)

    Although statistical monitoring plans are mandated by official documents, this mandate is rarely followed. Even when the Data Safety Monitoring Board does follow the mandate, there are often disputes about the appropriate statistical monitoring policy. However differences in judgments about RCT monitoring policies are important in understanding interim monitoring decisions – E.g. why did they stopped the trial halfway rather than continue?

    What I find essential in this legal case (and other similar cases) is that having the interim data and the monitoring rule are not sufficient for a proper reconstruction of DSMB interim decisions. Many decisions (whether to stop or continue) often violate the prescription of the monitoring rule in the protocol. That is because statistical monitoring rules generally function only as guidelines for the DSMB, and rightly so. There are numerous cases in which the DSMB acts in violation of its interim monitoring rule when, for instance, deciding to continue the trial despite early evidence for efficacy. There are many other reasons, including cases of futility or possible harm. It’s an important area that deserves more attention from philosophers, statisticians and policy makers.

    • Thanks Roger. The FDA claim you cite is pretty weasly, but this seems to refer more to stopping at interim points. There it makes it sound as if you just contact the appropriate staff person to get some kind of informal OK. However, I have seen protocols jointly created between drug companies and the FDA that were nevertheless condemned by FDA panels when the resulting drug came up for approval (this pertains to drug stocks I’ve owned).

  2. Katrin Hohl

    Within the social sciences hunting for significant results in this manner is standard and rarely problematised. It is considered a normal part of the analysis process to investigate further into sub-groups and see whether you can find significant results there, in particular if you didn’t find the significant result for the overall sample you excepted in the first place.
    There is no real awareness of the issue in the field.

    • Katrin: But the whole issue is that a subgroup identified by post hoc data dredging invalidates the significance level as an assessment of the probability of erroneous declaring evidence (for the dredged effect). It’s one thing to search for interesting correlations (in subgroups) with a view toward further testing, but you seem to be saying that in your experience, no distinction is made between predesignated and postdesignated factors. Is this in criminology?

      • Katrin Hohl

        Deborah: Exactly – commonly no distinction is made between pre- and postdesignated factors, we don’t see adjustments of p-values for multiple testing, re-using the very same data to generate and test hypotheses is very common. In my experience this is nothing specific to criminology, but applies to sociological and psychological questionnaire based research more widely – the philosophical debates have yet to trickle down into practical empirical research!

        • Katrin: I never saw this as philosophical, but a matter of the properties of tests and p-values. A small reported p-value cannot be regarded as at all difficult to achieve assuming the result is due to chance–as we want it to–, if in fact the small reported p-value is very easy to achieve, even assuming chance. But I get your point. So as a researcher on issues of methodology in criminology, you might expose this. (There’s the idea for a joint paper, at some point perhaps.)

      • Michael Lew

        Your statement about the dredging invalidating the P value as an assessment of type I errors is unarguably correct. however, the implication that experimental results should be considered only from the standpoint of type I error rates is misguided.
        Even within the error-decision framework dredging reduces the risk of type II errors just as it increases the risk of type I. However, the most important assessment of experimental evidence should be from the standpoint of evidence. Data dredging does not alter the nature of the evidence. Some would reasonably find the evidence less compelling if it comes from dredging, but the evidence is the evidence.A P value is a statement of evidence as long as it is not ‘corrected’ for multiplicity of comparisons or for sequential stopping rules. (Yes, it can be a cryptic index, but the likelihood function implied by the P-value is entirely explicit.)

        • Michael: You’re assuming a kind of likelihood notion of evidence which does not distinguish the nominal (or computed), from the actual, p-value, and so is different from the error statistical framework of evidence. It’s not mere semantics, although it is a well-known issue. Even before you get to deciding how to act toward the evidence, it would seem that one wants to know what is indicated (e.g., about the benefits/risks). I would concur here with Schachtman in his part 2: that the “multiple subgroups means that the usual level of statistical significance becomes ineffective for ruling out chance as an explanation for an increased or decreased risk in a subgroup”. If that is its evidential role, then the post-data “nominal” p-value report has an altered evidential meaning—unless of course entirely distinct reasons and qualifications are given to supplement.

          • Michael Lew

            Mayo: Yes, I do assume a likelihood notion of evidence. I am surprised that such an assumption is problematical.
            The risks/benefit issue is real, but my preference is that it be considered by the experimenter, not built into a mindless system of analysis that is based on protecting the false positive error rate at all costs.

            To me, the post-data nominal P-value IS the P-value, and the evidence in the data is entirely post-data. The manner in which one responds to the evidence is what should affected by pre-data issues like stopping rules and multiplicity of comparisons. The pre-data and post-data information should be assembled in what Abelson called ‘principled argument’, not in the calculation of a P-value.

            You may guess that I do not fell that extension of the error decision framework to graft onto it a notion of evidence is a particularly attractive approach. Likelihood functions seem to provide a perfectly reasonable way of depicting the evidential meaning of data, where they are available.

            • Michael: I too distinguish the error/decision/cost/benefit issues from evidential ones–that’s what my whole approach is about. Speaking generally, the central problem with just looking at likelihoods is that of data-dependent, ad hoc constructions. One can often if not always find/construct a hypothesis J rival to a given one, H, that renders the data maximally likely, and therefore J is considered better evidentially supported (or the like) than is H—even when H is correct. The idea that the choice is between looking just at likelihoods, and looking at cost/benefit/decision criteria is a false dilemma. You can search this blog for a lot on this topic, hopefully put more clearly.

              • Michael Lew

                In a case where J is better supported than H it is sensible to prefer J. However, if J is only a little better supported then the degree of preference should be minimal. I don’t think that your argument carries and real weight in situation where there is a continuum of hypotheses. Where you restrict your interest to H0 and Ha, maybe, but where a significance test is analogous to estimation not at all.

  3. Christian Hennig

    In the linked “rest”-part, Schachtman writes:
    “Clearly, multiplicity was a problem that diluted the meaning of the reported p-value, but the government never presented evidence of what the p-value, corrected for multiple testing, might be.”
    It seems that this is intended as something of a defence of Harkonen, but it misses the point. If you don’t specify in advance what you’re going to do when the data are in, there is no way to compute a “correct” p-value post hoc, because the p-value is a function of a well-defined testing procedure and if the procedure is not well-defined, there is no “true” p-value.
    (Schachtman’s blog would probably have been a better place for this remark but now it’s here…)

    • Christian: Well Schachtman occasionally “guest-blogs” here and has responded to questions on this blog. As a lawyer, I think he feels that having comments on his blog could somehow be problematic, maybe leading to lawsuits, what do I know?

      Anyway, that’s a good point I hadn’t thought of. I assumed they might try to adjust for the predesignated 9 endpoints and some reported # of other subgroups explored post-data.

      The other issue I find intriguing is the one about “free speech”. I’ve no doubt there are highly precise legal stipulations of the sort being raised in Harkonen’s defense (e.g., other scientists don’t consider it unreasonable to say/report what he did).

    • Michael Lew

      The relationship between the P-value and the error rates is altered by dredging and multiple comparisons and experimenter’s intentions etc. However the meaning of a P-value as an index of evidence is NOT dependent on that relationship. The evidence is the evidence. You can choose to act as if it is unimportant or unconvincing, but the evidence is the evidence, and a P-value can be a statement of the evidence.

      • Michael: In you earlier comment you said “The manner in which one responds to the evidence is what should affected by pre-data issues like stopping rules and multiplicity of comparisons.” But why is your response changing if it makes no difference to the evidence? Those issues must be relevant to evidential import in order to warrant the different responses you speak of, No?

        • Michael Lew

          I’m not exactly sure what you mean, but this might make my understanding of statistical and scientific evidence clearer.

          The evidence in data, against the null and against any other arbitrary value of effect size that you might like to propose is contained within the likelihood function. The likelihood function is not affected by the stopping rules or the number of comparisons (dredging). The probability of obtaining strong misleading evidence can be increased by those things, though, so it makes sense to be able to respond to the evidence differently in different circumstances. (I have to point out that the probability of strong misleading evidence never becomes particularly high, See Royall’s papers for a clear examination of that issue.)

          For any test type and sample size each P-value corresponds to a particular likelihood function as long as the P-value is not adjusted for things like sequential sampling and multiplicity of testing (and, in practice, as long as the sampling distribution is well defined). Thus an unadjusted P-value is an index of the evidence in the data.

          A scientific response to evidence is — inevitably, and properly –affected by things like how reliable the witness is, how well calibrated the reliability of the witness is, the consequences of mistakes, the scientist’s confidence in their own ability to evaluate all aspects of the problem, other relevant evidence, and their pre-existing belief systems. (I don’t know if systems is the right word. It can probably be omitted.) Not all of those factors are relevant to each data-based decision, but they are all relevant some of the time. No system that attempts to adjust the evidence parameter(s) to account for those factors can be well adapted to the various combinations of the factors that come into play for different problems. Thus it is better to have an unadjusted, unadorned and consistent evidence parameter(s) and to incorporate those other influences by careful and rational consideration. That is what I think I mean by saying that one’s response to the evidence has to change rather than the evidence itself.

          Scientists should practice that careful and rational reasoning all of the time, and when they need to persuade others they should make principled arguments. The worst aspect of the error-decision framework is that it tends to remove the responsibility from the scientist to do those things, and often even makes it impossible for the scientist to do so.

          • Michael: It’s not clear, but maybe your careful scientist would come to the same scrutiny in your later stage. But the thing is, the ability to make those judgments will turn on having principled arguments for why and how adjustments are needed to interpret what is, and is not, warranted to infer from data.

            I think the apparent disagreement here may stem from a much-lampooned conception of significance tests (and Neyman-Pearson tests) as giving automatic, recipe-like procedures that fix error probabilities, run tests, and mechanically accept/reject—where these are identified with “acting” or “behaving” in a given manner (e.g., act as if a theory is true). This is an extreme “behavioristic” construal of tests; it is a caricature never recommended by any of the founders of tests. If the choice was between
            (a) reporting the various likelihoods over test and alternative hypotheses), and
            (b) applying an automatic test that outputs a decision based on pre-designated, arbitrary error probabilities,
            then I’d agree that (a) was to be preferred. There is an old, and not-so-old literature, based on this false dilemma, and you may be reacting to this.
            I, and most other frequentists, would recommend a third way. (Abelson, whom you mention, was also motivated by such overly mechanical uses of tests, but it’s been too long since I’ve looked at his book to speak of similarities or differences.)
            In the view I’d recommend, p-values are post-data, data-dependent measures, and one may use properties of the test, sample size, and so on, to ascertain the discrepancies and effects that are and are not well indicated by a given p-value. On this view, one avoids fallacies of rejection and of acceptance, and also of spurious reports of p-values. Error probabilities have an evidential role: to determine the capacity of the test (or other method) to have detected discrepancies, discriminated real effects and artifacts, and discerned a variety of flaws. Knowledge of things like selection effects, stopping rules, data-dependent subgroups and endpoints can alter these evidential capacities. Thus, we need to take them into account in order to scrutinize evidence. (I’m fairly sure you’d agree that if a method is guaranteed to find a nominally “impressive” effect, and would never deem the data consistent with the null of no-effect, even if it were true, then reporting some nominally significant effect or other is scarcely well warranted.
            But, as I like to say, rules are made to be broken: the creative violator (as I called her in EGEK 1996) may be able to show that one is getting around the prima facie concern in the particular case at hand. Background knowledge would enter here. I distinguish this “level” of evidential scrutiny from BOTH the statistical report/inference and any subsequent decisions or actions that might be based on what is learned. One is still figuring out what is warranted to infer, in this “middle” (or “metastatistical”) level. Anyway, lots of these issues arise throughout my work. (And I certainly don’t think significance tests are the only or even the primary tool of inference.)

  4. This is scary.
    Prosecutors have way too much power in this country.
    If we are going to jail people for bad statistical reasoning
    let’s also do it for bad economic reasoning.
    We can begin by putting the president and all of
    congress in prison.

    • N.D.: The issue of selectivity is bothersome for sure, but these reports fall under various regulations that obviously don’t exist for other egregiously bad reasoning.
      There’s a separate (legal/semantic) issue being argued in his defense (based on the data quality act); namely, that the govt. should withdraw their describing his crime as having “falsified” results. But never mind that, the part that’s interesting that I didn’t mention in my post is the link being drawn to the Matrixx decision. In that case, a company’s defense of making no mention of untoward side effects, simply because they were not statistically significant was shot down (by the Supreme Court no less!). I discuss that case in a Feb 8, 2012 post: https://errorstatistics.com/2012/02/08/distortions-in-the-court-philstock-feb-8/, and elsewhere on this blog.

  5. Very interesting… First though, I’d would like to say that I find very unmoral to use placebo when testing medicines; the best medicine avaible should be used instead placebo to see if the new medicine does better than existing ones (Though I don’t know if in this partituclar case there are any other medicine available since they mention they are treating rare diseases).

    Anyway, this aside, it worries me that Dr. Harkonen is condemned to three years of probation, 200 hours of community service, and a fine of $20,000 by his interpretation of the data!!! Not for lying of fabricating results!!!

    To work on the non-prespecified subgroup mild-to-moderate IPF is more than reasonable, it is not a weird, strange subgroup created for the sole purpose to show spurious results and, in any case, with a p = 0.084 the statement “demonstrated prolonged survival for IPF patients” for which he is condemned holds at that level!

    I don’t know… I want to believe that there is more in the sentence that what the article explains, otherwise we’re going to need a lawyer to do math from now on.

    • Francisco: yes there’s more (not that I know all of it*). Your comment led me to post the controversial report on today’s post.
      *I think Schachtman intends to return to this case at some point soon.

  6. Nathan Schachtman


    Sorry for the delay; yesterday I came down with the flu, and it’s in full bloom today. I will try to write some more about the case when I can keep my eyes open for more than a few minutes, but I think the comments above generally capture my thinking on the case. Harkonen’s press release represents bad statistical practice, but I have serious doubts about whether his improvident use of “demonstrate” should turn a mandatory communication with investors into criminal fraud.

    The press release does use “demonstrate” to describe the results of the Phase III clinical trial, but elsewhere in the press release, Harkonen mentions earlier trial results and clinical experience.

    I have seen such data dredging in observational epidemiologic studies, published or written up for NIH review. In my view, the remedy is open access to protocols, and ultimately underlying data.


    • Nathan:
      You say that Harkonen’s press release was “a mandatory communication with investors”. This suggests that the issue had largely to do with whether it would mislead investors about the drug’s prospects/influence on stock price, as in the Matrixx report.

  7. Schachtman,
    Thank you for covering this fascinating legal case. When should we expect the final verdict on the appeal?

    • Nathan Schachtman


      The case was argued to the appellate court on December 6th. The Ninth Circuit usually hands down its decisions within 3 months unless there is a dissent, and the three judges are trying to find common ground to decide the case. (I’ve seen cases languish on appeal for well over a year, but that’s unlikely.)

      If Harkonen or the government loses, I would expect what’s called a petition for rehearing, with a suggestion of rehearing en banc (which means that the entire bunch of Ninth Circuit judges in active service get together to hear the case). Such rehearings are rarely granted though.

      The next step would be for the loser below to petition the U.S. Supreme Court for a writ of certiorari. This review is discretionary, and the Supreme Court hears about 100 or so cases per year. Because this case involves first amendment arguments and also criminal sanctions, the case may have a heightened profile when one side or the other asks for review in the Supreme Court.


  8. Nathan Schachtman


    A few other comments. The clinical trial was completed. What was preliminary was the company’s disclosure of the results to the public and the end of the trial. Arguably a company has a duty to communicate good and bad news to investors, but just as clearly, it may not make misleading statements on material matters. The government prosecutors moved on this case because the preliminary results at issue here involved a non-approved use. After the Ziesche, et al., trial, published in 1999, in the New England Journal of Medicine, many pulmonary physicians started to use interferon gamma 1-b, “off label,” as is their prerogative. The manufacturer, however, may not advertise or promote a drug or biologic for a non-approved indication.

    The DSMB was indeed involved here, although in an arguably gratuitous role. As noted, the trial was concluded. Thomas Fleming, however, objected to the language of the press release. This was probably beyond his contractual role as a member of the DSMB.

    The subsequent history is interesting. After phone conferences with the FDA, it was clear that an approval for idiopathic pulmonary fibrosis would not be forthcoming on the basis of post hoc analysis of a non-prespecified end point. The company started another clinical trial, the INSPIRE trial, which was terminated when the test drug showed lack of benefit at interim analysis. Interestingly, the INSPIRE clinical trial results were still written up and published. T. King, et al., “Effect of interferon gamma-1b on survival in patients with idiopathic pulmonary fibrosis (INSPIRE): a multicentre, randomised, placebo-controlled trial,” 374 Lancet 222 (2009).

    No one here should be surprised that the “demonstration” was ultimately falsified. The only question is whether the overstatement of the study’s conclusion, in a press release, was a crime.


    • Nathan: Presumably the published INSPIRE results were presented uncontroversially, yes? I’m curious as to how the “duplicity” appeal will work out. Can obiter dicta be retracted or clarified to avoid being misused in the manner you have discussed wrt a number of cases before, and now this. What would you like to see happen (in relation to the remarks on statistical significance from the Matrixx cases)? Is it fixed in stone? Hope you’re feeling better (Zicam works).

Blog at WordPress.com.