“Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses”
Nathan Schachtman, Esq., PC * October 14th, 2014
In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons.[i] The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].
A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval. When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing. The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.
* * * * * * *
There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem.
Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2]. [emphasis mine]
I’m perplexed by this mixture of stances. If you don’t mention the multiple testing for which it is acceptable not to adjust, then you’re guilty of poor statistical practice; but its “too prevalent for anyone to say that ignoring multiple testing is fraudulent”. This appears to claim it’s poor statistical practice if you fail to mention your results are due to multiple testing, but “ignoring multiple testing” (which could mean failing to adjust or, more likely, failing to mention it) is not fraudulent. Perhaps, it’s a questionable research practice QRP. It’s back to “50 shades of grey between QRPs and fraud.”
[…read his full blogpost here]
Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:
“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”
I’m going to require, as part of its meaning, that a statistically significant difference not be one due to “chance variability” alone. Then to avoid self contradiction, this last sentence might be put as follows: “when there is a high number of subgroup comparisons, at least some will show purported or nominal or unaudited statistical significance by chance alone. [Which term do readers prefer?] If one hunts down one’s hypothesized comparison in the data, then the actual p-value will not equal, and will generally be greater than, the nominal or unaudited p-value.”
So, I will insert “nominal” where needed below (in red).
Texas Sharpshooter fallacy
Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:
“Even if the elevated levels of lung cancer for men had been [nominally] statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a [nominally] statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”
The Texas sharpshooter fallacy is one of my all time favorites. One purports to be testing the accuracy of his aim, when in fact that is not the process that gave rise to the impressive-looking (nominal) cluster of hits. The results do not warrant inferences about his ability to accurately hit a target, since that hasn’t been well-probed.
[…read his full blogpost here]
The notorious Wells[4] case was cited by the Supreme Court in Matrixx Initiatives[5] for the proposition that statistical significance was unnecessary. Ironically, at least one of the studies relied upon by the plaintiffs’ expert witnesses in Wells had some outcomes with p-values below five percent. The problem, addressed by defense expert witnesses and ignored by the plaintiffs’ witnesses and Judge Shoob, was that there were over 20 reported outcomes, and probably many more outcomes analyzed but not reported. Accordingly, some qualitative or quantitative adjustment was required in Wells. See Hans Zeisel & David Kaye, Prove It With Figures: Empirical Methods in Law and Litigation 93 (1997)[6].
Maybe Schachtman will be willing to explain the first sentence of the above para. We’ve discussed the Matrixx case several times on this blog, but I don’t know the notorious Wells case.
Reference Manual on Scientific Evidence
David Kaye’s and the late David Freedman’s chapter on statistics in the third, most recent, edition of Reference Manual, offers some helpful insights into the problem of multiple testing:
“4. How many tests have been done?
Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.
Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve [nominal] statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. [Nominal] statistical significance is bound to follow.
There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are [nominally] significant. …”
Reference Manual on Scientific Evidence at 256-57 (3d ed. 2011).
When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis. When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate. Even a [nominally] statistically significant finding must be understood in the full context of the study. [emphasis mine]
I don’t understand the danger of it’s being reported as a Type I error, especially when the next sentence correctly notes “the distorting effect of multiple testing on preserving a pre-specified Type I error rate.” The only danger could be reporting the Type 1 error probability that would have held under the assumption there would be a predesignated hypothesis and no selection effects, when in fact multiple testing occurred. Knowing there was going to be multiple testing, the person could report, pre-data: “Since we are going to be hunting and searching for nominal significance among k factors, the Type I error rate is quite high”. Or, the predesignated error rate could be low, if each of k tests is adjusted.
Some texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves. [emphasis mine] Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”
This suggestion that readers “make the adjustment for themselves” reminds me of the recommendation that came up in a recent post about taking the stopping rule into account “later on”. If it influences the evidential warrant of the data, then it makes no sense to say, “here’s the evidence but I engaged in various shenanigans, so now you go figure out what the real evidence is.”
* * * * *
Despite the guidance provided by the Reference Manual, some courts have remained resistant to the need to consider multiple comparison issues. Statistical issues arise frequently in securities fraud cases against pharmaceutical cases, involving the need to evaluate and interpret clinical trial data for the benefit of shareholders. In a typical case, joint venturers Aeterna Zentaris Inc. and Keryx Biopharmaceuticals, Inc., were both targeted by investors for alleged Rule 10(b)(5) violations involving statements of clinical trial results, made in SEC filings, press releases, investor presentations and investor conference calls from 2009 to 2012. [ii]The clinical trial at issue tested perifosine in conjunction with, and without, other therapies, in multiple arms, which examined efficacy for seven different types of cancer. After a preliminary phase II trial yielded promising results for metastatic colon cancer, the colon cancer arm proceeded. According to plaintiffs, the defendants repeatedly claimed that perifosine had demonstrated “statistically significant positive results.” In re Keryx at *2, 3.
The plaintiffs alleged that defendants’ statements omitted material facts, including the full extent of multiple testing in the design and conduct of the phase II trial, without adjustments supposedly “required” by regulatory guidance and generally accepted statistical principles. The plaintiffs asserted that the multiple comparisons involved in testing perifosine in so many different kinds of cancer patients, at various doses, with and against so many different types of other cancer therapies, compounded by multiple interim analyses, inflated the risk of Type I errors such that some statistical adjustment should have been applied before claiming that a statistically significant survival benefit had been found in one arm, with colorectal cancer patients. In re Keryx at *2-3, *10.
The trial court dismissed these allegation given that the trial protocol had been published, although over two years after the initial press release, which started the class period, and which failed to disclose the full extent of multiple testing and lack of statistical correction, which omitted this disclosure….The trial court was loathe to allow securities fraud claims over allegations of improper statistical methodology, which:
“would be equivalent to a determination that if a researcher leaves any of its methodology out of its public statements — how it did what it did or was planning to do — it could amount to an actionable false statement or omission. This is not what the law anticipates or requires.” [emphasis mine]
Talk about an illicit slippery slope. Requiring information on the source of erroneous interpretations of statistical evidence is not “equivalent” to requiring the researcher report every detail about what it was planning to do.
In re Keryx at *10[7]. According to the trial court, providing p-values for comparisons between therapies, without disclosing the extent of unplanned interim analyses or the number of multiple comparisons is “not falsity; it is less disclosure than plaintiffs would have liked.” Id. at *11.
[…read his full blogpost here]
The court’s characterization of the fraud claims as a challenge to trial methodology rather than data interpretation and communication decidedly evaded the thrust of the plaintiffs’ fraud complaint. Data interpretation will often be part of the methodology outlined in a protocol. The Keryx case also confused criticism of the design and execution of a clinical trial with criticism of the communication of the trial results.
Exactly!
I’m not sure I understand at this point what the “Reference Manual”, or Daubert, or it’s current manifestation, are really requiring (on multiplicity); and as would be expected of any sharp lawyer, Schachtman makes some intricate gradations.
Please see the full blogpost and his extended footnotes here.
One clever gambit I often come across by way of excuse (for QRPs along the lines of selection effects) is that it’s a “philosophical issue”. How can you hold someone accountable for favoring one of rival philosophical positions? If it’s not put as a “free speech” issue, it’s a “freedom of philosophy” issue. How con-veenient!
[i] See In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (Rufe, J.).
[ii]Abely v. Aeterna Zentaris Inc., No. 12 Civ. 4711(PKC), 2013 WL 2399869 (S.D.N.Y. May 29, 2013); In re Keryx Biopharms, Inc., Sec. Litig., 1307(KBF), 2014 WL 585658 (S.D.N.Y. Feb. 14, 2014).
*Schachtman’s legal practice focuses on the defense of product liability suits, with an emphasis on the scientific and medico-legal issues. He teaches a course in statistics in the law at the Columbia Law School, NYC.
Mayo,
Good morning! Not too much violence to my original post,but you do challenge me on a few topics. As for my attempt to distinguish the Harkonen case and some of the other legal cases, I know I have not, to date, persuaded you. We lawyers reserve “guilty” for those who have committed crimes. For me, when many NIH-funded researchers publish articles with subgroup analyses not prespecified in their protocols and not identified as post hoc analyses, and when federal government researchers at NIH tout non-prespecified outcomes in RCTs, again without notice that the outcomes were not prespecified, and when the US gov’t takes the position in another case before the Supreme Court (Matrixx Initiatives v. Siracusan) that statistical significance is not necessary to “demonstrate” causation, then I think the government loses its moral, legal, and scientific standing to prosecute someone like Dr Harkonen.
My fellow amici may have had different goals, but I was never trying to advance Harkonen’s approach as “best practice” or the like. I would even heartily agree that such a practice should lead to Harkonen’s opinion’s being excluded if offered as testimony in litigation, but I don’t believe it is a fraud.
Why not? First, Harkonen had other information about efficacy. The 1999 Austrian RCT of interferon gamma 1b showed efficacy, and was published in the NEJM. The Austrian RCT was then continued, again with a strong showing of benefit for the therapy arm of the trial. The specific trial that Harkonen had new data on had a showing of survival benefit (a prespecified outcome), with large “effect” size, at p = 0.08, which shrank to 0.055 when the data were analyzed for compliance with the protocol. And when the researchers published their data with time-to-event analyses, the hazard ratio for the entire cohort was well below 1.0, with p < 0.05.
The "offending" subgroup analysis, admittedly non-prespecified, had a p = 0.004, and the gov't never showed, as was its burden, that this p-value would have been inflated above 5% if adjusted. Nor did the gov't show that, despite the multiple attempts to define the "right" subgroup, that the reported subgroup of mild to moderate cases of disease was clinically implausible, or that the magnitude of the mortality benefit was clinically insignificant.
Anyway, a longer story than I can bang out here in comments, but I would be happy to share my amicus brief with anyone who wants to read it.
As for the "notorious" Wells case, I have blogged extensively about it. If you like, I can provide links. The case was cited by the gov't in the Matrixx case, and then again by the Supreme Court in Matrixx, as an example of how causation decisions can be made without statistical significance, but as I have pointed out, plaintiffs' expert witnesses in Wells actually had studies that showed at least "nominal" significance. The problem was multiple testing (both announced and covert – but no one was prosecuted), and confounding by multiple exposures (arsenical spermicides, known to be genotoxic, were included in some of the studies, but were not part of the exposure claimed to have caused the birth defects in the specific case).
So there you have it for Sunday morning. A response on Harkonen, and an elaboration on Wells. I will try to respond to other points, and answer questions if I can.
Nathan
Nathan: Thanks so much. I actually was trying hard to keep away from Harkonen,but rather figure out the general standpoint from your post, as given in its title. (Maybe the Wells case inadvertently triggered Dr Hark,but I don’t want to go back to him!)
That’s fine. You might well wonder how I reconcile my urge for best practices with my opposition to a criminal prosecution for someone who fails to live up to those best practices. I spent a lot of time on Wells case in my blog. See http://schachtmanlaw.com/wells-v-ortho-pharmaceutical-corp-reconsidered-part-1/
and parts 2 through 6, of this series, and other posts as well. Professor Gastwirth wrote an article about the case, in which he seemed to try to say it wasn’t so off the wall, but I think he missed some really important points. The remarkable thing, of course, is that the case would not have survived (or should not) Daubert/Rule 702 review if decided today, but the Solicitor General and then the Supreme Court cited the case in 2011 as though it were still good law.
Nathan: OK, I refreshed my memory of the Well’s case. I never should have alluded to it because, as I suspected, it has nothing to do with the issue of interest here* which is: what’s the contemporary handling of multiplicity in appraising statistical evidence in legal cases, and what does Schachtman’s “it can and must” boil down to? I don’t have sufficient knowledge of the order in which the legal standpoint on this issue has changed. So if one were to report the current view, would the Freedman chapter be the relevant one?
*It does, however, make it doubly ironic that the case should have been referred to in the goofy Matrixx case. Error upon error. Don’t these Supreme Court members have huge staffs to look up relevant precedents? Perhaps since it was “obiter dicta” (or however that goes) no one cared much, but surely there were examples of egregious side effects shown without resort to statistics. I don’t want to confuse the current discussion with that case, so I withdraw my Well’s question, which just just curiosity.
Nathan: Are you saying they use one standard for expert testimony and another for criminal culpability?
I hate to crash this party, but I am quite perplexed. Isn’t Nathan dead right about the multiple testing problem? I’m not certain where the disagreement between Deborah and Nathan is …
Enrique: No party to crash, so thanks for your comment and reblog. Nathan and I don’t disagree on multiple testing, so far as I can see. I accidentally alluded to a certain case where we’ve debated how culpable a give salesman, I mean doctor, was. But that wasn’t testimony, and I take it he didn’t hide the multiple testing. Anyway, that’s an interesting legal case (more for the side precedents) but does not directly bear on this issue. I’m still not exactly sure whether Schachtman’s “can and must” alludes to testimony only rather than culpability of researchers/doctors issuing reports*. I take it his position on the latter is that one needs to consider it on a case by case basis, and I agree with that. Even aside from that, legal contexts really do/can differ from scientific ones.
*I’m more inclined than he seems to be to call violators “fraudfeasors”–a good term!
enrique,
The Harkonen case has a complicated fact set, not easily summarized here. But to make it very short, he was prosecuted for Wire Fraud, for reporting a clinical trial that, in his word, demonstrated a survival benefit in a subgroup that was not prespecified. The overall mortality benefit was p = 0.08, which shrunk to 0.055 on a per protocol analysis. So, very close even without data dredging. And, there was a prior independent trial that showed benefit, with a very low p value. I conceded that Harkonen’s practice in not revealing that the analysis was not prespecified was poor practice, but hardly a criminal fraud, especially considering the complexity that he was speaking about multiple trials. In some of past exchanges, Mayo took the position that Harkonen’s analysis was indefensible, and I had responded, legally with a demurrer: poor practice, which is all too common, but too close to the line to put the man in prison unless we wanted to throw a LOT of scientists in the hoosegow as well.
Nathan
Nathan: I think Enrique was wondering about the (or your) general position on multiple testing in the law. I may have run together stipulations only relevant for scientific testimony, being a legal outsider.
But anyone who does want to read about Harkonen can search this blog. (I wonder what company he’s in nowadays.)
Here’s a link to a 2012 post that linked the third edition of the reference manual on stat in the law–I don’t know how often these come out.
https://errorstatistics.com/2012/07/10/philstatlaw-reference-manual-on-scientific-evidence-3d-ed-on-statistical-significance-schachtman/
I suspect that Dr Harkonen rarely gets company these days. The Reference Manual (2011) is in its Third edition. The first came out in 1996, and the second edition, in 2000. Multiple testing is often overlooked by courts, and claims of statistical significance are taken at face value despite obvious multiple testing and post hoc analyses. They are taken at face value, in my view, because many social scientists and epidemiologists are improvident in how they interpret significance probabilities in their published articles, and courts are reluctant to go beyond the language that is used in peer-reviewed articles.
Nathan: But the case, through many appeals, was clearly faulting him for multiple testing, post-data endpoints, etc. and in a manner that impressed me with its apparent codification. How advanced the “stat evidence in the law” manual appeared, compared to many social sciences, I thought. Of course I recognized that the issue is also routinely thrown up (in legal settings) as “representing a mere philosophical disagreement”, thereby claiming freedom from culpability. So I guess it’s a mixed bag.