“Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses”

Nathan Schachtman, Esq., PC *** **October 14th, 2014

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons.[i] The *Zoloft MDL* court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].

A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval. When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or *p*-value, can be highly misleading if the p-value is used for hypothesis testing. The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to *the assigned task of evaluating methodology and its relationship with conclusions.*

* * * * * * *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. *See, e.g*., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 *Epidemiology* 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem.

* Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2]*. [emphasis mine]

I’m perplexed by this mixture of stances. If you don’t mention the multiple testing for which it is acceptable not to adjust, then you’re guilty of poor statistical practice; but its “too prevalent for anyone to say that ignoring multiple testing is fraudulent”. This appears to claim it’s poor statistical practice if you fail to mention your results are due to multiple testing, but “ignoring multiple testing” (which could mean failing to adjust or, more likely, failing to mention it) is not fraudulent. Perhaps, it’s a questionable research practice QRP. It’s back to “50 shades of grey between QRPs and fraud.”

[…read his full blogpost here]

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue.

Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002),aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition,

when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

I’m going to require, as part of its meaning, that a statistically significant difference not be one due to “chance variability” alone. Then to avoid self contradiction, this last sentence might be put as follows: *“when there is a high number of subgroup comparisons, at least some will show purported or nominal or unaudited statistical significance by chance alone. [Which term do readers prefer?] If one hunts down one’s hypothesized comparison in the data, then the actual p-value will not equal, and will generally be greater than, the nominal or unaudited p-value.”*

*So, I will insert “nominal” where needed below (in red). *

Texas Sharpshooter fallacy

Id.And shortly after the Supreme Court decidedDaubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:“Even if the elevated levels of lung cancer for men had been [nominally] statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a [nominally] statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

*The Texas sharpshooter fallacy is one of my all time favorites. One purports to be testing the accuracy of his aim, when in fact that is not the process that gave rise to the impressive-looking (nominal) cluster of hits. The results do not warrant inferences about his ability to accurately hit a target, since that hasn’t been well-probed.*

* [...read his full blogpost here]
*

The notorious

Wellscase was cited by the Supreme Court in[4]Matrixx Initiativesfor the proposition that statistical significance was unnecessary. Ironically, at least one of the studies relied upon by the plaintiffs’ expert witnesses in[5]Wellshad some outcomes with p-values below five percent. The problem, addressed by defense expert witnesses and ignored by the plaintiffs’ witnesses and Judge Shoob, was that there were over 20 reported outcomes, and probably many more outcomes analyzed but not reported. Accordingly, some qualitative or quantitative adjustment was required inWells.SeeHans Zeisel & David Kaye,Prove It With Figures: Empirical Methods in Law and Litigation93 (1997)[6].

Maybe Schachtman will be willing to explain the first sentence of the above para. We’ve discussed the Matrixx case several times on this blog, but I don’t know the notorious Wells case.

Reference Manual on Scientific EvidenceDavid Kaye’s and the late David Freedman’s chapter on statistics in the third, most recent, edition of

Reference Manual,offers some helpful insights into the problem of multiple testing:“

4. How many tests have been done?Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘significant’ findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair coin will produce 10 heads when tossed 10 times is (1/2)

^{10}= 1/1024. Observing 10 heads in the first 10 tosses, therefore, would be strong evidence that the coin is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. Ten heads in the first ten tosses means one thing; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another. A test—looking for a run of ten heads—can be repeated too often.Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.

^{111}Even a single researcher may examine so many different relationships that a few will achieve [nominal] statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. [Nominal] statistical significance is bound to follow.There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful

p-values in certain cases.^{112}However, no general solution is available… . In these situations, courts should not be overly impressed with claims that estimates are [nominally] significant. …”

Reference Manual on Scientific Evidenceat 256-57 (3d ed. 2011).When a lawyer asks a witness whether a sample statistic is “statistically significant,” there is the danger that the answer will be interpreted or argued as a Type I error rate, or worse yet, as a posterior probability for the null hypothesis.

Even a [nominally] statistically significant finding must be understood in the full context of the study. [emphasis mine]When the sample statistic has a p-value below 0.05, in the context of multiple testing, completeness requires the presentation of the information about the number of tests and the distorting effect of multiple testing on preserving a pre-specified Type I error rate.

I don’t understand the danger of it’s being reported as a Type I error, especially when the next sentence correctly notes “the distorting effect of multiple testing on preserving a pre-specified Type I error rate.” The only danger could be reporting the Type 1 error probability that would have held under the assumption there would be a predesignated hypothesis and no selection effects, when in fact multiple testing occurred. Knowing there was going to be multiple testing, the person could report, pre-data: “Since we are going to be hunting and searching for nominal significance among *k* factors, the Type I error rate is quite high”. Or, the predesignated error rate could be low, if each of *k* tests is adjusted.

Most jurors and judges are not sufficiently knowledgeable to make the adjustment without expert assistance, and so the fact of multiple testing, and its implication, are additional examples of how the rule of completeness may require the presentation of appropriate qualifications and explanations at the same time as the information about “statistical significance.”Some texts and journals recommend that the Type I error rate not be modified in the paper, as long as readers can observe the number of multiple comparisons that took place and make the adjustment for themselves. [emphasis mine]

This suggestion that readers “make the adjustment for themselves” reminds me of the recommendation that came up in a recent post about taking the stopping rule into account “later on”. If it influences the evidential warrant of the data, then it makes no sense to say, “here’s the evidence but I engaged in various shenanigans, so now you go figure out what the real evidence is.”

* * * * *

Despite the guidance provided by the

Reference Manual, some courts have remained resistant to the need to consider multiple comparison issues. Statistical issues arise frequently in securities fraud cases against pharmaceutical cases, involving the need to evaluate and interpret clinical trial data for the benefit of shareholders. In a typical case, joint venturers Aeterna Zentaris Inc. and Keryx Biopharmaceuticals, Inc., were both targeted by investors for alleged Rule 10(b)(5) violations involving statements of clinical trial results, made in SEC filings, press releases, investor presentations and investor conference calls from 2009 to 2012. [ii]The clinical trial at issue tested perifosine in conjunction with, and without, other therapies, in multiple arms, which examined efficacy for seven different types of cancer. After a preliminary phase II trial yielded promising results for metastatic colon cancer, the colon cancer arm proceeded. According to plaintiffs, the defendants repeatedly claimed that perifosine had demonstrated “statistically significant positive results.”In re Keryxat *2, 3.The plaintiffs alleged that defendants’ statements omitted material facts, including the full extent of multiple testing in the design and conduct of the phase II trial, without adjustments supposedly “required” by regulatory guidance and generally accepted statistical principles. The plaintiffs asserted that the multiple comparisons involved in testing perifosine in so many different kinds of cancer patients, at various doses, with and against so many different types of other cancer therapies, compounded by multiple interim analyses, inflated the risk of Type I errors such that some statistical adjustment should have been applied before claiming that a statistically significant survival benefit had been found in one arm, with colorectal cancer patients.

In re Keryxat *2-3, *10.The trial court dismissed these allegation given that the trial protocol had been published, although over two years after the initial press release, which started the class period, and which failed to disclose the full extent of multiple testing and lack of statistical correction, which omitted this disclosure….The trial court was loathe to allow securities fraud claims over allegations of improper statistical methodology, which:

“

would be equivalent to a determination that if a researcher leaves any of its methodology out of its public statements — how it did what it did or was planning to do — it could amount to an actionable false statement or omission. This is not what the law anticipates or requires.” [emphasis mine]

*Talk about an illicit slippery slope. Requiring information on the source of erroneous interpretations of statistical evidence is not “equivalent” to requiring the researcher report every detail about what it was planning to do.*

In re Keryxat *10[7]. According to the trial court, providing p-values for comparisons between therapies, without disclosing the extent of unplanned interim analyses or the number of multiple comparisons is “not falsity; it is less disclosure than plaintiffs would have liked.”Id. at *11.

[...*read his full blogpost here]*

The court’s characterization of the fraud claims as a challenge to trial methodology rather than data interpretation and communication decidedly evaded the thrust of the plaintiffs’ fraud complaint. Data interpretation will often be part of the methodology outlined in a protocol. The

Keryxcase also confused criticism of the design and execution of a clinical trial with criticism of the communication of the trial results.

Exactly!

*I’m not sure I understand at this point what the “Reference Manual”, or Daubert, or it’s current manifestation, are really requiring (on multiplicity); and as would be expected of any sharp lawyer, Schachtman makes some intricate gradations. *

*Please see the full blogpost and his extended footnotes here.*

*One clever gambit I often come across by way of excuse (for QRPs along the lines of selection effects) is that it’s a “philosophical issue”. How can you hold someone accountable for favoring one of rival philosophical positions? If it’s not put as a “free speech” issue, it’s a “freedom of philosophy” issue. How con-veenient!*

[i]

See In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig., MDL No. 2342; 12-md-2342, 2014 U.S. Dist. LEXIS 87592; 2014 WL 2921648 (E.D. Pa. June 27, 2014) (Rufe, J.).[ii]

Abely v. Aeterna Zentaris Inc., No. 12 Civ. 4711(PKC), 2013 WL 2399869 (S.D.N.Y. May 29, 2013);In re Keryx Biopharms, Inc., Sec. Litig., 1307(KBF), 2014 WL 585658 (S.D.N.Y. Feb. 14, 2014).

**Schachtman’s legal practice focuses on the defense of product liability suits, with an emphasis on the scientific and medico-legal issues. He teaches a course in statistics in the law at the Columbia Law School, NYC. *