PhilStat Law

Nathan Schactman: Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice (Guest Post)

.

Nathan Schachtman,  Esq., J.D.
Legal Counsel for Scientific Challenges

Of Significance, Error, Confidence, and Confusion – In the Law and In Statistical Practice

The metaphor of law as an “empty vessel” is frequently invoked to describe the law generally, as well as pejoratively to describe lawyers. The metaphor rings true at least in describing how the factual content of legal judgments comes from outside the law. In many varieties of litigation, not only the facts and data, but the scientific and statistical inferences must be added to the “empty vessel” to obtain a correct and meaningful outcome. Continue reading

Categories: ASA Guide to P-values, ASA Task Force on Significance and Replicability, PhilStat Law, Schachtman

Larry Laudan: “‘Not Guilty’: The Misleading Verdict” (Guest Post)

Larry Laudan

.

Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”

Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’.  Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction. Continue reading

Categories: L. Laudan, PhilStat Law | Tags:

96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!

Objectivity 1: Will the Real Junk Science Please Stand Up?Imagine. The New York Times reported a few days ago that the FBI erroneously identified criminals 96% of the time based on probability assessments using forensic hair samples (up until 2000). Sometimes the hair wasn’t even human, it might have come from a dog, a cat or a fur coat!  I posted on  the unreliability of hair forensics a few years ago.  The forensics of bite marks aren’t much better.[i] John Byrd, forensic analyst and reader of this blog had commented at the time that: “At the root of it is the tradition of hiring non-scientists into the technical positions in the labs. They tended to be agents. That explains a lot about misinterpretation of the weight of evidence and the inability to explain the import of lab findings in court.” DNA is supposed to cure all that. So is it? I don’t know, but apparently the FBI “has agreed to provide free DNA testing where there is either a court order or a request for testing by the prosecution.”[ii] See the FBI report.

Here’s the op-ed from the New York Times from April 27, 2015:

Junk Science at the FBI”

The odds were 10-million-to-one, the prosecution said, against hair strands found at the scene of a 1978 murder of a Washington, D.C., taxi driver belonging to anyone but Santae Tribble. Based largely on this compelling statistic, drawn from the testimony of an analyst with the Federal Bureau of Investigation, Mr. Tribble, 17 at the time, was convicted of the crime and sentenced to 20 years to life.

But the hair did not belong to Mr. Tribble. Some of it wasn’t even human. In 2012, a judge vacated Mr. Tribble’s conviction and dismissed the charges against him when DNA testing showed there was no match between the hair samples, and that one strand had come from a dog.

Mr. Tribble’s case — along with the exoneration of two other men who served decades in prison based on faulty hair-sample analysis — spurred the F.B.I. to conduct a sweeping post-conviction review of 2,500 cases in which its hair-sample lab reported a match.

The preliminary results of that review, which Spencer Hsu of The Washington Post reported last week, are breathtaking: out of 268 criminal cases nationwide between 1985 and 1999, the bureau’s “elite” forensic hair-sample analysts testified wrongly in favor of the prosecution, in 257, or 96 percent of the time. Thirty-two defendants in those cases were sentenced to death; 14 have since been executed or died in prison.Forensic Hair red

The agency is continuing to review the rest of the cases from the pre-DNA era. The Justice Department is working with the Innocence Project and the National Association of Criminal Defense Lawyers to notify the defendants in those cases that they may have grounds for an appeal. It cannot, however, address the thousands of additional cases where potentially flawed testimony came from one of the 500 to 1,000 state or local analysts trained by the F.B.I. Peter Neufeld, co-founder of the Innocence Project, rightly called this a “complete disaster.”

Law enforcement agencies have long known of the dubious value of hair-sample analysis. A 2009 report by the National Research Council found “no scientific support” and “no uniform standards” for the method’s use in positively identifying a suspect. At best, hair-sample analysis can rule out a suspect, or identify a wide class of people with similar characteristics.

Yet until DNA testing became commonplace in the late 1990s, forensic analysts testified confidently to the near-certainty of matches between hair found at crime scenes and samples taken from defendants. The F.B.I. did not even have written standards on how analysts should testify about their findings until 2012.

Continue reading

Categories: evidence-based policy, junk science, PhilStat Law, Statistics

Msc. Kvetch: What does it mean for a battle to be “lost by the media”?

IMG_17801.  What does it mean for a debate to be “media driven” or a battle to be “lost by the media”? In my last post, I noted that until a few weeks ago, I’d never heard of a “power morcellator.” Nor had I heard of the AAGL–The American Association of Gynecologic Laparoscopists. In an article Battle over morcellation lost ‘in the media’”(Nov 26, 2014) Susan London reports on a recent meeting of the AAGL[i]

The media played a major role in determining the fate of uterine morcellation, suggested a study reported at a meeting sponsored by AAGL.

“How did we lose this battle of uterine morcellation? We lost it in the media,” asserted lead investigator Dr. Adrian C. Balica, director of the minimally invasive gynecologic surgery program at the Robert Wood Johnson Medical School in New Brunswick, N.J.

The “investigation” Balica led consisted of collecting Internet search data using something called the Google Adwords Keyword Planner:

Results showed that the average monthly number of Google searches for the term ‘morcellation’ held steady throughout most of 2013 at about 250 per month, reported Dr. Balica. There was, however, a sharp uptick in December 2013 to more than 2,000 per month, and the number continued to rise to a peak of about 18,000 per month in July 2014. A similar pattern was seen for the terms ‘morcellator,’ ‘fibroids in uterus,’ and ‘morcellation of uterine fibroid.’

The “vitals” of the study are summarized at the start of the article:

Key clinical point: Relevant Google searches rose sharply as the debate unfolded.

Major finding: The mean monthly number of searches for “morcellation” rose from about 250 in July 2013 to 18,000 in July 2014.

Data source: An analysis of Google searches for terms related to the power morcellator debate.

Disclosures: Dr. Balica disclosed that he had no relevant conflicts of interest.

2. Here’s my question: Does a high correlation between Google searches and debate-related terms signify that the debate is “media driven”? I suppose you could call it that, but Dr. Balica is clearly suggesting that something not quite kosher, or not fully factual was responsible for losing “this battle of uterine morcellation”, downplaying the substantial data and real events that drove people (like me) to search the terms upon hearing the FDA announcement in November. Continue reading

Categories: msc kvetch, PhilStat Law, science communication, Statistics

PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must

NAS-3

.

The following is from Nathan Schachtman’s legal blog, with various comments and added emphases (by me, in this color). He will try to reply to comments/queries.

“Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses”

Nathan Schachtman, Esq., PC * October 14th, 2014

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons.[i] The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].

.

.

A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.

*   *   *   *   *   *   *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem.

Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2]. [emphasis mine]

I’m perplexed by this mixture of stances. If you don’t mention the multiple testing for which it is acceptable not to adjust, then you’re guilty of poor statistical practice; but its “too prevalent for anyone to say that ignoring multiple testing is fraudulent”. This appears to claim it’s poor statistical practice if you fail to mention your results are due to multiple testing, but “ignoring multiple testing” (which could mean failing to adjust or, more likely, failing to mention it) is not fraudulent. Perhaps, it’s a questionable research practice QRP. It’s back to “50 shades of grey between QRPs and fraud.”

  […read his full blogpost here]

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:

“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

I’m going to require, as part of its meaning, that a statistically significant difference not be one due to “chance variability” alone. Then to avoid self contradiction, this last sentence might be put as follows: “when there is a high number of subgroup comparisons, at least some will show purported or nominal or unaudited statistical significance by chance alone. [Which term do readers prefer?] If one hunts down one’s hypothesized comparison in the data, then the actual p-value will not equal, and will generally be greater than, the nominal or unaudited p-value.”

So, I will insert “nominal” where needed below (in red).

Texas Sharpshooter fallacy

Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:

“Even if the elevated levels of lung cancer for men had been [nominally] statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a [nominally] statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

The Texas sharpshooter fallacy is one of my all time favorites. One purports to be testing the accuracy of his aim, when in fact that is not the process that gave rise to the impressive-looking (nominal) cluster of hits. The results do not warrant inferences about his ability to accurately hit a target, since that hasn’t been well-probed. Continue reading

Categories: P-values, PhilStat Law, Statistics

Blog at WordPress.com.