An established probability theory for hair comparison? “is not — and never was”

Forensic Hair red

Hypothesis H: “person S is the source of this hair sample,” if indicated by a DNA match, has passed a more severe test than if it were indicated merely by a visual analysis under a microscopic. There is a much smaller probability of an erroneous hair match using DNA testing than using the method of visual analysis used for decades by the FBI.

The Washington Post reported on its latest investigation into flawed statistics behind hair match testimony. “Thousands of criminal cases at the state and local level may have relied on exaggerated testimony or false forensic evidence to convict defendants of murder, rape and other felonies”. Below is an excerpt of the Post article by Spencer S. Hsu.

I asked John Byrd, forensic anthropologist and follower of this blog, what he thought. It turns out that “hair comparisons do not have a well-supported weight of evidence calculation.” (Byrd).  I put Byrd’s note at the end of this post.

Washington Post Published: December 22

In its April investigation, The Post found that Justice Department officials failed to tell many defendants or their attorneys of questionable evidence and that the results of the review remained largely secret.

… How often might the hairs of different people appear to match? The truth is that there was no scientific way to know….

Before DNA profiling, testimony of a hair match was a powerful way for prosecutors to boil down an ambiguous case to a single, incriminating piece of physical evidence left at the scene of a crime…

But The Post’s investigation earlier this year showed how agents, prosecutors or both sometimes exaggerated the significance of the evidence they had.

For example, in a 1980 Indiana robbery case, one agent told jurors that he was unable to distinguish between the hair of different people just once in 1,500 cases he had analyzed.

In one of the District cases, federal prosecutors claimed that the agent had been unable to tell hair samples apart only “eight or 10 times in the past 10 years, while performing thousands of analyses.”

In another, the prosecutor said in closing arguments, “There is one chance, perhaps for all we know, in 10 million that it could [be] someone else’s hair.” That defendant was declared innocent this year.

The problem is, as an expert peer review panel wrote in Melnikoff’s case, “There is not — and never was — a well established probability theory for hair comparison.”

As noted in 2009 by the chief of the FBI hair team, the proper answer to the question of how often hairs from different people might match is, “We do not know.”

Vague standards

The FBI has known for decades that hair found at a crime scene is a valuable piece of evidence. Before DNA testing, agents would use a microscope to compare the evidence with a sample of hair from a suspect.

A visual analysis can tell animal hairs from human hairs; human hairs by race and body part; whether hairs were dyed or otherwise treated; and how hairs were removed from the body. Visual comparison, at its best, also can accurately narrow the pool of criminal suspects to a class or group or definitively rule out a person as a possible source.

But it was not possible to declare an absolute match. So the FBI had a problem. Hair comparisons could yield good evidence. But agents struggled to explain to a jury how good.

Morris Samuel “Sam” Clark was the head of the FBI’s hair unit when it began training state and local analysts in 1973. He said he long believed that examiners could trace hairs from a crime scene to a particular person with a high degree of probability — even though there is no scientific proof that is possible…

The FBI’s training regimen, which required agents to compare hairs side-by-side under high-powered microscopes for a year before working on live cases, gave lab veterans confidence that they could tell the difference between individuals’ hairs just as an ordinary person could distinguish between their faces.

They embraced a set of vague standards. In written lab reports, FBI agents would include the caveat that hair examination was not a basis for positive identification.

In court, however, they could suggest that it would be highly unlikely for an examiner’s match to be wrong. The bureau left it up to individual labs and examiners to explain matters to jurors. Agents were trained to say that in their “personal experience” they had rarely seen hairs from different people that looked alike.

That evolved into jurors’ hearing numbers that had a huge impact even if they lacked scientific grounding. After a slaying in Tennessee in 1980, an FBI agent testified in a capital case that there was one chance in 4,500 or 5,000 that a hair came from someone other than the suspect.

But as experts from around the world would later note, the FBI-taught answer was misleading. In reality, FBI examiners did not compare every hair to every other hair they had ever examined. They simply compared crime-scene hairs and hair samples from individuals relevant in each case.

Examiners kept no “database” of samples, which went back to police evidence files. And differences between hairs are so fine that a person can generally keep only a handful of hairs in mind at any time.

“The claim you could keep all those hairs in your head and sort them in your mind, that would be hard to do,” said Mark R. Wilson, a 23-year FBI veteran who helped develop DNA testing for hair in 1996. “After about three or four [hairs], it gets confusing.”

The claim was called into question at an international conference hosted by the FBI in 1985, but the training was not overhauled for at least a dozen more years…

Robillard, the former hair unit chief, said that he always waited for a defense attorney to challenge his claims about the accuracy of hair analysis but that neither they nor judges usually caught the logical sleight of hand.

“You would expect a defense attorney to say, ‘Wait — are you, Robillard, saying you compared every person’s hair to every other one?’ That’s the screaming question for cross-examination,” Robillard said. “I can’t off the top of my head remember ever having a defense attorney say that.”

….In 2004, Melnikoff lost his crime lab job in Washington because of errors whose discovery led to three overturned convictions in Montana. One of those cases was the child rape conviction of Jimmy Ray Bromgard, who served more than 15 years in prison before DNA tests showed he didn’t commit the crime.

At Bromgard’s 1987 trial, Melnikoff said he found head and pubic hairs “microscopically indistinguishable” from Bromgard’s, and he told the jury that there was less than one chance in 10,000 of a coincidence. He based this assertion on his case experience, multiplying by 100 the 1 in 100 frequency with which he claimed to have seen head and pubic hairs he could not tell apart.

After Bromgard was exonerated in 2002, a five-member panel that included Deadman said Melnikoff made “egregious misstatements not only of the science of forensic hair examinations but also of genetics and statistics.”

The full article is here.


Comment (from an e-mail) from John Byrd, forensic analyst:

It is a well-known problem in forensics that has proven difficult for the traditional labs to get past.

At the root of it is the tradition of hiring non-scientists into the technical positions in the labs. They tended to be agents. That explains a lot about misinterpretation of the weight of evidence and the inability to explain the import of lab findings in court.

I should note that we often talk of `weight of evidence` in forensic science. It is addressed by appeal to the frequency of a spurious match in repeated applications of a test. The larger the probability of a random match the lower the weight ascribed to the evidence. DNA is useful to the extent that the probability of someone else sharing the profile is low.

Hair comparisons do not have a well-supported weight of evidence calculation and we suspect if it did it would not be comparable to DNA, fingerprints, or other tests that are more reputable in the scientific community.

Clarified to mean: Hair comparison when made visually (under microscopes) do not have a well-supported weight of evidence calculation and we suspect if it did (i.e., if we checked the rate of false visual matches) it would not be comparable to DNA, fingerprints, or other tests that are more reputable in the scientific community.”

I am sure you can see the direct relationship between the weight of evidence and severity.

Note that the last person interviewed– Max Houck– is an anthropologist and was the first scientist (non-agent) they hired to do trace evidence. I know Max very well and he has distinguished himself in the scholarly world by pushing science fundamentals to the forensic disciplines. The first paper I saw Max present many years ago at the American Academy of Forensic Sciences was on the philosophy of science that underpins our forensic reasoning. (Forensics were largely born out of anthropology at turn of last century.) Just last year, he presented his ideas that the philosophy underlining forensic science is a subsidiary of historical sciences (archaeology, paleontology, astronomy, etc).

Forensics has turned a corner in any event. The accrediting bodies now follow ISO standards and require science degrees and training of the analysts. The National Academy of Sciences put out a scathing critique of forensics in America in 2009 that recommended that all analysts be trained and mentored to do scientific research before they become analysts.

The FBI is suffering lingering effects of the past… You might be pleased to hear that last time I saw the FBI lab Director in a meeting, he was all abuzz about wanting to hire a full time statistician to work with the staff. That was last year, so I will find out this year how that worked out. Statistics and scientific reasoning cannot be separated. John Byrd

John E. Byrd, Ph.D D-ABFA, 
Laboratory Director and Forensic Anthropologist


Categories: Severity, Statistics

Post navigation

14 thoughts on “An established probability theory for hair comparison? “is not — and never was”

  1. e.berk

    This is the danger in encouraging the use of personal degrees of belief and feelings of conviction in reporting how likely it is that one is in error. A reliable method would state the caveat of FBI reports, that hair examination was not a basis for positive identification. In court, by contrast, they were encouraged to be Bayesian. They were to report on their personal experiences, that “they had rarely seen hairs from different people that looked alike”. This assumes the procedure they were using was self-correcting. Maybe they had rarely found themselves to be in error—but this method was weakly powered to detect the error. They strongly believed in their own capacities. Fortunately DNA comes along, for different problem arenas, but how many false convictions in the mean time? They’re not even investigating outside a small proportion of cases.

    • E.B. I don’t know enough about the law or forensics to weigh in, except that obviously they were overconfident in relying on visual matches. I do recall a lot of dispute when I wasted some time listening to that Casey Anthony trial.

  2. Nicole Jinn

    I think that the article on flawed statistics behind hair match testimony that you report reveals (or makes more explicit) the need to *more carefully* investigate the limits of applying statistical methods. *If* researchers are able to better understand the limits of the statistical techniques they use *then* I think misinterpretation will be less likely to occur. But the if portion seems unattainable given the current disputes in philosophy of statistics.

    • Nicole: I don’t really think there were flawed statistics, the guys were simply guessing (in court) when they reported on how likely they were wrong, not backed by statistics. That’s how I’m understanding this. I guess giving impressionistic accounts of likelihoods could be called flawed statistics, but it’s not like they didn’t know they were giving hunches.
      I’m amused by the hair chief saying “he always waited for a defense attorney to challenge his claims about the accuracy of hair analysis but that neither they nor judges usually caught the logical sleight of hand.”
      The only thing that surprises me is Byrd’s claim about the inadequacy of even DNA analysis.

      Update: I see that they did use the term “flawed statistics”, you are right, and this reflects a certain ambiguity in the term “statistics”. Here it alludes to the numbers reported. Anyway, I take it back.

      • john byrd

        DNA is actually quite robust as a test method in comparison to use of hairs type, etc. to link a trace to a person. The difference is that DNA is highly variable between individuals (how much so depends on the type of testing being done). Plus, the probability of a random match pertaining to each test has been addressed in a rigorous manner. In other words, DNA expert witnesses are able to testify that there is a match, and that one can expert such a match by chance x times out of y due to random events. I like to point out the relationship to severity– One can judge this match probative just to the extent that it is unlikely that another individual (not the person in question) could have left the trace. The tests, such as DNA, that have very low random match probabilities are severe forensic tests.

        • Hi John Byrd: I may be misunderstanding, in that case, the meaning of your claim that “hair comparisons do not have a well-supported weight of evidence calculation and we suspect if it did it would not be comparable to DNA, fingerprints, or other tests that are more reputable in the scientific community.”

          • john byrd

            That was my mistake– I did not intend to include the word “not” in the sentence. Thanks for pointing it out. DNA is a robust test. My criticisms of DNA testing in forensics would center more on the propensity to report a Bayesian statistic than on the value of the test itself.

            • John: Even more confused, I think. Tell me how the sentence should read to still make sense of “we suspect if it did….”.

              • john byrd

                I think I should have avoided such a long sentence to convey the point. What I meant to say was that hair comparisons have not had the benefit of research to determine the statistical meaning of a “match.”. In contrast, DNA tests have well established statistical models for ascertaining the efficacy of a match, especially as pertains to the probability of a random match. Our lab has produced similar research to support the statistical evaluation of a “match” with dental records, optometric prescriptions, and chest radiographs. In these cases, not only has the work been done to address the frequency of a random match, but it has been demonstrated that this frequency is low so as to render the tests powerful. There have been calls for this type of research with hair comparison. Once done, I doubt that hair comparison will be shown to be powerful along the lines of DNA or some of these other tests. I doubt that hair comparison as traditionally performed is a severe test.

                • John: Not to belabor this, but I’d like to correctly correct the sentence. Is this what you mean?
                  “hair comparisons WHEN MADE VISUALLY (UNDER MICROSCOPES) do not have a well-supported weight of evidence calculation and we suspect if it did (I.E., IF WE CHECKED THE RATE OF FALSE VISUAL MATCHES) it would not be comparable to DNA, fingerprints, or other tests that are more reputable in the scientific community.”

                  If that’s right, then I’ll correct it that way. Sorry to have misunderstood.

  3. that is absolutely fascinating. I believe that every behavioral science student should have a well developed series of courses in the philosophy of science and be trained as researchers. Having at least a basic understanding of the philosophy behind introductory statistics is necessary for good practice of the applied behavioral fields.

    Darrin Coe, Ph.D.

  4. To Dr. Coe: in my (short) exposure to philosophy, studying philosophy of science does *not* necessarily imply nor require studying the philosophy behind statistical methods, at least in the way philosophy of science is traditionally formulated or taught. Instead, I find it better to take several courses in statistics, which give the formal/technical component of statistical methods, along with training in philosophy.

    • Hmm, well it may not be standard, but there are several variations on such courses that I have taught, at times with my colleague Dr. Aris Spanos. Often we offered it jointly in Philosophy-Economics (especially when I was 50% in Econ). Also at the LSE. Stat courses are important, but are busy teaching stat tools, and cannot typically be expected to provide the skills for a reflective scrutiny of the tools. Clearly there’s a need for more interdisciplinarity.

Blog at