Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)

wavy capitalNathan Schachtman, Esq., PC* emailed me the following interesting query a while ago:

NAS-3When I was working through some of the Bayesian in the law issues with my class, I raised the problem of priors of 0 and 1 being off “out of bounds” for a Bayesian analyst.  I didn’t realize then that the problem had a name:  Cromwell’s Rule.

My point was then, and more so now, what is the appropriate prior the jury should have when it is sworn?  When it hears opening statements?  Just before the first piece of evidence is received?

Do we tell the jury that the defendant is presumed innocent, which means that it’s ok to entertain a very, very small prior probability of guilt, say no more than 1/N, where N is the total population of people? This seems wrong as a matter of legal theory.  But if the prior = 0, then no amount of evidence can move the jury off its prior.

*Schachtman’s legal practice focuses on the defense of product liability suits, with an emphasis on the scientific and medico-legal issues.  He teaches a course in statistics in the law at the Columbia Law School, NYC. He also has a legal blog here.

Categories: PhilStatLaw, Statistics | Tags: | 27 Comments

Post navigation

27 thoughts on “Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)

  1. Nathan: Notice too that a prior of .5 in innocence is rather informative–overly so, it seems to me. Bottom line: the jury should not start with a prior for guilt, but be open to a relevant and unbiased scrutiny of the evidence in accord with “innocent until proved guilty”*.

    * Be it BARD or POE or other legal standard for the case (see comment to Corey)

    • Corey

      What severity threshold ought to count as “proved” guilty?

      • Corey: That statement (of mine) of course is to be qualified:
        “innocent until proven guilty according to the stipulated burden of proof for the case at hand,” be it BARD (beyond a reasonable doubt) or POE (preponderance of evidence) or something else..
        I wrote something on this in my “exchange” with Larry Laudan (Error and Inference,2010).
        I have always been surprised at the vagueness at which these standards of evidence are left, and the amount of conflation of different probabilistic terms in the law. For example, POE is sometimes thought to be “more likely than not”, which can refer to statistical likelihood or to a probability of .5. And BARD is sometimes identified with high confidence levels in statistics, other times posteriors. I’m not saying which are the “correct” interpretations, only that there’s a huge amount of confusion in the law. Larry Laudan introduced me to these issues when he was a colleague of mine years ago.

  2. The first problem I see here is the rule (enforced by the laws of probability) that data should not be used twice. That is, P(H|E,E)=P(H|E). So observing E a second time cannot affect the probability of H.

    This means that the jurors should not in any way take into account the fact that the defendant has been brought up on charges “by probable cause”. The reason is that the evidence that brought those charges in the first place is going to be presented in court, and they are supposed to evaluate that evidence. If they were to assume “probable cause” because charges would not have been brought had there not been probable cause, then they are in danger of using the evidence that brought the case to the level of “probable cause” twice. And, they have no way of knowing which of the evidence they are going to hear was considered when bringing the charges (nor of which of the evidence the defense will cite that may have been ignored when the charges were brought).

    This argues for a clean slate, and a prior that is very skeptical of guilt. 1/N is reasonable from this point of view. There is one guilty person, and the total population under consideration is of order N.

    Another point is that the problem is one in decision theory as well. That is, the jury is supposed to find guilt “beyond a reasonable doubt.” This means that the loss function has to take into account the fact that in our system of jurisprudence, it is better to let some number, n, of guilty people go than to unjustly punish one person. And, the injustice is greater, the greater the penalty exacted, up to and including the death penalty. How big n should be has to be assigned by the jury, since the judge and lawyers will not do that for them by defining “beyond a reasonable doubt”.

    In considering the loss function, the possibility of convicting an innocent person has in addition to the loss of punishing an innocent person the loss that the actually guilty person goes free and, since law enforcement considers the case closed, that guilty person may go on to commit other offenses.

    Some of these issues are discussed at a very elementary level in Gerd Gigerenzer’s book, “Calculated Risks”, which I recommend and have used in a freshman/sophomore honors course on decision theory for non-science students for many years. I’d be happy to discuss my experiences by email with Prof. Schachtman.

    • Corey

      ‘…jurors should not in any way take into account the fact that the defendant has been brought up on charges “by probable cause”.’

      I don’t agree. The bare fact that a person has been arrested and brought up on charges *is* evidence that ought to raise one’s probability of the defendant’s guilt above the base rate.

      The flip side of this is that an ideal juror must then *reduce* his or her probability of guilt if the prosecution does not present a sufficiently strong case. This is true irrespective of the defense case — even if the defense presented no case, the juror will have a lower probability of the defendant’s guilt after hearing the prosecution than before, in the event that the prosecution underperforms relative to the juror’s prior expectation.

      This is an instance of the martingale property of Bayesian statements discussed by Senn starting at the very bottom of page 3 of this letter on a not-very-related topic. This martingale property has been called the law of Conservation of Expecgted Evidence.

      Mayo: sorry for polluting your error statistics blog with our sordid Bayesian wrangling. 😉

      • I don’t agree, Corey. The jury, not the police and not the prosecutors, is supposed to be the sole judge of all evidence in the case. To do as you suggest is to contaminate the procedure with the opinions of people who have no business in judging the case. Theoretically, you say, the jury could back this out, but this is not an appropriate way to do this as they can’t in principle back out that information without some contamination. The only clean way to do this is to have the jury start from scratch, and evaluate the evidence that led to the arrest completely anew. And that means a very skeptical prior.

        • Corey

          Bill Jefferys: Let me be clear that I’m discussing ideal Bayesian reasoners, not human beings. The system is set up to try to get to a fair and accurate result, where “fair” maps roughly to decision theory and “accurate” maps roughly to probabilistic inference, using human beings as component parts. I’m not suggesting that it be altered to treat human beings as less fallible than we actually are.

          That said, maybe a toy model will help? The following will give the right intuitions as long as we imagine a smallish N.

          Let’s say that a “crime” has been committed. For each of the N members of the population, three six-sided dice are thrown, and the actual “culprit” is the one with the largest sum, ties broken uniformly at random. One die’s result is completely hidden. Some kind of non-exhaustive search is carried out and a “defendant” may be “brought to trial” if the search finds a sufficiently “suspicious” member of the population. In the “trial”, first the “prosecution” reveals the larger of the two observable results and then the “defense” reveals the smaller of the two observable results. (A mildly more realistic model might have the “defense” possibly discuss the observable dice of other members of the population, but think adding this complication muddies the waters unnecessarily.)

          Conditional on a “defendant” having been “brought to trial”, what is the probability of “guilt”? It depends on how one models the search process. It seems to me that your suggested prior is reasonable only if the search process ignores the observable dice. I find that… troubling. I’m saying that it’s reasonable to suppose that the fact that a “defendant” is “brought to trial” does convey information about the sum of the observable dice for that member of the population. For example, suppose that in my model, a “defendant” is only “brought to trial” if at least one “suspect” can be found with a “suspicious evidence” of at least 8. Right away I know something about about the probability of “guilt”. I haven’t done the math, my intuition suggests that if in the “trial” the “prosecution” reveals something less than 6, the probability of “guilt” is going to go down a fair bit; a 6 will make it go up slightly.

          Note that in my toy model, the results on the observable dice are known by the end of the “trial”, so the “verdict”, howsoever determined, is the same whichever of our two views one takes.

      • Corey: We (on Elba) are used to pollution; that is why we have the EPA. No not the usual EPA, the Error Probability Association!

  3. Nathan Schachtman

    Corey,

    There is no common law or legislative answer to your question. The standard jury instructions in every state, and in the federal court system, in the U.S., studiously avoids quantifying the posterior probability that the jury may accept as “proof beyond a reasonable doubt.” (PBRD) Sometimes you will see a judge refer to PBRD as a “moral certainty,” whatever that is. I recall reading a survey of judges for what level of probability they believed provided minimal PBRD, and many if not most judges weighed in around 80%, which scares me out of my (legal) briefs.

    Nathan Schachtman

    • For what it’s worth, in general (like theft) cases, my decision theory students have usually chosen loss functions that equate to at least 99% probability of guilt to convict…Once or twice as low as 90% but never lower. if it’s a murder case, they will want higher for “life in prison.” A few classes have assigned a finite loss for the death penalty, mostly when I was teaching in Texas (and not even all the time there), but many have chosen an infinite loss or a loss that is so large relative to the other losses that the jury would never choose the death penalty.

      I have to say that I am not a lawyer and my interest in this subject came from my teaching and interest in finding real-world examples of how decision theory can/could be used. The irony is that someone called for jury duty who actually evidenced the sort of thinking that my students learn would probably be dismissed by one side or the other for “knowing too much!”

      • Bill: Thanks for the comments/ references. of course on “knowing too much” I take it that always refers to knowledge of the case, and not to probability and statistics. It would be funny if they turned down someone for jury duty if they performed too well on a stat question. But maybe it could happen, Nathan will weigh in.

        • Mayo,

          I know of at least one prominent case. Physicist Neil deGrasse Tyson told a story on how he was dismissed from a jury when he explained his work on how humans could be deceived by data.

    • Corey

      Nathan,

      I wasn’t really expecting a definitive answer of any sort — I just wanted to know how Mayo would respond when confronted with the tension between her concept of severity as a sliding scale and the necessity to make binary go/no-go decision on “proved” guilty.

      Eighty percent!? Holy shit. Remind me to never be arrested in your fine nation…

      • Corey: Well I did respond. The hypotheses test setting is the most apt, and once the socio-legal system indicates the standard of evidence for the case at hand*, the binary inference follows. I never heard anyone say hypotheses tests weren’t binary enough.

        *Which isn’t to say this is straightforward, but rather outside the scope of statistics proper.

        • Corey

          Mayo: Of course you answered! When I wrote that I “wanted to know how [you] would respond,” I chose my phrasing carefully. I hope I didn’t give the impression that I was trying to *stump* you — I know better than that.

          I’m still curious to know how Jury Foreperson Mayo would go about assessing whether the evidence in some hypothetical trial met the burden of proof BARD.

  4. In case anyone’s interested, here’s a link to my”Error and the Law: Exchanges with Larry Laudan” but you’d need the book to read Laudan.

    Click to access Ch%209%20error%20&%20the%20law%20Laudan.pdf

  5. Nathan Schachtman

    Bill,

    Thank you for your comments. You are certainly correct that the “probable cause” to hold a defendant for trial is not, and must not, be considered by the jury as providing a prior > 0. That would be to accredit the prosecutor’s evidence, and to wipe out the presumption of innocence promised to the defendant.

    Assigning a prior of 1/N is interesting, and has been the usual suggested starting place. Still, it is a small prior, and that is not a starting belief in innocence. I don’t understand that the law requires a prior that is skeptical of guilt; it requires a presumption of innocence. And there may be other hypotheses for the cause of a crime (say an alleged murder) that do not involve the actions of a human agent. On some imagined hypotheticals, the death that is the subject of a murder claim may have been by natural causes, or self-inflicted.

    I will seek out Gigerenzer’s book, and any other references on this issue. I can be reached at the email, below.

    Thanks.

    Nathan
    nathan@schachtmanlaw.com

    • Hi Nathan, thanks for your email.

      Here’s the reason why I think a 1/N prior is justifiable.

      When the authorities find a murdered person (say) they have no evidence to assign guilt to any particular person. All they know is that it was committed by someone, probably someone within the nearby population in the city of N people. It’s only when they start to evaluate evidence (and this includes evidence such as “a significant fraction of murders are committed by people known to the victim”, for example) that they start to narrow the search for suspects to a smaller group.

      And even in the case that I cite above, this is evidence that will certainly be presented to the jury in one way or another, and therefore evidence that they might consider twice if they were to use a prior that was much different from the prior I recommend. For, if they use a 1/N prior, and (as is certain) the prosecution brings that information to their attention, in one way or another, the information that the perpetrator probably knew the victim can be taken into account in a principled way.

      • Paul

        It seems like the numerator could in some cases be greater than 1.

        • You can’t do better than to give an order of magnitude estimate. The important point is that the prior be quite skeptical of guilt, and let the evidence speak for itself to (possibly) overcome that skepticism. And as I said, in our system the jury is the sole judge of that evidence.

    • You wrote, “Assigning a prior of 1/N is interesting, and has been the usual suggested starting place. Still, it is a small prior, and that is not a starting belief in innocence.”

      I think I may not be understanding you correctly. A prior 1/N is for me a prior on guilt. It says that there are N people in the city, one of which is guilty, so the probability that the particular person that you have picked off the street has a probability 1/N of being guilty. It says that it is almost certain, probability (N-1)/N that that person is innocent. That is a starting belief in innocence.

  6. This is such an interesting subject, I think it will be of your interest this case (maybe you already know about it)

    Poincaré and Dreyfus

    Where Poincaré (one of the greatest mathematicians in history) et al says:

    As it is impossible to know the probability à priori, we will not be able to say such coincidence prove that the relation of the probability of the forgery to the inverse probability to such value. We would be only able to say, by the finding of this coincidence, this report becomes many times larger than before the finding.

    Even, after we being thus restrained, there remain many traps to avoid. One is never sure to have done a complete enumeration of the possible causes, and it is as LAPLACE was carried away in a memorable error on the subject of the probable direction of the rotation of planets.

    And yeah, Poincaré is kicking Laplace ass… just saying 😀

    They also conclude:

    To want to eliminate the moral elements and to replace them by numbers, is dangerous and futile. In a word, the probability theory is not, as people appear to believe, a better science that dispense the scientist of good sense.

  7. Cromwell’s rule was named by Dennis Lindley. There is some doggerel on the subject here http://www.senns.demon.co.uk/wpoetry.html

  8. Nathan Schachtman

    Stephen,

    Thanks for the poetry. I don’t think I ever expected to see Cromwell’s Rule so well integrated into a poem on statistics!

    Yes; Poincaré wrote a report in support of the defense in the Dreyfus case. Benjamin Peirce, father of Charles, testified in the Howland will case, which may well have been the first U.S. case to involve explicitly probabilistic evidence.

    I doubt that jurors would be dismissed for cause for their knowledge of statistics or decision theory, but many lawyers would use their peremptory challenges to “get rid of” such jurors. I recall a case I tried in New Brunswick, NJ, where jurors are allowed to ask questions. My adversary kept a juror who was an engineer, probably because of the color of his skin, and my adversary thought that this juror would be sympathetic to his catastrophically injury client. But the juror kept asking plaintiffs’ expert witnesses technical questions about the statistical analyses in the articles relied upon. Indeed, these were questions that I had asked in a pretrial “Daubert” hearing to challenge the reliability of the expert witnesses’ conclusions, but I had lost that motion, and I was reluctant to spend time in my crossexamination on the most technical points before the jury.

    As for the appropriate prior in a Bill’s murder hypothetical, I believe we can rest assured that evidence will not be counted twice because the trial court will instruct the jury that the mere fact of the defendant’s being charged or indicted is not to be considered evidence of guilt at all. Now many jurors will disregard that instruction, and the defense must find ways to make the jurors skeptical of the police/government’s accuracy. (And this is where race, ethnicity, political beliefs become such important areas of consideration for jury selection in trials.)

    I am still not persuaded that 1/N is an appropriate starting point. To start there is to assume that there is a finite affirmative belief in a person’s guilt, and that is not the law, which requires a belief in innocence. Suppose the murder took place on an island, and that the population of the island was known by the ferry owner to be 3. One of the 3 is found murdered. Suicide is ruled out by the manner of death. Should the jury be told that they must start with a prior probability of 0.5? Now we are no longer considering just a smidgeon of guilt. Of course, the Bayesian analysis must take into account that there is more than one issue for the jury to assess. It is not just the act that causes death that is subject of the analysis, but also the state of mind (mens rea). What should we tell the jury is the prior for the analysis of whether the defendant intended the death, was releckless with regard to the death, or was negligent with respect to the death? Can we tell them that there is a prior probability 1/X that is the appropriate starting place for them to analyze the level of criminal intent?

    Nathan

    • Nathan,

      I am still not persuaded that 1/N is an appropriate starting point. To start there is to assume that there is a finite affirmative belief in a person’s guilt, and that is not the law, which requires a belief in innocence.

      Welcome to the Null Hypothesis Significant Test world!

      Where you can set a Null Hypothesis of innocence and calculate a P-Value on how strange evidences are under such innocence hypothesis. Total agreement with the law!

      Then we simply need to set a reasonable Error Type I for each crime and we are done.

      We should only use Bayes when we unarguably can establish a prior.

      Poincaré was right.

      • Fran: I just noticed your comment which echoes mine. Null hypothesis testing does indeed afford the needed representation of the problem here. Of course in gathering up evidence against innocence, the prosecutors may use ordinary frequentist probabilities, as in forensics. Likewise for the defense.

  9. An appropriate way to view this context is by means of hypotheses tests in error statistics. One begins with a hypothetical presumption of innocence in the same way we conjecture a “no effect” null (or no positive effect) in science. Then the prosecutor’s burden of proof is to provide evidence for rejecting the null. We give the alternative hypothesis “evidence of guilt” a hard time by controlling the type 1 error to a small number (chosen, in effect, by the socio-legal stipulation). Different burdens of proof correspond to setting different error rates.

Leave a reply to Mayo Cancel reply

Blog at WordPress.com.