Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

Larry Laudan

Larry Laudan

Professor Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“When the ‘Not-Guilty’ Falsely Pass for Innocent” by Larry Laudan

While it is a belief deeply ingrained in the legal community (and among the public) that false negatives are much more common than false positives (a 10:1 ratio being the preferred guess), empirical studies of that question are very few and far between. While false convictions have been carefully investigated in more than two dozen studies, there are virtually no well-designed studies of the frequency of false acquittals. The disinterest in the latter question is dramatically borne out by looking at discussions among intellectuals of the two sorts of errors. (A search of Google Books identifies some 6.3k discussions of the former and only 144 treatments of the latter in the period from 1800 to now.) I’m persuaded that it is time we brought false negatives out of the shadows, not least because each such mistake carries significant potential harms, typically inflicted by falsely-acquitted recidivists who are on the streets instead of in


In criminal law, false negatives occur under two circumstances: when a guilty defendant is acquitted at trial and when an arrested, guilty defendant has the charges against him dropped or dismissed by the judge or prosecutor. Almost no one tries to measure how often either type of false negative occurs. That is partly understandable, given the fact that the legal system prohibits a judicial investigation into the correctness of an acquittal at trial; the double jeopardy principle guarantees that such acquittals are fixed in stone. Thanks in no small part to the general societal indifference to false negatives, there have been virtually no efforts to design empirical studies that would yield reliable figures on false acquittals. That means that my efforts here to estimate how often they occur must depend on a plethora of indirect indicators. With a bit of ingenuity, it is possible to find data that provide strong clues as to approximately how often a truly guilty defendant is acquitted at trial and in the pre-trial process. The resulting inferences are not precise and I will try to explain why as we go along. As we look at various data sources not initially designed to measure false negatives, we will see that they nonetheless provide salient information about when and why false acquittals occur, thereby enabling us to make an approximate estimate of their frequency.

My discussion of how to estimate the frequency of false negatives will fall into two parts, reflecting the stark differences between the sources of errors in pleas and the sources of error in trials. (All the data to be cited here deal entirely with cases of crimes of violence.)

i). Estimating the frequency of false negatives at trials. Trial acquittals represent a very small subset of overall acquittals. Specifically, of the 232k defendants who were arrested in 2008 for, but not convicted of, a violent crime, only 6% (15k) of the freed defendants were products of a trial. Conventional wisdom has it that most defendants acquitted at trial are probably factually guilty. After all, so the usual argument goes, these defendants wouldn’t even be going to trial unless the prosecutor believed that he had a strong chance of persuading jurors that these defendants were guilty beyond a reasonable doubt.

While this argument does not rest on any solid data (and we will soon be looking at one that does), it enjoys a prima facie plausibility. Even if the prosecutor sometimes overestimates the strength of his case against the defendant, it seems reasonable to suppose that most defendants winning an acquittal at trial have an apparent guilt in the range from about 70% to 90%. One’s initial inclination in such circumstances is to suppose that at least half of those who are acquitted at trial actually committed the crime(s) they are charged with but the evidence allowed room for rational doubt about defendant’s guilt. Accordingly, one might assume that about half of those acquitted at trial are guilty, giving us some 7.5k false negatives, even though my strong suspicion is that the true figure is higher than that. There are two powerful reasons for thinking that this simplistic assumption understates the frequency of guilt among those acquitted at trial. They are as follows:

a). One potential source for corroborating my hunch involves looking at some interesting data from Scotland. There, the justice system uses BARD as the standard, as in the United States, and trial by jury. However, the Scottish system consists of three verdicts rather than the usual two: ‘guilty’, ‘guilt not proven’ and ‘not guilty’.[1] The intermediate verdict gives us a point of entry for trying to pin down the rate of false acquittals. A guilt-not-proven verdict is called for when i). the jury is persuaded that the defendant is factually guilty (that is, p(guilt)≧0.5) but ii). the jury is not convinced of that guilt beyond a reasonable doubt. Both the not-guilty and the guilt-not-proven verdicts count as official acquittals but they send decidedly different messages. In a study of criminal prosecutions in 2005 and 2006 done by the Scottish government, it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict.[2] That means that about 7-in-10 acquittals in Scotland involve defendants regarded by the jurors as having probably committed the crime.

b). A different way of estimating the frequency of false acquittals at trials emerges from the monumental study by Kalven and Zeisel (The American Jury) of some 3,500+ jury trials in the US. The researchers asked judges in each of the trials that resulted in an acquittal whether, in the opinion of the judge, the case was ‘close’ (meaning the apparent guilt of the acquitted defendant verged on proof beyond a reasonable doubt) or whether it was a ‘clear’ acquittal (meaning that defendant’s apparent guilt was well below the BARD standard). According to the responses to this question (dealing with 1,191 acquittals), judges indicated that, in their opinions, only 5% of the trials resulted in ‘clear’ acquittals; by contrast, 52% of the cases were, in the view of judges, ‘clear for conviction’.[3]

Since about one-third of trials for violent crimes result in an acquittal, the Kalven-Zeisel data would seem to entail that only about 15% of the acquittals are ‘clearly’ acquittals, while some 85% are, in the opinion of the presiding judge, close cases. If, as in our example from 2008, there are some 15k acquittals, more than 12k of them are close enough to warrant an assumption that these are probably factually guilty defendants, even if their apparent guilt fails to eliminate all reasonable doubts.



Putting the two data sets together, it is fair to say that significantly more than half of those acquitted at trial of a violent crime were nonetheless regarded by the jurors and judges as probably guilty and thereby are reasonably assumed to be false negatives.[4] Accordingly, I shall hereafter assume that, among those 15k acquittals that emerged in trials for violent crimes in the US in 2008, some 11.2k of them were false negatives.

ii). False negatives in the dropping of charges (pre-trial acquittals). The much more intriguing question concerns the true guilt or innocence not of those 15k defendants acquitted at trial but of those 217k arrestees against whom charges were dropped or dismissed. Such decisions obviously came prior to trial, usually at the initiative of a prosecutor, sometimes at the initiative of a judge. We know that of those arrested by the police and charged with violent crimes in 2008, some 37% never make it to a trial or a plea bargain; the prosecutor or the pre-trial judge, in effect, acquits them. But how many of them so acquitted were truly innocent? Fortunately, there are two very large studies that shed substantial light on the answer to that crucial question. Both depend on the responses of thousands of prosecutors who were quizzed about the reasons why they dropped the charges that they did. One such study, analyzing FBI-initiated prosecutions nationwide, provides annual data about the reasons why federal prosecutors have dropped (and judges have dismissed) charges against those accused of a violent crime. The second study, undertaken by the Bureau of Justice Statistics, looked at the same issue in state cases, where of course most violent crime adjudications take place.

What emerge from both studies are many cases that were dropped for reasons that may indicate defendant’s innocence, or at least the relative weakness of the prosecutor’s case against the defendant. I shall call these factors innocence-indicators. Both studies show that prosecutors have multiple reasons for the dismissal or dropping of charges against persons charged with a violent crime. Still, both data sets about prosecutorial decisions indicate that the dominant motive for dropping outstanding charges is not, as you might expect, a belief that the defendant is actually innocent.[5]

Sometimes, charges are dropped because of a defendant’s willingness to testify for the state in the separate trial of an accomplice. Occasionally, charges are dropped because the prosecutor discovers that the statute of limitations expires before the trial can be scheduled or he discovers that the defendant, when the alleged crime occurred, was a minor and should be tried in juvenile court. Prosecutors will also often drop charges if the rulings in the pre-trial evidence hearing indicate that the judge will exclude what the prosecutors deem to be highly inculpatory evidence of defendants’ guilt. When that occurs, the case against the defendant obviously becomes less compelling than it would have been if the relevant evidence were admitted. In fact, this was reported as the most frequent problem that prosecutors’ offices ran into.[6] Commonly, prosecutors cite limitations of personnel and financial resources to cope with all the cases on their docket as another reason for dropping charges. (So much for the common idea that prosecutors have virtually unlimited resources!) Charges are also likely to be dropped if a key witness for the state vanishes or changes her testimony (as the Bureau of Justice Statistics puts it: “the reason for this reluctance [to testify] was usually fear of reprisal, followed by actual threats against the victim or witness.”[7]), or if the defendant was awarded bail awaiting trial and vanished, thereby becoming a fugitive at large.[8] Clearly, none of these reasons for dropping a case is, in any sense, an indicator of the defendant’s innocence.

Oftentimes, of course, charges are dropped for reasons that imply the weakness of the case against the defendant. A detailed report about the many decisions made in 2010 by federal prosecutors – in deciding whether to drop charges against some 7.3k detainees arrested by the FBI– claims that in 20.5% of dismissals, there appeared to be a ‘lack of criminal intent’; 7% of dropped charges were a result of the prosecutor’s decision that ‘no crime was committed’; and in another quarter of the dropped cases there were signs of ‘weak or insufficient evidence.’[9] That boils down to saying that, in federal trials for violent crimes, slightly less than half of all dismissals (48%) are motivated by factors other than a worry that defendant’s guilt might not be provable at trial. (Recall, too, that ‘insufficient evidence’ does not mean lack of substantial evidence that defendant committed the crime but rather evidence the prosecution believes is probably insufficient to establish defendant’s guilt beyond a reasonable doubt.)

This already gives us reason to suspect that about half of the cases where charges are dropped involve the abandonment of charges against defendants whom the prosecutor thought were probably factually guilty but was not at all sure that he could prove that guilt beyond a reasonable doubt. That argument becomes much more convincing when we remind ourselves of how defendants came to the prosecutors’ attention in the first place. Typically, a person becomes the object of police investigations initially as nothing more than a suspect, perhaps among several others who strike the police as possible culprits. If, after further inquiries and the analysis of more evidence, police decide to file charges (thereby ‘clearing’ the case as far as the police are concerned), they are required to have grounds to believe that it is more likely than not that defendant committed the crime. To make the arrest official, the police must persuade either a judge or a grand jury (or both) that a rational person, confronted with the available evidence, would conclude that defendant probably committed the crime.

Accordingly, by the time the prosecutor typically gets deeply into the act, he is dealing with a host of arrestees, each of whom is considered by the police, a grand jury and the arraigning judge to be more likely than not to be guilty on the available evidence. As the prosecutor begins assembling his case, some new evidence will often come in or be actively sought. Sometimes, that evidence will be exculpatory, and persuade the prosecutor that defendant really did not commit the crime. Much more often, though, the decision point for the prosecutor arrives when, after having reviewed the evidence, he must decide whether the case against the defendant is strong enough to persuade a trial jury that the defendant is guilty beyond a reasonable doubt. Supposing, with many scholars, that this standard represents roughly a 90+% likelihood of guilt, this means that most of those now charged with a crime have an apparent guilt that falls in the very broad range from 50+% to something close to 100%. The prosecutor will generally cull those defendants in the range of 50-80% apparent guilt out of the class of those he intends to take to trial or to negotiate a plea bargain with.



Why would he do that? When apparent guilt is in that range, the prosecutor knows that it is unlikely that he will be able to persuade the defendant to accept a plea bargain and he also knows that, if he takes the defendant to trial, it will probably result in an acquittal. There are moral reasons as well that lead to the dropping of charges,[10] even against those whom the prosecutor believes to be factually guilty.

The second pertinent study on this vexing issue of the frequency of guilt among those dropped out of the system prior to trial was published in 1992.[11] Unlike the FBI study, this one investigated state (rather than federal) criminal trials. It included some 40k cases. The researchers asked prosecutors why they had dropped charges in the cases (or why judges had dismissed charges) when they did. Three of the reasons given appear to be innocence-indicators: ‘evidence issues’, ‘witness problems’ and ‘the interests of justice’. Some 35% of the dropped/dismissed cases were attributed to these reasons. That left 65% of the abandoned cases involving reasons implying nothing about guilt or innocence.[12] An earlier study of 17,500 arrests in Washington, D.C. federal courts indicates that the prosecutor dropped 3.6k cases but only a third of those dismissals (34%) were attributed to ‘insufficiency of evidence’.[13]

Taking the mean between the FBI probably-guilty rate of 47% and the BJS value of 65%, we arrive at the estimate that about 56% of the dismissed and dropped arrestees were probably factually guilty. Even so, that figure doesn’t take us fully where we want to go. We’re after a reasonable estimate of the number of truly guilty who have the charges against them either dropped or dismissed. The fact that the 56% of arrestees against whom charges were dropped are probably guilty does not yet give us a definite way of determining how many of them were actually guilty.

There is, however, a way of generating the result we seek. Remember that the defendants in this group were dropped or dismissed because of reasons that had nothing to do with signs of their innocence. Hence, we can reasonably suppose that the proportion of guilty among them would be about the same as the proportion of guilty among those who go to trial. (After all, there is no perceived evidential weakness in the case against them that distinguished them from those who do go to trial.) Exactly two-thirds of those who went to trial for a violent crime were convicted. We have already explained why we assume that that 75% of those acquitted at trial are probably truly guilty.

That seems to provide a plausible rationale for saying that, among those defendants who had the charges against them dropped for non-evidentiary reasons, approximately two-out-of-three (and probably more) are highly likely to be guilty. Hence, we shall assume that about 37% (that is two-thirds of the 56% of those whom were booted out of the trial system for non-evidentiary reasons) are factually guilty (and, if they had gone to trial, would have been convicted). This amounts to 81k false negatives. When added to the estimate of 12k probably guilty defendants among those acquitted at trial, this figure entails that, at a minimum, some 93k of the 595k arrestees are acquitted although truly guilty. This suggests a false negative rate of ~40% (viz., 93k guilty out of 232k acquitted). The false positive rate in this example is 3% (some 11k falsely convicted defendants out of 360k convicted).

It remains to be seen whether this pattern of error distribution serves the interests of society. That is the subject of my next book. For now, I will simply note that recidivism data show unambiguously that the 88k false negatives do vastly more harm to innocent citizens than the 11k false positives do. Quite clearly, the current standard of proof needs drastic re-adjustment.




[1] For a lengthy discussion of the Scottish verdict system, see my “Need Verdicts Come in Pairs?” International Journal of Evidence and Proof, vol. 14 (2010), 1-24.

[2] Scottish Government Statistical Bulletin, Crim/2006/Part 11. The data come from the years 2004-2005. (

[3] Kalven & Zeisel, op. cit., Table 32.

[4] Given the Scottish estimate of ~70% false negatives at trial and the Kalven-Zeisel estimate of an 85% false negative rate in trials, I shall assume a false negative rate of ~75% in acquittals at trial.

[5] See especially US Dept. of Justice, United States Attorneys’ Annual Statistical Report, 2010.

[6] Ibid., Table 6.

[7] BJS, Prosecutors in State Courts, 1994 (1996), p.5.

[8] 5% of those on bail awaiting trial on a murder charge become fugitives. BJS, Felony Defendants in Large Urban Counties, 2009 –Statistical Tables, Table 18.

[9] The detailed breakdown of the relevant data can be found in Table 14 of US Dept. of Justice, United States Attorneys’ Annual Statistical Report, 2010. In that year, the FBI declined to prosecute some 7,252 cases of arrested defendants (794 of these cases were violent crimes) (ibid., Table 3).

[10] The ethics manual of the American Bar Association, the ABA Standards for Criminal Justice: Prosecution and Defense Function, insists that prosecutors “should not institute, or cause to be instituted, or permit the continued pendency of criminal charges when the prosecutor knows that the charges are not supported by probable cause.” (Standard 3-3.9) It goes on to say that the prosecutor should drop charges against the defendant if there isreasonable doubt that the accused is in fact guilty.” (ibid.)

[11] Barbara Boland et al., The Prosecution of Felony Arrests, 1988. Bureau of Justice Statistics, 1992.

[12] Here were the data for some of the cities in their study: Denver (46% dropped because of innocence issue); Los Angeles (50%); Manhattan (43%); St. Louis (20%); San Diego (27%); Seattle (25%); and Washington, D.C. (37%). Id., Table 5.

[13] Brian Forst et al., What Happens after Arrest? Institute for Law and Social Research (1977), Exhibit 5.1.

Earlier guest post by Laudan:

Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior

Truth, Error, and Criminal Law: An Essay in Legal Epistemology (Cambridge Studies in Philosophy and Law) by Larry Laudan (Apr 28, 2008)


Categories: evidence-based policy, false negatives, PhilStatLaw, Statistics | Tags: | 8 Comments

Stapel’s Fix for Science? Admit the story you want to tell and how you “fixed” the statistics to support it!



Stapel’s “fix” for science is to admit it’s all “fixed!”

That recent case of the guy suspected of using faked data for a study on how to promote support for gay marriage in a (retracted) paper, Michael LaCour, is directing a bit of limelight on our star fraudster Diederik Stapel (50+ retractions).

The Chronicle of Higher Education just published an article by Tom Bartlett:Can a Longtime Fraud Help Fix Science? You can read his full interview of Stapel here. A snippet:

You write that “every psychologist has a toolbox of statistical and methodological procedures for those days when the numbers don’t turn out quite right.” Do you think every psychologist uses that toolbox? In other words, is everyone at least a little bit dirty?

Stapel: In essence, yes. The universe doesn’t give answers. There are no data matrices out there. We have to select from reality, and we have to interpret. There’s always dirt, and there’s always selection, and there’s always interpretation. That doesn’t mean it’s all untruthful. We’re dirty because we can only live with models of reality rather than reality itself. It doesn’t mean it’s all a bag of tricks and lies. But that’s where the inconvenience starts.

I think the solution is in accepting this and saying these are the tips and tricks, and this is the story I want to tell, and this is how I did it, instead of trying to pose as if it’s real. We should be more open about saying, I’m using this trick, this statistical method, and people can figure out for themselves. It’s the illusion that these models are one-to-one descriptions of reality. That’s what we hope for, but that’s of course not true. 

This is our “dirty hands” argument, so often used these days, coupled with claims of so-called “perverse incentives,” to excuse QRPs (questionable research practices), bias, and flat out cheating. The leap from “our models are invariably idealizations” to “we all have dirty hands” to “statistical tricks cannot be helped,” may inadvertantly be encouraged by some articles on how to “fix” science.

Earlier in the interview:

You mention lots of possible reasons for your fraud: laziness, ambition, a short attention span. One of the more intriguing reasons to me — and you mention it twice in the book — is nihilism. Do you mean that? Did you think of yourself as a nihilist? Then or now?

Stapel: I’m not sure I’m a nihilist. ….

Did you think of the work you were doing as meaningful?

Stapel: I was raised in the 1980s, at the height of postmodernism, and that was something I related to. I studied many of the French postmodernists. That made me question meaningfulness. I had a hard time explaining the meaningfulness of my work to students.

I’ll bet.

 I agree with Bartlett that you don’t have to have any sympathy with a fraudster to possibly learn from him about preventing doctored statistics, or sharpening fraudbusting skills, except that it turns out Stapel really and truly believes science is a fraud![ii]  In his pristine accomplishment of using no data at all, rather than merely subjecting them to extraordinary rendition (leaving others to wrangle over the fine points of statistics), you could say that Stapel is the ultimate, radical, postmodern scientific anarchist. Stapel is a personable guy, and I’ve had some interesting exchanges with him; but on that basis, from his “Fictionfactory,” and autobiography, “Derailment”, I say he’s the wrong person to ask. He still doesn’t get it!


[i]There are several posts on this blog that discuss Stapel:

Some Statistical Dirty Laundry

Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)
Should a “fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no

How to hire a fraudster chauffeur (includes video of Stapel’s TED talk)

50 shades of grey between error and fraud

Thinking of Eating Meat Causes Antisocial behavior

[ii] At least social science, social psychology. He may be right that the effects are small or uninteresting in social psych.

Categories: junk science, Statistics | 7 Comments


3 years ago...
3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.[1]  It was extremely difficult to pick only 3 this month; please check out others that look interesting to you. This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.


June 2012

[1]excluding those recently reblogged. Posts that are part of a “unit” or a group of “U-Phils” count as one.

Categories: 3-year memory lane | 1 Comment

Can You change Your Bayesian prior? (ii)



This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.



S. SENN: According to Senn, one test of whether an approach is Bayesian is that while

“arrival of new data will, of course, require you to update your prior distribution to being a posterior distribution, no conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it.” (Senn, 2011, 63)

“If you cannot go back to the drawing board, one seems stuck with priors one now regards as wrong; if one does change them, then what was the meaning of the prior as carrying prior information?” (Senn, 2011, p. 58)

I take it that Senn is referring to a Bayesian prior expressing belief. (He will correct me if I’m wrong.)[ii] Senn takes the upshot to be that priors cannot be changed based on data. Is there a principled ground for blocking such moves?

I.J. GOOD: The traditional idea was that one would have thought very hard about one’s prior before proceeding—that’s what Jack Good always said. Good advocated his device of “imaginary results” whereby one would envisage all possible results in advance (1971,  p. 431) and choose a prior that you can live with whatever happens. This could take a long time! Given how difficult this would be, in practice, Good allowed

“that it is possible after all to change a prior in the light of actual experimental results” [but] rationality of type II has to be used.” (Good 1971, p. 431)

Maybe this is an example of what Senn calls requiring the informal to come to the rescue of the formal? Good was commenting on D. J. Bartholomew [iii] in the same wonderful volume (edited by Godambe and Sprott).

D. LINDLEY: According to subjective Bayesian Dennis Lindley:

“[I]f a prior leads to an unacceptable posterior then I modify it to cohere with properties that seem desirable in the inference.”(Lindley 1971, p. 436)

This would seem to open the door to all kinds of verification biases, wouldn’t it? This is the same Lindley who famously declared:

“I am often asked if the method gives the right answer: or, more particularly, how do you know if you have got the right prior. My reply is that I don’t know what is meant by “right” in this context. The Bayesian theory is about coherence, not about right or wrong.” (1976, p. 359)

H. KYBURG:  Philosopher Henry Kyburg (who wrote a book on subjective probability, but was or became a frequentist) gives what I took to be the standard line (for subjective Bayesians at least):

There is no way I can be in error in my prior distribution for μ ––unless I make a logical error–… . It is that very fact that makes this prior distribution perniciously subjective. It represents an assumption that has consequences, but cannot be corrected by criticism or further evidence.” (Kyburg 1993, p. 147)

It can be updated of course via Bayes rule.

D.R. COX: While recognizing the serious problem of “temporal incoherence”, (a violation of diachronic Bayes updating), David Cox writes:

“On the other hand [temporal coherency] is not inevitable and there is nothing intrinsically inconsistent in changing prior assessments” in the light of data; however, the danger is that “even initially very surprising effects can post hoc be made to seem plausible.” (Cox 2006, p. 78)

An analogous worry would arise, Cox notes, if frequentists permit data dependent selections of hypotheses (significance seeking, cherry picking, etc). However, frequentists (if they are not to be guilty of cheating) would need to take into account any adjustments to the overall error probabilities of the test. But the Bayesian is not in the business of computing error probabilities associated with a method for reaching posteriors. At least not traditionally. Would Bayesians even be required to report such shifts of priors? (A principle is needed.)

What if the proposed adjustment of prior is based on the data and resulting likelihoods, rather than an impetus to ensure one’s favorite hypothesis gets a desirable posterior? After all, Jim Berger says that prior elicitation typically takes place after “the expert has already seen the data” (2006, p. 392). Do they instruct them to try not to take the data into account? Anyway, if the prior is determined post-data, then one wonders how it can be seen to reflect information distinct from the data under analysis. All the work to obtain posteriors would have been accomplished by the likelihoods. There’s also the issue of using data twice.

So what do you think is the answer? Does it differ for subjective vs conventional vs other stripes of Bayesian?

[i]Both were contributions to the RMM (2011) volumeSpecial Topic: Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond? (edited by D. Mayo, A. Spanos, and K. Staley). The volume  was an outgrowth of a 2010 conference that Spanos and I (and others) ran in London, and conversations that emerged soon after. See full list of participants, talks and sponsors here.

[ii] Senn and I had a published exchange on his paper that was based on my “deconstruction” of him on this blog, followed by his response! The published comments are here (Mayo) and here (Senn).

[iii] At first I thought Good was commenting on Lindley. Bartholomew came up in this blog in discussing when Bayesians and frequentists can agree on numbers.


Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.”
Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.”
Berger, J. O.  2006. “The Case for Objective Bayesian Analysis.”

Discussions and Responses on Senn and Gelman can be found searching this blog:

Commentary on Berger & Goldstein: Christen, Draper, Fienberg, Kadane, Kass, Wasserman,
Rejoinders: Berger, Goldstein,


Berger, J. O.  2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Cox, D. R. 2006. Principles of Statistical Inference. Cambridge, UK: Cambridge University Press.

Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis.”  Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics 2 (Special Topic: Statistical Science and Philosophy of Science): 67–78.

Godambe, V. P., and D. A. Sprott, ed. 1971. Foundations of Statistical Inference. Toronto: Holt, Rinehart and Winston of Canada.

Good, I. J. 1971. Comment on Bartholomew. In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 108–122. Toronto: Holt, Rinehart and Winston of Canada.

Kyburg, H. E. Jr. 1993. “The Scope of Bayesian Reasoning.” In Philosophy of Science Association: PSA 1992, vol 2, 139-152. East Lansing: Philosophy of Science Association.

Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Lindley, D.V. 1976. “Bayesian Statistics.” In Harper and Hooker’s (eds.)Foundations of Probabilitiy Theory, Statistical Inference and Statistical Theories of Science., 353-362. D Reidel.

Senn, S. 2011. “You May Believe You Are a Bayesian But You Are Probably Wrong.” Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics 2 (Special Topic: Statistical Science and Philosophy of Science): 48–66.

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”

Objectivity 1: Will the Real Junk Science Please Stand Up?


I had a chance to reread the 2012 Tilberg Report* on “Flawed Science” last night. The full report is now here. The discussion of the statistics is around pp. 17-21 (of course there was so little actual data in this case!) You might find it interesting. Here are some stray thoughts reblogged from 2 years ago…

1. Slipping into pseudoscience.
The authors of the Report say they never anticipated giving a laundry list of “undesirable conduct” by which researchers can flout pretty obvious requirements for the responsible practice of science. It was an accidental byproduct of the investigation of one case (Diederik Stapel, social psychology) that they walked into a culture of “verification bias”[1]. Maybe that’s why I find it so telling. It’s as if they could scarcely believe their ears when people they interviewed “defended the serious and less serious violations of proper scientific method with the words: that is what I have learned in practice; everyone in my research environment does the same, and so does everyone we talk to at international conferences” (Report 48). So they trot out some obvious rules, and it seems to me that they do a rather good job.

One of the most fundamental rules of scientific research is that an investigation must be designed in such a way that facts that might refute the research hypotheses are given at least an equal chance of emerging as do facts that confirm the research hypotheses. Violations of this fundamental rule, such as continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts…. [T]he use of research procedures in such a way as to ‘repress’ negative results by some means” may be called verification bias. [my emphasis] (Report, 48).

I would place techniques for ‘verification bias’ under the general umbrella of techniques for squelching stringent criticism and repressing severe tests. These gambits make it so easy to find apparent support for one’s pet theory or hypotheses, as to count as no evidence at all (see some from their list ). Any field that regularly proceeds this way I would call a pseudoscience, or non-science, following Popper. “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory” (Popper 1994, p. 89). [2] It is unclear at what point a field slips into the pseudoscience realm.

2. A role for philosophy of science?
I am intrigued that one of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. Based on these insights, research Master’s students and PhD students must receive practical training from their supervisors in the application of the rules governing proper and honest scientific research, which should include examples of such undesirable conduct as data massage. The Graduate School must explicitly ensure that this is implemented.

A philosophy department could well create an entire core specialization that revolved around “the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science” (ideally linked with one or more other departments).  That would be both innovative and fill an important gap, it seems to me. Is anyone doing this?

3. Hanging out some statistical dirty laundry.images
Items in their laundry list include:

  • An experiment fails to yield the expected statistically significant results. The experiment is repeated, often with minor changes in the manipulation or other conditions, and the only experiment subsequently reported is the one that did yield the expected results. The article makes no mention of this exploratory method… It should be clear, certainly with the usually modest numbers of experimental subjects, that using experiments in this way can easily lead to an accumulation of chance findings….
  • A variant of the above method is: a given experiment does not yield statistically significant differences between the experimental and control groups. The experimental group is compared with a control group from a different experiment—reasoning that ‘they are all equivalent random groups after all’—and thus the desired significant differences are found. This fact likewise goes unmentioned in the article….
  • The removal of experimental conditions. For example, the experimental manipulation in an experiment has three values. …Two of the three conditions perform in accordance with the research hypotheses, but a third does not. With no mention in the article of the omission, the third condition is left out….
  • The merging of data from multiple experiments [where data] had been combined in a fairly selective way,…in order to increase the number of subjects to arrive at significant results…
  • Research findings were based on only some of the experimental subjects, without reporting this in the article. On the one hand ‘outliers’…were removed from the analysis whenever no significant results were obtained. (Report, 49-50)

For many further examples, and also caveats [3],see Report.

4.  Significance tests don’t abuse science, people do.
Interestingly the Report distinguishes the above laundry list from “statistical incompetence and lack of interest found” (52). If the methods used were statistical, then the scrutiny might be called “metastatistical” or the full scrutiny “meta-methodological”. Stapel often fabricated data, but the upshot of these criticisms is that sufficient finagling may similarly predetermine that a researcher’s favorite hypothesis gets support. (There is obviously a big advantage in having the data to scrutinize, as many are now demanding). Is it a problem of these methods that they are abused? Or does the fault lie with the abusers. Obviously the latter. Statistical methods don’t kill scientific validity, people do.

I have long rejected dichotomous testing, but the gambits in the laundry list create problems even for more sophisticated uses of methods, e.g.,for indicating magnitudes of discrepancy and  associated confidence intervals. At least the methods admit of tools for mounting a critique.

In “The Mind of a Con Man,”(NY Times, April 26, 2013[4]) Diederik Stapel explains how he always read the research literature extensively to generate his hypotheses. “So that it was believable and could be argued that this was the only logical thing you would find.” Rather than report on believability, researchers need to report the properties of the methods they used: What was their capacity to have identified, avoided, admitted verification bias? The role of probability here would not be to quantify the degree of confidence or believability in a hypothesis, given the background theory or most intuitively plausible paradigms, but rather to check how severely probed or well-tested a hypothesis is– whether the assessment is formal, quasi-formal or informal. Was a good job done in scrutinizing flaws…or a terrible one?  Or was there just a bit of data massaging and cherry picking to support the desired conclusion? As a matter of routine, researchers should tell us. Yes, as Joe Simmons, Leif Nelson and Uri Simonsohn suggest in “A 21-word solution”, they should “say it!”  No longer inclined to regard their recommendation as too unserious, researchers who are “clean” should go ahead and “clap their hands”[5]. (I will consider their article in a later post…)


*The subtitle is “The fraudulent research practices of social psychologist Diederik Stapel.”

[1] “A ‘byproduct’ of the Committees’ inquiries is the conclusion that, far more than was originally assumed, there are certain aspects of the discipline itself that should be deemed undesirable or even incorrect from the perspective of academic standards and scientific integrity.” (Report 54).

[2] Mere falsifiability, by the way, does not suffice for stringency; but there are also methods Popper rejects that could yield severe tests, e.g., double counting. (Search this blog for more entries.)

[3] “It goes without saying that the Committees are not suggesting that unsound research practices are commonplace in social psychology. …although they consider the findings of this report to be sufficient reason for the field of social psychology in the Netherlands and abroad to set up a thorough internal inquiry into the state of affairs in the field” (Report, 48).

[4] Philosopher, Janet Stemwedel discusses the NY Times article, noting that Diederik taught a course on research ethics!

[5] From  Simmons, Nelson and Simonsohn:

 Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir.

Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve.

If you determined sample size in advance, say it.

If you did not drop any variables, say it.

If you did not drop any conditions, say it.

The Fall 2012 Newsletter for the Society for Personality and Social Psychology
Popper, K. 1994, The Myth of the Framework.
Categories: junk science, spurious p values | 13 Comments

Evidence can only strengthen a prior belief in low data veracity, N. Liberman & M. Denzler: “Response”



I thought the criticisms of social psychologist Jens Förster were already quite damning (despite some attempts to explain them as mere QRPs), but there’s recently been some pushback from two of his co-authors Liberman and Denzler. Their objections are directed to the application of a distinct method, touted as “Bayesian forensics”, to their joint work with Förster. I discussed it very briefly in a recent “rejected post“. Perhaps the earlier method of criticism was inapplicable to these additional papers, and there’s an interest in seeing those papers retracted as well as the one that was. I don’t claim to know. A distinct “policy” issue is whether there should be uniform standards for retraction calls. At the very least, one would think new methods should be well-vetted before subjecting authors to their indictment (particularly methods which are incapable of issuing in exculpatory evidence, like this one). Here’s a portion of their response. I don’t claim to be up on this case, but I’d be very glad to have reader feedback.

Nira Liberman, School of Psychological Sciences, Tel Aviv University, Israel

Markus Denzler, Federal University of Applied Administrative Sciences, Germany

June 7, 2015

Response to a Report Published by the University of Amsterdam

The University of Amsterdam (UvA) has recently announced the completion of a report that summarizes an examination of all the empirical articles by Jens Förster (JF) during the years of his affiliation with UvA, including those co-authored by us. The report is available online. The report relies solely on statistical evaluation, using the method originally employed in the anonymous complaint against JF, as well as a new version of a method for detecting “low scientific veracity” of data, developed by Prof. Klaassen (2015). The report concludes that some of the examined publications show “strong statistical evidence for low scientific veracity”, some show “inconclusive evidence for low scientific veracity”, and some show “no evidence for low veracity”. UvA announced that on the basis of that report, it would send letters to the Journals, asking them to retract articles from the first category, and to consider retraction of articles in the second category.

After examining the report, we have reached the conclusion that it is misleading, biased and is based on erroneous statistical procedures. In view of that we surmise that it does not present reliable evidence for “low scientific veracity”.

We ask you to consider our criticism of the methods used in UvA’s report and the procedures leading to their recommendations in your decision.

Let us emphasize that we never fabricated or manipulated data, nor have we ever witnessed such behavior on the part of Jens Förster or other co-authors.

Here are our major points of criticism. Please note that, due to time considerations, our examination and criticism focus on papers co-authored by us. Below, we provide some background information and then elaborate on these points. Continue reading

Categories: junk science, reproducibility | Tags: | 9 Comments

“Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)

Objectivity 1: Will the Real Junk Science Please Stand Up?Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)




Categories: evidence-based policy, frequentist/Bayesian, junk science, Rejected Posts | 2 Comments

What Would Replication Research Under an Error Statistical Philosophy Be?

f1ce127a4cfe95c4f645f0cc98f04fcaAround a year ago on this blog I wrote:

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of a conceptual clarification 

Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)

It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks)

  • Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
  • Simple conceptual job that philosophers are good at

(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)



Example of revealing inconsistencies and tensions 

Critic: It’s too easy to satisfy standard significance thresholds

You: Why do replicationists find it so hard to achieve significance thresholds?

Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs

You: So, the replication researchers want methods that pick up on and block these biasing selection effects.

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.


Whether this can be resolved or not is separate.

  • We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
  • As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

Categories: Error Statistics, reproducibility, Statistics | 18 Comments

3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2012. Lots of worthy reading and rereading for your Saturday night memory lane; it was hard to choose just 3. 

I mark in red three posts that seem most apt for general background on key issues in this blog* (Posts that are part of a “unit” or a group of “U-Phils” count as one.) This new feature, appearingthe end of each month, began at the blog’s 3-year anniversary in Sept, 2014.

*excluding any that have been recently reblogged.


May 2012

Categories: 3-year memory lane | Leave a comment

“Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from my last post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10).

Intentions is a New Code Word: Where, then, is all the information regarding your trying and trying again, stopping when the data look good, cherry picking, barn hunting and data dredging? For likelihoodists and other probabilists who hold the LP/SLP, it is ephemeral information locked in your head reflecting your “intentions”!  “Intentions” is a code word for “error probabilities” in foundational discussions, as in “who would want to take intentions into account?” (Replace “intentions” (or the “researcher’s intentions”) with “error probabilities” (or the method’s error probabilities”) and you get a more accurate picture.) Keep this deciphering tool firmly in mind as you read criticisms of methods that take error probabilities into account[2]. For error statisticians, this information reflects real and crucial properties of your inference procedure.

Continue reading

Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 48 Comments

From our “Philosophy of Statistics” session: APS 2015 convention



“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:


D. Mayo: “Error Statistical Control: Forfeit at your Peril” 


S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”


A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)


For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)

brain-quadrants2nd part of the double header:

Society for Philosophy and Psychology (SPP): 41st Annual meeting

SPP 2015 Program

Wednesday, June 3rd
1:30-6:30: Preconference Workshop on Replication in the Sciences, organized by Edouard Machery

1:30-2:15: Edouard Machery (Pitt)
2:15-3:15: Andrew Gelman (Columbia, Statistics, via video link)
3:15-4:15: Deborah Mayo (Virginia Tech, Philosophy)
4:15-4:30: Break
4:30-5:30: Uri Simonshon (Penn, Psychology)
5:30-6:30: Tal Yarkoni (University of Texas, Neuroscience)

 SPP meeting: 4-6 June 2015 at Duke University in Durham, North Carolina


First part of the double header:

The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference, 2015 APS Annual Convention Saturday, May 23  2:00 PM- 3:50 PM in Wilder (Marriott Marquis 1535 B’way)aps_2015_logo_cropped-1

Andrew Gelman
Stephen Senn
Deborah Mayo
Richard Morey, Session Chair & Discussant

taxi: VA-NYC-NC

 See earlier post for Frank Sinatra and more details
Categories: Announcement, reproducibility | Leave a comment

“Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo



A new joint paper….

“Error statistical modeling and inference: Where methodology meets ontology”

Aris Spanos · Deborah G. Mayo

Abstract: In empirical modeling, an important desideratum for deeming theoretical entities and processes real is that they can be reproducible in a statistical sense. Current day crises regarding replicability in science intertwine with the question of how statistical methods link data to statistical and substantive theories and models. Different answers to this question have important methodological consequences for inference, which are intertwined with a contrast between the ontological commitments of the two types of models. The key to untangling them is the realization that behind every substantive model there is a statistical model that pertains exclusively to the probabilistic assumptions imposed on the data. It is not that the methodology determines whether to be a realist about entities and processes in a substantive field. It is rather that the substantive and statistical models refer to different entities and processes, and therefore call for different criteria of adequacy.

Keywords: Error statistics · Statistical vs. substantive models · Statistical ontology · Misspecification testing · Replicability of inference · Statistical adequacy

To read the full paper: “Error statistical modeling and inference: Where methodology meets ontology.”

The related conference.

Mayo & Spanos spotlight

Reference: Spanos, A. & Mayo, D. G. (2015). “Error statistical modeling and inference: Where methodology meets ontology.” Synthese (online May 13, 2015), pp. 1-23.

Categories: Error Statistics, misspecification testing, O & M conference, reproducibility, Severity, Spanos | 2 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

C.D. Broad quoteNote that in the case that only one counter remains we have n = m + 1 and the two probabilities are the same. However, if n > m+1 they are not the same and in particular if m is large but n is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See Dicing with Death(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Continue reading

Categories: Jeffreys, P-values, reforming the reformers, Statistics, Stephen Senn | 41 Comments

What really defies common sense (Msc kvetch on rejected posts)

imgres-2Msc Kvetch on my Rejected Posts blog.

Categories: frequentist/Bayesian, msc kvetch, rejected post | Leave a comment

Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)



These days, there are so many dubious assertions about alleged correlations between two variables that an entire website: Spurious Correlation (Tyler Vigen) is devoted to exposing (and creating*) them! A classic problem is that the means of variables X and Y may both be trending in the order data are observed, invalidating the assumption that their means are constant. In my initial study with Aris Spanos on misspecification testing, the X and Y means were trending in much the same way I imagine a lot of the examples on this site are––like the one on the number of people who die by becoming tangled in their bedsheets and the per capita consumption of cheese in the U.S.

The annual data for 2000-2009 are: xt: per capita consumption of cheese (U.S.) : x = (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8); yt: Number of people who died by becoming tangled in their bedsheets: y = (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)

I asked Aris Spanos to have a look, and it took him no time to identify the main problem. He was good enough to write up a short note which I’ve pasted as slides.


Aris Spanos

Wilson E. Schmidt Professor of Economics
Department of Economics, Virginia Tech



*The site says that the server attempts to generate a new correlation every 60 seconds.

Categories: misspecification testing, Spanos, Statistics, Testing Assumptions | 14 Comments

96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!

Objectivity 1: Will the Real Junk Science Please Stand Up?Imagine. The New York Times reported a few days ago that the FBI erroneously identified criminals 96% of the time based on probability assessments using forensic hair samples (up until 2000). Sometimes the hair wasn’t even human, it might have come from a dog, a cat or a fur coat!  I posted on  the unreliability of hair forensics a few years ago.  The forensics of bite marks aren’t much better.[i] John Byrd, forensic analyst and reader of this blog had commented at the time that: “At the root of it is the tradition of hiring non-scientists into the technical positions in the labs. They tended to be agents. That explains a lot about misinterpretation of the weight of evidence and the inability to explain the import of lab findings in court.” DNA is supposed to cure all that. So is it? I don’t know, but apparently the FBI “has agreed to provide free DNA testing where there is either a court order or a request for testing by the prosecution.”[ii] See the FBI report.

Here’s the op-ed from the New York Times from April 27, 2015:

Junk Science at the FBI”

The odds were 10-million-to-one, the prosecution said, against hair strands found at the scene of a 1978 murder of a Washington, D.C., taxi driver belonging to anyone but Santae Tribble. Based largely on this compelling statistic, drawn from the testimony of an analyst with the Federal Bureau of Investigation, Mr. Tribble, 17 at the time, was convicted of the crime and sentenced to 20 years to life.

But the hair did not belong to Mr. Tribble. Some of it wasn’t even human. In 2012, a judge vacated Mr. Tribble’s conviction and dismissed the charges against him when DNA testing showed there was no match between the hair samples, and that one strand had come from a dog.

Mr. Tribble’s case — along with the exoneration of two other men who served decades in prison based on faulty hair-sample analysis — spurred the F.B.I. to conduct a sweeping post-conviction review of 2,500 cases in which its hair-sample lab reported a match.

The preliminary results of that review, which Spencer Hsu of The Washington Post reported last week, are breathtaking: out of 268 criminal cases nationwide between 1985 and 1999, the bureau’s “elite” forensic hair-sample analysts testified wrongly in favor of the prosecution, in 257, or 96 percent of the time. Thirty-two defendants in those cases were sentenced to death; 14 have since been executed or died in prison.Forensic Hair red

The agency is continuing to review the rest of the cases from the pre-DNA era. The Justice Department is working with the Innocence Project and the National Association of Criminal Defense Lawyers to notify the defendants in those cases that they may have grounds for an appeal. It cannot, however, address the thousands of additional cases where potentially flawed testimony came from one of the 500 to 1,000 state or local analysts trained by the F.B.I. Peter Neufeld, co-founder of the Innocence Project, rightly called this a “complete disaster.”

Law enforcement agencies have long known of the dubious value of hair-sample analysis. A 2009 report by the National Research Council found “no scientific support” and “no uniform standards” for the method’s use in positively identifying a suspect. At best, hair-sample analysis can rule out a suspect, or identify a wide class of people with similar characteristics.

Yet until DNA testing became commonplace in the late 1990s, forensic analysts testified confidently to the near-certainty of matches between hair found at crime scenes and samples taken from defendants. The F.B.I. did not even have written standards on how analysts should testify about their findings until 2012.

Continue reading

Categories: evidence-based policy, junk science, PhilStat Law, Statistics | 3 Comments


3 years ago...

* 3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2012. I mark in red three posts that seem most apt for general background on key issues in this blog* (Posts that are part of a “unit” or a group of “U-Phils” count as one.) This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

*excluding those recently reblogged.

April 2012

Contributions from readers in relation to published papers

Two book reviews of Error and the Growth of Experimental Knowledge (EGEK 1996)-counted as 1 unit

Categories: 3-year memory lane, Statistics | Tags: | Leave a comment

“Statistical Concepts in Their Relation to Reality” by E.S. Pearson

To complete the last post, here’s Pearson’s portion of the “triad” 

E.S.Pearson on Gate

E.S.Pearson on Gate (sketch by D. Mayo)

“Statistical Concepts in Their Relation to Reality”

by E.S. PEARSON (1955)

SUMMARY: This paper contains a reply to some criticisms made by Sir Ronald Fisher in his recent article on “Scientific Methods and Scientific Induction”.

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data.  We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done.  If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955), it is impossible to leave him altogether unanswered.

In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect.  There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”.  There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans.  It was really much simpler–or worse.  The original heresy, as we shall see, was a Pearson one!


Categories: E.S. Pearson, phil/history of stat, Statistics | Tags: , , | Leave a comment

NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)

Note on an Article by Sir Ronald Fisher

By Jerzy Neyman (1956)


(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.  (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values.  The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight.  (4)  The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.

1. Introduction

In a recent article (Fisher, 1955), Sir Ronald Fisher delivered an attack on a a substantial part of the research workers in mathematical statistics. My name is mentioned more frequently than any other and is accompanied by the more expressive invectives. Of the scientific questions raised by Fisher many were sufficiently discussed before (Neyman and Pearson, 1933; Neyman, 1937; Neyman, 1952). In the present note only the following points will be considered: (i) Fisher’s attack on the concept of errors of the second kind; (ii) Fisher’s reference to my objections to fiducial probability; (iii) Fisher’s reference to the origin of the concept of loss function and, before all, (iv) Fisher’s attack on Abraham Wald.


Categories: Fisher, Neyman, phil/history of stat, Statistics | Tags: , , | 2 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 826 other followers