**A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy. The following are some excerpts, read the full blog here. I make two comments at the end.**

July 8th, 2012

Nathan Schachtman

How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance? Inconsistently and at times incoherently.

Professor Berger’s IntroductionIn her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:

“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”

Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).

This seems rather backwards. Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.

…………

Chapter on StatisticsThe RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction. The authors carefully describe significance probability and p-values, and explain:

“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”

David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011). Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.

Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:

“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.

There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”

Id. at 256 -57. This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.

……..

Chapter on EpidemiologyThe chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.

“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”

Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573. The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics.

The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.” The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary. Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off. Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.

The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:

“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association.

See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant….”Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing. The authors could well have added

In re Viagra, to the list of courts that confused effect size with statistical significance.See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).…………….

When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct. What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.

The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:

“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87

Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”

Id. at 578-79.Mostly true, but again rather unhelpful to judges and lawyers. The authors divide the world up into “strict” testers and those critical of “strict” testing. Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all? I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis. Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them. But what of those critical of strict testing? This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in

In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008). The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample. Confidence intervals, however, do not eliminate the need to interpret the extent of random error.In the final analysis, the epidemiology chapter is unclear and imprecise. I believe it confuses matters more than it clarifies. There is clearly room for improvement in the Fourth Edition.

** Two remarks: I very much agree with Schactman that it is misleading to “divide the world up into ‘strict’ testers and those critical of ‘strict’ testing”. It is time to get beyond the much lampooned image of recipe-like uses of significance tests. One hopes the Fourth Edition recognizes and illustrates how a right-headed use of tests (and CIs) can serve to indicate the extent of discrepancies that are/are not warranted by data.**

**Schachtman chides the Manual for conflating “ Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure.” My own view is that the behavioristic “accept/reject” view of hypothesis testing so often attributed to Neyman is a serious caricature, at least of what Neyman (as well as Pearson) thought (and of what tests are capable of). Fisher’s practice is at least as (or even more) automatic and dichotomous as is Neyman’s theory (recall the posts from the “triad”, e.g., here). Also, in this connection, watch the second half of the address by Sir David R. Cox:**

**ASA President’s Invited Address: “Statistical Analysis: Current Position and Future Prospects”*** (Aug. 1, 2011)*

*http://www.amstat.org/meetings/jsm/2011/webcasts/index.cfm *

[Windows Media Player (.wmv) - 267MB]

[AVI movie (.avi) - 171MB]

I realized that when I suggested that the Reference Manual conflated Fisher with Neyman, I was abridging a good deal of nuanced writing and history, I should have also realized that I was risking caricature. My main point here was that the Manual oversimplified significance testing in a way that has allowed some courts to disregard significance tests or evaluation altogether. To be fair, the Manual’s chapter on statistics did not specify that significance or hypothesis testing was a dichotomous procedure, but most courts regard it as such, and then reject it as a “litmus test” or as “arbitrary.” The result is that there is no serious consideration of random error in the observed data.

Thanks Nathan. I look forward to reading the Manual in full!

There’s a link to the entire Manual at the Federal Judicial Center’s and the National Academies websites. My post has a link to one or the other download sites.

I was able to get it through your blog’s link.

I think one has to be very careful in claiming that P-values are somehow incompatible with hypothesis testing. No less a person than Lehman defined them in terms of hypothesis tests. He wrote

‘In applications, there is usually available a nested family of rejection regions corresponding to different signicance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given signicance

level, but also to determine the smallest signicance level alpha^ =alpha^(x), the signicance probability or p-value, at which the hypothesis would be rejected for the given observation.’ P70 of Testing Statistical Hypotheses, 1994

The problem with casting hypothesis testing as a simple act of deciding whether to reject or not a null hypothesis is that it assumes that a given scientist is mandated to make that decision on behalf of all scientific posterity. If a community of scientists agree together that hypothesis testing is what they like and if they agree on the test to use on a given occasion but they have different preferences as regards type I error rates, then communicating P-values is an operational solution.

I appreciate, of course, that Nathan was aware that he was risking being accused of simplifying matters and knew more than space permitted him to communicate but feel that nevertheless this point is worth making.

Stephen: Great to hear from you, I was just thinking of writing to you last night about this because I was reading J. Kiefer on conditional confidence (the Neyman student). He was alleging, e.g, Brownie and Kiefer 1977, p. 693, along with a critique of Bayesian and likelihood methods,that “the statement of the level at which a hypothesis is ‘just rejected (or accepted)'”, does not, or at least need not have “frequentist interpretability-“–but perhaps he meant only in certain cases (and he considers all kinds). Are you familiar with him?

I appreciate the point; thanks. Yes, there was much more nuance here than I wanted to take on in my original post, but I will add defensively that my oversimplification was in response to perhaps an even more simplistic presentation in the Reference Manual. The statistics chapter is better than others. Having read other work by both David Kaye and the late David Freedman, I know that they were actively engaged in simplifying many concepts for consumption by the judiciary and lawyers, the intended audience of their chapter.

The problem is that judges are called upon to decide whether expert witnesses have proffered opinions that are based upon “sound science,” including the incorporated statistical analyses, and the Reference Manual is not giving them all the tools that are needed. Another problem I wanted to highlight is that the common law system of precedent is ill suited for dealing with scientific and statistical concepts. Lawyers will collect every misstatement in the caselaw and marshal it in the next case as “persuasive” or binding authority upon the next court to consider the matter.

Given that the Reference Manual has the imprimatur of the National Academies and the Federal Judicial Center, I think it would be worthwhile for you and others to look at it closely, and see how it holds up as a precis for judges and lawyers.

Nathan

Deborah: I am familiar with Kiefer’s fundamental work on experimental design (together with Wolfowitz) but know nothing of his work with Brownie.

Steohen: It’s the same as regards this point, I will look for a diffeent reference. thanks.