A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy. The following are some excerpts, read the full blog here. I make two comments at the end.
July 8th, 2012
How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance? Inconsistently and at times incoherently.
Professor Berger’s Introduction
In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:
“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”
Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).
This seems rather backwards. Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.
Chapter on Statistics
The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction. The authors carefully describe significance probability and p-values, and explain:
“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”
David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011). Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.
Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:
“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large dataset—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.
There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”
Id. at 256 -57. This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.
Chapter on Epidemiology
The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.
“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”
Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573. The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics.
The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.” The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary. Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off. Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.
The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:
“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant….”
Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing. The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance. See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).
When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct. What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.
The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:
“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87
Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”
Id. at 578-79.
Mostly true, but again rather unhelpful to judges and lawyers. The authors divide the world up into “strict” testers and those critical of “strict” testing. Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all? I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis. Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them. But what of those critical of strict testing? This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008). The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample. Confidence intervals, however, do not eliminate the need to interpret the extent of random error.
In the final analysis, the epidemiology chapter is unclear and imprecise. I believe it confuses matters more than it clarifies. There is clearly room for improvement in the Fourth Edition.
Two remarks: I very much agree with Schactman that it is misleading to “divide the world up into ‘strict’ testers and those critical of ‘strict’ testing”. It is time to get beyond the much lampooned image of recipe-like uses of significance tests. One hopes the Fourth Edition recognizes and illustrates how a right-headed use of tests (and CIs) can serve to indicate the extent of discrepancies that are/are not warranted by data.
Schachtman chides the Manual for conflating “ Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure.” My own view is that the behavioristic “accept/reject” view of hypothesis testing so often attributed to Neyman is a serious caricature, at least of what Neyman (as well as Pearson) thought (and of what tests are capable of). Fisher’s practice is at least as (or even more) automatic and dichotomous as is Neyman’s theory (recall the posts from the “triad”, e.g., here). Also, in this connection, watch the second half of the address by Sir David R. Cox:
ASA President’s Invited Address: “Statistical Analysis: Current Position and Future Prospects” (Aug. 1, 2011)