Memory Lane: One Year Ago on error statistics.com
A quick perusal of the “Manual” on Nathan Schachtman’s legal blog shows it to be chock full of revealing points of contemporary legal statistical philosophy. The following are some excerpts, read the full blog here. I make two comments at the end.
July 8th, 2012
Nathan Schachtman
How does the new Reference Manual on Scientific Evidence (RMSE3d 2011) treat statistical significance? Inconsistently and at times incoherently.
Professor Berger’s Introduction
In her introductory chapter, the late Professor Margaret A. Berger raises the question of the role statistical significance should play in evaluating a study’s support for causal conclusions:
“What role should statistical significance play in assessing the value of a study? Epidemiological studies that are not conclusive but show some increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value, 62 at least in proving general causation. 63”
Margaret A. Berger, “The Admissibility of Expert Testimony,” in RMSE3d 11, 24 (2011).
This seems rather backwards. Berger’s suggestion that inconclusive studies do not prove lack of causation seems nothing more than a tautology. And how can that tautology support the claim that inconclusive studies “therefore ” have some probative value? This is a fairly obvious logical invalid argument, or perhaps a passage badly in need of an editor.
…………
Chapter on Statistics
The RMSE’s chapter on statistics is relatively free of value judgments about significance probability, and, therefore, a great improvement upon Berger’s introduction. The authors carefully describe significance probability and p-values, and explain:
“Small p-values argue against the null hypothesis. Statistical significance is determined by reference to the p-value; significance testing (also called hypothesis testing) is the technique for computing p-values and determining statistical significance.”
David H. Kaye and David A. Freedman, “Reference Guide on Statistics,” in RMSE3d 211, 241 (3ed 2011). Although the chapter confuses and conflates Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure, this treatment is unfortunately fairly standard in introductory textbooks.
Kaye and Freedman, however, do offer some important qualifications to the untoward consequences of using significance testing as a dichotomous outcome:
“Artifacts from multiple testing are commonplace. Because research that fails to uncover significance often is not published, reviews of the literature may produce an unduly large number of studies finding statistical significance.111 Even a single researcher may examine so many different relationships that a few will achieve statistical significance by mere happenstance. Almost any large data set—even pages from a table of random digits—will contain some unusual pattern that can be uncovered by diligent search. Having detected the pattern, the analyst can perform a statistical test for it, blandly ignoring the search effort. Statistical significance is bound to follow.
There are statistical methods for dealing with multiple looks at the data, which permit the calculation of meaningful p-values in certain cases.112 However, no general solution is available, and the existing methods would be of little help in the typical case where analysts have tested and rejected a variety of models before arriving at the one considered the most satisfactory (see infra Section V on regression models). In these situations, courts should not be overly impressed with claims that estimates are significant. Instead, they should be asking how analysts developed their models.113 ”
Id. at 256 -57. This qualification is omitted from the overlapping discussion in the chapter on epidemiology, where it is very much needed.
……..
Chapter on Epidemiology
The chapter on epidemiology mostly muddles the discussion set out in Kaye and Freedman’s chapter on statistics.
“The two main techniques for assessing random error are statistical significance and confidence intervals. A study that is statistically significant has results that are unlikely to be the result of random error, although any criterion for “significance” is somewhat arbitrary. A confidence interval provides both the relative risk (or other risk measure) found in the study and a range (interval) within which the risk likely would fall if the study were repeated numerous times.”
Michael D. Green, D. Michal Freedman, and Leon Gordis, “Reference Guide on Epidemiology,” in RMSE3d 549, 573. The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics.
The suggestion that alpha is “arbitrary,” is “somewhat” correct, but this truncated discussion is distinctly unhelpful to judges who are likely to take “arbitrary“ to mean “I will get reversed.” The selection of alpha is conventional to some extent, and arbitrary in the sense that the law’s setting an age of majority or a voting age is arbitrary. Some young adults, age 17.8 years old, may be better educated, better engaged in politics, better informed about current events, than 35 year olds, but the law must set a cut off. Two year olds are demonstrably unfit, and 82 year olds are surely past the threshold of maturity requisite for political participation. A court might admit an opinion based upon a study of rare diseases, with tight control of bias and confounding, when p = 0.051, but that is hardly a justification for ignoring random error altogether, or admitting an opinion based upon a study, in which the disparity observed had a p = 0.15.
The epidemiology chapter correctly calls out judicial decisions that confuse “effect size” with statistical significance:
“Understandably, some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky. 2008) (describing a small increased risk as being considered statistically insignificant and a somewhat larger risk as being considered statistically significant….”
Actually this confusion is not understandable at all, other than to emphasize that the cited courts badly misunderstood significance probability and significance testing. The authors could well have added In re Viagra, to the list of courts that confused effect size with statistical significance. See In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008).
…………….
When they are on message, the authors of the epidemiology chapter are certainly correct that significance probability cannot be translated into an assessment of the probability that the null hypothesis, or the obtained sampling statistic, is correct. What these authors omit, however, is a clear statement that the many courts and counsel who misstate this fact do not create any worthwhile precedent, persuasive or binding.
The epidemiology chapter ultimately offers nothing to help judges in assessing statistical significance:
“There is some controversy among epidemiologists and biostatisticians about the appropriate role of significance testing.85 To the strictest significance testers, any study whose p-value is not less than the level chosen for statistical significance should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed p-value below that specified level. Epidemiologists have become increasingly sophisticated in addressing the issue of random error and examining the data from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies that are not statistically significant.86 Meta-analysis, as well, a method for pooling the results of multiple studies, sometimes can ameliorate concerns about random error.87
Calculation of a confidence interval permits a more refined assessment of appropriate inferences about the association found in an epidemiologic study.88”
Id. at 578-79.
Mostly true, but again rather unhelpful to judges and lawyers. The authors divide the world up into “strict” testers and those critical of “strict” testing. Where is the boundary? Does criticism of “strict” testing imply embrace of “non-strict” testing, or of no testing at all? I can sympathize with a judge who permits reliance upon a series of studies that all go in the same direction, with each having a confidence interval that just misses excluding the null hypothesis. Meta-analysis in such a situation might not just ameliorate concerns about random error, it might eliminate them. But what of those critical of strict testing? This certainly does not suggest or imply that courts can or should ignore random error; yet that is exactly what happened in In re Viagra Products Liab. Litig., 572 F. Supp. 2d 1071, 1081 (D. Minn. 2008). The chapter’s reference to confidence intervals is correct in part; they permit a more refined assessment because they permit a more direct assessment of the extent of random error in terms of magnitude of association, as well as the point estimate of the association obtained from the sample. Confidence intervals, however, do not eliminate the need to interpret the extent of random error.
In the final analysis, the epidemiology chapter is unclear and imprecise. I believe it confuses matters more than it clarifies. There is clearly room for improvement in the Fourth Edition.
Two remarks: I very much agree with Schactman that it is misleading to “divide the world up into ‘strict’ testers and those critical of ‘strict’ testing”. It is time to get beyond the much lampooned image of recipe-like uses of significance tests. One hopes the Fourth Edition recognizes and illustrates how a right-headed use of tests (and CIs) can serve to indicate the extent of discrepancies that are/are not warranted by data.
Schachtman chides the Manual for conflating “ Fisher’s interpretation of p-values with Neyman’s conceptualization of hypothesis testing as a dichotomous decision procedure.” My own view is that the behavioristic “accept/reject” view of hypothesis testing so often attributed to Neyman is a serious caricature, at least of what Neyman (as well as Pearson) thought (and of what tests are capable of). Fisher’s practice is at least as (or even more) automatic and dichotomous as is Neyman’s theory (recall the posts from the “triad”, e.g., here). Also, in this connection, watch the second half of the address by Sir David R. Cox:
ASA President’s Invited Address: “Statistical Analysis: Current Position and Future Prospects” (Aug. 1, 2011)
http://www.amstat.org/meetings/jsm/2011/webcasts/index.cfm
[Windows Media Player (.wmv) – 267MB]
[AVI movie (.avi) – 171MB]
Note (7-10-13): I can’t access these (though I saw Cox give the talk), but maybe others can.See comments from One Year Ago.
Mayo,
Thanks for the constructive feedback. Perhaps it is fair to say that the divide between N-P and Fisher can be discerned from some, but not all their writings, and that the dichotomy may not be entirely fair to either camp when their writings are taken as a whole. I certainly have not read all the historical writings thoroughly to offer a view of what these authors actually believed, but it is not difficult to find some supporting statements from each side. My point may need to be re-phrased, but the Manual’s presentation is clearly lacking in nuance and historical context.
Do you believe that N-P and Fisher contributed to the caricatures of each other’s views?
Putting aside the historical issues, I believe that the presentation of the controversy in the Reference Manual suffers from its assumption that hypothesis testing is a strict testing procedure and there is no other way to interpret p-values. I found it particularly distressing that the epidemiology chapter (M. Green, L. Gordis, M. Freeman) cite to Ziliak and McCloskey’s Cult book, with what appears to be approval. This sort of exposition is not going to help the judiciary make sense of the use of statistical analyses in the many contexts that arise in the law – epidemiology, economics, discrimination claims, etc.
Nathan
Nate:
“Do you believe that N-P and Fisher contributed to the caricatures of each other’s views?”
Yes, definitely. for just one example of where this comes up on this blog:
https://errorstatistics.com/2013/02/16/fisher-and-neyman-after-anger-management/
Nate:
In the post you aver: “The suggestion that a statistically significant study has results unlikely due to chance probably crosses the line in committing the transpositional fallacy so nicely described and warned against in the chapter on statistics”.
I have discussed why I think this criticism is misplaced in “Telling what’s true about significance levels”
https://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/
See what I mean? I think this understanding of the p-value helps avoid some of the well-known criticisms.
Mayo,
I will read your earlier post. I wasn’t averring so much as just saying that the authors’ stating that “a statistically significant study has results unlikely due to chance” is a short-hand for results or results more extreme unlikely due to chance given the model and the assumption of the null hypothesis. I threw in the “probably” crosses the line as a weak attempt at irony.
My limited point is that the authors’ formulation, which is a short-hand for what I believe they stated more accurately and fully earlier, is dangerous in the context of their speaking to judges and lawyers who have repeatedly committed the transposition fallacy despite the good advice of the Manual’s statistics chapter. (Many examples on my website.)
What I suppose I would have the authors say is to point out at least once, the first time they introduce the concept of a p-value, to point out all the assumptions, and to remind the reader that all these assumptions carry along with their discussion even if for linguistic felicity they happen to drop one or another in a later sentence.
Wouldn’t that be helpful in your view for the intended audience?
A low p-value does constitute evidence against the null, even if doesn’t translate directly into a likelihood that the null is incorrect, or as the complement to the likelihood that the observed result (and those more extreme) is correct. So I wasn’t saying that p-values don’t contribute to our understanding of a warrant for rejecting the null hypothesis.
OK; let me go read your earlier post again, more carefully.
Nathan
OK; I see now your discussion of the “principle of charity,” and how it would be ungenerous to interpret what I have characterized a short-hand for an earlier, correct statement. I agree, with the concern that judges will latch on to the one short-hand formulation because it is fits their desire to have a Bayesian statement, and from there, we have statistical chaos.
Nate: I surmised you would jump on that lawyerly term “aver”. But anyway, I really mean something stronger, even though I’m happy to leave it for now as a matter of generosity (since I plan to come back to this). I actually think it is entirely correct to state the p-value as I do in that post, and as they do in the manual, and that it is only by failing to grasp the intimate connection between Ho and what it says of the prob of the statistically significant outcome, that the misreading occurs. This pertains to that discussion we had one time about the difference between a conditional probability and the probability (of a stat significant outcome, say) under the supposition that it was generated by a procedure as described in Ho. Lots of people said I was being picayune, but I think it’s non-trivial. Sorry not to be clearer, dashing off to an appt just now….