One of the first examples I came across of problems in construing statistically insignificant (or “negative”) results was a House Science and Technology investigation of an EPA ruling on formaldehyde in the 1980’s. Investigators of the EPA (led by Senator Al Gore!) used rather straightforward, day-to-day reasoning: No evidence of risk is not evidence of no risk. Given the growing interest in science and values both in philosophy and in science and technology studies, I made the “principle” explicit. I thought it was pretty obvious, aside from my Popperian leanings. I’m surprised it’s still an issue.
The case involved the Occupational Safety and Health Administration (OSHA), and possible risks of formaldehyde in the workplace. In 1982, the new EPA assistant administrator, who had come in with Ronald Reagan, “reassessed” the data from the previous administration and, reversing an earlier ruling, announced: “There does not appear to be any relationship, based on the existing data base on humans, between exposure [to formaldehyde] and cancer” (Hearing p. 260).
The trouble was that this assertion was based on epidemiological studies that had little ability to produce a statistically significant result even if there were risks worth worrying about (according to OSHA’s standards of risks of concern, which were not in dispute).[i]The EPA’s assertion that the risks ranged from “0 to no concern” had not passed a very stringent or severe test.
In the spirit of keeping the discussion non-technical, I formulated a rather clunky “metastatistical rule” M (in Mayo 1991):
Rule (M): A statistically insignificant difference is a poor indication that an actual increased risk is less than (some amount) d* if it is very improbable that the test would have resulted in a more statistically significant difference, even if the actual increased risk is as large as d*.
Note: this is akin to saying: a statistically insignificant difference is a poor indication that an actual increased risk is less than (some amount) d*, if the power of the test to detect d* is low. The only difference is that M takes account of the actual insignificant p-value, and so is more informative. [ii]
Little did I know at the time, however (not until I found some papers in my attic a decade later), that Jerzy Neyman had made an analogous point in terms of power when he warned us about this common misinterpretation of non-rejections. To be continued at a later time (“Neyman’s Nursery”).
U.S. Congress. House of Representatives. Committee on Science and Technology. May 20, 1982. Formaldehyde: Review of Scientific Basis of EPA’s Carcinogenic Risk Assessment. Hearing before the Subcommittee on Investigations and Oversight. 97th Cong., 2d sess.
Mayo, D. 1991. Sociological Versus Metascientific Views of Risk Assessment. In Acceptable Evidence, Science and Values in Risk Assessment, edited by D. Mayo and R. Hollander. Oxford: Oxford University Press.
(For a rather scruffy copy: Sociological Versus Metascientific Views of Risk Assessment)
[i] Animal studies had resulted in statistically significant risks; but the studies on humans did not.
[ii] By contrast, when a result is statistically significant, the worry is that the test may be so sensitive as to be picking up on discrepancies from the null that are not substantively important.