Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today:
To explain to a broader audience how weak the .05 statistical threshold really is, Johnson joined with 71 collaborators on the new paper (which partly reprises an argument Johnson made for stricter p-values in a 2013 paper). Among the authors are some big names in the study of scientific reproducibility, including psychologist Brian Nosek of the University of Virginia in Charlottesville, who led a replication effort of high-profile psychology studies through the nonprofit Center for Open Science, and epidemiologist John Ioannidis of Stanford University in Palo Alto, California, known for pointing out systemic flaws in biomedical research.
The authors set up a scenario where the odds are one to 10 that any given hypothesis researchers are testing is inherently true—that a drug really has some benefit, for example, or a psychological intervention really changes behavior. (Johnson says that some recent studies in the social sciences support that idea.) If an experiment reveals an effect with an accompanying p-value of .05, that would actually mean that the null hypothesis—no real effect—is about three times more likely than the hypothesis being tested. In other words, the evidence of a true effect is relatively weak.
But under those same conditions (and assuming studies have 100% power to detect a true effect)—requiring a p-value at or below .005 instead of .05 would make for much stronger evidence: It would reduce the rate of false-positive results from 33% to 5%, the paper explains.
Her article is here.
From the perspective of the Bayesian argument on which the proposal is based, the p-value appears to exaggerate evidence, but from the error statistical perspective, it’s the Bayesian inference (to the alternative) that exaggerates the inference beyond what frequentists allow. Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman (2016, p. 342) observe, correctly, that whether “P-values exaggerate the evidence” “depends on one’s philosophy of statistics and the precise meaning given to the terms involved”. [1]
Share your thoughts.
[1] .”..it has been argued that P values overstate evidence against test hypotheses, based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … Nonetheless, many other statisticians do not accept these quantities as gold standards” (Greenland et al, p. 342).
There are areas where p-values work well. In industrial experiments with follow up replication, for example. In agriculture, again with follow up replication. In both areas, things are very systematic. The question is posed. Randomization is used. Just re-read any good book on the design of experiments.
The wheels of science fall off with p-hacking and HARKing which is VERY common. I’ve now counted out over 50 observational papers. When you count out outcomes-of-interest, potential-predictors, and covariates, the median number of questions under consideration is on the order of 9,000. Nine followed by three zeros. It is no wonder p-values from that process do not replicate.
Use an honest 0.05 (one question and follow the other rules), then replicate and you get 0.0025, which is close enough to 0.005 for me. Boos and Stefanski suggest 0.001.
None of these suggestions, 0.005, 0.0025, 0.001, work if there is p-hacking and HARKing.
Stan: Thanks for your comment. Yes, cherry-picking, multiple testing, hunting for significance, optional stopping etc. are the main culprits in flawed findings, but equally, if not more serious are violated statistical assumptions–something rarely mentioned, let alone checked. Alas, in some fields, much of this is irrelevant because the problems are deeper than statistical method (as normally understood).It’s rather that statistical inferences, even ones that pass muster, may be quite irrelevant for the actual goal: warranting a scientific research claim, or learning about a substantive phenomenon of interest. Launching these technical reforms (as psych has been doing for decades) might even encourage looking away from the more serious problems of statistical-substantive links, and questionable measurements.
It should be remembered that because the criticism is based on two-sided tests, even when there’s a predicted direction, the .05 level is actually .025 for the predicted direction, as I understand it. (In other words, you don’t typically see people using .05 one-sided tests, which really is weak).
My main problem with this particular argument is that it rests on a questionable Bayesian appraisal wherein a research hypothesis is claimed to have prior probability .1 because it has, allegedly, been selected from an urn of hypotheses only 10% of which are true–90% of the corresponding nulls are true. (It’s kind of a cross between a type of Bayesian prior and a diagnostic screening prevalence.) The real problem with the project-despite its leaders having the best of intentions–is that people may think this is the correct way to appraise evidence of a given hypothesis in science (transposing the “conditional” in an ordinary error probability). I’ve talked about all this before, so, I’ll just leave it to others to weigh in.
P-hacking, experimental biases, etc., aside, you potentially face two practical problems:
1. You don’t really know the value of, say, 2-sigma, but only an estimate of it. In the long run that estimate may be unbiased, but the p-value is highly non-linear in the value of sigma you use. So for *your* particular data set, who knows how off your estimated p-value is? Of course, it might be off in a favorable way, but how can you tell?
2. You cannot, in many cases, really show from your data that the distributions are gaussian. You just can’t get enough points to reliably define the tails. If the tails are fatter, then a two-sigma (or whatever) spread won’t actually give you the p-value you thought you had. Of course, there are bounds (e.g., Chebyshev’s inequality), but they can be much looser than you might like your p-value to be.
It can be useful to do simulations – lots of runs – to get a sense of how these points may affect your particular case. And it is well to be humble about p-values at all close to a magic number like 0.05.
Tom: yes it’s best to be humble…but don’t let Toxic Co. get away too easily maintaining we don’t have darn good evidence of risks of harms*. Don’t forget there’s a second type of error.
Part of taking precautions, of course, is testing statistical assumptions, and the method for such checking is significance tests! The null asserts the given assumption holds, so, again, you might not want to be too demanding before claiming evidence of a model violation, if you’re keen to detect the fat tails you mention.
My own recommendation is never to just infer the difference is significant at such and such level, but to infer the magnitudes that are well and poorly indicated by the data at hand.** We should steer clear of all recipe statistics, whether in the form of Bayes factors (what do they mean?) or significance levels (which at least are calibrated with error probabilities).
*Neyman puts the “risk exists” hypothesis as the test or null hypothesis, by the way–in contrast with the standpoint of the point null.
**After tests of assumptions have shown reasonable model adequacy.
“Neyman puts the “risk exists” hypothesis as the test or null hypothesis, by the way–in contrast with the standpoint of the point null.” That’s interesting … I tend to think that an emphasis on a “null” hypothesis is overblown. It’s like some number on a number line. What is special about that particular point? Why not start from your measured results as the null instead? Or some other values.. The reasoning, if it is correct, should allow you to invert the roles of null hypothesis and the measurement results. And, after all (bias and other non-ideal effects aside), the sample mean is the best estimate of the population mean.
Tom, there is no such thing in statistics as a “best estimate”. There is such a thing as a “best estimator”, and the difference is not just semantics…
This is my reponse to Benjamin et al (page 17 at http://www.biorxiv.org/content/early/2017/07/24/144337 )
“Since this paper was written, a paper (with 72 authors) has appeared [39] which proposes to change the norm for “statistical significance” from P = 0.05 to P = 0.005. Benjamin et al. [39] makes many of the same points that are made here, and in [1]. But there a few points of disagreement,
(1) Benjamin et al. propose changing the threshold for “statistical significance”, whereas I propose dropping the term “statistically significant” altogether: just give the P value and the prior needed to give a specified false positive rate of 5% (or whatever). Or, alternatively, give the P value and the minimum false positive rate (assuming prior odds of 1). Use of fixed thresholds has done much mischief.
(2) The definition of false positive rate in equation 2 of Benjamin et al. [39] is based on the p-less-than interpretation. In [1], and in this paper, I argue that the p-equals interpretation is more appropriate for interpretation of single tests. If this is accepted, the problem with P values is even greater than stated by Benjamin et al. (e.g see Figure 2).
(3) The value of P = 0.005 proposed by Benjamin et al. [39] would, in order to achieve a false positive rate of 5%, require a prior probability of real effect of about 0.4 (from calc-prior.R, with n = 16). It is, therefore, safe only for plausible hypotheses. If the prior probability were only 0.1, the false positive rate would be 24% (from calc-FPR+LR.R, with n = 16). It would still be unacceptably high even with P = 0.005. Notice that this conclusion differs from that of Benjamin et al [39] who state that the P = 0.005 threshold, with prior = 0.1, would reduce the false positive rate to 5% (rather than 24%). This is because they use the p-less-than interpretation which, in my opinion, is not the correct way to look at the problem.”
David: Thank you for sharing your comment on the paper!
David: A short commentary of interest to you: https://errorstatistics.files.wordpress.com/2017/08/casella-berger-comment-on-berger-delampady-stat-sci-1987-1.pdf
Thanks for that. They say “We would be surprised if most researchers would place even a 10%
prior probability on Ho.”
That’s hilarious. Pure hubris. It’s easy to ‘prove’ your hypothesis if you assume that it’s almost certain to be true before you did the experiment. Good luck in getting any paper that presented an analysis based on that premise past editors.
David:
It’s the opposite, the spiked prior begs the question. You erroneously think scientific claims are shown to be warranted or well tested by giving them a large Bayes boost, defined in a very odd way. There’s no error control. C & B (Roger) are simply showing that even if one were to grant the Bayesian point null context, it wouldn’t properly be one with a high spiked prior. Of course, from the start, Edwards, Lindmann and Savage said this.
The ‘prior probability’ is central to discussions about P values. The problem is that it means different things to different people. It could mean the prior probability (i.e. ‘prior’ to analyzing the data) that the methods and results were not described accurately so that any attempt to repeat the study would not include the original hidden biases, cherry picking, P hacking etc. used to boost the result illegally. This means that the study would have a low probability of being replicated when repeated as described and before the data were analysed. This situation could be minimized by ‘registered reports’ (as suggested in an earlier Error Statistics blog on 11th July).
The prior probability of the outcome of a random selection is what most people seem to mean by the ‘prior probability’ however. This may mean taking into account prior data (e.g. from a pilot study or from a similar study or from an imagined study) that were done in exactly the same way as the current study. This can be done (a) by combining such the prior distribution of such data with the likelihood distribution of the current data or (b) performing a simple meta-analysis by combining the ‘prior data’ with the ‘current data’ and assuming a ‘uninformed prior’ from the combined data or (c) assuming a ‘uninformed prior’ from the ‘current data’ and ignoring ‘prior data’.
It seems to me that the ‘base-rate’ prior probabilities of the possible results of a random selection are always uniform (see the following blog: https://blog.oup.com/2017/06/suspected-fake-results-in-science/). It is only when prior data are taken into account to generate a non-base-rate prior probability that the latter are not uniform. The evidence on which these different prior probabilities are based then has to be combined with the evidence of the new data to produce a probability of replication based on all the evidence. This may have to be combined with even more evidence to arrive at the probabilities of various scientific hypotheses being ‘true’.
Changing levels of ‘significance’ from 0.05 to 0.005 does not seem an adequate solution.
OK, I have now read the whole paper, which focuses exclusively on false positive rate and scenario 2 from my earlier post – that I discount because the success rate (% significant results) is too low to support an academic career. The paper defines false positive differently to me: I use alpha for a single hypothesis, but the paper introduces a population of hypotheses in which a certain percentage are true and compute the ratio Pr(H1)/Pr(H0). Fig 2 is misleading on two counts: firstly it is not clearly stated that it is conditional on achieving a statistically significant result; and secondly it uses (for Pr(H1)/Pr(H0)) values of 1:40, 1:10 and 1:5, entirely in line with my discounted scenario 2. I believe that professors of psychology, economics and so forth are much better than this so we should be looking at Pr(H1)/Pr(H0) ratios of 40:1, 10:1, and 5:1 to give a much more realistic picture of the (negligible) effect of false positives. So I stick with the view that the replication problem is caused, not by too many false positives, but by insufficient power.
Of course the prior of 0.1 is entirely made up (just as all priors are), But it isn’t an entirely unreasonable value in some cases so it makes sense to calculate what would be expected if the prior were 0,1 .
It is because the prior is never known that I advocate expressing uncertainty by calculating the prior that you’d have to believe in order to achieve a false positive rate (FPR) of, say, 5%, This iis Matthews’ reverse Bayesian approach -see section 7 in http://www.biorxiv.org/content/biorxiv/early/2017/07/24/144337.full.pdf
The conversation about P-values and replication remains off-course, despite the best attempts of the ASA’s statement on P-values. Even participants of the symposium that drafted the statement seem unable to take to heart its admonition against “bright line thinking”. Changing the bright line from 0.05 to 0.005 is unlikely make for better science. Instead we should be erasing the bright line and requiring scientists to make scientific arguments about the evidence in the context of theory and logic. Statistical analyses should support such argument, but it cannot substitute for it no matter where the bright line is drawn.
The many-authored paper does, belatedly, mention the bright line, but falls down by implying that the impediment to removing the line is a need to find a singular alternative method. Mayo, Stan and David all imply that the common assumption of ‘one size fits all’ for statistical inference is part of the problem. I agree. It is a large part, and it provides most of the impediment to a more reasonable use of statistical inference in scientific inference.
The proposal to reduce the critical threshold for significance from 0.05 to 0.005 may indeed reduce the number of unsupportable claims, and that may be helpful in some fields, but in other fields of research the power and resource costs of such a change would outweigh the benefits.
Michael makes a great point about the bright line thinking being a big part of what was broken. They are reinforcing that mindset.
Peter also made a great point regarding the priors maybe being 10:1 not 1:10. It really depends upon the circumstances. Priors should be subject to rigor.
Cox laid out a variety of species of statistical hypotheses subject to significance testing. They will not all tend to involve “true” nulls. Some will tend to involve false nulls.
Another factor I will raise again is that there is a lot more to replication than statistics. Many failures probably relate to a failure to truly replicate the conditions or to make the observations the same way. I believe these problems are often because the original work was poorly described or maybe even poorly designed. What you call a widget might not be what I call a widget. Maybe my measurement method was obscure in my paper. The replication team took it differently. Or, I forgot to mention a key environmental factor. Significance level will not help with these things. Neither will a Bayes Factor.
All these points argue against the monolithic thinking suggested by the paper.
John: Agree with all you say, but I find it especially troubling to spoze an error statistician (or user of an error statistical method) ought to use a Bayes Factor as the future gold standard for measuring his error statistical tool, forcing it to come in line, if it disagrees––even though Bayes Factors don’t control or measure error probabilities. Why shouldn’t we instead use our error probability measures to evaluate if their Bayes Factors are exaggerating evidence. If we do, we will discover that they greatly exaggerate evidence,licensing a fairly high posterior probability to an alternative (max likely) hypothesis, even though such an assignment would be wrong with very high error probability!
So lowering the p-value by itself isn’t the problem, it’s basing doing so on a foreign argument that presupposes that this is the error statistician’s new master.
I was interviewed by someone at Slate today on this.
by the way anyone going to the JSM?
Some Fisher quotes:
“… no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
“No isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us.”
These are very old concepts. Maybe we should not re-invent the wheel, but rather just roll with it.
John: some seem not to want to learn how to drive, or see it as too dangerous…or something. It is peculiar that today’s debunkers, some of them, sound like they’re reinventing well known howlers–and think they deserve credit.
I agree entirely that, in the end, replication is the only solution. But it takes only a glance at the biomedical literature to see that it’s full of examples of claims made on the basis of P < 0.05' I include the following example in my paper at
Click to access 144337.full.pdf
"Here’s a real life example. A study of transcranial electromagnetic stimulation, published In Science, concluded that it “improved associative memory performance”, P = 0.043 [26]. If we assume that the experiment had adequate power (the sample size of 8 suggests that might be optimistic) then, in order to achieve a false positive risk of 5% when we observe P = 0.043, we would have to assume a prior probability of 0.85 that the effect on memory was genuine (found from calc-prior.R). Most people would think it was less than convincing to present an analysis based on the assumption that you were almost certain (probability 0.85) to be right before you did the experiment. Another way to express the strength of the evidence provided by P = 0.043 is to note that it makes the existence of a real effect only 3.3 times as likely as the existence of no effect (likelihood ratio found by calc-FPR+LR.R [40]). This would correspond to a minimum false positive risk of 23% if we were willing to assume that non-specific electrical zapping of the brain was as likely as not to improve memory (prior odds of a real effect was 1)."
I cannot get past the swag of 1:10 as a prior for the appropriateness of null hypotheses. This is at best a shot from the hip, which makes it ironic that it can be used to refine/calibrate p-values, which have a variety of explicit conditions we are supposed to meet before using the statistic. Seems cavalier, naive, reckless? Is 1:10 replicable? Did it ever have merit to begin with?
It is amazing, I chalk it up to a handful of highly persuasive, smart guys.
There is no doubt that hypotheses vary in their plausibility. In some cases 0.1 would be much too high -for example if you were testing a homeopathic pill against a placebo (the pills would be identical, if you believe Avogadro’s number). And 0.1 might well be in the right ball park for testing a new drug (most of them don’t work).
But rather than guessing at a prior, it seems better to me to calculate the prior that you’d need to achieve a false positive risk of (say) 5%. It’s then up to you to persuade your readers that the prior is reasonable.
See http://www.biorxiv.org/content/biorxiv/early/2017/08/07/144337.full.pdf
The call for smaller significance levels cannot be based only on mathematical arguments that p values tend to be much lower than posterior probabilities, as Andrew Gelman and Christian Robert pointed out in their comment (“Revised evidence for statistical standards”).
In the rejoinder, Valen Johnson made it clear that the call is also based on empirical findings of non-reproducible research results. How many of those findings are significant at the 0.005 level? Should meta-analysis have a less stringent standard?
Irreplicable results can’t possibly add empirical clout to the mathematical argument unless it is already known or assumed to be caused by a given cut-off, and further, that lowering it would diminish those problems. In fact the problems are largely caused by biasing selection effects and the well-known fallacy of moving from a statistically significant result to a research hypothesis. Insofar as the latter hasn’t been well probed by the significance test, let alone the significance seeking, lowering the p-value wouldn’t help: fish for significance just a bit longer. We know the causes and they are easy to demonstrate. Most worrisome, Johnson and other originators of this argument are Bayesians who hold the likelihood principle which entails the irrelevance of error probabilities once data are in hand (the methods condition on the data). It also follows that the methods don’t pick up on gambits that alter error probabilities, as J. Berger others make clear. So if their recommended tools are adopted, error control is at most secondary, and we need to rely on significance tests to tell us if anything has replicated. (How do you replicate a Bayes factor, especially given the Bayes factors used in the argument of the p-value lowerers (point nulls with 2-sided tests. The argument underlying this reform comes first from Edwards, Lindman and Savage (1963)–maybe even earlier–, but even they question the soundness of using point nulls.
I haven’t mentioned another biggies, the problem of violations of statistical models–at least as nefarious as biasing selection effects.
“Irreplicable results can’t possibly add empirical clout to the mathematical argument unless it is already known or assumed to be caused by a given cut-off, and further, that lowering it would diminish those problems.”
The preprint cites empirical results to support its use of the 1:10 prior odds. If that is in fact a reliable estimate of the prior odds for the reference class of previous studies, then, in the absence of other relevant information, it would be reasonable to use as input for Bayes’s theorem.
John Byrd asks, “Is 1:10 replicable?” Is it important to ask whether a 1:1 prior odds can be rejected at the 0.005 significance level?
I’d just add that the point null prior doesn’t seem to be an essential part of my argument. Valen Johnson come up with quite similar estimates of false positive risks using his UMPBT priors.
Sure butI have the same objections to it. See pp 361-70 of SIST. Or search this blog “why significance testers….” Aside from not having those priors, nor the dichotomy, the result readily exaggerates the evidence in favor of the alternative vs which the test has high power.
I’m pleased to see the objections to newest .05 to .005 proposal – some can be found in the 2014 letters in PNAS on Johnson’s 2013 article proposing stricter thresholds.
It’s especially good to see objections to using thresholds instead of just giving the P-value directly to the reader without reference to a threshold, which I have the impression both Cox and Lehmann preferred for basic research reports. Presumably Fisher would wanted those as well for the purposes of combining evidence and we should want them now to see how they cluster near magic thresholds.
Perhaps the most pernicious aspect of thresholds is the publication biases they create (both in terms of whether the results are published and disseminated widely, and the subsequent bias among estimates from those that are published). These biases worsen as the threshold drops. We need complete reporting of what was examined, how it was examined, and how it came to be examined (the trail through the garden of forking paths as Gelman and Loken put it), not statistical methods for intensifying selection bias.
Plus, thresholds feed delusions that a single statistic from a research study should be the basis for some discrete press-release soundbite such as “discovery”, “confirmation”, or “refutation”, with no attention to methodologic problems (such as model-specification error) or other poorly quantified sources of uncertainty that require deep contextual knowledge to evaluate. Finally as Mayo mentions almost all testing focuses on taking a null (independence) assumption as the test hypothesis, when in fact Neyman knew that for some stakeholders or contexts testing the null would make no sense.
Two new articles on these and related problems are:
Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance thresholds and
the crisis of unreplicable research. PeerJ 2017;5:e3544; DOI 10.7717/peerj.3544
Greenland S. The need for cognitive science in methodology (in press, American Journal of Epidemiology).
I would attach these here but don’t see how so will supply them on request.
In mine I promote the idea that the surprisal transform of P-values to bits of information against the tested hypothesis, -log_2(p), could help avoid the misinterpretations arising from dealing directly with p as a probability (whether or not one uses thresholds).
Anyway, in light of the deeper issues it is shocking to me is to see some of the names on the new proposal. Even more interesting would be to see who was approached but did not sign (I will hazard a guess that Gelman and Rubin were among them).
Sander: Very grateful for your comment. I agree with your sentiments, including:
“Even more interesting would be to see who was approached but did not sign”
Maybe those of you who are good with “missing data” can reconstruct who might have been invited: Come forward all ye non-signers.
Curious about your argument as to why this will exacerbate the problem of publication bias–just, supposedly, harder to get through. Increasing the power, as they recommend, might pick up on smaller population effects.
Mayo: Thanks for inviting me…
To answer your question about why lowering the reporting cut-off would generate a smaller, more biased literature: Consider the extreme in which P>alpha becomes a missing-data (results-censoring) indicator, with perfect randomized experiments large enough to produce unbiased normally distributed estimators. Under the null the expected P at alpha=0.05 becomes .025 since reported Ps become uniform on (0,0.05]. The estimates become correspondingly inflated in absolute magnitude since they must be at least 1.96 standard errors from the null. Often – usually in my field – there are strong directional expectations, or one direction is of overwhelming practical concern and the other isn’t. Estimates in the opposite (“wrong”) direction may then go unreported even if P<0.05; in the extreme the current two-sided standard then becomes unstated one-sided 0.025 for alternatives in the prior direction (I think Stephen Senn has often pointed this out). The consequence is that reported estimates are now inflated in the expected direction.
Of course increasing power is good in theory but it will not eliminate reporting bias, and will make the biased results seem more reliable. Then too, the expense of taking in more subjects may lead to cutting corners in data quality. It will also mean fewer studies will get done by fewer groups (more will fail to get adequate funding). There are reasons for expecting harms from the resulting resource concentration and the reduction in independent replication.
My favorite whipping boy (which Stan may share) is nutritional epidemiology, in which false hypotheses became clinical dogma thanks to the strong influence of large prestigious groups doing enormous, expensive studies. Such groups become the dominant sources of accepted 'facts', and ostracize scientists as 'cranks' for straying too far from their orthodoxy, while becoming quite inventive in interpreting data to fit their own dogma. The fat vs. sugar controversy is shaping up to be a case study which, by some accounts, this kind of error amplification cost untold billions in the form of the obesity and diabetes epidemics that followed the triumph of the low-fat school in the 1960s and 70s (whose dogmas have taken a half-century to unseat, reflecting the time and expense of large diet and nutrition studies).
Thus requiring larger studies with lower thresholds may well amplify important biases even if it reduces certain random errors. So I argue that we need instead complete reporting of methods and data, with conclusions (inferences) reserved for meta-analyses and pooled analyses. But then, my stance goes against real research incentives to maximize funding and publicity (e.g., for having supplied 'important' findings), so I do not expect it to be embraced by researchers!
Judges have to decide what counts as evidence and bright lines make judging such things easier. According to Google Scholar there have been over 400 opinions/decisions published since the ASA statement that contain “statistically significant” or its variant, and not one mentions the statement. Instead they tend to contain pyrite like this: ” … if p is very small, something other than chance must be involved.”
Death penalty cases, securities class actions, property tax disputes, you name it; all are turning on “statistical significance”. To provide evidence of causation a stock price drop following bad news must be “statistically significant”. Some statutes include the phrase (without defining it) and depending on where you live you might be mortified to learn how the tax assessor goes about figuring your house’s value. Confidence intervals have been swept up too. Since “we can be confident” that a defendant’s real IQ isn’t too low because a number less than 70 isn’t within the confidence interval for his IQ it’s ok to execute him. Etc.
It would be better of course if judges and legislatures could be made to understand the niceties of statistical methods but it ain’t gonna happen ’cause they ain’t that smart and they love, love, love bright lines. So please consider the p p<.005 proposal from this perspective: it should at least reduce the harm (the loss of life, liberty and property) currently being done by "statistical significance".
I don’t know what evidence you are using to support your claim that ‘the p<.005 proposal should at least reduce the harm (the loss of life, liberty and property) currently being done by "statistical significance".' In a rare unanimous decision the US Supreme Court ruled that statistical significance was not a prerequisite in proving causation. Where is a study with empirical evidence of the costs and benefits of any cut-off in other judicial situations?
Any universal cut-off will cause great harm in some applications even it helps in others. As Byrd points out, even Fisher knew better than to call for a one size fits all cut-off, and as I have emphasized, so did Neyman (I'll bet that Mayo could attest the same for Egon Pearson). And this insight does not even address the issue of whether or where we as scientists should or should not be using cut-offs: From purely statistical calculations we can say that, whatever their (controversial) benefits, using cut-offs causes inferential biases and entails implicit utility judgments (loss functions), see the citations I gave above and others, e.g.,
Fiedler, K., Kutzner, F., and Krueger, J. I. (2012). The long way from error control
to validity proper: problems with a short-sighted false-positive debate. Perspect.
Psychol. Sci. 7, 661–669. doi: 10.1177/1745691612462587
See also Krueger's excellent post on the .005 proposal at his Psychology Today blog:
https://www.psychologytoday.com/blog/one-among-many/201707/fear-false-positives
These stakeholder utility issues do not arise in machine-learning uses of cross-validated branching and selection rules based on cut-offs, which only underscores the radical difference context and application makes in these matters.
Sander: First, let me say I have no idea how your (and some other) comments were held in moderation. Knowing I was leaving town, I took the moderation requirement entirely off, so I wasn’t alerted to any new comments waiting, and only discovered some when you called it to my attention today. I’m so very sorry.
Now this doesn’t speak to your main point, but the Supreme Court did not rule on whether statistical significance is needed for causal inference (even though, obviously, it is not since, for starters, in fully controlled fields, you wouldn’t even need to be doing statistical experiments to infer causes). But it’s very distorting to suppose that what really transpired in that court case had anything to do with making a judgment about evidential standards. In a nutshell: If no statistical analysis has been done, but anecdotally it seems many patients had a certain side effect when taking a drug, then, GIVEN the drug company just reported they see a very profitable quarter coming up, THEN it’s not warranted for the company to say they needn’t have reported the information on side-effects to shareholders (despite knowing that the mere info would likely alter stock price) SOLELY on the grounds that no statistical analysis (and hence no statistical significance) was available.
By Schachtman:
https://errorstatistics.com/2012/02/08/guest-blog-interstitial-doubts-about-the-matrixx-by-n-schachtman/
By Mayo:
https://errorstatistics.com/2012/02/08/distortions-in-the-court-philstock-feb-8/
Mayo:
You said “the Supreme Court did not rule on whether statistical significance is needed for causal inference”. That is what corporate-defense lawyers like Schachtman want us to think.
I say: On the contrary, the Court most certainly did rule that statistical significance is NOT needed to infer causation, and to claim otherwise is to buy into obfuscations concocted to cover up the biggest court loss ever suffered by null-hypothesis significance testing (NHST). (Disclosure: I was a plaintiff consultant in a different zinc-anosmia case.)
Note that I say that with no sympathy for the Bayesian view of the decision by Kadane that you and Schachtman cite, but rather for the blow against rigid rules in scientific inference that it represents (and note well I do not conflate or condone the widespread confusion of scientific with statistical inference, as exemplified by Howson & Urbach’s book).
Let readers see what they think the court meant with these quotes from the unanimous Court opinion in the Matrixx case (Matrixx Initiatives, Inc. et al. v. Siracusano et al., 131 S. Ct. 1309, No. 09–1156. Argued January 10, 2011—Decided March 22, 2011) upholding the appeal decision of the Ninth Circuit Court:
“A lack of statistically significant data does not mean that medical experts have no reliable basis
for inferring a causal link between a drug and adverse events. … We note that courts frequently permit expert testimony on causation based on evidence other than statistical significance.… It suffices to note that, as these courts have recognized, ‘medical professionals and researchers do not limit the data they consider to the results of randomized clinical trials or to statistically significant evidence.’ ”
– for those who want the real context and full ruling, a full searchable PDF of the decision is available at https://www.supremecourt.gov/opinions/10pdf/09-1156.pdf
But let me repeat the first quoted sentence with emphasis: “A lack of statistically significant data does NOT mean that medical experts have no reliable basis for INFERRING A CAUSAL LINK between a drug and adverse events.” And you say the Court did not rule on whether statistical significance is needed for causal inference? Really???
Sander:
As I noted, it goes without saying that there are much stronger grounds possible for causal inference, so statistical significance cannot be necessary–so it is a truism, but this truism is not what was decided in the court case (In fact the truism was only mentioned as a side point to the main issue.). The case had entirely to do with whether the company was correct to claim it didn’t have to mention side-effects that were not part of a statistical trial to shareholders solely because they were anecdotal and not part of a statistical trial (and so couldn’t be stat significant). The case had nothing to do with how to appraise evidence. And, by the way, had the company not gone ahead and told shareholders they saw a great quarter looming, the Co would have been correct in not having to say anything one way or another about side-effects. I’m totally on the side of requiring side-effects to be reported to shareholders (even though that’s not the law), my point is simply that the case shouldn’t be distorted. It’s been awhile since that discussion, and I gave links before. Since it would be rather cumbersome to repeat all the details, I recommend interested persons read those links.
You said “statistical significance cannot be necessary–so it is a truism, but this truism is not what was decided in the court case (In fact the truism was only mentioned as a side point to the main issue.).” I say: Wrong again! You are repeating the misleading defense spinning of the court’s opinion, instead of going straight to the opinion itself:
The decision revolved entirely around whether statistical significance is necessary for inference and thus for reporting. The ‘truism’ that statistical significance is unnecessary is in fact what is contradicted constantly by defense arguments and defense-expert reports, which routinely claim statistical significance is a necessary condition for inference as part of “the scientific method” and thus for reporting obligations.
Furthermore, it is a distortion to claim that the main issue was about whether there was a ‘statistical trial’ rather than statistical significance. Here’s a quote from the introductory passage of the court’s opinion:
“This case presents the question whether a plaintiff can state a claim for securities fraud under §10(b) of the Securities Exchange Act of 1934…based on a pharmaceutical company’s failure to disclose reports of adverse events associated with a product if the reports do not disclose a STATISTICALLY SIGNIFICANT number of adverse events.” (emphasis added)
– Note well the term used here and in most of the opinion is statistically significant, NOT ‘statistical trial.’ And it’s clear the court knew damn well the difference: The term ‘statistically significant’ appears over and over again from start to finish, whereas the term ‘statistical trial’ never appears at all. Even the term ‘clinical trial’ appears only twice outside of quotes from the defense: once in an Amici quote referring “to randomized clinical trials OR to statistically significant evidence,” and once in “ethical considerations may prohibit researchers from conducting randomized clinical trials to confirm a suspected causal link for the purpose of obtaining statistically significant data.”
Bottom line: Your truism is no truism in the courts, and it is a misrepresentation of the case to claim the non-necessity of statistical significance was a “truism was only mentioned as a side point to the main issue” or that “The case had entirely to do with whether the company was correct to claim it didn’t have to mention side-effects that were not part of a statistical trial to shareholders solely because they were anecdotal and not part of a statistical trial.”
This is not sound reason. Courts/Judges rely on expert conclusions drawn from rigorous methods. Experts should not commit to a bright line because a Judge can understand it easier. This is not why they invite expert testimony.
As to your examples, they really are not about statistics.
Thanatos, you make some valid points, but I do not see them as being particularly compelling in the realm of statistical support of scientific inference. The thing is that most scientific inferences do not need to be dichotomous in the way that your examples are.
I also find your suggestion that we accept the lowered threshold in order to reduce harm as being a little blinkered. A reduction in threshold may reduce harm in the abstract world, but in the real world it leads to a substantial reduction in the power of experiments. The proposed benefits of a lowered threshold do not come without costs.
Thanatos: No it will not reduce the harm. The argument upon which it is based is a Bayesian one, so if you buy this argument (for lowering p-values), you buy the Bayesian way of assessing statistical methods and inference–at least the ones advocated by the leaders of variants of this argument. Since these methods do not prohibit things like optional stopping or other gambits that alter error probabilities, and are prohibited by error statisticians (unless there are appropriate adjustments to p-values or at least making it clear the selections have occurred), they would be allowing much more serious infractions to continue unabated.
I do remember being very hopeful when working on quality scoring guides for randomized clinical trails in the late 1980,s. That they would help folks do better trials and journals reign in author’s omissions regarding weaknesses in their trials. Instead, they likely just mostly helped authors re-write the faulty trials they had already done by providing a list of good things to claim they had done.
Now, John Tukey did warn statisticians to not change the way people do science based on mere technical knowledge – and this paper does seem to ignore that warning.
For instance “Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods. This proposal should not be used to reject publications of novel findings with 0.005 < P < 0.05 properly labeled as suggestive evidence." seems naive of journal publication reality.
Now, I find Peirce’s three grades of a concept to be helpful in thinking about this, the ability to recognise instances, the ability to define it and the ability to get the upshot of it (how it should change our future thinking and actions). For, instance in statistics the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations”. It also always remains open ended.
Almost all the teaching in statistics is about the first two and much (most) of the practice of statistics skips over the third with the usual, this is the p_value in your study and don’t forget its actual definition. Now in this paper they do acknowledge that to some degree context matters but I do think they will just end up setting a new bright-line for folks to chase and they will often do so inappropriately.
The statistical discipline needs to take some accountability for the habits of inference they _set up_ in various areas of scholarship by the methods/explanations they recommend.
Keith O'Rourke
Phan: Peirce is always endlessly wise.
Mayo,
Sander Greenland has given the propaganda line of what the lawsuit industry (plaintiffs’ bar and its hired expert witnesses) takes to be the “ruling” of the Siracusano case. Greenland’s opinion reflects a profound ignorance of legal method and what is, and what is not, the holding (or ruling) of a case.
Courts, unlike legislatures, decide only the cases before them, on the records of those cases. They may articulate general rules, but the holding (or ratio decidendi) of a case turns on the material facts before the court.
So here is the shocker. There were no facts of record in Siracusano. The trial court dismissed the case because the plaintiffs failed to ALLEGE statistical significance as part of their ALLEGATIONS of causation. This is because the plaintiffs claimed securities fraud, and fraud must be PLEADED with particularity.
The intermediate appellate court reversed the dismissal because, as it HELD, you need not plead CAUSATION. The case then went up to the Supreme Court, which affirmed and HELD that you do not need to plead causation in support of a securities fraud case.
The Supreme Court’s decision was unanimous; the case was and is very clear. I would think non-lawyers could get this but apparently not. What the Supreme Court saw was that a company could commit security fraud, within the meaning of the relevant statute, by doing the following things (which the plaintiffs did successfully allege): (1) receiving reports of individual cases of anosmia following use of product, Zycam; (2) being told by some physicians that they thought there might be a causal relation; (3) making bullish projections to the shareholders and the market; and (4) and not disclosing that there MIGHT be a causal relationship.
So, what Greenland has failed to disclose to you, is that the Supreme Court in Siracusano held that causation was not necessary at all, at the level of Zycam and anosmia. (The holding of a case turns on the facts that are necessary to the legal outcome.) In the case, the FDA had already issued an order that the Zycam be withdrawn as formulate (with the zinc salt). Why was causation not necessary? Well, the FDA can order the withdrawal, as it did, WITHOUT adequate evidence to support a causal conclusion. Zycam was an over-the-counter product, the efficacy of which had not been established in a submission to the agency; there were other products available; it did not address a life-threatening malady.
What happens when causation is not required? Well, the issue of whether statistical significance is somehow necessary to establish causation goes out of the case, completely and unambiguously. That doesn’t mean that the Court did not comment on it as suggest; Justice Sotomayor’s opinion waxes on (in a very silly way) about the issue, but what she wrote has a name in legal parlance: obiter dictum — things said along with the way. Dictum is not the ruling in a case; and it is certainly not the holding of the case. Greenland has been involved in a sufficient number of lawsuits that he might have known this. The lawsuit industry certainly knows it, but they forgot since law school. Or maybe they didn’t show up that year in law school.
Now I do agree that the defense was quite misguided in raising the statistical significance issue as a matter of the allegations. If all that existed were case reports, without a quantification of incidence rate or whatnot for these case, against some expected value, it made no sense to talk of statistical significance. I made this point very shortly after the Supreme Court decided the case (much to the chagrin of my cousin who is a partner at O’Melveny, which handled the defense), and David Kaye made the point as well in an article he wrote for the BNA.
Subsequently, the actual factual issue of causation (anosmia by Zycam) has been litigated. Mostly courts have thrown out these cases are entirely speculative.
So statistical significance was never a legitimate issue in the case; and the claimed need for such significance went out of the case when the Supreme Court said that factual causation between Zycam and anosmia was not at issue.
Some statisticians, who want to misrepresent the actual meaning of the Siracusano case, have quoted words, not rulings, not holdings, to support their own views about statistical significance. I assure you that Justice Sotomayor, and probably all the other justices, with the possible exception of Justice Breyer, have no idea what any of this means.
Furthermore, to show how inane some of the dictum is; consider this. Justice Sotomayor cited three lower court cases in support of her claim that statistical significance has not been required in earlier cases to show causation. Two of the three cases involved what lawyers call differential etiology, where the general causation is conceded (this exposure admittedly can cause this disease), but the issue was whether the exposure caused this particular plaintiff/claimant’s disease. Statistical significance NEVER came up in those two cases. In the third case, Wells v. Ortho, statistical significance did come up. The case, which is one of the most notorious epidemiology cases in the federal system, involved a claim that Ortho’s spermicidal jelly causes birth defects. The plaintiffs’ expert witnesses in that case relied upon various studies, some of which did have statistical significance (at least nominally, because there were multiple outcomes, without adjustment). The trial judge, sitting as the finder of fact, found for plaintiff. On appeal, the judges of the 5th Circuit wrote that you did not need statistical significance, but their law clerks (and they) had clearly failed to read the record; because there were statistically significant studies relied upon by plaintiffs’ expert witnesses. (The problem was that these studies looked at arsenical spermicides, not the spermicide made by Ortho. The problem was external validity.)
Sorry to go on so long. Greenland has given his hipshot about what the Matrixx Initiatives v Siracusano case means in various reports he has written for plaintiffs’ lawyers. He of course is entitled to his opinion, but it is not based upon fact or a correct understanding of the case.
Nathan
Nathan:
“Some statisticians, who want to misrepresent the actual meaning of the Siracusano case, have quoted words, not rulings, not holdings, to support their own views about statistical significance.”
Yes, this is a good point, one can find words and side remarks and that’s quite different from what this case was all about. It’s fascinating how roles people take on as expert witnesses intertwine with their statistical philosophies–entirely understandable. I haven’t revisited this case, but I wonder if maybe the way it has been used (and abused) has itself changed its significance for significance testing.
Glad to have finally flushed out Schachtman, whose blog did not allow my critical dissenting comments back when this case first hit. Nice to see him insult the intellect of the Court too, using standard legal obfuscation of the fact that the Court is entitled to consider science, ordinary logic, and common sense outside of that legal framework to form and justify its ruling – that reasoning is what composes the bulk of the opinion I linked. Go read it and see what you think without the smokescreen offered by Schachtman.
Talk about propaganda: Schachtman’s commentary is a skilled attempt by a master legal expert to persuade lower courts and observers that the defense industry’s worst nightmare did not become a reality. Maybe Schachtman really believes his denial, but this DID happen: The U.S. Supreme Court presented a powerful opinion weighing against one of the major litigation-defense industry strategies. The strategy has been to argue vigorously that statistical significance is essential for scientific (not legal) inference of harms, and then use their statistical experts to whittle away any such significance using any means, such as claiming the standard is not met because only a minority of studies had P<0.05 – never mind that it may have been far more than 5% of the studies, or that the combination had P<<0.05. I could go on at length (and have elsewhere) about the strategies I've seen.
None of this should be surprising given that the job of lawyers is to win their cases, not reveal the truth. To expect intellectual honesty from them is as absurd as expecting it from our politicians (who are by no coincidence mostly lawyers). But the main point all readers should bear in mind is that the issue here is corruption of science and scientists in service of lawyers – whether plaintiff or defense. In particular, there is a sewage backflow of cartoonishly wrong descriptions of inference tailored by hired experts to suit their team, then taken up by the same experts as dogma in classrooms and lecture halls as well as courtrooms.
Case study: The public-health community wrung their hands in shock when it came out how the beloved and newly deceased Patricia Buffler had offered dubious defense testimony (http://www.epimonitor.net/PrintVersion/Jan14/Jan-Feb_2014_The_Epidemiology_Monitor.pdf), yet 10 years before I had documented exactly that in an article invited by and then rejected by the American Journal of Public Health (based on one review by an anonymous but clearly impugned and powerful expert witness), which the Wake Forest Law Review picked up when they found out that had happened (Greenland, S. 2004. The need for critical appraisal of expert witnesses in epidemiology and statistics. Wake Forest Law Review, 39, 291-310; see also Schachtman review at http://schachtmanlaw.com/1892/).
A good question Schachtman raises implicitly is: How is it when I appear as a named expert it is for plaintiffs? The reason is simple: I'm approached routinely by both plaintiffs and defense. If the case sounds interesting enough to take and I have the time to take it (which is less often than not), I state my terms: I have a right to form and publish my own opinions on the science of the matter, regardless of which side it helps or hurts, with no obligation to provide such articles for review by the hiring party, and that my goal in participating is not to argue for or against causation, but rather to upgrade the quality of scientific testimony on all sides and aid the court in understanding the science and the problems in expert reports.
Guess what happens when I lay that condition down? Every single time, defense runs away like a cockroach fleeing light; whereas almost every time, plaintiff lawyers say "Yes, that's wonderful!" (the exceptions being when they say "sounds fine but I have to run that by my colleagues," who then assent). Guess what that tells me about my "esteemed colleagues" who show up as the defense experts? That either they made no such demand, or else agreed to make no such demand. Thus I've come to regard the side on which one shows up as at least a weakly informative integrity predictor.
Too bad the defense is so gun-shy of honesty and transparency, as there are plaintiff expert reports as bad as any defense report I've seen (albeit in a different way, of course) that I would have loved to rip. And some plaintiff lawyers have gone on to regret their hiring me – after one deposition in which the cross-examining defense lawyer shook my hand effusively, the plaintiff lawyer called me "a loose cannon" (a label I wear proudly); after another depo in which the defense made me aware of truly questionable plaintiff counsel claims, I quit the case.
So I get approached less by both sides these days, which suits me just fine – I don't need the money and don't have that much spare time. But if you see me show up you can bet that I will be ripping into the mischaracterizations of science and statistics that plague expert reports. This problem is systemic: Experts are hired by parties invested in the final conclusions, when they should be hired by the court (there is even a mechanism in place to do so at the federal level although apparently it is used too rarely, and then only as an adjunct to the usual experts). Again, I urge any reader who wants to understand what is going on to examine actual expert reports and court opinions, and contrast those to what is reported by me, Schachtman (or Mayo acting as his echo here), or any secondary source – that is, calibrate the latter secondary reports against data to understand the reliability of the reporters versus the realities of the case.
Sander: Now that I take it we’ve moved beyond the Matrixx case (which I know about from my other life in trading, and agree entirely with the ruling, quite aside from the highly flawed construal by Ziliac and McCloskey and others) we can talk about what’s really at issue. By the way, I’ve disagreed with Schachtman on the Harkonen case for years*–despite his great legal acumen–but I understand his taking the Matrixx case as actually helping the bad guy in the Harkonen case.
So the real issue: You wrote: “Experts are hired by parties invested in the final conclusions, when they should be hired by the court (there is even a mechanism in place to do so at the federal level although apparently it is used too rarely, and then only as an adjunct to the usual experts).”
That would be interesting, and seems right–assuming the goal is to get at the truth, as opposed to entitling the accused to any and all defenses, which I thought was how our legal system works.
*Best not to reopen it, but you probably know it already.
The system does indeed “entitle the accused to any and all defenses” including it seems misrepresentations of science and consensus within it: I’ve noticed a gross asymmetry in cases in which defense experts will be allowed to get away with misrepresentations (e.g., cherry picking) while plaintiff experts will be thrown out for the same misbehavior. This makes our American legal environment (like our political one) quite hostile to valid science and its noble avowed goal of truth finding. And it has resulted in many distortive reports, with the resulting biases in those creeping into scientific presentations from their authors (see a methodologic case study in Greenland, S. 2012. Transparency and disclosure, neutrality and balance: shared values
or just shared words? Journal of Epidemiology and Community Health, 66, 967–970). I’ve been trying to change that in what little way I can by outing some of the bad expert reports I’ve seen, and by pushing for more extensive critical review of reports before filing.
But in the end only the courts have the power to do much about the situation. I have always said that scientific reports for litigation should be held to journal peer-review standards, but the courts lack the expertise for that. Having court-appointed experts for review seems a great idea but has not taken off (perhaps because of the procedural burdens for the court of having to locate and compensate experts acceptable to both sides). Another way courts could improve submissions is to make clear that all scientific reports will be posted as PDFs on public websites for peers to see, something that the court clerk could do. The resulting pressure for quality could be enhanced by requiring extensive keywords from the topic at the start of the report, so that it comes up on searches by scientists working in the area. Perhaps even comments (and responses) could be allowed, as in a blog.
As for the Harkonen case, it was my impression that there really was much more to the conviction than mere misinterpretation of statistics, as described here for example:
https://law.stanford.edu/2013/10/07/lawandbiosciences-2013-10-07-u-s-v-harkonen-should-scientists-worry-about-being-prosecuted-for-how-they-interpret-their-research-results/
Am I missing something in that view?
Sander: Much more in the sense that Harkonen was trying to argue that since there was disagreement about multiple testing you couldn’t really find him culpable? I agree with the article you linked and, of course, the SUTC did not agree with him. Here’s a blogpost of mine on it.
https://errorstatistics.com/2012/12/17/philstatlawstock-multiplicity-and-duplicity/
When I was a very young lawyer I, thanks to my Dad (a trial lawyer far better than I’ll ever be) who helped craft my cross examination, expertly skewered the nation’s leading courtroom epidemiologist (who’d been flown in by chartered Concorde) in open court. Afterwards he (the expert) called and asked to meet.
We had gin and tonics at the Tremont in Galveston (where he and the lawyer by whom he was employed had just won a $100 million verdict for a single plaintiff in another case – 25 years ago when $100 million was considered to be a lot of money for one person for a week’s work). During the course of the conversation he said that I should be a plaintiffs’ lawyer as, he thought, I’d do well at it.
I asked on what basis he believed that he could perceive the truth of the matter on causality. He replied that (paraphrasing) “we always knew these chemicals caused cancer but it wasn’t until Mantel Haenzsel’s paper that we could show we had evidence that ~ tetra-chloro-death ~ caused cancer.” That launched me on a decades-long quest to understand – ‘Why would you believe such a thing?'”
Why should rooting about in a bucket of numbers, rather than conducting experiments, give a keener insight into causality? He had admitted that he had no empirical evidence for his claim. It was confirmation of his bias that lay at the heart of the matter. It took me years to understand.
My eventual conclusion was that a lot of young men, steeped in the belief that we were (when we were young) in the midst of an epidemiologic transition, grasped at every straw that seemed to confirm our belief that man, and not microbes, or viruses, or some emergent phenomena, lay at the heart of humanity’s suffering. I understand now. We fight the dragons that we can conceive; not those we cannot. The rest is just fooling yourself.
But I have great respect for you and Rothman, for yours (or maybe his at the time, it was years ago) was the first tome I devoured on my quest. If I hire such an expert know this: that if you say my client’s methods are poor I do not subsequently hire you because I don’t like the answer – but rather because I have paid the Piper and my need has dissipated.
Mayo,
Yes, error can be significant, too. And the Matrixx Initiatives v Siracusano case has done some mischief. I can point to a few cases in which state or federal judges were snookered by the overzealous reading of the case, urged by Sander Greenland and others. But I think more recently, the courts have taken to heart a few obvious points.
First, if causation is not necessary, then whether statistical significance is need for causation, become irrelevant, and outside the holding. That was my point in the rant above.
Second, the Court itself noted that it was not addressing what the requirements of Federal Rule of Evidence 702 were. This is the rule that trial courts use to evaluate the admissibility of expert witness opinion testimony. Recall that this could not be involved in Siracusano because there was no evidence at all, only allegations.
Third, we see in some in recent cases that the courts have tried to articulate exactly what the role statistical significance should play in cases in which epidemiology is essential. The US Court of Appeals struggled with this issue in In re Zoloft, but came to the manifestly correct conclusion that the trial court had appropriately booted statistician Nicholas Jewell and epidemiologist Anick Berard, for some dubious opinion testimony. In those birth defect cases, there was some dodgy studies with statistical significance, such as Berard’s study on sertraline/Zoloft, but even Jewell could not abide that study, and opined he had no confidence in it. Instead, Jewell criticized a meta-analysis he relied upon in other cases (involving fluoxetine/Prozac), and did his meta-analysis of only two of many studies, but without any coherent explanation for why he left out the overwhelming majority of studies of the subject.
I would be the first to agree that there are biological causal pathways identified without epidemiology (as in virology and microbiology, etc.), but for many cases that come up in the judicial system today, there is no identified causal pathway, with biomarkers, or epigenetic markers of high penetrance, etc., etc. All we have are associations, which may or may not qualify in our view for causal associations under some heuristic as Sir Austin’s considerations. In such cases, I am hard pressed to think of an example of causality accepted by the scientific community without multiple statistically significant studies. Take a gander at the WHO/IARC website and look at the Category I (so-called known carcinogens) for any counter examples. I don’t think you will find any, but I am prepared to be corrected.
But back to your point. The nuances of the ASA statement are often lost on judges, who would prefer to have a bright-line test. Sometimes, the courts have noted that there is no brightline, only to get mired in analytical nihilism.
Nathan
Nathan:
Interesting that you mentioned Jewell and Berard re Zoloft. In that case the trial judge had bought into defense mischaracterization of significance testing hook, line and sinker, and so the plaintiff counsel asked me to write a report explaining the concepts correctly and identifying the mischaracterizations. But this was after their lead expert reports had been filed… It turned out the Court’s misunderstanding of tests wasn’t the pivotal problem: As you noted, among other dodgy things Jewell (as well as Berard) did some inexplicable (except as plaintiff bias) cherry picking of studies for his meta-analysis (his report was one I had in mind in my earlier comments about bad plaintiff reports). Such meta-selection bias is the most common deceptive practice I see on both sides, and is one reason I have long argued against allowing exclusions in meta-analyses (apart from the rare study with no salvageable information). So when the Court threw Jewell out for the obvious problems I told plaintiff counsel I would’ve done the same, and excoriated him for not having someone like me critically review reports before filing.
I have since advised lawyers on both sides to hire critical reviewers to rip into their own expert reports before filing, rather than waiting for the other side to do it for them. Such pre-filing peer review might go a long way toward relieving the court of the burden of adjudicating on technical matters, as well as improve the quality of the reports. To encourage further quality improvement, the Court could inform both parties that all submitted scientific reports will be posted on a public website, and do so (subject of course to blanking as necessary to conform to confidentiality agreements).
As for requiring “multiple statistically significant studies”, as far as basic testing theory goes that is just fallacious (other reasons might be invented for the requirement but they would fall outside that theory and I know of none that are generally accepted) – especially in fields with chronically underpowered studies, like postmarketing surveillance (where sample size is usually fixed by the available data, rather than being a design feature). I give examples showing why in my own report in the case; quoting the latter:
“it is fallacious to attempt to judge the statistical significance of multiple studies based on the significance of the individual studies (e.g., see Greenland, Epidemiologic Reviews 1987). Consider a simple example: If there were 20 studies and the null was true for all of them, we would expect 1 of the 20 to be significant at the 0.05 level (i.e., P<0.05). Suppose however 4 of the 20 studies were significant at the 0.05 level. Most writers [especially defense expert witnesses] would interpret this evidence as favoring the null, since most (80%) of the studies were not significant. But that conclusion would be entirely wrong: It turns that [using the binomial formula] the exact P-value for getting 4 significant results in 20 when only 1 is expected is 0.03, significant at the 0.05 level. Such examples show that merely counting significant results can be highly misleading, in that it will often miss overall significance.
Worse, it may be that not one individual study is significant, when overall the totality of results is significant. As another simple example, suppose there are only 3 studies, each with P = 0.10, so none is significant. Then, using the classic Fisher formula for combining P-values (Cox & Hinkley, 1974, p. 80; Greenland, 1987, p. 18), the total chi-squared for the 3 studies is 6×ln(0.10) = 13.8 on 6 degrees of freedom, so that the overall P-value for getting 3 studies with P = 0.10 is 0.03. Thus the overall P-value is significant at the 0.05 level even though not one of the individual studies is significant at that level. Furthermore, the general formulas cited above for combining study significance show that the disparity between overall and individual-study significance expands as the number of studies increases; for example, with 5 studies each with P = 0.10, the total chi-squared is 5×ln(0.10) = 23 on 10 degrees of freedom, so that the all P-value for getting 5 studies with P = 0.10 is 0.01. Such examples show again that judging overall significance based on individual-study significance is extremely misleading, in that it will likely miss overall significance.
The deceptiveness of basing literature evaluation on individual-study significance is analogous to judging whether a coin is fair by asking 10 different people to toss it twice. Suppose each person reported back to you that they got two heads, and so saw nothing unusual, since there is nothing unusual about getting two heads in a row. But taken as a totality, we would see they got 20 heads in 20 tosses, which would almost certainly indicate something unusual is going on."
Mayo, Sander,
There is too much swirling around in multiple responses for me to respond to everything (and I doubt that you would want me to), but I will try to be responsive a bit.
1. I don’t have comments enabled on my blog because I often do not have time to moderate, and under the Bars’ discliplinary codes, I am responsible for what people post. I have made the open invitation to people to submit a response if they feel so moved.
2. As for insulting the Court, in Matrixx, well, it is a time honored tradition to criticize published opinions, but no one here, or anywhere, has rebutted my analysis. It is not insulting the Court to point out that (a) the difference between holding and dictum, and (b) to show that the Court misunderstood or misrepresented the cases upon which it relied. (As an aside, the argument about which I complained was lifted almost verbatim from the Solicitor General’s brief.) The Court held causation is not required for the materiality prong of securities, and I absolutely agree. And once causation is out of the case, there was no warrant to discuss the defense argument on the necessity of statistical significance. Sander, that is what defines the holding or ruling of a case; everything else is dicta. This is not legal obfuscation, but legal analysis. As for the three cases cited by the Court in support of its assertion that statistical significance is not required, no one has challenged my reading and interpretation of those lower court opinions. I have blogged about the Wells case at length, and how it has been almost universally derided as seriously in error.
3. Sander, I am glad that we have found some common ground at least with respect to the methodological sloppiness of the plaintiffs’ expert witnesses challenged in the Zoloft litigation. As an aside, you may want to take a look at a post I wrote in defense of Jewell’s use of the mid-p for a Fisher’s exact test in his work as a plaintiffs’ expert witness in the Lipitor diabetes cases. Although there was much wrong with Jewell’s opinion in the Lipitor cases, I thought that the defense and the trial court had unfairly characterized his offering a mid-p analysis as a “second” post hoc analysis.
4. I think we also agree that science does not always survive the adversarial process. I have had the pleasure to work with Sander’s colleague Tim Lash in the welding fume litigation, where the claimed outcome was Parkinson’s disease or parkinsonism, and the studies and causal analyses relied upon by plaintiffs and their witnesses was truly abysmal. The multidistrict court refused the defense challenges to plaintiffs’ expert witnesses, and the defense was forced to try two dozen cases. We won 21, and finally the plaintiffs decamped and went home. I am confident that court-appointed expert witnesses would have dispatched these claims earlier in the process. Alas, the transactional costs of court-appointed witnesses can be high, as we saw in the silicone breast implant litigation. So, yes, critical reviewers might well improve the quality of expert witnesses on both sides, but in the silicone cases, the IOM and the court’s own experts came down on the issues much closer to the defense experts than the plaintiffs’, and the cases evaporated.
5. As for the Harkonen case, I fear anything more I would say about the case would provoke Mayo unnecessarily, but along with Ken Rothman and Tim Lash, I have laid out my views in our amicus brief that we filed in the Supreme Court. I have also blogged and written elsewhere about the case at length, but I will say this much at least. The Stanford blog pointed to makes the point that the Supreme Court wouldn’t take the case because the jury had found the speech to be fraudulent and there is no first amendment protection for fraudulent speech. True, true, but circular and immaterial. The Stanford author did not really examine whether the speech was opinion or fact, or whether it was even incorrect as a matter of fact. Our amicus brief set out to do this, along with the affidavits filed by Don Rubin and Steve Goodman in the trial court. The problem, legally, was that Harkonen’s counsel did not call anyone to rebut the 0.05 magic number testimony of Prof. Fleming at trial. Rubin and Goodman weighed in, too late, on post-verdict motions and for sentencing. Dr Harkonen has been left to challenge whether his trial counsel violated the constitutional standard of care (as in the 6th amendment) in failing to call someone like Rubin or Goodman at trial. This is Harkonen’s last go at challenging his conviction, but I don’t see much hope for him.
Nathan
It was an example from law courts that finally persuaded me that ignoring the prior could result in ghastly mistakes. The Island problem shows very clearly that neither P values nor likelihood ratios (/Bayes’ factors) are good ways to judge guilt.
See http://www.dcscience.net/2016/03/22/statistics-and-the-law-the-prosecutors-fallacy/
David. Here’s a post by Larry Laudan on why the presumption of innocence is not a Bayesian prior: https://errorstatistics.com/2013/07/20/guest-post-larry-laudan-why-presuming-innocence-is-not-a-bayesian-prior/. I don’t know if it’s relevant to you.
In the Island problem in the post to which I refer, the prior comes from the number of people on the island. It has nothing to do with the presumption of innocence. Of course it’s a somewhat artificial example, but a very telling one.
The fact that P(evidence | guilty) = 0.996, and the likelihood ratio in favour of guilt was 250 would be enough to get the suspect executed in some parts of the world. The relevant result in this example is P(guilty | evidence) = 0.2.
This example alone is sufficient to persuade me that it would be rash -indeed morally indefensible, to ignore prior probabilities. The fact that we never know them in real life is why I advocate the reverse Bayes approach to calculate the prior that you’d need to believe in order to achieve a 5% (say) false positive risk.
Click to access 144337.full.pdf
David: You’re missing my point. The case of Isaac (and also the case of guilt, but put that aside) is just like your example, and I’m very familiar with it. There’s a Bayes boost, but the prior is still low. The issue is for the Bayesian to say if she’s a B-booster or a posterior probabilist. You say you’re the latter, so you wouldn’t say Isaac’s high grades are evidence of his college readiness (this was the post to which I linked in the last), but the so-so grades of a student selected from Many Ready town is evidence of readiness–thanks to the high prior. My point is that both Bayesian probabilist approaches have counterintuitive results and they are in tension. Whatever choice you make, however, has nothing to do with an account of inference applicable for a different context: science, where the hypothesis’s truth/adequacy is not the occurrence of an event.
David: Predicting guilt is very different from assessing how warranted or well tested a scientific hypothesis is–it’s back to the confusion between a scientific or statistical hypothesis being true, and an event occurring. This comes up in 2 papers of mine in 1997, and elsewhere.
But even restricting ourselves to the context of events, the Bayesian probabilist’s problem is this: should we measure ‘evidence’ (for the occurrence of an event) in terms of a posterior probability Pr(H|x) is high? or in terms of a Bayes boost Pr(H|x ) > P(H). If in terms of a boost, which is how most see it, you get a very different answer than the one you are favoring–high posterior probability. And it has nothing to do with using error probabilities or not. The hypothesized event may go way up in probability even though, thanks to a low prior, the posterior is low.
Playing within the realm of events (and not scientific hypotheses), what do you say about Isaac who comes from a town where few are college ready, but who gets impressive scores? You require Isaac to get a higher set of test scores than a student from a town where many are college ready, in order to say there’s evidence of his college readiness. Is this reverse discrimination?
https://errorstatistics.com/2014/02/01/comedy-hour-at-the-bayesian-epistemology-retreat-highly-probable-vs-highly-probed-vs-b-boosts/