Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:
When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]
…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Nowadays, writers make it much less clear that the fault lies with the fallacious use of significance tests and other error statistical methods. Instead, the tests are blamed for permitting or even encouraging such misuses. Criticisms to the effect that we should stop trying to teach these methods correctly have hardly helped. The situation is especially puzzling given the fact that these same statistical fallacies have trickled down to the public sphere, what with Ben Goldacre’s “Bad Pharma”, calls for “all trials” to be registered and reported, and the popular articles on the ills of ‘big data’:
We’re more fooled by noise than ever before, and it’s because of a nasty phenomenon called “big data”. With big data, researchers have brought cherry-picking to an industrial level. (from Taleb 2013)
I say it is puzzling because the very reason these writers are able to raise the antennae of entirely non-technical audiences is that they assume–rightly in my judgment–the intuitive plausibility of the reasoning needed to pinpoint the fallacies being committed. No technical statistics required. Some well-known slogans:
- Statistical significance is not substantive significance
- Association is not causation
- If you torture the data enough they will confess.
The associated statistical fallacies are so antiquated that one is almost embarrassed to take them up in 2013, but there is no way around it if one is to tell the truth about statistical inference. When p-values are reported in the same way regardless of various “selection effects,” testing’s basic principles—whether of the Neyman-Pearson or Fisherian variety—are being twisted, distorted, invalidly used, even if the underlying statistical model has been approximately satisfied—a big if! To those who say, turn tests into decisions with explicit losses, I ask: how does it help in holding “big pharma” accountable to advocate they explicitly mix their cost-benefits into the very analysis of the data? At least a Goldacre-type criticism can call them out on subliminal bias.
To detect or even explain the fallacies is implicitly to recognize how various tactics mischaracterize and may greatly inflate the chance of reporting unwarranted claims. It is not a problem about long-runs either—it is a problem of the capability of the specific test to have done its job. Whenever presented with a statistical report, I always want to audit it by asking: just how frequently would this method have alerted me to erroneous claims of this form? If it would infrequently have alerted me, I deny it provides good evidence for this particular claim (at least without further assurances). When probability arises to describe how frequently methods are capable of detecting and discriminating erroneous interpretations of data, we may call it an error statistical use.
I suspect that the growth of fallacious statistics is due not only to the growth of big data but also to the acceptability of, if not preference for, methods that declare themselves free from such error-probabilistic encumbrances. The popular writers, quite correctly in my judgment, assume the reader will know all too well what is troubling about cherry-picking, hunting and all the rest. But statistical accounts that downplay error probabilities are at odds with these commonplace intuitions! These accounts, in more formal terms, deny or downplay the relevance of the sampling distribution on which pinpointing the trouble depends. Deniers do not take into account the sampling distribution once the data are in hand.
In a view that does not take into account the sampling distribution, inferences are conditional on the realized value x; other values which may have occurred are regarded as irrelevant. . . . No consideration of the sampling distribution of a statistic is entertained; sample space averaging is ruled out. (Barnett 1982, 226)
So if I ask a denier, how often the procedure would output nominally significant effects, I might receive the reply:
The question of how often a given situation would arise is utterly irrelevant to the question of how we should reason when it does arise. I don’t know how many times this simple fact will have to be pointed out before statisticians of ‘frequentist’ persuasions will take note of it. (Jaynes 1976, 247).
To us, reasoning from the result that did arise is crucially dependent on how often it would occur erroneously.
But what is the “it”? I admit this needs clarification.
It may be best understood as the inference for which the data have been claimed to provide evidence. Even better, one might consider the particular erroneous interpretation of the data that would be of concern or interest. The critique is directed by the sources of misinterpretation relevant to the inferential problem of interest. One then considers the inference, the data, how it was obtained, etc. as a general process that might make it too easy to find an erroneous positive result. Given how common and vital this type of error-statistical reasoning is, I have found myself thinking that the dismissal of sampling distributions (among, say, Bayesians [2]) is based on a confusion—and that we error statisticians have simply not explained the philosophy behind using these tools. (I still think this!)
Remember that, here, statistical scrutiny means scrutinizing for mistaken or unwarranted interpretations of data. To say that a statistical inference to claim H is warranted is to say that H is, at the least, an adequate interpretation of the data (be it this data or all the data available): that is, that H has passed with reasonable severity. So if the report claims to have evidence that treatment T increased benefits regarding factor F, it matters that the procedure would often have reported that T increases some benefit or other, even if all are spurious. To properly assess these error frequencies or error probabilities, the overall process must be considered, though it requires thought and is rarely automatic.
Some advocate that we view the observed statistical significance level, or p-value, simply as a logical measure between data and hypothesis, with no error-statistical component. One is free to do so (despite apparently violating standard FDA regulations and being at odds with best practices in statistics in the law). But the main problem is that one will still need some way to set about scrutinizing low-p-values that can very often be generated, even though the mistaken interpretation has scarcely been well ruled out. We prefer to adhere to the intended and valid use of observed significance levels. But mixing valid and invalid uses has become so prevalent, that some new terminology is called for. Any computed or nominal observed significance level (or p-value) will be called unaudited. Until we have vouchsafed its error-statistical credentials, it is at most unaudited. We auditors invariably have an interest that Jaynes would call pathological: viewing the positive result as an instance of a general procedure of testing, with various abilities to mislead (or not)[3].
Frequentist calculations may be used to examine the particular case by describing how well tests can uncover mistakes in inference, or so I have argued. On these grounds, the “hunting procedure” is little able to have alerted us to temper our enthusiasm, even where tempering is warranted. In Selvin’s illustration, because at least one such impressive departure is common even if all are due to chance, the test has scarcely reassured us that it has done a good job of avoiding such a mistake in this case (Mayo 1996). We do not thereby discount the claim, but call attention to the need for further evidence. Being able to describe a test as affording terrible evidence is actually one of the important assets of this account. In some cases we may want to say that the evidence “so far” is terrible [4].
Other unaudited reports of observed significance levels proceed apace. In some cases, though the null or test hypothesis is fixed, the criteria for rejection or choice of distance measure are chosen in order to produce a result that appears impressively far from what is expected under the null. And despite the absence of genuine association, much less causal connection, the frequency or probability of outputting an impressive result may be high. The non-null has “passed” the test, but with terrible severity. It does not really matter whether one alludes to relative frequencies or probabilities[5]. Mounting the critical audit (whether formal or informal) reflects an error-statistical principle of evidence because its application depends on sampling distributions. A statistical account that denies the relevance of sampling distributions to the interpretation of data obstructs such auditing.
REFERENCES
Barnett, V. 1982. Comparative statistical inference. 2nd ed. New York: John Wiley & Sons.
Jaynes, E. T. 1976. Common sense as an interface. In Foundations of probability theory, statistical inference and statistical theories of science. Vol. 2, edited by W. L. Harper and C. A. Hooker, 218-57. Dordrecht, The Netherlands: D. Reidel.
Mayo, D. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. (see Chapter 9)
Selvin, H. 1970. A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine.
Young,S:
http://www.niss.org/sites/default/files/RASSWebinar_121212.pdf
[1] Selvin calculates this approximately by considering the probability of finding at least one statistically significant difference at the .05 level when 20 independent samples are drawn from populations having true differences of zero, 1 – P (no such difference): 1 – (.95)20 = 1 – .36. This assumes, unrealistically, independent samples, but without that it may be unclear how to even approximately compute actual p-values.
[2] At least those who accept the (strong) likelihood principle.
[3] But because we consider outcomes other than the one observed does not entail we consider experiments other than the one performed, in reasoning from the data.
[4] Contrast this with saying that the hypothesis is not probable so far.
[5] Frequentist or propensity notions.
Great post! I did want to point out that Taleb is certainly aware of the reasoning behind the fallacies being commited, he’s written technical papers about it (or did you mean that the non-technical audiences are the ones intuitivly assuming? maybe I just need to hone my reading skills).
Mark: Thank you. That was my point! The people* who write for “popular” audiences assume the intuitions needed to expose the fallacies; whereas some of the statistical schools question the role of error probabilities. Some in this class (of deniers) suggest we can rely on prior implausibility of inferences, but that’s not a reliable cure nor does it identify correctly the issue at hand. The particular hypothesis in question might be plausible and certainly on par with lots of other assertions (e.g., treatment T has beneficial effect F), but the particular process and data on which it is based may have poor error probabilities. The error probabilities must be relevant (to evaluating the capability of the test to have alerted us to the misinterpretation of interest). Not all error probs are relevant.
*Of course these writers also may and usually do write technical statistics. (e.g., Young). It’s not clear if this shows up in the philosophies of statistics behind their technical work, I think it does. This is a fairly recent phenomenon, so I am not sure about trends.
“Whenever presented with a statistical report, I always want to audit it by asking: just how frequently would this method have alerted me to erroneous claims of this form? If it would infrequently have alerted me, I deny it provides good evidence for this particular claim (at least without further assurances).”
I’m confused as to how to read this. The less often a method alerts one to claims that are erroneous, the better evidence it gives for claims that it does alert one to, all else equal.
Does this mean: for generic claims that it flags as possibly true, the less frequently the method informs me of the actually erroneous nature of those claims, the less valuable its flags marking generic claims are?
Your last paragraph seems right, but not the second one. (The second one might be true, just not directly what I was saying here.) Let me try to clarify what I meant in this case: I ask how frequently would this method have endorsed or inferred claims of this form, without alerting me to the influence of selection effects. Inability or little ability to warn of unwarranted interpretations is bad. Remember that cartoon, I forget which site it was on, that reported all of the null hypotheses that could not be rejected: (e.g., we found no stat sig benefit for F’, F”, F’”etc.), along with the one or two that were cherry-picked. That would be one way to alert me to information that indicates the problem with the original report, which presumably only announced the cherry picked hypotheses. The information about the many non significant effects searched indicates that the original report did a poor job in distinguishing spurious from genuine effects. I invite improved wordings.
Here’s a link to the Young and Karr paper that has the cartoon to which I allude in my previous comment (to Brian).
Click to access Young%20Karr%20Obs%20Study%20Problem.pdf
Suppose that we lined up students from shortest to tallest, then selected the shortest student and said this represents the population. You would dismiss the result as entirely biased and consider me a fraud or crazy.
Now suppose I lined up the p-values from smallest to largest and said that the smallest p-value represented a real effect. Seen in this light, you would call me a fraud.
Suppose that I did not tell you how many (raw) p-values were computed?
Suppose that the p-values were computed after a lot of regression modeling and I picked the “best” model, the one that gave the smallest raw p-value. You might now consider me not a fraud, but a science crook.
Professional statisticians teach statistical methods to honest students with the idea of finding results that are expected to replicate. P-values can be adjusted to reflect the number of questions under consideration.
P-values can be computed and reported in ways that are fair. They can also be used to commit fraud either through ignorance or by design.
Westfall/Young, Resampling-based multiple testing, Wiley, is a useful introduction to the area.
This analogy does not hold. \
Stan Young: Thank you so much for your comment. I had only recently heard of your book, and look forward to reading it. I entirely agree with what you’ve written; however, with many selection effects, it’s far from clear how to adjust p-values. Still, it’s progress even to report when they are unable to be assessed.
On a related issue, I am very interested in your proposals for introducing Deming-style assessments into journals, dividing up the data pre publication. Is it facing opposition? I hope to post something on this idea.
If the data do not satisfy statistical assumptions, then the data held back cannot be relied on to check results. At least it isn’t obvious how they can.
Mayo:
We discuss the importance of sampling distributions in chapter 7 of Bayesian Data Analysis (chapter 8 of the third edition). I agree that Bayesians have said and done foolish things regarding the sampling distribution, which is why we thought it important to get things right in our book.
Andrew: Thanks for the comment. I wish you’d make it better known that “Bayesians have said and done foolish things regarding the sampling distribution” because most just assume such considerations are to be rejected as irrelevant post-data at least.
Mayo:
Many thousands of people have read our book, but maybe they don’t make it all the way to chapter 7. The book has a pragmatic focus, to the extent that readers don’t always realize that the philosophy is there. It’s partly our style, that we just explain things rather than formally setting out our explanations as Theorems and Proofs.
Andrew: I’m going to study the chapter and see if I can identify the “philosophy”. Despite commenting on ~3 of your articles, in print and in this blog, I can’t say I’m sure of your philosophy at all (partly because 2 of those were joint papers).
I found this quote from Fisher in a paper by Yates:
“In recent times one often repeated exposition of the tests of significance, by J. Neyman, a writer not closely associated with the development of these tests, seems liable to lead mathematical readers astray, through laying down axiomatically, what is not agreed or generally true, that the level of significance must be equal to the frequency with which the hypothesis is rejected in repeated sampling of any fixed population allowed by hypothesis. This intrusive axiom, which is foreign to the reasoning on which the tests of significance were in fact based seems to be a real bar to progress…” — R.A. Fisher, 1945. The logical inversion of the notion of the random variable. Sankhya 7, 129-32.
The Yates paper is F. Yates, 1964. Fiducial Probability, Recognisable Sub-Sets and Behrens’ Test. Biometrics 20(2), 343-60. Incidentally, section 9 of that paper discusses a concept I have called “luckiness” confidence procedures previously on this blog. It turns out that John Pratt wrote a paper on the notion in 1961. Truly, there’s nothing new under the sun.
That quote should read “lead mathematical readers astray”, not “lead mathematical leaders astray”. (And “asociated” is a typo.)
Corey: Fixed.
Corey: On the Fisher business, yes, as you know we’ve taken up his lambasting Neyman beginning with the break around 1934 or so. Recently:
https://errorstatistics.com/2013/02/16/fisher-and-neyman-after-anger-management/
https://errorstatistics.com/2013/02/17/r-a-fisher-how-an-outsider-revolutionized-statistics/
Last year too, around his birthday.
I don’t take any of it seriously anymore. (Certainly less so than I did in EGEK 1996. )
I still don’t think I am sure about the meaning of your luckiness confidence intervals, but agree most things are rediscovered many times, especially nowadays when few people read items from even a decade ago. In fact they should focus on reading “old” things, that have stood the test of time, whatever that means.
Mayo: This isn’t about the lambasting of Neyman (and anyway, this is more of a side swipe). The substance of the quote is that Fisher is denying that he ever thought it necessary that p-values be what Gelman calls u-values. This stance seems to me to sharply disagree with your desire to audit p-values. (Not that I’m going to defend selection bias or failure to deal adequately with multiplicity — the quote arises in the context of a discussion of fiducial inference and the Behrens-Fisher problem, so there are connections to Bayes…)
Corey: I think you are misunderstanding Fisher, even trying to subtract out his lack of “anger management”. He surely requires being able to compute p-values under a null, but as I recall, gives an example where a vague null can result in error rates much smaller than a type 1 error probability, because it’s so hard to get strong evidence against. So the pre-specified error prob may be very conservative. Fisher requires a point, or non-vague null, as does Cox. But all of this is at right angles to the issue. I don’t have time to look up the paper just now, sorry.