This post picks up, and continues, an exchange that began with comments on my June 14 blogpost (between Sander Greenland, Nicole Jinn, and I). My new response is at the end. The concern is how to expose and ideally avoid some of the well known flaws and foibles in statistical inference, thanks to gaps between data and statistical inference, and between statistical inference and substantive claims. I am not rejecting the use of multiple methods in the least (they are highly valuable when one method is capable of detecting or reducing flaws in one or more others). Nor am I speaking of classical dualism in metaphysics (which I also do not espouse). I begin with Greenland’s introduction of this idea in his comment… (For various earlier comments, see the post.)
…. I sense some confusion of criticism of the value of tests as popular tools vs. criticism of their logical foundation. I am a critic in the first, practical category, who regards the adoption of testing outside of narrow experimental programs as an unmitigated disaster, resulting in publication bias, prosecutor-type fallacies, and affirming the consequent fallacies throughout the health and social science literature. Even though testing can in theory be used soundly, it just hasn’t done well in practice in these fields. This could be ascribed to human failings rather than failings of received testing theories, but I would require any theory of applied statistics to deal with human limitations, just as safety engineering must do for physical products. I regard statistics as having been woefully negligent of cognitive psychology in this regard. In particular, widespread adoption and vigorous defense of a statistical method or philosophy is no more evidence of its scientific value than widespread adoption and vigorous defense of a religion is evidence of its scientific value. That should bring us to alternatives. I am aware of no compelling data showing that other approaches would have done better, but I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics, in which every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation, if only to reduce prevalent idiocies like interpreting a two-sided P-value as “the” posterior probability of a point null hypothesis.
Nicole Jinn (to Sander Greenland)
What exactly is this ‘dualist’ approach to teaching statistics and why does it mitigate the problems, as you claim? (I am increasingly interested in finding more effective ways to teach/instruct others in various age groups about statistics.) I have a difficult time seeing how effective this ‘dualist’ way of teaching could be for the following reason: the Bayesian and frequentist approaches are vastly different in their aims and the way they see statistics being used in (natural or social) science, especially when one looks more carefully at the foundations of each methodology (e.g., disagreements about where exactly probability enters into inference, or about what counts as relevant information). Hence, it does not make sense (to me) to supply both types of interpretation to the same data and the same research question! Instead, it makes more sense (from a teaching perspective) to demonstrate a Bayesian interpretation for one experiment, and a frequentist interpretation for another experiment, in the hopes of getting at the (major) differences between the two methodologies.
Sander. Thanks for your comment. Interestingly, I think the conglomeration of error statistical tools are the ones most apt at dealing with human limitations and foibles: they give piecemeal methods to ask one question at a time (e.g., would we be mistaken to suppose there is evidence of any effect at all? mistaken about how large? about iid assumptions? about possible causes? about implications for distinguishing any theories?). The standard Bayesian apparatus requires setting out a complete set of hypotheses that might arise, plus prior probabilities in each of them (or in “catchall” hypotheses), as well as priors in the model…and after this herculean task is complete, there is a purely deductive update: being deductive it never goes beyond the givens. Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore.
As for your suggestion of requiring a justification using both or various schools, the thing is, the cases you call “disasters” can readily be “corroborated” Bayesianly. Take the case of a spurious p-value regarded as evidence for hypothesis H, let it be a favorite howler (e.g., ovulation and political preferences, repressed memories, or drug x benefits y, or what have you). The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference! Now an inference that is not countenanced by error statistical testing (due to both violated assumptions and the fact that statistical significance does not warrant the substantive “explanation”) is corroborated!
So, while I’m sorry to shoot down so ecumenical-sounding a suggestion, this would not ring (but instead would mute) the error-statistical alarm bells that we want to hear (wrt the disasters), and which are the basis for mounting criticisms. Besides, even Bayesians cannot reconcile competing approaches to Bayesian inference; even those under the same banner, e.g., default Bayesians, disagree on rudimentary examples—as Bernardo and Berger and others concede. True, there are unifications that show “agreement on numbers” (as Berger puts it), and ways to show that even Bayesian methods have good long-run performance in a long series of repetitions. I happen to think the result is the worst of both worlds (i.e., heralding p-values as giving posterior beliefs, and extolling crude “behavioristic” rationales of long-run performance.
Anyone wanting so see more on these topics here, please search this blog.
Sander Greenland (to Nicole Jinn)
The reasoning behind the dualist approach is that by seeing a correct Bayesian interpretation of a confidence interval, the user is warned that the confidence interval is not necessarily her posterior interval, because the prior required to make it her posterior interval may be far from the prior she would have if she accounted for available background information. In parallel, by seeing a correct Bayesian interpretation of a P-value, the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information. This approach requires no elaborate computation on her part, just initial training (I wrote a series of papers for the International Journal of Epidemiology on that, and I give workshops based on those), and then in application some additional reflection on the background literature for the topic.
This approach is intended to provide a brake on the usual misinterpretation of confidence intervals as posterior intervals and P-values as posterior probabilities. They are indeed posterior intervals and probabilities, but only under particular priors that may not represent the opinion of anyone who has actually done a critical reading of the background literature on a topic. A P-value says a little bit more, however: It bounds from below posterior probabilities from a class of priors that is fairly broad. Now, the larger a lower bound, the more informative it is. This means, almost paradoxically under some frequentist perspectives, that for Bayesian bounding purposes the larger the P-value, the more informative it is. All this is discussed in two articles by Greenland & Poole, 2013, who review old theoretical results by Casella & Berger, 1987.
I have used this ‘dualist’ approach in teaching for over 30 years, inspired by the dualist authors of the 1970s (especially Good, Leamer, and Box). It seems effective in scaling back the misrepresentation of confidence intervals and P-values as the posterior intervals and probabilities implied by the data, and my experience has been corroborated by other colleagues who have tried it. I recommend the dualist approach this empirical reason, as well as for the following more philosophical reasons: As you note, certain rigid statistical approaches labeled “Bayesian” and “frequentist” appear vastly different in their aims and the way they say statistics should be used in science. Hence, it makes perfect sense to me to supply both types of interpretation to the same data and the same research question, so that the user is aware of potential conflicts and respond as needed. In my work I have found that much if not most of the time the two perspectives will agree if constructed under the same assumptions (model) and interpreted properly, but if they seem to diverge then one is alerted to a problem and needs to isolate the source of apparent disagreement.
The bottom line is I think it essential to understand both frequentist and Bayesian interpretations in real applications. I regard the disagreements between certain extremist camps within these “schools” as a pseudo-controversy generated by a failure to appreciate the importance of understanding and viewing results through alternative perspectives, at least any perspective held by a large segment of the scientific community. Failing to seek multiple perspectives is gambling (foolishly in my view) that the one perspective you chose gets you the best inference possible. No perspective or philosophy comes with a credible guarantee of optimality or even sufficiency. Furthermore, all the empirical evidence I see suggests to me that using a frequentist perspective alone is frequently disastrous; as Poole pointed out, even Neyman could not get his interpretations correct all the time. Surely we can do better.
I think artificial intelligence (AI) research provides us important clues as to how. That research shows there is no known single formal methodology for inference that can said to be optimal in any practical sense, as is clear from the fact that we cannot yet build a robot or program that can perform inferences about certain highly complex human systems (like economies) consistently better than the best human experts, who blend statistical tools (often modest ones) with intuitive judgments. To take a highly statistical field as an example, econometrics has made great strides in forecasting but econometricians still cannot use their tools to build investment strategies that beat highly informed but informal experts like Warren Buffett. Consonant with that observation, AI research has found Bayesian structures beneficial in the construction of expert systems, where the priors correspond to the intuition injected into the system along with data to produce inferences.
By the way, you use the word “experiment” but the fields of greatest concern to me are primarily nonexperimental (observational. My own research experience has included a lot of studies using secondary databases, needed because randomized trials were either infeasible (as in occupational or environmental studies) or were too small, short-term, or selective to provide for detection of infrequent or long-term effects. In these studies the entire frequentist foundation often looks far more hypothetical and dubious than carefully constructed priors.
My reply (6-25-13): Sander’s response to Jinn is telling. Let me explain why it does not respond to my concern, but it does help to illuminate them. The concern I raised (with the duality “check”) above is this: ‘The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference!’ (my previous comment)
2. Now Sander tells us: “the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information”.
If you look at the main researchers in some of the studies that have come up for questioning in the last few posts (e.g., recovered memories, ovulation-political preferences) you will see them defend these beliefs. Of course they never gave a prior to all the assumptions being met, but they will happily change them so that the posterior fits her beliefs in the reality of the effect. (I am reminded of Senn and others on this blog).
3. But Sander wants us to consider a rational researcher open to correction. Excellent! But what she needs to be shown in the case of the unwarranted p-value, is the available background information that leads US to criticize a p-value as spurious. We need to show her that if she accounted for this available background—the confounding factors, the lack of controls, the alternative explanations having nothing to do with hormones, and whatever else the case demands —she would see that so small a p-value would be very easy to generate even if the proposed effect is not real, much less causal. Does her method have a high probability of warning us that such impressive-looking results could readily have been generated (using her methods) even if H0 were the case? If not, the statistical inference was unwarranted. She would thereby see her reported p-value has not done its job.
What the background information goes to substantiate is that which is needed to mount the error statistical criticism.
By contrast, if you ask whether various error probabilities accord with researchers’ intuitive degrees of belief, they may well say no, simply because they were not setting out to assign any such things, and don’t have a clue how they’d bet either on the mere statistical claim (of a real effect) and much less so as regards a substantive causal claim. But I think the ovulation researcher would say and does say (as with the repressed memories researcher) that she believes in the reality of the effect as well as the causal claim… Your best hope is to demonstrate (perhaps by simulation) that the probability of generating so small a p-value as she got is high, even under H0. Much more effective, and to the point, than saying, “no you don’t really believe that theory”.
The bottom line is that there is no reason to beat around the bush: what needs recognition and/or fixing are the problems that cause the illegitimate p-values. Even if a researcher senses a discord between beliefs and error probabilities, this is no argument for why.
4. A point of syntax: I noticed in some of Greenland’s work* a presumption that a frequentist error statistician takes the model, and statistical assumptions of a method, as infallibly given. It’s easy to knock down such an absurd view. But where’s the evidence for this reading? Why would error statisticians have developed a battery of methods for testing assumptions if she regarded them as infallible. Why would she have erected methods—quite deliberately—that are quite robust to violated assumptions were she not aware they may be defeated? Indeed, the “ecumenism” of many, e.g., George Box, stems from an acknowledged need to utilize error statistical methods of some sort when it comes to checking the model (I think Gelman concurs, but I will not speak for him).
But there’s more: in understanding that we regard assumptions as checkable and testable, we are not claiming to assign them degrees of belief. Nor have I seen Bayesians assigning formal priors to things like iid (I’m sure some do, send references). Greenland’s dualist does not do this either. He imagines some qualitative background judgment, thereby going right back to what frequentists do (where formal checks are not needed.) But as already explained, our uses of background are directed to the specific flaws. Others can repeat the checks to corroborate allegations, thereby bringing the researcher around by reasoned argument.
*Here’s one short example in a rejoinder: https://docs.google.com/file/d/0B8ssu_MqjtheX1FwUmlxNnhFTTA/edit?usp=sharing
Some Related Posts:
- (1/26/12) Updating & Downdating: One of the Pieces to Pick up
- (4/25/12) Matching Numbers Across Philosophies
- (10/05/12) Deconstructing Gelman, Part 1: “A Bayesian wants everybody else to be a non-Bayesian.”
- (11/19/12) Comments on Wasserman’s “what is Bayesian/frequentist inference?”
- (11/21/12) Irony and Bad Faith: Deconstructing Bayesians – reblog
Sander Greenland (to Mayo) moved from June 14, 2013
Thanks for your comments. As I mentioned, it seems we live in antiparallel universes in terms of how we see what is going on. All the empirical evidence I see reading through health and medical science journals points to error-statistical methods as being routinely misinterpreted in Bayesian ways. Such observations are widely corroborated in health and social sciences (see Ch. 10 of Modern Epidemiology for many citations) and I think are unsurprising given common cognitive fallacies (such as those reported in the Kahnemann et al. 1982 and Gilovich et al. 2002 anthologies). These observations should be alarming. We can theorize all we want about why these problems arise and persist, but I think we do not yet understand human limitations in statistical reasoning half as well as it seems your comments assume.
Thus I opt for an empirical view of the situation: We have had an education and practice disaster on our hands for a half century or more. In recent decades it has abated somewhat thanks to authors emphasizing estimation over testing. But as still lamented improper inversion of conditional data probabilities (error statistics) into hypothesis probabilities is still habitual and even encouraged by some teachers. See this link for an example and tell me if you don’t see a problem:
Charlie Poole sent us a more subtle example in which failure to reject the null is misinterpreted as no evidence against the null (Hans-Hermann Dubben, Hans-Peter Beck-Bornholdt . Systematic review of publication bias in studies on publication bias. BMJ 2005; 331:433-434). Knowledge of most any branch of the medical literature will show this kind of error remains prevalent, unsurprisingly perhaps since as Charlie showed even Neyman fell prey to it.
Your completely hypothetical Bayes example, disconnected from any research context, and resembles nothing like I see, teach, recommend or apply as Bayesian checks on inferences. Those methods I have found useful are detailed in several articles of mine, the basic introduction being Greenland 2006, Bayesian perspectives for epidemiologic research. I. Foundations and basic methods (with comment and
reply) Int J Epidemiol 35:765–778, reprinted as Chapter 18 in the 3rd edition of Modern Epidemiology . The one comment printed with it may be interesting because it shows that some Bayesians feel just as threatened by this Bayes-frequentist connection as do some frequentists – understandably perhaps because it points up the weaknesses of the data models they share.
In nonexperimental epidemiology, error statistics derive from purely hypothetical data-probability models (DPMs) which appear to be nothing more than prior distributions for the data, as they have no basis in the study design, which by definition lacks any known randomization of treatment. Only some of the Bayesian analogs I recommend as checks derive from the same DPM; others derive by relaxation of the sharp constraints that are used to force identification in error statistics, but are not known to be correct by design and may even be highly doubtful, up to a point.
I have explained how I see penalized estimation as a frequentist and Bayesian blend that demonstrates how both can be used to explore proper uncertainty in nonexperimental results that arise from inability to identify the correct or even assuredly adequate model (Greenland 2009, Relaxation penalties and priors for plausible modeling of nonidentified bias sources, Statistical Science 24:195-210.). For real case studies explaining and demonstrating the methods see Greenland 2000, When should epidemiologic regressions use random coefficients? Biometrics 56:915–921; and Greenland 2005, Multiple-bias modeling for analysis of observational data (with discussion), J R Stat Soc Ser A 2005;168:267–308. I have published several other case studies in epidemiology and statistics journals, and have used these methods for subject-matter research articles.
I have endorsed these methods based on my experience with them and that reported by others. Following Good, Leamer, Box and others I recognize that no set of tools should be recommended without empirical evaluation, and that accepted tools should be subject to severe tests indefinitely, just as tests of general relativity continue to this day. Conversely, no one should not reject a set of tools without empirical evaluation, as long as the tools have their own theoretical justification that is consistent both internally (free of contradictions) and externally (is not absolutely contradicted by data), and preferably consistent with accepted theory (a theory that consistently predicts accepted facts). To reject tools or facts on philosophical grounds beyond these basics strikes me as unscientific as refusing to question one’s accepted tools and facts.
Before we go any further I think it is growing time to dispense with the increasingly obsolete labels of frequentist and Bayesian, except in the history of 20th century statistics (and I suppose 21st century philosophy). Good pointed out there were over 40,000 kinds of Bayesians, and I doubt if the number has diminished or the number of kinds of frequentist are much fewer. I am not a Bayesian or a frequentist, as I find such labels confining and almost antithetical to how I conceive science, which seeks many tools as well as theories and clings to none when faced with breakdowns (perhaps philosophically I may be sort of restrained descendant of Feyerabend). As Stan Young said, we have a system out of control and it needs to be fixed somehow. On this and data sharing we agree even though his answer to the problem is multiple-testing adjustment whereas mine sees testing obsession as one of the factors worsening the current situation; hence I focus on sparse-data estimation instead.
My message is not that we need to Bayesian analyses all or even most of the time; rather it is that if we are claiming to make an inference, we need to understand the priors that we would need to hold to endorse the inference if we were Bayesian, so we can contrast those to actual data information in the field. Senn categorizes this approach as subjunctive Bayes: if I had believed that, I should have come to believe this were I Bayesian. Conversely, I do well to ponder that if I now believed this, it would imply that I must have believed that, were I a Bayesian. As with any diagnostic, this reverse-Bayesian analysis may warn us to take special care discourage Bayesian interpretation of our error statistics, and alert us to overconfident inferences from our colleagues and ourselves.
This is a far cry from anything I see you allude to as Bayesian. If you are not familiar with the ideas then you might do worse than start with Leamer – his 1978 masterpiece Specification Searches is available as a free download here:
Also Box (his classic 1980 JRSS article) and some articles by Good (in his 1983 anthology). My 2006 IJE article gives other relevant references.
Perhaps reflecting my 40 years in epidemiology, I am more concerned with frequencies of events and their relation to antecedent events than with untested theories. Thus, if you want me to consider your claims seriously, what I would ask of you is:
1) Please supply one real example of a published study in which the authors applied the kind of methodology I am talking about, and because of that they made clearly incorrect statements about the hypothesis or parameter or evidence under discussion (apart from computational errors, which all method are at risk of). You need to supply some case studies which provide at least some test of your claims in real analyses for which we have a sense of where the answers lie.
2) Please comment on Charlie’s example (Dubben and Beck-Bornholdt , 2005) in which he showed the writers were led into an incorrect statement about evidence by misinterpreting a singular test rather than considering the full confidence distribution or P-value function (the usual null P-value and confidence limits comprising 3 point on the graph).
3) Please comment on the examples in the articles I sent you showing a professor of biostatics misusing tests and of power to claim support of the null in data that discriminate nothing within the range of interest (Greenland 2011, Null misinterpretation in statistical testing and its impact on
health risk assessment, Prev Med 53:225–228; and Greenland 2012, Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol 22:364–368).
4) Please explain how you rationalize your promotion of error statistics in the face of documented amd chronic failure of researchers to interpret them correctly in vitally important settings.