This post picks up, and continues, an exchange that began with comments on my June 14 blogpost (between Sander Greenland, Nicole Jinn, and I). My new response is at the end. The concern is how to expose and ideally avoid some of the well known flaws and foibles in statistical inference, thanks to gaps between data and statistical inference, and between statistical inference and substantive claims. I am not rejecting the use of multiple methods in the least (they are highly valuable when one method is capable of detecting or reducing flaws in one or more others). Nor am I speaking of classical dualism in metaphysics (which I also do not espouse). I begin with Greenland’s introduction of this idea in his comment… (For various earlier comments, see the post.)
…. I sense some confusion of criticism of the value of tests as popular tools vs. criticism of their logical foundation. I am a critic in the first, practical category, who regards the adoption of testing outside of narrow experimental programs as an unmitigated disaster, resulting in publication bias, prosecutor-type fallacies, and affirming the consequent fallacies throughout the health and social science literature. Even though testing can in theory be used soundly, it just hasn’t done well in practice in these fields. This could be ascribed to human failings rather than failings of received testing theories, but I would require any theory of applied statistics to deal with human limitations, just as safety engineering must do for physical products. I regard statistics as having been woefully negligent of cognitive psychology in this regard. In particular, widespread adoption and vigorous defense of a statistical method or philosophy is no more evidence of its scientific value than widespread adoption and vigorous defense of a religion is evidence of its scientific value. That should bring us to alternatives. I am aware of no compelling data showing that other approaches would have done better, but I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics, in which every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation, if only to reduce prevalent idiocies like interpreting a two-sided P-value as “the” posterior probability of a point null hypothesis.
Nicole Jinn (to Sander Greenland)
What exactly is this ‘dualist’ approach to teaching statistics and why does it mitigate the problems, as you claim? (I am increasingly interested in finding more effective ways to teach/instruct others in various age groups about statistics.) I have a difficult time seeing how effective this ‘dualist’ way of teaching could be for the following reason: the Bayesian and frequentist approaches are vastly different in their aims and the way they see statistics being used in (natural or social) science, especially when one looks more carefully at the foundations of each methodology (e.g., disagreements about where exactly probability enters into inference, or about what counts as relevant information). Hence, it does not make sense (to me) to supply both types of interpretation to the same data and the same research question! Instead, it makes more sense (from a teaching perspective) to demonstrate a Bayesian interpretation for one experiment, and a frequentist interpretation for another experiment, in the hopes of getting at the (major) differences between the two methodologies.
Sander. Thanks for your comment. Interestingly, I think the conglomeration of error statistical tools are the ones most apt at dealing with human limitations and foibles: they give piecemeal methods to ask one question at a time (e.g., would we be mistaken to suppose there is evidence of any effect at all? mistaken about how large? about iid assumptions? about possible causes? about implications for distinguishing any theories?). The standard Bayesian apparatus requires setting out a complete set of hypotheses that might arise, plus prior probabilities in each of them (or in “catchall” hypotheses), as well as priors in the model…and after this herculean task is complete, there is a purely deductive update: being deductive it never goes beyond the givens. Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore.
As for your suggestion of requiring a justification using both or various schools, the thing is, the cases you call “disasters” can readily be “corroborated” Bayesianly. Take the case of a spurious p-value regarded as evidence for hypothesis H, let it be a favorite howler (e.g., ovulation and political preferences, repressed memories, or drug x benefits y, or what have you). The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference! Now an inference that is not countenanced by error statistical testing (due to both violated assumptions and the fact that statistical significance does not warrant the substantive “explanation”) is corroborated!
So, while I’m sorry to shoot down so ecumenical-sounding a suggestion, this would not ring (but instead would mute) the error-statistical alarm bells that we want to hear (wrt the disasters), and which are the basis for mounting criticisms. Besides, even Bayesians cannot reconcile competing approaches to Bayesian inference; even those under the same banner, e.g., default Bayesians, disagree on rudimentary examples—as Bernardo and Berger and others concede. True, there are unifications that show “agreement on numbers” (as Berger puts it), and ways to show that even Bayesian methods have good long-run performance in a long series of repetitions. I happen to think the result is the worst of both worlds (i.e., heralding p-values as giving posterior beliefs, and extolling crude “behavioristic” rationales of long-run performance.
Anyone wanting so see more on these topics here, please search this blog.
Sander Greenland (to Nicole Jinn)
The reasoning behind the dualist approach is that by seeing a correct Bayesian interpretation of a confidence interval, the user is warned that the confidence interval is not necessarily her posterior interval, because the prior required to make it her posterior interval may be far from the prior she would have if she accounted for available background information. In parallel, by seeing a correct Bayesian interpretation of a P-value, the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information. This approach requires no elaborate computation on her part, just initial training (I wrote a series of papers for the International Journal of Epidemiology on that, and I give workshops based on those), and then in application some additional reflection on the background literature for the topic.
This approach is intended to provide a brake on the usual misinterpretation of confidence intervals as posterior intervals and P-values as posterior probabilities. They are indeed posterior intervals and probabilities, but only under particular priors that may not represent the opinion of anyone who has actually done a critical reading of the background literature on a topic. A P-value says a little bit more, however: It bounds from below posterior probabilities from a class of priors that is fairly broad. Now, the larger a lower bound, the more informative it is. This means, almost paradoxically under some frequentist perspectives, that for Bayesian bounding purposes the larger the P-value, the more informative it is. All this is discussed in two articles by Greenland & Poole, 2013, who review old theoretical results by Casella & Berger, 1987.
I have used this ‘dualist’ approach in teaching for over 30 years, inspired by the dualist authors of the 1970s (especially Good, Leamer, and Box). It seems effective in scaling back the misrepresentation of confidence intervals and P-values as the posterior intervals and probabilities implied by the data, and my experience has been corroborated by other colleagues who have tried it. I recommend the dualist approach this empirical reason, as well as for the following more philosophical reasons: As you note, certain rigid statistical approaches labeled “Bayesian” and “frequentist” appear vastly different in their aims and the way they say statistics should be used in science. Hence, it makes perfect sense to me to supply both types of interpretation to the same data and the same research question, so that the user is aware of potential conflicts and respond as needed. In my work I have found that much if not most of the time the two perspectives will agree if constructed under the same assumptions (model) and interpreted properly, but if they seem to diverge then one is alerted to a problem and needs to isolate the source of apparent disagreement.
The bottom line is I think it essential to understand both frequentist and Bayesian interpretations in real applications. I regard the disagreements between certain extremist camps within these “schools” as a pseudo-controversy generated by a failure to appreciate the importance of understanding and viewing results through alternative perspectives, at least any perspective held by a large segment of the scientific community. Failing to seek multiple perspectives is gambling (foolishly in my view) that the one perspective you chose gets you the best inference possible. No perspective or philosophy comes with a credible guarantee of optimality or even sufficiency. Furthermore, all the empirical evidence I see suggests to me that using a frequentist perspective alone is frequently disastrous; as Poole pointed out, even Neyman could not get his interpretations correct all the time. Surely we can do better.
I think artificial intelligence (AI) research provides us important clues as to how. That research shows there is no known single formal methodology for inference that can said to be optimal in any practical sense, as is clear from the fact that we cannot yet build a robot or program that can perform inferences about certain highly complex human systems (like economies) consistently better than the best human experts, who blend statistical tools (often modest ones) with intuitive judgments. To take a highly statistical field as an example, econometrics has made great strides in forecasting but econometricians still cannot use their tools to build investment strategies that beat highly informed but informal experts like Warren Buffett. Consonant with that observation, AI research has found Bayesian structures beneficial in the construction of expert systems, where the priors correspond to the intuition injected into the system along with data to produce inferences.
By the way, you use the word “experiment” but the fields of greatest concern to me are primarily nonexperimental (observational. My own research experience has included a lot of studies using secondary databases, needed because randomized trials were either infeasible (as in occupational or environmental studies) or were too small, short-term, or selective to provide for detection of infrequent or long-term effects. In these studies the entire frequentist foundation often looks far more hypothetical and dubious than carefully constructed priors.
My reply (6-25-13): Sander’s response to Jinn is telling. Let me explain why it does not respond to my concern, but it does help to illuminate them. The concern I raised (with the duality “check”) above is this: ‘The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference!’ (my previous comment)
2. Now Sander tells us: “the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information”.
If you look at the main researchers in some of the studies that have come up for questioning in the last few posts (e.g., recovered memories, ovulation-political preferences) you will see them defend these beliefs. Of course they never gave a prior to all the assumptions being met, but they will happily change them so that the posterior fits her beliefs in the reality of the effect. (I am reminded of Senn and others on this blog).
3. But Sander wants us to consider a rational researcher open to correction. Excellent! But what she needs to be shown in the case of the unwarranted p-value, is the available background information that leads US to criticize a p-value as spurious. We need to show her that if she accounted for this available background—the confounding factors, the lack of controls, the alternative explanations having nothing to do with hormones, and whatever else the case demands —she would see that so small a p-value would be very easy to generate even if the proposed effect is not real, much less causal. Does her method have a high probability of warning us that such impressive-looking results could readily have been generated (using her methods) even if H0 were the case? If not, the statistical inference was unwarranted. She would thereby see her reported p-value has not done its job.
What the background information goes to substantiate is that which is needed to mount the error statistical criticism.
By contrast, if you ask whether various error probabilities accord with researchers’ intuitive degrees of belief, they may well say no, simply because they were not setting out to assign any such things, and don’t have a clue how they’d bet either on the mere statistical claim (of a real effect) and much less so as regards a substantive causal claim. But I think the ovulation researcher would say and does say (as with the repressed memories researcher) that she believes in the reality of the effect as well as the causal claim… Your best hope is to demonstrate (perhaps by simulation) that the probability of generating so small a p-value as she got is high, even under H0. Much more effective, and to the point, than saying, “no you don’t really believe that theory”.
The bottom line is that there is no reason to beat around the bush: what needs recognition and/or fixing are the problems that cause the illegitimate p-values. Even if a researcher senses a discord between beliefs and error probabilities, this is no argument for why.
4. A point of syntax: I noticed in some of Greenland’s work* a presumption that a frequentist error statistician takes the model, and statistical assumptions of a method, as infallibly given. It’s easy to knock down such an absurd view. But where’s the evidence for this reading? Why would error statisticians have developed a battery of methods for testing assumptions if she regarded them as infallible. Why would she have erected methods—quite deliberately—that are quite robust to violated assumptions were she not aware they may be defeated? Indeed, the “ecumenism” of many, e.g., George Box, stems from an acknowledged need to utilize error statistical methods of some sort when it comes to checking the model (I think Gelman concurs, but I will not speak for him).
But there’s more: in understanding that we regard assumptions as checkable and testable, we are not claiming to assign them degrees of belief. Nor have I seen Bayesians assigning formal priors to things like iid (I’m sure some do, send references). Greenland’s dualist does not do this either. He imagines some qualitative background judgment, thereby going right back to what frequentists do (where formal checks are not needed.) But as already explained, our uses of background are directed to the specific flaws. Others can repeat the checks to corroborate allegations, thereby bringing the researcher around by reasoned argument.
*Here’s one short example in a rejoinder: https://docs.google.com/file/d/0B8ssu_MqjtheX1FwUmlxNnhFTTA/edit?usp=sharing
Some Related Posts:
- (1/26/12) Updating & Downdating: One of the Pieces to Pick up
- (4/25/12) Matching Numbers Across Philosophies
- (10/05/12) Deconstructing Gelman, Part 1: “A Bayesian wants everybody else to be a non-Bayesian.”
- (11/19/12) Comments on Wasserman’s “what is Bayesian/frequentist inference?”
- (11/21/12) Irony and Bad Faith: Deconstructing Bayesians – reblog