This post picks up, and continues, an exchange that began with comments on my June 14 blogpost (between Sander Greenland, Nicole Jinn, and I). My new response is at the end. The concern is how to expose and ideally avoid some of the well known flaws and foibles in statistical inference, thanks to gaps between data and statistical inference, and between statistical inference and substantive claims. I am not rejecting the use of multiple methods in the least (they are highly valuable when one method is capable of detecting or reducing flaws in one or more others). Nor am I speaking of classical dualism in metaphysics (which I also do not espouse). I begin with Greenland’s introduction of this idea in his comment… (For various earlier comments, see the post.)

…. I sense some confusion of criticism of the value of tests as popular tools vs. criticism of their logical foundation. I am a critic in the first, practical category, who regards the adoption of testing outside of narrow experimental programs as an unmitigated disaster, resulting in publication bias, prosecutor-type fallacies, and affirming the consequent fallacies throughout the health and social science literature. Even though testing can in theory be used soundly, it just hasn’t done well in practice in these fields. This could be ascribed to human failings rather than failings of received testing theories, but I would require any theory of applied statistics to deal with human limitations, just as safety engineering must do for physical products. I regard statistics as having been woefully negligent of cognitive psychology in this regard. In particular, widespread adoption and vigorous defense of a statistical method or philosophy is no more evidence of its scientific value than widespread adoption and vigorous defense of a religion is evidence of its scientific value. That should bring us to alternatives. I am aware of no compelling data showing that other approaches would have done better, but I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics, in which every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation, if only to reduce prevalent idiocies like interpreting a two-sided P-value as “the” posterior probability of a point null hypothesis.

Nicole Jinn(to Sander Greenland)What exactly is this ‘dualist’ approach to teaching statistics and why does it mitigate the problems, as you claim? (I am increasingly interested in finding more effective ways to teach/instruct others in various age groups about statistics.) I have a difficult time seeing how effective this ‘dualist’ way of teaching could be for the following reason: the Bayesian and frequentist approaches are vastly different in their aims and the way they see statistics being used in (natural or social) science, especially when one looks more carefully at the foundations of each methodology (e.g., disagreements about where exactly probability enters into inference, or about what counts as relevant information). Hence, it does not make sense (to me) to supply both types of interpretation to the same data and the same research question! Instead, it makes more sense (from a teaching perspective) to demonstrate a Bayesian interpretation for one experiment, and a frequentist interpretation for another experiment, in the hopes of getting at the (major) differences between the two methodologies.

Sander. Thanks for your comment. Interestingly, I think the conglomeration of error statistical tools are the ones most apt at dealing with human limitations and foibles: they give piecemeal methods to ask one question at a time (e.g., would we be mistaken to suppose there is evidence of any effect at all? mistaken about how large? about iid assumptions? about possible causes? about implications for distinguishing any theories?). The standard Bayesian apparatus requires setting out a complete set of hypotheses that might arise, plus prior probabilities in each of them (or in “catchall” hypotheses), as well as priors in the model…and after this herculean task is complete, there is a purely deductive update: being deductive it never goes beyond the givens. Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore.

As for your suggestion of requiring a justification using both or various schools, the thing is, the cases you call “disasters” can readily be “corroborated” Bayesianly. Take the case of a spurious p-value regarded as evidence for hypothesis H, let it be a favorite howler (e.g., ovulation and political preferences, repressed memories, or drug x benefits y, or what have you). The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference! Now an inference that is not countenanced by error statistical testing (due to both violated assumptions and the fact that statistical significance does not warrant the substantive “explanation”) is corroborated!

So, while I’m sorry to shoot down so ecumenical-sounding a suggestion, this would not ring (but instead would mute) the error-statistical alarm bells that we want to hear (wrt the disasters), and which are the basis for mounting criticisms. Besides, even Bayesians cannot reconcile competing approaches to Bayesian inference; even those under the same banner, e.g., default Bayesians, disagree on rudimentary examples—as Bernardo and Berger and others concede. True, there are unifications that show “agreement on numbers” (as Berger puts it), and ways to show that even Bayesian methods have good long-run performance in a long series of repetitions. I happen to think the result is the worst of both worlds (i.e., heralding p-values as giving posterior beliefs, and extolling crude “behavioristic” rationales of long-run performance.

Anyone wanting so see more on these topics here, please search this blog.

Sander Greenland(to Nicole Jinn)The reasoning behind the dualist approach is that by seeing a correct Bayesian interpretation of a confidence interval, the user is warned that the confidence interval is not necessarily her posterior interval, because the prior required to make it her posterior interval may be far from the prior she would have if she accounted for available background information. In parallel, by seeing a correct Bayesian interpretation of a P-value, the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information. This approach requires no elaborate computation on her part, just initial training (I wrote a series of papers for the International Journal of Epidemiology on that, and I give workshops based on those), and then in application some additional reflection on the background literature for the topic.

This approach is intended to provide a brake on the usual misinterpretation of confidence intervals as posterior intervals and P-values as posterior probabilities. They are indeed posterior intervals and probabilities, but only under particular priors that may not represent the opinion of anyone who has actually done a critical reading of the background literature on a topic. A P-value says a little bit more, however: It bounds from below posterior probabilities from a class of priors that is fairly broad. Now, the larger a lower bound, the more informative it is. This means, almost paradoxically under some frequentist perspectives, that for Bayesian bounding purposes the larger the P-value, the more informative it is. All this is discussed in two articles by Greenland & Poole, 2013, who review old theoretical results by Casella & Berger, 1987.

I have used this ‘dualist’ approach in teaching for over 30 years, inspired by the dualist authors of the 1970s (especially Good, Leamer, and Box). It seems effective in scaling back the misrepresentation of confidence intervals and P-values as the posterior intervals and probabilities implied by the data, and my experience has been corroborated by other colleagues who have tried it. I recommend the dualist approach this empirical reason, as well as for the following more philosophical reasons: As you note, certain rigid statistical approaches labeled “Bayesian” and “frequentist” appear vastly different in their aims and the way they say statistics should be used in science. Hence, it makes perfect sense to me to supply both types of interpretation to the same data and the same research question, so that the user is aware of potential conflicts and respond as needed. In my work I have found that much if not most of the time the two perspectives will agree if constructed under the same assumptions (model) and interpreted properly, but if they seem to diverge then one is alerted to a problem and needs to isolate the source of apparent disagreement.

The bottom line is I think it essential to understand both frequentist and Bayesian interpretations in real applications. I regard the disagreements between certain extremist camps within these “schools” as a pseudo-controversy generated by a failure to appreciate the importance of understanding and viewing results through alternative perspectives, at least any perspective held by a large segment of the scientific community. Failing to seek multiple perspectives is gambling (foolishly in my view) that the one perspective you chose gets you the best inference possible. No perspective or philosophy comes with a credible guarantee of optimality or even sufficiency. Furthermore, all the empirical evidence I see suggests to me that using a frequentist perspective alone is frequently disastrous; as Poole pointed out, even Neyman could not get his interpretations correct all the time. Surely we can do better.

I think artificial intelligence (AI) research provides us important clues as to how. That research shows there is no known single formal methodology for inference that can said to be optimal in any practical sense, as is clear from the fact that we cannot yet build a robot or program that can perform inferences about certain highly complex human systems (like economies) consistently better than the best human experts, who blend statistical tools (often modest ones) with intuitive judgments. To take a highly statistical field as an example, econometrics has made great strides in forecasting but econometricians still cannot use their tools to build investment strategies that beat highly informed but informal experts like Warren Buffett. Consonant with that observation, AI research has found Bayesian structures beneficial in the construction of expert systems, where the priors correspond to the intuition injected into the system along with data to produce inferences.

By the way, you use the word “experiment” but the fields of greatest concern to me are primarily nonexperimental (observational. My own research experience has included a lot of studies using secondary databases, needed because randomized trials were either infeasible (as in occupational or environmental studies) or were too small, short-term, or selective to provide for detection of infrequent or long-term effects. In these studies the entire frequentist foundation often looks far more hypothetical and dubious than carefully constructed priors.

**My reply (6-25-13):** Sander’s response to Jinn is telling. Let me explain why it does not respond to my concern, but it does help to illuminate them. The concern I raised (with the duality “check”) above is this: ‘The researcher believes his hypothesis H fairly plausible to begin with, and he has found data x that seem to be just what is expected were H true; and the low (nominal) p-value leads him to find the data improbable if H is false. P(H) and P(x|H) are fairly high, P(x|not-H) = very low, and a high posterior for P(H|x) may be had. I’m giving a crude reconstruction, but you get the picture. Now we have two methods warranting the original silly inference!’ (my previous comment)

2. Now Sander tells us: “the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information”.

If you look at the main researchers in some of the studies that have come up for questioning in the last few posts (e.g., recovered memories, ovulation-political preferences) you will see them defend these beliefs. Of course they never gave a prior to all the assumptions being met, but they will happily change them so that the posterior fits her beliefs in the reality of the effect. (I am reminded of Senn and others on this blog).

3. But Sander wants us to consider a rational researcher open to correction. Excellent! But what she needs to be shown in the case of the unwarranted p-value, is the available background information that leads US to criticize a p-value as spurious. We need to show her that if she accounted for this available background—the confounding factors, the lack of controls, the alternative explanations having nothing to do with hormones, and whatever else the case demands —she would see that so small a p-value would be very easy to generate even if the proposed effect is not real, much less causal. Does her method have a high probability of warning us that such impressive-looking results could readily have been generated (using her methods) even if H_{0} were the case? If not, the statistical inference was unwarranted. She would thereby see her reported p-value has not done its job.

What the background information goes to substantiate is that which is needed to mount the error statistical criticism.

By contrast, if you ask whether various error probabilities accord with researchers’ intuitive degrees of belief, they may well say no, simply because they were not setting out to assign any such things, and don’t have a clue how they’d bet either on the mere statistical claim (of a real effect) and much less so as regards a substantive causal claim. But I think the ovulation researcher would say and does say (as with the repressed memories researcher) that she believes in the reality of the effect as well as the causal claim… Your best hope is to demonstrate (perhaps by simulation) that the probability of generating so small a p-value as she got is high, even under H_{0}. Much more effective, and to the point, than saying, “no you don’t really believe that theory”.

The bottom line is that there is no reason to beat around the bush: what needs recognition and/or fixing are the problems that cause the illegitimate p-values. Even if a researcher senses a discord between beliefs and error probabilities, this is no argument for why.

4. A point of syntax: I noticed in some of Greenland’s work* a presumption that a frequentist error statistician takes the model, and statistical assumptions of a method, as infallibly given. It’s easy to knock down such an absurd view. But where’s the evidence for this reading? Why would error statisticians have developed a battery of methods for testing assumptions if she regarded them as infallible. Why would she have erected methods—quite deliberately—that are quite robust to violated assumptions were she not aware they may be defeated? Indeed, the “ecumenism” of many, e.g., George Box, stems from an acknowledged need to utilize error statistical methods of some sort when it comes to checking the model (I think Gelman concurs, but I will not speak for him).

But there’s more: in understanding that we regard assumptions as checkable and testable, we are not claiming to assign them degrees of belief. Nor have I seen Bayesians assigning formal priors to things like iid (I’m sure some do, send references). Greenland’s dualist does not do this either. He imagines some qualitative background judgment, thereby going right back to what frequentists do (where formal checks are not needed.) But as already explained, our uses of background are directed to the specific flaws. Others can repeat the checks to corroborate allegations, thereby bringing the researcher around by reasoned argument.

*Here’s one short example in a rejoinder: https://docs.google.com/file/d/0B8ssu_MqjtheX1FwUmlxNnhFTTA/edit?usp=sharing

_____

**Some Related Posts:**

- (1/26/12) Updating & Downdating: One of the Pieces to Pick up
- (4/25/12) Matching Numbers Across Philosophies
- (10/05/12) Deconstructing Gelman, Part 1: “A Bayesian wants everybody else to be a non-Bayesian.”
- (11/19/12) Comments on Wasserman’s “what is Bayesian/frequentist inference?”
- (11/21/12) Irony and Bad Faith: Deconstructing Bayesians – reblog

Sander Greenland (to Mayo) moved from June 14, 2013

Thanks for your comments. As I mentioned, it seems we live in antiparallel universes in terms of how we see what is going on. All the empirical evidence I see reading through health and medical science journals points to error-statistical methods as being routinely misinterpreted in Bayesian ways. Such observations are widely corroborated in health and social sciences (see Ch. 10 of Modern Epidemiology for many citations) and I think are unsurprising given common cognitive fallacies (such as those reported in the Kahnemann et al. 1982 and Gilovich et al. 2002 anthologies). These observations should be alarming. We can theorize all we want about why these problems arise and persist, but I think we do not yet understand human limitations in statistical reasoning half as well as it seems your comments assume.

Thus I opt for an empirical view of the situation: We have had an education and practice disaster on our hands for a half century or more. In recent decades it has abated somewhat thanks to authors emphasizing estimation over testing. But as still lamented improper inversion of conditional data probabilities (error statistics) into hypothesis probabilities is still habitual and even encouraged by some teachers. See this link for an example and tell me if you don’t see a problem:

http://abacus.bates.edu/~ganderso/biology/resources/statistics.html

Charlie Poole sent us a more subtle example in which failure to reject the null is misinterpreted as no evidence against the null (Hans-Hermann Dubben, Hans-Peter Beck-Bornholdt . Systematic review of publication bias in studies on publication bias. BMJ 2005; 331:433-434). Knowledge of most any branch of the medical literature will show this kind of error remains prevalent, unsurprisingly perhaps since as Charlie showed even Neyman fell prey to it.

Your completely hypothetical Bayes example, disconnected from any research context, and resembles nothing like I see, teach, recommend or apply as Bayesian checks on inferences. Those methods I have found useful are detailed in several articles of mine, the basic introduction being Greenland 2006, Bayesian perspectives for epidemiologic research. I. Foundations and basic methods (with comment and

reply) Int J Epidemiol 35:765–778, reprinted as Chapter 18 in the 3rd edition of Modern Epidemiology . The one comment printed with it may be interesting because it shows that some Bayesians feel just as threatened by this Bayes-frequentist connection as do some frequentists – understandably perhaps because it points up the weaknesses of the data models they share.

In nonexperimental epidemiology, error statistics derive from purely hypothetical data-probability models (DPMs) which appear to be nothing more than prior distributions for the data, as they have no basis in the study design, which by definition lacks any known randomization of treatment. Only some of the Bayesian analogs I recommend as checks derive from the same DPM; others derive by relaxation of the sharp constraints that are used to force identification in error statistics, but are not known to be correct by design and may even be highly doubtful, up to a point.

I have explained how I see penalized estimation as a frequentist and Bayesian blend that demonstrates how both can be used to explore proper uncertainty in nonexperimental results that arise from inability to identify the correct or even assuredly adequate model (Greenland 2009, Relaxation penalties and priors for plausible modeling of nonidentified bias sources, Statistical Science 24:195-210.). For real case studies explaining and demonstrating the methods see Greenland 2000, When should epidemiologic regressions use random coefficients? Biometrics 56:915–921; and Greenland 2005, Multiple-bias modeling for analysis of observational data (with discussion), J R Stat Soc Ser A 2005;168:267–308. I have published several other case studies in epidemiology and statistics journals, and have used these methods for subject-matter research articles.

I have endorsed these methods based on my experience with them and that reported by others. Following Good, Leamer, Box and others I recognize that no set of tools should be recommended without empirical evaluation, and that accepted tools should be subject to severe tests indefinitely, just as tests of general relativity continue to this day. Conversely, no one should not reject a set of tools without empirical evaluation, as long as the tools have their own theoretical justification that is consistent both internally (free of contradictions) and externally (is not absolutely contradicted by data), and preferably consistent with accepted theory (a theory that consistently predicts accepted facts). To reject tools or facts on philosophical grounds beyond these basics strikes me as unscientific as refusing to question one’s accepted tools and facts.

Before we go any further I think it is growing time to dispense with the increasingly obsolete labels of frequentist and Bayesian, except in the history of 20th century statistics (and I suppose 21st century philosophy). Good pointed out there were over 40,000 kinds of Bayesians, and I doubt if the number has diminished or the number of kinds of frequentist are much fewer. I am not a Bayesian or a frequentist, as I find such labels confining and almost antithetical to how I conceive science, which seeks many tools as well as theories and clings to none when faced with breakdowns (perhaps philosophically I may be sort of restrained descendant of Feyerabend). As Stan Young said, we have a system out of control and it needs to be fixed somehow. On this and data sharing we agree even though his answer to the problem is multiple-testing adjustment whereas mine sees testing obsession as one of the factors worsening the current situation; hence I focus on sparse-data estimation instead.

My message is not that we need to Bayesian analyses all or even most of the time; rather it is that if we are claiming to make an inference, we need to understand the priors that we would need to hold to endorse the inference if we were Bayesian, so we can contrast those to actual data information in the field. Senn categorizes this approach as subjunctive Bayes: if I had believed that, I should have come to believe this were I Bayesian. Conversely, I do well to ponder that if I now believed this, it would imply that I must have believed that, were I a Bayesian. As with any diagnostic, this reverse-Bayesian analysis may warn us to take special care discourage Bayesian interpretation of our error statistics, and alert us to overconfident inferences from our colleagues and ourselves.

This is a far cry from anything I see you allude to as Bayesian. If you are not familiar with the ideas then you might do worse than start with Leamer – his 1978 masterpiece Specification Searches is available as a free download here:

http://www.anderson.ucla.edu/faculty/edward.leamer/books/specification_searches/specification_searches.htm

Also Box (his classic 1980 JRSS article) and some articles by Good (in his 1983 anthology). My 2006 IJE article gives other relevant references.

Perhaps reflecting my 40 years in epidemiology, I am more concerned with frequencies of events and their relation to antecedent events than with untested theories. Thus, if you want me to consider your claims seriously, what I would ask of you is:

1) Please supply one real example of a published study in which the authors applied the kind of methodology I am talking about, and because of that they made clearly incorrect statements about the hypothesis or parameter or evidence under discussion (apart from computational errors, which all method are at risk of). You need to supply some case studies which provide at least some test of your claims in real analyses for which we have a sense of where the answers lie.

2) Please comment on Charlie’s example (Dubben and Beck-Bornholdt , 2005) in which he showed the writers were led into an incorrect statement about evidence by misinterpreting a singular test rather than considering the full confidence distribution or P-value function (the usual null P-value and confidence limits comprising 3 point on the graph).

3) Please comment on the examples in the articles I sent you showing a professor of biostatics misusing tests and of power to claim support of the null in data that discriminate nothing within the range of interest (Greenland 2011, Null misinterpretation in statistical testing and its impact on

health risk assessment, Prev Med 53:225–228; and Greenland 2012, Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol 22:364–368).

4) Please explain how you rationalize your promotion of error statistics in the face of documented amd chronic failure of researchers to interpret them correctly in vitally important settings.

Sander: I’d love to respond to each of your points in depth, but I am already writing a book on philosophy of statistical inference just now, and I need to complete it ASAP. Many of these issues may be found searching this blog, and checking its table of contents. A few remarks on points #1-3 are below:

Point #1. You had it right the first time (when you said): “I am aware of no compelling data showing that other approaches would have done better…[readers can check full comments, this post]. We may grant “at least some of the problems would have been mitigated [if] every procedure must be supplied with both an accurate frequentist and an accurate Bayesian interpretation”(Greenland, this post). But if we had an accurate frequentist application, the problems would not have arisen. It is absurd to imply, as you do in point #1, that method M is valid (and no criticism of M can have any weight) unless you can prove that because they applied my methodology M, they made clearly incorrect statements. Any criticism (of an instantiation of M) can readily be dismissed as not caused by the methodology, or not really an application of it (as in the “not a real woman” fallacy).

Point #2. Since my doctoral dissertation, I advocated looking at a series of intervals or severity benchpoints. See Mayo 1983 and two more recent references below. There is a computational similarity with Poole and other attempts at a “series of CIs”(e.g., by Kempthorne and by Birnbaum), but with an important difference in interpretation, noted in Mayo and Spanos (2006) below. I thought at first it was a mere rule of thumb; only later did it become something deeper.

Point #3. I will (in a later post) comment on your “Nonsignificance plus high power does not imply support for the null over the alternative” as an exercise in how tests and power may be misinterpreted. It is easy to show that the statement can be true, but there is no problem in the least for a proper interpretation of tests! Actually, all the ingredients are already in my recent reblog:

https://errorstatistics.com/2013/06/06/anything-tests-can-do-cis-do-better-cis-do-anything-better-than-tests-reforming-the-reformers-cont/

Also, search for fallacy of acceptance.

Finally, what is your reply to Christian Hennig?

Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

http://www.phil.vt.edu/dmayo/personal_website/Ch 7 mayo & cox.pdf

Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.

Mayo, D. (1983). “An Objective Theory of Statistical Testing.” Synthese 57(2): 297-340.

http://www.phil.vt.edu/dmayo/personal_website/(1983) AN OBJECTIVE THEORY OF STATISTICAL Testing.pdf

I’m in favour of “dual teaching” as far as this means to teach both frequentist and Bayesian approaches (maybe from time to time applied to the same data), and to clearly stress the differences between the interpretations of probability in them. I worry that what Sander Greenland proposes, makes matters even more confusing than they are already.

The idea of p-values and tests is based on an interpretation of probability that does *not* assign probabilities to models or parameters being true – period. This is what needs to be taught and if taught properly, people should not interpret a p-value as probability of any set of parameters being true.

Results about how p-values or confidence intervals could have Bayesian interpretations as bounds over certain classes of priors are of some theoretical value but introduced too early, I worry that their message will be counterproductive, encouraging people to think as p-values of at least some kind of posterior probability of sets of models under some priors, which is what they are not intended to be.

If you want a posterior, come up with a well justified prior first! Don’t use p-vales out of their domain!

Now I think that Sander is right that currently statistics education has mostly failed in teaching the difference between the approaches properly. I, however, see a different reason, namely that the vast majority of statisticians is not much interested in philosophy and interpretations of probability (and, understandably, somewhat annoyed by wars about foundations), and that they tend to teach that Bayesian and frequentist methods are rather two technical devices doing more or less the same thing. And out of the remaining group, the majority think that either frequentist or Bayesian is right and the other one is wrong, so they teach only one of them properly and ridicule the other one, instead of facing the fact that these two major approaches in statistics start from different points of view and do essentially different things, that may be legitimate for different research question (and won’t necessarily lead to the same conclusions about the same data if just applied properly).

Christian: As is so often the case, I appreciation your voice of reason. I think the key point is one you mention: most think that either frequentist or Bayesian is right…so they teach only one of them properly and ridicule the other one. We see from Greenland’s often (but not always) very forceful papers, that he is in the group that has at times ridiculed frequentist methods.(7/7/13) The other point you mention is one of the key things that puzzles me about Greenland’s suggestion: while castigating people for identifying error probabilities and posteriors, his recommendation is to try p-values on for size as posteriors, and try to bring the person around if the person sees a discord.

Mayo, you say “We see from Greenland’s often (but not always) very forceful papers, that he is in the group that tends to ridicule frequentist methods.” I just do not think this is accurate. I think that Greenland has great respect for frequentist methods, albeit applied in certain restricted settings (frankly, I see myself as a die-hard frequentist, but I agree with him here). For example, his 1990 Epidemiology paper emphasizes frequentist inference in randomized experiments. As I commented previously, this paper is one of my favorite papers of all time. Additionally, ironically, from the first in his series of 3 Bayesian tutorial papers published in Int. J. of Epidemiology in 2006: “In the randomized trial and random-sample survey in which they were developed, these frequentist techniques [i.e., those of Fisher, Neyman, and Peason] appear to be highly effective tools.” His beef seems to be with applying these methods in non-experimental contexts. As I said, I tend to agree with him on this (and I suspect that other regular contributors here, as well some deceased greats like Fisher and David Freedman, might agree, too, but I certainly don’t want to put words in their mouths…).

You have to keep in mind that Greenland is writing from a public health perspecitve, and this is a realm where a lot of demonstrably bad inferences have been made from observational data (by “bad” I mean inferences that have later been shown to be false in randomized trials, but not before becoming part of public health policy or recommendations). Stan Young and Ioannidis have both written quite a bit about this.

Where I part ways with Greenland is his embrace of Bayesian and causal modeling methods for non-randomized (i.e., observational) data. I personally just don’t see the point in trying to fix a biased design by making a bunch of strong, completely unverifiable assumptions. I don’t see any difference between this approach and “expert opinion”, and I don’t value expert opinion highly in most contexts (Nassim Taleb has written a lot about the failings of expert opinion, as did Popper). Personally, I think the best approach for non-experimental data (notice that I’m not specifically saying “non-randomized” because there are other ways to do a true experiment, as with certain physical experiments) is another recommendation made by Greenland in his 1990 paper: “deemphasize inferential statistics in favor of pure data descriptors.” Unfortunately, for whatever reason, such a descriptive approach is unlikely to get published.

Mark: I appreciate your comments; they help me to get some perspective, so thanks. My remarks are based on around 8 of Greenland’s papers or joint papers, the ones he has recommended I read (over the past year), and one on which my colleague Aris Spanos and I commented–though these issues are peripheral to the last. (You may search this blog). The points in my comments on this blogpost, especially the semantic point I raise at the end, may explain the tension I find running through nearly all these articles, when discussing these issues. [Granted I was picking up on Hennig’s language, but I don’t think it’s far off, even if it’s unintended.] I will say more about two of these topics on later posts (what he/they say about frequentist model assumptions, and the business of power). I might note that we have at times discussed testing assumptions in non-experimental settings on this blog, focusing on my colleague Aris Spanos’ work. See for example a series of 4 posts:

https://errorstatistics.com/2012/02/22/2294/

https://errorstatistics.com/2012/02/23/misspecification-testing-part-2/

https://errorstatistics.com/2012/02/27/misspecification-testing-part-3-m-s-blog/

https://errorstatistics.com/2012/02/28/m-s-tests-part-4-the-end-of-the-story-and-some-conclusions/

To: Christian Hennig: Yes, I concur that many seem to think that either frequentist or Bayesian is right and the other one is wrong; and that inevitably affects how people are taught statistics as a subject. That is the primary reason for my motivation in wanting to focus (in my later studies!) on the way statistics is taught, i.e., methods of instruction! Effective teaching/instruction (to me) really depends on the quality of knowledge the instructor has about the subject, both in terms of addressing answers to foundational questions and possessing a strong technical or mathematical background.

Greenland applies the frequentist’s weak repeated sampling property when he criticizes frequentist statistics in (non-experimental) contexts where he alleges “using a frequentist perspective alone is frequently disastrous”. Too bad for those contexts, but no skin off my nose. While I’m here, I fail to see that allowing ‘accept h’ as a shorthand for ‘do not reject h’ (at a specified level) means that I could not get my interpretation correct all the time

I want to address this:

“Perhaps the data will require a change in your prior—this is what you must have believed before, since otherwise you find your posterior unacceptable—thereby encouraging the very self-sealing inferences we all claim to deplore.”

To avoid confusion, I’ll refer to the prior distribution I hypothetically used and the posterior distribution I hypothetically obtained as the analysis prior and posterior.

When I find the analysis posterior unacceptable, there must be some proposition it evaluates as probable that I myself regard as improbable, or vice versa. At this point, I ask myself what one thinker I respect greatly has called the fundamental question of rationality: why do I believe what I believe?

By examining the reasons for my belief, I may conclude that my original reaction was wrong, and that I have no grounds to find the analysis posterior unacceptable. — Or, I may find that I can point to some prior scientific information that warrants my disagreement, and in this case, I will most likely find that the analysis prior failed to encode this prior information. In this situation, I judge it legitimate to change the analysis prior; what’s more, the fact that the analysis posterior is sensitive to this aspect of the prior may itself be of scientific interest.

Corey: I am not sure how your point addresses the issue in the particular context in which my remarks arise here. Take for instance one of the hypotheses that arose in this entire discussion over the past few posts H: Ovulation influences voting preferences. Women who are married or in serious relationships—according to the researchers’ interpretation of the data—feel more liberal sexually during ovulation and thus they “overcompensated” by favoring Romney, the conservative candidate (in the PIs study). Her reading into the data is based on the statistically based “real effects”.

[The post was https://errorstatistics.com/2013/06/22/what-do-these-share-in-common-mms-limbo-stick-ovulation-dale-carnegie-sat-night-potpourri/

the links to “CNN pulled a story about a study are http://www.wthitv.com/dpps/healthy_living/general_health/study-looks-at-voting-and-hormones_4857701#.Uc80GJWvtRZ

and

http://business.utsa.edu/faculty/kdurante/files/Durante_PresidentialElection_Hormones.pdf%5D.

Now the dualist critic comes along and says “by seeing a correct Bayesian interpretation of a P-value, the user is warned that the P-value is not necessarily her posterior probability of the model or hypothesis that the P-value “tests”, because the prior required to make it her posterior probability may be far from the prior she would have if she accounted for available background information” (Greenland comment).

So I guess Greenland would explain that his posterior in the null is actually not very low; it seems to him that her findings are readily explicable by chance. If this means only that he will show that her controls are terrible, etc. etc., then he is performing an error statistical critique and not beating around the bush. I endorse that. If instead he says a good Bayesian would give a low posterior to her hypothesis “ovulation influences voting preferences” then he is doing no more than those who criticized the study as simply unbelievable, silly and the like. The PI would respond, as she does, saying that she thinks it’s valid. If for some reason she is forced to change her likelihoods, she’ll adjust her prior.

To repeat myself: “The bottom line is that there is no reason to beat around the bush [and doing so is counterproductive]: what needs recognition and/or fixing are the problems that cause the illegitimate p-values” in the first place. One then points out that even with a genuine statistical effect, subsequent interpretations like hers commit the statistical vs substantive fallacy.

Mayo: The point that is generally applicable is that when an analysis of any type leads to an inference one finds implausible, it’s asking (and answering!) “why do I believe what I believe?” that can save one from unwarranted self-sealing inferences. I agree that it won’t help someone who is going to jump from a statistical inference to a substantive inference without even trying to think about ways that such a jump can be in error.

Thanks to all for their discussion. Like Mayo I don’t have the time to respond to every issue raised. Most of the comments strike me as starting points for side discussions of their own. But one struck me as a good illustration of exactly the problem with these kinds of philosophical discussions: Mayo said “I noticed in some of Greenland’s work a presumption that a frequentist error statistician takes the model, and statistical assumptions of a method, as infallibly given. It’s easy to knock down such an absurd view. But where’s the evidence for this reading?” The evidence is in the applied literature. If you read health-science and medical journals, do a survey of how many times the authors say they checked their model to make sure their inference did not depend on the model they chose beyond some obvious items like covariate terms. An example is the routine use of binomial-error logistic regression in epidemiology. Few papers outside of contagious-disease studies (including most subject papers I’m on) report a check or even a give a thought to whether the binomial-logistic model is correct. And if you go to meetings and ask those on the study, you’ll find that failure to mention means failure to do it. Yet the packaged stats (frequentist or Bayesian) from this model assume iid Bernoulli variation within covariate levels (an extremely strong assumption that is violated when, as usual, important covariates are unmeasured); furthermore, the model is incapable of fitting correctly probabilities that are 0 or 1 (a concern in exposure-propensity modeling when some patients could never have received a treatment other than the one they got – so it is simply assumed under the rubric of “positivity” that such patients don’t exist).

As for the comment by the ghost of J. Neyman: The confusion of “acceptance of the null” with “failure to reject the null” is no different than confusing “acceptance of god” with agnosticism, or confusing the abstract statements ‘H’ and ‘H or not-H”. The difference is simple and profound, logically and epistemologically.

A major problem shared by both statistical philosophy and philosophy of statistics is a severe deficiency of agnosticism, fueled by taking theoretical arguments as reasons for accepting or rejecting a methodology. The deep irony I see in philosophical commentaries on statistics (and which may explain why most applied statisticians ignore them) is that they tend to be almost data-free. They purport to prescribe what is good for data analysis based on no data from designed studies and often based on no in-depth experience with scientific research topics. When done by philosophers, they may not even be extensive and careful reading of the studies representative of most sciences, the kind that don’t get press releases (of course, physics gets massive attention; so ask the relevance of tests of relativity and quantum mechanics to questions like what is your best bet for a diet to optimize your long-term health).

Where are randomized trials of the effects on subsequent research of teaching frequentist vs. Bayesian vs. multiple tools to students? Where are randomized trials of the effects using frequentist vs. Bayesian vs. multiple tools on the validity of the reporting done by researchers? Where are randomized trials of the effects on seeing frequentist vs. Bayesian vs. multiple outputs on the validity of inferences made by readers of these reports? If one took a genuinely scientific approach to these questions, one would have to do large, long term trials like these, and repeat them for different fields since there would be little basis for generalization (e.g., contrast industrial experiments to epidemiology). I expect there are some comparative studies in education or psychology, and certainly there are studies in cognitive psychology bearing on the questions, but I’ll wager the totality of evidence provides no basis for the kind of strong assertions I see in foundational exchanges. Certainly nothing I’ve seen cited in this blog provides such a basis, and if you consider what it would take to mount relevant trials it is easy to see why.

Instead of direct effectiveness studies, we get by with (a) toy examples, often devoid of any context and sometimes devoid even of numbers (no seasoned epidemiologist would give a damn if a method may break down when relative risks are 100 instead of 1 or 10); (b) real examples so extreme that they have little or no generalizability (we know how sex ratios ought to run; how about some tough examples where anyone who claims to know the scientific answers regarding most effects is a charlatan or worse, as in nutritional epidemiology?); or (c) cherry-picked sets of examples presented as if they were representative of an entire field instead of the extreme if perhaps interesting special cases they are (e.g., Young and Karr, Significance Sept. 2011). Examples along the lines of (a)-(c) are useful for teaching and conceptual illustration, but a steady diet of only these kinds of examples appears to degrade thinking by making one feel they have insight into everyday practice when instead one has developed a completely distorted concept of general, practical importance.

For the above reasons, my clues about what works for practice come from uncommitted practitioners, not statistical theorists or philosophers. An exception who spent time as all three: I.J. Good, who made clear that the frequentist-Bayes dichotomy is a gross oversimplification for practice, at best on a par with classifying politicians as conservative or liberal, or as religious or secular. Among the “ecumenical” albeit more Bayesian-oriented (but not exclusively so) statisticians I’ve praised as voices of reason, George Box and Ed Leamer enjoyed wide recognition in science and technology (Box acclaimed in industrial-engineering statistics, Leamer in financial forecasting). D.R. Cox is the one nominal frequentist in my list; his frequentism is replete with cautions against prevalent misinterpretations, and he had no fear of dualist interpretations – e.g., the last sentence of his sadly neglected 1982 paper on significance tests (Br J Clin Pharmacol; 14:325–331) is “in problems in which the direction of an effect is under study and the external information is relatively weak, the one-sided P value is approximately the (Bayesian) probability that the time effect has the opposite sign to the effect in the data and the two approaches are largely reconciled.” The comments in this blog against teaching such interpretations strike me as of the form “we don’t want to confuse people with an inconvenient truth,” that being there is more than one way to look at these commonly offered and usually misinterpreted statistics, much as it pains certain purists (starting with Fisher) to admit this.

In summary, I find philosophical and theoretical arguments to be the weakest evidence regarding the effectiveness of a scientific methodology, in the same way as mechanistic arguments are the weakest evidence regarding the effectiveness of a treatment. No amount of elaboration can improve the status of a nonempirical argument, even if the argument fills books on end. Just as the FDA demands studies that compare treatments on human subjects to approve new treatments, so scientists should demand studies that compare methodologies in real subject-matter contexts before buying into methodological claims (and as much as simulations may provide quantitative insight into qualitative claims, they won’t suffice in either situation).

I’ll close by repeating what I said a few replies back (with some grammar edits), since it seems to have been ignored but is relevant to my main point here: “… I recognize that no set of tools should be recommended without empirical evaluation, and that accepted tools should be subject to severe tests indefinitely, just as tests of general relativity continue to this day. Conversely, no one should reject a set of tools without empirical evaluation, as long as the tools have their own theoretical justification that is consistent both internally (free of contradictions) and externally (is not absolutely contradicted by data), and preferably consistent with accepted theory (a plausible theory that consistently predicts accepted facts). To reject tools or facts on philosophical grounds beyond these basics strikes me as unscientific, just as is refusing to question one’s accepted tools and facts.”

Can anyone explain what Professor Greenland means in alleging: “The confusion of ‘acceptance of the null’ with ‘failure to reject the null’ is no different than confusing ‘acceptance of god’ with agnosticism”? I don’t identify the two, I have many times decried such an identity as extremely dangerous. I say the former is simply an abbreviation of the longish assertion in the latter. (Perhaps we should have said the null is “acceptable” at the given level of precision and size.)

The next step demands a detectable discrepancy size analysis to ascertain the meaning of latter claims. This proceeds in terms of discrepancies that would have been detected with high power. As for “confusing the abstract statements ‘H’ and ‘H or not-H'”, I am utterly in the dark as to where I supposedly identify an empirical hypothesis H with a tautologous proposition that contains no empirical information.

Sander: Thanks for your lengthy comments; I can just take up two main themes. Here’s the first:

“Where are randomized trials of the effects on subsequent research of teaching frequentist vs. Bayesian vs. multiple tools to students?”

I would love to see such a study, provided that each methodology was taught with a genuine appreciation and understanding of the overarching philosophy of statistical inference in which they get their meaning and rationale.

Still, I don’t really think such a study is plausible. Besides we have evidence right now of what is taking place. I hate to say this, but I think a lot of the problems we’re seeing stems from teaching frequentist methods in a disdainful fashion, declaring that everyone knows that what we really want are posterior belief assessments or the like. The textbooks themselves (many of which I have recently surveyed) show this. Some texts for methodology courses in the social sciences are even more cringe worthy. Leaders in their fields have told me frankly that they have been taught to view statistical analyses as largely window dressing for purposes of publication. No wonder then that this is what is learned.

Statistical science, in my view, is continuous with scientific inquiry more generally: they enter when signal and noise are not well distinguishable through informal means. It is bizarre to suppose that in the rest of science strong arguments from coincidence enable severe tests to teach about the world, but where formal statistical methods enter, suddenly the goal is learning about our beliefs, and expressions of beliefs. I began this blog with a non-statistical example on purpose: finding out about prions and mad cow disease.

I want to try to understand what you’re saying in (c): “ cherry-picked sets of examples presented as if they were representative of an entire field instead of the extreme if perhaps interesting special cases they are (e.g., Young and Karr, Significance Sept. 2011). a steady diet of only these kinds of examples appears to degrade thinking by making one feel they have insight into everyday practice…” Are you saying a steady diet of these critical examples may be degrading thinking? I am sympathetic, but want to understand.

Sander: Here’s the second part of my comments to your last:

2. To me, “getting philosophical” about statistical inference is not articulating rarified concepts divorced from statistical practice, but tools with which to get clear about concepts, and to avoid obfuscations that are regularly bandied about. At the very least, readers of this blog should be empowered to uncover some of the hidden presuppositions that too often remain conveniently hidden under wraps. Most practitioners will say that they cannot be bothered with philosophical matters, and this is as it should be for relatively uncontroversial methodologies in science; however, one finds this most vociferously claimed by practitioners in fields rife with problems of how to interpret or justify method. We see this to be the case in your arena, even in the comments to this blog. My remarks are all based on actual and relevant reasons; to dismiss them as “philosophical” is just a way to duck them. A cop out.

It may be thought that the basic statistical concepts are well understood, as are their weaknesses and strengths. But (to my surprise if not alarm, at times) this simply is not true! (And for the trivial and unproblematic Neyman point, “accept h” abbreviates “no evidence against” in relation to a test with various properties–read my Neyman nurseries for what Neyman thought on this.)

The current debates are so often colored by the biases of the debunkers—philosophical, institutional, political—that it is no wonder people become skeptical of the whole business. I do not mean (or expect) to derail a huge cottage industry in reporting on heartbreaking stories of P-value abuse. (Personally I think we should ban further repetitions of the same howlers discussed on this blog.) How many times must we hear such original ideas as the following? P-values and confidence levels are invariably interpreted as posterior probabilities in hypotheses about parameters, so, let’s find a way to interpret them as posterior probabilities (search this blog for examples).

Enough! Enough!

Mayo, I have read through some of Greenland’s comments, and frankly I am not sure what he is trying to do. It seems that he does something that I have seen other Bayesians do: criticize the use of p-values because of the potential for bias in the study. Bayesian analyses, irrespective of how priors are obtained, will suffer from the same distortion, but potentially worse because now affecting the likelihood ratio, and the posterior directly. The suggestion that we should teach p-values and CIs as posteriors because there are some priors that would make the interpretation correct (if I got the suggestion correctly) seems downright absurd.

I have to say that Sander’s olive branch looks like poison ivy to me.

Klemens: I don’t know this character, and I can’t check just now (traveling). I don’t mind ghost names so long as the comments are serious, and yours are. I don’t know if he intends his approach to be an olive branch or not. As you know, if you follow this blog, I have made a similar point, and even suggested it stems from a semantic issue: when a claim is conjectured, as opposed to posited as error free, he construes it as assigned a posterior of some sort. Even if it doesn’t have the properties of a formal probability. I don’t know if this is so, but conceivably when there’s a high severity for some directional claim C, there’s good evidence for C, so calling it plausible/probable doesn’t get you into much trouble (given the reasonable agreement on numbers with one-sided tests)–so long as probability computations are avoided. This doesn’t address your point about bias.

I have returned from holiday to make some reply to the comments on my post of June 30. Before launching into a long and perhaps tedious general commentary on the blog exchange, I do want to respond to Mayo’s question: “Are you saying a steady diet of these critical examples may be degrading thinking? ” Yes I am, and I regard the Young & Karr article and citations to it as providing a classic example of degradation. Critical examples are important but the manner in which the examples were selected and evaluated must be spelled out, lest they be mistaken for being representative when in fact they are misleading relative to the literature at large. For more detailed explanation in the Young-Karr case please see my response to Larry Wasserman’s blog post from last year mentioning Young & Karr favorably:

http://normaldeviate.wordpress.com/2012/06/18/48/

It’s a beautiful example of how a brilliant mind (Wasserman) is lulled into passive acceptance of an ostensibly empirical and fun-to-believe claim by Young & Karr that dissolves under close scrutiny when one critically examines their data, their empirical methodology, and – as importantly – the relevant literature that they pointedly do not mention.

Speaking of empirical methodology and data, my delay in replying is largely because the level of argumentation I see on the present thread is nothing I recognize as scientific, and thus I have become discouraged about what can come of it. A few colleagues (working scientists, not philosophers or theorists) who at my request (as a reality check) read this thread have given similarly dismal views, especially in light of nonresponse to the numbered list of requests for evidence, comment and explanation that I gave the week before last.

The core problem I lamented repeatedly on this blog remains: Lack of any experiments, or well-controlled observational comparisons, or even case studies in science (not philosophy or theoretical statistics) – which is to say lack of citation to scientific empirical evidence – to substantiate the strong opinions given by respondents here. In its place seems a retreat to ideological commitment, devoid of empirical base. Case in point here: Mayo’s comment that Neyman’s widely promulgated misinterpretation of his own methods has been “trivial and unproblematic”. I find this a strikingly offhand dismissal of scores of articles and books stretching back now some 70 years that disagree vehemently, and which in doing so offer many real examples of how researchers and sometimes entire literatures were led far astray in interpreting their own studies thanks to following Neyman’s lead (once again see Modern Epidemiology 3rd ed. Ch. 10 and Greenland & Poole 2011, 2013, Greenland 2011, 2012 for examples and citations to many more examples from others). Although to give Neyman his due he did warn against some of the most egregious abuses, to no avail that I can see.

The comment from von Metternich is even more discouraging for simply not understanding what have written here, as well as not going to the literature I cite (which appears in science and applied-stat journals, not in philosophy or math stat). The 2013 Greenland and Poole articles I cite describe a Bayesian bounding view of P-values that goes far beyond the description he or Mayo gives, which I outlined in an earlier post. Even had he got our argument right (and he didn’t), that an idea strikes von Metternich or anyone as absurd is no evidence and not even logical argument: Witness Kelvin regarding the idea of heavier-than-air flying machines and Jeffreys regarding continental drift as impossible; their vehemence and previous accomplishments meant nothing at all. The same can be said of the vehemence of those opposed to Bayesian or frequentist methods.

As evidence for failure to read what I actually wrote, I pointed out in my comments on this blog (as well as in the articles) I cited that traditional Bayesian computations are subject to bias in the same way as traditional frequentist computation, because outside of tightly controlled experiments, both depend on highly misleading probability distributions for data. This fact is a major reason why some smart researchers disdain formal statistics as window dressing. But the utility of modern computational statistics (whether frequentist or Bayesian) is that it can go beyond traditional distributional models to address these biases. In this blog I have also cited detailed approaches to bias problems, including a duallist account of mine (Greenland Stat Sci 2009) which in turn cites both frequentist (Vansteelandt et al., 2006) and Bayesian (Greenland and Gustafson, separately and jointly 2001-2009) approaches. If anyone here is interested in contributing constructively to the debate about how to handle bias problems, they will need to read this literature and they will need to support their proposed alternative methodology with empirical evidence.

Following scientific (apparently, as opposed to philosophical) objectives to improve practice rather than sharpen rhetoric, I have here and earlier cited evidence of how these bias-analysis methods work with interpreting real data from real studies of real scientific controversies, and how prior distributions are an inescapable component of such analyses (whether hidden as in frequentist approaches or laid on the table as in Bayesian approaches). There are whole books on the topic (I reviewed one of the latest in JASA 2010). Inflamed rhetoric such as “cop out” and “poison ivy” are not a fitting substitute for citations of case studies in real scientific topics that examine these methods and the role of Bayesian thinking and computation in them. Mayo adds that her “remarks are all based on actual and relevant reasons,” but as I said before, even moderated, carefully delineated reasons are no substitute for empirical evidence, and are inadequate response to case studies like those I have cited here and the many others cited elsewhere including in the citations I gave.

As an example, one can offer as many reasons as one wants as to why residential electromagnetic fields can or cannot affect the risk of childhood leukemia, and there have been reasons offered vehemently on both sides. But scientifically, the issue calls for actually looking to see if risks appear elevated among those with elevated exposure, as was done despite the naysaying, followed by critical examination of bias explanations for the resulting observations (which happened to be nastily persistent associations, sometimes reported as null simply because P>0.05). There are those who attempt to dismiss all this epidemiology based on a priori or theoretical arguments (rather than adopting an agnostic stance and looking directly at the observations). I only note that taking such argumentation as solid evidence is associated with some of the greatest embarrassments in the history of science, e.g., that the age of the earth can’t be more than around 108 years, that continents can’t drift, that genes can’t jump,and that diets high in refined carbohydrate can’t cause diabetes. And examples continue to emerge, e.g., most recently that Lamarckian inheritance can’t occur (it now appears to happen, via epigenetics).

The arguments I have been getting back on this blog in response to my citation of and request for empirical evidence strike me as falling firmly in the a priorist category, with no searchable citation of supporting observations. Maybe someone will appreciate the supreme irony in the present debate: That those attacking Bayesian perspectives keep falling back on nonempirical prior arguments whose justification never seems to touch down on real data and real scientific debates. Yet there are a fair number of relevant studies in the cognitive psychology of probability and statistics stretching back many decades, including entire collections by Kahneman and Tversky, Gigerenzer, and Gilovich et al., which I have cited. None of them provide any evidence that an exclusionary approach to inference is productive; on the contrary, their results show that training in how to distinguish and use both Pr(data|H) and Pr(H|data) is essential to avoid common traps of thinking.

Now, these studies have been mostly limited to narrow settings involving very simple and often artificial questions, rather than full research controversies, in part because the investigators could not evaluate effectiveness and defects of strategies without knowing the correct answers. But this literature does demonstrate that even trained scientists are prey to a long list of cognitive biases. Most notable for statistics are various forms overconfidence, prosecutor-type fallacies, double-counting and so on, which can be ameliorated by understanding and distinguishing both frequentist and Bayesian calculations. Gigerenzer’s experiments further show that both calculations are far better understood in the context of actual data counts than in terms of abstract probability statements; I interpret his findings as showing a primal dependence of valid scientific inference on empirical frequencies, rather than the counterfactual events and nonexistent long-run distributions embodied in conventional frequentism.

As Mayo notes, I am not offering anyone an olive branch, and my embracing of multiple methodologies should not be construed as some kind of naïve ecumenism. On the contrary, I see myself as aggressively attacking narrow mindedness. To borrow from Haack, as a militant agnostic I see multiplicity (not just dualism) as the only methodology I can accept given the empirical evidence and its profound limits. To anyone who attempts to amputate thinking, I am demanding empirical evidence that such amputation is beneficial to scientific research, and offering evidence that amputation encouraged on this blog is harmful.

Again, no one here or elsewhere has offered me empirical evidence contradicting my position. Again, a priori argument and claims that such evidence exists are not acceptable substitutes for bona fide citations to such evidence. Supposedly, science left a priorism behind with Aristotelian physics when it entered its empirically based golden age at the dawn of the 1600s, and since then supposedly we all agree on the importance of experimental evidence. Where is that evidence for the exclusionary sentiments I see on this blog? That some Bayesians were or even remain as exclusionary in the opposing direction is no excuse, any more than Stalin’s mass murder excuses Hitler’s.

Between my first and rigidly Neymanian introduction to statistics (under Neyman himself and his first generation of trainees) and today, applied statistics (statistical science, as opposed to philosophical debates) has moved past a highly counterproductive frequentist vs. Bayes schism to a far more subtle and broad view of inference. The depth of the schism in the mid-20th century appears to me more a product of the authoritarian, dogmatic personalities that gave birth to “modern” statistics than to any genuine scientific conflict: For each of Neyman, Fisher, and others, theirs was the right way and that was the end of the matter. Their certainty about their methods were products of their time, when authoritarian governments led by merciless autocrats (whether fascist or Marxist/Stalinist competitors) were being taken seriously, even by intellectuals, as alternatives to ostensibly failing democracies. I for one am pleased that I have witnessed statistics (and more generally inferential methodology) break out of authoritarian strangleholds, into more open, agnostic, and at times even anarchic experimentation with syncretisms and new methodologies. I hope that this is not a mere cycle and that statistics does not sink back into rigid ideology, immune to evidence-based scientific inquiry and policy as some of our governments have become.

The history of statistics, the empirical evidence regarding its practice, and the practical experiences recounted by those I respect as statistical scientists leave me aghast at the hostility expressed on this blog toward Bayesian calculations. So far, this hostility has been given no empirical justification that I can find anywhere on this blog or in what I have seen of Mayo’s writings. When pointed to a place where such support is claimed to be available, all I find is more a priori argument, often tinged with more rhetoric. I have come to expect such ideologically driven exclusivism in what passes for “policy debates” in the American political arena.

Perhaps this blog and debate merely reflect the larger culture in becoming a dialog of the deaf (without the benefit of sign language). On that more emotional lament, I will close with the following even more blunt remarks: My experience here suggests to me a hypothesis about the sense of persecution expressed in the phrase “frequentists in exile.” As a statistical scientist, the phrase is bizarre to me, given that frequentist methods still dominate almost every stat textbook below the graduate level and almost every research article I see, dominate many grad texts as well, and are still the majority of new methods promoted in statistical methods journals (where I have served in editorial and reviewer capacities for many decades, and never noticed a shred of prejudice against new frequentist methodology). So, I think, perhaps the sense of persecution arises from resentment at being left behind by the emerging syncretism. This is a familiar story in the history of science: there are always those who could not surrender the dogmas they adopted in their youth when their field moved on to a new paradigm, one with drifting continents, jumping genes, and now (as deFinetti foretold) quantum Bayesianism. [I am reminded of Planck’s other law, that science progresses from funeral to funeral.]

Lest the above seem too negative, I do think the ongoing challenges of real data will keep statistics open, regardless of fashion swings in philosophical arenas. The closest analogy that I can think of for the emergent frequentist-Bayes fusion is the development of the wave-particle duality in quantum mechanics, which replaced the old schism between the corpuscular and wave theories of light. So it is that we now have a frequentist-Bayesian duality for probability and statistical inference, and to restrict ourselves to anything less is retrograde and crippling. There has been much written on this fusion over the past few decades, largely traceable to Good and Box as I have cited before, and further developed by Hill, Rubin and others since. And, regarding this blog, what I have witnessed here has been ample food for thought: I have responded with only about a third of what I have formulated in reply. But I suspect those who have made it this far will at least agree that I have said enough for now.

To: Sander Greenland: You say, “no one here or elsewhere has offered me empirical evidence contradicting my position”; and I acknowledge your demand for empirical evidence. In fact, I did my Bachelor’s degree in Statistics and Computer Science, with those two fields constantly stressing empirical research as the way to provide evidence for a theory or phenomenon. On the other hand, the basis for *any* methodology (whether Bayesian or frequentist) is *not* scientific in the way you are wanting to think it is; rather, the basis for any (Statistical) methodology *is Philosophical*! Indeed, I have first-hand experience in grappling with Philosophical issues that are at the basis of forming various methodologies, which is what my (current) Master’s degree is essentially concentrating on. Nonetheless, I truly think that theory and practice should be more closely related, and will try to highlight that idea throughout my thesis.

In closing, I urge you to think in the following manner: *If* we [researchers] are oblivious to – or do not take time to grapple with – Philosophical questions about concepts used in Statistics (and applications of Statistics) *then* researchers will *never* have a *proper* understanding of the Statistical methods and techniques they are using; and this lack of understanding translates into the subpar teaching we currently see/witness (or even have experienced in the past)! Hence, I repeat – for emphasis – that Philosophy (and *not* necessarily empirical evidence!) forms the basis for any Statistical methodology, which is why empirical evidence “contradicting [your] position” is not readily available: appreciating the role Philosophy (of Statistics) has in understanding the Statistical methods/techniques we regularly use is *the key* to resolving many of the disputes/debates that continue to exist in Statistical practice!

Professor Greenland. There appears to be a confusion here. I thought the issue was the abbreviation Egon and I adopted (“accept H” as a shorthand for “the data do not warrant rejecting H (at this level)”). Is this the “widely promulgated misinterpretation” of my own methods discussed in “scores of articles and books stretching back now some 70 years”? I don’t think so. I’m well aware that hypotheses tests have been criticized by many, but I hope it is not because of the use of an abbreviation. Egon and I have had, on several instances, the chance to illuminate interested persons on our understanding of the interpretations of test results.

On this blog, a paper of mine, pp. 290-1, is but one example, but owing to the fact that I am there responding to certain accusations of R.A.Fisher, I mostly illustrate that, in practice, he shared my meaning. https://errorstatistics.com/2012/02/11/jerzy-neyman-note-on-an-article-by-sir-ronald-fisher/

Egon, as is well known, has done far more than I to clarify the evidential interpretations of tests. See on this blog:

Egon’s response to Fisher, https://errorstatistics.com/2012/02/12/pearson-statistical-concepts-in-their-relation-to-reality/

as well as:

https://errorstatistics.com/2012/08/16/e-s-pearsons-statistical-philosophy/

Sander: I really appreciate the care, interest and comprehensiveness of your comments to this blog; it is rare that I can start new posts with just comments alone, and in your case, I could have started many more than this one. (A post on the power issue will show up later.) I’ve never regarded the blog as anything like the best place to work out detailed exchanges, so I’m very grateful when someone with so significant a background in the area is willing to react and reflect on matters in an informative and serious way. As much as I’m grateful for your analyses—statistical and psychological—I wouldn’t take quite so seriously “the sense of persecution expressed in the phrase ‘frequentists in exile.’ We exiles don’t. Recall the “about” page indicates “this may now be changing” https://errorstatistics.com/about-2/. What you call the “emerging syncretism” is one of the key reasons I say this. I appreciate the error statistical leanings in such thoughtful meetings-of-the-mind as represented by Gelman and Shalizi (2012):

”Implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” (Gelman and Shalizi, p. 3).

If you read my recent papers, you will find other examples of why I suggest that error statistical ideas are being disinterred or discovered anew.

I am all for multiplicity—and both Neyman and Pearson were both driven by a staunch rejection of the idea that there was a single overarching account of rational inference (quote may be found searching Neyman, this blog). That’s what they fought against.

Wave-particle duality is excellent, but why must it always come out as a particle? (e.g., your interpreting p-values as posteriors instead of letting them be p-values). In the book I am writing, I think there is a genuine fusion that grows from breaking out of this rut. So that is a promissory note.

I think the most contructive point, for short-term purposes, stems from your answer to “Mayo’s question: ‘Are you saying a steady diet of these critical examples may be degrading thinking?’ I too have concerns about some of the metacriticisms, also expressed in a comment on Normal Deviate’s blog:

http://normaldeviate.wordpress.com/2013/04/27/the-perils-of-hypothesis-testing-again/#comment-8491

I don’t know any of the details of the data behind the paper you raised doubts about. Thanks again for your illuminating comments.