I predicted that the degree of agreement behind the ASA’s “6 principles” on p-values , partial as it was,was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? There’s word that the ASA will hold meeting where the other approaches are put through their paces. I don’t know when. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in.
“Restoring Credibility in Statistical Science: Proceed with Caution Until a Balanced Critique Is In”
“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)
What’s sauce for the goose is sauce for the gander right? Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “go slow” to “moratorium“ [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved.
Background information via priors. Donald Berry declared that “standard Bayesian data-analytic measures have the same fundamental limitation as p-values” for capturing background information, adding that: “subjective Bayesian approaches have some hope, but exhibiting a full likelihood function for non-quantifiable data [regarding context] may be difficult or impossible”.
Most of the practitioners in machine learning that we interviewed echoed the late Leo Breiman, that “incorporating domain knowledge into the structure of a statistical procedure is a much more immediate…approach than setting up the [full] Bayes’ machinery” (1997, p. 22).
In stark contrast, many Bayesians maintain, with Christian Robert, that “The importance of the prior distribution in a Bayesian statistical analysis is …that the use of a prior distribution is the best way to summarize the available information (or even the lack of information) about this parameter. Yet at the same time, oddly enough, instead of using prior probability distributions to introduce background beliefs into the analysis, members of the most popular Bayesian school these days, alternatively called default, reference, or O-Bayesianism, are at pains to show their priors are as uninformative as possible! Here, prior probability distributions arise from a variety of formal considerations that in some sense give greatest weight to the data. There is a panoply of variations, all with their own problems; the definitive review is Kass & Wasserman, 1996. So there’s scarce agreement here.
O-Bayesian, Subjective Bayesian, Pseudo-Bayesian. Jim Berger, a leader of the O-Bayesian movement, told us: “The (arguably correct) view that science should embrace subjective statistics falls on deaf ears; they come to statistics in large part because they wish it to provide objective validation of their science”. Using default priors not only short-circuit arduous elicitation, he thinks the use of conventional priors combats a new disease that appears to be reaching epic proportions, which he dubs “pseudo-Bayesian” subjectivism.
Berger describes this as “a very adhoc version of objective Bayes, including use of constant priors, vague proper priors, choosing priors to ‘span’ the range of the likelihood, and choosing priors with tuning parameters that are adjusted until the answer ‘looks nice’.”
Perhaps they can agree on a principle prohibiting pseudo-Bayes, but the funny thing is, no one we interviewed gave the same definition as Berger!
We were pointed to a paper by Donald Fraser 2011 where he says only the frequentist prior is really “objective”, but in that case, it “can be viewed as being probability itself not Bayesian.”
Stephen Senn told us he’s irked to often find authors making claims such as ” ‘an advantage of the Bayesian approach is that the uncertainty in all parameter estimates is taken into account’” while in practice, he finds none of the priors reflect what the authors believe, but are instead various reference or default priors. “… it is hard to see how prior distributions that do not incorporate what one believes can be adequate for the purpose of reflecting certainty and uncertainty.”
Andrew Gelman was frank with us: “If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.”
Stephen Senn also had a humorous twist: “As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.”
In psychology, a field that is leading with way with reforms to restore scientific credibility in the face of irreplication, old questions as to whether anything is really being measured are returning with new force..
According to Joel Mitchell, psychology uses methods based on “the central hypothesis (that psychological attributes are quantitative) …in the absence of supporting evidence and this fact is ignored because psychometricians remain ignorant about the concept of quantity; they accept a definition of measurement [by Stevens] …That is, psychometricians claim to know something that they do not know and have erected barriers preserving their ignorance. This is pathological science.”
A Bayesian wants everyone else to be non-Bayesian. That’s a section straight from Gelman (who says he is a Bayesian). “Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior distribution (or, more generally, a hierarchical model). The likelihood is key. . . . No funny stuff, no posterior distributions, just the likelihood. . . . I don’t want everybody coming to me with their posterior distribution—I’d just have to divide away their prior distributions before getting to my own analysis. Sort of like a trial, where the judge wants to hear what everybody saw—not their individual inferences, but their raw data.”
We’d heard Gelman wasn’t a friend of the standard subjective Bayesian approach, but we were surprised to hear him say he thinks it has “has harmed statistics” by encouraging the view that Bayesian statistics has to do with updating prior beliefs by Bayes’s Theorem. Sure, but wouldn’t that be correct? The consequence of this that really bothers Gelman is that it’s made Bayesians reluctant to test their models. “To them, any Bayesian model necessarily represented a subjective prior distribution and as such could never be tested. The idea of testing and p-values were held to be counter to the Bayesian philosophy.”
He directed us to a co-author, Cosma Shalizi, a statistician at CMU, who echoed Gelman in flatly declaring “most of the standard philosophy of Bayes is wrong”.
Gelman even shows sympathy for frequentists: “Frequentists just took subjective Bayesians at their word and quite naturally concluded that Bayesians had achieved the goal of coherence only by abandoning scientific objectivity. Every time a prominent Bayesian published an article on the unsoundness of p-values, this became confirming evidence of the hypothesis that Bayesian inference operated in a subjective zone bounded by the prior distribution.”
Possibly, the recent ASA p-value forum would fall under that rubric?
Problems of Interpretation. An important goal of the p-value principles was to highlight misinterpretations. What principles do the insiders put forward to guide Bayesian interpretation? Larry Wasserman thinks there are mistakes equally on both sides.
Although a common “complaint is the people naturally interpret p-values as posterior probabilities so we should use posterior probabilities. But that argument falls apart because we can easily make the reverse argument. Teach someone Bayesian methods and then ask them the following question: how often does your 95 percent Bayesian interval contain the true value? Inevitably they say: 95 percent. … In other words: people naturally interpret frequentist statements in a Bayesian way but they also naturally interpret Bayesian statements in a frequentist way.” If you ask people if it means 95 percent of the time the method gets it right, they will say yes, even though the frequentist coverage claim might no longer hold.
Each of the statisticians we talked to had a different answer to the question, “what’s the meaning or function of a prior probability?” Answers ranged from “it’s how much you’d bet”, “a hunch or guess”, “a way to add background beliefs to the analysis”, “an invariant reference to avoid bringing beliefs into the analysis”, “a way to regularize messy data”, “a way to get posteriors that match frequentist error probabilities”, to “an undefined concept whose sole use is to compute a posterior probability“.
Bayes hacking and transparency. An important principle (4) in the ASA document on P-values emphasized the biases that arise in “conducting multiple analyses of the data and reporting only those” having results the researcher finds desirable. At least 2 of the people we interviewed expressed concern that the same selection effects could enable finagling Bayes factors and Bayes updating. The prior could well be changed based on the data, and the hypothesis selected to compare with H can be deliberately chosen to be less (or more) likely than H. Nevertheless, less than half the practitioners canvassed thought that a principle on “full reporting of selection effects”(4) could be justified for Bayes factors, likelihood ratios, and Bayes updating. Those who did, recommended an adjunct “error control” requirement for Bayesian inference.
Larry Wasserman put it bluntly: “If the Bayes’ estimator has good frequency-error probabilities, then we might as well use the frequentist method. If it has bad frequency behavior then we shouldn’t use it. ”
Those who didn’t, explained that the result would still be sound, because of “calibration” and “given what you believed”.
For instance, Jim Berger, even though he worries about pseudo Bayes told us that his favorite Bayesian conditional error probabilities “do not depend on the stopping rule” even if it allows stopping only when a credible erroneously excludes 0.
In his famous book, The Likelihood Principle (1988), jointly with Wolpert, even though the stopping rule ensures that the Bayes credible interval will erroneously exclude 0, they claim any “‘misleading’, however, is solely from a frequentist viewpoint, and will not be of concern” to a Bayesian, given what he believed to begin with (p.81). Irrelevance of the stopping rule and selection effects follow from the Likelihood Principle, which follows from inference by Bayes Theorem, except that it too is violated in default or reference Bayesianism.
Coherence. One would think that all Bayesians agree on at least one principle: coherence, in the sense of betting. But, no.
“Betting incoherency thus seems to be too strong a condition to apply to communication of information,” says Jim Berger. Even subjective Bayesianism is not coherent in practice, he says, “except for trivial versions such as always estimate θ ∈ (0, ∞) by 17.35426 (a coherent rule, but not one particularly attractive in practice)…In practice, subjective Bayesians will virtually always experience what could be called practical marginalization paradoxes” where posteriors don’t even sum to 1. But then it cannot be a posterior probability.
Bayes Factors. Most Bayesians appear to think the Bayes factor (essentially, the likelihood ratio of two hypotheses or models), is the cat’s meow. But Jose Bernardo, a leader in the “reference Bayesian” movement, avers that “Bayes factors have no direct foundational meaning to a Bayesian: only posterior probabilities have a proper Bayesian interpretation.” While a main contrast between frequentist tests and Bayes factors occurs with the Bayesian assignment of a “spiked” prior to a point null, Bernardo opposes this distinction with confidence intervals, with their smooth reference priors.
Many people we spoke to tout popular default Bayes factor programs because they enable support for a null hypothesis, unlike in a significance test. However,Uri Simonsohn told us that it is “prejudiced against small effects.”
He explained to us that it’s actually incorrect to claim support for the null hypothesis; it’s always just a comparison.“What they actually ought to write is ‘the data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’.” The default Bayes factor test “means the Bayesian test ends up asking: ‘is the effect zero, or is it biggish?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability!
“Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”
At this point, Gelman reminds Simonsohn that “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors”.
What’s the forecast for this next step in regulating statistical meaning? The ASA p-value document took over a year; they envision this next step will take at least twice as long. Given the important community service this would provide for those seeking to understand the complexities of the popular Bayesian measures of evidence, it’s well worth it. In the mean time officials are likely to advise practitioners to check any Bayesian answers against the frequentist solution for the problem and perhaps employ several different Bayesian approaches before putting trust in it. [To read the rest of the article see this link.]
*From the ASA P-value document.
Gelman and Shalizi‘s joint article (2013) suggests that “implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense”. So maybe the entire exploration will take us full circle and bring us back to error statistics, but a far more sophisticated and inclusive version. That’s all to the good, even if this new “hold off until we get back to you” creates a degree of confusion in the mean time.
 This post was written jointly with Jean Miller.
Wasserstein and Lazar 2016, “ASA Statement on P-values”