S. McKinney: On Efron’s “Frequentist Accuracy of Bayesian Estimates” (Guest Post)

SMWorkPhoto_IMG_2432

.

Steven McKinney, Ph.D.
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

                    

On Bradley Efron’s: “Frequentist Accuracy of Bayesian Estimates”

Bradley Efron has produced another fine set of results, yielding a valuable estimate of variability for a Bayesian estimate derived from a Markov Chain Monte Carlo algorithm, in his latest paper “Frequentist accuracy of Bayesian estimates” (J. R. Statist. Soc. B (2015) 77, Part 3, pp. 617–646). I give a general overview of Efron’s brilliance via his Introduction discussion (his words “in double quotes”).

“1. Introduction

The past two decades have witnessed a greatly increased use of Bayesian techniques in statistical applications. Objective Bayes methods, based on neutral or uniformative priors of the type pioneered by Jeffreys, dominate these applications, carried forward on a wave of popularity for Markov chain Monte Carlo (MCMC) algorithms. Good references include Ghosh (2011), Berger (2006) and Kass and Wasserman (1996).”

A nice concise summary, one that should bring joy to anyone interested in Bayesian methods after all the Bayesian-bashing of the middle 20th century. Efron himself has crafted many beautiful results in the Empirical Bayes arena. He has reviewed important differences between Bayesian and frequentist outcomes that point to some as-yet unsettled issues in statistical theory and philosophy such as his scales of evidence work.

“Suppose then that, having observed data x from a known parametric family fμ(x), I wish to estimate t(μ), a parameter of particular interest. In the absence of relevant prior experience, I assign an uninformative prior π(μ), perhaps from the Jeffreys school. Applying Bayes rule yields , the posterior expectation of t(μ) given x:

 \hat{\theta}\\ =E{t(μ)|x}.         (1.1)”

A simple setup, that is of interest in many MCMC methods. This issue is certainly relevant in all of the bioinformatics algorithms that use MCMC methods to comb through mountains of genomic data these days, such as those being developed by the wunderkinds here at the British Columbia Cancer Research Centre and elsewhere. Efron’s gift is timely indeed.

“How accurate is \hat{\theta}\\ ? The obvious answer, and the one that is almost always employed, is to infer the accuracy of \hat{\theta}\\ according to the Bayes posterior distribution of t(μ) given x. This would obviously be correct if π(μ) were based on genuine past experience. It is not so obvious for uninformative priors. I might very well like \hat{\theta}\\ as a point estimate, based on considerations of convenience, coherence, smoothness, admissibility or aesthetic Bayesian preference, but not trust what is after all a self-selected choice of prior as determining \hat{\theta}\\ ’s accuracy. Berger (2006) made this point at the beginning of his section 4.”

Efron demonstrates his sensibilities here. He’s not dogmatic about any particular methodology and has a keen sense for and appreciation of aesthetics. That’s why I’m always excited when Efron produces new findings.

“As an alternative, this paper proposes computing the frequentist accuracy of \hat{\theta}\\ , i.e. regardless of its Bayesian provenance, we consider \hat{\theta}\\ simply as a function of the data x and compute its frequentist variability.”

Efron adeptly dances between and within the Bayesian and frequentist ballrooms.

“Our main result, which is presented in Section 2, is a general accuracy formula for the delta method standard deviation of \hat{\theta}\\ : general in the sense that it applies to all prior distributions, uninformative or not. Even in complicated situations the formula is computationally inexpensive: the same MCMC calculations that give the Bayes estimate \hat{\theta}\\ also provide its frequentist standard deviation. A lasso-type example is used for illustration. Many of the examples that follow use Jeffreys priors; this is only for simplified exposition and is not a limitation of the theory.”

This is the brilliance here. Efron has identified an extremely useful general accuracy formula, using results readily at hand, detached from priors and all the handwringing that goes on as people argue endlessly about whose prior is better than whose.

The discovery that the very MCMC results that produced the point estimate can also yield an estimate of its variability gives us extremely valuable information at no extra charge. MCMC algorithms do indeed produce useful point estimates, after much computation. So what were previous ways to obtain an estimate of variability? One could bootstrap the process (again, an amazing statistical methodology from the mind of Efron), but this would involve thousands of repeats of the already computationally intensive process. One could assess the variability from the spread of the posterior distribution, but this depends on the prior, and the handwringing begins.

“In fact several of our examples will demonstrate near equality between Bayesian and frequentist standard deviations. That does not have to be so: remark 1 in Section 6 discusses a class of reasonable examples where the frequentist accuracy can be less than half of its Bayesian counterpart. Other examples will calculate frequentist standard deviations for situations where there is no obvious Bayesian counterpart, e.g. for the upper end point of a 95% credible interval.”

Efron demonstrates, once again, how sensible Bayesian and frequentist computations so often yield nearly equal answers. This is why so much of the silly bashing of frequentist results by Bayesian zealots, and Bayesian results by frequentist zealots, is pointless. Mostly, we arrive at the same conclusion regardless of which strategy we employ, and Efron is so adept at developing useful and practical solutions from methods in either camp without being a zealot. In those cases where frequentist and Bayesian solutions differ starkly, we still have something to learn.

“The general accuracy formula takes on a particularly simple form when fμ(x) represents a p-parameter exponential family: Section 3. Exponential family structure also allows us to substitute parametric bootstrap sampling for MCMC calculations, at least for uninformative priors. This has computational advantages. More importantly, it helps to connect Bayesian inference with the seemingly superfrequentist bootstrap world, which is a central theme of this paper.”

More bridge building between the Bayesian and frequentist paradigms. Always useful.

“The general accuracy formula provides frequentist standard deviations for Bayes estimators, but nothing more. Better inferences, in the form of second-order-accurate confidence intervals, are developed in Section 4, again in an exponential family bootstrap context. Section 5 uses the accuracy formula to compare hierarchical and empirical Bayes methods. The paper concludes with remarks, details and extensions in Section 6.”

Amazing. This paper is such a rich vein of statistical mathematical development.

“The frequentist properties of Bayes estimates is a venerable topic, that has been nicely reviewed in chapter 4 of Carlin and Louis (2000). Particular attention focuses on large sample behaviour, where ‘the data swamp the prior’ and \hat{\theta}\\ converges to the maximum likelihood estimator (see result 8 in section 4.7 of Berger (1985)), in which case the Bayes and frequentist standard deviations are nearly the same. Our accuracy formula provides some information about what happens before the data swamp the prior, though the present paper offers no proof of its superiority to standard asymptotic methods.”

Efron is always objective about judging superiority. He’s clear about the criteria upon which he does base his assessments of superiority when he does so, and it’s never ego-based.

“Some other important Bayesian-cum-frequentist topics are posterior and preposterior model checking as in Little (2006) or chapter 6 of Gelman et al . (1995), Bayesian consistency (Diaconis and Freedman, 1986), confidence matching priors, going back to Welch and Peers (1963), and empirical Bayes analysis as in Morris (1983). Johnstone and Silverman (2004) have provided, among much else, asymptotic bounds for the frequentist accuracy of empirical Bayes estimates.”

More interesting reading to do. Efron’s citations do not include zealots and crackpots, unless he is deconstructing their sillinesses. I appreciate Efron’s vetting of the literature.

“Sensitivity analysis—modifying the prior as a check on the stability of posterior inference—is a staple of Bayesian model selection. The methods of this paper amount to modifying the data as a posterior stability check (see lemma 1 of Section 2). The implied suggestion here is to consider both techniques when the prior is in doubt.”

More useful suggestions about sensitivity analysis methods, always important to assess to ensure results are not potentially misleading.

“The data sets and function freqacc are available from http://statweb.stanford.edu/~ckirby/brad/papers/jrss/.”

Efron is always generous about sharing his insights. His results are not squirreled away behind corporate doors, seeking to make vast profit from a software offering, or a consulting firm.

So, in a nutshell (okay it’s a big coconut shell) that is my summary of the brilliance I see in this paper. A beautiful presentation of a general result that gives very useful information at no extra charge. It was sitting there all along, and Efron saw it, crafted it and presented it to us succinctly, to use forevermore. This theory (Lemma 1 and Theorem 1) will be in future statistics textbooks, alongside many other Efron discoveries. It’s amazing to live during the times of such a statistical genius.

Note from Mayo: This began as a comment by Steven McKinney on a recent post on Efron. He agreed to let me put it up as a guest blog post. (There was also a comment by Greenland.) To galvanize some conversation, let me pose some queries:  Is it true that “Mostly, we arrive at the same conclusion regardless of which strategy we employ”? and when we do, do they mean the same things? Should we really be striving for an agreement on numbers without an agreement on interpretation? I’m inclined to favor a philosophy of “different tools for different goals”, lest one side get shortchanged. Notably, there’s an essential difference between how believable or plausible claims are and how well tested or probed they are by dint of a given set of inquiries. Distinct aims occupy “science-wise screening” tasks, as in large-throughput testing of a “diagnostic” type, or jobs with a performance-oriented behavioristic flavor. Typically, when I see “matching numbers” used for different ends, one set of goals and philosophy is compromised and/or misunderstood. (Recall the “P-values exaggerate” meme.) Why not try to truly understand just what each can (and cannot) do?

I completely agree with McKinnney on the brilliance of Efron’s many innovations over the years, particularly in relation to resampling, and concur about the value of checking error statistical properties of other methods, but we’ve also heard Lindley claim “there’s nothing less Bayesian than empirical Bayes” and even “conventional” Bayesians sometimes bristle at being cross-checked by error statistical means.  I’m very grateful to McKinney for this guest post!

References:

Berger, J. O. (1985) Statistical Decision Theory and Bayesian Analysis, 2nd edn. New York: Springer.

Berger, J. (2006) The case for objective Bayesian analysis. Baysn Anal., 1, 385–402.

Carlin, B. P. and Louis, T. A. (2000) Bayes and Empirical Bayes Methods for Data Analysis, 2nd edn. Boca Raton: Chapman and Hall–CRC.

Diaconis, P. and Freedman, D. (1986) On the consistency of Bayes estimates (with discussion). Ann. Statist., 14, 1–67.

Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1995) Bayesian Data Analysis. New York: Chapman and Hall.

Ghosh, M. (2011) Objective priors: an introduction for frequentists (with discussion). Statist. Sci., 26, 187–202.

Johnstone, I. M. and Silverman, B. W. (2004) Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. Ann. Statist., 32, 1594–1649.

Kass, R. E. and Wasserman, L. (1996) The selection of prior distributions by formal rules. J. Am. Statist. Ass., 91, 1343–1370.

Little, R. J. (2006) Calibrated Bayes: a Bayes/frequentist roadmap. Am. Statistn, 60, 213–223.

Morris, C. N. (1983) Parametric empirical Bayes inference: theory and applications (with discussion). J. Am. Statist. Ass., 78, 47–65.

Welch, B. L. and Peers, H.W. (1963) On formulae for confidence points based on integrals of weighted likelihoods. R. Statist. Soc. B, 25, 318–329.

Categories: Bayesian/frequentist, objective Bayesians, Statistics

Post navigation

44 thoughts on “S. McKinney: On Efron’s “Frequentist Accuracy of Bayesian Estimates” (Guest Post)

  1. Steven: thanks so much for your comment and post, I had some questions on the webinar that I’ll raise later on. I’m sorry not to have been able to get the theta-hat to look right.

  2. Christian Hennig

    Thanks for this. I was tuned in to this seminar, too, but your summary still adds something valuable for me; it gives me a clearer picture particularly on where this sits in the environment of what others have done than I had from just following the presentation.

    Regarding Mayo’s important question “when we do, do they mean the same things?”, I’d like to mention Sections 5.3-5.5 of the paper on objectivity and subjectivity in statistics (or rather on how to avoid these terms) of Andrew Gelman and myself at
    http://arxiv.org/abs/1508.05453
    I think that Efron’s results fall in the category of what we call in Sec. 5.5 “Falsificationist Bayes”, by interpreting the sample model in a Bayesian setup in a frequentist way (i.e., as something generating observations rather than quantifying beliefs etc.). This differs from how Bayesian probabilities are interpreted in both the classical subjectivist and objectivist Bayesian setup (Secs. 5.3 and 5.4), in which there is no such thing as a “true parameter” that could be estimated in a frequentist sense. “Falsificationist Bayes” (which actually goes back to an earlier paper of Gelman and Shalizi, though the “Falsificationist Bayes” name wasn’t used there) provides an interpretation in which (largely) the same things are meant by the frequentist and the Bayesian. There is an issue with how to interpret the parameter prior and one may suspect from this that more work needs to be done to clarify things completely; we discuss this in Sec. 5.5.

    I should take the opportunity to thank Mayo for her extremely detailed and useful comments on the paper given to me at some point. Some of these found some way into the paper but I’m sure there is still quite a bit with which you don’t agree. I should really have written more to you in response to some of the points you raised; somehow I haven’t found a good way of doing this yet. (Who knows, you may discuss the paper here at some point and I can respond “in public”.)

    • Christian: Thanks for thanking me for my detailed comments. No need to write in response. Maybe I’ll let you return the favor by sharing (shortly) my discussion of objectivity (in How to Tell What’s True About Statistical Inference).

  3. Steven McKinney

    Hennig:

    Let me start with a basic question:

    Does the posterior expectation of a distribution of a parameter of interest ever mean something to you?

    Looking at the arxiv paper you link to above (Gelman and Hennig, 2015), I see discussion in section 3.1 of applied examples, discussing pharmacological examples, such as “gamma.1: mean of population distribution of log(BVA(j)latent/50) . . . ”

    Does this mean correspond to something of interest in the pharmacological applied example?

    Would you or any of the chemists involved want to know what this number is, and would this number tell you something?

    • Anonymous

      Steven: I’ve got to say that this is one of Andrew’s application examples in which I was not involved (apart from reading it and not finding anything I could object to regarding putting it in the paper). In any case, one possibility is to use the posterior just as a technical tool for finding a better estimator of a parameter that has a frequentist interpretation. If you want, you can interpret the prior as a device generating “alternative plausible worlds” and the posterior then has a frequentist interpretation over the worlds generated by the prior. This may seem very idealised, but it still should provide the researcher with some kind of impression of the uncertainty about the parameter. Certainly in such a situation one would want to know about the frequentist properties of the resulting estimator, and to what extent posterior probabilities approximate confidence levels.
      However, I can’t go into discussions of pharmacology here because I’m not personally involved in this.

      • Christian: So, for example, Pr(C|x) = .9 would mean C holds in approximately 90% of “plausible worlds”? By considering the fixed parameter in this world to have been generated by a random sample from a superpopulation, it becomes possible to deductively arrive at a posterior probability. As Fisher says, this is deduction and not inductive inference to C. But of course we do not select this world, the one we care about, from an urn of worlds, and since the parameter and model (Gelman agrees) is regarded as fixed, what does it mean to posit a random selection from an urn of worlds? I’m not sure that the frequentist properties of this method, assuming it could be pinned down,would be relevant. There’s only one world, so the fact that a method works in most of the imaginary other worlds, if it did, wouldn’t be much comfort. We want to know how frequently a method works in this world, given the variability of of the events in question, and that’s what a frequentist error probability supplies.

        • Christian Hennig

          Mayo: That’s just one way to interpret the posterior. You don’t have to do it. You may also just use the posterior mean (or some other statistic of the posterior) as estimator and evaluate its frequentist properties (if that’s possible, but that’s what Efron’s presentation was about); and also frequentist coverage probabilities for Bayesian credibility regions. If the Bayesian probabilities and the frequentist coverage probabilities are approximately equal, one can use the Bayesian ones as approximation for the frequentist ones. That’s a twist on these Bayesians who try to sell to the frequentists that what really matters are the Bayesian probabilities and the frequentist ones are only fine if they’re about the same. One can play this game the other way round; although one implication may be that in some situations even a committed frequentist is better off doing something Bayesian.

        • Christian Hennig

          Ouch. All my postings turned up now, two days late, which is a bit embarassing because the one from earlier today is probably better than the two postings from 7 Nov combined.

          • Christian Hennig

            I meant that the one from 8 Nov is better than the two postings from 7 Nov combined. Igore the earlier ones if you still can.

          • Christian: That’s OK, we can see the development of your thinking. I’m not sure why that happened since you’ve posted comments before; none should have been held. If there’s one you truly hate, send me the link, or better, send it to the bloglady on Elba, asking her to remove it or them. jemille6@vt.edu

    • Steve: I wrote something responding to you already which has somehow vanished. I may have missed that this computer hadn’t saved my name here so it may be listed as “Anonymus”. Let’s see whether it surfaces in a few hours (like it does sometimes). If it doesn’t, I’ll bite the bullet and write it again.

      I wanted to add something else, namely that one way of interpreting the prior if it’s seen as a tool to find better frequentist estimates is that it will bias the estimator in the direction of values that have high density under the prior. So the estimator will be better, in terms of MSE (for example; there are some subtleties if priors are not unimodal and there probably even can be some if they are but I’m ignoring them for the moment), than a non-Bayesian estimator, if the true parameter is about where the prior suggests it is, and elsewhere it will be worse. Not sure whether there is work on this but I’m pretty sure that’ll be the effect. Now if the information encoded in the prior is good and valid, this may be a useful thing to have, although not from a “being as safe as possible in worst case situations” attitude,

    • Steven: I had replied already yesterday but somehow this was apparently lost. (What I write now is actually better in some respect, so I’m not necessarily hoping foir recovery of the old text.)

      Anyway, Sec. 3.1 is Andrews application, I’m not personally involved in this and can therefore not tell you how the communication with the pharmacologists went and to what extent they found this meaningful. The mean you’re talking about actually is a characteristic of a population/sampling distribution, which has a frequentist interpretation here, so the gamma.1 itself is interpreted in the same way as a frequentist would do that. What this bit is about is setting up a prior for gamma.1, from which the data analysis then will produce a posterior. So this is what you’re asking about, I guess.

      The Section doesn’t tell what exactly was done afterwards with the posterior (which means that I don’t know much more about it; I discussed what’s in the Section with Andrew but not all that’s behind). However, there are various possibilities. First, the posterior can be used as a technical tool to find a frequentist estimator for gamma.1 (which may be the expectation of the posterior or some other statistic). In this case, obviously, the frequentist characteristics of this estimator are of interest. The posterior itself is of interest too to the extent to which it approximates confidence levels and the like.

      In many cases (depending probably on whether the prior is unimodal and/or well-behaved in some sense; I’m not sure whether there is research on this), the prior will bias the estimation in the direction of the prior’s high density region, which means that in terms of MSE (say), chances are that the estimator will be better than a straightforward frequentist estimator if the true parameter is in the high density region of the prior, and otherwise worse. If the information encoded in the prior is valid, this may be something desirable. Andrew used the prior basically as tool to exclude certain regions of the parameter space which made no sense for the pharmacologist, and by douing this he improved the numerical stability of his estimator.

      I can imagine further interpretations of the posterior; they give some kind of conditional sampling distribution that can be interpreted in a frequentist sense if one imagines a random generator of different possible worlds with different possible values of the parameters. This of course is very idealised and one may wonder of how much scientific value it is, but it may give the researcher some idea about the uncertainty in the analysis (frequentists are no strangers to bold idealisations either).

      • I generally intepret Bayesian analysis along the lines of a ‘generator of possible worlds parameterised by different parameter values’ (to paraphrase).

        I think this is quite a natural operational interpretation for many mathematical modellers, esp. those with some background in statistical mechanics and similar subjects.

        Note that this doesn’t mean I interpret the worlds as ‘real’, rather as hypothetical ‘sketches’ to compare the world to. There is obviously much more subtlety here – too much for a blog comment – but worth a note.

        • (So a prior, along with a sampling model, defines an initial predictive ensemble; new info allows a bayeisan to update to a posterior predictive ensemble. Bayes theorem is an intermediate step and requires additional closure assumptions, which are themselves falsifiable.)

        • OM: In other words, different worlds correspond to different values of parameters. Thus to say, for example, that a model or other claim C has probability .8 would mean that in approximately 80% of worlds (where data x was observed) claim C holds. Is that it?
          (I don’t suppose you consider only “nearest possible worlds” as some philosophers do.

          • Say a model has a parameter b with three possible values, b in B = {0,1,2} say. A prior over B tells us how often (relative to the other values) to draw each value.

            So if p(b=0) = 0.8, then 80% of the models considered will have b=0.

            The ‘possible world topology/geometry’ (‘nearest’) is defined by the parameter space and distributions over it. E.g. closeness of worlds might be related to curvature or other metrics, and are ‘user-defined’.

            • Ie the ‘shape of the (prior) possible world space’ is related to the ‘shape’ of the prior distribution over the parameter space. If we can differentiate, then higher derivatives tell you about the local shape. Differential geometry, topology, that sort of thing.

            • OM: I know exactly what you mean. If we had such a frequentist prior, we might wish to use it, it’s just a deductive computation and scarcely makes one “Bayesian”. But the entire reason Fisher, Neyman-Pearson, Cox Fraser, and many,many others, developed statistical inference methods that were not merely “inverse probability” (i.e., non-Bayesian methods) was to handle scientific inference where we don’t have a clue about the relative frequency of worlds where model M holds. (Nor did they want to apply a principle of indifference.) If we did know, it’s not even clear how that would be relevant for reliable inference in this world. Now since you have 2 copies of EGEK, check pp 120-2 where I discuss Reichenbach’s frequentist view (which is related to the frequency of worlds view, but differs. R hoped that one day we might be able to judge the how frequently a type of claim was true. See my “natural bridge” 123-4)..

              • Mayo – in the context of statistical mechanics and eg stochastic/non equilibrium thermodynamics this is the problem of defining the appropriate ensemble. There are guides and principles that have proven effective but of course it is not – nor probably ever – a fully solved problem.

                Many, many applied problems have been solved by this approach though.

                I have read a few philosophers who take applied math seriously. I think ideas like these and eg renormalization, singular perturbation theory, weak solutions etc are just as deep if not more than verbally phrased concepts. Have you found a description of hierarchical bayes that you understand yet? What do you think about the BBGKY hierarchy in statistical mechanics?

              • How did Fisher put it? Something like ‘of what population is this a random sample?’

                The prior-as-defining-an-ensemble view is another way of answering/conceiving of this question – and as Christian hinted at, no less bold than many frequentist idealistions based on hypothetical populations of comparison.

                It seems ridiculous to simply declare it doesn’t make sense when it makes as much sense as any other method of understanding the world and has been used to much success. People thought Dirac’s delta function made no sense, quantum mechanics makes no sense etc etc. Zeno’s paradox. I could go on about verbally phrased criticisms that can be reasonably successfully answered by operational mathematical theories of the world. I’m all for criticism but it should be productive/constructive.

                • john byrd

                  ‘of what population is this a random sample?’– Fisher used counterfactual reasoning to set up his tests and estimation problems, such as if this sample was randomly drawn from this population with these characteristics, we have xx expectations. Null hypothesis testing uses counterfactuals to properly frame the problem so that the numbers generated can inform the overall Inference. We all understand how significance testing works and why ( I think), to include the counterfactuals. How would a counterfactual look if it involves modelling which population was sampled?

                  If we have a universe of populations and we randomly select one, we expect ….? I believe Fisher objected to this.

                  • Parameterising different possibilities (‘worlds’ if you must) is a general way of carrying out counterfactual reasoning. There are different ways to implement some details.

                    You could ‘manually’ explore different possibilities by directly varying the parameter. That’s more of a frequntist approach. Or you could explore the possibilities through random sampling. That’s more of (one type of) a bayesian approach.

                    Both involve exploring hypothetical populations of comparison. Both seem equally objectionable/unobjectionable to me, though have slightly different practical strengths and weaknesses.

                    • And by ‘random’ – a bayesian prior gives a stochastic recipe for exploring possible parameter values. It need not – and generally isn’t best imo (and Gelman’s, according to the seminar) to be eg a uniform or some other ‘uninformative’ sampling.

                      It could be a more informed parameter space sampling. This is where this conception of a prior connects back to regularisation – it defines the parameter space in which we search for solutions and we may choose to exclude pathological values arising in a larger domain and focus more on certain subsets.

                      This is not really speculative – it’s a fairly straightforward interpretation of applied bayesian inference as described in Gelman’s BDA for example.

                    • And again – please everyone distinguish the characteristics of the ensemble as a whole and the characteristics of the members making up the ensemble. This is a necessary condition for understanding this perspective.

                    • omaclaren:

                      Nicely put, but frequentist conventions here are uniform over all parameter values or at least the supremum over all parameter values – so sampling is in principle ruled out (though David Cox would often say a grid of points that covers the region of the parameter space that could be relevant would suffice in practice.)

                      I have not carefully studied the Efron paper, but it seems clear that is main relevance is to address the computational problems involving a grid of points that currently are not practical to carry out the simulations for.

                      Keith O’Rourke

                    • Thanks 🙂 Here is an interpretation you might like – both frequentist and bayesian models are initally purely formal objects which are possibly inaccessible even in principle.

                      We can give concrete computational interpretations to them via sampling algorithms, however. This also inevitably introduces the idea of approximation. Think real analysis compared with numerical analysis.

                      Agree that it’s good to have clever methods of making the inaccessible more accessible. That’s why the advice to think about the underlying Bayesian model even if you use a Frequentist approximation is helpful.

                    • OM: The issue isn’t whether we can envisage a statistical mechanism generating the fixed theta (in this world), often enough we might. The issue is a) how can we know the theta generating distribution, and most importantly, b) what is the justification for appealing to such a thing for qualifying the warrant for a hypothesis about theta in this world. Can one imagine a performance justification maybe?

                    • I prefer ‘stability’ rather than ‘performance’ but they’re similar ideas.

                      I also like parts of Spanos’ ‘inductive step’ view. I take a more hierarchical view of theories tho – see hierarchical bayes or even the phil lit on physicist’s renormalization group theory.

                      This world is not necessarily ‘regular’ so we can’t carry out naive induction. All is not lost – we can embed this observed world in various expanded, ‘regularised’ contexts. For reasons I’ve tried to explain many times I view the parameter embedding in a possibility space as an important task and a prior as one way of achieving this.

                      Because hierarchical theories always use temporary approximate ‘closures’ (think eg moment closure in the BBGKY hierarchy of statistical mechanics) these are always open to expansion when internal contradictions are reached. We have to ‘go to the next level’ in the hierarchy when this happens.

                      Luckily, sequences can have rates of convergence so we can often order and improve our models. ‘Turtles all the way down’ ignores the mathematical concept of a limit and the tools developed to speak about them. Bifurcations and phase transitions challenge global stability, however. Even the simplest of dynamical systems can exhibit complex behaviour; the geometry of the ‘space of scientific theories’ is probably quite complicated – if past experience is anything to go by (induction joke).

                      I have a very simple high school example in my latest blog post but also have links to ideas of regularisation and renormalization. It’s a bit subtle probably since it’s based on a high school math problem so seems straightforward…

                      In summary – generally, these days I tend to take the ideas of well-posed problem, regularisation, stability, bifurcation, formal/constructive dualism, static/dynamic dualism as fundamental (as does Laurie Davies, in my interpretation) tho’ am open to different implementations and interpretations. ‘Performance’ may be (and seems to be) a useful perspective but I can’t take it as fundamental, unless it means the same sort of thing as regularisation or regularity condition. Which it might.

                    • (and ‘hierarchy and closure’, obviously…)

                      Relatedly, I searched the other day to see if there was any good philosophical literature on the physics/applied math ideas of renormalization etc.

                      I found this paper in Synthese (which is pretty reputable, right?). It was an interesting read:

                      ‘The conceptual foundations and the philosophical aspects of renormalization theory’

                      PDF: http://www.researchgate.net/profile/Silvan_Schweber/publication/225936036_The_conceptual_foundations_and_the_philosophical_aspects_of_renormalization_theory/links/54eb442b0cf29a16cbe5ac76.pdf

                    • john byrd

                      Getting back to the “doesn’t make sense” charge, I think that most of us can see sampling from a population as a fair representation of natural processes, but many have a hard time seeing that there is a natural analog to sampling populations randomly. If this approach is proven useful that is good, but it still might not make intuitive sense to many of us. Do you think sampling the population is analogous to natural processes, or merely a useful modelling approach, or other?

                    • Everyone needs to explore the possible parameter space. Eg confidence intervals as inversions of hypothesis tests for different parameter values.

                      An algorithm is a way of giving a concrete interpretation. How would you implement varying parameters (null hypotheses) on a computer? Set up a grid, iterate through them etc right?

                      Then think about numerical integration. You could manually iterate through your integration points or you could randomly sample them. The convenience of a sampling algorithm over parameter space is we can now think of both the ensemble as a whole and of the individual members depending on the perspective of interest. Think sets as wholes vs as collections of elements.

                      There is much more to it, but suffice to say the dual view of an ensemble of possibilities and its particular elements is a nice way of thinking about underdetermined problems.

                    • So it’s an ‘epistemic modelling’ concept. Another analogy – most physicists don’t intepret the wave function of QM as ‘real’ or ‘natural’ but as a carrier of information from which they can extract concrete answers given well defined questions.

  4. A quick comment. I enjoyed Efron’s seminar (and have a lot of respect for him) but, if I am honest, I was left a little disappointed by it. I am probably missing some things – I haven’t had a chance to work through and think about his calculations in detail. That is usually the only way I can ‘truly’ understand something.

    I note though that at some point he mentioned something about the importance of hierarchical bayes for relating freq. and bayesian approaches, which I wholeheartedly agree with (I’ll need to go back to find exactly what I said when I have proper access).

    Here is a (perhaps) relevant quote from Gelman et al.’s BDA3 (p. 104 I think):


    The analysis using the data to estimate the prior parameters, which is sometimes called empirical Bayes, can be viewed as an approximation to the complete hierarchical Bayesian analysis. We prefer to avoid the term ’empirical Bayes’ because it misleadingly suggests that the full Bayesian method, which we discuss here and use for the rest of the book, is not ’empirical’.

    • Steven McKinney

      MacLaren:

      The webinar was indeed brief. I recommend reading his paper just to get an overview of it, then watch the webinar, then re-read the paper. I found that the webinar filled in some gaps for me, but I could see how just watching the webinar would be confusing, for anyone who has not read the paper.

      I think a big part of the problem here is too much introduction of terminology, and endless debating over the minutia of the structure of sentences describing the terminology.

      I believe that all applied statisticians, whatever they want to self-identify as, often engage in the practice of putting some numbers from some real-world situation into an algorithm that produces a final number that they think should mean something.

      So an applied statistician working with a group of scientists in North America takes values from a series of assays, and runs the values through an algorithm – a well described algorithm found in textbooks and often taught in institutions of higher learning. Of course such algorithms are tweaked this way and that, to try and improve some minor aspect of performance. If a group in Australia working on the same problem put their numbers through some such similar algorithm, they will get a different output. This phenomenon was noticed hundreds of years ago and it doesn’t seem to be going away.

      So neither group is certain that the value they estimated would come out the same if the experiments/assays were repeated. They will want to create some measure of the amount by which this quantity of interest will jiggle about. And neither group would be happy if algorithms of a similar nature gave wildly different answers.

      We can all argue endlessly about what the algorithm should be through which we put our numbers, and what name we should give to how much such a number will jiggle about should we repeat the experiments/assays. But in the end, we all do this. We estimate things, and we make some kind of statement about how precise that estimate is.

      If none of us can even agree on this basic premise, what then is this enterprise we call statistics? If we can’t even agree that if I add up my numbers, and divide by the number of numbers, but you calculate an expected value of a posterior using a gaussian prior and a gaussian model, that those two algorithms generally tell us the same thing, when reasonable quantities of data are available, then this enterprise is doomed. We might as well break out the goat entrails again.

      So this is why I find Efron’s exegeses so valuable – he is able to derive algorithms that do useful things without getting caught up in popularity contests about whose algorithmic flavour-du-jour has the fluffiest texture, or how well coiffed its proponents are. In this paper, Efron describes another algorithmic variation that gives a useful output about how much we can expect a statistic to jiggle about. The mathematics look solid to me, I haven’t found any errors there yet. I haven’t found any gaping philosophical flaws in his statistical philosophical reasoning. I’m certainly no Efron, and would never have derived such beautiful results on my own, but I appreciate sensible development of algorithmic variations that show me how to estimate statistics of interest, and their inherent variability. I don’t care that some of his statistics come from a Bayesian paradigm, and others from a frequentist paradigm, or that by bridging such realms a new label is placed on some of these new algorithmic creatures. All I care about is that the performance of algorithms of a similar nature yields similar conclusions, and if not, what are we not yet understanding?

      • Steven: I guess I’m not so sure we shouldn’t be “debating over the minutia of the structure of sentences” such as your “All I care about is that the performance of algorithms of a similar nature yields similar conclusions”. ” Evaluating “performance” presumably alludes to how well a method does at something, so it’s good to know what that something is.

        • Steven McKinney

          Mayo:

          I don’t mean to say that minutiae should never be considered – details do matter. Efron does a nice job in this paper of presenting mathematical minutiae such as his discussion of measure theoretic elements for the “sufficient condition for the interchange of integration and differentiation in expression (2.7) . . .” and his discussion for bias correction. So performance is evaluated. Efron notes that for the large choice of B in the presented examples, the bias correction ended up being “insignificant”.

          So this is the kind of thing I am talking about. We could go off on a great argument that bias correction should always be done, or that bias correction is somehow suspect so shouldn’t be done, but the answers, whether bias corrected or not, were nearly identical. So what then is the point on digging in one’s heels for always doing bias correction, or never doing it, when the difference in answer does not materially change the interpretation of results? This is what I am referring to. When a frequentist 95% confidence interval consists of two numbers that differ only in the second or third decimal place from a Bayesian 95% Highest Posterior Density interval, what then is the point of declaring all frequentists as mad, or declaring all Bayesians as misguided?

          • Steven: I don’t think one has to say either is “mad” to point out the difference in meanings, and it’s not typically clear what the .95 in the HPD interval is intended to mean. That doesn’t mean I sign onto standard CIs–I think they’re in need of supplements just as tests are; and I wouldn’t choose a single confidence level.

    • OM: The “empirical” here (in empirical Bayes) refers to its being a frequentist prior of some sort. Clearly it doesn’t mean “evidence-based” or “testable” or thelike, so Gelman shouldn’t mind it for that reason. However, one does need to be clear about what it means, if it’s to be testable in a warranted way. (Merely being “testable” isn’t enough, in that there are all kind of accounts of testing.)

      I did think that Gelman raised good and interesting questions in relation to the Efron talk. They were general, non-technical and philosophical. I would like to know what Efron really thinks to a greater extent. Doubtless,a webinar doesn’t really lend itself to a careful philosophical exchange, unless perhaps by philosophers.

      • In my view the comparison of hierarchical and empirical bayes is clearest when thinking of them as estimation methods. Hierarchical Bayes usually has a more explicit (elaborated) model structure than what is typically considered in the empirical Bayes lit. Empirical Bayes also typically focuses on point estimates (and variability thereof).

        These differences can be strengths or weaknesses depending on circumstances but is a useful point of comparison.

      • vl

        ‘The “empirical” here (in empirical Bayes) refers to its being a frequentist prior of some sort.’

        Omaclaren (and mine/Andrew and many others’) is that a “frequentist prior” is usually nothing but a special case of a hierarchical model with a middle layer which models the frequency distribution, usually by estimating mean and variance of the distribution across parameters.

        There’s still an implicit top level prior over the frequency distribution’s parameters that gets swept under the rug (to make matters worse, it can often be an improper prior). I find it disingenuous to pretend like it’s a “separate” form of bayesian inference that is somehow more “empirical” when it’s simply a special case.

        • vl: I’d be grateful if you explained your interesting comment.

          “middle layer which models the frequency distribution, usually by estimating mean and variance of the distribution across parameters.”
          What’s the”middle layer”? What’s the “top level”? Is that the usual distribution over the parameter(s), in a single application of Bayes’s theorem?

          “There’s still an implicit top level prior over the frequency distribution’s parameters that gets swept under the rug (to make matters worse, it can often be an improper prior). I find it disingenuous to pretend like it’s a “separate” form of bayesian inference that is somehow more “empirical” when it’s simply a special case.”
          How is it swept under the rug?
          In many discussions, the parameters are claimed to be fixed, so it’s not clear where a frequentist prior enters. I assumed it was some kind of non-frequentist “weight”.

  5. Slightly off topic, but it was from McKinney that I first learned some of the inside story on the Anil Potti fraud. See the recent update: http://retractionwatch.com/2015/11/07/its-official-anil-potti-faked-data-say-feds/#comment-802865.
    I explain this in my subsequent comment.

  6. Retraction Watch just announced, “It’s official: Anil Potti faked cancer research data, say Feds”. However, tomorrow’s official “federal register reports that “Respondent neither admits nor denies ORI’s findings of research misconduct”!
    https://www.federalregister.gov/articles/2015/11/09/2015-28437/findings-of-research-misconduct

    There is an interesting confluence between the current post and the recently announced culmination of the Anil Potti (Duke) investigation into the fraudulent basis for predicting best treatments for cancer patients. I learned about the Potti controversy through Steven McKinney*. Although medical journals deemed Baggerly and Coombes expose of the errors and irreproducibility of the Potti/Nevins results “too negative” to publish, Efron, editor of “Annals of Applied Statistics,” gave priority to publishing their (2009) paper in the Annals, which led to a letter-writing campaign, and eventually to stopping the clinical trials already under way at Duke. McKinney has been calling on the various powers that be to analyze the frequentist properties of the Potti/Nevins model. McKinney’s letter appears in the “Omics” report following the scandal. Here are some excerpts:

    “Dec 16, 2010
    Dear Dr. Micheel,
    I have been following with interest and concern the development of events related to the three clinical trials …currently under review by the Institute of Medicine (Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials).
    I have reviewed many of the omics papers related to this issue, and wish to communicate my concerns to the review committee. In brief, my concern is that the methodology employed in the now retracted papers, and many others issued by the Duke group all use a flawed statistical analytical paradigm. …
    …….
    The statistical properties of this analytical paradigm, in particular its type I error rate, have not to my knowledge been reviewed or published. I respectfully request the IOM committee to include this issue in its agenda for the upcoming review, as findings from this committee will provide a broader educational opportunity, allowing journal editors and reviewers to have a better understanding of the statistical properties of the analyses repeatedly developed and submitted for publication by the Duke University investigators.”
    …..
    “Appendix – Details of points of concern regarding the statistical analytical paradigm repeatedly used in personalized medicine research papers published by Duke University investigators
    In 2001 West et al. [1] published some details of a statistical analytical method involving “Bayesian regression models that provide predictive capability based on gene expression data”. In the Statistical Methods section of this paper they state that the ‘Analysis uses binary regression models combined with singular value decompositions (SVDs) and with stochastic regularization by using Bayesian analysis (M.W., unpublished work) as discussed and referenced in Experimental Procedures, which are published as supporting information on the PNAS web site.'”
    …..
    “When predictor variables derived from the entire set of data are used, it cannot be claimed that subsequent “validation” exercises are true cross-validation or out-of-sample evaluations of the model’s predictive capabilities, as the Duke investigators repeatedly state in publications.”

    Other Potti posts:
    https://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-potti-training-and-test-data-fireworks-part-1/

    https://errorstatistics.com/2015/01/12/only-those-samples-which-fit-the-model-best-in-cross-validation-were-included-whistleblower-i-suspect-that-we-likely-disagree-with-what-constitutes-validation-potti-and-nevins/

    *A tip-off on this blog was first in a post by Stephen Senn: “Casting Stones”

    https://errorstatistics.com/2013/03/07/stephen-senn-casting-stones/

Blog at WordPress.com.