Comments on Wasserman’s “what is Bayesian/frequentist inference?”

What I like best about Wasserman’s blogpost (Normal Deviate) is his clear denial that merely using conditional probability makes the method Bayesian (even if one chooses to call the conditional probability theorem Bayes’s theorem, and even if one is using ‘Bayes’s’ nets). Else any use of probability theory is Bayesian, which trivializes the whole issue. Thus, the fact that conditional probability is used in an application with possibly good results is not evidence of (yet another) Bayesian success story [i].

But I do have serious concerns that in his understandable desire (1) to be even-handed (hammers and screwdrivers are for different purposes, both perfectly kosher tools), as well as (2) to give a succinct sum-up of methods,Wasserman may encourage misrepresenting positions. Speaking only for “frequentist” sampling theorists [ii], I would urge moving away from the recommended quick sum-up of “the goal” of frequentist inference: “Construct procedures with frequency guarantees”. If by this Wasserman means that the direct aim is to have tools with “good long run properties”, that rarely err in some long run series of applications, then I think it is misleading. In the context of scientific inference or learning, such a long-run goal, while necessary is not at all sufficient; moreover, I claim, that satisfying this goal is actually just a byproduct of deeper inferential goals (controlling and evaluating how severely given methods are capable of revealing/avoiding erroneous statistical interpretations of data in the case at hand.) (So I deny that it is even the main goal to which frequentist methods direct themselves.) Even arch behaviorist Neyman used power post-data to ascertain how well corroborated various hypotheses were—never mind long-run repeated applications (see one of my Neyman’s Nursery posts).

It is true that frequentist methods should have good error probabilities, computed with an appropriate sampling distribution (hence, they are often called “sampling theory”). Already this is different from merely saying that their key aim is “frequency guarantees”. Listening to Fisher, to Neyman and Pearson, to Cox and others, one hears of very different goals. (Again, I don’t mean merely that there are other things they care about, but rather, that long-run error goals are a byproduct of satisfying other more central goals regarding the inference at hand!) One will hear from Fisher that there are problems of “distribution” and of “estimation” and that a central goal is “reduction” so as to enable data to be understood and used by the human mind. For Fisher,  statistical method aims to capture the “embryology” of human knowledge (Mayo and Cox (2010). “Frequentist Statistics as a Theory of Inductive Inference”)–i.e., pursues the goal of discovering new things; he denied that we start out with the set of hypotheses or models to be reached (much less an exhaustive one). From Neyman and Pearson, one hears of the aims of quantifying, controlling and appraising reliability, precision, accuracy, sensitivity, and power to detect discrepancies and learn piecemeal. One learns that a central goal is to capture uncertainty–using probability, yes, but attached to statistical methods, not statistical hypotheses. A central goal is to model, distinguish and learn from canonical types of variability—and aspects of phenomena that may be probed by means of a cluster of deliberately idealized or “picturesque” (Neyman) models of chance regularity. One hears from David Cox about using frequentist methods for the goal of determining consistency/inconsistency of given data with a deliberately abstract model, so as to get statistical falsification at specified levels–essentially the only kind of falsification possible in actual inquiry (with any but the most trivial kinds of hypotheses).

It is a mistake to try and delimit the goals of frequentist sampling statistics so as to fit it into a twitter entry [iii]. Moreover, to take a vague “low long-run error goal” (which is open to different interpretations) as primary encourages the idea that “unification” is at hand when it might not be. Let Bayesians have their one updating rule–as some purport. If there is one thing Fisher, Neyman, Pearson and all the other “frequentist” founders fought was the very idea that there is a single “rational” or “best” account or rule that is to be obeyed: they offered a hodge-podge of techniques which are to be used in a piecemeal fashion to answer a given question so that the answers can be communicated and criticized by others. (This is so even for a given method, e.g., Cox’s taxonomy of different null hypotheses). They insist that having incomplete knowledge and background beliefs about the world do not mean that the object of study is or ought to be our beliefs. Frequentist sampling methods do embody some fundamental principles such as: if a procedure had very little capability of finding a flaw in a claim H, then finding no flaw is poor grounds for inferring H. Please see my discussion here, (e.g., Severity versus Rubbing Off).  I have been raising these points (and hopefully much, much more clearly) for a while on this blog and elsewhere[iv], and it is to be hoped that people interested in “what is Bayesian/frequentist” will take note.

Of course it is possible that Wasserman and I are using terminology in a different manner —I return to this.  Regardless, I am very pleased that Wasserman has so courageously decided to wade into the frequentist/Bayesian issue from a contemporary perspective: an Honorable Mention goes to him (11/19/12).

[i]  Some have dubbed this “Bayesian boosterism”.

[ii] The sampling distribution is used to assess/obtain error probabilities of methods. Thus, a more general term for these methods might be “methods that use error probabilities” or “error probability statistics”. I abbreviate the last to “error statistics”. It is not the use of frequentist probability that is essential; it is the use of frequentist sampling distributions in order to evaluate the capabilities of methods to severely probe discrepancies, flaws, and falsehoods in deliberately idealized hypotheses.

[iii]Some people have actually denied there are any frequentist statistical methods, because they have been told that frequentist statistical theory just means evaluating long-run performance of methods. I agree that one can (and often should) explore the frequentist properties of non-frequentist methods, but that’s not all there is to frequentist methodology. Moreover, one often finds the most predominant (“conventional” Bayesian) methods get off the ground by echoing the numbers reached by frequentists. See “matching numbers across philosophies”. 

[iv]Mayo publications may be found here.

Categories: Error Statistics, Neyman's Nursery, Philosophy of Statistics, Statistics

Post navigation

21 thoughts on “Comments on Wasserman’s “what is Bayesian/frequentist inference?”

  1. E. Berk

    Defining the goal as frequentist performance has has been harmful in my field. I have heard speakers say that since we will all be dead in the long run that all of frequentist methods are irrelevant. The replacement method bandied about give numbers, but there is an air of mystery about their meaning and justification. It’s like just, everyone does it.

    • E. Berk: Yes it’s one of those age-old throw away lines that completely misunderstand frequentist inference and what enables progress in science. Ironically, it’s the Bayesian betting rate idea that invariably appeals to long runs, whereas we are keenly interested in distinguishing genuine from spurious regularities in the data in front of us. This is managed by counterfactual reasoning about what would have occurred under various scenarios—precisely what frequentist sampling theory provides. No wonder the Bayesians generally strive to match frequentist error probabilities, even those who aren’t sure why. See my deconstruction of J. Berger.

  2. guest

    The Frequentists asks “what will the outcome be?”. The objective Bayesian asks “what should the outcome be based on a given state of knowledge?”.

    The former question cannot be answered. Any experiment can have any outcome, and there is no theorem of probability which can tell you the outcome ahead of time. At any given point, any aspect of any prediction can be wrong in the real world. This is true even for “aggregate” predictions like the frequencies of heads in coin flips. And it is true regardless of what your statistical calculation showed or what assumptions you made.

    On the other hand, the objective Bayesian question can be answered given an explicitly defined state of knowledge. By comparing what “should have been” with “what actually happened” you will learn whether the “state of knowledge” is sufficient or lacking for a given purpose. This is the essence of science can be highly useful when wielded by a competent scientist.

    In general, there seems to have been a major failure of Frequentists to understand, let alone engage, objective Bayesians. They seem intent on having 50 year old arguments with subjective Bayesians. Which is a shame because Frequentists have much to learn; the Objective Bayesians can do everything the Frequentists can and quite a bit more.

    The main reason for this is that “frequencies” are separated from “probabilities”. So if your information, or subject of discourse, involves frequencies “f”, then simply consider probabilities P(f). Here frequencies “f” are considered real experimental facts about the universe and not some “limiting frequencies” fantasy.

    Now it sometimes happens that the most likely value (mode of P(f)) of f is approximately equal to some probability p. So that f ~ p. But often they are not the same. It is this very separation and flexibility which will allow an Objective Bayesian to answer the question “what should the outcome be” in far more instances than the Frequentist can. This is true even in problems that explicitly involve frequencies!

    As an added bonus, the entire subject becomes much more down to earth. Instead of focusing all ones attention on phantom experiments that will never be performed, or made up “populations” that don’t exist, you can concentrate on what’s really known and what you would really like to know.

    • guest: “The Frequentists asks “what will the outcome be?”. The objective Bayesian asks “what should the outcome be based on a given state of knowledge?”” What? I don’t know where you’ve learned why frequentists say about probability, let alone statistical inference (which in any event generally concerns queries about the processes resulting in data, not predictions.) Not sure where to start with the rest….Objective Bayesianism tries to avoid being relative to a state of knowledge (search the blog). It is the frequentist sampling theorist who compares the observed to what should/would have been expected under hypothetical scenarios. That is very different from updating states of knowledge. I could continue with each of your remarks….

    • original_guest

      Hello “guest”. As per the board-manager’s request a few days ago, could you use a different board name please? I used to post quite a bit as “guest”, it’s confusing when you also do – our statements do not agree. Thanks

      • original guest: thanks for the clarification, I do think the Elba Admin were to issue a reminder to the unoriginal guest, especially as I’m traveling and blogging on the go.

    • Nicole Jinn

      To: guest: Yes, I acknowledge that we often want to know what evidence the *observed* data give of the hypothesis being investigated; however, relying on the observed data is *not* enough *if* a competent scientist wants to better defend their experimental claims against challenges others in the scientific community might put forward. And this (I think) requires looking into other *hypothetical* situations.

      • Nicole: Thanks! I totally agree. Many people imagine that considerations of the reliability and probativeness of a method are “add-ons” and thus somehow separable from an inference. For an error statistician, there is no inference without some consideration of the relevant sampling distribution. As Pearson pointed out, one cannot determine if a “fit” between data and model, say, is really large or small without considering how often one would get so good an agreement, even if the model or claim were (specifiably) false. This is indeed a “how often” claim, but it’s directly relevant to scrutinizing the inference at hand.

  3. I agree with most of this Deborah.
    I was trying to get to the main points of
    difference between frequentist and Bayes so indeed
    I simplified a bit.
    To me there are really two main goals in frequentist inference.
    The first is validity: for example, correct long run coverage.
    The second is efficiency: short intervals, high power etc.
    I like to think of it in stages. First, find valid procedures.
    Then, try to optimize efficiency.
    But I do put validity first:
    Primum non nocere (first, do no harm).

    Larry

    • Normal Deviate: Thanks for this (nice to see you over her on Elba). The thing is that billing frequentist inference as primarily aiming for low long-run error rates has done harm. That is why I dropped my planned post on our Philosophy of Science Association symposium session on RCTs in development economics when I saw your blog. If Larry Wasserman speaks, people listen, so there is a danger of providing more grist for the mills of those who are happy to dismiss frequentist methods as utterly irrelevant for evaluating the data/hypotheses/model at hand. Never mind that some of the same critics are just as happy to output the same numbers/inferences, only with different (and often quite vague) interpretations.

  4. Anon

    “Moreover, one often finds the most predominant (“conventional” Bayesian) methods get off the ground by echoing the numbers reached by frequentists.”

    The methods you talk about date back to Laplace (who usually assumed uniform priors). So to be historically correct, the frequentist methods were rediscoveries Laplace’s Bayesian methods from a century before.

    • Aris Spanos

      Claiming that “the frequentist methods were rediscoveries of Laplace’s Bayesian methods from a century before” would make Fisher turn in his grave! Those who care enough about the historical development of statistics, should know that Fisher did answer that question in a series of exchanges he had with Jeffreys in the early 1930s with an unequivocal NO!

      • Anon

        In Mayo’s paper ‘Error Statistics’ the one example she calculates SEV(mu>mu1) for is the identical calculation that is made for the Bayesian P(mu>mu1: data) for the same problem.

        This identical calculation (I don’t mean “similar” or “analogous”) almost certainly appears in Laplace’s “Théorie analytique des probabilités” verbatim which was published in 1812.

        Far be it for me to tell others how to do their job, but If I where philosophizing about statistics and making claims that “Bayesian were just piggybacking off Frequentist successes”, I’d probably want to familiarize myself with seminal works in the field like Laplaces “Théorie analytique des probabilités”.

  5. Ok I don’t know if I’m repeating really old points here, but I’ll say I agree that “long run properties” is not the best way to describe NP testing for the following reasons: the usual Neyman-Pearson “error rate” is not the “rate at which you would make errors in a long series of trials”, fairly obviously. An error rate is something like “the probability that the test would indicate H *if* H was false” which I would write something like:

    (1) P( T=H | ¬H )

    Where as the number of times you will make such an error in lots of trials is either:

    (2) P( T=H & ¬H ) = P( T=H | ¬H ) P( ¬H )

    if you are a Bayesian, or (as far as I understand) unknowable (at least, not the “goal”) if you are a frequentist, since a frequentist usually will say it is not sensible to assign a probability to “¬H”.

    You don’t want to confuse the two (I’m not suggesting Wasserman’s post confuses the two – probably he and most of us can understand the difference between these two things! It’s just that the statement “long run probability” can be so easily read as (2))

    Also, I think Bayesian “long run” arguments tend to be about (2) rather than (1) anyway — certainly an argument of the form “if I bet on the truth of H then I will lose money p% of the time” is talking about (2) — so they aren’t talking about the same thing as the NP notion of an “error rate” or “error probability”.

    Of course, (2) is bounded above by (1), but this means that a test could be awful by the NP standard (1) even though by (2) it seems to make few errors: if you saw a bag being filled with 20 black balls and one white one, then you will make an error in 1/21 random draws if you always bet on black, but “always bet on black” is surely a poor NP test procedure to determine the color of the next ball, since the probability that you will bet on black given that the next ball is white (1) is one!

  6. Aris Spanos

    James: Although I agree with the spirit of what you are saying, the notation you use might give the false impression that frequentist error probabilities are conditional probabilities; they are not! In frequentist statistics conditioning on hypotheses makes no probabilistic sense, because the parameters are NOT random variables but unknown constants. Frequentist error probabilities are defined as tail areas of sampling distributions evaluated under different scenarios relating to the unknown parameter of interest; theta=theta0, theta < theta0, etc.
    A less confusing way to articulate the meaning of the "long run properties" is to distinguish between error probabilities as measures that "calibrate" the capacity of a test, and the long run metaphor that we use to explain that capacity. The power of an alpha-significance level test measures the capacity of a particular test to reject the null when it should [evaluated at different discrepancies from the null], but erring with probability alpha. In this sense, the capacity of the test refers to this particular test applied to this data. The "long-run" is just a metaphor that provides a way to imagine what that means in practical terms where the error probabilities are related to actual relative frequencies.

    • Paul

      In what contexts is it productive to assert that the parameters are unknown constants? This assertion should be useful in some of Physics. But it’s going to underperform in a lot of other areas, especially when human beings are involved.

    • Aris and Mayo: Thanks for your responses. In short, yes I agree that the conditional probability formulation probably isn’t a good reflection of the frequentist approach, but using it helps me to understand the different ways that you could formulate a “long run error rate” — that is, if you do accept (1), then (2) seems to follow quite naturally by the definition of conditional probability, and then you would be led towards thinking of the purpose of the error rate as only to limit e.g. the number of times you get a “wrong” answer over the course of lots of experiments, which I don’t think is really the point, and I think we agree on that at least.

      With regard to silly examples, I thought my silly example was rather like this one as well: https://errorstatistics.com/2011/09/03/overheard-at-the-comedy-hour-at-the-bayesian-retreat/

  7. STJ

    All this interesting discussion reminds me the paper by David Freedman:

    Some issues in the Foundation of Statistics – Foundations of Science 1 (1) (1995) [1]
    [it is important to also read the comments and the rejoinder]

    One of the key issue is the statistical model and Profs Mayo and Spanos have written a lot on this.
    [Specification, checking/validation, selection, substantive vs statistical, etc.]

    The uncertainty in the substantive model is rarely if never discussed. The model is assumed then the inference is done.
    which leads to Paul’s comment but also Prof. Wasserman’s comment about the Higgs:
    “The p-value is not illusory. This is not social science where the null is always false and it is only a matter fo time until we reject. In this case, the physics defines the null and alternatively clearly.” [2]

    In the post by Prof Wasserman, Bayesian is mainly “subjective” Bayesian statistics [and several quotes from Prof. Goldstein].
    In Freedman’s paper, it is the same. Jim Berger’s comments (comments on Freedman’s paper) are thus interesting in this context but also Freedman’s reply.

    As a HEP physicist, I find highly difficult to understand statistical inference (beyond blindly running the methods):
    – Frequentist with the hybrid testing or decision-making framework [which is mainly irrelevant in fundamental physics]
    – Bayesian beyond the usual “turn the crank”. Moreover, most Bayesian books seem to be a propaganda against frequentist (for ex: problem of contraints vs priors [3])
    (and thanks to Profs Spanos and Mayo to deconstruct the propaganda)
    The second difficulty is that Bayesian statistics is presented as “unified” (ie Bayes’theorem) but in reality, there are so many faces: “subjective”, “objective”, …
    (see for example: Bayesian Analysis volume 1 number 3 [4] or Prof David Cox’s concluding talk at JSM 2011 [5]).
    – Thirdly, there is the issue wonderfully fomulated by Prof Senn: “Mathematical statistics is full of lemmas whereas applied statistics is full of dilemmas”.
    [I don’t count the many times in data analysis that a stat theorem does not apply (because the model is not regular (for the likelihood), etc.)

    As a scientist, when I start investigating a new field (here statistics), I want to understand the meaning of the methods (what’s the meaning of the number I got at the end), the domain of validity, etc. With statistics, this is a lot difficult with the standard books (there are some exceptions: Cox&Hinkley1974, Spanos1999, Gelman2003). I wonder with this state of affair how can students in statistics succeed in understanding their field?

    STJ

    [1] http://philpapers.org/rec/FRESII

    [2] http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/
    http://bayesian.org/forums/news/3648

    [3] http:// statistics.berkeley.edu/~stark/Preprints/constraintsPriors12.pdf

    [4] http://ba.stat.cmu.edu/vol01is03.php

    [5] http://www.youtube.com/watch?v=LE3rhuD7zhk
    http://www.amstat.org/meetings/jsm/2011/

    • Christian Hennig

      STJ: I don’t know whether you find this helpful but your posting tempts me to repost something that I wrote as comment on Prof. Wasserman’s text on his blog (with some tiny modifications):

      I think that there is one issue that deserves more focus.
      I think that the meaning of the idea of a “true parameter” is central for frequentism. Even if somebody does Bayesian inference, as long as there still is the idea that there is a true sampling model governed by some unobservable parameter is essentially frequentist as far as I see it.
      Bayesian analysis is often done using the standard setup prior(parameter)*sampling_distribution(.|parameter), but the interpretation of probability implied varies a lot. Some valid options are, as far as I see it,
      1) the parameter is only an artificial mathematical device in order to set up predictive distributions for future observation (de Finetti),
      2) *both* prior and sampling distribution have a physical/frequentist/propensity meaning, as can be justified sometimes (e.g. in the physical experiment described in Bayes’s original paper) but not in the vast majority of applications,
      3) all probabilities (including the resulting predictive ones) are not intended to have an operational/physical meaning but are plausibilities resulting from a logical analysis of existing knowledge (Jeffreys/Jaynes).
      However, it seems to me that very often Bayesian analysis is interpreted as if the sampling distribution has a physical meaning, there is a true parameter, and the posterior tells us something about what we know about it given the data and the prior information. But I don’t think that any justification of probability calculus existing in the literature licenses a mixed use of the sampling distribution as something “out there in the real world” whereas the prior formalises a belief about something that is not in any sense “distributed” in nature. Either both should be frequentist and about “data generating processes in the real world”, or both should be logical and about beliefs. Otherwise meanings are confused.
      When interpreting results very carefully, Jaynes’ approach accomodates belief about both “true (unobservable) states of reality” and what will be observed in the future, but even there it is not really taken into account on how essentially different considerations usually are, in practice, that are required for setting up the parameter prior and the sampling distribution, and Jaynes himself emphasizes that the sampling model is about logical ideas involving *treating* some things as symmetric/exchangeable and not about a process going on in the real world.

Blog at WordPress.com.