Oxford Gaol: Statistical Bogeymen

Memory Lane: 3 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)Unknown-2

Criticisms then follow readily: the form of one or both:

  • Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
  • Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
  • I have proposed an alternative philosophy that replaces these tenets with different ones:
  • the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
  • the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
  • Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution.  Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical.

The very use of the sampling distribution to make inferences from data is at odds with Bayesian methods where consideration of outcomes other than the one observed is disallowed (likelihood principle).

“Neyman-Pearson hypothesis testing violates the likelihood principle, because the event either happens or does not; and hence has probability one or zero.” Kadane, 2011 Principles of Uncertainty, CRC press–For non-commerical purposes can download from http://uncertainty.stat.cmu.edu/

The idea of considering, hypothetically, what other outcomes could have occurred in reasoning from the one that did occur seems so obvious in ordinary reasoning that it will strike many as bizarre that an account of statistical inference would wish to banish such considerations.  And yet, banish them the Bayesian must[i]—at least if she is being coherent.  It may be surprising to discover that the Bayesian mask, if you wear it consistently, only has eyeholes for likelihoods (once the data are in front of you). (See earlier posts on the likelihood principle, and my contribution to the special RMM volume.)

What is key on the (philosophical side) is that error probabilities may be used to quantify probativeness or severity of tests (for a given inference).

The twin goals of probative tests and informative inferences constrain the selection of tests.  I am prepared to grant that an overarching philosophy of science and statistics is needed to guide the use and construal of tests (whether of the N-P or Fisherian varieties), and to allow that formal methodology does not automatically give us methods that are adequate for controlling and assessing well-testedness .  (Otherwise, it would be very hard to explain how so many clever people raise those same criticisms and misinterpretations of tests!)

In this philosophy of science, inquirers find things out piece-meal.  Perhaps if scientists had to bet on a theory they could, but that is precisely the difference between such conjecturing (e.g., “I’ll bet GTR will break down somewhere!”) and what must be done to learn from evidence scientifically.  Rather than try to list all possible rivals to a hypothesis of interest, plus degrees of probability to each (however one likes to interpret these), progress is made by splitting off questions and developing the means to probe them by a series of distinct, pointed questions.  (See Oct. 30 post.) An account of inference, as I see it, should also illuminate how new hypotheses are constructed and discovered based on scrutinizing previous results, and by unearthing shortcomings and limitations that are communicated systematically by other researchers. Any account that requires an exhaustive list in advance fails to capture this work.

To allude to the example of prion transmission with which I began early posts to this blog, researchers only start out with vague questions in hand: what is causing the epidemic of kuru among the women and children of the Fore tribe?  Is it witchcraft as many thought?  Determining that it was due to cannibalism was just a very first step: understanding the mechanism of disease transmission was a step by step process.  One can exhaust answers to questions at each step precisely along the lines of the local hypotheses tests and estimation methods offered in standard frequentist error statistics. (There is no difference, at least from the current perspective, if one formulates the inference in terms of estimation).

I have and will continue put flesh on the bones of these skeletal claims!

[i] This is true for likelihoodists as well. Since writing this 2 3 years ago, I’ve learned of Gelman Bayes which is not in terms of inductively obtaining posteriors and which uses a sampling distribution. There are other non-standard accounts. Whether they capture reasoning from severe error probes I cannot say. If they do, I regard them as error statistical. 

Categories: 3-year memory lane, Bayesian/frequentist, Philosophy of Statistics, Statistics | Tags: ,

Post navigation

30 thoughts on “Oxford Gaol: Statistical Bogeymen

  1. “This week, several top infectious disease experts ran simulations for The Associated Press that predicted as few as one or two additional infections by the end of 2014 to a worst-case scenario of 130″in the U.S.

  2. Michael Lew

    Mayo, I agree that a conventional Bayesian account has no mechanism for dealing with aspects of the experiment that do not appear in the likelihood function, and that precludes the incorporation of considerations of stopping rule etc. into the posterior probability distribution. However, I am not sure that we should interpret the likelihood principle as prohibiting consideration of those aspects when we make inferences.

    The likelihood principle says that the evidence in the data relevant to the parameter of interest in the statistical model is entirely contained in the relevant likelihood function. It implies that stopping rules etc. do not affect the evidence, but it does not (well, _should_ not) say that one cannot take extra-evidential aspects of the experiment into account when making inferences.

    All that is needed to bring together error statistics and likelihood analyses is an acceptance that the information in the experimental design that do not affect the likelihood function can play a role in inference that is distinct from that of the evidence.

    • Michael: Some actually say there can be no distinction in inference, conclusion, decision or anything else. Savage is one, but I don’t have a handy quote. So you would, I take it disagree with him. In any event, there are things that are distinct from the evidence, but the sample space isn’t one of them–at least to error statisticians. So we’re back to the earlier rounds we’ve had. I don’t know how you classify these extra-evidential things. We discuss attempts to place them in the prior (e.g., Mayo and Kruse).

  3. anon

    Dr. Mayo,

    Just out of curiosity, are there any parts of Bayesian statistics which you think are legitimate and condone other than those than can be given a solid Frequentist foundation where probabilities are interpreted as frequencies? If so which truly Bayesian methods do you approve of?

    • anon: Is the mere use of conditional probability or inverse inference part of Bayesian statistics? It would be absurd to oppose a theorem. Then of course many people feel it’s altogether appropriate for personal beliefs and decision-making. I can understand and I use the idea of having a “hunch” without reasons, just a guess. I just think that’s different than evaluating evidence for inference when the goal is consciously probing errors. By the way, I don’t particularly like the “frequentist”/Bayesian comparison. It’s foisted on us, but wouldn’t be my way to get at the nitty gritty. What is relevant is (a) the use of error probabilities to control and evaluate the ability of a method to discern misleading/erroneous inferences and (b) the use of that information to reason about the source of the data at hand (not about long-runs).

      I’m sitting here in a hospital, so I don’t know how long my connection will last, or whether this is altogether clear Mr. anon.
      (your name came up with ! for some reason, had to manually approve)

      • anon

        Let’s take it for granted opinions or beliefs have no place in science. Are there any legitimate Bayesian methods that assign probabilities to something like fixed parameters (which thus can’t be given a frequentist interpretation or have any connection to “error probabilities”)?

        • Anon: False to say there can’t be “any connection”:. of course fixed parameters can be related to error probabilities. Do we not give error probs of measuring instruments for tables with fixed lengths? etc etc

          • anon

            Those are probabilities for the measurement errors surely. Are there any instances in which it’s legitimate to assign a probability distribution directly to the fixed table length itself?

            Are there any Bayesian methods that do this which you consider ok?

            • Anon: Don’t disregard the error probs so readily before moving on to an entirely different question (like how often do they make 30 inch round tables). Take seriously, for once, the way knowledge of the probe ability or tool’s shape reveals information about the object or process AS measured.

              • anon

                I’m not moving onto a new question, I’m asking the same simple question repeatedly. Forget about “how often do they make 30 inch round tables”. Suppose only one table was ever made or ever will be made by that carpenter.

                Are any Bayesian methods that assign a probability distribution to that table legitimate and Ok, or should such methods be discontinued?

                Your response about “how often they make 30 inch round tables” suggests “only when that distribution can be interpreted as a frequency”.

                • Anon: You think it’s the same question because you think there is only one way, or one best way, that probabilities may be used to reach and qualify inductive inference.
                  You’re the one who needs to answer YOUR question Are there Bayesian methods that assign a probability distribution to that unique (by definition) table that are useful and warranted. And please tell us how you’re interpreting them.

                  • anon

                    But I already in know my answer. I’d like to know yours. What’s the big deal with sharing your views?

                    • what’s the big deal with sharing yours, since you know it?

                    • anon

                      Dr. Mayo, all I’m asking for is basic clarification of your views to avoid misunderstandings.

                      After writing several books, dozens of peer reviewed papers, and an entire career thinking about the foundations of statistics, do you think some Bayesian methods using non-freq prob distributions are legitimate or are they not Ok and should be abandoned?

                    • Anon: I think I answered your question earlier. Now you need to explain the contexts and interpretations where you endorse them, so we can avoid misunderstandings. Thanks.

                    • anon

                      Huh? I have no idea from what you said what your answer to the question is. John Byrd said “no”, I say there are legitimate (for science) Bayesian methods that rely on non-freq prob distributions.

                      What say you?

                    • Anon: I take it my answer was in tis comment https://errorstatistics.com/2014/10/31/oxford-gaol-statistical-bogeymen-3/#comment-100731
                      As I noted “rely on” isn’t te same as “are justifiable by”.

                      If you explain in what way you favor Bayesian stat for science, I can know what you’re talking about, e.g., subjective beliefs, Bayesian updating Bayes boost measures, default Bayes (and which of many), etc.

                      I take it to be a matter of goals.

                    • anon

                      Are you saying there are no instances (in science) in which it’s justifiable to use non-freq prob distributions? Should such methods be dropped in favor of methods that exclusively interpret prob distributions as frequencies?

                    • No, I am not. But you’ve yet to explain what you mean by legitimate, justifiable, non-freq. Saying it’s not something doesn’t tell us what it is. Most of science doesn’t even use formal statistics.
                      Unless you explain your meaning, there’s no point in continuing this conversation, and I’m leaving town….

                    • anon

                      John Byrd, who seems to be a practicing devote of Error Statistics, denied there were any instances in science in which it’s ok to use a non-freq prob distribution. His understanding of Error Statistics seems to contradict yours.

                      It would be helpful if you could explain to him (and us) why it is sometimes justified in Error Statistics to use probability distributions which aren’t interpretable as frequency distributions.

              • anon

                Even simpler: are there any Bayesian methods using non-frequency probability distributions which are legitimate in your mind and ok to use?

                • john byrd

                  Anon: What is an example of a non-freq probability distribution that is useful with Baye’s theorem?

                  • anon


                    To take Mayo’s table example. Probabilities like P(table| A) represent the uncertainty in the length of the table given that all we know about the table at first is “A”. It “represents” this uncertainty by specifying a range of plausible values compatible with A.

                    For example, If “A” is the fact that the table is currently in my dining room, I know immediately it’s length is somewhere between 0-25 ft. A prior P(table|A) which places it’s probability mass over that range than accurately reflects my initial uncertainty in the tables length (although perhaps I could come up with a better A which narrows it further in real life).

                    Note this prior is not a frequency of anything, even approximately, but it is objective. If the length of the table is not in the range suggest by the prior probability mass then the prior is objectively wrong and misleading.

                    The role of Bayes theorem and the data is to produce a posterior P(table|data, A) whose probability mass covers a narrower range than 0-25. The initial uncertainty (or range of plausible values) has been reduced by the data to a subrange of 0-25 ft.

                    In a scientific context, we might be trying to measure the mass of a neutrino. If the mass of the neutrino were above a certain value it would radically change the observed properties of the universe we live in. Although this prior information isn’t a frequency it’s very strong (much better confirmed than the experiments used to measure the mass directly).

                    We can use a prior for the neutrino’s mass which narrows the range of possible values based on this (non-freq) information.

                    (Incidentally, it’s hard to see how a Frequentist would use this information. It has no effect either the measurement model structure or sampling distribution, so it doesn’t change the Frequentists answer at all)

                    Note this view of probability isn’t an alternative for either frequentism or subjective Bayes, rather it’s a generalization of both. In this view, frequentistm and subjective Bayes are special cases essentially.

                    In particular, this view can get more out frequencies than frequentism can. Why? because frequentism admits only one connection between frequencies and probabilities (i.e. prob=freq), while this view allows for more general connections.

                    • john byrd

                      Anon: “Note this prior is not a frequency of anything, even approximately, but it is objective. If the length of the table is not in the range suggest by the prior probability mass then the prior is objectively wrong and misleading.”

                      I suppose you know how I am reacting to the second sentence– I suggest that where we can verify the prior is wrong and misleading we did not need a prior (just measure the table and give a tolerance limit on the value), and where we cannot such practice is some sort of parlor trick.

                    • john byrd

                      Anon: To discuss further, what you are doing is taking a singular fact– the table cannot be larger than the room it resides in — and assigning numbers to the values 1-24 (say) and then calling a probability mass. I say it is not probabilities unless the distribution is based on something more tangible, namely frequencies established through a reliable procedure. In your example the prior information is a fact that helps you select the right tape measure from the toolbox. Dressing it up as something more and hitching to a likelihood offers nothing but confusion for the lay reader. I suppose it looks elegant, or something like that.

            • john byrd

              Anon: It appears to me that one can assign a probability to the process of producing a table, provided it is well understood. That is, if I know quite a lot about the method of manufacture, then I can give the probability the product will fall within some specified interval of length. This prob is only useful to the degree that I really understand the process. It is useful, in other words, to the degree I would know it if it is wrong. It makes no sense to state flatly that the already produced table has a prob of being a certain length. It is what it is, and for any measurement value (say, to the c!osest mm with a rounding Rule) the prob is either 0 or 1. Recall that the interval with a prob associated is assigned to the process of producing tables. Perhaps you would assign a prob to my degree of belief that the table is a certain length?

              • John: But the question (to me) is: how can probabilities be used to reach ampliative (non-deductive, fallible) inferences about properties of phenomena or entities. It cannot be assumed that the only or even the best way is the probabilisit’s assignments. I frankly don’t know what they mean except as a kind of bookkeeping for rating the evidence for claims–where that evidence has been obtained by other means. So, for example, I can imagine that if H has passed a very severe test, that I give it .9. Put aside performing probabilistic computations, just record keeping. Things get problematic with poor evidence. Does .5 mean the evidence is middling, as often wrong as right, ignorance, indifference, never thought about it, etc. In other words, excellent evidence for H, obtained by stringent and severe tests, can be given a high number in your (0-1) bookkeeping, if you’re so inclined. All the work’s been done by other means. It’s when you have poor evidence, and need to figure out what threats of error yu face, etc. that the bookkeeping most obviously comes up short. We want methods that do work for us. That work requires error probs of sampling distributions–at least in those cases where questions are to be tackled using formal statistics. Otherwise, as in the majority of science, one should resort to other means of inductive inference and learning. Here one still uses the principles of evidence and severe tests, but the assessment is qualitative. The strongest inductive arguments, I claim, use only qualitative severity arguments. (And there too, I spoze, the bookkeeping can be done with numbers from (0-1), if one insists, but I doubt one can consistently rate claims this way. In any event, once again, it’s entering at the point of window-dressing (i.e., I am talking now of inquiry that is not a matter of formal statistics).

              • john byrd

                “Even simpler: are there any Bayesian methods using non-frequency probability distributions which are legitimate in your mind and ok to use?”. I should have noted my answer is No.

                • John: I guess you mean for scientific contexts. One of the big difficulties seems to be the matter of background information. I have often heard people say that using background information and knowledge is somehow Bayesian, as if we want to start with blank slates. Call it the repertoire of background information about experimental design, models, and the subject matter.

                  Now if someone thought, quite incorrectly, that the only or even a good way for background information to enter would be by means of a probability distribution, then one might say, “of courses there are cases when we must be Bayesian” and really only mean “of course we must use warranted background information”.

                  Still, there’s the question of how to use it, and I take it there are disagreements here. If another study showed something conflicting the current one, Bayesians seem to want to average it in so as to change the import of the current evidence, whereas we would likely not. Unless it was some kind of ongoing assembly line updating of priors.

                  The other thing about which there is a great deal of unclarity is: what does it mean for their to be a frequentist prior for a hypothesis or a parameter?

  4. john byrd

    It appears to me that the only reason one would be compelled to capture (somehow) the background info in a single number or distribution is because you have compelled yourself to take the entire inference from the posterior number or distribution. Many of us reject these requirements and do not believe it is reasonable to attempt to quantify the background considerations. It seems it will often be incomplete and misleading.

Blog at WordPress.com.