Return to the Comedy Hour: P-values vs posterior probabilities (1)

Comedy Hour

Comedy Hour

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB [Jim Berger]: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist Significance Tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke….

Of course it is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0 [i] .  Somewhat more recent work generalizes the result, e.g., J. Berger and Sellke, 1987. Although from their Bayesian perspective, it appears that p-values come up short as measures of evidence, the significance testers balk at the fact that use of the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!  An interesting twist in recent work is to try to “reconcile” the p-value and the posterior e.g., Berger 2003[ii].

The conflict between p-values and Bayesian posteriors considers the two sided  test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0 .

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!


Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many find the example compelling evidence that the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative”(?) Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others charge that the problem is not p-values but the high prior (Casella and R.Berger, 1987). Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Moreover, the “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false”. See Senn’s very interesting points on this same issue in his letter (to Goodman) here

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and frequentist error probabilities: it is imagined that we sample randomly from a population of hypotheses, some proportion of which are assumed to be true, 50% is a common number used. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known, or the like”).

Therefore P(H0 is true) = .5.

It isn’t that one cannot play a carnival game of reaching into an urn of nulls (and one can imagine lots of choices for what to put in the urn), and use a Bernouilli model for the chance of drawing a true hypothesis (assuming we could even tell), but this “generic hypothesis”  is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. (In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation.) [iii]  In any event .5 is not the frequentist probability that the selected null H0 is true. (Note the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

Yet J. Berger claims his applets are perfectly frequentist, and by adopting his recommended O-priors (now called conventional priors), we frequentists can become more frequentist (than using our flawed p-values)[iv]. We get what he calls conditional p-values (of a special sort). This is a reason for coining a different name, e.g.,  frequentist error statistician.

Upshot: Berger and Sellke tell us they will cure  the significance tester’s tendency to exaggerate the evidence against the null  (in two-sided testing) by using some variant on a spiked prior. But the result of their “cure” is that outcomes may too readily be taken as no evidence against, or even evidence for, the null hypothesis, even if it is false.  We actually don’t think we need a cure.  Faced with conflicts between error probabilities and Bayesian posterior probabilities, the error statistician may well conclude that the flaw lies with the latter measure. This is precisely what Fisher argued:

Discussing a test of the hypothesis that the stars are distributed at random, Fisher takes the low p-value (about 1 in 33,000) to “exclude at a high level of significance any theory involving a random distribution” (Fisher, 1956, page 42). Even if one were to imagine that H0 had an extremely high prior probability, Fisher continues—never minding “what such a statement of probability a priori could possibly mean”—the resulting high posteriori probability to H0, he thinks, would only show that “reluctance to accept a hypothesis strongly contradicted by a test of significance” (44) . . . “is not capable of finding expression in any calculation of probability a posteriori” (43). Sampling theorists do not deny there is ever a legitimate frequentist prior probability distribution for a statistical hypothesis: one may consider hypotheses about such distributions and subject them to probative tests. Indeed, Fisher says,  if one were to consider the claim about the a priori probability to be itself a hypothesis, it would be rejected by the data!

UPDATE NOVEMBER 28, 2015: Now I realize that some recent arguments of this sort will bite the bullet and admit they’re assessing the prior probability of the particular hypothesis H* you just tested by considering the % of “true” nulls in an urn from which it is imagined that H* has been randomly selected.  They admit it’s an erroneous instantiation, but declare that they’re just assessing “science wise error rates” of some sort or other. Even bending over backwards to grant these rates, my question is this: Why would it be relevant to how good a job you did in testing H* that it came from an urn of nulls assumed to contain k% “true” nulls? (And think of how many ways you could delineate those urns of nulls, e.g., nulls tested by you, by females, by senior scientists, nulls in social psychology, etc. etc).

(0) If we’re ever going to make progress, or even attain a cumulative understanding, we really need to go back to at least one of the key, earlier criticisms and responses for each classic howler. (This is the first (1) in a “let PBP” series.) Please check comments from this post.

(1) Pratt, commenting on Berger and Sellke (1987), needled them on how he’d shown this long before. I will update this note with references when I return from travels.

[i] A result my late colleague I.J. wanted me to call the Jeffreys-Good-Lindley Paradox.

[ii]An applet is available at∼berger

[iii] Bayesian philosophers, e.g., Achinstein, allow this does not yield a frequentist prior, but he claims it yields an acceptable prior for the epistemic  probabilist (e.g., See Error and Inference 2010).

[iv]Does this remind you of how the Bayesian is said to become more subjective by using the Berger O-Bayesian prior? See Berger deconstruction “irony and bad faith”).

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R..  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, Statistical Science18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.

Related posts:

Categories: Bayesian/frequentist, Comedy, PBP, significance tests, Statistics

Post navigation

27 thoughts on “Return to the Comedy Hour: P-values vs posterior probabilities (1)

  1. Mayo – I’m wondering whether you could clarify your view on something for me. Do you accept that the steps of
    – sampling particular parameter values from a prior p(theta)
    – plugging these into a conditional model p(y|theta) to get a simulated sample

    constitutes a different procedure to what you describe as ‘fallacious instantiation’?

  2. IJ Good’s comment to Berger & Sellke. His suggested standardized P (to take into account sample size) is on p. 127. (This has come up in conversation on the blog and I didn’t have the reference, even though he told me about it often enough.)

    Click to access i-j-good-comment-point-null.pdf

  3. By strange chance, looking up something entirely different, I came across a Gelman post that alludes to an article (by Morris*) that turns out to be a response to Berger and Sellke 1987, and, moreover, a comment on that blog by Jeremy Fox claims the discussion helps him to finally see the connection between my notion of severity and Gelman’s recent analysis of why stat sig with low powered tests “exaggerate” the magnitude of discrepancy. Only, I don’t understand how SEV links to Gelman’s, in that it seems at odds with it. So I must be missing something, and I’m just recording this here as an item to come back to.

    Click to access morris_example.pdf

    This was the Gelman blog:

    This was Fox’s remark:
    Jeremy Fox says:
    October 13, 2015 at 4:47 pm
    Morris’ intuition isn’t just Bayesian, as he himself notes. The final line in Morris’ table 1, which he calls “power @t”, is the (one-tailed) “severity” (sensu Deborah Mayo) of each dataset’s test of the hypothesis that the true parameter is 0.55. 1 minus the p-value is the one-tailed severity of each dataset’s test of the hypothesis that the true parameter is 0.5. All three datasets have the same p-value and so same severity against theta=0.5, but they have radically different severities against values of theta even slightly greater than 0.5. In other words, all three datasets warrant the frequentist, prior-free inference that the election is not a dead heat, and they warrant that inference to the same degree. But only the one with the largest sample size warrants the frequentist, prior-free inference that the candidate of interest is anything more than very slightly favored.
    Just checked to make sure my memory wasn’t off, and yup, the frequentist intuition here concerns severity–see Deborah Mayo’s Error and the Growth of Experimental Knowledge, section 6.4, which uses the binomial distribution as a simple example.
    Not that anyone else has any reason to care, but personally I’m glad that Andrew’s post today finally forced me to figure out how the notion of “severity” maps onto that wonderful “this is what power=0.06 looks like” diagram of Andrew’s. It’s been bugging me.

  4. All discussions about P values take place in the `behaving as if true’ mode and in the context of a parametric family of models indexed by \theta. It is assumed that the data were generated by some value \theta, the true value, or perhaps the `true’ value, and then given data x the null hypothesis H_0:\theta=\theta_0 for some specified \theta_0 is tested. The alternative H_1 takes various forms such as H_1: \theta \ne \theta_0. Based on \theta_0 a P value is calculated, a small P value being evidence against H_0. The parametric family is not given. At some point a decision is made to base the analysis using this parametric model but once this decision has been made the manner in which it was arrived at plays no role whatsoever in the ensuing formal statistical inference. Indeed given that the statistician is now in the ´behaving as if true mode’ it could not be otherwise. How one arrives at the truth, by fair means or foul, is completely irrelevant once one has it. One can object that the P value as a measure of evidence can be no more convincing than the procedure which lead to the choice of the parametric model in the first place. Such an objection cannot even be formulated in the ´behave as if true’ mode. A second problem is the use of a parametric model. Unless the model has some theoretical underpinning there other models which could have been chosen. As an example the first family could be normal with \theta=(\mu,\sigma) and the second log-normal with \theta=(\mu,\sigma) whereby the latter (\mu,\sigma) refer to the log-normal distribution. Suppose the null hypothesis for the first model is H_0:\mu=\mu_0. What is the corresponding hypothesis for the log-normal model? Suppose the ´behave as if true’ mode is replaced by a mode which consistently treats models as approximations to the data. The null hypothesis H_0:\theta=\theta_0 can now be interpreted as the claim the the model with parameter \theta_0 is an adequate approximation to the data. this assume that \theta_0 fully describes the model. In some cases such as the normal family mentioned above this is not the case. Here H_0_\mu=\mu_0 must be interpreted as: there exists a \sigma such that (\mu_0,\sigma) is an adequate approximation to the data. The alternative H_1:\theta\ne \theta_0 clearly cannot be interpreted as \theta_0 being the only value of \theta which is an adequate approximation so it has no role to play in an analysis based consistently on a concept of approximation. The following is an approach based consistently on approximation. Firstly a well-defined concept of approximation is required which will typically include features usually relegated to the choice of model phase and then ignored thereafter. Examples are goodness-of-fit, absence of outliers and the behaviour of certain functionals such as the mean. Given such a definition the set of adequate parameter values can be determined. Given a theta_0 the ´evidence’ against \theta_0 can be measures by its distance from the set of adequate values where the scale can be a probability one but must not necessarily be such. The problem of non-commensurability when using different parametric model can be overcome by basing the analysis on a functional. The choice of functional can be guided by amongst other considerations, equivariance properties and stability, more specifically differentiability, to provide a degree of stability.

    • Laurie; how much more skeptical must you be of the supposition in these popular computations that we know the % of true nulls in the urn from which our null was randomly selected? and that we’d want to get the “true” type 1 error probability by the resulting Bayesian (?) computation? Yet this is the basis for what is felt to be the deepest, darkest criticism of significance tests in science!

    • Michael Lew

      I find myself in complete agreement with what Laurie wrote above. I’m surprised and pleased.

      If one takes the model as approximation approach (I think we should in most cases) then the likelihood function — an expression of the evidential content of the data concerning parameters within that model — is an approximation of the evidence provided by the data concerning the real world analogues of the parameters of the approximating model. That’s irresistible, isn’t it? It matches what Laurie wrote:
      “Given a theta_0 the ´evidence’ against \theta_0 can be measures by its distance from the set of adequate values where the scale can be a probability one but must not necessarily be such.”
      The appropriate scale is likelihood, which is proportional to probability.

  5. Deborah, you are absolutely right. How do we distinguish balls in an urn? The 50% true ones have a weight of 1gm exactly whereas the false ones have a weights of 1.0000000000000ugm for various u in (0,1) but are otherwise indistinguishable from the true ones. A ball is chosen and then 100 observations with mean equal to the weight of the ball and variance standard deviation 0.001. On the basis of this we have to decide whether the ball is a true one or a false one – clearly no chance. Is this any different from putting a prior probability of 0.5 on \theta=\theta_0=1 and a prior probability of zero on all \theta=1+0.0000000000000u? I find it very strange. Stigler
    author = {Stigler,~S.~M.},
    title = {Do robust estimators work with real data? (with discussion)},
    journal = {Annals of Statistics},
    year = {1977},
    OPTkey = {},
    volume = {5},
    number = {6},
    pages = {1055-1098},
    OPTmonth = {},
    OPTnote = {},
    OPTannote = {}
    gives seven data sets of James Short’s determinations of the the parallax of the sun based on the 1761 transit of Venus. The modern true value is 8.798. The P value for the seventh data set for the hypothesis H_0:mu=8.798 is 1.522e-07. If ones puts into operation the method I suggested based on approximations then the normal model (8.644,0.192) is consistent with the data. The normalized distance from mu=8.798 is (8.798-8.644)/0.192=0.802 which can be converted into a probability for example 1-2*(pnorm(sqrt(21)*0.802)-0.5)=2.37e-04. If one uses M-functionals then (8.638,0.243) is consistent with the data and the standardized distance is now 0.658. This does not of course solve the basic problem with such data, namely the bias of the measurements which can have various causes and can only be corrected by further more precise measurements.

  6. Deborah, I have read David Colquhoun’s paper and do indeed have difficulties with it. He writes ´the test is almost useless anyway´ and further ´the false discovery rate of 86% is disastrously high’. I read this as claiming that any test with such a high false discovery rate is ´almost useless’ regardless of the context. After an operation to remove a tumour a patient has regular blood tests including tumour markers. At one point one such marker is double the range given for patients without cancer. To use David Colquhoun’s numbers suppose the false alarm rate is 86.2%. The probability that the increase is due to cancer cells is now 13.8%. How should the oncologist react? If I understand David Colquhoun correctly he should do nothing perhaps apart from telling the patient not to worry. One alternative would be to repeat the blood test in say six weeks time. And at what point would the test be useful, a false discovery rate of 80%, 40%, 20%? The main part of paper is concerned with a situation where each sample either has no effect or it has an effect. The model is such that by adjusting the parameter values every possible false positive rate is attainable. The claim is made that this model mimics what is done in real life making the results more persuasive. It is not clear what is meant by ´mimics’. If it simply means that the model parameters can be so chosen to reflect the false discovery rate in real life then it has no value. This can always be done. The claim may however be much more substantial, namely the false positive rate in real life is (sometimes, mainly?) due to having some samples with no effect and other samples with an effect. In my last post I mentioned James Short’s determinations of the the parallax of the sun based on the 1761 transit of Venus given by Stephen Stigler. There are 8 data sets and if one calculates the P value based on the modern true value the data sets 7 and 8 have P -values of 1.522e-07 and 9.971e-4 respectively. These values cannot be explained by different real effects. The explanation which comes to mind is bias with several possible causes: the instruments used, the calibration of the instruments, undetected bias in the manner of measuring, weather conditions, insufficient care on the part of the scientists taking the measurements. Berger and Sellke quote Jeffreys who stated that astronomers find that effects of two standard deviations and less tend to disappear in the light of more data whilst effects of three standard deviations usually persist. Jeffreys stated that this was consistent with his prior. Berger and Sellke then make up a story about an astronomer who checks previous point tests and finds that half the hypotheses were true and the other half false etc etc. This seems to me to be the same explanation offered by David Colquhoun. If one repeats the calculations for Stigler’s other data sets then the bias explanation seems much more plausible than that of true and false hypotheses or zero and real effects. All 6 of Michelson’s data sets on the speed of light overestimate the present accepted values: five of the P values are less than 6e-06, the sixth one is 0.0504.

    • Laurie: Thank you for looking at the paper. I would like to have a discussion on this blog at some point on the whole issue of appealing to diagnostic screening computations for appraising methods in science, and even for diagnostic screening. ( The supposition of just one test is, of course, absurd, as is ignoring the actual p-value. Even in diagnostics for rare events, say, luggage with a truly dangerous item, ringing the alarm in screening increases the chance of the presence of a dangerous item beyond the initial rate. Thus it warrants a further check.

      Trying to frame statistical testing as an “up-down” diagnostic affair also makes questionable use of size and power. I don’t think they can even get the numbers to work. I’ll reblog a post that makes this point. (Also relevant are discussions on some posts by Senn, one of which I’ll reblog after mine.)

      I’ll study your comment more carefully later.

  7. Michael, I am also surprised that you are in complete agreement with me but I cannot reciprocate. I will try and put forward my attitude in more detail and if you are then still in agreement I would be even more surprised and of course pleased. At the start of our discussions I seem to remember that you criticized me for using the term model to refer to a single probability measure rather than a family of such. I have since thought about your remark to the extent of devoting a section of a paper now in preparation as to why using the term model in the sense of a family of probability measures is a bad idea with bad consequences. I specifically have in mind a standard form of goodness-of-fit tests. The use such tests in this form may not logically follow from the meaning of the word model in the sense of a family of measures but it is certainly consistent with it. For me and in the rest of this post a model is a single probability measure which are the objects of study in probability theory. The reasons for this terminology are precision and clarity. Thus when I state that a particular model is an adequate approximation for a particular data set the particular model is a single probability measure, it is not a family of probability measures. One can of course restrict the search for an adequate model to a family of models such as the Gaussian family. In a previous post I gave an R programme for determining the set of adequate parameter values for the Gaussian family. The output is either the empty set or a bounded subset of {\mathbb R}\times {\mathbb R}_+. Should you wish to use likelihood and restrict it to the set of adequate models you will not have the full likelihood as you seem to wish to have. You can use maximum likelihood but if the maximum is not restricted to the set of adequate parameter values you may well end up with a model previously rejected as not being an adequate approximation. Michael, when you state that likelihood is proportional to probability you must be using the word proportional in a non-standard sense.

    • Michael Lew

      Laurie, likelihood has always been defined as proportional to a probability. You can read any source to see that is the case. Try Fisher, Edwards, Royall, Pawitan, or even Mayo.

      • Yes likelihoods in inference come in the form of likelihood ratios, so constants drop out.

    • Michael Lew

      Laurie, I didn’t expect to have to tell you that likelihood has always been defined as a quantity proportional to a probability. Read Fisher, Edwards, Basu, Royall, Pawitan, or even Mayo.

      • john byrd

        Would it be reasonable to say that likelihoods are proportional to probabilities of models, but avoid stating that likelihood is proportional to probability, since each likelihood relates to a single model?

        • Michael Lew

          John, no, it usually would not be reasonable to say that, because it would be a strange way to use the word ‘model’.

          A ‘statistical model’ is the set of equations and assumptions that allow probabilities to be specified for possible observations. The actual observations (i.e. the data) provide the likelihoods which are proportional to the probabilities according to the statistical model of the observations for all possible values of the model parameter(s). We then talk of those likelihoods as being ‘of’ those parameter values. (There is nothing unconventional about my usage of model or proportional to or likelihood.)

          As the likelihoods ‘belong’ to the parameter values within a statistical model, they ‘belong to’ the model that ‘owns’ those parameters in some sense. In that way each likelihood “relates to a single model”, as you say. However, there is usually an infinite set of likelihoods from a single statistical model (one per possible value of each parameter) as the individual likelihoods are just particular points on a (usually) continuous likelihood function that is obtained from the statistical model when the data are specified. The likelihoods only exist within the statistical model and so they cannot be “probabilities of models”.

  8. Michael, how come they all got it wrong? We’ve had a perfectly good and precise definition of proportionality for at least 2300 years so why alter it? A probability is a number of the form P(B) where P is a probability measure and B a Borel set. So if likelihood is proportional to probability we must have P_\lambda(B)=cf(x,\lambda) for all x and \lambda which means that it can only hold for a specific choice of B depending on x, P_\lambda(B(x))=cf(x,\lambda) for all x and \lambda and for a specific choice of the dominating measure. If P is concentrated on the integers then this is possible. Simple take the dominating measure to be counting measure and B(x)={x} to give P_\lambda({x})=f(x,\lambda). This doesn’t work in the continuous case as the left hand side is zero. One could try B(x)=(x,x+\delta) for some fixed \delta > 0 but this doesn’t work either P((x,x+\delta)=\deltaf(x+\theta(x,\lambda)\delta,\lambda) with 0 < \theta(x\lambda)< 1 for a continuous density by the mean value theorem. So what is going on? One could argue that it is approximately proportional but if I mean approximately proportional I say so. In your paper on Birnbaum's example you cite Fisher stating
    L(\theta|x_obs)=cP(x_obs|\theta) but for Lebesgue densities the right hand side is zero in contrast to the left hand side so I can make no sense of it. I don't know whether you read my remark on Birnbaum's counter-example namely that it cannot be an counter-example against likelihood as there is no likelihood in the counter-example. The ´proof' I gave is incorrect, I thought, trying to recall measure and integration last done many years ago, that you needed sigma-finite measures for Radon-Nikodym but you don't. You need them for Fubini. Nevertheless the claim is correct as one can prove directly without any appeal to sigma-finiteness.

    • Michael Lew

      Laurie, they didn’t get it wrong. If you read them you would see.

  9. Michael, just read your last comment so once more. Fisher writes L(\theta|x_obs)=cP(x_obs|\theta) which is indeed the correct use of the word proportional. For P concentrated on the integers this is correct if the dominating measure is the counting measure or some constant multiple of it. For continuous models it is simply false.

    • Michael Lew

      Laurie, you wrote “For continuous models it is simply false.” That is an error.

      The problem that you are concerned about is (presumably) the fact that the probabilities of observing any particular value of a test statistic is zero for a continuous distribution as a consequence of the unity area of the distribution being divided into an infinite set of areas each with a zero area. Given that, the probability of observing all values of the test statistic is zero for all values of the parameters of the model. The likelihood function would be a flat line at zero if it were proportional to those probabilities, but it usually isn’t. There are two (at least) reasons why that is not an impediment to the truth of my statement that likelihoods are proportional to probabilities.

      First, the conventional response to the problem, one that is convincing for any practical purpose, but one which strikes me as a slight cop-out. We can never know or express the test statistic or parameter values to infinite precision, except for the infinitely small proportion of cases where the relevant values are rationals requiring a suitably small number of numerals. As the proportion of easy rationals among all real numbers is zero for non-god-like creatures, we can ignore those special cases. To deal with the unity divided by infinity equals zero problem, we simply acknowledge that we cannot practically divide the probability distribution into a truly infinite number of slices and deal with the integral of probabilities of the test statistic lying within a small range, t plus and minus epsilon, as the relevant probability. You will read that solution in every book on likelihood that I have read. Basu has a particularly illuminating discussion of the issue in his lecture notes book (also available from JSTORE:

      The second response to the problem is a little more pointed. It comes from Jaynes’s book Probability Theory: the Logic of Science. Jaynes points out that almost every complaint and alleged counter-example to the likelihood principle comes from an inappropriate determination of the infinite limit of likelihoods. He writes:
      “It is very important that our consistency theorems have only been established for probabilities assigned on _finite sets_ of propositions. In principle, every problem must start with such finite-set probabilities; extension to infinite sets is permitted only when this is the result of a well-defined and well-behaved limiting process from a finite set. More generally, in any mathematical operations involving infinite sets, the safe procedure is the finite-sets policy:
      Apply the ordinary process of arithmetic and analysis only to expressions with a finite number of terms, Then, after the calculation is done, observe how the resulting finite expressions behave as the number of terms increases indefinitely.”

      If we follow Jaynes finite-set policy to the problem you raise, we see that as the number of distinct values of test statistic or parameters approaches infinity, the probabilities all remain above zero. It is not until the number of terms _reaches_ infinity that the probabilities reach zero and the (still non-zero) likelihoods are no longer ‘proportional’ in the sense that you prefer. As the likelihoods remain proportional to the probabilities for _all_ non infinite numbers of terms, we take the behavior to be non-problematical.

      Likelihoods are proportional to probabilities in all real-world situations, and in the non-real-world situation with infinite sets of propositions the Jaynes-correct way to interpret the limit-approaching behavior of probabilities and likelihoods is non-problematical.

  10. Michael, as I said in a previous post the claim is true for integer valued random variables and it is clear that it applies to any discrete random variables. I use the standard definition of density in measure and integration theory which is the basis of probability theory. Then it is not the case that likelihoods are proportional to probabilities. Perhaps when you state this you should always mention the rider that this applies only for discrete random variables. You quote Jaynes and I agree with him in general but the statement is too vague for it to be applicable. What he states holds for any form of limiting operation in applied mathematics. None of this alters the fact that in continuous models the differential operator is pathologically discontinuous and this cannot be remedied by discretization. You write ´Jaynes points out that almost every complaint and alleged counter-example to the likelihood principle comes from an inappropriate determination of the infinite limit of likelihoods’. So let us go back to the mixture model 0.5N(\mu_1,\sigma_1^2)+0.5N(\mu_2,\sigma_2^2) and generate a sample of size n=200 which is i.i.d. N(0,1) corresponding to \mu_1=\me_2=0, \sigma_1=\sigma_2=1. Let us discretize everything in sight, the data and the parameter space. Then apply The Strong Law of Likelihood as on page 28 of Basu to conclude that parameter values of mu_1 near some x_i and sigma_1 near zero are better supported than the values used to generate the data. Moreover the limiting model well describes the behaviour of the discretized versions. So what has gone wrong? In other cases the properties of the discrete version do not carry over to the limiting version. Examples are entropy, Kullback-Leibler, AIC and BIC. I must completely retract my remarks about the Birnbaum counter-example. I missed the restriction of the parameter mu and the observation x to integer values.

    • Michael Lew

      Laurie, I continue to say that likelihoods are proportional to probabilities because that is true for all real-world situations, for the reasons I gave in my previous comment. You difficulties with the awkward mixture distribution seem to me to be difficulties with the meaning of the word ‘support’ rather than anything more deep.

      If you have real-world applications that need analysis of evidential meaning of data from mixture distributions then I suggest that this section from the recent paper by Keli Liu and Xiao-Li Meng might be helpful:

      “1.3 Frequent Misconceptions: The Meaning of A Statistical Model
      This section aims to clarify what we assume and do not assume when simulating controls according a probabilistic pattern/distribution. In the literature, this modeling assumption usually takes the form: “The data come from such and such distribution.” This phrasing may give the false impression that…
      Misconception 1. A probability model must describe the generation of the data.
      A more apt description of the model’s job (in inference) is “Such and such probabilistic pattern produces data which resemble ours in important ways.” To create replicas (i.e., controls) of the Mona Lisa, one does not need to bring da Vinci back to life —a camera and printer will suffice for most purposes. Of course, knowledge of da Vinci’s painting style will improve the quality of our replicas, just as scientific knowledge of the true data generating process helps us design more meaningful controls. But for purposes of uncertainty quantification, our model’s job is to specify a set of controls that resemble (D,θ). Nowhere is this point clearer than in applications involving computer experiments where a probabilistic pattern is used to describe data following a known (but highly complicated) deterministic pattern (Kennedy and O’Hagan, 2001; Conti et al., 2009). We need a descriptive model, not necessarily a generative model. See Lehmann (1990), Breiman (2001) and Hansen and Yu (2001) for more on this point.”

      Perhaps your analysis of the mixture model data using a mixture model is akin to reconstruction of da Vinci. You might get better results using an analysis that is designed to perform well with respect to pulling the mixture apart. No version of the law of likelihood (strong, weak, restricted, yours, mine) carries a warranty that any particular statistical model will not be misleading when a combination of model and data are badly behaved for some reason. If analysts require a hands-off (eyes closed) algorithm for inference then there seems little room for characterisation of the evidential support for parameter values and so a likelihood analysis is irrelevant. Likelihood functions are powerful stuff: use with caution.

  11. Michael,
    You write
    ´I continue to say that likelihoods are proportional to probabilities because that is true for all real-world situations, for the reasons I gave in my previous comment’
    but also
    ´It seems to me to be irrelevant to the use of a likelihood function as a depiction of the relative support by the data of parameter values within a model’
    from which one can deduce that a likelihood function is a probability and vice versa. And then you complain.

    I have no difficulties whatsoever with the analysing the mixture model and it is in no sense awkward. It is a well-defined and well-posed problem. You write
    ´You might get better results using an analysis that is designed to perform well with respect to pulling the mixture apart’.
    It does exactly that. It will tell you how far the mixtures can be pulled apart and still be consistent with the data. It will tell you if you can take (\mu_1,\sigma_1)=(\mu_2,\sigma_2) and still be consistent with the data. It will tell you if you can take \sigma_1=\sigma_2 and still be consistent with the data. It will tell you if you can put \mu_1=log(1+\mu_2^2)*\sigma_2 and still be consistent with the data etc. So what do you mean by better results? I’ve never shown you the results. Anyway tell me how you read all this off the likelihood probability or function.

    You cite Keli Liu and Xiao-Li Meng in the context of mixture distributions. For a problem with mixture distributions see
    Author = {Davies,~P.~L. and Gather,~U. and Meise,~M. and Mergel,~D. and Mildenberger,~T.},
    Journal = {Annals of Applied Statistics},
    Number = {3},
    Pages = {861-886},
    Title = {Residual Based Localization and Quantification of Peaks in X-Ray Diffractograms.},
    Volume = {2},
    Year = {2008}}

    You write
    If analysts require a hands-off (eyes closed) algorithm for inference …

    What do you wish to indicate with the terms ´hands-off’ and ´eyes closed’? Every algorithm I write has an explicit purpose, minimize an expression locate a zero etc. I know exactly what it is meant to do and I know how it does it. The programme will give a plot of the distribution function of the closest model and that of the data. You can see how close the two are but only if your eyes are open. But indeed, when I have the results likelihood has nothing more to say. It is worse than irrelevant, it is wrong. Do you understand exactly what my algorithm does? Judging by your remarks I have the feeling that you don’t.

    You write
    No version of the law of likelihood (strong, weak, restricted, yours, mine) carries a warranty that any particular statistical model will not be misleading when a combination of model and data are badly behaved for some reason.

    Once more: neither the model nor the data are badly behaved in this mixture example. The behaviour in both cases is exemplary.

    You write
    The law of likelihood says that the degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.

    and Basu writes
    … the author now suggests a stronger version of Hacking’s law of likelihood.

    The strong law of likelihood: For any two subsets A and B of Omega the data supports the hypothesis omega in A better than the hypothesis omega in B if
    \sum_{ omega in A} L(omega) > \sum_{ omega in b} L(omega) .

    which, given any use of the word ´support’ listed in the OED, has a clear intention. But when it all goes wrong as in the mixture example the problem is resolved by stating

    You(r) difficulties with the awkward mixture distribution seem to me to be difficulties with the meaning of the word ‘support’ rather than anything more deep.

    So the escape routes are (1) to designate the problem as awkward, which it is not, and (2) say it all depends what one means by ´support’.

    Discussion on the laws(s) of likelihood are intellectually sterile. The discussions are discussions, they are essays using ill-defined terms such as `support’, ´relevant’, ´information’ whose lack of precision opens up escape routes which can be taken to explain away any awkward example. There are discussions about whether the ace of Hearts as the first card in the deck better supports the hypothesis that all the cards are Aces of Hearts or some other hypothesis. There are discussions on Birnbaum’s counter-example based on a sample of size n=1. I seem to remember that Jim Durbin once said the the Law of Likelihood is not a mathematical theorem and it is therefore impossible to prove or falsify. Whether he said it or not the statement is true. There are essays (one) which show the the Law of likelihood follows from A and B, essays (two) which show that that this is not the case and essays (three) which show that the essays (two) are wrong and thatbthe essays(one) are correct ad so on until exhaustion sets in. Fortunately all this does not matter as statistics can be done very successfully with no use of any law of likelihood and with very restricted use of likelihood itself. By this latter I mean as a demarcation tool between what is possible and what is not possible but this can only be done under very specific conditions.

    • Michael Lew

      Laurie, again you write a load of bollocks.

      1. Likelihood is not probability, just as a dollar is not a pound, despite a well defined proportionality at any point in time. I would complain when a likelihood is interpreted as a probability because they are different. For example, likelihood functions do not have to sum or integrate to unity.

      2. Yes, you mixture model is a well defined and well posed problem. The model is badly behaved in a likelihood analysis because it provides support for silly parameter value of sigma = 0. If you are capable of using an open-eyed and hands-on analysis you would see that the zero value of sigma implies a badly behaved analytical model rather than a usable estimate of a real-world value.

      3. If you take the Liu & Meng advice to heart then you would try a better behaved model, such as a model with two mu and a shared sigma. Your insistence on the analytical model matching the generating model is exactly what Liu & Meng call “Misconception 1” concerning statistical models. If you want to continue that argument, I suggest that you contact them, as I am (again) exhausted by your peculiar style of discussion.

      4. The word “support” is extensively described in the likelihood literature. Ian Hacking defined it first in his book “Logical of Statistical Inference”. I use it in a standard technical way. It may or may not match your dictionary, a book that is probably not the best resource for technical definitions.

      5. You seem to be struggling with the principle of charity again. Keep working on it.

  12. Michael,
    1) In the course of the argument you referred me to Basu. I read it. He considers only the discrete case, normalizes the likelihood and makes it a probability. As scaling is irrelevant this he is perfectly entitled to do this and whether one talks of a finite measure or a probability is irrelevant as they are the same apart from a scale factor. His Section 8 is entitled Likelihood – A point function or measure?. Arguing against Fisher he comes down firmly on the measure side and never refers to likelihood as a function. You on the other hand, see quotes in my previous comment, use both measure and function. Make up your mind, it can’t be both. I read all this as an outsider, from without so to speak. I regard the whole discussion as intellectually sterile and a waste of time so it doesn’t interest me at all.

    2) What is badly behaved is likelihood. If you insist on using a pathologically discontinuous operation when analysing data, which by the way had to be pointed out to you, what do you expect?

    3) You have referred to several papers in the course of our discussion. I have downloaded all and read them including Liu & Meng.

    Your insistence on the analytical model matching the generating model is exactly what Liu & Meng call “Misconception 1” concerning statistical models. If you want to continue that argument, I suggest that you contact them, as I am (again) exhausted by your peculiar style of discussion.

    I don’t know whether you have read any of my papers but here is one which is easily accessible.
    Author = {Davies,~P.~L.},
    Journal = {Journal of the Korean Statistical Society},
    Pages = {191–240},
    Title = {Approximating data (with discussion)},
    Volume = {37},
    Year = {2008}}

    You can also read Section 2.2 Inference without models in
    author = {Wasserman,~L.},
    title = {Low Assumptions, High Dimensions},
    journal = {Rationality, Markets and Morals},
    year = {2011},
    OPTkey = {},
    volume = {2},
    OPTnumber = {},
    pages = {201-209},
    OPTmonth = {},
    OPTnote = {},
    OPTannote = {}
    Larry Wasserman’s account is not quite correct. Contrary to the title of the section I do use models. Also I do not assume that the data is deterministic as I make no, I repeat NO, assumptions about the generating mechanism of the data. But he does understand the basic idea. See also Chapter 1.6 of my book where I write ´The concept of approximation is that of the model to the data. It is not of the model to some unknown true generating mechanism which gave rise to the data’. I do not in anyway insist on ´the analytical model matching the generating model’. Exactly the opposite is the case as the above quotes show. You are the only person with whom I have discussed my ideas who has simply not understood this.

    You also write

    you would try a better behaved model, such as a model with two mu and a shared sigma.

    Michael, do you understand what is going on? I have explained what the programme does. It takes a specific model 0.5N(mu_1,\sigma_1^2)+0.5N(mu_2,sigma_2^2) and given a quantile say the 0.95 quantile of the Kuiper metric it then lists all those parameter values (mu_1,sigma_1,mu_2,sigma_2) which the corresponding models are within the Kuiper distance q_{ku}(0.9) of the empirical distribution. If there are none you are told so. This is sufficient information to write your own programme. The only open question is how to specify a suitable grid of values for the parameters but even this is possible with only four parameters. This is a numerical problem not a conceptual one. Your data can what you want them to be, for example the Old Faithful data obtainable from Larry Wasserman
    The result is the empty set. That is there are no parameter values for the model 0.5N(mu_1,\sigma_1^2)+0.5N(mu_2,sigma_2^2) which give an adequate approximation to the Old Faithful data. Change the model to
    pN(mu_1,\sigma_1^2)+(1-p)N(mu_2,sigma_2^2) so we now have a fifth parameter p. You can now try various p. For example there are adequate models with p=0.34. For this value of p the closest model in the sense of the Kuiper metric is (mu_1,sigma_1,mu_2,sigma_2)=(1.974, 0.203,4.297,0.435) with a distance of 0.0658. The 0.95 quantile distance is 0.1054. For p=0.34 the upper and lower bounds of acceptable parameter values are 1.921 and 2.118 for mu_1, 0.155 and 0.349 for sigma_1, 4.271 and 4.462 for mu_2 and 0.324 and 0.556 for sigma_2. From it follows that there are no acceptable models with mu_1=mu_2 but there are acceptable models with sigma_1=sigma_2=sigma for sigma in the range 0.324 to 0.349. The nearest model with p=0.34 and sigma_1=sigma_2 is the model 0.34N(2.026,0.324)+0.666N(4.314,0.324) with a Kuiper distance of 0.0826. The smallest value of p compatible with the data is 0.25 and the highest is 0.38.

    You write

    If you are capable of using an open-eyed and hands-on analysis you would see that the zero value of sigma implies a badly behaved analytical model rather than a usable estimate of a real-world value.

    Take all values of (mu_1,sigma_1,mu_2,sigma_2,p) which are compatible with the data. Then you will see that all values of sigma_1 lie in (0.143,0.382) and all values of sigma_2 in (0.289,0.637) neither of which contains a value 0. If you put p=0.25 and consider a model with mu1=x_i and sigma_1=0 then the Kuiper distance of such a model from the data is at least 0.25. The 0.99999999999 quantile of the Kuiper metric is 0.242 so you can judge how far away 0.25 is. So my eyes-closed hands-off algorithm automatically excludes a zero sigma. I agree with you that such values should not be considered. But then nor should many other parameter values be considered. My programme gives you all those parameter values consistent with the Old Faithful data so I take it you will restrict your likelihood to such values, they are the usable estimates of a real-world value.

    It is completely obvious from the description of the method that it automatically gives you all this information. Why do you need it to be explicitly pointed out? I suggest you download the data from Larry Wasserman’s webpage, analyse it yourself and let me know what the results of your likelihood analysis are.

Blog at