I don’t know how to explain to this *economist blogger* that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?

On significance and model validation (Lars Syll)Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20.

Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions, with a t-value of 1.25 [(100-75)/20] the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level, we would not reject the null hypothesis although the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false.….

And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation.

Of course, as we’ve discussed many times, failure to reject a null or test hypothesis is not evidence for the null (search this blog). We would however note that for the hypotheses: H0: µ > 100 vs. H1: µ <100, and a failure to reject the null, one is interested in setting severity bounds such as:

*sev(µ > 75)=.5
sev(µ > 60)=.773
sev(µ > 50)=.894
sev(µ > 30)=.988*

So there’s clearly very poor evidence that µ exceeds 75*. Note too that sev(µ < 100)=.89.**

I agree the issue of model validation is always vital– for all statistical approaches. See the unit beginning here.

*As Fisher always emphasized, it requires several tests before regarding an experimental effect as absent or present. One might reserve SEV for such a combined assessment.

**I am very grateful to Aris Spanos for number-crunching for this post, while I’m ‘on the road’.

I think that a big problem to overcome is that many people apparently believe (subconsciously, to some extent) that if you have data and a hypothesis, there is somewhere out there in logical space an objective, unique and well defined true probability that the hypothesis is true given the data. As significant tests have *something* to do with measuring evidence from data in favour or against a hypothesis, these people either assume that such tests tell you something about this true underlying probability, or that significance tests are bogus and everyone should be a Bayesian. The idea that significance tests tell you something else that doesn’t give you such a probability but is well interpretable and and useful information doesn’t apparently get into the heads of such people.

It reminds me of this survey

http://andrewgelman.com/2013/01/participate-in-a-short-survey-about-the-weight-of-evidence-provided-by-statistics/

linked on Andrew Gelman’s blog, which implicitly assumes that given two hypotheses and data, there is a true objective relative weight of evidence of the data for the two hypotheses (and then seems to mock p-values because this is not what they deliver).

Christian: I entirely agree with you, maybe it will help. He is clearly reporting it as some kind of standard rule of thumb used somewhere–I hope not. If only people would get over the presupposition that what is even wanted is a posterior of some sort. One of the aims of the book I am writing/struggling over is to argue this. Once one makes the shift, the misinterpretations, the very desire to misinterpret, all disappear!

Christian,

I took the bait on Gelman’s blog and started doing that survey, but had to quit after the second question after I realized that I couldn’t possibly select one of their response options… I honestly had no idea how to respond.

Christian: I haven’t looked at it but now I’m curious about the question you couldn’t answer. I hope you alerted him or whomever is doing the survey.

If you’re curious, look up the survey yourself. To be fair, Andrew is not directly responsible for the survey (though he advertised it) and the survey allows to tick “I don’t know” and to give comments (and I used them to tell the authors what I think).

Christian: sorry, I see the comment was from Mark. I will check out the survey (which my computer keeps trying to write as “surgery”, so I hope it won’t be a dangerous operation!) thanks.

Well I looked at that survey linked on Gelman’s blog but didn’t have the patience to fool with the A, B slider, especially after having just done countably many on-line recommendations, each with their own tricks. So already there’s a bias to the survey.

I, too, started the survey that Christian Hennig referred to (linked on Andrew Gelman’s blog); and I was definitely confused on how to respond to several questions on that survey. (Also, it is not clear to me that the notion of evidence should be captured by a measure, such as a posterior probability!)

Mayo,

Interesting that this is the claim that two other economists, Ziliak & McCloskey, made in their amicus brief filed with the U.S. Supreme Court in Matrixx Initiatives v. Siracusano.

Steve Goodman, in his “A Dirty Dozen: Twelve P-Value Misconceptions,” Semin Hematol 45:135-140 (2008) gives a table of posterior probabilities from various p-values, but of course he assumes a prior of 0.5:

Nathan

Nathan: Thanks for reminding me. I do recall a strange use of power in their book as a way to compute the “true” type 1 error probability. But what can they mean? Are they inadvertently looking for a ratio of sev(~Ho)/sev(Ho)?

Mayo,

Dunno. I haven’t read their book. I am waiting for it to show up on the $1 rack at Strand’s. I was drawing on their brief to the Court. I suspect that they are doing something along the lines you suggested.

Nathan

Wouldn’t the likelihood function showing the probability of the probability of the observation as a continuous function of the effect size be the best way to assess the evidence in this case?

Dividing the conceptual responses to the data into ‘P0.05: we cannot exclude the possibility that vouchers fail to work’ seems to me to be particularly unhelpful, as you no doubt agree. Severity analysis may provide something better than that, it seems to be doing so by providing something like a likelihood function. If that is the case, then why not go with a likelihood function?

Michael: What you allude to at the start does sound to be along the lines of severity, but it is not a likelihood (as formally defined) but an error probability stemming from the distribution of the P-value statistic over varying hypothesized discrepancies—if I’m understanding you.

I don’t understand your response. To what does the “it” refer?

I also notice that my comment is incomplete. Does a less than symbol do something in this system? The second sentence should read “Dividing the conceptual responses to the data into ‘P less than 0.05, the vouchers work’ versus ‘P greater than 0.05, we cannot exclude the possiblity that the vouchers fail to work’ seems to me …”

Michael. The “it” referred to your “probability of the probability” phrase, which might be cashed out in relation to the distribution of the P-value (as a statistic). I guess I was wrong about your meaning.

OK, so nowI’ll ask the question again, taking more care to see that the message comes through.

The likelihood function shows the probability of the observation as a function of all possible values of the effect size. It would seem to be exactly what is needed for assessment of the evidence in this case. Is that not true?

Your severity index seems to be doing something very similar to a likelihood function. Have you explored the relationship between severity and likelihoods?

(I have previously asked whether severity is directly proportional to the integral of the likelihood but you didn’t answer.)

To: Michael Lew: No, I do *not* think that the likelihood is exactly what is needed for assessment of evidence. To clarify, a likelihood function *fixes* the observed data set, and is a function of the unknown parameter(s) of interest. Hence, I see the likelihood as mainly focusing on the *observed* data at hand; whereas severity (in my understanding) *also* looks at cases that were *not* (directly) observed! Simply put, I think that the relationship between severity and likelihoods is *not* as close as what you propose.

Thanks Nicole.

Thanks Nicole. I’m still not sure that I agree. The severity function may look at the numbers in a different manner, but if it really yields a function of evidence in the data then it should be commensurable with a likelihood function. Even if the conceptual models of severity functions and likelihood functions differ, their numerical values do not need to differ.

It seems to me that a severity function for a sample from a normal population with known variance (as in section 2.3 of Mayo & Spanos 2011) can be calculated exactly as one minus the integral of the likelihood function. Is that the case?

(If the answer is “yes” then a second question relates to whether it is always the case.)

To: Michael Lew: You say, “if it really yields a function of evidence in the data then it should be commensurable with a likelihood function.” That assumption is problematic (to me) because I do *not* think that the notion of evidence should be captured by a measure, whether it is likelihoods or posterior probabilities, or anything along similar lines. In fact, an *adequate* notion of evidence, if there is one, goes *beyond* the formalisms.

It *may* be the case in that particular situation that severity is one minus the integral of the likelihood function (I would need to re-read the section you referred to); but even if that relationship (between severity and likelihoods) holds in that particular case, there is *no* way, in my opinion, to *generalize* that relation to all cases! Besides, I do *not* see how computing the integral of the likelihood would be of interest – why would one want to compute such an integral?

A point of clarification is that computing integrals of probability distributions *are* of interest, yet the likelihood function is *not* to be regarded as a probability because it does *not* satisfy the probability axioms. In addition, I have *not* encountered any cases myself where it is of interest to compute an integral of a likelihood function, *as opposed to* an integral of a probability distribution.

Nicole: If you stretch the notion of evidence beyond the scope of any mathematical formalism then you’ve stretched it beyond usefulness. That might put it into a philosopher’s comfort zone, but I’m no philosopher. Consider me as a person who is interested in what could be called “statistical quantitation of experimental evidence” and see if you can humor me.

One might want to compute the integral of a likelihood function for it to serve as the severity curve, perhaps. However, it seems difficult to get a considered response to the question of whether it might serve in that way. My understanding of severity and likelihood functions leads me to suppose that they have similar properties. The example that I gave seems to show the direct relationship, so if you need to re-read that section then you should do so before pronouncing on the issue.

If you have not read Hacking’s book (http://www.amazon.com/Logic-Statistical-Inference-Ian-Hacking/dp/0521290597), then I recommend that you do so. You will find it fascinating even if you find useful.

Hacking’s book from 1965 is great–for its time–, but he renounced the law of likelihood shortly after. The problem is that one can always find rival hypothesis H2, such that H1 is maximally likely on the data. Since one can readily do so even when H2 is false, this is a terrible rule for evidence. Plus there’s the fact that additional inferences are needed to get the statement comprising evidence “x” and to warrant the underlying statistical model. As Hacking came to see, at least as I read him, the presumption that there’s such a thing as a formal logic of evidence is just a holdover from outdated and rejected logicist philosophies based on the tenets of logical empiricism. If you look for likelihood on this blog, and Hacking’s review of Edward’s book, you’ll find at least one place.

Mayo: There is no reply link for your comment, so I’m replying here.

Hacking recanting on likelihood is very interesting, Can you give me a reference?

I don’t understand what you mean by H2 and H1. What would be H2 for an H1 was ‘the coin is fair’ for an observed HTTHTHH?

Speaking as an engineer, the supposed null hypothesis that there would be an improvement of 100 points is badly posed, since any measured value needs a tolerance. You can’t expect to measure a value to arbitrary precision and have all those decimal points mean anything. The missing part of the H0 specification is, what is the precision? Also, a null hypothesis ought normally to be the case that nothing changes (“null”, right?), but let’s set that aside.

The proper main conclusion is that, from the reported data, we can’t tell the whether the value is 75, 100, or some other value in that general range because we don’t have enough precision in the measurement. But clearly the effect is unlikely to be as low as 0. The SEV() values reflect this statement and give a stronger numerical sense to what the term “unlikely” indicates, at least assuming a normal distribution.

Even this is incomplete, if only because we don’t have a control. What is the standard error of the old, non-voucher system? Maybe the experiment with the vouchers drastically increased the standard error. Maybe it decreased it. We don’t know. If the original system had a sample mean of 0 but a sample standard error of 55, I wouldn’t be able to conclude much except that the new system had reduced the variance. I certainly wouldn’t like to make a claim about any supposed improvement in scores. Since we don’t know this information, we’re not really in a position to evaluate the new system.

If the old system had a standard error of the mean of say 5, I’d be able to conclude that the new system had made a change in the mean (with pretty good severity) , but I’d really like to know why the standard error had increased so much (from 5 to 20) before relying on that conclusion.

You may say that the problem statement above was simplified for the sake of illustration, but the simplification has eliminated any chance of drawing sound conclusions about the value of the new system relative to the old.

Tom: Thank you for this. Yes, my first reaction to his toy illustration was, where does one begin? I totally agree about the ill framing, and the fact that much more would typically be known in evaluating the impact of the vouchers… But I thought it worth pointing up the blatant flaw in the alleged posterior odds. I don’t see how it can be recommended by some who bill themselves as Reformers trying to avoid misinterpreting p-values! This kind of thing is a serious obstacle to “how to avoid lying about statistics”!

The author (Syll) has not responded or corrected his mistake, and will doubtless continue repeating it. Puzzling.

The New yorker has a telling review of Silver: http://www.newyorker.com/online/blogs/books/2013/01/what-nate-silver-gets-wrong.html#ixzz2J6UMfmDG

Here’s the comment i posted, and another reaction:

My comment:

I’m very glad to hear a disavowal of Silver’s evidence-free Bayesian cheerleading. Anyone who could say that Fisher, the man who developed randomization among other ways to deal with threats of bias in experimental design, was searching for “immaculate” statistical procedures and that he excluded human error and bias has obviously never read Fisher. Perhaps in honor of Fisher’s birthday in February, Silver will consider reading Statistical Methods and Scientific Inference and report back. Errorstatistics.com

Posted 1/26/2013, 12:39:02am by deborahmayo

Someone else’s comment:

Yes, Bayesians annoy the hell out of me, too. A cocktail of the well-known and the wrong-headed dressed up as a whole new way to do stats. The real problem is a whole new generation of statisticians who are brought up “in the faith” and whom you then have to teach that some simple concept like likelihood is not an intrinsic component of their ontologically untenable belief system.

Posted 1/26/2013, 5:48:34am by JonesyTheCar

Incidentally, this issue is now getting a fair amount of airplay throughout the blogosphere, and I can’t follow it–though I’ve written a bunch of comments here and there– (starting from the New Yorker, to Gelman’s blog, to Twitter and whatnot). Anyone who wants to trace some of it, collect and report your findings to us, I’d be glad to post it.

Michael: Here’s a Hacking reference I could think of quickly: Hacking 1972: “Likelihood,” British Journal for the Philosophy of Science 23:132-137. See this blog:

http://errorstatistics.com/2012/07/02/more-from-the-foundations-of-simplicity-workshop/

For the coin toss sequence, the H2 could assert that the probability of ‘H’ (heads) is one just on those trials that resulted in heads. You get the idea.

Such a hypothesis would be outside the scope of acceptable hypotheses if one was interested in the value taken by a single probability of heads covering all tosses. I don’t see how it can possibly invalidate the notion that likelihood functions depict evidence.

I’ll look at Hacking’s paper and see if there is something more to the idea.

OK, I’ve read the Hacking paper (book review) and it is true that he has clearly moved away from likelihood and finds Edwards’ arguments in its favor to be unconvincing in three ways. The first two of Hacking’s complaints seem to apply to likelihood ratios but, in my opinion, not to likelihood functions. I will refute them below. The third of Hacking’s complaints (on the last page of his review, I’ll call it C3) seems to be that Edwards has treated the Bayesian prior distribution of probability as if it were a likelihood and called it ‘prior support’. I do not find that to be objectionable, but Hacking seems to be worried by the fact that prior support does not change its nature or scope after the experimental evidence has been factored in, and he also expresses some (misplaced) concern that the prior is potentially subjective (!). I do not agree with Hacking’s worries in C3, but perhaps I do not understand them properly. However, given that they relate to generation of a Bayesian posterior, they may not be at all important to the evidential aspects of a likelihood function.

Hacking’s first two complaints are: C1, if two models yield the same likelihood ratio for a particular pair of hypotheses, there is no way to know if the ratios mean the same thing; and C2, a single observation appears to support a very unlikely hypothesis (the ‘tank’ example).

C1: It seems desirable that likelihood ratios from different models be directly comparable, and that the comparison be meaningful. However, in many circumstances we are dealing with estimation on a continuous scale, and the relevant likelihoods are continuous functions rather than point values. (Any situation that might allow generation of a severity function would probably imply such a likelihood function.) Those functions have not only a maximal height, but also a location, a width and a shape, and all of those features has a role in specifying the exact evidence provided by the data. Hacking’s two different models would (presumably) yield different likelihood functions that differ not just in height but in one or more of the other aspects. To attempt to compare them on the basis of a single ratio of two points on the continuous axis seems to be almost pointless. Thus complaint C1 seems to me to have been born of a convenient simplification that has led to an unfortunate oversimplification.

C2: Hacking complains that a single observation yields a nonsensical estimate of variance. However, I think it is more reasonable to say that a single observation yields NO estimate of variance. He says that observing tank number 2176 best supports the hypothesis that there are exactly 2176 tanks. That is clearly nonsensical, but it is again dependent on the idea that a single observation allows estimation of both the mean and a range parameter. It clearly cannot do that, so the argument does nothing to weaken claims that likelihood functions encapsulate the evidential meaning of experimental data.

All in all, this paper is a very nice book review and it’s written in a plain style that would be approachable by a wider range of readers than his book. However, it doesn’t seem to have the rigor of Hacking’s earlier book (1965) and because if that I find it disappointing. It certainly does not contain a compelling reason to discard or replace likelihood functions.

The hypotheses can be anything at all for Hacking. George Barnard, another earlier supporter of likelihoods, makes the same point incidentally. One can always arrive at a hypothesis that says in effect “whatever happened had to have happened.”

A statistical God hypothesis.It has no explanatory power and is only a hinderance to scientific consideration of evidence. Is that really the reason that you consider likelihood functions to be unusable as pictures of experimental evidence? (Seems like a pretty poor excuse to me.)

Would it be ‘monster barring’ to say that the scope of hypotheses should be restricted to those that predict something useful? To those that resemble those actually contemplated by the experimenter?

Michael Lew: Here is a reconstruction of severity as a kind of likelihood function.

Suppose I decide to carry out a statistical hypothesis test in the following fashion. First I choose some one-dimensional statistic of the data to be collected in such a way that its median (or expectation or some other measure of central tendency) is a monotonic increasing function of a one-dimensional parameter of interest. Next, I choose a threshold and decide to report pass/fail according to whether the statistic was either (i) less than, or (ii) equal to or greater than the threshold. Naturally this pass/fail report is a random variable (well, random element technically, but whatevs), and hence has a sampling distribution that depends on the parameter value.

So far so good. Now I collect my data, and find that by a *remarkable* coincidence, the value of my statistic is exactly equal to the threshold I chose prior to seeing the data. I report the likelihood function of the parameter given — not the complete data, not even the statistic, but the pass/fail variable. *This* likelihood function is mathematically identical to the severity function.

Obviously this “likelihood function” corresponds to a badly mangled description of the experiment that was actually carried out, so I’m not sure what can be learned from this mathematical reconstruction.

Corey: Hadn’t noticed this. I understand your trying to get an error probability by grouping outcomes into accept/reject regions, but this will not be equal to a SEV assessment which, for starters, is always relative to a particular inference.