I don’t know how to explain to this economist blogger that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?
On significance and model validation (Lars Syll)
Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20.
Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions, with a t-value of 1.25 [(100-75)/20] the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level, we would not reject the null hypothesis although the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false.….
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation.
Of course, as we’ve discussed many times, failure to reject a null or test hypothesis is not evidence for the null (search this blog). We would however note that for the hypotheses: H0: µ > 100 vs. H1: µ <100, and a failure to reject the null, one is interested in setting severity bounds such as:
sev(µ > 75)=.5
sev(µ > 60)=.773
sev(µ > 50)=.894
sev(µ > 30)=.988
So there’s clearly very poor evidence that µ exceeds 75*. Note too that sev(µ < 100)=.89.**
I agree the issue of model validation is always vital– for all statistical approaches. See the unit beginning here.
*As Fisher always emphasized, it requires several tests before regarding an experimental effect as absent or present. One might reserve SEV for such a combined assessment.
**I am very grateful to Aris Spanos for number-crunching for this post, while I’m ‘on the road’.
I think that a big problem to overcome is that many people apparently believe (subconsciously, to some extent) that if you have data and a hypothesis, there is somewhere out there in logical space an objective, unique and well defined true probability that the hypothesis is true given the data. As significant tests have *something* to do with measuring evidence from data in favour or against a hypothesis, these people either assume that such tests tell you something about this true underlying probability, or that significance tests are bogus and everyone should be a Bayesian. The idea that significance tests tell you something else that doesn’t give you such a probability but is well interpretable and and useful information doesn’t apparently get into the heads of such people.
It reminds me of this survey
linked on Andrew Gelman’s blog, which implicitly assumes that given two hypotheses and data, there is a true objective relative weight of evidence of the data for the two hypotheses (and then seems to mock p-values because this is not what they deliver).