The “P-values overstate the evidence against the null” fallacy

Posted on January 19, 2017 by Mayo

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally Bayesian probabilities of the sort used in Jeffrey’s-Lindley disagreement (default or “I’m selecting from an urn of nulls” variety). Szucs and Ioannidis (in a draft of a 2016 paper) claim “it can be shown formally that the definition of the p value does exaggerate the evidence against H0” (p. 15) and they reference the paper I discuss below: Berger and Sellke (1987). It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago. But the formulation of the “P-values overstate the evidence” meme introduces brand new misinterpretations into an already confused literature! The following are snippets from some earlier posts–mostly this one–and also includes some additions from my new book (forthcoming).

1. What you should ask…

When you hear the familiar refrain, “We all know that P-values overstate the evidence against the null hypothesis”, what you should ask is:

“What do you mean by overstating the evidence against a hypothesis?”

One honest answer is:

“What I mean is that when I put a lump of prior probability π₀ = 1/2 on a point null H₀(or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H₀.”

Your reply might then be: (a) P-values are not intended as posteriors in H₀ and (b) P-values can be used to determine whether there is evidence of inconsistency with a null hypothesis at various levels, and to distinguish how well or poorly tested claims are–depending on the type of question asked. A report on the discrepancies “poorly” warranted is what controls any overstatements about discrepancies indicated with large n.

You might toss in the query: Why do you assume that “the” correct measure of evidence (for scrutinizing the P-value) is via the Bayesian posterior?

If you wanted to go even further you might ask: And by the way, what warrants your lump of prior to the null? (See Section 3. A Dialogue.)

^^^^^^^^^^^^^^^^^

2. J. Berger and Sellke and Casella and R. Berger

It is well-known that for a fixed P-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H₀ (Jeffreys-Lindley disagreement). I.J. Good recommended decreasing the required P-value as n increases, and had a formula for it. A more satisfactory route is to ensure the interpretation takes account of the (obvious) fact that with a fixed P-value and increasing n, the test is more and more sensitive to discrepancies–much as is done with lower/upper bounds of confidence intervals. (See Mayo and Spanos 2011: Fallacy #4, p. 174).

The Jeffreys-Lindley result is generalized in J. Berger and Sellke (1987). They make out the conflict between P-values and Bayesian posteriors by considering the two sided test of the Normal mean, H₀: μ = μ₀ versus H₁: μ ≠ μ₀ .

“If n = 50…, one can classically ‘reject H₀ at significance level p = .05,’ although Pr (H₀|x) = .52 (which would actually indicate that the evidence favors H₀).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null going from .5 to .82!

While from their Bayesian perspective, this appears to supply grounds for denying P-values are adequate for assessing evidence, significance testers rightly balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null–or even evidence for it! Note, the probability on the null goes up from .5 to .82 when a statistically significant result at the .025 level (one-sided) is observed.

The following chart records the posterior probability on the null.

From J. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

Many think this shows that the P-value ‘overstates evidence against a null’ because it claims to use an ‘impartial’ Bayesian prior probability assignment of .5 to H₀, the remaining .5 spread out over the alternative parameter space. (But see the justification Berger and Sellke give in Section 3. A Dialogue.) Casella and R. Berger (1987) charge that the problem is not P-values but the high prior, and that “concentrating mass on the point null hypothesis is biasing the prior in favor of H₀as much as possible” (p. 111) whether in 1 or 2-sided tests. There’s nothing impartial about these priors. Casella and Berger show that the reason for the wide range of variation of the posterior is the fact that it depends radically on the choice of alternative to the null and its prior.[i] There’s ample latitude (in smearing the alternative) so that the Bayes test only detects (in the sense of favoring probabilistically) discrepancies quite large (for the context).

Stephen Senn argues, “…the reason that Bayesians can regard P-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree sharply with each other“ (Senn 2002, p. 2442). Riffing on the well-known joke of Jeffreys (1961, p. 385):

It would require that a procedure is dismissed [by significance testers] because, when combined with information which it doesn’t require and which may not exist, it disagrees with a [Bayesian] procedure that disagrees with itself. Senn (ibid. p. 195)

In other words, if Bayesians disagree with each other even when they’re measuring the same thing–posterior probabilities–why be surprised that disagreement is found between posteriors and P-values! See Senn’s interesting points on this same issue in his letter (to Goodman) here, as well as in this post, and it’s sequel.

Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). The same conflict persists between Bayesian “tests” and Bayesian credibility intervals.

^^^^^^^^^^^^^^^^^

3. A Dialogue (ending with a little curiosity in J. Berger and Sellke):

So a guy is fishing in Lake Elba, and a representative from the EPA (Elba Protection Association) points to notices that mean toxin levels in fish were found to exceed the permissible mean concentration, set at 0.

EPA Rep: We’ve conducted two studies (each with random sample of 100 fish) showing statistically significant concentrations of toxin, at low P-values, e.g., .02.

P-Value denier: I deny you’ve shown evidence of high mean toxin levels; P-values exaggerate the evidence against the null.

EPA Rep: Why is that?

P-value denier: If I update the prior of .5 that I give to the null hypothesis (asserting toxin levels are of no concern), my posterior for H₀is still not all that low, not as low as .05 for sure.

EPA Rep: Why do you assign such a high prior probability to H₀?

P-value denier: If I gave H₀ a value lower than .5, then, if there’s evidence to reject H_{0 ,}at most I would be claiming an improbable hypothesis has become more improbable. Who would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’?

The last sentence is a quote from Berger and Sellke!

“When giving numerical results, we will tend to present Pr(H₀|x) for π₀ = 1/2. The choice of π₀ = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’ (some might argue that π₀should even be chosen larger than 1/2 since H₀is often the ‘established theory.’) …[I]t will rarely be justifiable to choose π₀< 1/2; who, after all, would be convinced by the statement ‘I conducted a Bayesian test of H₀, assigning prior probability .1 to H₀, and my conclusion is that H₀has posterior probability .05 and should be rejected’? We emphasize this obvious point because some react to the Bayesian-classical conflict by attempting to argue that π₀should be made small in the Bayesian analysis so as to force agreement.” (Berger and Sellke, 115)

There’s something curious in assigning a high prior to the null H₀–thereby making it harder to reject (or find evidence against) H₀–and then justifying the assignment by saying it ensures that, if you do reject H₀, there will be a meaningful drop in the probability of H_0.What do you think of this?

^^^^^^^^^^^^^^^^^^^^

4. A puzzle.

I agree with J. Berger and Sellke that we should not “force agreement”. Why should an account that can evaluate how well or poorly tested hypotheses are–as significance tests can do (if correctly used)–want to measure up to an account that can only give a comparative assessment (be they likelihoods, Bayes Factors, or other) [ii]. From the perspective of the significance tester, the disagreements between (audited) P-values and posterior probabilities are an indictment, not of the P-value, but of the posterior, as well as the Bayes ratio leading to the disagreement (as even one or two Bayesians appear to be coming around to realize, e.g., Bernardo 2011, 58-9). Casella and R. Berger show that for sensible priors with one-sided tests, the P-value can be “reconciled” with the posterior, thereby giving an excellent retort to J. Berger and Sellke.

Changing the interpretation may satisfy J. Berger’s call for “an agreement on numbers”, but doing so precludes the proper functioning of P-values, confidence levels, and other error probabilities. And “what is the intended interpretation of the prior?” you might ask. Aside from the subjective construals (of betting and belief, or the like), the main one on offer (from the default Bayesians) is that the prior is undefined and is simply a way to compute a posterior. There are several conflicting default priors; there’s no agreement on which to use. You might ask, as does David Cox: “If the prior is only a formal device and not to be interpreted as a probability, what interpretation is justified for the posterior as an adequate summary of information?” (Cox 2006, p. 77)

The most common argument behind the “P-values exaggerate evidence” collapses. It reappears in different forms, also fallacious.[iii] I end with a quote from Senn.

The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.

[o] In a recent blog, I discussed the “limb-sawing” fallacy in that same paper.

[i] Berger and Sellke try another gambit. “Precise hypotheses…ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136). Are we really not interested in magnitudes?

[ii]To a severe tester, these aren’t really tests, in the sense that they don’t falsify, and they don’t satisfy a minimal severity principle.

[iii]Bayesians, Edwards, Lindman and Savage (1963, p. 235), despite being the first (?) to raise the “P-values exaggerate” argument, aver that for Bayesian statisticians, “no procedure for testing a sharp null hypothesis is likely to be appropriate unless the null hypothesis deserves special initial credence”. See also Pratt 1987, commenting on Berger and Sellke.

Epidemiologists Sander Greenland and Charles Poole give this construal of a spiked prior in the case of a two-sided test:

“[A] null spike represents an assertion that, with prior probability q, we have background data that prove [μ = 0] with absolute certainty; q = 1/2 thus represents a 50-50 bet that there is decisive information literally proving the null. Without such information…a probability spike at the null is an example of ‘spinning knowledge out of ignorance’”. (Greenland and Poole 2012, p. 66).

You might be interested in the comments in the original blog, and the continuation of the discussion comments here.

References (minimalist)

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion).

Casella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion).

Cox, D. R. (2006), Principles of Statistical Inference: CUP.

Edwards, Lindman and Savage (1963). “Bayesian Statistical Inference for Psychological Research”.

Greenland and Poole (2013), “Living With P-values.

Mayo, D. Statistical Inference as Severe Testing: CUP.

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,”

Mayo, D. and Spanos, A. (2011). “Error Statistics”.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.”

Senn, S, (2001). “Two Cheers for P-values”. Journal of Epidemiology and Biostatistics 6(2): 193-204.

Szucs and Ioannidis (2016), “When null hypothesis significance testing is unsuitable for research: a reassessment”.

Among Related Posts:

Comedy Hour at the Bayesian Retreat: P-values versus Posteriors.

Highly probable vs highly probed: Bayesian/ error statistical differences.

Save

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values, Statistics | 47 Comments

47 thoughts on “The “P-values overstate the evidence against the null” fallacy”

January 20, 2017

Michael Lew

Seems to me that there is a very important mistake in those arguments that P-values overstate evidence. A fatal mistake that might not have been pointed out directly:

In the Bayesian framework the `evidence’ is the likelihood function, so any attempt to compare the evidential meaning of the P-values to Bayesian evidence should be based on the likelihood function rather than the posterior!

The criticism that the slab and spike prior is inappropriate is clearly related to mine, but it is not as direct. I suspect that few Bayesians would be able to defend the silliness of treating the posterior as evidence once it has been pointed out.

Reply

January 20, 2017

Mayo

Michael: Thanks for the comment. Some do show the mismatch between the p-value and likelihood ratios, others think the evidential sum-up is given in a posterior. But since they will often purport to give .5 to the null and the alternative, they will say it doesn’t matter. But the sneaky secret that is kept under wraps is that blurring the 5 over the alternatives gives low prior to the parameter values most likely. This allows Jeffreys-Lindley paradox, and familiar attempts these days to claim evidence “for” the null. I take this to be the point on which Bayesian Bernardo (linked in the post) disagrees. A high LR in favor of an alternative, he’d say, should be taken as evidence for a discrepancy, whereas the spike and slab Bayesian can (if she chooses) take the same result as evidence for the null.

The most generous move they make is to consider the max likelyalternative–so it’s point against point (null vs max likely alternative). Even then, they say, the p-value looks more impressive than the LR, but of course,this insists stones be pounds as Senn puts it.

Reply

January 20, 2017

Michael Lew

Yes, they may take “point against point”, but the problem with comparing a likelihood ratio to the P-value is that the P-value is non-specific with regard to any particular alternative.

The maximum likelihood point is an interesting point, and its likelihood ratio against the null point is undoubtedly the relative weight of evidence in the data about the merits of those points, according to the model. However, there is more to evidence than any single ratio can contain or communicate. The P-value communicates a different blend of evidential components than a likelihood ratio. The attempts to compare P-values with other forms and flavours of evidence are silly, as you know.

Reply

January 20, 2017

Mayo

Michael: Of course it’s crazy which is why I’m so amazed that it’s repeated several times a day as “demonstrated”.

Reply

January 20, 2017

Enrique Guerra-Pujol

Can we just abolish all talk of “p-values” already?

Reply

January 20, 2017

Mayo

Enrique: What we should abolish already are all uncritical repetitions of alleged knock-down demonstrations against significance levels, which were never more than part of a general account for assessing and controlling certain types of mistaken inferences. I know you recommend Bayesian ways, but I’ve never seen a posterior probability that captured well testedness–even when frequentist.

Reply

January 21, 2017

Enrique Guerra-Pujol

Thanks for the clarification… Agreed that “well testedness” is what really matters.

Reply

January 21, 2017

Waznoc

It is easy to construct an example in which the shift from prior to posterior favors the null while the p-value is less than .05. This gives the more instructive scenario in which the frequentist rejects the null while the Bayesian (regardless of p(H0)) strengthens their belief in it.

Reply

January 21, 2017

Mayo

Waznoc: We see this in the Berger Sellke discussion and chart.

Reply

January 22, 2017

West

Often my first thought when seeing p-values is “why don’t they just treat this as a parameter estimation problem?” I understand that model comparison is important when the model structures are notably different, but for something like mu_o vs not mu_o, don’t you just want an estimate of the actual mu?

Reply

January 22, 2017

Mayo

West: yes, typically once it’s inferred that there’s a genuine effect or discrepancy one wants to indicate its magnitude. In my approach, this is part of testing, but we’d also want to indicate the magnitudes that are unwarranted. Estimation is more similar in Bayesian and frequentist approaches, but the criticisms of the latter tend to focus on tests.
There are other cases, however, where you need tests.

I’m reblogging excerpts from this discussion so that current articles won’t just repeat the exact same claims without thinking them through from scratch with their own brains. I’m not too confident they will, given the echo chamber.

Reply

January 22, 2017

omaclaren

Hi Mayo, here’s another similar issue it might be worth clarifying (via twitter).

@JPdeRuiter:
Isn’t it a bit weird that when H0=True NHST Type I error rate is 5% even with infinite N.

[ojm comment: obviously there is no need to fix the type I error rate to 5%, but still]

@lakens replies – if you want evidence to increase with N – duh. But error control??? Jezus.

[ojm interpretation of lakens: yes evidence should increase with N but error control is distinct from evidence]

So my, perhaps naive, question is still: what is the precise relation between evidence, N and error control in the ‘error statistics’ approach? Is the ‘N gets large’ limit directly relevant to your account or a side issue? Is there an explicit quantitative measure of evidence or is this a qualitative notion in your account? Eg for likelihoodists it’s the likelihood ratio, for Evans it’s the ratio of posterior and prior etc.

Note – I’m not saying it’s best to have a quantitative measure of evidence or not, just asking if there is a numerical quantity playing this role in your account.

Reply

January 22, 2017

Mayo

Om: I’ve been developing and extending the account of severity based on error statistics, both in statistical and non-statistical cases in science, for a long time. Take a look at “Error Statistics (Mayo and Spanos 2011).
http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdfc

Reply

January 22, 2017

omaclaren

That doesn’t directly address my question. Are you saying severity is a numerical measure of evidence?

Isn’t it typically 1 minus the pvalue?

Reply

January 22, 2017

omaclaren

Also, say you obtained SEV=0.99 for a claim in a case of N=2 and another in a case of N=10000. I assume this is possible? Would you say these claims have equal evidence because they have equal SEV?

Reply

January 22, 2017

Michael Lew

Oliver, for the same reason that makes me point to the fact that the full likelihood function should be inspected to understand the evidence in data, the full severity curve is necessary. A single point severity value might be the same for n=2 and n=1000, but the severity curve would be far steeper for the latter case. No sensible system for statistical evidence should focus on a single point in parameter space, because how rapidly the evidential support changes along the parameter scale should inform any response to the evidence.

Reply

January 23, 2017

Mayo

Michael: The account I put forward always requires reporting what has and what has NOT been severely passed. Minimally, that means a benchmark for a quite lousy inference, but I too prefer to report a range.

Reply

January 23, 2017

Michael Lew

Mayo, I’m not sure I understand your response. If you prefer report of the whole severity function then you could have said so. Given that you didn’t say so, I have to assume that it is not your preference. Thus I try to interpret your words.

I can report a point that has been tested severely and a point that has not passed the severe test. Is that what you mean by “requires reporting what has and what has NOT been severely passed”?

Perhaps my two point version is what you mean by “Minimally” and thus it would allow only a “quite lousy inference”. I can agree with that, but only if you allow me to point out that ANY inference that is informed only by the test results is a statistical inference that should be considered along with all other information before a scientific inference is made.

“I too prefer to report a range.” I prefer a range in some circumstances, but I insist that the analysist should always see the whole function. Distilling a likelihood function or a severity function down to a dichotomising (i.e. inside/outside) interval reduces the information content substantially. If a scientific inference is to be informed by all of the available information then the interval is inferior to the full function, no matter how convenient it might be for page layout in paper-based reports.

Reply

January 23, 2017

Mayo

Michael: It’s crazy to try and recapitulate years of careful writing in a blog comment. I think “Error Statistics” (Mayo and Spanos 2007) should answer your questions.
For more general philosophy of science applications, the the philosophy of science list on my publications page.

Reply

January 23, 2017

omaclaren

Mayo – this answers neither my nor Michael’s fairly simple questions.

Reply

January 23, 2017

omaclaren

Hi Mayo – so the SEV curve represents evidence in your account? Can I interpret SEV=0.8 evidentially? Or do I need the whole curve? Or what?

Reply

January 23, 2017

omaclaren

Hi Michael,
Yes, fair enough. I just want to know if SEV – the whole function or point values or whatever- is intended to represent evidence directly.

Reply

January 23, 2017

omaclaren

Also – this still does make me wonder if any philosophical account of evidence explicitly requires a continuous or smoothly differentiable or whatever ‘evidence function’. Why/why not?

While Leibnizians and Peirceans might make such regularity requirements, these issues seem rarely explicitly addressed in philosophy of science/stat. Also fits somewhat awkwardly with the ‘everything is really discrete’ stance people love to claim in statistical discussions.

If a parameterisation is arbitrary then why require continuity or differentiability or whatever? If a parameterisation is not arbitrary and must be, say, smoothly differentiable then where is this requirement made explicit in the various accounts?

Reply

January 23, 2017

Michael Lew

Oliver, I think that any attempt to define or evaluate accounts of evidence should step back from the methods. What properties do you think that statistical evidence should take?

I am in my heart a likelihoodlum and so my assumption is that evidence is comparative. That means that any measure of evidence has to relate the support for one thing relative to the support for another thing. The conventional likelihood account of evidential support does so fully when the whole likelihood function is considered, but only partially when the focus is on isolated likelihood ratios. I assume the same is true of severity, as I take it to be a close analogue of likelihood (Mayo will complain about that!).

I also think that statistical evidence can only exist within a statistical model. That means that comparisons of likelihood values from different models is pointless, and it also means that we should not tie the statistical evidence too strongly to real world decisions or beliefs.

As to smooth differentiability, I know that Laurie has some thoughts on that issue, but I am agnostic to that as I am unable to follow those thoughts. Nonetheless, I will advance the possibility that continuity or granularity requirements follow from the properties of real world numbers and statistical models. Thus, why? Because the model is unavoidable and the numerical results are unavoidably discrete.

Is parameterisation arbitrary? Not of itself, but the choice of statistical model determines the parameters and the choice of statistical model is often entirely arbitrary.

Reply

January 24, 2017

omaclaren

Hi Michael, there is definitely a lot we could (and have and I hope will continue to!) discuss here. My main goal for the moment is really to understand how Mayo interprets SEV.

Within her account, is a single SEV value interpretable? Do we need to evaluate the whole function or can each point be evaluated separately? Does the derivative or second derivative of the SEV function have an interpretation? Is The SEV function typically differentiable?

Most importantly- if SEV values are the same for different claims do they have the same ‘evidence’ (regardless of N etc)? Is ‘evidence’ directly measured numerically by SEV? What is the interpretation of SEV=0.8, say?

Do the answers to these questions differ from those given by the Edwards-style Likelihoodist?

Reply

January 24, 2017

Mayo

Om: Most of these questions are answered, e.g., “Error Statistics”, some would require examination. Once the underlying reasoning is understood, others can creatively explore the research program.
January 24, 2017

omaclaren

Could you recapitulate a couple of the answers here?

a) Is a single SEV value interpretable or do you need to see the whole function?

b) Is SEV intended as a numerical measure of ‘evidence’ or something else?

c) If the SEV vales are the same for different claims do they have the same ‘evidence’ or at least the same interpretation (remember Hacking’s worries about whether likelihood ratios have the same interpretation in different problems)?

January 24, 2017

omaclaren

Can’t resist a brief comment – I’m not sure any of performance, evidence or support are quite what I’m after, generally speaking. I just googled synonyms for ‘image’ – thinking ‘inverse image’ – and ‘likeness’ was the first result. So I still like this term.

But inverse image/likeness of what? Not the raw data since this may be interpreted in multiple ways. You need data features (Laurie) or statistics capturing ‘information’ in the data (Fisher), ie features of interest/’signal’ as opposed to ‘noise’. Then you can consider their inverse image.

But I think ‘likeness’ or whatever can be reasonably naturally taken to imply the need for some ‘semantics’ or ‘information-capturing statistics/features’. So, I’m still thinking likeness is a reasonable term for what I’m after, rather than ‘evidence’.

Reply

January 24, 2017

Mayo

Om: Inferring starts with an “indication”, a pointing (rather than evidence). A well tested claim is one you fail to falsify with stringent tests. But to show this requires a demonstration or ‘performance’ of the sort Fisher has in mind: knowing how (to bring about results that rarely fail to be significant). It’s interesting that the root of probability, probare, is really test/prove. The exception “proves”the rule = it tests the rule. If prob were used in the sense of well-probed, it would be more like Popper’s degree of corroboration or severity.
I don’t see distinguushing signal from noise as linked to likeness, because the key is to come out with more than you put in.
January 24, 2017

omaclaren

I think about these things differently to you, based in part on my own experiences trying to do science and mathematics, but before we continue would you be able to answer my a,b,c above? Thanks
January 24, 2017

Mayo

Om: If you’ve read some of my published work on the topic and find no answers, then I guess not; if you haven’t read my published work on the topic, then the answer is also no: I can’t possibly do justice to the whole megillah in a blog comment.
January 24, 2017

omaclaren

Can you offer rough responses to these three questions? I won’t hold you to them, promise!

I find your published work particularly vague on these points. I could guess but why not just get a direct comment from you? Sure they might be subtle points but why no rough response at all? They seem very important yet basic questions to get some indication on. How else could I confidently use SEV in practice?
January 25, 2017

omaclaren

OK I’ll guess that the intended ‘official’* error stat answers are

a) yes a single value is interpretable, though a curve obviously gives more information.

b) yes SEV is intended to represent a measure of evidence for a claim.

c) Yes two claims with the same SEV value are intended to have the same evidence, though other contextual information may alter this.

Given these answers one can decide how well SEV achieves these goals. Correct me if wrong.

(*Michael would, I’m guessing, offer different answers but that’s coz he’s a Likelihoodist! In his approach there is little to distinguish SEV from likelihood.)
January 25, 2017

Mayo

Om: It’s a testing concept. Try: http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf
January 25, 2017

omaclaren

Ugh, I’ve read that every time you’ve pointed me to it…you surely know this…surely…and of course it’s a testing concept…

You’ve said your aim is to provide an evidential interpretation of error statistical methods/tests and brought up SEV when I asked about an evidence measure. So I could only assume SEV is your measure of evidence. Otherwise there doesn’t seem to be an explicit one in your account, which is fine (as I said!) but I don’t see why you can’t just say so: ‘there is no explicit numerical measure of evidence in my account but there is…’. Easy. And not necessarily a bad thing.

I don’t understand why you can’t just answer the three simple questions…they’re not tricks! Just basic questions. You don’t have to give fully detailed responses, just good faith attempts to convey your point of view.

But at this point, while it’s disappointing, I give up.
January 25, 2017

omaclaren

(And a pvalue is a testing concept but is often described in terms of evidence – weak, strong, no evidence etc. So I really don’t understand your reply)
January 25, 2017

Mayo

Om: I use “evidential” or inferential” interpretation of N-P methods, in the way that is meant by Birnbaum, Cox and others: i.e., only as opposed to a so-called behavioristic construal of N-P methods. I don’t even think Neyman held the strict behavioristic view, and I now Pearson didn’t.

I think you’re keen to know if I’ve supplied a general formal measure that holds for any hypothesis H and evidence E, possibly with background B, and the answer is this: all such logics of evidence were an outgrowth of logical positivism, logicism and the like–philosophies we’ve gotten beyond, thankfully. I would not only deny there is such a thing, I would deny we’d want such a measure–at least if we were seeking, as I am, an account that captures how we actually learn (gain ampliative information) from data. So, I’m not an evidential-relation (E=R) theorist. I recognize that many of the early attempts at stat foundations by Carnap (until he rejected it), Jeffreys, Hacking (until he rejected it), many others–I admit some still crave such a thing- held this to be the holy grail. It was to be like deductive logic only with probabilities. Ideally, purely syntactical and mechanical.
Such a program has degenerated and should be rejected. That doesn’t mean we don’t have systems that learn and that are adaptive, and genuine accounts of learning from error.

Those who fetishize deductive logic, with or without probabilities, forget that deductive arguments are too cheap to be worth having, and can always be had for the asking. The hard part is getting sound arguments, or at least approximately true premises.
But my account is inductive and while it can be equally formulated as deductive, the same inductive work goes into corroborating the premises.

Now in logical positivist days, it was thought the premises were just “given” claims of empirical data! It’s hard for us to believe it now, but it was only overthrown in the last 50 or so years. Unfortunately, the message hasn’t always reached those who revere formal evidential-relation logics. Their heroes can be found upholding views from naive positivistic/operationalist days, and since these were very smart guys, many assume they still hold.
January 26, 2017

omaclaren

No, I just wanted an answer to a,b,c above.
January 27, 2017

Mayo

gave you one.
January 27, 2017

omaclaren

What was your answer to a?

Also to c?

And I guess the answer to b is ‘SEV is not a numerical measure of evidence but is something else’?
January 27, 2017

omaclaren

(I’m just after less obscurantist answers to the questions I labelled a,b,c)

January 23, 2017

Christian Hennig

“The choice of π0 = 1/2 has obvious intuitive appeal in scientific investigations as being ‘objective.’”
Ah! The joys of objectivity! 😉

Anyway, I think a key issue here is that we should actually not believe that the H0 is ever precisely true. Assuming that the H0 is some parameter a=0 with continuous parameter space and a reasonably smooth parametrisation of the model, this normally refers to potential violations of the model assumptions, but can also refer to a not being exactly zero, but being “practically” indistinguishable from zero – or even if it is distinguishable from zero, the deviation may not be practically meaningful (even if the theory to be tested seems to suggest a=0 precisely, we’re never safe from some minor biases in measurement etc.).

This makes putting a lump of 1/2 or whatever value at zero quite problematic; if one believes the theory behind a=0 is a good one, prior mass should also boost a neighbourhood around zero, not just zero precisely. The significance testing approach is to some extent better off, because it only measures to what extent the data are consistent with the H0 without requiring that the H0 is precisely true. If indeed a=10^{-8} it’s fair enough to not reject H0. On the other hand, as is well known, significance testing will tend to reject the H0 in case of too small and meaningless deviations with large samples; though of course the tester could do something about this if a definition of how large a deviation from a=0 is meaningless is available, and in any case one should look at effect sizes.

Reply

January 23, 2017

Michael Lew

Christian, the endlessly repeated assertion that the null is never exactly true is both true and false. And it is a thought killing zombie of misdirection.

Reason 1.
If we are talking about continuous scale of parameters then yes, a point null is an infinitesimally small point and so we can expect that it will in practice never be exactly true. However, we live in an unavoidably granular world. The number of exactly expressible numeric values is an infinitely small fraction of all numbers so I can take them as being negligible. The value of the parameter under the null hypothesis is therefore a not an infinitesimal, but a range as wide as the decimal column of the last significant figure. It therefore has a non-zero probability of being true.

Reason 2.
Even if the assertion were true it would have no real-world consequence. All of the test statistics that are calculated on the assumption that the null is true retain ALL of their meaning when the null is not true. The assumption need not be satisfied in the real world for their statistical meaning to be valid because they are calculated in a statistical model. Assertions that, for example, P-values lose their meaning when the null is not true are rubbish and cannot be justified.

Reply

January 23, 2017

Christian Hennig

Michael:
R1: A model is for thinking, and using a continuous model we take into account infinitely many possibilities indeed. The world might be granular but how do you know that your H0 sits on a granule?
R2: Both the mess in which Bayesians get when assigning a lump of 1/2 to the H0 and the frequentist trouble with large sample sizes when naively testing a point H0 demonstrate that it has implications to take the H0 too literally as a “truth”-candidate.

Reply

January 23, 2017

Michael Lew

Christian, how do I know that H0 sits on a granule? Because the probability of it sitting on an exactly expressible value is infinitesimally small and the probability of it sitting in a granule is indistinguishable from unity. That is the meaning of my Reason 1.

For your response to Reason 2, I will simply say that I prefer methods with increasing probability of yielding evidence against the null hypothesis as sample size increased when the null hypothesis is not exactly true. How would you prefer the methods to perform?

Reply

January 26, 2017

Christian Hennig

Regarding R2: The null hypothesis is never exactly fulfilled in practice so I’d like to have increasing evidence against it in case that the H0 is violated to a practically meaningful extent. A case of interest is where the H0 is not exactly true but something is true that in practice has the same meaning as, or is indistinguishable from, the H0. And whatever method we use, we should be aware what it does in such cases. (If it piles up evidence against the H0 in such cases, we need to be very cautious with interpretation.)

Reply

Pingback: “A megateam of reproducibility-minded scientists” look to lowering the p-value | Error Statistics Philosophy

The “P-values overstate the evidence against the null” fallacy

Post navigation

47 thoughts on “The “P-values overstate the evidence against the null” fallacy”

Leave a reply to Mayo Cancel reply

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.