Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values?
I. To get at philosophical underpinnings, the single most import question is this:
(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data?
Three Roles For Probability: Degrees of Confirmation, Degrees of Long-run Error Rates, Degrees of Well-Testedness
A. Probabilism: To assign degrees of probability, confirmation, support or belief in a hypotheses and other claims: absolute[a] (e.g., Bayesian posteriors, confirmation measures) or comparative (likelihood ratios, Bayes factors).
B. Performance (inductive behavior philosophy): To ensure long-run reliability of methods (e.g., Neyman-Pearson NP behavioristic construal; high throughput screening, false discovery rates).
C. Probativeness (falsification/corroboration philosophy): To determine the warrant of claims by assessing how stringently tested or severely probed they are. (Popper, Peirce, Mayo)
Error Probability Methods: In B and C, unlike A, probability attaches to the methods of testing or estimation. These “methodological probabilities” report on their ability to control the probability of erroneous interpretations of data.
The inferences (some call them “actions”) may take several forms: declare there is/is not evidence for a claim or a solution to a problem; infer there’s grounds to modify a model, etc. Since these inferences go beyond the data, they are inductive and thus, open to error. The methodological probabilities are also called error probabilities. They are defined in terms of the sampling distribution of a statistic.[b]
Some spin-off questions:
(2) Do criticisms of P-values assume probabilism?
We often hear: “There is nothing philosophical about our criticism of statistical significance tests. The problem is that a small P-value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis that the observed difference is mere chance.” Really? P-values are not intended to be used this way; presupposing they ought to be so interpreted grows out of a specific conception of the role of probability in statistical inference. That conception is philosophical.
a. Probabilism says H is not warranted unless it’s true or probable (or increases probability).
b. Performance says H is not warranted unless it stems from a method with low long-run error.
c. Probativism says H is not warranted unless something (a fair amount) has been done to probe, and rule out, ways we can be wrong about H.
Remark. In order to show that a probabilist reform (in the form of posteriors) is adequate for error statistical goals, it must be shown that a high posterior probability in H corresponds to having done a good job ruling out the ways we can be mistaken about H. In this connection, please see Section IV.
It’s not clear how comparative reports (C is favorable relative to C’) reach inferences about evidence for C.
In this connection be sure to ask: Do advocates of posterior probabilities tell us whether their priors will be conventional (default, reference), and if so, which? Frequentist? Or subjective? (Or are they just technical strategies to estimate parameters, justified on error statistical grounds?)
- A very common criticism is that P-values exaggerate the evidence against the null: A statistically significant difference from H0 can correspond to large posteriors in H0. From the Bayesian perspective, it follows that P-values “exaggerate” the evidence; but the significance testers balk at the fact that the recommended priors result in highly significant results being construed as no evidence against the null—or even evidence for it! Nor will it do to simply match numbers.
It’s crucial to be able to say, H is highly believable but poorly tested. Even if you’re a probabilist, you can allow for the distinct tasks of stringent testing and error probes.(probativism).
Different philosophies of statistics are fine; but assuming one as grounds for criticizing another leads to question-begging and confusion.
(3) Are critics correctly representing tests?
- Do criticisms of P-values distinguish between simple (or “pure”) significance tests, and Neyman-Pearson (NP) tests and confidence intervals (within a model)? The most confused notion of all (often appropriated for unintended tasks) is that of power. (Search this blog for quite a lot on power.)
- Are criticisms just pointing up well-known fallacies of rejection and non-rejection that good practitioners know to avoid?(i) (e.g., Confusing nominal P-values and actual P-values.) Do their criticisms relate to an abusive animal (NHST) that permits moving from statistical inference to a substantive research hypothesis (as we have seen, at least since Paul Meehl, in psychology)?
- Underlying many criticisms is the presupposition that error probabilities must be misinterpreted to be relevant. This follows from assuming that error probabilities are irrelevant to qualifying particular scientific inferences. In fact, error probabilities have a crucial role in appraising well-testedness, which is very different from appraising believability, plausibility, or confirmation. Looking at hypothetical long-runs serves to understand the properties of the methods for this inference.
Notice, the problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, barn hunting, etc. are not problems about long-runs. It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data.
Loads of background information enters informally at all stages of planning, collecting, modeling and interpreting data. (Please search “background information” on this blog.)
I link to some relevant papers, Mayo and Cox (2006), and Mayo and Spanos (2006).
II. Philosophers are especially skilled at pointing up paradoxes, inconsistencies and ironies [ii]
Paradox of Replication:
Critic: It’s too easy to satisfy standard significance thresholds.
You: Why do replicationists find it so hard to achieve significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs…
You: So, the replication researchers want methods that pick up on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference!
It’s actually an asset of P-values that they are demonstrably altered by biasing selection effects (hunting, fishing, cherry picking, multiple testing, stopping rules, etc.). Likelihood ratios are not altered. This is formalized in the likelihood principle.
(4) Likelihood principle: Do critics assume inference must satisfy the likelihood principle—(all of the evidential import is in the likelihoods, given the model)? This is at odds with the use of error probabilities of methods.
- Probabilist reforms often recommend replacing tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lowering the P-value (so that the maximal likely alternative gets .95 posterior)
The problem is, the same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals. With one big difference: Your direct basis for criticism and possible adjustments has just vanished!
- All traditional probabilisms obey the likelihood principle; violating it, however, (as with conventional priors) doesn’t automatically yield good error control.
Some critics are admirably forthcoming about how the likelihood principle surrenders this basis–something entirely apt under the likelihoodist philosophy [iii]. Take epidemiologist Stephen Goodman:
“Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value”(Goodman 1999, p. 1010).
On the error statistical philosophy, it has a lot to do with the data.
III. Conclusion so far: It’s not that I’m keen to defend many common uses of significance tests (at least without subsequent assessments of discrepancies indicated), it’s just that highly influential criticisms are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms”.
How can you be clear the reforms are better if you might be mistaken about existing methods?
Some statisticians employ several different methods, even within a given inquiry, and so long as the correct interpretations are kept in mind, no difficulty results. In some cases, a vital means toward self-correction and triangulation comes about by examining the data from more than one perspective. For example, simple significance tests are often used in order to test statistical assumptions of models, which may then be modified or fed into subsequent inferences.[v] This reminds me:
(6) Is it consistent to criticize P-values for being based on statistical assumptions, while simple significance tests are the primary method for testing assumptions of statistical models? (Even some Bayesians will use them for this purpose.)
Quick jotting on a huge topic is never as succinct as intended. Send corrections, comments, and questions. I will likely update this. Here’s a promised update:
(IV) Zeroing in on a key point that the reformers leave unacceptably vague:
One of the major sources of hand-wringing is the charge that P-values are often viewed as posterior probabilities in the null or non-null hypotheses. (A) But what is the evidence of this? (B) And how shall we interpret a legitimate posterior ‘hypothesis probability’?
(A) It will be wondered how I can possibly challenge this. Don’t we hear people saying that when a null of “no effect” or “no increased risk” is rejected at the .05 level, or with P-value .05 or .01, this means there’s “probably an effect” or there’s “probably evidence of an increased risk” or some such thing?
Sure, but if you ask most people how they understand the .05 or .01, you’ll find they mean something more like a methodological probability than a hypothesis probability. They mean something more like:
- this inference was the outgrowth of a reliable procedure, e.g., one that erroneously infers an effect with probability .01.
- 95% or 99% of the time, a smaller observed difference would result if in fact the data were due to expected variability, as described under the null hypothesis.
Such “methodological probabilities” are akin to either the “performance” or “probativeness” readings above. They are akin to what many call a “confidence concept” or confidence distribution, or what Popper called corroboration. Don Fraser argues (“Quick and dirty confidence” paper, 2011) that this is the more fundamental notion of probability, and blames Lindley for arbitrarily deciding that whenever a posterior disagreed with a confidence distribution notion, only the former would count. Fraser claims that this was a mistake, but never mind. The important point is that no one has indicated why they’re so sure that the “misinterpreters” of the P-value don’t have the confidence or corroboration (or severe testing) notion in mind.
(B) How shall we interpret a legitimate posterior hypothesis probability?
As often as it’s said that the P-value is not the posterior probability that the null hypothesis is true, critics rarely go on to tell us what the posterior probability would mean, and whether and why it should be wanted. There is an implicit suggestion that there’s a better assessment of evidence out there (offered by a posterior hypothesis probability). What kind of prior? Conventional, subjective, frequentist (empirical Bayes)? Reformers will rarely tell us.
The most prevalent view of a posterior probability is in terms of betting. I don’t think the betting notion is terribly clear, but it seems to be the one people fall back on. So if Ann assigns the null hypothesis .05 posterior probability, it means she views betting on the truth of Ho as if she’s betting on an event with known probability of .05. She’d post odds accordingly, at least in a hypothetical bet. (If you think another notion works better, please comment.)
Is that really what Ann means when she takes a statistically significant result as evidence of a discrepancy from the null, or as evidence of a genuine risk, non-chance result, or the like?
Perhaps this could be put to empirical test. I’ll bet people would be surprised to find that Ann is more inclined to have something like methodological probability in mind, rather than betting probability.
An important point about English and technical notions:
In English, “a strong warrant for claim H” could well be described as saying H is probable or plausible. Being able to reliably bring about statistically significant effects may well warrant inferring genuine experimental effects. Therefore, using the ordinary English notion of “probable”, P-values (regularly produced) do make it “probable” that the effect is real. [I’m not advocating this usage, only suggesting it makes sense of common understandings.]
We must distinguish the ordinary English notions of probability and likelihood from the technical ones, but my point is that we shouldn’t assume that the English notion of “good evidence for” is captured by a formal posterior probability. Likewise if you ask people what they mean by assigning .95 to a .95 highest probability density (HPD) interval they will generally say something like, this method produces true estimates 95% of the time.
(V) 10/24/15 , 10/27/15 The improbability or infrequency with which the pattern of observed differences is “due to chance” is thought to be a posterior probability
The other central reason that people suppose a P-value is misinterpreted as a posterior is a misunderstanding as to what is meant by reporting how infrequently such impressive patterns could be generated by expected chance variability. “Due to chance” is not the best term, but in the context of a strong argument for ruling out “flukes” it’s clear what is meant. Contrary to what many suppose, the null hypothesis does not assert the results are due to chance, but at most entails that the results are due to chance. When there’s a strong argument for inferring one has got hold of a genuine experimental effect, as when it’s been reliably produced by independent instruments with known calibrations, any attempt to state how improbable it is that all these procedures show the effect “by chance” simply does not do justice to the inference. It’s more like denying a Cartesian demon could be responsible for deliberately manipulating well-checked measuring devices just to trick me. In methodological falsification, of which statistical tests are examples, we infer the effect is genuine. That is the inference. (We may then set about to estimate it’s magnitude.) The inference to a “real (non-chance) effect” is qualified by the error probabilities of the test, but they are not assigned to the inference as a posterior would be. I’ve discussed this at length on this blog, notably in relation to the Higgs discovery and probable flukes. See for example here.
i. R.A. Fisher was quite clear:
“In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.“ (Fisher 1947, p. 14)
[ii] Two examples: “Irony and Bad Faith: Deconstructing Bayesians”
“Some Ironies in the Replication Crisis in Social Psychology”
[iii] See”Breaking the Law of Likelihood,” “Breaking the Royall Law of Likelihood”, and “How Likelihoodists Exaggerate Evidence From Statistical Tests”.
[iv] See “P-values can’t be trusted except when used to argue that P-values can’t be trusted”.
[v] David Cox gave an excellent “taxonomy” of tests, years ago.
Notes added after 10/18/15
[a] “absolute” vs comparative is a common way to distinguish “straight up” posteriors with comparative measures, but it’s not a very good term. What should we call it? Maclaren notes (in a comment) that Gelman doesn’t fit here and I agree, insofar as I understand his position. The Bayesian tester or Bayesian “falsificationist” may be better placed under the error statistician umbrella, and he calls himself that (e.g., in Gelman and Shalizi, I think it’s 2013). The inference is neither probabilistic updating nor a Bayes boost/Bayes factor measure.
[b] There may well be Bayesians who fall under the error statistical rubric (e.g., Gelman?) But the recommended reforms and reconciliations, to my knowledge,take the form of probabilisms. An eclecticist like Box, as I understand him, still falls under probabilism, insofar as that is the form of his “estimation”, even though he insists on using frequentist significance tests for developing a model. In fact, Box regards significance tests as the necessary engine for discovery. I thank Maclaren for his comment.
Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.
Goodman, S. 1999. ‘Toward Evidence-Based Medical Statistics. 2: The Bayes Factor,’ Annals of Internal Medicine, 130(12): 1005-1013.
Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357.
Under ‘Probabilism’ you list Bayesian posteriors as an ‘absolute’ measure of support for ‘hypotheses’.
In my view this is your central misunderstanding of people like me (and Andrew Gelman, on my personal reading of him). We have had this conversation many times to no success but here is my view (and, I believe, Gelman’s) again:
Continuous Bayesian posterior distributions are a relative measure of support for parameters within a mathematical model structure.
I hope you can have a “fair-minded engagement” with this view.