There’s something about “Principle 2” in the ASA document on p-values that I couldn’t address in my brief commentary, but is worth examining more closely.
2. P-values do not measure (a) the probability that the studied hypothesis is true , or (b) the probability that the data were produced by random chance alone,
(a) is true, but what about (b)? That’s what I’m going to focus on, because I think it is often misunderstood. It was discussed earlier on this blog in relation to the Higgs experiments and deconstructing “the probability the results are ‘statistical flukes'”. So let’s examine:
2(b) P-values do not measure the probability that the data were produced by random chance alone,
We assume here that the p-value is not invalidated by either biasing selection effects or violated statistical model assumptions.
The basis for 2(b) is the denial of a claim we may call claim (1):
Claim (1): A small p-value indicates it’s improbable that the results are due to chance alone as described in H0.
Principle 2(b) asserts that claim (1) is false. Let’s look more closely at the different things that might be meant in teaching or asserting (1) . How can we explain the common assertion of claim (1)? Say there is a one-sided test: H0: μ = 0 vs. H1:μ > 0 (Or, we could haveH0: μ < 0 ).
Explanation #1: A person asserting claim (1) is using an informal notion of probability that is common in English. They mean a small p-value gives grounds (or is evidence) that H1:μ > 0. Under this reading there is no fallacy.
Comment: If H1 has passed a stringent test, a standard principle of inference is to infer H1 is warranted. An informal notion of:
“So probably” H1
is merely qualifying the grounds upon which we assert evidence for H1. When a method’s error probabilities are used to qualify the grounds on which we assert the result of using the method, it is not to assign a posterior probability to a hypothesis. It is important not to confuse informal notions of probability and likelihood in English with technical, formal ones.
Explanation #2: A person asserting claim (1) is interpreting the p-value as a posterior probability of null hypothesis H0 based on a prior probability distribution: p = Pr(H0 |x). Under this reading there is a fallacy.
Comment: Unless the p-value tester has explicitly introduced a prior, this would be a most ungenerous interpretation of what is meant. Given that significance testing is part of a methodology that is directed to provide statistical inference methods whose validity does not depend on a prior probability distribution, it would be implausible to think a teacher of significance tests would mean a Bayesian posterior is warranted. Moreover, since a formal posterior probability assigned to a hypothesis doesn’t signal H1 has been well-tested (as opposed to,say, it’s strongly believed), it seems an odd construal of what a tester means in asserting (1). The informal construal in explanation #1, is far more plausible.
A third explanation further illuminates why some assume this fallacious reading is intended.
Explanation #3: A person asserting claim (1) intends an ordinary error probability. Letting d(X) be the test statistic:
Pr(Test T produces d(X)>d(x); H0) ≤ p.
(Note the definition of the p-value in my comment on the ASA statement.)
Notice: H0 does not say the observed results are due to chance. It is just H0:μ = 0. H0 entails the observed results are due to chance, but that is different. Under this reading there is no fallacy.
Comment: R.A. Fisher was clear that we need not isolated significant results “but a reliable method of procedure” (see my commentary). We may suppose the tester follows Fisher and the test T consists of a pattern of statistically significant results indicating the effect. The probability that we’d be able to generate {d(X) > d(x)} in these experiments, in a world described by H0, is very low (p). Equivalently:
Pr(Test T produces P-value < p; H0) = p
The probability test T generates such impressively small p-values under the assumption they are due to chance alone is very small, p. Equivalently, a universe adequately described by H0 would produce such impressively small p-values only p(100)% of the time. Or yet another way:
Pr(Test T would not regularly produce such statistically significant results; were we in a world where H0 ) = 1-p
Severity and the detachment of inferences
Admittedly, the move to inferring evidence of a non-chance discrepancy requires an additional principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null.
Data x0 from a test T provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).
It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).
The severity principle, put more generally:
Data from a test T (generally understood as a group of individual tests) provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have survived the test so well were H false.
Here H would be the rather weak claim of some discrepancy, but specific discrepancy sizes can (and should) be evaluated by the same means.
Conclusion. The only explanation under which claim (1) is a fallacy is the non-generous explanation #2. Thus, I would restrict principle 2 to 2(a). That said, I’m not claiming 2(b) is the ideal way to construe p-values. In fact, without being explicit about the additional principle that permits linking to the inference (the principle I call severity), it is open to equivocation. I’m just saying it’s typically meant as an ordinary error probability [2].
Souvenir: Don’t merely repeat what you hear about statistical methods (from any side) but, rather, think it through yourself.
Comments are welcome.[1]
Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.
Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” British Journal for the Philosophy of Science 57(2): 323–57.
My comment, “Don’t throw out the error control baby with the bad statistics bathwater is #17 under the supplementary materials:
[1] I have this old Monopoly game from my father that contains metal pieces like this top hat. There’s also a racing car, a thimble and more.
[2] The error probabilities come from the sampling distribution and are often said to be “”hypothetical”. I see no need to repeat “hypothetical” in alluding to error probabilities.
It seems to me that the error probability interpretation of explanation 3 is a violation of the common meaning of English. *Maybe* they mean an error probability, but if I ask “What is the probability that the symptoms are due to heart disease?” I’m asking a straightforward question about whether the probability that the symptoms are caused by an actual case of heart disease, not the probability that I would see the symptoms assuming I had heart disease. Likewise, the sentence in question seems to ask what the probability is that random variation is the only factor in play, and it seems difficult to interpret that as an error probability. Also — and this is anecdotal — I’ve talked to several people who were confused about principle 2 who could not define a p value properly.
I guess I’m not saying that there isn’t anyone who would use principle 2’s wording as a shorthand for an error probability, it just doesn’t seem common to me or a good reading of the sentence.
As for explanation 1, I guess since “small” and “improbable” are left undefined — and indeed, for explanation to work “probable” needs to be informal — you’re right that it is hard to argue with it as fallacious per se. But it doesn’t seem much comfort to argue that a statement is not fallacious because it’s meaning is fairly unconstrained.
(On top of that, seems like it implies that the p value is what determines the “improbability”, apart from sample size.)
Richard: Just on your first point: this reading is in sync with the arguments we use to make causal inferences based on triangulating coincidences. Part of the problem may arise in supposing we’re dealing with a conditional probability when in fact we’re saying, what’s the probability we’d consistently be able to bring about such stat sig results under the supposition they were all due to chance error (or the like). We do need the additional principle from that premise to something like” there’s good evidence the effect is real” (Severity). We don’t assign a probability to the effect being genuine. We infer the effect is genuine, but with a qualification. The inference is qualified by the properties of the measurement tools. Take my weighing example where several well-calibrated scales show a 1 lb gain, and EGEK is a known 1 pound book. What’s the prob all these scales show weight gain due to experimental artifacts? It isn’t merely that the answer is practically 0, it’s that you’d have to posit mind-reading scales, or the like, to explain how they behave properly when the weight happens to be known, but conspire against me when the weight is unknown.
Richard: I want to come back to your first comment:
You wrote:
if I ask “What is the probability that the symptoms are due to heart disease?” I’m asking a straightforward question about whether the probability that the symptoms are caused by an actual case of heart disease, not the probability that I would see the symptoms assuming I had heart disease.
My reply: Stop right after the first comma, before “not the probability.” The real question of interest is: are these symptoms caused by heart disease (not the probability they are).
In setting out to answer the question suppose you found that it’s quite common to have even worse symptoms due to indigestion and no heart disease. This indicates it’s readily explainable without invoking heart disease. That is, in setting out to answer a non-probabilistic question* you frame it as of a general type, and start asking how often would this kind of thing be expected under various scenarios. You appeal to probabilistic considerations to answer your non-probabilistic question, and when you amass enough info, you answer it. Non-probabilistically.
Abstract from the specifics of this question which brings in many other considerations.
*You know the answers are open to error, but that doesn’t make the actual question probabilistic.
I’m a bit disappointed in this post! You appear to be defending something that is just indefensible, namely the idea that it’s somehow ok to say that P-values *do* measure “the probability that the data were produced by random chance alone”. Your defense of interpreting it as “informal language” is belied by the word “measure”, and even informally, it is quite fallacious to infer that if any particular person wins the lottery then “probably” it was not fairly played. On the other hand, your interpretation#3 amounts to defending an incorrect verbal translation of a mathematically correct statement.
Indeed, my main beef with the rejected language is that, “uncharitable” or not, your interpretation#2 is the only one that actually fits with the words being used. So why waste any breath defending them?
The problem with those words can of course be expressed in quite informal terms by pointing out (as I think anyone can understand, without any technical understanding of the word “probability” or of any jargon about posterior and prior probabilities) that “the probability that the data were produced by random chance alone” is different from “the probability that those particular data would have been produced by random chance alone”. Most people will agree that when a die comes up five (for which the probability of happening by chance is just 1/6), that does not make it unlikely that the die was fair and gives one no reason to believe that there is a low probability that the five actually did just happen by chance.
alQpr: I forgot to address you point that ” it is quite fallacious to infer that if any particular person wins the lottery then “probably” it was not fairly played”.
We don’t argue from the fact that Pr(win|fair lottery) = low that “probably” the game’s not fair (although it’s worth the likelihoodist noting that an alternative explanation such as “all the tickets have my number” or the like is better supported than that the game is fair). To infer “ the game wasn’t fairly played” on the basis of x there is a winner, would be to make an inference with 0 severity (in this case, that’s the same as a maximal type 1 error). That’s because, there will always be a winner (let’s suppose) and thus you are guaranteed to infer the lottery is unfair erroneously. Mere improbable events do not indicate real effects and are, for that reason, of little interest in science.
alQpr: First, don’t confuse 2(a) with 2(b). I concur with 2(a). Second, it’s not an argument to declare I’m defending something indefensible. You may be new to the site, we’ve discussed this point before. If you follow the Higgs posts, you can see a specific example: What’s the probability we wouldn’t produce such bumps at independent detectors under the assumption they are due to background alone? Answer: extremely high. Bumps that are spurious or flukes disappear with high probability under the assumption of Ho. (I revised this last sentence a bit to be clearer.)
It seems to me that any account of inference that outlaws one of the most powerful arguments in science, ought to think again. That most powerful form of argument is often called an argument from coincidence: If we repeatedly produce an effect using several independent, but well understood, experimental tools, deliberately triangulating to insure that at least 1 is overwhelmingly likely to detect a flaw if present, then we have evidence of a genuine effect. We don’t even need to say yet what its source is, but it’s a first step. Now significance testers, since they often don’t obey Fisher, may only have a weak argument from coincidence, but that’s irrelevant to my point. A strong argument from coincidence is an essential pattern of argument in science. And the inference is not a posterior probability from a prior, it’s that we’ve got evidence of a genuine effect. We qualify by pointing to the strong argument from coincidence, i.e., to the fact that it’s overwhelmingly improbable we’d continue to produce these results at will, if we hadn’t got hold of a genuine effect.
It seems to me like ‘arguments from coincidence’ are a special case of ‘arguments from invariance’ which do indeed have a long history in science, especially physics but also mathematics and causal inference literature. Why not focus on the invariance aspect? That would also seem to fit best with Fisher’s work (a lot of it seems to have a strong geometric and group theoretic flavour).
Om: Write down your argument from invariance. By the way, I admit args from coincidence have other names like “the ultimate argument”.
An invariant is an object or property of an object that is unchanged under certain classes of transformations. In general these are taken in mathematics and physics to represent ‘intrinsic’ or ‘coordinate independent’ properties – the ‘essence’ of the object under study. The ‘object’ could (and often is) the functional form of a relationship itself (see our ‘structuralist’ discussions).
The ‘invariance’ approach to formulating theories it to try to find statements of the form ‘given this range of circumstances (transformations of a reference object) then these conclusions/these relationships hold’. Lorentz invariance is a famous example from relativity. I’m trying to think of a physical principle not stated as an invariance principle.
The argument from coincidence is, as you say,
‘we repeatedly produce an effect using several independent, but well understood, experimental tools, deliberately triangulating to insure that at least 1 is overwhelmingly likely to detect a flaw if present, then we have evidence of a genuine effect’
This could alternatively be seen as something like
‘given this broad range of experimental tools covering circumstances which only differ in details which we would like to be arbitrary, we repeatedly observe the same thing (effect/relationship etc). We therefore we have evidence that this object/relationship/effect is invariant to/independent of arbitrary details of the type considered’.
You also say
‘And the inference is not a posterior probability from a prior, it’s that we’ve got evidence of a genuine effect.’
I’d say that it’s more often that a particular functional relationship holds. E.g. (Einstein) ‘physical laws should take the same mathematical form in all coordinate systems’. Or, more down to earth, the average weight in samples from a population of interest always increases with average height to the power of a number in the interval [2.5,3.5]. (Or something).
Om: As you start out noting: “An invariant is an object or property of an object that is unchanged under certain classes of transformations.” Fine. It is rather a stretch to extend the idea of an argument from coincidence to something about an object. There’s also the fact that invariance” is used differently in statistics. I’m not saying you couldn’t struggle to extend the metaphor to do the work of an argument from coincidence, but I don’t think it’s a good idea. Your tendency to extend metaphors shows imagination, and I’m not knocking it, but I really think it’s a different animal here. That doesn’t mean I’m disagreeing with your idea of theorizing by seeking invariant properties.
Obviously it’s hard to go beyond metaphors in a blog comment.
From a philosophical point of view, Woodward, Pearl, Glymour and Dawid have all advanced similar views, and related these to statistical contexts. A lot of Bayesian inference is related to invariance assumptions (e.g. Jaynes). I don’t think it’s a stretch to start from the basic concepts that have been most successful in science, rather than (say) p-values.
I find classical statistics an awkward fit to a lot of science, even though some of it could be reformulated in a more compatible way. What’s the classic example? The world is round (p < 0.05).
[BTW I still owe Laurie some working/code/plots from some discussion we’ve been having on regularisation etc. If you would like more mathematics then I can send you the examples too.
Words and metaphors are generally a poor substitute for mathematics but are often the only way to communicate across disciplines. E.g. I don’t really know what your statement ‘It is rather a stretch to extend the idea of an argument from coincidence to something about an object’ means.’ Is this argument not attempting to say something *about* something? E.g. infer that *’an effect’* is genuine?]
Om: I certainly never advocate starting from p-values; what’s been most successful in science are arguments from coincidence and stringent tests.. I know Woodward’s and Glymour’s views rather well; they’ve been here and in conferences with me at least 8 times. Glymour’s own view of bootstrapping captures a weak severity requirement. It’s one thing to have a notion of causality in terms of invariance, quite another to characterize an argument from coincidence to a cause. None of that group except Dawid are Bayesians, but I really don’t see where that enters here. No clue where you’re going with your comment Oliver, especially with the final reference to the power analyst Joshua Cohen.
A genuine effect according to your argument is essentially a stable effect. Stability is best formulated in terms of invariance wrt perturbations/transformations (in my view).
I prefer to think in terms of structural relationships as opposed to ‘effects’, as we’ve discussed. Stable structural relationships are more related to the causal inference folk than classical stats, hence the off-hand mention.
I would replace an argument from coincidence about an effect with an argument to stable structural relationships. You seem to find this unusual or an awkward fit, I don’t. Incidently fits with the question of how to define a replication and how much the circumstances can vary to count.
Anyway, best to leave the verbal arguments at this point as they’re not getting us anywhere further.
Om: I’m not describing the goals, e.g., stability, causal knowledge, manipulation, generalizability, etc but what kinds of strategies can be developed in order to warrant inferences to claims that satisfy those goals. Goals are easy, how to attain them are not.
Om: I’m distinguishing the procedure of testing or argument to a claim C, from the nature of claim C.
I want to say something about your first sentence that for me a genuine effect is a “stable effect”. Not as that’s typically understood. I’m taking Fisher more literally, it’s a matter of “knowing how” to produce results that rarely fail to be statistically significant (or the analogous in informal cases). Their stability needn’t “out there” so to speak, instead we “rein it in” or rein something in, even if that something is entirely engineered by dint of our clever ,ingenious, methods and probes. These probes include both modeling and instruments. We get enough “push back” from the world (as modelled) to make it difficult to generate stat sig results “at will” if we’re far off from capturing a real effect. Get my drift?
Yah I think I do. And I think I mean something similar in the sense of a structuralist and/operational view of modelling (eg the effect is an invariant of our procedures) but understand that others might mean something else.
Thanks for taking the time to reply. I certainly had nothing to say about 2a, and my only concern was with your objection to 2b. In that context, my declaration that the wording rejected in 2b is “indefensible” was not intended as an argument but as a statement of the position that I intended to argue for (which I tried to indicate by my use of “it appears that..”). But getting to the real point, what I think you miss about 2b is that it is *not* of the same form as your (correct) reference to “the probability we *would* find such bumps at independent detectors under the assumption they are due to background alone” (where I have expanded and emphasized the contraction). What the statement pattern rejected in 2b refers to would be “the probability that we *did* find such bumps at independent detectors due to background alone”. That probability (if we can find a way to give it a well defined meaning at all) is probably very small, but I will be surprised if you don’t agree that it is *not* the same as the p-value.[In fact, given that we did see the bumps isn’t it just basically P(noHiggs), whereas the p-value is P(bumps|noHiggs)?]
al: The last line: no, not the same as P(no Higgs).And I don’t think the “did” vs “would” statements differ, at least as understood by an error statistician. We always ask about the hypothetical “would be”.
Let me try to illustrate what I see as the difference between “did” and “would” with a simpler example. Say I toss a standard die that I am fairly certain from previous testing is a fair one, and that I have no particular skill in manipulating the outcome of a toss. When I next toss the die, the probability that it *will* show a five is 1/6. When I last tossed it, there was a 1/6 probability that it *would* show a five. But after the toss has been observed it does not make sense for someone who knows the outcome to think of the probability that it *did* show a five as other than 1 or 0. And even if it did show a five right after someone had “called” for that outcome, I would still be almost certain that the die was fair and so would assign high value (certainly much more than 1/6) to the “probability” that it *did* show the five by pure chance (ie that future testing would not actually reveal any bias). And I have put scare quotes on “probability” there because some might deny that probability is applicable to such a proposition at all.
But I agree that the Higgs case is more complicated and (cf the Exercise statement#2) if the “probability that the bump *did* occur by chance” means anything it’s just the revised P(Ho) given that the bump was observed, or P(Ho|bump) if conditional notation is appropriate in such a context, which I guess is is different from (less than) P(noHiggs) because it’s P(noHiggs && noAnythingElse).
No, not a posterior in no- Higgs, whatever that means, but P(Test yields > d bumps;background alone). For a frequentist, again, results of applying methods are always treated generally, whether we actually know the data they did produce or not. We still ask, what’s the prob it would have… We have one outcome, but the sampling distribution speaks of hypothetical outcomes.
Yes, I know that the p-value is P(Test yields > d bumps;background alone), or less formally “the prob it *would* have…”. But the language the ASA is objecting to in 2b says *were*. And I think the reason for their objection to that wording is the clear possibility that it could be interpreted as I (and I think also Carlos) have done. The fact that it could (in my opinion at a stretch!) be interpreted more “generously” does not make it appropriate. The appropriate language for describing scientific results is something that is as clear and accurate as possible, not something that requires a “generous” interpretation in order to be considered correct.
and what’s wrong with “were” again?
As Carlos has said perhaps more clearly than I, the expression “the probability that the data were produced by random chance alone” encourages the non-statistican (or, IMO, anyone who reads plain English correctly rather than “generously”) to hear it as referring to the probability of the means by which the data were produced (what you call a “posterior”?) rather than to the probability of them being produced. To take a non-statistical example, a statement “that the house was painted by me” (especially when we know that the house was actually painted) is naturally seen as referring more to the question of who painted the house than to how I was spending my time.
I do think though that you have shown quite clearly that “P-values do not measure…the probability that the data were produced by random chance alone” is itself capable of misinterpretation, and so should have been better stated as “P-values should not be described as measuring…the probability that the data were produced by random chance alone”.
alQpr: ‘it is quite fallacious to infer that if any particular person wins the lottery then “probably” it was not fairly played.’
I don’t think this works; typically, we would have the knowledge that millions of people played, and *someone* won. That is often a high probability event under the hypothesis that the lottery is fair. If, on the other hand, you knew that only one person played, and they won (in spite of the one in 100 million chance of winning), that would clearly be another thing entirely.
Richard: But the real issue is that we do not go from an improbable event to a posterior, nor even to a genuine effect. There is a meta-level assertion, as it were:
Pr(test T produces P-values < p; Ho) = p.
This is an ordinary statement about p-values, or the associated Type 1 error probability. For the moment, the issue is just the legitimacy of this claim, which is the bedrock of sampling theory. (To repeat, we assume the p-value is not invalidated by selection effects or violated assumptions, those are not the issue here.) SEV provides the linkage to (something on the order of):
So the data indicate a genuine effect.
This parallels an "argument from coincidence" to a real effect (Whewell, Peirce, Popper, Musgrave, Hacking, Cartwright, Chalmers, Mayo) in entirely non-formal contexts.
SATURDAY NIGHT EXERCISE:
When I discussed this issue in relation to the Higgs researchers, we took a look at a post wherein David Spiegelhalter gave “thumbs up or thumbs down” to a number of different reports on the 5 sigma result corresponding to a p-value of 1 in 3 million or 1 in 3.5 million. Without looking it up, which of the following do you think he gave a thumbs up to, and which a thumbs down?
1.
“CMS observes an excess of events at a mass of approximately 125 GeV with a statistical significance of five standard deviations (5 sigma) above background expectations. The probability of the background alone fluctuating up by this amount or more is about one in three million.”
[And therefore the observed excess of events are not due to background alone.]
2. “A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in three million would see an apparent signal this strong in a universe without a Higgs[i.e., where Ho is correct or adequate].”
[And therefore the observed signals are not merely “apparent” but are genuine. ]
3. “They claimed that by combining two data sets, they had attained a confidence level just at the “five-sigma” point – about a one-in-3.5 million chance that the signal they see would appear if there were no Higgs particle [i.e., where Ho is adequate].”
[And so we infer “the signal they see” wasn’t due to background alone, it indicates a genuine effect.]
Re the Exercise: I don’t know David Spiegelhalter so can’t guess what he would say, and I’m going to ignore the question of whether or not the terms “statistical significance” and “confidence level” are being used correctly, but I’d give #1 a thumbs up [with the proviso that the parenthesis should include the word “probably” (but *not* identify the complement of that probability with the 1 in 3million – which would be the ASA 2(b) error)]
I’d give both versions of #2 a thumb at best sideways because, although the null hypothesis of “background” can fairly be interpreted as “in a standard-model-type theory with no particle at the indicated mass”, there could be other causes of the bumps – including completely different theories or just other particles with similar but not identical properties to the Higgs (which is related to the ASA 2(a) error).
Answers to the Saturday night exercise: Spiegelhalter gives themALL thumbs up!
http://understandinguncertainty.org/explaining-5-sigma-higgs-how-well-did-they-do
1. Here p = 1 in 3 million
The probability of the background alone fluctuating up by this amount or more is about one in three million.”
So, thumbs up to:
1. The p-value = probability that a process adequately described by Ho fluctuates up by this amount or more, i.e., produces {D >d}.
2. “A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in three million would see an apparent signal this strong in a universe without a Higgs.”
So, thumbs up to:
2. The p-value = the relative frequency of an apparent signal this strong in a universe adequately described by Ho.
3. Here p = 1 in 3.5 million.
“about a one-in-3.5 million chance that the signal they see would appear if there were no Higgs particle.”
So, thumbs up to
3.The p-value = the chance that the signal they see would appear if Ho is correct
The “signal” refers to events {D > d}
In each case, there’s an implicit principle (severity) which leads to the following inferences:
1. And therefore the observed excess of events are not due to background
2. And therefore the observed signals are not merely “apparent” but are genuine. Or, and therefore, the observed signals indicate they were not produced by a universe without a Higgs (with given properties).
3. And therefore we infer “the signal they see” wasn’t due to background alone, it indicates a genuine effect (with such and such properties).
I think that in the context of the two sentences preceding “principle 2” [1] and the two sentences explaining it [2] there is not much room to misunderstand what the ASA document means by “the probability that the data were produced by random chance alone.”
[1] “The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.”
[2] “Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.”
Carlos: So you’re saying it’s obvious they only meant 2(a)? OK. The trouble, though, is that I can point you to scads of “gotcha moments” wherein (what Wasserman calls) the P-value police accuse assertions of form 2(b) as misinterpreting the P-value. That’s why 2(b) is in the ASA doc to begin with.
Now you might be saying, “right, because whenever certain people hear 2(b) they interpret it as a posterior”. Yet many, many scientists do not mean a posterior by 2(b), they mean either explanation #1 or #3. So the difficulty is that there’s a problem with assuming 2(b) must mean 2(a) when in fact it is intended as a perfectly acceptable claim.
I find the two clauses that you call (a) and (b) esentially equivalent. The formulation could be more precise, but my understanding is that the ASA document wants to stress that p-values do not measure the probability that the studied hypothesis (the stated null hypothesis) is true, which is to say p-values do not measure the probability that the data were produced by random chance alone (it seems to me another way of saying that the null hypothesis is true and the data-generating process is just noise without signal).
I agree that “the probability that the data were produced by random chance alone” could also be interpreted as “the probability that the data would have been produced by random chance conditional on the null hypothesis being true.” But I think in this case it’s clear that it should be interpreted as a probability (what you call a posterior) and not as a conditional probability. Maybe in other contexts 2(b) is a perfectly acceptable claim using your alternative interpretations and it has been unjustly attacked by the p-value police.
Carlos: Please reread what i wrote. Ho does not assert the data in the experiment are due to chance. (I’m not fussing with the “due to chance” metaphor here, I’m allowing that.) The null hypothesis may ENTAIL that the results from such and such experiment are “due to chance”, but that’s different.
Remember, too, that in the significance testing school, experimental results or data are always regarded as of a type. It is recognized that other outcomes could have occurred. Take a look at the “thumbs up or down” expercise in one my comments.
I don’t understand your objection and how it relates to what I wrote. If H0 is true, the data is due to chance (and not to a deviation from H0). Saying that the data is due to chance alone is saying the H0 is indeed true. How do you interpret otherwise the statement “the data is due to chance alone”?
In any case, I think most non-statisticians who hear “the probability that the data were produced by random chance alone” understand it as the probability that “the data were produced by random chance alone” being true, not the probability that “the data were produced” being true provided it was “by random chance alone.”
Similarly as “the probability that MH17 was shot down by a missile” is a statement about the cause of that particular incident and not about the plane-destructing capabilities of missiles.
Hypotheses don’t speak about all the possible tests that might be employed to generate data for their appraisal. But really, one needn’t get into acrobatics of language to parse an ordinary error probability. Take a look at the thumbs up or down statements, if you haven’t already.
(I tried to send this comment several times yesterday, but it seems it was getting blocked somewhere…)
I have no issues with the quoted statements about the Higgs discovery (but the conclusions in brackets seem too strong to me: the fact that the results are highly unlikely to be produced by the background alone does not completely rule out that possibility, etc.).
Regarding ASA’s principle 2, however, I think you need to engage in some serious language acrobatics to parse “the probability that the data were produced by random chance alone” as something different than the probability of “the data were produced by random chance alone” being true.
Carlos: That’s exactly what’s meant, and it’s corroborated by the well-known claims about “the probability that the observed signals were “flukes””. Now examine the thumbs up and down examples by a Bayesian critic. I need to remind people of the thumbs down examples, but you can trace the Higgs link. Sorry I’m rushing.
I’m sorry, I don’t understand what’s meant by whom according to your reply. Just to be clear, my comment is about what the ASA Board of Directors means when they state their principle 2: “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
I think “the probability that the data were produced by random chance alone” means the probability that “the data were produced by random chance alone” is true. I think so because it’s consistent with the first half of the sentence (this reiteration seems more likely to me than mixing two unrelated propositions), because it’s consistent with the explanation they give in the following paragraph (“the probability that random chance produced the observed data”) and because it’s indeed correct to say that “P-values do not measure [that probability]”.
Another reason to think they do not mean the conditional probability (“the probability that the data were produced” provided that they were produced “by random chance alone”) is that it is the definition of the p-value, something that I think they would notice, and they already adressed what the p-value actually is in their first principle. Principle 2 is about that the p-value is not.
What the ASA is saying is not that what those who said 2(b) “meant” is wrong, but that what they *said* is wrong (or perhaps “not even wrong”). And it would be the wrong language to use even if it actually expressed a correct statement – because, correct or not, it can easily be read as meaning something that is wrong.
The underpinnings of the “p-value” are simple, and it’s fairly easy to understand them when you apply them to some simple case you can visualize. I always try to go back to this when reading some of these discussions, which can get very hard to understand. Viz:
You have a production line that fabricates metal objects, tens of thousands of them per day, and you measure their dimensions so that you can reject the items that are out of tolerance. Assuming their length distribution is normal, then about 95% will be within +/- 2 standard deviations of the mean. You adjust the production process few days so that the mean equals the design length. Of course, this equality can’t be exact, since it is subject to measurement and sampling errors.
OK, we can visualize this easily, or we can mentally visualize the area of the tails of a normal distribution, which is really equivalent. In the production situation, in addition to rejecting individual parts, you could ask whether the mean length has changed since last week. Nothing esoteric there, again easy to visualize. With tens of thousands of parts, it’s easy to see how we could get good precision from the data. Of course, we’d probably think that if the mean today were more than 2 standard errors away from the design value, something would have changed. If tolerances were in thousandths of an inch and standard errors say 10 or 100 times smaller, it’s easy to see that you could get usable, reliable results.
We would not bother talking about the probability of a hypothesis that that the mean length had changed. We would, instead, take action if the measured sample mean shifted by, say, more than 3 sigma, We would not take action if the day’s standard error had shifted by less than 1 s.e. We would have some policy to deal with the in-between cases.
The basis of p-values, which basically involves using the area under parts of the probability distribution, is analogous to the above, and for fairly obvious reasons.
I think that where people come to grief is when there aren’t clear-cut populations like the above, perhaps even one-off cases, and they want to support some hypothesis rather than simply to identify cases where some effect may (or may not) be different from some other effect. They don’t have large populations of identically distributed items, they don’t have something readily visualized, they may without realizing be comparing point to range estimates. etc., etc. That’s when you get into wordy, abstruse, hard-to-understand arguments, start talking about the “probability of a hypothesis”, and generally become prone to erroneous thinking.
Let’s get back to basics, it’s easier to straighten out the thinking process!
Tom: Thanks for your comment. I haven’t come to grief in understanding p-values, I ‘m trying to explain the legitimacy of Pr( P < p; Ho) = p: The probability that test T reliably brings about statistically significant effects due to chance alone is p. This is questioned or denied in the ASA doc, principle 2(b), as I've numbered it in my post.
The kind of situation you invoke is fine, and if you think about the inference method itself as a general procedure, then it's one that can regularly be invoked (even without literal acceptance sampling).
Mayo, yes, I wasn’t really commenting on what you wrote so much as on the ongoing discussions/controversies arising from p-value use/misuse – it seems to be all over the place on the right blogs. I wanted to contrast the complexity of much of what we read with the basic simplicity from which p-values arise… and to remind people how they can be more concrete than they are usually treated.
Tom: Sure, and I wanted you to see that the kind of context you rightly find illuminating is actually the typical case, insofar as we consider the test procedure of a general type.
Tom: Just to be clear, as I reread people’s comments, I do agree with what you wrote.
Thumbs up for 1.
“CMS observes an excess of events at a mass of approximately 125 GeV with a statistical significance of five standard deviations (5 sigma) above background expectations. The probability of the background alone fluctuating up by this amount or more is about one in three million.”
***
Reason:
by analogy, the first sentence is: we found an event with payoff x in set A.
and sentence 2 is: if the event in question with payoff x is not in set A, then the chance of finding an event with payoff x in set A is less than 1/(3mil)”
King regards,
@alxpr1c3
Oy!
Mayo, it seems to me that you are excluding the main reason for saying that a P-value should not be interpreted as the probability that the data were obtained by chance alone with your statement that assumptions are OK: “We assume here that the p-value is not invalidated by either biasing selection effects or violated statistical model assumptions.”
If the P-value is the product of a planned comparison where the experimental design was appropriate and the planned design was followed, then your discussion of 2b seems correct to me. However, in the ASA meeting that produced the document in question we were considering the fact that there are many, many occasions where your assumption about assumptions is false. Where the P-value is a product of data-inspired testing or experimental design then it most definitely does not give the probability that the data were due to chance alone.
It does still index the evidence against the null hypothesised value of the parameter of interest. See my commentary on the ASA statement for a slightly more extensive treatment of this topic. File number 13 on this page: http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108
Michael: Absolutely true that if it’s an invalid p-value then there’s no point in discussing if that invalid p-value warrants the claims that a VALID p-value warrants. However, the classic criticism in 2 is not considered to be pointing up violated assumtions, it’s considered to be highlighting an invalid interpretation. 2(a) is invalid even with the reported p-value being actual. If we are to focus just on cases where cheating occurs it would be rather pointless to use any statistics.
So I say that you are incorrect to suppose that the problem that intends to be captured by principle 2 is that actual p-values may differ from reported p values because of cherry-picking, and other biasing selection effects.That point is raised elsewhere.
In the case of CERN Higg’s boson and LIGO gravitational waves detections, they both claim to have ruled out ALL other explanations. Whether that is true or not is up to the physicists, but the point is they claim either they detected the thing they were looking for or it was a spurious correlation.
To repeat, according to them there is NO other plausible explanation that needs to be considered. This seems to involve a rather large amount of hubris from my perspective, but it is not my impression that the group-think amongst physicists is strong enough to avoid someone publishing an alternative explanation (if it existed).
Granted, there’s quite a lot of background physics.But that goes beyond the statistical inference. (I distinguish by layers.) I happen to know more about LIGO and the huge background knowledge of interferometers. Talk about arguments from coincidence! They readily produce null cancellations at will. So they’ve got a real effect, and its of the relativistic gravity sort (no shaking due to passing trucks, since they both pick it up.) They then have to rely on the fact that all viable (non falsified) gravity theories concur on given things. This is a total outsider’s quick take.
This comment by Anoneuoid might be a good opportunity to make explicit some non-trivial confusion (I think exists) about the difference between a theory and a formalism. First I try to explain the relevance of this distinction, then some question follow about the the assumptions that validate p-values, sorry in advance for the length:
Anoneuoid wrote: “they both claim to have ruled out ALL other explanations.”
Yes, this is possible because physicists use formalised logic (mathematical models of reality) to build and evaluate theories instead of personalised logic (interpretations of reality by individual scientists affiliated with prestigious institutes). The explanatory claims about the phenomena of interest of entire classes of theories can be formally analysed using logic as a tool e.g. using a Boolean algebra to study ontological and epistemological properties of theories, such as holism or supervenience.
In the case of Higgs and Gravitational Waves, the plausible competing theories are well known and likely highly corroborated. Varieties of theories have to compete to explain the phenomena described by Quantum and General Relativity formalisms respectively. A formalism is a set of axioms and postulates that define a domain in reality, an arena in which theories are the gladiators. So it is likely the “ALL other explanations” refers to the derivation of analytical or computational solutions of the formal models for many different parameter settings that specify different preparation, measurement procedures and initial conditions in a hypothetical experiment.
The point is, the possible outcomes, their implications and the conditions that imply an anomaly was observed in LHC or any other large physics experiment are more or less known in advance, because theories can accurately “simulate” the outcomes of hypothetical experiments. Or: The theories produce testable predictions that are precise and empirically accurate.
Theories in social science predict regularities of the form: cov(Y, X) > 0 (i.e., the sign of a correlation), whereas in physics “theory and data speak more for themselves” (Fanelli, 2010). QED predicts an anomalous electron dipole moment (which is anomalous to the prediction by the Dirac equation) and it agrees with the actual measurement outcome to more than 10 significant figures (https://en.wikipedia.org/wiki/Precision_tests_of_QED). This does however not mean that the Dirac equation is “falsified”, the limits of its explanatory domain have been identified.
So, now for the questions:
1. When a psychologist and a physicist report an observation at 5 sigma from the center of a distribution predicted by a null model, the meaning is equivalent. However, the epistemic weight of the predicted sign of a correlation and a number representing an actual measurement outcome of 10 significant figures is anything but equal, even if the theories concern completely different domains in reality.
>> Should a quantifiable notion of epistemic weight be included in hypothesis testing? Or perhaps, should standardisation of deviance between model and observations be considered invalid for variables that do not fluctuate on an intrinsic physical scale. I assume from the perspective of error statistics, there is no principle that prescribes standardisation
2. About: “We assume here that the p-value is not invalidated by either biasing selection effects or violated statistical model assumptions.”
The selection bias can be addressed by design in most cases, but the second part implies that at least one interpretation of the fact that p-values can be invalidated by violations of statistical model assumptions implies that the nature of the process that generated the data is such that it falls outside of the domain in reality defined by some version of the probability axioms that describe the probabilities of random events.
In general any part or whole of a (bio)physical system that violates the ergodic theorems will meet this requirement: the time averaged behaviour of the process will not be the same as its space averaged behaviour. This would disqualify the use of p-values for the scientific study of virtually all living organisms displaying some form of adaptation of behaviour to previously experienced events.
>> If such is the case, shouldn’t the ASA have provided a huge disclaimer defining the subset of phenomena for which the use of p-value testing as a means of inference is valid?
3. Should the different forms of inference, Bayesian, Frequentist, N-P and what not be considered theories in competition within the same arena defined by the probability axioms (the phenomena in reality whose properties and behaviour may be described by the probabilities of random events/fields)? OR: Should Bayesian inference be considered to operate within a different formalism from the others, because it describes a profoundly different aspect of reality?
>> If the first is true then Bayes and the others are likely in competition to provide the most veridical inferences and their diverging predictions should be tested. If the second is true, then it is futile to compare the two, or to discuss which is better, than the other. They will be appropriate for inference on different types of phenomena and these differences should be made explicit.
I have many answers to these questions but not yet any favourites, perhaps they have one been resolved, which I’d be happy to learn.
Fred: Great to have this interesting comment from you, it could be a post of its own. I do think that something illuminating (for me) is coming out of this review of the “due to chance” business that I’ve been mulling over for a long time. It comes in my later comments of the thumbs up/down claims. I think it’s important.
Fred:
“If the first is true then Bayes and the others are likely in competition to provide the most veridical inferences and their diverging predictions should be tested. ”
What statistical inference account shall we use to test them severely?
I don’t have any problem with the notion of “the common meaning in English” here. In fact, I suspect that Fisher meant it the same way.
Where things get more delicate, I think, is that the only real way to think of this situation from a frequentist point of view is, ironically, to think of a prior on H0. I certainly don’t mean this as a subjective prior — which is where my objection to Bayesian methods comes from — but a prior none the less. (Priors are used by frequentists all the time, e.g. in linear discriminant analysis.)
To make matters concrete, say I am a pollster, and I always work in two-candidate elections, with my hypothesis being H0: p = 0.50, where p is the true population proportion of people who intend to vote for my client. As I go through my career, conducting many, many polls, there will in the long run be a “distribution” on p. E.g., in the long run, some proportion q of all my polls will be in settings in which p < 0.68, say. That is a prior on p.
There are various issues here. For people who like testing (I do NOT), our prior would need to have an atom at 0.50, some nonzero mass. Though Deborah has excluded it here, for the sake of argument, I would renew the point I made on my own blog recently that composite hypotheses don't make sense, in view of measurement bias — in the poll settings, some people lie, for instance, or they accidentally give an answer opposite to what they had intended. But let's put that aside here.
In such a setting, one can indeed mathematically talk about "our observed data being due to random chance," AND I would contend that this is what people really mean (unconsciously) in "common English."
Norm: Empirical priors are fine, but scientists don’t put a frequentist prior on a theory or law in these contexts, yet their “thumbs up” assertions are thumb’s up. Maybe it’s the fact that people don’t get what’s meant by a frequentist error probability that this claim seems odd. I allowed that it wasn’t the best possible way to state the results, but together with the severity principle it’s exactly how we reason all the time. I’m not sure how you like to interpret your confidence levels Norm–pure performance?
Norm: noticed somethng weird in your comment, I’ll have to come back to it.
A palindrome for “pi” day, sent by a previous winner of our palindrome contest
(One day I will be able to get this comment through!)
I have no issues with the quoted statements about the Higgs discovery (but the conclusions in brackets seem too strong to me: the fact that the results are highly unlikely to be produced by the background alone does not completely rule out that possibility, etc.).
Regarding ASA’s principle 2, however, I think you need to engage in some serious language acrobatics to parse “the probability that the data were produced by random chance alone” as something different than the probability of “the data were produced by random chance alone” being true.
test
I have no issues with the quoted statements about the Higgs discovery (but the conclusions in brackets seem too strong to me: the fact that the results are highly unlikely to be produced by the background alone does not completely rule out that possibility, etc.).
Regarding ASA’s principle 2, however, I think you need to engage in some serious language acrobatics to parse “the probability that the data were produced by random chance alone” as something different than the probability of “the data were produced by random chance alone” being true.
It’s notlogically impossible, but if you demand that, there will be no inductive learning in science. I aim to capture the reasoning and explain when and why it’s warranted. Sorry about difficulties posting. a little glitch on Elba I think–lots of lightening here, and I have to cross a ferry to get to the island to change the settings.
I do not demand anything but I think those statements would be improved with the addition of some qualifying language to indicate that these conclusions are not necessarily true. If it was implied I missed it, hence my comment about them seeming “too strong” to me.
ok, I’m going to try just one more time. I think everyone here understands correctly what p-values are, so we don’t need to keep schooling one another on that score. The question is about how to describe them in colloquial language – and in particular about whether the ASA claim 2b that “P-values do not measure.. the probability that the data were produced by random chance alone” is appropriate.
I think the ASA is correct because the language they reject strongly (and in my opinion *very* strongly) appears to be claiming that the p-value is a “posterior” – which makes it *very* frustrating that every time I say this, Mayo responds with “No, not a posterior” as if i was too stupid to know what a fucking p-value really is!!! (And no, I don’t expect this comment to pass moderation. But *please* read it and respond to what i am saying rather than pretending that I am advocating what I explicitly reject)
AlQpr: Oh my! We lost electricity here on Elba today-lightening–so I hadn’t seen this.The only person to use the F word on this blog was in something I reblogged by Math Babe, but it was kind of cool within the context of her particular post. Yours is just puzzling/gratuitous because I replied to your comments, but you’re new to this blog..*
I’m not really looking for colloquial language (in explanation #3, which I take it is of main interest) but looking to explain how a frequentist error statistician (and lots of scientists) understand
Pr(Test T produces d(X)>d(x); Ho) ≤ p.
You say ” the probability that the data were produced by random chance alone” is tantamount to assigning a posterior probability to Ho, based on a prior) and I say it is intended to refer to an ordinary error probability. The reason it matters isn’t because 2(b) is an ideal way to phrase the type 1 error prob or the attained significance level. I admit it isn’t ideal But the supposition that it’s a posterior leaves one in the very difficult position of defending murky distinctions, as you’ll see in my next thumb’s up and down comment.
You see, for an error statistician, the probability of a test result is virtually always construed in terms of the HYPOTHETICAL frequency with which such results WOULD occur, computed UNDER the assumption of one or another hypothesized claim about the data generation. These are 3 key words.
Any result is viewed as of a general type, if it is to have any non-trivial probability for a frequentist.
Aside from the importance of the words HYPOTHETICAL and WOULD is the word UNDER.
Computing {d(X) > d(x)} UNDER a hypothesis, here, Ho, is not a conditional probability.** This may not matter very much, but I do think it makes it difficult for some to grasp the correct meaning of the intended error probability.
OK, well try your hand at my next little quiz.
*I hope I don’t have to put you on the list of disrupters to this blog.
**See double misunderstandings about p-values https://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/
Thumbs up or down? Assume the p-value of relevance is 1 in 3 million or 1 in 3.5 million. (Hint: there are 2 previous comments of mine in this post of relevance.)
1. only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequate].
2. the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million
3. The probability of the background alone fluctuating up by this amount or more is about one in three million.
4. there is only a 1 in 3.5 million chance the signal isn’t real.
5. the likelihood that their signal would result by a chance fluctuation was less than one chance in 3.5 million
6. one in 3.5 million is the likelihood of finding a false positive—a fluke produced by random statistical fluctuation
7. there’s about a one-in-3.5 million chance that the signal they see would appear if there were [Ho adequate].
8. it is 99.99997 per cent likely to be genuine rather than a fluke.
They use likelihood when they should mean probability, but we let that go.
Thanks again for responding – especially given my resort to inappropriate language.
I am almost certain that we agree about the actual facts of the subject except in so far as I am mistaken or am not familiar with technical terms (as may become apparent when I try your next quiz). But what I am deeply frustrated about is my failure to convey exactly what I mean to someone who is probably in many ways more capable than I. I could have been wrong in my comments, but your responses don’t convince me of that because I agree with every word of them. They just don’t address what I thought I was saying.
For example, let me revisit this from your last response:
<>
My point was that while I agree with you about what was *intended* (both as to its correctness and as to the fact that it really was what was intended) I also agree with the ASA that the statement quoted is an incorrect expression of that intent. You agree that it “isn’t ideal” but I think it’s far worse than that because it creates the impression of being a posterior even though it is not intended to be one. That may not be important for an audience that knows what should be intended, but for stuff that is going out to the general public I think it is a disaster.
It’s not that what was intended is wrong – or even that the statement couldn’t be more “generously” interpreted as an (almost) correct expression of that intent. Coming from a first year undergrad I would only give it a very small penalty if any, but I *would* identify it as unacceptable from a professional researcher – especially in the context of any communication with the non-expert public.
I don’t want to be a disruptor on your blog but will continue pressing this issue as long as you let me – either here or via private communication if you will allow that – because I think it is very important for public statements about research results to be both as correct and as clear as humanly possible. And so I want to persuade you to endorse the ASA position that the language rejected in 2b is indeed unacceptable.
PS Actually I’m not so new here. I have been lurking with the very occasional comment for a couple of years – ever since you once responded favourably to something I posted about a discussion over at Briggs’ place.
On the new quiz:
1.If “where Ho is adequate” is equivalent to “under Ho” then I’d give this a thumbs up if it had included the word “expect” somewhere. As is, maybe 60 degrees up (I’d accept “this strong” as meaning “this strong or stronger”)
2. down
3.up
4. down
5.60 degrees up (“their signal” is too specific – which makes the statement not false but trivial. But I’m not going to object to the use of “likelihood” in place of “probability” in any of these.)
6.up
7. same as 5
8. down
Note: reposting to include the quote that I foolishly turned into a markup by using angle brackets
Thanks again for responding – especially given my resort to inappropriate language.
I am almost certain that we agree about the actual facts of the subject except in so far as I am mistaken or am not familiar with technical terms (as may become apparent when I try your next quiz). But what I am deeply frustrated about is my failure to convey exactly what I mean to someone who is probably in many ways more capable than I. I could have been wrong in my comments, but your responses don’t convince me of that because I agree with every word of them. They just don’t address what I thought I was saying.
For example, let me revisit this from your last response:
‘ You say ” the probability that the data were produced by random chance alone” is tantamount to assigning a posterior probability to Ho, based on a prior) and I say it is intended to refer to an ordinary error probability. ‘
My point was that while I agree with you about what was *intended* (both as to its correctness and as to the fact that it really was what was intended) I also agree with the ASA that the statement quoted is an incorrect expression of that intent. You agree that it “isn’t ideal” but I think it’s far worse than that because it creates the impression of being a posterior even though it is not intended to be one. That may not be important for an audience that knows what should be intended, but for stuff that is going out to the general public I think it is a disaster.
It’s not that what was intended is wrong – or even that the statement couldn’t be more “generously” interpreted as an (almost) correct expression of that intent. Coming from a first year undergrad I would only give it a very small penalty if any, but I *would* identify it as unacceptable from a professional researcher – especially in the context of any communication with the non-expert public.
I don’t want to be a disruptor on your blog but will continue pressing this issue as long as you let me – either here or via private communication if you will allow that – because I think it is very important for public statements about research results to be both as correct and as clear as humanly possible. And so I want to persuade you to endorse the ASA position that the language rejected in 2b is indeed unacceptable.
PS Actually I’m not so new here. I have been lurking with the very occasional comment for a couple of years – ever since you once responded favourably to something I posted about a discussion over at Briggs’ place.
60 comments (now, 61), with no one disagreeing on what a p value is, yet still disagreement on whether certain statements are fallacious. Of course a statement isn’t, by itself, fallacious; what is fallacious is the meaning of the statement, and Mayo is saying that it is possible that the statements that some would find fallacious have a meaning that is non-fallacious. This seems possible. I guess the only way to *know* whether a particular statement had an intended meaning that is fallacious is to ask.
The disagreement seems to come where others (including myself) read a statement and say that the meaning of the statement seems fallacious. There’s more to a statement than whether it is possible to be read in a non-fallacious way; we also have to consider whether it is likely to be misunderstood by a reader.
I was recently co-teaching a stats course with another researcher and I was going over their questions for the exam. One of the questions something like “…You perform a t test on the data, which yields p=.03. What is the probability that there is an effect?” I asked “What do you think the right answer to this question is?” and they replied “.97”. I asked why, and they said “Because the probability that the results were due to chance is .03.”
Now, had she merely said the latter statement, Mayo would not regard that as (necessarily) fallacious. Fine, *if* the person making the claim has the intent of communicating a typical error probability. But many of the “p value police” are interested not just in whether a writer understands the statement, but also in what readers are likely to understand from a statement.
If I render 2b as “p = Pr(the data were produced by random chance alone)” then one can ask, what is the complement of this probability? It seems like the most reasonable compliment is “1-p = Pr(the data were NOT produced by random chance alone)”. This was the mistake that my fellow researcher was making, and it relies on the most obvious reading of 2b. That is, people are likely to understand the wording like that 2b warns about in a fallacious way (I’m, of course, assuming that the utterer understands that “the data” means “a test statistics more extreme”)
If instead, I say “p = Pr(a test statistic more extreme, were H0 true)” now we have “were” used in the subjunctive mood (appropriate, given the counterfactual nature of the p value) and it is obvious that the complement of p is “1-p = Pr(a test statistic less extreme, were H0 true)”. The fallacious reading is blocked. This is why it is “was/were due to chance” is no good; it does not adequately communicate the counterfactual nature of p.
All that said, I guess I will answer your 8 “thumbs up/down” statements as “In my dialect, does the most obvious reading of the statement communicate a fallacy?” (your previous 3 from Spiegelhalter all seemed fine to me)
1. Up (assuming some reasonable definition of “adequate”).
2. Down
3. Up
4. Down
5. Maybe (I can see the argument for rendering “their signal” as “a signal this strong [or stronger]”)
6. Up
7. Maybe (I can see the argument for rendering “the signal they see” as “a signal this strong [or stronger]”)
8. Down
Richard has zeroed in on the crucial issue! But I can’t comment until tonight.
Here is my take on 1. I assume that the probability one in three million is calculated in a model. That is, it is not an empirical probability based on thousands of millions of repetitions. I am not sure what the model is not being acquainted with the problem, but I will assume that it is a normal model N(mu,sigma^2) for some mu and sigma. Is this model part of the physical theory or is it an empirical model, perhaps ‘justified’ by the central limit theorem? How accurate is the model? One way of measuring this is the Kuiper distance between the N(mu,sigma^2) model and the empirical distribution. The claim one in three million is a claim about the world but calculated in the model. For this to be reasonably accurate the Kuiper distance will have to be of the order O(1/(3.5*10^6)). This accuracy requires a sample size of about 10^13. But with this sample size and this accuracy one would have seen millions of events with probability one in three million. Why is this particular event so important? If there were not millions of such events then the accuracy of the approximation must be less that one in three million. If this is so the model is being applied with a precision not warranted by the data. Perhaps a t-distribution with 50 degrees of freedom is consistent with the data. Under this adequate model the probability increases from one in three million to 1 in 250 thousand. Does this matter? The model is applied under background assumptions: all instruments working correctly, no faulty connections, nobody coughing etc. As I understand him Oliver has emphasized this point. If the background assumptions do not hold the data may for example contain outliers. Do the data contain the occasional outliers and what do the physicists do with such data sets? Throw them away? This list could be continued but it is sufficient for the present purpose. The bottom line is that P-values are calculated in a model. The generally accepted meaning of P-values however seems to be of the world: there is or is not an effect, this deviation is just noise interpretation. My interpretation is that a P-value is a measure of the inadequacy of the model for the data. Why the model is not adequate is then a matter for discussion whereby one possible explanation could indeed be the presence of an effect. But such a discussion requires saying exactly what is meant by adequate. Without it I see no basis for a serious discussion.
I think I agree with everything Laurie has said here, at least in spirit.
Perhaps relatedly, a comment based on reading some of the invited comments. In particular, Philip Stark (#20) in ‘The value of p-values’ (in which he largely defends the use of p-values) lists a few points of disagreement with the main statement. For example
“The statement draws a distinction between “the null hypothesis” and “the underlying assumptions” under which the p-value is calculated. But the null hypothesis *is* the complete set of assumptions under which the p-value is calculated”
I think this is a fair point. Unfortunately, as Laurie often points out – we have to specify what exactly these assumptions are and when are they are to be considered met.
For example, one might say “well I’m assuming nobody coughs, I wear my lucky red shirt, there are no outliers, I make n independent measurements…[i.e. *everything* that was true when the experiment was carried out]… then the data are normally distributed and the p-value is 0.05”.
This might seem pedantic – why include all these? But by neglecting them you’re implicitly assuming that they make no difference. Even besides the fact that the p-value is itself a random variable, changing the model to one that might be indistinguishable by the truncated list of ‘relevant’ assumptions (e.g. keep ‘no outliers’, drop ‘my shirt is red’?) can lead to very different p-values.
What is the formal principle that allows you to drop all these extra background variables from the list of assumptions of the null hypothesis? Do you consider severity to do this job?
(and I would of course include ‘adequacy’ or ‘look like’ assumptions in this list – eg a description of how you compare simulated and real data – as well as a list of all the mathematical properties. Are all moments of the distribution to be considered accurate? What if I truncated after the Nth? etc).
Clearly we need to consider how we ‘draw the boundaries’ and what implications that has for ‘within the model’ conclusions. I don’t think these can be relegated as a side issue
omaclaren, as you imply, it is not possible to include all of the real-world considerations such as cough and red shirt into the statistical model used to calculate a P-value. It is also undersirable to do so.
We have to keep noting that the statistical aids to human judgement are always, inevitably and unavoidably model-based. Where we suspect the model might ignore important features of the real world we need to consider the results carefully before making inferences.
I prefer to keep statistical evidence separate from inference. Doing so makes a process space for thoughtful integration of evidence and real-world considerations.
Hi Michael, I largely agree with you too despite also largely agreeing with Laurie and you two seeming to disagree. At least in the sense of working within models and expanding the hierarchy of models as we suspect things might be relevant or we want to consider a broader invariance class etc etc. The interesting thing though is that conclusions valid within a restricted model – eg my predictions are invariant wrt to he colour of my shirt – can dramatically change for some seemingly small model perturbations. I elaborate more below.
What I take from all this is that the phrase “the probability that the data were produced by random chance alone” is irretrievably ambiguous and should be avoided. Some will understand it as p(H0), the probability that the data are being generated by the null model. Others will understand it as p(x|Ho), the probability that the null model, if true, might produce a test statistic at least as large as that observed.
David: But Ho does not assert the data are being generated by the null model. To say there’s no effect, no deflection, no increase or whatever Ho asserts is not to talk about specific experiments. We don’t have a different null for each experiment that might be performed on a given effect or claim. Nor do scientific theories talk about specific tests of them. At most Ho ENTAILS something (statistical) about your particular experiment, but that’s different. I’ll explain later why we can’t just cut talk about the probability of a test producing such impressive results under the supposition of Ho.
3 years ago today: Double misunderstandings about p-values on this blog:
https://errorstatistics.com/2013/03/15/normal-deviate-double-misunderstandings-about-p-values/
Readers: Slight distraction this evening from the so-called super Tues primary results in U.S., but I’m rereading your comments, & will comment tomorrow.
Mayo: I had thought this 9_Greenland_Senn_Rothman_Carlin_Poole_Goodman_Altman.pdf in the supplement had nicely covered the concerns, Richard, Laurie, David and others raised.
I have added to one of their summary comments what I thought they were getting across in the paper in the square brackets –
from page 19 “That the statistical model [a representation that needs to adequately capture all the important aspect of reality for the study in hand] used to obtain the results is correct”.
You say “whatever Ho asserts is not to talk about specific experiments” but how are you escaping from the fallibility of representations of reality when trying to get reality right?
Keith O’Rourke
Phan: I realize people have difference conceptions of what to consider the model, and what to consider a hypothesis about the model, here a parameter. So Ho: theta = 0, is a claim about a parameter in the model which would include the distribution. I follow Spanos’ way of distinguishing the “specification” problem from the “estimation” problem as Fisher would have called it. But I don’t think this issue turns on one’s preference for breaking down the components of a model and its hypotheses..
Mayo: I would agree that we have a Humpty-Dumpty issue of when I use a word I mean it to mean what I want it to mean as did the commentators on the ASA statement (e.g. 20_Stark.pdf seems to mean by H0 is true that all aspects of the model specification are true/adequate where as 9_Greenland_Senn_Rothman_Carlin_Poole_Goodman_Altman.pdf do not.)
I do find the “which would include” here very strange “So Ho: theta = 0, is a claim about a parameter in the model which would include the distribution.”
But if you wish to follow Spanos’ way of distinguishing the “specification” problem from the “estimation” problem, then the problems Oliver, Laurie, others and I are raising simply fall into estimation – which is always critical in applications. And the ASA statement is about applications and the pragmatic grade of meaning to be taken from the p_value.
The estimation challenge in applications with respect to p_values is to obtain a uniform distribution under theta = 0 under (conceptual) replication of the actual data generating process. If that is not the case, who cares what that this distribution is uniform is under a model specification that is misaligned with how the data were generated.
For instance, in a non-randomized study, if one cannot remove essentially all confounding in the estimation of an effect, it will be misaligned, and the p_value won’t mean anything (i.e. the true error rates cannot be estimated.).
Keith O’Rourke
Keith, some of them are covered, in particular the background circumstances, but there remain what are for me considerable differences. I treat all models as approximations and thus the probability that a model is true is zero. The authors operate in the ‘behave as if true’ mode but there are sentences where they seem to to accept models as approximations. They refer to assumptions being ‘satisfied’ or ‘verified’ although it is simply impossible to determine whether the data are in truth random. All you can do is to state that they have certain properties enjoyed by random data. Deterministic chaos is deterministic but may well look random ‘random’ number generators for example. In misconception 1. we read ‘the P-value assumes that the test hypothesis is true’ but later ‘The P-value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model).’ The former is in the ‘behave as if true’ mode, the latter can be read as a P-value being a measure of how well the model approximates the data. Their interpretation of confidence intervals is in the ‘behave as if true’ mode, namely the frequency with which the interval contains the true value in a long sequence of repeats. In general they operate in the ‘behave as if true’ mode although there are certain statements which can be read as measures of approximation. I take it that their use of the word ‘model’ is the more or less standard one in this area of statistics, namely a parametric family of single probability measures. The P-value depends on the model under which it is calculated. Although such a model is not true it may be adequate. The P-value therefore depends on your definition of adequacy and although the such a definition depends on the model one would have thought an attempt could be made for the case of a Gaussian model. One could define a reasonable practice (check-list) for deciding on the adequacy of a Gaussian model. No doubt there will be no agreement on this but the alternative is ‘if it looks Gaussian to me, then Gaussian it is’. There is a great reluctance to specify when a model is adequate. If I write a contribution on likelihood there is a response, all contributions on adequacy or P-values are simply ignored. The same happens on Andrew Gelman’s blog. Why this reluctance? As a concrete example I mentioned Mayo’s ‘1. only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequate]’. How adequate does Ho have to be for this claim? I suggested a Kolmogorov distance of this order which requires a sample size of 10^13. There was no response.
Laurie: I think these issues mix together the interpretive ones I’m keen to get at here, as important as they are.
Well I replied to say I agree with you…
I’m happy to give probability statements with a model structure/family – ie I simulate parameter values, plug them in as realisations of the parametric family, simulate data under this parameter value and accept the parameter if the simulated data match the real data for some explicit acceptance criteria. This gives a new distribution of accepted parameters which can be compared to the original parameter simulation distribution to see how much the data have ruled out some values. Probability statements are just related to the relative rates of acceptance.
These are within a model family probability statements and will cease to make sense if the model is taken too literally, as you point out.
I could do the same for another model family. To compare different model families to each other I would have to find some functional or whatever that maps the parameter distributions found from each to some common property. Eg number of modes, whatever. These can be considered the desired invariants when transforming between model classes. I think this is consistent with your view.
These transformations are probably only well defined for a subset of possible model classes so we again face a higher-level ‘working within a family of model families issue’. The hope is that continuing this hierarchical procedure leads to some sort of locally but not too locally valid approximation. I keep it local to allow for the possibility that this procesa may lead to bifurcations where are previously neglected variables becomes relevant near some critical point a la phase transitions. Scientific revolutions and all that.
Oliver, yes you did, but in my brain you were in the complement of the set I was addressing.
THUMBS UP OR DOWN ACCORDING TO THE P-VALUE POLICE
1. only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequately describes the process].
up
2. the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million
down
3. The probability of the background alone fluctuating up by this amount or more is about one in three million.
up
4. there is only a 1 in 3.5 million chance the signal isn’t real.
down
5. the likelihood that their signal would result by a chance fluctuation was less than one chance in 3.5 million
up
6. one in 3.5 million is the likelihood of finding a false positive—a fluke produced by random statistical fluctuation
down (or at least “not so good”)
7. there’s about a one-in-3.5 million chance that the signal they see would appear if there were no genuine effect [Ho adequate].
up
8. it is 99.99997 per cent likely to be genuine rather than a fluke.
down
I find #3 as a thumbs up especially interesting.
The real lesson, as I see it, is that even the thumbs up statements are not quite complete in themselves, in the sense that they need to go hand in hand with the INFERENCES I listed in an earlier comment, and repeat below. These incomplete statements are error probability statements, and they serve to justify or qualify the inferences which are not probability assignments.
In each case, there’s an implicit principle (severity) which leads to inferences which can be couched in various ways such as:
Thus, the results (i.e.,the ability to generate d(X) > d(x)) indicate(s):
1. the observed signals are not merely “apparent” but are genuine.
3. the observed excess of events are not due to background
5. “their signal” wasn’t (due to) a chance fluctuation.
7. “the signal they see” wasn’t the result of a process as described by Ho.
If you’re a probabilist (as I use that term), and assume that statistical inference must take the form of a posterior probability*, then unless you’re meticulous about the “was/would” distinction you may fall into the erroneous complement that Richard Morey aptly describes. So I agree with what he says about the concerns. But the error statistical inferences are 1,3,5,7 along with the corresponding error statistical qualification.
For this issue, please put aside the special considerations involved in the Higgs case. Also put to one side, for this exercise at least, the approximations of the models. If we’re trying to make sense out of the actual work statistical tools can perform, and the actual reasoning that’s operative and why, we are already allowing the rough and ready nature of scientific inference. It wouldn’t be interesting to block understanding of what may be learned from rough and ready tools by noting their approximative nature–as important as that is.
*I also include likelihoodists under “probabilists”.
OK, so is your issue with #6 “false positive” or “finding”? I was interpreting this under the oft-used definition of the p value as “the smallest alpha under which one would reject the null” – that is, “in a similarly designed experiment in which the null is true [as implied by *false* positive] the probability of a smaller p is X”.
If one interprets “the likelihood of finding a false positive” as “the likelihood of *having found* a false positive” then it is clearly wrong; but since it said “finding” it seems not so bad.
Richard and everyone: The thumb’s up/downs weren’t mine!!! The are Spiegelhalter’s!
http://understandinguncertainty.org/explaining-5-sigma-higgs-how-well-did-they-do
I am not saying I agree with them! I wouldn’t rule #6 thumbs down, but he does. This was an exercise in deconstructing his and similar appraisals, (which are behind principle #2) in order to bring out the problem that may be found with 2(b). I can live with all of them except #8.
Please see what I say about “murky distinctions” in the comment from earlier:
https://errorstatistics.com/2016/03/12/a-small-p-value-indicates-its-improbable-that-the-results-are-due-to-chance-alone-fallacious-or-not-more-on-the-asa-p-value-doc/#comment-139716
I didn’t agree with Spiegelhalter about #6 either, and didn’t follow his explanation so have asked him about it. But of course it’s clear from my previous comments that I do agree with him and Richard about #2&4. In fact I am beginning to wonder if there is a transatlantic cultural difference with regard to need for the subjunctive (ie for the “was/would distinction”), because in the language as I learned it there is no grammatical ambiguity about “the likelihood that their signal was a result of a chance fluctuation”.
I’m actually American, so it is unlikely to a transatlantic issue.
alQ–did he answer? Please send him the link.
I commented on UU and (since the post is old) also sent him an email. And he did take the trouble to reply. I have copied you and Richard on my reply to his reply and will quote more fully and publicly if he gives permission to do so.
Here is a quote from David Spiegelhalter’s response to my email about #6:
“I admit I might have been a bit harsh on Carl. But maybe not. In the phrase ‘Likelihood of finding a false positive’ , the object of the probability statement is ‘false positive’, which can be easily interpreted as a combination of both the observation (‘positive’) and the hypothesis (‘false’). So on second thoughts, I stick to my guns on this one, although fully admit I am being a bit pedantic, and that Carl’s ambiguous statement is ‘thumbs sideways’.”
So I think his concern here is not because the statement is as clearly and unambiguously wrong as #2,4,&8 are (at least to anyone who speaks the same version of English as I do), but because of the ambiguity.
And I do agree with him that there is no good reason for not being completely clear. So, given the importance of not misleading the public, perhaps anything that could reasonably be misinterpreted by a non-expert should actually get a thumbs down.
But their approximative nature blocks the first thumbs up, no? Different models can be adequate and give very different p-values.
Om: I was doing my best, using the particular example of the Higgs–because that’s where these statements arose– but trying to remove that solely to get at one issue! There’s no way to take up all issues at once. The point is, substitute whatever words can allow us to examine the ONE particular issue of interpreting p-values here.
Here is my attempt to substitute appropriate words
1a. only one experiment in three million would see an apparent signal this strong in a universe *exactly* described by a model Ho where Ho happens to *also* be judged appropriate for describing our particular universe according to some finite list of criteria.
1b. Similarly, only one experiment in [some other number] would see an apparent signal this strong in a universe exactly described by a different model Ho’ where Ho’ also happens to be judged appropriate for describing our particular universe according to the same finite list of criteria above.
1c. Therefore, only one experiment in [some number] in [some range] would see an apparent signal this strong in *our* universe if it is exactly described by one of the models in the set of all models which would be judged appropriate for our particular universe according to that finite list of criteria and [some range] contains all values that would generated by this set of judged-to-be appropriate models.
1d. Assuming that the universe cannot be described by a model specified by a finite list of criteria there are always models that would both be judged appropriate according to the same standard but that are arbitrarily far apart according to some criterion not included in the original judgement. In particular, there are models which are both judged appropriate according to the same seemingly reasonable criteria but which give very different p-values. There are also models that give the same p-value but only one of which satisfies some other seemingly reasonable list of criteria.
1e. Therefore it is important to carefully choose our criteria for judging the appropriateness of possible models and what that implies about the ways in which we can be wrong. The assumptions of a model are often more important than the output of a model. More attention should be focused on these than the numerical value of an output from one particular model.
1f. Bonus: assuming 1d, the only way to (possibly) obtain a unique description of the universe is to introduce theoretical principles which assert an infinite number of statements – e.g. models of the universe should satisfy some continuous symmetry or other, or be invariant to all possibilities in some infinite set. These are falsifiable by finite observations but cannot be established by finite observations.
We do not need exact models, to spoze we do is absurd. The precise p-value doesn’t matter either when we have a strong argument from coincidence. But again, this is a bit of a side issue to the one I”m getting at.
You’re missing the point that Laurie and I are trying make, as well as why we see it as central.
Om: That’s fine, I’m glad you’re pursuing the discussion of model approximation, even though it’s distinct from the interpretive issue I’m trying to crystallize.
You’ve posted about ‘when Bayesian inference shatters’ and mentioned results to the effect that two Bayesians can start out agreeing to an arbitrary degree but end up disagreeing to an arbitrary degree.
What ‘sort’ of issue do you see these as for Bayesians? Does they have no interpretative implications whatsoever? Doesn’t it introduce ambiguity into the meaning of the priors and posteriors if they can diverge in these ways?
If frequentist also ‘shattered’ in certain ways would you be similarly concerned as you are about Bayesian inference? Wouldn’t it also make the interpretation ambiguous?
(Hopefully my meaning is invariant wrt my typos)
Om: Not sure what you’re point is, but the Bayesian shattering result was a guest post:
https://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/
Owhadi explained why and how the frequentist avoids this. I realize some discount their result as mainly of mathematical interest, hardly relevant to practice. I don’t know. The authors think it’s not far from what happens or can happen in today’s modeling practice. The link to the published paper is:
https://errorstatistics.com/2016/01/11/on-the-brittleness-of-bayesian-inference-owhadi-scovel-and-sullivan-published/
My point, also contained in my other comments here re p-values etc, is that Frequentists can avoid wrt to some criteria but not others. Bayesians and frequentists satisfy their own criteria, diverge from the others and both shatter wrt reality.
Think about #3. Why do the P-value police give it a thumb’s up? It might contain a central key.
3. The probability of the background alone fluctuating up by this amount or more is about one in three million.
up
Oliver got in first. We seem to be the only ones who give a thumbs down to 1. as stated. I was intending to write that if [where Ho is adequate] is replaced by [where Ho is true] I would agree with 1 – I have a feeling that there is a self-reference problem in what I have just written but never mind. Oliver also correctly points out that there may be many models which are adequate in the same sense that Ho is adequate but give very different P-values. In mathematical terms statistical problems are often ill-posed and require some form of regularization. This does not seem to be mentioned in the ASA p-value doc but I have not read everything. I am only prepared to interpret a P-value qualified by the word adequate if I am given the definition of adequacy and the manner of calculation of the P-value. Given this it is of interest as to how the physicists defined adequate. The one in three million is not an empirical value. I pointed out in a previous post that this would require an enormous number of experiments to obtain this order of accuracy. The following is based on https://de.wikipedia.org/wiki/Higgs-Boson which is in German. According to this the P-value was calculated by simulation. The corresponding English wikipedia https://en.wikipedia.org/wiki/Higgs_boson does not mention this. As I understand it the physicists simulated the decay of sub-atomic particles under two situations, one without the Higgs boson and one with. These simulations are based on the Standard Theory of particle physics and it looks pretty solid to me. The Standard Theory of particle physics has a different status as model compared with the normal model in the ‘if it looks normal it is normal’ approach. It also eliminates problems of ill-posededness and regularization. To account for the empirical variability of the experimental results the simulations made use of random numbers. Suppose the random numbers were generated as follows. Take the first twenty digits of the binary expansion of pi and convert these to a number in [0,1]. This is the first random number. Take the next twenty digits and covert these to give the second random number and so on. The P-value one in three million is based on these random numbers. But of course they are not random, they are deterministic. How do you talk your way out of this? Would you still accept the one in three million? This is not what the physicists did. They presumably used their own random numbers. But these are also not random ,they are deterministic: see https://root.cern.ch/about-root. The P-value one in three million is based on a deterministic but chaotic generating scheme. In principle the cern random number generators are not much different fro the binary expansion of pi. You can prove some useful theorems about them which are not available for pi but in the last resort they will have to pass ‘tests for randomness’. One advantage of pi is that it is not periodic. What does this say about the use of the words ‘chance’ and ‘random’ when interpreting the deterministic P-value?
Laurie Davies,
For me your posts show up without any paragraphs, making them more difficult than is necessary to comprehend, but what I do get looks interesting. Is that “single-paragraph” format intended, or a bug due to the way this blog makes your posts show up? If the latter, maybe “\n” would do it.
I think all this talk about determinism misses the point. A (deterministic) simulation with, say, pseudo-random numbers is merely a way of estimating an integral. Analytically computing an integral is a deterministic process too. This does not threaten the interpretation of the resulting p value at all.
Richard – here is a quote from a other Richard (Feynman) “The next great awakening of human intellect may well produce a method of understanding the qualitative content of equations. Today we cannot. Today we cannot see that the water-flow equations contain such things as the barber pole structure of turbulence that one sees between rotating cylinders. Today we cannot see whether Schrödinger’s equation contains frogs, musical composers, or morality – or whether it does not.”
(arg phone autocorrect)
Laurie: I was actually trying to avoid side issues by writing”adequate”; else I would have written “Ho is true*e.g., “mu = 0”.So let’s change it to “Ho is correct”. Personally, I’m not afraid of speaking of truth. I was trying to abstract away from the Higgs example for purpose of general lessons which are many and are quite important, and I worry we’re not grasping them.
Suppose one wants to evaluate E(f(X)) for some real valued function f and a real-valued random variable X. This is simply an integral but it may be too complex to calculate. For example the f(X)=X where X is the exit time of a two-dimensional Brownian motion from the interior of some suitably smooth two-dimensional region. As you state this may be estimated by generating i.i.d. X_i and then using the law of large numbers with perhaps an error term. Unfortunately no-one knows how to generate random numbers so instead we use pseudo-random numbers X’_i. These have ever have the property that if I know X’_1 I can calculate the whole process X’_i. As a side remark this can be very useful in debugging programmes. This property of the X’_i completely negates any concept of randomness which is required when estimating integrals through simulations. It is the case that integrals can be estimated by taking values on a grid which is also deterministic. The proof that this works is in some cases simple, say Simpson’s rule. How ever as far as I understand it the use of pseudo- random number generators relies on them mimicking random variables in some sense. Tell me why it works? Your words ‘is merely a way of estimating an integrals’ suggest that the answer is simple as for example in Simpson’s rule. Is this so? There may be theorems to explain why they work and if so I would be interested in references. If you wish to show your students what a two-dimensional Brownian motion looks like how do you do this? Presumably you use a deterministic pseudo-random number generator but when doing this you are not estimating an integral. There are other uses. Early pseudo-random number generators were poor, three consecutive numbers lying on parallel lines in three dimensions is one famous example. No doubt the physicists put in a lot of work to make sure that their generators had no observable defects. So in this particular case I have no problem in accepting the conclusion on the existence of the Higgs boson. Nevertheless I always keep in mid that pseudo-random variable are not random variable. As another aside there were reports in the newspapers today that the end digits of prime numbers 1,3,7 and 9 are not so to speak random. They have different frequencies and the conditional distributions for successive primes are not independent. I suppose I have problems with the ontology of random. I don’t have them, at least not to the same degree, with the ontology of chaos. But that is just my picture of the world.
I don’t know if the PVP would accept me as a member, but the reason *I* gave #3 a thumbs up is because I didn’t see it as encouraging me to believe anything that is false or nonsense. Why do you identify this one in particular as being somehow special?
The insights to take away from this thumb’s up:
3. The probability of the background alone fluctuating up by this amount or more is about one in three million.
Given that the PVP are touchy about assigning probabilities to “the explanation” it is noteworthy that this is doing just that. Isn’t it?*
Abstract away as much as possible from the particularities of the Higg’s case, which involves a “background,” in order to get at the issue.
3′ The probability that chance variability alone (or the perhaps the random assignment of treatments) produces a difference as or larger than this is about one in 3 million. (The numbers don’t matter.)
In the case where p is very small, the “or larger” doesn’t really add any probability. The “or larger” is needed for BLOCKING inferences to real effects by producing p-values that are not small. But we can keep it in.
3” The probability that chance alone produces a difference as larger or larger than observed is 1 in 3 million (or other very small value).
3”’The probability that a difference this large or larger is produced by chance alone is 1 in 3 million (or other very small value).
I see no difference between p” and p”’.
For a frequentist who follows Fisher in avoiding isolated significant results, the “results” = the ability to produce such statistically significant results.
*Qualification: It’s never what the PVP called “explanation” alone, nor the data alone,at least for a sampling theorist-error statistician. It’s the overall test procedure,or even better: my ability to reliably bring about results that are very improbably under Ho”.
See also my comment:
https://errorstatistics.com/2016/03/12/a-small-p-value-indicates-its-improbable-that-the-results-are-due-to-chance-alone-fallacious-or-not-more-on-the-asa-p-value-doc/#comment-139624
The mistake is in thinking we start with the probabilistic question Richard states. I say we don’t. I don’t.
I don’t see 3 as assigning probability to the explanation. Given the reference to “background”, I think I would see it as more like “the probability of a bird flying into my window” which, even if I had just heard a bump, I would still take as the expected fraction of days in which such an event happens rather than “the probability that the bump I just heard was a bird hitting the window” which *would* be assigning probability to the explanation (and would depend very strongly on whether or not I have family members in the habit of forgetting their keys).
Also, I do see a difference between 3” and 3”’, and the key to me is in just one letter.
For 3”, although I don’t think it’s grammar is perfect, I’d give thumbs almost straight up to “The probability that chance alone produces X” , which, to me, is equivalent to “The probability of chance alone producing X” or “The probability that X *would be* produced by chance alone”.
On the other hand, I’d go all PVP on “The probability that chance alone produced X”, which I see as equivalent to 3”’: “The probability that X *was* produced by chance alone”.
Rejected post
http://rejectedpostsofdmayo.com/2016/03/17/can-todays-nasal-spray-influence-yesterdays-sex-practices-non-replication-isnt-nearly-critical-enough-rejected-post/
Wow! I first learned about p-values over 40 years ago and used them all my working life. I thought I understood them but now I’m not so sure. I think I’m having difficulty with the phrase “the probability that the data were produced by random chance alone,”. Is this the same as P[T|H0]?
The issue is that there is some ambiguity, the phrase could mean:
“assuming that the data were produced by random chance alone, what was the probability of obtaining this result”, which corresponds to your interpretation P(T|Ho). Be careful because the frequentist police might object to your use of a conditional probability given that H0 is not a random variable, you can also call it P(T;H0) to be safe.
“what is the probability that the data has been indeed produced by random chance alone”, which is what I think the ASA meant in their document. This doesn’t make much sense in frequentists terms because (leaving aside discussions about the ability of chance to produce anything) either the data was produced by random chance or it was not so it is a certain (but unknown) fact. Of course if you think of the probability as a reflection of your knowdledge about this fact it’s a very sensible question to ask.
Peter: This is what the non-argument is about.
More details here https://errorstatistics.com/2016/03/12/a-small-p-value-indicates-its-improbable-that-the-results-are-due-to-chance-alone-fallacious-or-not-more-on-the-asa-p-value-doc/#comment-139781
But more succinctly is the phrase H0 true to be taken as the statistical model is adequate versus is the unknown parameter of interest in the test is equal to the null parameter (common in many statistical models) but the model may be too wrong in other aspects.
I believe the first (Mayo’s and others choice) makes it a mathematical claim (the model taken as true implies this error rate) and the second is what is needed in an application claim (the model taken as a possibly too wrong representation).
Mayo wants to focus on the first and it is their blog.
Keith O’Rourke
Hi Peter, If you are having difficulty accepting the phrase “the probability that the data were produced by random chance alone,” as a good way of expressing the idea of P[T;H0] to the general public, then you are in line with the recent statement of the ASA, along with at least three of the commenters here and the rest of the “P-value Police”. Our objection is that those words quite strongly suggest something different (and either wrong or “not even wrong”) to the general reader – namely “P[Ho|T]”.
Mayo seems to think that the words are ok so long as the user means well, but I am finding that very difficult to accept.
AlQpr: But not for an error statistician for whom probability of results or data or events always refer to “what would occur” under such an such scenario. While we’re at it, what is the interpretation of Pr(x/Ho) for a Bayesian? With x known, Kadane says it’s 1. Not a very useful number, however construed, at least for someone keen to reason about underlying causal processes.
But, but, but… the general reader is *not* an error statistician who automatically inserts the words “would have” in front of every occurrence of “occurred”. So to fail to do it for her (when needed in order to make the statement a grammatically correct expression of the meaning) is wrong, wrong , wrong!
P.S. I suppose I should repeat my earlier acknowledgement (made on March 13 in a comment that is still “awaiting moderation”) that, for the benefit of error statisticians who *do* automatically insert “would”s as needed, the ASA statement would have been better with “do not measure” replaced with “should not be described as measuring”.
P.P.S. There’s another comment of mine from March 13 that also appears to be still “awaiting moderation” (Perhaps not having realized that you hadn’t seen these may explain, if not excuse, my going off the rails on the 14th)
P.P.P.S. I’m afraid I have no idea what a Bayesian would say. I strongly suspect that it would depend on *which* Bayesian (but please don’t take that as my having a position on Objective vs Subjective priors).
alQpr: I suggest the average reasoner, who after all trades in inductive learning every day, if not fed the supposition that inference is to the probability of a claim, rather than just to a claim (qualified), would approach the issue as does an error statistician.
I’ll look into any comments that were held up, it was not intentional. Sorry. Even your rique comment went through.
This is the crux of any disagreement that I may have with you! And it is an empirical matter that I think would be well worth a serious study. What proportion of the general reading public would be misled by the forbidden statement? My evidence is only anecdotal but quite substantial. I have had dozens if not hundreds of students who would say exactly what Richard Morey’s colleague did; so even if the proportion is less than half I, would still like to see that wording forbidden. But I really would like to know how many people do have a correct instinct that is sufficiently strong to override the incorrect (or at least debased) language.
alQpr: The trouble with Richard’s exchange is that it begins with the presumption that the problem is to assign a posterior probability (e.g., that placebo alone produces the observed outcomes), whereas we understand it as the probability of results under random assignment alone.
Are we referring to the same anecdote? The one I was referring to had Richard’s colleague propose a test question of the form “Given p=.03. What is the probability that there is an effect?” with the intended answer being “.97”. Yes, the problem was to assign a posterior probability, but does that make the answer correct? Was the question even well posed?
Surely the question and proposed answer indicate a deep misunderstanding. And that misunderstanding appears to have been prompted by the colleague’s misconstrual of what should have been meant when told that “p=.03 means the probability that the results were due to chance is .03.” So the point of the anecdote is that that language did lead at least one person astray.
Al: For one thing, “probability of claim C” in English is often used to mean “there’s good, but not conclusive evidence, to rule out C’s falsity”. If asked, “given low p-values, how good is the evidence that C’s falsity has been ruled out?” It might be understood in this informal way. To give a confidence level of .97 to the assertion “mu exceeds the .95 lower confidence bound,” for example, is correct. if asked what posterior is warranted given a p-value of .03, the only correct answer would be “the question makes no sense”, and unless this was one of the choices, the test-taker would assume the informal notion, or the “confidence” notion was meant. I’m not advocating explanation #1, by the way–even if I think that’s quite common. I think explanation #3 is more plausible.
The onus for the Bayesian reading is to show why a posterior of.97 to C given x corresponds to strong evidence that C’s falsity has been ruled out.
Just to clarify. In Richard’s anecdote he was not talking about a student responding to the quoted question and perhaps being forced to make the least wrong choice of responses to a question that makes no sense. The anecdote is about the “researcher” who *posed* the question that makes no sense to a class of beginning students (and wanted to force them into a posterior interpretation).
With regard to the informal language business, I would be happy to accept that interpretation for language such as “claim C is probably true” or just “probably C”, but not for something that assigns a concrete numerical value to “the probability that C is true”.
But that doesn’t matter much, because I have never disagreed with you about “Explanation#3” being almost always the case with regard to what the speaker or writer *intended*. The intent may be fine, but IMO claim(1) is a wrong *expression* of that intent.
al: Well maybe we’ve made some progress. But now look at the murky distinctions one must hold in order to follow the pattern of thumb’s up and down. The error statistician always construes error probabilities as methodological probabilities: e.g., the probability that a method would have resulted in even larger observed differences by chance variability.
OK, I’m back. In the frequentist world, there is nothing wrong with P(T|Ho)=P(T|Ho=TRUE). Ho doesn’t have to be a random variable. And I wasn’t aware that P(T|Ho) was different to P(T;Ho).
As an applied statistician one has to constantly move between a theoretical world and a practical world.. So, for example, an experimenter wants help designing an experiment and he/she asks for help. He uses lay language that I have to convert to theoretical language so that I can consider the problem using the mathematical foundation of the subject. Having determined an optimal design for the experiment I then have to explain it using lay language.
.Now, an experimenter who has just read the ASA asks me for an interpretation of the phrase “the probability that the data were produced by random chance alone” . I try to convert the phrase to a theoretical basis and can’t. I have to admit that I do not know what it means. Is there a difference between chance and random chance? Why is the word “random ” included in the statement? What is the significance of the word “alone”?Are we talking about a marginal or a conditional probability? The data could have arisen as an extreme outcome under the null or a less extreme outcome under the alternative.
So,in my eyes the ASA document fails and they have to have another stab at it
Hi Peter, Are you perhaps missing the intent of the ASA? The description you find unclear and/or wrong (as do I) is not something they are advising people to use. Quite the contrary. in fact. I think the only reason they use that wording is because it is how many authors *do* describe the p-value and it needs to be stopped because it misleads at least a significant part of the readership. Where I do think they went wrong is in saying “p-values do not measure blah,blah,blah” where “blah,blah,blah” is ambiguous or meaningless or wrong (at least accoding to how I read the English language) rather than “p-values should not be described as measuring blah,blah,blah”.
alQpr: If they think it’s misleading, never mind the clear meaning that error statisticians attach to ti, they at least owe us an explanation of the posterior probability they construe it as.
I don’t think so. Such a claim is kind of like saying that those who dispute a claim that something was caused by “the Gujji spirit” (or any other meaningless array of syllables) owe it to us to explain what they think “the Gujji spirit” actually looks like.
But as I read the English language, the disputed claim(1) (identified in your OP as what is rejected by the ASA) is *describing* the p-value as a posterior probability of null hypothesis H0 based on a prior probability distribution: p = Pr(H0 |x). Under this reading there is a fallacy, no matter what we construe Pr(H0 |x) to be, or even whether we consider it to have no sensible meaning at all. Note: I have changed “interpreting” from your Explanation#2 to “describing”, so I am not alleging a failure of understanding on the part of the claimant but rather a failure of expression. (And this is no longer an “explanation”. The corresponding explanation may well be just that the claimant doesn’t speak English correctly.)
But I agree that we should all yield to the results of an empirical test of how English is really used and understood these days and of how effective the disputed language is (compared to the alternative of a properly phrased subjunctive) in either advancing or retarding a correct understanding among those who don’t already know what a p-value really is.
No, I don’t think I’m missing the intent. The offending phrase is 2b in Deborah’s article. If I read Deborah correctly she is saying that 2b is OK and should be removed from the document leaving 2a on its own. When I first read 2b my initial reaction was that you can say this – but I was interpreting the phrase as being conditional on H0=TRUE (as is Deborah). But 2b doesn’t refer to H0 so I might be reading something into 2b that is not meant by the authors. So, 2b as it stands defies interpretation and should be removed. If 2b is a phrase commonly used to explain results and ASA thinks that this is wrong then the statement should expand on the point and explain why it is wrong.
Peter: I’m not sure, but I think you’re largely agreeing with me.
Yes, I am.
Peter, I agree that 2b “defies interpretation” – in the same sense as saying “The p-value is not the strength of God’s will not to decieve you”. It does so by denying a claim [identified by Mayo as claim(I)] which itself defies interpretation.
I also agree that it might be better stated as follows:
‘P-values should not be described as measuring “the probability that the data were produced by random chance alone” because the quoted phrase lends itself to an incorrect interpretation.’
But I think that to remove 2b rather than just amending it would be irresponsible – because my sense of the popular literature is that claim(I) *is* frequently made, and my sense of the general population including some “researchers” is that it *is* often interpreted as some kind of “posterior” claim about the probability of an H0.(I also think that is the only grammatically correct interpretation of the words in question, but we can leave that aspect aside for now.)
And, by the way, the ASA document does “expand on the point and explain why it is wrong” as follows:
‘Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.’
I am not claiming that you should find this adequate (though I do), but at least it is there.
No, I don’t find this adequate – it seems to be doing no more than repeating the headline principle no 2. If you look at the six principles in the ASA statement, five of them address a single issue but number 2 addresses two issues – 2a and 2b in Deborah’s post. 2a and 2b are quite different statements, so I would either split principle 2 into two different principles or i would remove 2b. My reason for removing 2b is that it is an OK thing to say conditioned on H0=TRUE.
More generally, the sentence ” It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.” has lost me. With language like this there is no way that this statement will have any impact with the intended audience.
Well at least you have managed to convince me to change my mind about the adequacy of that explanation!
I don’t know if any of my own attempts have been any better, but I do still think that the wording denied in 2b *is* frequently used and that it *does* have the effect of misleading many readers, and so that something needs to be said to stop people from using that wording.
al: But it’s ok to say the proportion of experiments that would produce such impressive results due to chance variability is p(100)%.
And now you need principle to move from this to something like: so this indicates the results are not due to chance. Why? Because with very high probability, a less impressive result would occur,under Ho.
This is where probability belongs: on how probable it is the method would have alerted me to the falsity of a claim.
Yes, yes, and yes.
I might add more details, but I don’t find any of those statements likely to lead people astray.
Peter: The p-value is always computed under Ho, which of course is merely an implicationary assumption–an assumption for testing.
Yes, the statement about the relationship between data and hypotheses is indeed cryptic. Inference is always about such relationships, and it’s wrong to suppose we don’t learn about H’s truth by studying the relationship between data and H.
Deborah: I am aware that the p-value is always computed under Ho. I am also aware that inference is always about the relationship between data and hypotheses. I didn’t understand the phrase “a p-value…………is not a statement about the explanation itself”. Now, we (the readers of this blog) can all agree that a p-value is not a “statement about the explanation itself” just as we can all agree that a p-value is not a statement about dairy farming in Denmark. So perhaps we should forget it. But……………. the authors of the ASA document clearly believe that some people do (erroneously) believe that a p-value is a “statement about the explanation”. If the ASA did not believe this they would not have mentioned it. So, the ASA should explain (to all of us) why they believe that some people have this misunderstanding and should explain (to them, not to me) why it is not true.
We have a situation here in which a lot of people are using statistical methods and are getting it wrong, for whatever reason. If we are to address this situation we need to do so in language that they can understand, not language that only philosophers can understand.
Peter: You wrote that “the sentence ‘It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.’ has lost me. With language like this there is no way that this statement will have any impact with the intended audience.” I was simply agreeing that it was a cryptic claim. I had nothing to do with it. I went on to say what I suspected was behind the assertion, and noted it was faulty. So basically I don’t get your last e-mail. I think the ASA assertions in question were just part of the ASA’s attempt to say that a p-value is a posterior probability. The phrases are awkward because all of the other approaches are also about data-explanation relationships, and few if any of the authors agreed about which is the recommended form of statistical inference.
Al: Why didn’t they allege researchers often wish to turn a p-value into a statement about the probability or probable truth of a null? There’s nothing to block using p-values to learn what is the case about genuine effects, extents of discrepancies indicated or not, and much else. There’s an erroneous assumption that a posterior probability is the way to learn about what is the case, except that the “alternative approaches” don’t obviously yield statements about the truth of a null either. so we’ve this highly awkward claim about “wishing to turn”.
As for the last few sentences you quote from the ASA doc, …well, I wrote a distinct set of comments on them, perhaps I’ll share them in a different post.
Anoneuoid, no particular reason, I can’t even think of a good excuse.
Mayo, replacing ‘Ho adequate’ by ‘Ho true’ just makes things worse. If you do this in 1. I give it a thumbs up because it becomes a mathematical theorem. I have no option but to give it a thumbs up. But it also becomes vacuous. As Oliver wrote, I think you are missing our point. Given that Ho is not true but only adequate you can only interpret a P-value if we know exactly what your definition of adequacy is, what Ho is and how you calculated the P-value.
This may be long and boring but here goes. I take my copper in drinking water data and have to decide whether the true amount of copper in the water exceeds the legal limit of 2.20 mgm per litre. To this end I intend to use a location-scale model F(*-mu/sigma) with i.i.d. random variables. As the precision of the final result depends on F I intend to regularize by choosing a minimum Fisher F. As the normal distribution F=Phi is a minimum Fisher distribution I will base the analysis in the first instance on the normal model. I will provisionally and speculatively identify the location parameter mu with the true amount of copper in the water. I will now determine those parameter values (mu,sigma) which give adequate approximations to the data. To this end I define an approximation region for (mu,sigma) which involves a number alpha , 0 < alpha<1 which defines what I mean by 'typical' and several statistics which take into account the behaviour of the mean, the variance, the extreme values (possible outliers) and the shape as measured say by the Kolmogorov metric under the model with parameters (mu,sigma). I have already provided an R-programme which does just this. The interpretation is that typical samples (100 alpha% of the samples) generated under the model N(mu,sigma^2) will satisfy the inequalities which define the approximation region. On putting alpha=0.9 the range of adequate mu values is [1.945,2.084]. As this does not contain 2.2 the conclusion based on this analysis is that the water sample does satisfy the copper contamination requirements.
To give an idea of how far away mu=2.2 is from being an adequate value of mu one can increase alpha until the interval for mu includes 2.2. The results is alpha'=0.99997 with P-value 1-alpha'=0.00003. The concept of adequacy based on alpha' is so weak that only 1 in 33,000 normal samples will fail to satisfy it. The P-value is a measure of approximation of Ho for the given data. If the P-value is small then the model is a poor approximation, the smaller the P-value the worse the approximation. One can look for possible reasons. One reason but not the only one could be that the amount of copper in the water is in fact much less than the 2.0. Another possible one is that the measurements were biased, perhaps due to poor calibration.
In interpreting the P-value as a measure of approximation of a stochastic model to the given data set I do not have to put forward explanations such as there being a one in whatever chance of such data given a true copper content of 2.2 which are due to random statistical fluctuations, or there being a one in whatever chance of the signal not being real etc. For this reason all 1.-8. get a thumbs down. I have a single data set x_n and a model Ho. The P-value is very small, the model Ho is a very poor approximation to x_n. Now go find out why.
It is a different way of thinking and like Oliver I think it is fundamental and not a side issue.
Laurie: I get you’re drift, and now I see that it arose because of some very central confusions between how different authors of comments on the ASA doc construed the model. I don’t want to relegate that to a mere problem of nomenclature or convention, but nevertheless, I’d like to take it up another time and return to my vacuous claim.
(1):The Probability test T yields d(X) > d(x) under Ho is very small.
(having trouble with the symbols so I’m using wds). This may be warranted by simulation or analytically, so not mere mathematics, but I’m prepared to grant this.
Or, to allude to Fisher:
The probability of bringing about such statistically significant results “at will”, were Ho the case, is extremely small.
Now for the empirical and non-vacuous part (which we are to spoze are shown):
Spoze
(2): I do reliably bring about stat sig results d(X) > d(x). I’ve shown the capability of bringing about results each of which would occur in, say, 1 in 3 million experiments UNDER the supposition that Ho.
(It’s not even the infrequency that matters, but the use of distinct instruments with different assumptions, where errors in one are known to ramify in at least one other)
I make the inductive inference (which again, can be put in various terms, but pick any one you like):
(3): There’s evidence of a genuine effect.
The move from (1) and (2) to inferring (3) is based on
(4): Claim (3) has passed a stringent or severe test by dint of (1) and (2). In informal cases, the strongest ones, actually, this is also called a strong argument from coincidence.
(4) is full of content and not at all vacuous, as is (2). I admit it’s “philosophical” but also justifiable on empirical grounds. But I won’t get into those now.
What role does statistical significance play here? By repeatedly bringing about a result (regardless if takes the form of stat sig or not) you’re demonstrating that the result appears to be an invariant of your procedures and nature. Ie evidence that you’re dealing with a regularity of some sort. You could do the same with Bayesian methods or completely non-statistical methods couldn’t you?
Om: I’m sure you could do it with non-statistical methods, and doubtless can reconstruct it Bayesianly if you’re keen to, but it’s hard to see the counterfactual claims being cashed out without reference to a sampling distribution or the like. But the deeper point of my argument is that for an error statistician, the goal is not a probability assignment to a hypothesis. Rather, we want to warrant the inductive yet non-probabilistic inference by reference to methodological probabilities (formally or informally). The objection in principle 2(b) is from the perspective of a probabilist. So it’s not clear why they’d be wanting to reformulate it in a non-probabilist mode. Surely the PVP would not. Not that we’re told what “the probability this result is due to chance”actually means for a Bayesian probabilist. Perhaps, “John bets it’s not due to chance”–but that only tells me about John and doesn’t unpack the claim.
Too late to write.
I may be being thick but I still don’t see why the number one in three million has any relevance here. It seems like an arbitrary number within a model that doesn’t really connect to your ability to reliably interact with the external world. Again, because of the disconnect between your model, the world and all the other possible models lying in between.
Oliver, as I understand it the situation is as follows. The number one in three million was based on simulations which in turn were based on the Standard Model for subatomic particles. Thus the Standard model with or without the Higgs boson was the only model under consideration. They no doubt used pseudo random number generators and these presumably had been subject to severe testing before being used, for example against known empirical results. Nevertheless the simulations were in a range where little (no) empirical data was available. Even before its discovery some properties of the Higgs boson were known (assuming it to exist); spin zero, heavy, short life-time, energy requirements etc. Given this they more or less knew where to look, in particular at what energy levels, and what to look for, that is the sort of decay which would take occur. The value one in three million (or a five sigma event) under the Standard Model without the Higgs boson was decided upon in advance. Before the discovery was finally announced there were reports of two and three sigma events. From this I conclude that they knew that the decay of a Higgs boson was (assuming it to exist) sufficiently variable for it to cause at least five sigma events under the Standard Model. According the the German Wikipedia they also performed simulations under the Standard Model but now including the Higgs boson. If this is so I assume that the five sigma event under the model without the Higgs boson has a higher probability under the model with the Higgs boson. However nobody seems to mention this. Thus one model was tested against another. Even after a five sigma event had been observed it was not clear that it was due to the decay of a Higgs boson. It took some time until this was established. All this is only my understanding and there are lots of gaps. If anyone know a physicist who could give a more detailed and yet comprehensible account it would be appreciated.
This makes sense to me. But then my impression is that all the key regularisation work is done by non-statistical theoretical principles from the standard model (which I’m perfectly happy to accept the physicists’ word on).
Given the regularisation we just compare the fit of the two possible models. A comparison of least squares or any other fit measure would do. As long as one of the two was assumed correct then a likelihood ratio should also work.
What seems to make the key difference is not that one fits and one doesn’t (measured on a scale of order ‘one in three million’ etc) but that we have also ‘regularised away’ all the other possible models based on physical principles. These can’t be established by error statistical means as far as I can see – this also seems to be Chalmers’ major objection in his ‘What is this thing called science?’ and Mayo and Spanos’ ‘Error and Inference’ volume.
Om: What? I think I convinced him entirely about the piecemeal theoretical learning that follows error statistical principles. If you know of an argument of Chalmer’s regarding the limits of my overall severe testing philosophy of science, which goes far beyond formal statistics, let me know. Actually, he’s been one of my leading supporters over the years, including me in the prefaces to his later editions of “What is this thing…” Take a look at his appendix in chapter 13, “happy meetings…”
Yes I read that appendix. I have the third edition – I think that is one of the more recent ones.
I want to emphasise that I’m not saying your approach has no merit but that I have some objections that I believe others share. Similarly I interpret Chalmers as offering both support and constructive criticism of your approach.
In the epilogue of the same edition he states ‘the new experimentalist account is incomplete insofar as it does not include an adequate account of the various crucial roles played by theory in science’. How theory serves to provide the required regularisation in the case of the Higgs is one example of this Imo.
He elaborates in his chapter in your book and I tend to lie on the side of agreeing with his criticisms instead of your rebuttal.
Again, not to say your approach has no merit but that the role of theory and regularisation appear to me to be inadequately addressed and that they threaten the conclusions from your approach. Again, in my opinion and base on my own attempts to seriously consider a range of alternatives.
Om: Please tell me of the threat to my approach. The decade after EGEK (1996) was largely focused on the theory question. i returned to PhilStat ~2003. To read that material requires going beyond the philstat papers, which maybe you have. Chalmers “What is this thing..” was based on EGEK, and in Error and Inference he only went slightly beyond. He was here, by the way, living in one of my places for a month, and we worked through this over that time. But he’s too far away now to keep up.
Well I’ve tried to get it across but I’m obviously not doing a good job. How about – theory space is so large that *even locally* it is huge. Thus small differences can make a big difference.
So, either your approach sneaks in a regularity assumption that can’t ever be exhaustively (finitely) checked even locally, and hence your conclusions rest on uncheckable assumptions that seem reasonable but may have surprising effects, or you refuse to introduce uncheckable assumptions and your drawing of conclusions is blocked.
I’m not a philosopher obviously so you could probably poke holes in the framing, but I believe there are good mathematical reasons underlying this perspective.
Om: Sorry, I don’t see what you’re getting at. When did I ever say I needed or wanted to encompass theory space? That would be needed by an account that employs the catchall but mine does not. Learning is from the ground up, exhausting, at most, the answers to one question at a time. Part of the inference also includes what’s so far unchecked by reliable probes. Let me direct you to my stuff on underdetermination.
By the way, Chalmers wanted me to allow entire theories to be accepted even though only portions or variants are well probed, so he wanted me to be less stringent than I am.
‘One question at a time’ can still be hopelessly ambiguous. That’s why I emphasized *locally*. There is effectively a ‘local catchall problem’ for you imo.
‘Building things up one question at a time’ works for finite dimensional linear problems but will break in general for either infinite dimensional or nonlinear problems and I take reality to be at least one of these (and there is a sense in which they are very similar concepts anyway).
Anyway I think we are still talking past each other so I’ll give up for now. I still found it a valuable exchange.
Om: Nobody ever said science is guaranteed to succeed. An adequate philosophy of science, I say, should reflect the impressive successes as well as the trials and tribulations of actual science. See, what I’m really interested in is articulating how we learn (or fail) when we do, and I continue to make gradual progress. Anyone who denies we obtain scientific knowledge is holding a very boring view. So if learning doesn’t fit someone’s mathematical purism, clearly they’re barking up the wrong tree. Are yu claiming scientific learning has thus far been limited to linear problems? There’s no local catchall problem because we in fact do exhaust the answers to a given piecemeal question, however much we’re forced to “bring it down to the level” at which squeezing the space is possible, or else we recognize the obstacles. Incidentally, I’m placing a link to my exchange with Chalmers
Click to access ch-2-warranting-theories-with-severity-chalmers.pdf
I’m not denying scientific progress or claiming we have only learned about finite dimensional linear problems. I’m saying I feel your account misses crucial lessons from the mathematicians, physicists and others who developed the methods to make progress on infinite dimensional and nonlinear systems.
Quantum mechanics requires Hilbert spaces, people who analyze nonlinear systems for a living use centre manifolds and bifurcation theory etc. This is not because of ‘mathematical purism’ but because we need them to make sense of our theories.
The local catchall problem is that ‘localization of questions’ is inevitably an approximation which can fail in surprising ways. If we just want to beyond merely listing past experiments and their results then the failure of localization to me blocks reliable inductive inference.
I basically accept more traditional Popper (and Hume?) – we assert generalizations based on assuming regularity and work with them until they fail, but we never really inductively infer anything.
* if we want to go beyond just merely…
Om: Popper most certainly held corroboration, which he defines as passing a severe test. He said he was happy to follow Peirce in calling that induction, so long as one didn’t confuse it with Bayesian probabilism. I too follow Peirce, who ironically went further than Popper. As Chalmers said, and I agree with him here, I have a more adequate notion of severity than did Popper. To group Popper with Hume is bizarre because Hume thought people were in the habit of inductive enumeration and therefore followed an irrational method. Popper argued we do no such thing and that furthermore it fails on logical grounds. If you’re keen to allude to the philosophers in a serious way Om, you need to study them. I have several posts on Popper, which isn’t to say one can pick up philosophy by way of Cliff’s notes, as some seem to think.
What’s your argument that my account invariably misses crucial lessons from mathematicians
Also – I believe it would give a fairer picture of your respective positions to post his original chapter not just your rebuttal. OK, I will stop posting on this thread from now! Might be back with more annoying comments on other posts…
Om: use the newly opened blogspace if you do.I’ll try to get a link to his chapter.
If you could get him to read my and especially Laurie’s comments and post here I’d be interested to hear his take.
I’m not so keen on the insults. I just meant something like
” Hume showed that it is not possible to infer a theory from observation statements; but this does not affect the possibility of refuting a theory by observation statements. The full appreciation of this possibility makes the relation between theories and observations perfectly clear.”
– Popper
This is the part of accept or Popper, I don’t accept corroboration really.
If you haven’t picked up the gist of it from my comments so far then I am probably incapable of bridging the gap between us.
*part of Popper I accept, gah
” It is to Hume’s undying credit that he dared to challenge the commonsense view of induction, even though he never doubted that it must be largely true. He believed that induction by repetition was logically untenable – that rationally, or logically, no amount of observed instances can have the slightest bearing upon unobserved instances. This is Hume’s negative solution of the problem of induction, a solution which I fully endorse.”
He then goes on to discuss where he disagrees. I agree with both on the ‘negative solution’.
(And if you feel that I’ve mis-interpreted Chalmers then I’d be happy to hear his take on this comment section. Perhaps he would take your side, I’m not sure. I’ve never seen him unequivocally accept all of your account. Reading some of his critiques was at least a stimulus for my own thoughts, even if mine diverged from his.)