Since we’ll be discussing Bayesian confirmation measures in next week’s seminar—the relevant blogpost being here--let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis *H* has passed a test, this Bayesian critic assigns a sufficiently low prior probability to *H* so as to yield a low posterior probability in *H*[i]*. * But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis *H*.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H.

* Isaac and college readiness*

An example that Peter Achinstein[ii] and I have debated concerns a student, Isaac, who has taken a battery of tests and achieved very high scores, *s, *something given to be highly improbable for those who are not college ready.[iii] We can write the hypothesis:

*H*(I): Isaac is college ready*.*

And let the denial be *H’*:

*H*’(I): Isaac is not college ready (i.e., he is deficient).

The probability for such good results, given a student is college ready, is extremely high:

P(s | *H*(I)) is practically 1,

while very low assuming he is not college ready. In one computation, the probability that Isaac would get such high test results, given that he is not college ready, is .05:

P(s | *H’*(I)) =.05.

But imagine, continues our critic, that Isaac was randomly selected from the population of students in, let us say, Fewready Town—where college readiness is extremely rare, say one out of one thousand. The critic infers that the prior probability of Isaac’s college-readiness is therefore .001:

(*) P(*H*(I)) = *.*001*.*

If so, then the posterior probability that Isaac is college ready, given his high test results, would be very low:

p(*H*(I)|*s*) is very low,

even though the posterior probability has increased from the prior in (*).

This is supposedly problematic for testers because we’d say this was evidence for H(I) (readiness). Actually I would want degrees of readiness to make my inference, but these are artificially excluded here.

But, even granting his numbers, the main fallacy here is fallacious probabilistic instantiation. Although the probability of a randomly selected student taken from high schoolers in Fewready Town is .001, it does not follow that Isaac, the one we happened to select, has a probability of .001 of being college ready (Mayo 1997, 2005, 117).

Achinstein (2010, 187) says he will grant the fallacy…but only for frequentists:

“My response to the probabilistic fallacy charge is to say that it would be true if the probabilities in question were construed as relative frequencies. However, … I am concerned with epistemic probability.”

He is prepared to grant the following instantiations:

- P% of the hypotheses in a given pool of hypotheses are true (or a character holds for p%).
- The particular hypothesis
*H*_{i}was randomly selected from this pool. *Therefore*, the objective epistemic probability P(*H*_{i}is true) = p.

Of course, epistemic probabilists are free to endorse this road to posteriors—this just being a matter of analytic definition. But the consequences speak loudly against the desirability of doing so.

*No Severity.* The example considers only two outcomes: reaching the high scores *s*, or reaching lower scores, ~*s*. Clearly a lower grade gives even less evidence of readiness; that is, P(*H*’(I)| ~*s*) *> *P(*H*’(*I*)|*s*). Therefore, whether Isaac scored as high as *s *or lower, ~s, the epistemic probabilist is justified in having high belief that Isaac is not ready. Even if he claims he is merely blocking evidence for Isaac’s readiness (and not saying he believes highly in his unreadiness), the analysis is open to problems: the probability of finding evidence of Isaac’s readiness even if in fact he is ready (*H(I)* is true) is low if not zero. Other Bayesians might interpret things differently, noting that since the posterior for readiness has increased, the test scores provide at least some evidence for *H*(I)—but then the invocation of the example to demonstrate a conflict between a frequentist and Bayesian assessment would seem to diminish or evaporate.

*Reverse Discrimination?* To push the problem further, suppose that the epistemic probabilist receives a report that Isaac was in fact selected randomly, not from Fewready Town, but from a population where college readiness is common, Fewdeficient Town. The same scores s now warrant the assignment of a strong objective epistemic belief in Isaac’s readiness (i.e., *H(I)*). A high-school student from Fewready Town would need to have scored quite a bit higher on these same tests than a student selected from Fewdeficient Town for his scores to be considered evidence of his readiness. (Reverse discrimination?) When we move from hypotheses like “Isaac is college ready” to scientific generalizations, the difficulties become even more serious.

We need not preclude that *H*(I) has a legitimate frequentist prior; the frequentist probability that Isaac is college ready might refer to generic and environmental factors that determine the chance of his deficiency—although I do not have a clue how one might compute it. The main thing is that this probability is not given by the probabilistic instantiation above.

These examples, repeatedly used in criticisms, invariably shift the meaning from one kind of experimental outcome—a randomly selected student has the property “college ready”—to another—a genetic and environmental “experiment” concerning Isaac in which the outcomes are ready or not ready.

This also points out the flaw in trying to glean reasons for epistemic belief with just any conception of “low frequency of error.” If we declared each student from Fewready to be “unready,” we would rarely be wrong, but in each case the “test” has failed to discriminate the particular student’s readiness from his unreadiness. Moreover, were we really interested in the probability of the event that a student randomly selected from a town is college ready, and had the requisite probability model (e.g., Bernouilli), then there would be nothing to stop the frequentist error statistician from inferring the conditional probability. However, there seems to be nothing “Bayesian” in this relative frequency calculation. Bayesians scarcely have a monopoly on the use of conditional probability! But even here it strikes me as a very odd way to talk about evidence.

Bayesian statisticians have analogous versions of this criticism, discussed in my April 28 blogpost: error probabilities (associated with inferences to hypotheses) may conflict with chosen posterior probabilities in hypotheses.

*z “B-boosts” H iff: P(H|z) > P(H). Recommended C-measures vary. I don’t know what counts as a “high” B-boost, and that is a central problem with these measures.

For a formal statistical analogue, see this post.

**References:**

Achinstein, P. (2001), *The Book of Evidence*, Oxford: Oxford University Press.

— (2010), “Mill’s Sins or Mayo’s Errors?”, pp. 170-188 in D. G. Mayo and A. Spanos (eds.), *Error and Inference. Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science*, Chicago: Chicago University Press.

— (2011), “Achinstein Replies” pp. 258-98 in G. Morgan (ed.) *Philosophy of Science Matters: The Philosophy of Peter Achinstein*. Oxford: Oxford University Press.** **

Howson, C. (1997a), “A Logic of Induction”, *Philosophy of Science* 64, 268–90.

— (1997b), “Error Probabilities in Error,” *Philosophy of Science* 64(4),194.

Mayo, D. G (1997a), “Response to Howson and Laudan,” Philosophy of Science 64: 323-333.

Mayo, D. G. (1997b), “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) *Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science *64, S195-S212.

— (2005), Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved, pp. 95-127 in P. Achinstein (ed.) *Scientific Evidence*. Johns Hopkins University Press.

[i] e.g., Howson 1997a, b; Achinstein 2001, 2010, 2011.

[ii] Peter Achinstein is Professor of Philosophy at Johns Hopkins University. Among his many publications, he is the author of: The Concept of Evidence (1983); Particles and Waves: Historical Essays in the Philosophy of Science (1991) for which he received the prestigious Lakatos Prize in 1993; and The Book of Evidence (2003).

[iii] I think Peter and I have finally put this particular example to rest at a workshop I held here in April 2011, with grad students from my philosophy of science seminar. When a student inquired as to where we now stood on the example, toward the end of the workshop, my response was to declare, with relief, that Isaac had graduated from college (NYU)! Peter’s response dealt with the movie “Stand and Deliver!” (where I guess reverse discrimination was warranted for a time.)

Added Oct 26, 2013: Moreover, Peter and I concur that evidence is a “threshold” concept.

Right off the bat, something important has been omitted. How did Isaac come to take the test? More than likely, he took it because he believed he was college-ready and also was interested in getting accepted by some college. So Isaac – very likely – wasn’t selected randomly from the set of the town’s high school students. The stereotyped Bayesian response here ought to have included this likelihood in its prior.

If this turns out not to apply, then we would have the lottery paradox. The probability of hitting one of the Megabucks-style lottery is so small that no one can expect to win it. Therefore no one should be a winner. Yet most of the time, *someone* does.

Perhaps we should call this the stereotype paradox instead.

And in a fit of unexpected coincidence, it happens that this month’s issue of Scientific American has an article on apparent paradoxes arising from trying to apply small probabilities when large number of cases are involved.

Please send the link.

I think there is something different going on here. For one thing, I want to deny that a probability of an event is the right quantitative measure of how well probed hypotheses are. Here, for purposes of argument, I consider an event.

There’s this much of a connection with the lottery paradox: if a low posterior for readiness suffices to declare x is evidence for unreadiness, then in cases like Isaac’s (common in testing for rare diseases) there is no chance of providing evidence against unreadiness–by this method. (A lot of nots, but clear enough I hope.) i.e., he scored high, but low scores would count even more against readiness. Then unreadiness passes a test with minimal severity. The dichotomy is an element of the argument as the critics raise it by the way–it’s obviously too simple-minded.

If you follow the links, you’ll come across an analogous criticism by Howson, and a parallel in formal statistics.

I can tell you first hand that the people in the capitalist trenches absolutely do care about base rates and don’t try to market a medical diagnostic test unless its outcome could makes a financial (that is, Bayesian decision-theoretic) difference to the prospective purchasing market. (Recall that the entities on the hook for the purchase price of such diagnostic tests are generally insurance companies.)

Corey: Of course they care about increasing/decreasing the chance of an event occurring. And to evaluate the evidence, they seek to probe (severely) whether or not and how much the product will payoff financially, avoid mistakes, etc. To make decisions, of course, various costs enter, but I wouldn’t say I’m led to a decision theory that is “Bayesian”.

No, I’m saying that they care about positive and/or negative predictive value (PPV/NPV). A test’s severity corresponds to either the test’s specificity or its sensitivity, depending on which hypothesis is being deemed warranted. The fact that PPV and NPV depend on population-dependent base rates as well as the strictly test-related quantities of sensitivity and specificity means that the use of the test can be well-justified for (and the test can be sold to) some populations/markets but not others — even if the severity of the test is the same for all populations.

Do you recognize this as a brute fact? If so, how do you square said fact with the severity principle?

Corey: I do not think that is quite right. Since severity must be assigned to a result + the test with all its performance characteristics, then it does not use sens/spec to the exclusion of PPV/NPV. Both support a severity argument. The Isaac example is to simplistic to really explore that, and does not deal with how Isaac came to take the test. This latter piece precludes any meaningful estimate of PPV.

I agree the college entrance exam example is not the best test bed — that’s why I’m focusing on the medical diagnosis setting.

You say that PPV/NPV can support a severity argument, implicitly arguing for the relevance of base rates in the interpretation of test results. How would you defend yourself from the charge that such an account would commit fallacious probabilistic instantiation?

The severity is for warranting the probability model from which the probabilities of events, including priors, are obtained. For a discussion (short) on what I say for cases where one is interested in such rates, perhaps see Mayo, D. G. (1997b), which is linked to my article. Yes, that’s what we face with evidence-based policy, and it’s understandable. I remember when Howson first raised this criticism (on which Achinstein based his) I was visiting with Erich Lehmann and discussing my responses (to Howson) with him. He was horrified at how people confused a legitimate frequentist prior for a statistical hypothesis H, with the frequency with which an event “selecting a true H” occurs in a specified experiment consisting of randomly selecting from a population. He thought it was the wrong way to evaluate an individual, be it a patient or student (his wife was at the Educational Testing Service in Princeton at the time). So much the worse for appraising hypotheses. When people start doing this for hypotheses in science—all the rage these days—we get into the business of my “Trouble with ‘Trouble in the Lab’”

https://errorstatistics.com/2013/11/09/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-i/

“The severity is for warranting the probability model from which the probabilities of events, including priors, are obtained.”

Utterly confused now. Here’s what I want to know: Alisa is a member of a population in which, say, tuberculosis is endemic. Bob is not. Both Alisa and Bob presents at their local hospitals with signs of respiratory illness and are given the same highly sensitive and specific blood test for active TB infection. What is the SEV calculation for each person, for each of the two possible test results?

In particular, suppose that Alisa’s and Bob’s tests show the same result. Is the conclusion indicated by the tests equally well-warranted in both cases? Why or why not?

First, I would direct you to the Spanos base-rate fallacy paper. He shows how frequentist models can address these problems, and the severity calculations follow naturally (enough). As to the specific case of Alisa and Bob, then let us suppose there exists a test with associated (freq) statistical model for Alisa’s population, so they perform the test, no controversy. As to Bob, there might not be a test developed for his population.

If not, then you can apply the test used by Alisa’s clinic, but you are breaking with the statistical assumptions with regards to the sample the model was meant to be applicable to. Thus, the performance metrics projected for the model– and error statistics– might be off in an uncertain magnitude and direction. And likewise for a severity calculation. Spanos emphasizes the nature of frequentist statistics, which includes sampling considerations and error measurements that relate to method protocol and model performance not base rates in the population. This seemed a subtle distinction to me at first, but it is actually fundamental.

john byrd: No, no, you don’t understand. In this scenario, the same company developed the test and validated its sensitivity and specificity in multiple potential markets. By assumption, the only substantive difference between Alisa and Bob is that Alisa is known to come from a population in which TB is endemic and Bob is known to not come from such a population.

There are many ways to slice up the populations from which individuals come (it’s not just philosophers who disagree on how to solve the “reference class” problem). You will get very different answers, especially if you select the class post hoc.

Also, I might note, we would not want to consider merely whether s/he has it or not, but rather degrees (e.g., from lesser to greater indications). I discuss this briefly in the Howson response.

I think the scenario is growing. If the test is appropriate for Bob’s population then apply it to him. Take it as seriously as the error statistics warrant. You must believe it is possible for him to have the disease or the test would not be applied. As to his population frequency, if you thought that was all you need to know you would not apply the test. If the test is applicable to the population, you should trust the error estimates from the stat model. Keep in mind that perhaps Bob just returned from a dream vacation to Phuket, or maybe he is hosting an international student in his home. Just because his community does not see the disease historically does not mean it cannot come to call. Your use of base rates could kill people, if I understand you correctly.

Mayo and john byrd: I can’t help but feel that there’s an awful lot of squirming going on here to avoid getting down to brass tacks. Yes, in real scenarios, we’d have to be concerned about things like trips to regions where TB is endemic. This isn’t a real scenario — I want to discuss a hypothetical situation in which, somehow, we know that the only substantive difference between the two people being tested is that one is an IV drug user, and hence has one more risk factor for TB. Symptoms? Identical. Socioeconomic status? Identical. Test result? Identical. Performance characteristics of the test? Identical. Etc.

What, if anything, is difference in the SEV calculation (for the warrant accorded to the disease status indicated by the test result) induced by twiddling *this* *one* *detail*? Consider all other requisite details to be filled in as needed.

Corey: No squirming here. And the scenario has expanded yet again. You do not seem to grasp the relationships between reference data, model, model assumptions, and model performance. I think the Spanos base rate paper clarifies these. The tweaks you make to the scenario affect the requirements for reference data. That is how the information enters the process of inference. Severity calculations follow from the development of the statistical models and have meaning only to the extent that the reference data is appropriate.

john byrd: I think if you review my specifications, you’ll find that I haven’t been “expanding” the scenario so much as clarifying the question in which I am interested in the face of responses which fail to address it. (I don’t have access to the Spanos paper, by the way.)

Anyway, the question as I have most recently posed it does specify that you are to fill any other necessary details as needed. (That’s why I consider your most recent answer as *still squirming*.) You can, e.g., consider that reference data have been obtained and that well-warranted estimates of the ROC curve under all conditions the test will be used.

The severity principle purports to give general conditions under which a particular conclusion can be considered well-warranted following a fallible test. It seems like it ought to be relevant to a doctor attempting to interpret a TB blood test result! Why can I not get a straight answer about the SEV calculation in my scenario?

Corey: Every time you have expanded the scenario, it was by adding additional bits of relevant circumstances.These affect our appraisal of the appropriate rereference class for a frequentist model. As to severity, I am not giving an example because Spanos gives a detailed example with discussion that will explain this better than I can on the blog. I wonder how I can get a copy of the paper to you?

My email address: firstname dot lastname at gmail dot com

I suppose I would say that Sens/Spec and PPV/NPV can be applied to any method, including those that do not make a Bayesian type use of base rates. These concepts were developed to control (in the accounting sense) error when applying test methods. PPV is applicable to a result and is available only when model performance has been checked and measured. I would not agree that a naive use of base rates– as in Isaac’s case– is basis for a PPV value. Performance metrics for an appropriate statistical model and associated decision rule could be the basis for a valid PPV, and could be relevant to severity.

Can we please stop talking about Isaac? As far as I know, he’s long since graduated, so let’s focus on the medical testing setting.

“I suppose I would say that Sens/Spec and PPV/NPV can be applied to any method, including those that do not make a Bayesian type use of base rates.”

My point is that PPV and NPV make a Bayesian-type use of base rates. I’m trying to figure out if and how a change in the base rate (a.k.a. prevalence) affects what the severity principle has to say about the warrant provided by a positive (resp. negative) test result to the hypothesis that the tested individual has (resp. doesn’t have) the disease.

Base rates, as I think you use the term, reference the make-up of the population. You use them as the basis for the probabilities in your model ( though they are estimates). Model performance is not the same thing, as it includes how samples are taken, the model used with all its assumptions, and the protocol for interpreting a result. In other words, the expectation from applying a test method correctly. The difference becomes clear when you consider whether or not you should view a patient who walks into the clinic because they are concerned about symptoms. The base rate of a suspected disease in the population should not be relevant to the testing that is done. This patient was not selected randomly from the population as part of the test protocol.

“This patient was not selected randomly from the population as part of the test protocol.”

So in my Alisa/Bob scenario, is there simply no SEV calculation at all?

Here’s the Scientific American link –

http://www.scientificamerican.com/article/math-explains-likely-long-shots-miracles-and-winning-the-lottery/

Mayo said: “There’s this much of a connection with the lottery paradox: if a low posterior for readiness suffices to declare x is evidence for unreadiness, then in cases like Isaac’s (common in testing for rare diseases) there is no chance of providing evidence against unreadiness–by this method.”

Right – nearly the same as saying “my mind is made up, so you can’t convince me no matter what you say.”

…which is a trap that will be very hard to climb out of if we make such reasoning commonplace.

I think there are a few things missing from this discussion.

First there’s the cost of either type I/II errors. If the posterior predictive probability of a rare disease may be low, but that may be sufficient to trigger action. The probability of being correct isn’t the only thing that matters.

Furthermore, isn’t this example similar to what gelman is calling a one-way-street fallacy? We’re worried about undiagnosed diseases, but shouldn’t we also be worried about overdiagnosed rare diseases if we followed the recommendation of looking at evidence in isolation?

Finally, what’s missing in this analysis is that it looks at the problem too narrowly. I actually don’t think this problem is as simple as applying a posterior probability. The reason is not because of a problem with posterior probabilities or bayes theorem, but rather that there’s a society-wide utility that is being ignored in this individual-level calculation. Using a bayesian decision theoretic approach will improve statistical efficiency in predicting readiness (from a frequentist ppv/npv perspective), but it is detrimental with respect to the collective effect on society as a whole. Mass disenfranchisement has consequences that are outside the cost function considered here, but nonetheless matter.

The decision-theoretic setting, obviously improtant in many of these settings, differs from a hypothesis assessment. I went along with considering “events” because every one of these criticisms (of severity) are based on examples of events. I turned the event into a hypothesis (along the lines of an example in Neyman) for purposes of engaging the criticism. But then its frequentist prior differs from the prior based on rates in one or another reference classes.

I missed the entry of mass disenfranchisement–is this because Isaac from Fewready is deemed guilty (unready) by association?

“”…unreadiness, then in cases like Isaac’s (common in testing for rare diseases) there is no chance of providing evidence against unreadiness–by this method.”

Right – nearly the same as saying “my mind is made up, so you can’t convince me no matter what you say.”

”

More like “extraordinary claims require extraordinary evidence”. If weak evidence results in the study being too underpowered to make an extraordinary claim, don’t blame bayes theorem.

Again, this is with the caveat that you can factor in cost/benefit in deciding a course of action. And as I mention below, individual level PPV/NPV performance isn’t the only thing being optimized in educational testing.

I didn’t explain the point about disenfranchisement well.

Within some purely descriptive and predictive contexts, base rates should be incorporated as part of regularizing estimates (I’m including class assignment as a form of categorical estimation here).

However, in the context of things like education testing and law I think there are issues that make this problematic. The problem is that there’s an aggregate utility for society as a whole that has to be considered. If any base-rate categorization goes, you’ll soon get the Steve Sailers of the world clamoring to regularize test scores and court decisions on the basis of race, ethnicity, and gender. Even _if_ you could improve NPV/PPV on for an individual prediction, the net cost to society as a whole is unacceptable.