Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)

Our favorite high school student, Isaac, gets a better shot at showing his college readiness using one of the comparative measures of support or confirmation discussed last week. Their assessment thus seems more in sync with the severe tester, but they are not purporting that z is evidence for inferring (or even believing) an H to which z affords a high B-boost*. Their measures identify a third category that reflects the degree to which H would predict z (where the comparison might be predicting without z, or under ~H or the like).  At least if we give it an empirical, rather than a purely logical, reading. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.

Did you hear the one about the frequentist error statistical tester who inferred a hypothesis H passed a stringent test (with data x)?

The problem was, the epistemic probability in H was so low that H couldn’t be believed!  Instead we believe its denial H’!  So, she will infer hypotheses that are simply unbelievable!

So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis H has passed a test, this Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H[i].  But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H.

To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.”  This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true.  This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H.

An example that Peter Achinstein[ii]  and I have debated concerns a student, Isaac, who has taken a battery of tests and achieved very high scores, s, something given to be highly improbable for those who are not college ready.[iii]   We can write the hypothesis:

And let the denial be H’:

H’(I): Isaac is not college ready (i.e., he is deficient).

The probability for such good results, given a student is college ready, is extremely high:

P(s | H(I)) is practically 1,

while very low assuming he is not college ready. In one computation, the probability that Isaac would get such high test results, given that he is not college ready, is .05:

P(s | H’(I)) =.05.

But imagine, continues our critic, that Isaac was randomly selected from the population of students in, let us say, Fewready Town—where college readiness is extremely rare, say one out of one thousand. The critic infers that the prior probability of Isaac’s college-readiness is therefore .001:

(*) P(H(I)) = .001.

If so, then the posterior probability that Isaac is college ready, given his high test results, would be very low:

p(H(I)|s) is very low,

even though the posterior probability has increased from the prior in (*).

This is supposedly problematic for testers because we’d say this was evidence for H(I) (readiness).  Actually I would want degrees of readiness to make my inference, but these are artificially excluded here.

But, even granting his numbers,  the main fallacy here  is fallacious probabilistic instantiation.  Although the probability of a randomly selected student taken from high schoolers in Fewready Town is .001, it does not follow that Isaac, the one we happened to select, has a probability of .001 of being college ready (Mayo 1997, 2005, 117).

Achinstein (2010, 187) says he will grant the fallacy…but only for frequentists:

“My response to the probabilistic fallacy charge is to say that it would be true if the probabilities in question were construed as relative frequencies. However, … I am concerned with epistemic probability.”

He is prepared to grant the following instantiations:

1. P% of the hypotheses in a given pool of hypotheses are true (or a character holds for p%).
2. The particular hypothesis Hi was randomly selected from this pool.
3. Therefore, the objective epistemic probability P(Hi is true) = p.

Of course, epistemic probabilists are free to endorse this road to posteriors—this just being a matter of analytic definition.  But the consequences speak loudly against the desirability of doing so.

No Severity. The example considers only two outcomes: reaching the high scores s, or reaching lower scores, ~s. Clearly a lower grade gives even less evidence of readiness; that is, P(H’(I)| ~s) > P(H’(I)|s). Therefore, whether Isaac scored as high as s or lower, ~s, the epistemic probabilist is justified in having high belief that Isaac is not ready. Even if he claims he is merely blocking evidence for Isaac’s readiness (and not saying he believes highly in his unreadiness), the analysis is open to problems: the probability of finding evidence of Isaac’s readiness even if in fact he is ready (H(I) is true) is low if not zero. Other Bayesians might interpret things differently, noting that since the posterior for readiness has increased, the test scores provide at least some evidence for H(I)—but then the invocation of the example to demonstrate a conflict between a frequentist and Bayesian assessment would seem to diminish or evaporate.

Reverse Discrimination?  To push the problem further, suppose that the epistemic probabilist receives a report that Isaac was in fact selected randomly, not from Fewready Town, but from a population where college readiness is common, Fewdeficient Town. The same scores s now warrant the assignment of a strong objective epistemic belief in Isaac’s readiness (i.e., H(I)). A high-school student from Fewready Town would need to have scored quite a bit higher on these same tests than a student selected from Fewdeficient Town for his scores to be considered evidence of his readiness. (Reverse discrimination?) When we move from hypotheses like “Isaac is college ready” to scientific generalizations, the difficulties become even more serious.

We need not preclude that H(I) has a legitimate frequentist prior; the frequentist probability that Isaac is college ready might refer to generic and environmental factors that determine the chance of his deficiency—although I do not have a clue how one might compute it. The main thing is that this probability is not given by the probabilistic instantiation above.

These examples, repeatedly used in criticisms, invariably shift the meaning from one kind of experimental outcome—a randomly selected student has the property “college ready”—to another—a genetic and environmental “experiment” concerning Isaac in which the outcomes are ready or not ready.

This also points out the flaw in trying to glean reasons for epistemic belief with just any conception of “low frequency of error.”  If we declared each student from Fewready to be “unready,” we would rarely be wrong, but in each case the “test” has failed to discriminate the particular student’s readiness from his unreadiness. Moreover, were we really interested in the probability of the event that a student randomly selected from a town is college ready, and had the requisite probability model (e.g., Bernouilli), then there would be nothing to stop the frequentist error statistician from inferring the conditional probability.  However, there seems to be nothing “Bayesian” in this relative frequency calculation.  Bayesians scarcely have a monopoly on the use of conditional probability!  But even here it strikes me as a very odd way to talk about evidence.

Bayesian statisticians have analogous versions of this criticism, discussed in my April 28 blogpost: error probabilities (associated with inferences to hypotheses) may conflict with chosen posterior probabilities in hypotheses.

*z “B-boosts” H iff: P(H|z) > P(H). Recommended C-measures vary. I don’t know what counts as a “high” B-boost, and that is a central problem with these measures.

References:

Achinstein, P. (2001), The Book of Evidence, Oxford:  Oxford University Press.

— (2010), “Mill’s Sins or Mayo’s Errors?”, pp. 170-188  in  D. G. Mayo and A. Spanos (eds.), Error and Inference. Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, Chicago: Chicago University Press.

— (2011), “Achinstein Replies” pp. 258-98 in G. Morgan (ed.) Philosophy of Science Matters: The Philosophy of Peter Achinstein. Oxford: Oxford University Press.

Howson, C. (1997a), “A Logic of Induction”, Philosophy of Science 64, 268–90.

—    (1997b), “Error Probabilities in Error,” Philosophy of Science 64(4),194.

Mayo, D. G (1997a), “Response to Howson and Laudan,” Philosophy of Science 64: 323-333.

Mayo, D. G. (1997b), “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64, S195-S212.

— (2005), Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved, pp. 95-127 in P. Achinstein (ed.) Scientific Evidence. Johns Hopkins University Press.

[i] e.g., Howson 1997a, b; Achinstein 2001, 2010, 2011.

[ii] Peter Achinstein is Professor of Philosophy at Johns Hopkins University. Among his many publications, he is the author of: The Concept of Evidence (1983); Particles and Waves: Historical Essays in the Philosophy of Science (1991) for which he received the prestigious Lakatos Prize in 1993; and The Book of Evidence (2003).

[iii] I think Peter and I have finally put this particular example to rest at a workshop I held here in April 2011, with grad students from my philosophy of science seminar. When a student inquired as to where we now stood on the example, toward the end of the workshop, my response was to declare, with relief, that Isaac had graduated from college (NYU)! Peter’s response dealt with the movie “Stand and Deliver!” (where I guess reverse discrimination was warranted for a time.)

Added Oct 26, 2013: Moreover, Peter and I concur that evidence is a “threshold” concept.

20 thoughts on “Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)”

1. “[…]it does not follow that Isaac, the one we happened to select, has a probability of .001 of being college ready”. But (assuming college readiness is a binary state) either he is college ready or he is not, he doesn’t realy *have* a probability, right? The point is that we doesn’t know whether he is ready or not, we just know that a college student “sampled” like Isaac has 1/1000 chance of being ready. This is a really small chance, and a test that is as insensitive as the one we are giving Isaac just doesn’t give enough evidence for us to believe he is college ready.

I sort of fail see how the scenario you describe result in a fallacy 🙂

If we on the other hand wanted a test that is fair, we just assign all student an equal prior probability to be college ready, right?

• Rasmusab: I’m distinguishing between whether the high scores provide evidence of his readiness (I’d always make the inference to a particular extent, but am going along with the dichotomy in the example), as opposed to the probability that a randomly drawn student from Fewready town exhibits the property of “college readiness” (like selecting a ball from an urn with p% red).

• Phil Koop

I think that Rasmusab is interpreting the problem along the lines of the Bernoulli model you mention near the end of your post.That is, if high scores really yield a 5% false positive then obviously it is more likely that we have found a false positive than the 1 in a thousand college-ready inhabitant of Fewtown.

I reckon this is because you have not clearly explained what distinguishes the epistemic approach from the error-probe approach in this example: “there seems to be nothing “Bayesian” in this relative frequency calculation.”

Exactly. Applying a test of severity 95% to a base rate of 0.1% seems to be the frequentist analog of a Bayesian who uses an improper prior. Is this really what you advocate?

What exactly to you object to in Achinstein’s instantiations? They are exactly the ones that any professor would use when giving an example of prosecutor’s fallacy (or base rate fallacy or whatever you like to call it) to an introductory undergraduate class, regardless of whether her personal philosophy were frequentest, Bayesian, epistemic, logical, propensity or whatever.

2. Anon

Mayo,

But how do you determine that 5% of type II error is small enough for the Fewready?

You just assumed that 5% is good enough evidence and, hence, the contradiction with the bayesian view.

It could very well be that for this situation 5% is not enough evidence.

Let`s assume that I am the college owner and that it is very costly for me to get bad students.

If I do know that Isaac is from Fewready, I could demand a more stringent test so not to risk an error.

Aside the moral consideration here** nothing stops me the college owner to decide what risks I want to assume.

And nothing stops me to decide that for Fewready students I demand a very stringent test, if the cost of a bad decision is costly.

Best regards

**we are not dealing with moral here, so I think your example is not really good, because it tries to make “bayesians” look “imoral” rejecting Isaac

• john byrd

Anon: You need to refute the idea that there is a “fallacious probabilistic instantiation” before your point has merit. It seems that Isaac is either college ready or he is not. How can you presume the probability he is college ready is 0.001? In practical applications, this makes no sense. (It is also why racial profiling is unsound, not just unpleasant to think about.)

• I’m trying to understand this notion of “fallacious probabilistic instantiation”; I have never encountered it in the statistical literature.

Suppose I have a container with 1000 balls in it, 10 of which are red and the rest of which are blue. I think everyone would agree that if I select a ball at random, the probability that it is red is 1 in 100. So now I select a ball at random, but I don’t look at it — I cover it instead.

Now I’ve selected the ball, so its color is “non-random”, even though it’s unknown to me. Is it a “fallacious probabilistic instantiation” to say that the probability that I selected a red ball is 1 in 100?

• Corey: The fallacy (which I named) really arises from the cases where it is alleged that if I randomly sample hypotheses from an urn with p% true hypotheses, and obtain specific hypothesis h’, then h’ has a frequentist probability of p. There’s a perfectly legit probabiity of an outcome, but that differs from P(h’), for a frequentist at least. Colin Howson (in the refs to this article), for example, argues that NP methods, e.g., confidence intervals are “unsound” because he thinks a particular fixed interval estimate, generated by a .95 interval estimator has/or ought to have a frequentist probability of .95. Achinstein agrees with me that they do not but allows the epistemic probabilist to instantiate. But there is a slippery slide between the probability of an event (e.g., plucking a true hypotheses from the urn), and the probability that the hypothesis plucked is true. Certainly it’s a fallacious slide for a frequentist, but I fail to see why anyone would say if a specific h were plucked from an urn with 95% true hypotheses, then IT’s “degree of epistemic probability” or belief-worthiness ought to be .95. If h came from a different urn, with a different % of true h’s, it gets a different evidential weight, if plucked.

• anon

I know that people from Fewready are less likely to be ready, and I am not willing to take a chance, so I will demand a more stringent test.

This example is not good, because it involves moral issues of justice. But rationale is alright. And people actually do that.

Let’s say tha Isaac is a job applicant. And that he graduated from a college in which 99% of the students are technically poor. He may have very well achieved a very high grade in my application test, but I wil definitely take account that he has graduated from that college in my decision, even though he may be the one in a hundred students of that college that is not bad. And it would be irrational for me to disregard this information when hiring.

Why can’t we be ok with p=5% in some hypothesis and for others we demand p=0.00000001%? This is perfectly fine and it actually happens a lot in science.

• Anon: The point is that precisely this type of example is raised as demonstrating the unsoundness (their word) of the error statistical approach and similar approaches. If you look just at my response to Howson in either of the responses in the references (links included), you’ll see what I mean. Howson’s example actually gives a 0 type 1 error, with the null being unreadiness (I used Mary there, in other examples it was a disease absent or present). Clearly setting a very low alpha level means we want erroneous rejections of readiness to be very improbable. By insisting that what we want is the posterior, Howson will assign a high posterior belief to Isaac’s UNreadiness, despite his high scores s. I don’t say the computations are wrong, but I deny that is what we are after in evaluating the evidence for the given hypothesis or the given student. Yet he is saying that I must want that, on pain of unsoundness…I emphasize that he’s raising the criticism of me.

• anon

“but I deny that is what we are after in evaluating the evidence for the given hypothesis or the given student.”

But when you say this, it seems that evaluating evidence is something that has the same meaning despite the prior knowledge of the subject or the costs involved. It seems to me that evidence should be assessed considering costs and prior knowledge.

Let’s change the example again:

Let’s suppose I open a college in a city called Fewready, and by the local law only people from Fewready can apply to the college, so there is no discrimination involved (of course, this is not necessary, I’m using only to avoid moral issues here). And I know that very few people from Fewready are ready for college. And it is very costly for me to make wrong calls. So I will demand a very stringet test to assess readiness.

Let’s suppose I open a college in a city called Fewdeficient, and by the local law only people from Fewdeficient can apply to the college, so there is no discrimination involved. And I know that almost all people from Fewdeficient are ready for college. And it is very costly for me to apply stringent tests. So I will make an easier test to assess readiness.

I have used both prior knowledge and costs to assess evidence, and actually to design the best tests, and to me this seems fine. Actually, it seems the right way to do it. For me, it would not make sense to say that both colleges should assess college readiness evidence with the same stringency or in the same manner. Why would they?

I may say that, if what you are trying to do is separate two different things, “well testdeness” from “overall evidence”, then you are indeed right.

In our example above, people who pass the test in Fewready were subjected to a more stringent test and thus their ability, we can definitelly say, have been put to more stringent scrutiniy. And this is one thing worth to acknowledge. And I think you are right to point this out. That “theory” has passed a more severe test then the other.

But the ambiguity here happens when we want to assess the “overall evidence”. Even if Isaac, who got into college in fewready have been put to a more stringent test, the “overall evidence” about college readinesse is more favorable to, say, Jack, who has got into college in fewdefficient. So the “overall evidence” is still in favor of the other theory.

Now, the problem here is how reliable is the prior evidence. And, of course, if the prior evidence is not reliable, maybe the conclusion is not either (but it could be robust to different plausible priors). But that also can happen to the “test”. If the assumptions to derive the error probabilities were wrong, the conclusions could also be wrong (but it coul also be robust to different deviations).

• john byrd

I do not think that is the issue. How would you rate the severity of the test of hypothesis H` = Bubba, born and raised in
Fewready , is not college ready after he fails your more rigorous test? I understand where your posterior will be, but what do you think of the severity of the overall test?

• anon

The test is very severe, because if he were ready, he would have passed the test almost surely. That is, in Mayo’s words: P(s | H(I)) is practically 1.

So the failure to reject here is a severe test of not readiness and combining this with the prior evidence, we would be almost sure that who has not passed the test in Fewready is, indeed, not ready.

• anon: it’s important to see that SEV is not a mere likelihood ratio, it’s an error probability. Even playing with an artificial example like this (which I do for sake of argument with Achinstein, Howson), I’d look at degrees or extent of readiness, along the lines of a given scale (hard to pick, but one would have to). We might then set an upper bound on his readiness, if he has gotten lower than a given score, ideally on several tests. The data may “fit” unreadiness (to degree d), but there may be too high a probability of so good a fit (with unreadiness) even if he’s more ready than degree d.

• john byrd

Anon: The severity is near 0 for the inference that Bubba is not college ready after failing the test. You have to see this to understand the error stat reasoning.

• anon

No it is not.

The hypothesis “Bubba is not ready” has passed a severe test insofar as the hypothesis would not have survived the test if it were false.

If “Bubba is not ready” were false, it would mean that “Bubba is ready” is true. And if “Bubba is ready” is true, the probability that we would have had a test that does not accord with “Bubba is not ready” – that is, the probability of getting high grades – would be close to 100%.

So the severity is near 100%, for we would almost surely have had a statistic that accords less to the hypothesis, if it were false.

Now, I would very much like to see why you think the sevevrity would be zero, that does not make any sense.

• john byrd

Anon: Go back to what you said you were doing to the test for the kids from Fewready. You were raising the bar to a very high standard out of concern created by the prior. When you raise the bar for passing, then you increase the proportion of kids who are college ready yet fail the exam. You were taking this risk to mitigate your concerns over the prior.

Now, you have to ask, What is the probability that a college ready kid from Fewready will meet my standards? It is close to 0. This is the severity for the question at issue.

You sacrificed the ability of your system of evaluation to fairly appraise a college ready kid from Few ready solely to preclude ever admitting a kid that is not college ready. And I would say you succeeded in that.

But, when you insist on letting the prior govern how you utilize real evidence, then you create these blind spots for yourself. Most of us want to know that the method that produced the evidence can actually address the question.

3. This is a quick comment, pertaining to Crupi and disconfirmation, which I promised to come back to.