I’m sure I’m not alone in finding it tedious and confusing to search down through 40+ comments to follow the thread of a discussion, as in the last post (“Bad news bears“), especially while traveling as I am (to the 2012 meeting of the Philosophy of Science Association in San Diego–more on that later in the week). So I’m taking a portion of the last round between a reader and I, and placing it here, opening up a new space for comments. (For the full statements, please see the comments).

(Mayo to Corey*) Cyanabear: … Here’s a query for you: Suppose you have your dreamt of probabilistic plausibility measure, and think H is a plausible hypothesis and yet a given analysis has done a terrible job in probing H. Maybe they ignore contrary info, use imprecise tools or what have you. How do you use your probabilistic measure to convey you think H is plausible but this evidence is poor grounds for H? Sorry to be dashing…use any example.*He also goes by Cyan.

(Corey to Mayo): .….Ideally, if I “think H is plausible but this evidence is poor grounds for H,” it’s because I have information warranting that belief. The word “convey” is a bit tricky here. If I’m to communicate the brute fact that I think H is plausible, I’d just state my prior probability for H; likewise, to communicate that I think that the evidence is poor grounds for claiming H, I’d say that the likelihood ratio is 1. But if I’m to *convince* someone of my plausibility assessments, I have to communicate the information that warrants them. (Under certain restrictive assumptions that never hold in practice, other Bayesian agents can treat my posterior distribution as direct evidence. This is Aumann’s agreement theorem.)

New: Mayo to Corey: I’m happy to put aside the agent talk as well as the business of trying to convince. I take it that reporting “the likelihood ratio is 1” conveys roughly that the data have supplied no information as regards H, and one of my big points on this blog is that this does not capture being “a bad test” or “poor evidence”. Recall some of the problems that arose in our recent discussions of ESP experiments (e.g., multiple end points, trying and trying again, ignoring or explaining away disagreements with H, confusing statistical and substantive significance, etc.)

The ESP experiments were quite good for ruling out simple chance as the explanation for the findings, but were nevertheless bad at substantiating ESP as the cause of the results. In this situation I would say that the observed results were just as plausible on the hypothesis of any of {multiple end points, trying and trying again, ignoring or explaining away disagreements with H, outright cheating, some combination, etc., etc.} as they were on the hypothesis of (weak!) ESP.

Corey: My point is that one does not adequately represent poor tests, unwarranted inferences, etc. by reports such as your: “observed results were just as plausible on the hypothesis of any of {multiple end points, trying and trying again, ignoring or explaining away disagreements with H, outright cheating, some combination, etc., etc.} as they were on the hypothesis of (weak!) ESP.” Not increasing one’s favored plausibility measure seems at most to speak to the uninformativeness of the data/test. Of course I want to get at capturing poor tests/unwarranted inferences in general, never minding special aspects of the ESP case.

What else is a poor test but one that provides little information about a hypothesis of interest? What else is an unwarranted inference but one that follows from a misinterpretation of the available information?

A poor test for claim H would be one that has no chance of finding flaws in H, even if they exist. This is very different from merely being uninformative.

I’m trying to imagine what distinction you could possibly be making. For example, in the canards you use to introduce your overheard at the comedy hour post, the straw-man frequentist tests are poor tests exactly because they are uninformative about the hypothesis of interest. The only situation I can come up with where this is not a distinction without a difference is optional stopping.

My position on this situation is more-or-less the same as that of Andrew Gelman. If I recall correctly, both of us have posted about it before, and neither of us were successful in communicating our positions to you.

Corey: First, the Kadane example of “no test at all” is different*(I’ve added some links), and second, you’re missing my point. On the first, the coin flip or whatever, has nothing whatsoever to do with the hypothesis H, and so there is no relevant test statistic. Sure one can fault its power, but that’s not where I’d stymie that example. (Your “appeal to Gelman” doesn’t change anything.)

Second, regarding the issue at hand, the kind of “poor test of H” that occurs when it is predetermined to find evidence for H (by some means or other), thereby violating what I sometimes call a “minimal requirement for evidence” (crossing Popper’s demarcation) is very well and very easily represented by an appeal to error probabilities of the test. It is not well-captured merely employing claims about the probability of H—or so I claim. I think this is at the heart of why we severe testers are so at home with error probabilities of tests. (This feature, incidentally, is not adequately brought out in Normal Deviate’s latest distinctions of frequentist/Bayesian—to be continued. I’m going to send this aeroblog, even though I have discovered that it’s not recommended for those of us highly prone to nausea and airsickness.

*

http://errorstatistics.com/2011/09/16/getting-it-right-but-for-the-wrong-reason/

http://errorstatistics.com/2011/09/13/in-exile-clinging-to-old-ideas/

“A poor test for claim H would be one that has no chance of finding flaws in H, even if they exist… the issue at hand [is] the kind of “poor test of H” that occurs when it is predetermined to find evidence for H (by some means or other)”

I think I get it. A poor test is one which, for some set of flaws, Pr(H fails T; H holds) = Pr( Pr(H fails T; H has flaw 1) = Pr(H fails T; H has flaw 2) = … = something in the neighborhood of zero. “Uninformative” in the sense of having a likelihood ratio of 1 doesn’t require that Pr(H fails T; H holds) be near zero.

You wanted to see how I would communicate that “a given analysis has done a terrible job in probing H,” with one possible reason being “maybe they… use imprecise tools.” It seems reasonable to interpret this as implying that the test was subject to excessive noise rather than predetermined, so I remain satisfied with my “likelihood ratio = 1” characterization. If I wanted to further communicate that the analysis was poor because it was never able to find some set of flaws in H, then in addition to the likelihood ratio statement I would also communicate “Pr(H fails T; H holds) is near zero”.

(I haven’t been very careful about distinguishing between the likelihood of the data vs. the likelihood of the outcome of some dichotomous test, but it hasn’t caused any misunderstandings in the discussion that I can see — so far.)

Corey: I see some glints of successful communication, but I don’t get the iterated probabilities in your claim, maybe it was unintended? “ A poor test is one which, for some set of flaws, …. Pr( Pr(H fails T; H has flaw 1) = … = something in the neighborhood of zero.”

You then wrote: “ If I wanted to further communicate that the analysis was poor because it was never able to find some set of flaws in H, then in addition to the likelihood ratio statement I would also communicate “Pr(H fails T; H holds) is near zero”. But that would not be a criticism. Maybe you mean Pr(H fails T; H is flawed) is near 0? (or “whether or not H holds”?) In any event, there’s still the business of cashing out these intuitions/statements, which is what frequentist sampling theory can readily do. Hopefully you/readers can see* the advantage of an account that goes simply and directly to a critique of an inference (that a flaw is absent) relative to a test T.

*One needn’t preclude, in addition, a “logic of belief”(or the like), if one was so very enamored of having such a thing for different purposes, or as a sum-up of the critique directly obtained by an error statistical analysis (at least for passing severely, but not for passing in severely).

Corey: I’ll come back to this when I have a minute to read it carefully. thanks.

“I see some glints of successful communication, but I don’t get the iterated probabilities in your claim, maybe it was unintended?”

Out of curiosity, what price would you pay for a contract that pays out $100 if it was in fact unintended?

Vg jnf havagraqrq — vg’f n pbcl-a-cnfgr reebe.

“But that would not be a criticism. Maybe you mean Pr(H fails T; H is flawed) is near 0? (or “whether or not H holds”?)”

On its own it’s not a criticism. I only need to state that one of the relevant probabilities is near zero; that plus the “likelihood ratio = 1 in some set of flaws F” statement then imply that Pr(H fails T; H has a flaw in F) is near 0. I just picked the probability at the start of the chain of equalities.

“Hopefully you/readers can see* the advantage of an account that goes simply and directly to a critique of an inference (that a flaw is absent) relative to a test T.”

So far, we’ve treated dichotomizing the outcome as a fait accompli, thereby avoiding the sticky issue of tail areas. As a result, we are in apparent agreement about what to communicate, if not how to communicate it. Two things: first, in the dichotomous test scenario, “likelihood ratio = 1” covers both what you’ve called “poor tests” and “not a test at all” . Second, I dispute that tail areas are either direct, or, for the uninitiated, simple.

As I’ve mentioned previously, I’m quite impressed with severity as a lens for making sense of frequentist practice. This is because if dichotomizing the outcome of an experiment is treated as a fait accompli, then severity reasoning plus the Suppean hierarchy of models roughly replicates Jaynes’s notion of probability theory as the logic of science. (Yes, really!) But I continue to execrate the practice of computing tails areas of sampling distributions…

Corey: Maybe $5 max, but that’s only because you asked, and by asking lead me to suppose it’s a mistake; prior to asking 0$, I assume you meant what you said.

The bottom line is that these attempted computations do not speak about posterior probabilities in hypotheses, or prior probabilities of hypotheses or any such thing. They employ probability to qualify the method (be it a test or other), by assignments to a sampling distribution (on which the inference rule is defined). That is the central mathematical difference corresponding to the difference in goals. (Wasserman might have captured the distinction he had in mind this way too.)

There’s a line of gibberish under the query about the contract reading, “Vg jnf havagraqrq — vg’f n pbcl-a-cnfgr reebe.” It’s encoded in rot13; if you paste it into the rot13 box and click “Cypher”, it comes out, “It was unintended — it’s a copy-n-paste error.”

You updated in the correct direction on some new information. Well done.

Corey: I prefer not to get into rot 13 ciphers, whatever they are….if you want to write it’s translation please do. But you haven’t really responded to my point about what’s needed here.

I did write the translation — I guess you missed it. It’s, “It was unintended — it’s a copy-n-paste error.” (The business about cyphered text was just to provide verification (if any was desired) that the outcome of the hypothetical contract was determined at the time I posed the question.)

You wrote, “The bottom line is that these attempted computations do not speak about posterior probabilities in hypotheses, or prior probabilities of hypotheses or any such thing.”

True — so far. Your “canonical models for piecemeal investigations” carefully avoid any tricky data-generating mechanisms, e.g., something outside of exponential families with relatively large numbers of both interest and nuisance parameters. As far as I can tell, severity qua well-defined function on hypothesis space is basically useless in such setups, which permit no total order in hypotheses of interest, no total order on statistics, and have assessment of hypotheses of interest inextricably linked with assessment of “nuisance” hypotheses. (I do not deny the applicability of informal severity reasoning about situations in which such models are necessary, although the kinds of inferences such reasoning is capable of producing are necessarily quite coarse-grained.)

Bayesian computations can cope with these situations, but are unacceptable to the error statistician, since no piecemeal error-statistical assessment of sufficiently “fine” resolution is available. Is well-grounded science impossible unless piecemeal investigations can be carried out? Is that “what’s needed here”?

There’s too much here to respond to just now at this hour (London), but granting “True — so far” should be troubling to you. What shall we say of an account of statistical inference/evidence that is unable to well represent a canonical problem of pseudoscience/prejudged inference or the like, even when we can clearly see the problem exists? I don’t get your suggestion that in cases that are so messy that no assessment of the messiness (e.g., in terms of 0 probativeness/0 discrimination, etc) is possible, that it is better to appeal to a statistical account that fell down on the job when the statistical evidence/test was poor and we knew it. Sorry, this isn’t put very clearly, I’ll have to come back to this….But the fact is that the majority of science and of learning is not captured in formal statistics. If, however, nothing of an informal sort can be said even by analogy, then it’s hard to see we’re in any kind of a learning context. Recognizing this should be a positive impetus to start asking some piecemeal questions, or remain in the dark.

“What shall we say of an account of statistical inference/evidence that is unable to well represent a canonical problem of pseudoscience/prejudged inference or the like, even when we can clearly see the problem exists?”

Not troubled — you’ve misunderstood what I’ve been trying to communicate. Granting the dichotomous test scenario, the Bayesian account of “a given analysis [that] has done a terrible job in probing H” only involves sampling distributions (pre-data) or likelihood functions (post-data). The prior and posterior for H don’t enter into the account. The simplified setting of your query doesn’t touch on things like the ir/relevance of tail areas to inference and methods for dealing with nuisance parameters.

Corey: Not getting this (maybe jet lag)? What I meant was that your various attempts to convey the very poor/prejudged test (so readily captured via error probabilistic claims) do not seem to be naturally or adequately put in terms of posterior probabilities. Where does the “irrelevance of tail areas” enter?

I can see why it seems to you that my account of the very poor/prejudged test does not seem to be naturally or adequately put in terms of posterior probabilities. What I’m doing is asking myself the question, “What are the necessary and sufficient conditions for the posterior probabilities of hypotheses in some set H to be equal the prior probabilities?” The necessary condition is that there must be a nonempty set of observable events E that have the same data probability under all of the hypotheses in the set H. The sufficient condition is that one of the elements of E is the actual observed event, i.e., the likelihood ratio is 1 in the set H. My focus on discussing these conditions has masked the method I used to arrive at them; hence the appearance that they have nothing to do with posterior probabilities.

In the dichotomous test scenario, a map from E to {H_0 Passes, H_0 Fails} is proposed. If the probabilities of the two test outcomes are equal for all hypotheses in H, then the test outcome alone (i.e., absent the data) is non-informative in the Bayesian sense above. This scenario covers both the test with predetermined outcome and the test that is not a test at all.

The Bayesian account says that an analysis that is equivalent (or roughly equivalent) to the above scenario has done a terrible job in probing H_0 for a certain class of errors, viz, the other hypotheses in H. It also says that the test outcome is poor grounds for claiming H_0 — any map from probabilities on hypotheses to claims will produce “claim H_0” after the test if and only if it also did so before the test.

As far as I can tell, in the dichotomous test scenario the above Bayesian account and the severity account agree as to when a test outcome is poor grounds for claiming H_0. To highlight where my account will start to diverge from the severity account, it is necessary add detail not present in the dichotomous test scenario — that’s where the question of the relevance of tail areas to inference or lack thereof will enter.

The posteriors being equal to the priors fails to capture my notion (and I think common) notions of poor tests. That essential idea requires considering features of tests that are not picked up on by posteriors/priors. One might try to reconstruct it that way, but it will always miss the essential rationale. That was the insight I first gleaned from C.S. Peirce and later echoed in Popper, Egon Pearson.

And yet in this dichotomous test scenario I have shown above that if the posterior equals the prior then the test lies on a continuum from tests predetermined to pass H_0 through tests that are “not a test at all” to tests predetermined to fail H_0…

That a poor test in the error statistical sense might “pass through” tests you call poor does not give the converse. Thus your scenario does not capture what we have in mind as regards a quite blatant notion of a poor test, or “no test at all”. By contrast one easily qualifies this (and other) characterizations of tests that are of interest by using probability to qualify the testing method itself. (This same issue is really what’s behind Wasserman and Gelman’s talking past eachother in their recent blog discussion of what is frequentist and what is Bayesian.)