# What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

.

Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

1. Simple rules for alternatives against which T+ has high power:
• If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
• If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong.

~~~~~~~~~~~~~~

1. How could people think it plausible to compute a comparative likelihood this way?

I have been thinking about this for awhile because it’s ubiquitous throughout criticisms of error statistical testing, and it comes from a plausible comparativist likelihood position (which I do not hold), namely that data are better evidence for μ than for μ’ if μ is more likely than μ’ given the data. I’m guessing they’re reasoning as follows:

The probability is very high that z > 1.96 under the assumption that μ = 4.96.

The probability is low that z > 1.96 under the assumption that μ=μ0 = 0.

We’ve observed z= 1.96 (so you’ve observed z > 1.96)

Therefore,μ= 4.96 makes the observation more probable than does  μ = 0.

Therefore the outcome is (comparatively) better evidence for μ= 4.96 than for μ = 0.

But the “outcome” for a likelihood is to be the specific outcome, and the comparative appraisal of which hypothesis accords better with the data only makes sense when one keeps to this. Power against μ’ concerns the capacity of a test to have produced a larger difference, under μ’. (It refers to all of the outcomes that could have been generated.)

~~~~~~~~~~~~~~

1. That’s not at all how power works.

The result is that power works in the opposite way! If there’s a high probability you should have observed a larger difference than you did, assuming the data came from a world where μ =μ’, then the data indicate you’re not in a world where μ is as high as μ’. In fact:

if Pr(Z > z0;μ =μ’) = high , then Z = z0 is strong evidence that  μ < μ’!

Rather than being evidence for μ’, the statistically significant result is evidence against μ being as high as μ’.

~~~~~~~~~~~~~~

1. Stephen Senn

Stephen Senn (2007, p. 201) has correctly said that the following is “nonsense”:

“[U]pon rejecting the null hypothesis, not only may we conclude that the treatment is effective but also that it has a clinically relevant effect.”

Now the test is designed to have high power to detect a clinically relevant effect (usually .8 or .9). I happen to have chosen an extremely high power (.999) but the claim holds for any alternative that the test has high power to detect. The clinically relevant discrepancy, as he describes it, is one “we should not like to miss”, but obtaining a statistically significant result is not evidence we’ve found a discrepancy that big. (See also Senn’s post here.)

Supposing that it is, is essentially  to treat the test as if it were:

H0:μ < 0 vs H1:μ  > 4.96

This, he says,  is “ludicrous”as it:

“would imply that we knew, before conducting the trial, that the treatment effect is either zero or at least equal to the clinically relevant difference. But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference.”(Senn, 2007, p. 201).

The same holds with H0:μ = 0 as null.

If anything, it is the lower confidence limit that we would look at to see what discrepancies from 0 are warranted. The lower .975 limit (if one sided) or .95 (if two-sided) would be 0 and .3, respectively. So we would be warranted in inferring from z:

μ  > 0 or μ  > .3.

~~~~~~~~~~~~~~

1. What does the severe tester say?

In sync with the confidence interval, she would say SEV(μ > 0)= .975 (if one sided), and would also note some other benchmarks, e.g., SEV(μ > .96)= .84.

Equally important for her is a report of what is poorly warranted. In particular the claim that the data indicate

μ > 4.96

would be wrong over 99% of the time!

Of course, I would want to use the actual result, rather than the cut-off for rejection (as with power) but the reasoning is the same, and here I deliberately let the outcome just hit the cut-off for rejection.

~~~~~~~~~~~~~~

1. The (type1,2 error probability) trade-off vanishes

Notice what happens if we consider the “real type 1 error” as Pr(H0 |z0)

Since Pr(H0 |z0) decreases with increasing power, it decreases with decreasing type 2 error. So we know that to identify “type 1 error” and Pr(H0 |z0) is to use language in a completely different way than the one in which power is defined. For there we must have a trade-off between type 1 and 2 error probabilities.

The conclusion is that using size and power as likelihoods is a bad idea for anyone who wants to assess the comparative evidence by likelihoods. It’s true that the error statistician is not in the business of making inferences to point values, nor to comparative appraisals of different point hypotheses (much less do we wish to be required to assign priors to the point hypotheses).  Criticisms often start out forming these ratios and then blaming the “tail areas” for exaggerating the evidence against. We don’t form those ratios. My point here, though, is that this gambit serves very badly for a Bayes ratio or likelihood assessment.

Likelihood is a “fit” measure, “power” is not. (Power is a “capacity” measure.)

~~~~~~~~~~~~~~

Send any corrections, I was just scribbling this….

This is related to my “no headache power for Dierdre” post, and several posts having to do with allegations that p-values overstate the evidence against the null hypothesis, such as this one.

Senn, S. (2007), Statistical Issues in Drug Development. Wiley.

### 87 thoughts on “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?”

1. Deborah, I don’t think you’ll like my answer, but my first reaction to your post is that the problem is due to obsession with likelihood in the first place. As you said yourself, comparing likelihoods seems fishy, but I would go further and say that I believe too much is made of likelihood in general anyway. In addition to philosophical issues, likelihood can be very sensitive to distributional departures from the model, especially in the tails, the regions of interest. (Recall my related statement about the Higgs boson analysis.)

More generally, my problem with all of the analysis you describe is that it is all aimed at taking the decision making out of the hands of the domain expert. I’d much prefer that the statistics tell the analyst what plausible values μ may have, and then let the analyst make a decision (on policy or whatever) accordingly. As usual, I’m talking about confidence intervals, but in a different way than you seem to view them.

• Matloff: first, surprised to get a comment so quickly, and second, nice to hear from you. I don’t think I get what you’re driving at. I’ve been through criticisms and reforms recently (in writing my book) from at least 4 different approaches, and they all come back to this kind of move. So it can’t just be dismissed. Now anyone who does comparativist inference, which is basically almost everyone it seems, can say, “look we didn’t want to compare these ‘tail areas’, but you made us by giving us ‘tail areas’ to work with.” My answer is that we never told you to do this and, most importantly, if you think about it, it’s not accomplishing YOUR goal in forming the comparison.

But I realize this doesn’t get to your point. Why is this a matter of taking the decision making out of the hands of the domain expert? I thought these ratios were considered staying close to the data. But now on confidence intervals, why do you spoze you’re talking about them in a way different than I? I gave the confidence intervals. They’re what should alert people to the problem. Even if one doesn’t use SEV, surely confidence intervals show the mistake. That’s why I like the Senn discussion here. The important thing about CIs are the limits. Only thing is, if you form one-sided CIs, say lower as here, then you also need to look at the upper CI bound to avoid fallaciously inferring larger discrepancies than are warranted from stat sig results. I wouldn’t want to say all the values in the CI are equally plausible–so maybe that’s why you think I view them in a different way than most people. I’m sure I can convince you why it’s not a good idea to view all the points of a CI as equally plausible, or even plausible (some of them). They are survivors, and some of them (near the boundaries) just by the skin of their teeth.

• Well, I first must again make my disclaimer that I feel I’ve wandered into a party in which I don’t know anyone and know nothing about the topics everyone is discussing. 🙂 Still a fun party, though. 🙂

If I’ve understood your posting of this evening correctly, both you and the people you disagree with want to have an Automatic Decision Making Robot, who/that will input the data and output a crisp, black-and-white, unnuanced decision, e.g. “The drug works.” In such a situation, the ADMR makes the decision, not the domain expert. The former has taken the decision making out of the hands of the latter, which I think is highly undesirable.

As you know, that does NOT imply that I advocate a Bayesian approach. I am definitely a frequentist. But that merely means I don’t want the analyst to incorporate his/her hunches into the quantitative analysis. Instead, the analyst should compute a (frequentist) CI, display for all intended consumers (a scientific community, a corporate board etc.) to see, and then add his/her conclusions, based both on the data and external knowledge.

I didn’t say you look at CIs differently from most people — just differently from me. I must apologize (again) for not having had the time to really delve into the references you’ve given me for “homework,” But it’s my impression that you use CIs as an ersatz significance test, i.e. back to the ADMR. And I don’t think either one of us claims all points within the CI are equally plausible.

• Matloff: i like when you wander into our parties once a year or so (even if we weren’t holding any) where you don’t know anyone: neither do most of the rest of us. Have an Elba Grease with ice and marachino cherries, relax and discover that few of us are so well acquainted with the topics we are discussing– Your merely saying “I am definitely frequentist” automatically earns you a prime spot where all the action takes place. plus a jumbo shrimp cocktail-my favorite, and a ‘frequentists in exile T-shirt”, just designed (I’m also an artist of sorts).

Now learning you’ve not done your homework on CIs is disappointing, but I’m not going to give you detention or anything and I agree that the analyst shouldn’t incorporate his or her hunches into the quantitative analysis. Do explain at some point how you view CIs differently from others. Please…
in the mean time, saunter into my Error Statistical Casbah and relax…

• Michael Lew

Matloff, you say that you would “prefer that the statistics tell the analyst what plausible values μ may have, and then let the analyst make a decision (on policy or whatever) accordingly” and then go on to ask for confidence intervals. A properly constructed likelihood function would give you a much more complete picture of the evidence in the data relevant to the plausibility of various values of mu than a confidence interval could ever do.

• I regard the information in a standard confidence interval as being as good as it gets. For reasons I gave last night, I believe that likelihood often carries MISinformation.

• Michael Lew

You might well be mistaken. What makes you think that likelihood functions are misinformation?

• Michael Lew

OK, I didn’t think that your comment included enough criticism for anyone to reasonably decide to discard likelihood functions in favour of confidence intervals. However, you did say that likelihoods are “fishy” in some sense and that they are severely affected by departures from model assumptions. I would suggest that confidence intervals might be equally affected by departures from the model assumptions as are likelihood functions.

I note your point about the tails of the functions, but one is rarely interested in the extremities of a likelihood function cause the best supported values of the parameter of interest are near the peak of the likelihood function. The peak is naturally the place of most interest.

• We are, at least in the discussion here, talking about estimation of means, in which case there really is no model to violate. We’re just using the Central Limit Theorem.

I mentioned the tails because the discussion had involved tail probabilities, arising in inference (hypothesis testing). But (to my knowledge) most Bayesians are interested primarily in point estimation, not inference, in which case the peak of the posterior likelihood is the main item of interest, as you say. But really, this is a likelihood issue, not a Bayesian-vs.-frequentist issue, because frequentists do Maximum Likelihood Estimation. And I have the same objections to MLEs.

What happens when we go beyond just estimating means (including conditional ones in the regression setting)? Again, I have the same objections. I don’t like fancy, difficult-to-verify models, again for the same reason: Generation of MISinformation.

• Michael: Likelihoods ignore crucial pieces of information that alter error probabiities such as cherry picking, data-dependent hypotheses, multiple testing, stopping rules, selection effects. And comparative likelihood claims are pretty useless.

• Out of room again. (Deborah, this is a configurable WordPress parameter if you can stomach longer discussions. 🙂 ) I’m replying to rasmusab, who had replied to me.

You and I of course agree on the conditions under which the Bayesians’ famous “the prior eventually washes out” claim fails. But my point was that the Bayesians don’t put an asterisk on that famous slogan, which is why I said the Bayesian approach is not quite as advertised. That’s a really big deal to me.

And more importantly, we’re not talking about some rare case here. On the contrary, the excellent book Bayesian Ideas and Data Analysis, one of whose authors is my former colleague Wes Johnson (a really smart guy and a leading Bayesian), is chock full of examples of priors that assume bound(s) on θ.

The examples in that book — and in every other book I know of on the Bayesian method — show that many, indeed most, Bayesians set up priors exactly in the way you believe that the vast majority don’t: Their priors are chosen, as you say, because “it feels right.” Of course, they also often choose “convenient” priors because they lead to nice posterior distributions, making the priors even more questionable.

I’m not familiar with the Rubin/Jaynes approach. A quick Web search seems to indicate it is aimed at performing “What if?” analyses. I have no problem with that at all (providing, as always, that the ultimate consumers of the analyses are aware of the nature of what was done).

• rasmusab

I actually believe most people that do Bayesian data analysis (those you call Bayesians) actually use convenience priors (such as default priors, or reference priors). And I think that’s fine, as long as you know that you are using a convenience prior. Just like most people use convenience models (like linear regression), it’s quick and easy and hopefully works ok most of the time.

It’s’ a different case if you were to chose a convenience prior and then stick to it whatever happens. That would be like sticking to linear regression without ever questioning the model assumptions. And that would be questionable.

A useful way of thinking about priors is just as “part of the model”. Just like the assumption of linearity is part of the model, and has to be justified, the priors are also part of the model and have to be justified. But sometimes use use linear regression because you have no better option and sometimes you use convenience priors because you haven’t figured out something better.

What I meant with the Rubin/Jaynes approach was a very pragmatic approach to Bayesian data analysis, like the one described here, for example: http://projecteuclid.org/euclid.aos/1176346785

• But rasmuab, you are ignoring the key point: One can use the data to assess the propriety of frequentist models, as linearity of a regression function, but one can NOT do that in the (subjective) Bayesian case. In Bayesian settings, since one has only a single realization of θ at hand, one can’t estimate the distribution of θ to verify the prior.

All this changes in the empirical Bayes case. Then there is a real distribution for θ , and one’s model for that distribution can be verified as in any other frequentist method — because it IS a frequentist method. For instance, Fisher’s linear discriminant analysis (or for that matter logistic regression) without raising an eyebrow, even though it is an empirical Bayes method.

I skimmed through the first few pages of the Rubin paper (thanks for the interesting link), and immediately noticed that his very first example, on law school grades uses an empirical Bayes approach, not a subjective one, which makes it frequentist.

• Matloff: There is a group of Bayesians these days who want to see themselves as nonsubjective but not frequentist. Whether their priors are on firm grounds when not based on any frequencies is very unclear. However, they will also refer to them as “regularization devices”, “smoothing devices” or the like. It’s hard to know how to test them if they’re doing different jobs in any particular case.
One question for you: just saying it’s empirical Bayes doesn’t yet say how they get the empirical prior. And, if available, is it always relevant to problems of inference about THIS parameter/hypotheses? The typical screening context seems very different from the inferential one.

2. Alan

Agreed this is absurd. What about the Bayesian solution? Using a uniform prior on mu to get Pr{H0} = Pr{H1}, then for an observed value of z0=1.96 and a stdev=1 you get:

Pr{mu greater than zero | z0} = .975
Pr{mu greater than .96 | z0} = .84
Pr{mu greater than 4.96 | z0} = .0013

In words, these read: there’s strong evidence mu is greater than zero, fair evidence it being greater than .96 and it’s very unlikely to be greater than 4.96.

Are you claiming these results are absurd?

Remembering the observed value is greater than zero by about two standard deviations, these seem spot on to me.

• When Bayesians use the point .5 priors, though, it doesn’t work. with the uniform, it’s like the error statistical tester. I’ll look at this tomorrow–too late.

• Alan

Great, thank you. Could you possibly sketch the analysis for the case when we have outside knowledge of the kind, for example, where mu has to be greater than 3.

I can see how to add that to the Bayesian analysis, but I’m uncertainty how to modify the severity analysis. Under those circumstances the Bayesian answer becomes:

Pr{mu greater than 3.1 |z0, mu has to be greater than 3} = .85

Which seems intuitive. Given the low value of z0 there’s some chance mu is between 3 and 3.1. The old severity is:

SEV(mu greater than 3.1) = .15

Does this change at all? How is severity to be calculated under these circumstances? It would be greatly appreciated and most helpful to see what the correct answer is.

• I’ve never understood this idea that if we know that μ is greater than 3, that means we should adopt a Bayesian approach. What’s wrong with simply truncating our frequentist estimate at 3? And if our pre-truncation estimator does drop below 3, wouldn’t we like to know that, so that we can re-examine our conviction that μ is at least 3? Sometimes convictions, no matter how firm, turn out to be wrong.

• Exactly! You read my mind.

• Alan

Hi matloff,

Most examples I deal with have physical cutoffs which are firm and have to be respected. Simply truncating the old result doesn’t work because the Confidence Interval could be entirely below the cutoff. Or you could get funny 95% Confidence Intervals like (0,3.0000001). Truncating would imply a 95% CI of (3,3.0000001) which is absurd. Even if truncating worked sometimes it clearly wouldn’t always work.

It’s unclear to me how to “truncate” the severity measure. Currently all values of mu greater than or equal to 3 have very low severity. Is that to change? If so, what principle do we use to change it and what’s the result?

For comparison the Bayesian update gives

Pr{mu greater than 3 |z0, mu has to be greater than 3} = 1
Pr{mu greater than 3.1 |z0, mu has to be greater than 3} = .85
Pr{mu greater than 3.5 |z0, mu has to be greater than 3} = .42

which seems fairly intuitive.

• Granted, the lower bound is firm in some settings. But from my point of view, the “funny” confidence interval is just something we have to live with; we can’t derive more information from the data than the data give us. Unless one is a Bayesian, of course. 🙂

• Alan

If I truncated a CI and reported (3, 3.000001) as a 95% CI, while my Bayesian coworker reported something sensible like (3, 3.4) or so based on the same information, then I’d be fired.

I’m reluctant to live with that.

These are just toy problems in truth. The problems I work on are more sophisticated, and if Bayesians are giving clear, general, and intuitive answers, while Confidence Intervals and severity are already meeting severe difficulties of principle, then that leaves me no choice but to go the Bayesian route.

It’s a basic question of principle. How should the CI’s and severity measures be adapted to that additional information? Why the reluctance to give a thoughtful, clear answer to such a simple question?

• Funny you should mention being fired. I recall opining on this blog once that Bayesians might be fired if their bosses really knew what the Bayesians were up to.

• Alan

Ok I’m done. The severity analysis in the post is giving identical answers the Bayesian posterior based on the uninformative prior .

In order to understand what’s going on I considered the simplest instance of genuine information which would cause a Bayesian to use a different prior. The Bayesian result is still sensible. The CI truncation idea is ridiculous in general, and I don’t see any way to adopt the severity measure to account for that information (I genuinely don’t have a vague hint of an idea how to do that).

I will use this exchange to re-evaluate the claims of non-Bayesians that they can use background information to the same extent as Bayesians.

Chow.

• What I said wasn’t a quip at all; I was dead serious. A lot of bosses and other consumers of statistical reports would be hopping mad if they were to learn that the analyst had added in his/her own fudge factor, based on the analyst’s hunch. I made exactly the same point in this forum some months ago.

(Note: I am talking about subjective priors; as I’ve said before, I’m fine with empirical priors.)

If you are like most people, you feel strongly about at least one political issue. How would you like it if policy on that issue were to be based on “research” in which the analyst’s personal views on the issue were to be incorporated into the analysis and subsequent policy recommendations? The analyst may even think his/her work is impartial.

We may be talking about “toy cases,” but not at my instigation. Instead, it was the topic at hand from the beginning, and properly so, as simple settings make it easier to clarify the issues.

I should mention a bit more about my personal preference for using concrete quantities like sample means instead of likelihoods. You’ll at least concede that with means, the problem of the estimator violating the physical constraint isn’t going to happen in the first place (again assuming the physical constraint really is valid). One still could have part of the CI in the “wrong” region, but the “(3,3.0000001)” scenario is almost certainly not going to occur. In that sense, “(3,3.0000001)” is the toy case.

• Alan

matloff,

Honestly I’m not interested in Bayesian vs non-Bayesian mud slinging. I don’t care. I’d really just like to know the (non-Bayesian) way to include that information the non-Bayesian analysis. At this moment I have no idea how to modify the CI in general, or the severity analysis at all, or even what principles I could use to do so. It’s a simple question. Why can’t I just get a straight answer?

• Alan

Matloff,

Two points of clarification. It was not my intention to focus on toy problems. It was my intention was to see for myself the (non-Bayesian) answers to the sophisticated types of problems I work on. To see them, I have to get them. If I can’t solve this simple toy problem, there’s no hope of solving a real one and examining it.

Second, if you’re measuring mu with measurement errors, it is possible to get point estimates and entire CI’s below the cutoff. You’re telling me this won’t happen much provides no help. I want to know the correct in principle way to deal with this so that I can carry those (non-Bayesian) principles to more sophisticated problems.

• I’m with vl on this

• Alan: I wouldn’t report the nonsensical one.
Trouble is that probability is not a very good measure of how warranted or well tested claims are, even though you’re interpreting them as if they give you “strength of evidence” (rather than your reconstructing your own givens probabilistically). I don’t rule out trying to relate posteriors (of fixed parameters) to sev measures associated with them, but I know sev doesn’t obey the probability calculus.

• Alan

Mayo, thanks for the response. To be clear, I’m not interested in comparing posteriors to severity measures. I only mentioned the comparison because in the example in your post they yield identical conclusions. In order to understand what’s going on I’d like to see them when they differ, hence the introduction of a simple type of background information I encounter regularly.

I don’t need any analysis of the Bayesian answer. I have done the calculations myself and they yield reasonable interval estimates that I could defend to my boss. I would like to know how to include the same background info into the CI and/or severity analysis using non-Bayesian principles (naturally). It’s a simple question truthfully and it’s a bit of a red flag answers aren’t forthcoming.

• You may be able to find a prior that matches my SEV, but our meanings differ, and the other priors led many people to the different posteriors in my post. It’s the very fact that you could justify nearly any answer you want with a plucked prior that raises a bit of a red flag.

• Alan

Mayo, please forget I ever mentioned the Bayesian analysis. I regret doing so. I would just like to know your (non-Bayesian) way for including that type of background information in the (non-Bayesian) CI’s and severity analysis. Can you give me a hint of the principle used to so at least? Even just the basic idea? Anything?

• Alan: The observed result is improbably smaller than what would be expected were mu greater than 3. The significance level isn’t terrible small, ~.16, but there’s fairly good evidence that mu < 3. The severity associated with mu 3.1 given this data?
My view of statistical inference is to indicate how well tested claims are by the data; the idea that this data warrant mu as large as 3.1 is wrong-headed, or would be for an error statistician.

Mark essentially gave you the answer and you’re stamping your feet just as he described.

• Alan

Thank you. At least I can understand this answer.

Mu greater than 3 is a firm assumption. It is realistic in some instances, maybe not others, but that’s besides the point. The same could be said for every other assumption in any problem. The goal was to see what do we got with those assumptions.

This clarifies the Error Statisticians answer considerably, albeit qualitatively. None of the possible values of mu is well warranted so thinking one of the possible values of mu is the true value of mu is “wrong-headed”.

The Bayesians answer says with a value of z0 below three, that mu is most like three or somewhere not too much bigger.

At least now I can see both answers at least qualitatively.

• I too have run out of room, and thus am replying to Alan via a different message of his.

I think I’ve been very clear as to what I would do about the physical constraints (putting aside for the moment the issue of whether it might be invalid even if I am sure it’s valid). I would truncate the CI, and given my preference for means-based estimation, I dismissed the “(3,3.00001)” toy example.

One aspect that people here have been for the most part dancing around is what one should do if there simply is not enough data to get a decent answers. Bayesians (in general, not necessarily you, Alan) “solve” that problem by, in essence, inventing their own data, in the form of quantifying their hunches. Frequentists “solve” the problem by making models that reduce the dimensionality of the problem. The frequentists are somewhat ahead in that at least they can try to use the data to verify their model, but not in the small-n case I’m discussing here, and anyway their model would probably need analysis via likelihood, which again I am wary of.

There is no good solution in that setting. In that light, the honest and useful action is to admit that the data are insufficient to do much with. I think it’s unreasonable for the analyst to give his/her boss an answer just for the sake of having one. The analyst should report what the data say.

Now, if the analyst wishes to then say to the boss, “The sample is too small to say much, but if I supplement it with my hunches, I do get the following estimate, which is pretty nice, don’t you think?”, I don’t have much objection. Similarly if the analyst uses a questionable frequentist model that is not verifiable due to the small sample size. This is basically what I say in my book, by the way.

I continue to be worried about “firm” physical-constraint assumptions. They may well be valid for the physical process itself, but may be wrong once possible problems with, say, measuring instruments are recognized

• Alan

matloff,

It is possible to get CI’s entirely below the cutoff. You telling me you dismiss this as a toy problem is fine, but I still have no idea what (non-Bayesian) principle I would use to deal with it correctly.

At the end of the day I can actually look at the Bayesian answer. I can see the numbers generated and evaluate them for myself without ideology. But I can’t do so for CI’s and severity measures if there’s a cutoff to the parameters, because after dozens of comments I still don’t know how do get the (non-Bayesian) answers.

So what’s the point of debating between them? If I can’t get the non-Bayesian numbers, the debate is irrelevant.

• If you work with means, you will NOT get CIs lying entirely below the cutoff.

If you work with likelihood models — which once again, I am wary of — you can easily solve the problem in a standard frequentist way. If the lower bound for the support of the distribution is c, then just model the data as having, say, an exponential distribution shifted rightward by c.

I’ve been assuming, by the way, that your cutoff is for the random variable X that is being observed. My comments above assume this.

If on the other hand you’re assuming that the population mean μ is known for sure to be greater than or equal to c, you are on very thin ice. This would not count as a physical constraint.

• Alan

matloff,

I don’t know what you’re talking about. You can get CI’s lying below the cutoff. I’m talking about the same problem as the post, except with additional outside information that mu (the parameter) is greater than 3. X the observed value is mu +error and so could potentially be any number for normal errors. Your response is more game playing. It’s a simple question. I can’t judge for myself what’s going on without seeing the numbers and I have no idea how to get the numbers. I’m starting to wonder why two acknowledged experts can’t produce them.

• Mark

Alan, let me get this straight. Your example involves a case where there’s a hard physical constraint on the mean being greater than 3, but no such physical constraint on individual observations? The only possible way to get a CI that lies almost entirely below the cutoff is to have the vast majority of values lying below the cutoff. What’s a Bayesian to do in this case, stamp his feet and say “no, no, no, the mean must be constrained to be greater than 3, so I’ll put the vast majority of my weight on my prior” (that is, acknowledge that the data are noisy and so essentially throw them out)? I’d love to see a Bayesian analysis where a) there is a physical constraint on the mean being greater than 3, b) almost all of the data are sufficiently lower than the cutoff *such that the standard frequentist CI was almost entirely below the cutoff*, and c) the final inference was not based almost exclusively on the prior. If your answer is that your final inference in this case would be essentially the prior, then I frankly don’t see anything less absurd in your approach than claiming that (3, 3.00001) is a reasonable CI. It’s the same argument, as far as I’m concerned, they’re equally concocted.

Now, if there truly is a physical cutoff, such that both the mean and realized values are required to be above this cutoff, then there is a very simple frequentist approach to incorporate this background information. Do a transformation like log(X-3). No need to truncate, your entire CI will be in the required range.

• Alan

Mark,

I don’t want to discuss the Bayesian case. I used a uniform prior for mu greater than 3. I understand this prior and what it means completely. I can work out the numbers trivially. I can see first had what they do. They are sensible and I can defend them to my boss no problem.

I don’t need people to tell me what engineering problems I should work on. My boss does that and she is very good at here job. She doesn’t need any help.

All I need to know is the principles to use If I were working the same problem as the post except with additional information that mu is greater than three. I need principles which are universally applicable no matter what the exact values of z0 and mu are and tell me how to adapt the CI’s and severity analysis to this case. If I could see those principles in action in this simple case I could potentially use them in realistic cases.

• Alan

I’m flabbergasted and frustrated at how difficult this has been. In the problem in the post x=mu+error, z0=1.96 with sigma=10, n=100 is given together with a classical analysis and a new analysis using severity measures. I just want to know how in principle the classical and severity analysis changes if mu is independently known to be greater than 3. It’s not a trick question. It’s not a difficult question. It reflects the very simplest example of real engineering problems I face. Would someone please just tell me? Because I genuinely have no idea.

• Alan: Again, you miss the point of the post. The point had nothing to do with introducing a new analysis, in fact I am keen to make the case with the corresponding confidence interval. The same point is shown that way in my reference to Senn. I could completely have removed any reference to SEV, except for the fact that it automatically forces you to show what has NOT been well warranted, whereas the one-sided CI just says mu > CI-lower. Now one could use the one-sided CL to describe what’s wrong with inferring mu(.8): ask at what confidence level you could get mu(.8) as a lower bound. The absurdly high error rate immediately is seen.

• rasmusab

When I count, Alan has asked “I just want to know how in principle the classical and severity analysis changes if mu is independently known to be greater than 3.” around eight times. I would just like to add that I’m also curious! If this is not the area to discuss this, so be it. However, I get a bit irked by the general sneering at Bayesian methods, such as:

“we can’t derive more information from the data than the data give us. Unless one is a Bayesian, of course. :-)”

“Funny you should mention being fired. I recall opining on this blog once that Bayesians might be fired if their bosses really knew what the Bayesians were up to.”

“A lot of bosses and other consumers of statistical reports would be hopping mad if they were to learn that the analyst had added in his/her own fudge factor, based on the analyst’s hunch.”

without showing how the methods proposed in this blogs can be used to solve Alan’s problem. If you don’t know, that fine, all methods can’t be used everywhere. But if you want to avoid the comments going “off topic” a clear answer might be the best remedy.

• rasmusab:

I agree, this is annoying.

Let me pretend I am NormalDeviate (sorry Larry) and solve it.

You data generating model has location parameter with parameter space (3,infinity).
Full parameter space for location, scale of (3,infinty) x (0, infinity).

Technical note: Your parameter space is not variation independent so don’t expect “similar” confidence intervals but you can search for confidence intervals that uniformly have coverage > 1 – alpha. See Barndorff-Nielsen, O., and Cox, D. R. Inference and asymptotics.

OK without ruling out any way of coming up with the procedure to obtain intervals, prove one or more have coverage > 1 – alpha every where in the parameter space. All them are bonafide 1 – alpha confidence intervals, so choose one – any one.

Now, if you used likelihood shape based constructions of the interval, it would always exclude 3 or lower (one reason David Cox gives for preferring them.) Otherwise they like often include some values below three (that are annoying). Remove those and you have to again have to prove what 1 – alpha* coverage they always exceed in order to call them confidence intervals.

Was this so hard? See Barndorff-Nielsen, O., and Cox, D. R. Inference and asymptotics.

• Mark: Thanks for your reply. I have now had three people e-mail me that one of the standard Bayesian ploys is to raise pseudo-objections and fallacious arguments of the form of “You didn’t answer my question, which proves you’re wrong” even where it’s been answered in the variety of ways the murky question can be interpreted. I guess I hadn’t at first viewed it as a con. Naive.

• rasmusab

Oy, you fell for the oldest trick in the Dutch book. The Bayesian conspiracy has struck again!

Honestly, I don’t see why Alans question is a “pseudo-objection”, I actually don’t see how it is an objection at all. What is he objecting against? He’s asking a (not so murky) question about how a data analytic problem can be handled using the methodology you recommend on you blog, he’s not objecting against it. If he was saying something along the lines of “Bayes is the only game in town because you can’t do that or this, ha!”, then that might be a “fallacious arguments” (or perhaps actually just an “argument”), but he’s not doing that.

Reading your blog I get the impression that you want to communicate that Bayesian statistics are worse than useless (but I might be reading you wrong). If someone comes up with a problem that is trivially solved in a Bayesian framework, but where a severity version is not so obvious (to me at least), isn’t it a strange argument tactics to just dismiss it as a “ploy”?

• But I fail to see a problem that the Bayesians solve trivially here, that we have trouble with, and I really don’t know what he was getting at. Frankly, if you are told some new information afterwards, it’s the Bayesian that may be stuck since he’s used up all his prior probability. the non-bayesian has no such problem. As for having it given as an assumption ahead of time, I still see no advantage. Mark, and others answered that, however Matloff brought up the good point that, such an assumption might be wrong. We test our assumptions.As I noted elsewhere, mu > 3 would produce data in excess of what we observed ~84% of the time, and thus his assumption in not in sync with the data.

On your main allegations, you have no basis whatsoever to suggest I’m trying to show Bayesian methods are “worse than useless” a phrase used to describe biased tests, by the way. You wouldn’t be able to find any such examples. so you should withdraw that. I rarely even discuss Bayesian methods. I only defend against criticisms that betray a misunderstanding of what frequentists are doing. If I wanted to devote space to bayesian horror stories, I’d use some of the examples from Shalizi, Freedman, many others, based on problem priors. I’m not very interested in that right now. Nor do you see any posts where I show how the whole Bayesian gig depends on getting the right prior, while we don’t even know what they’re supposed to represent .

If you are honest you will see that I only mention Bayesians when I’m defending error statistics against their criticisms. I never instigate the criticisms. So please stop misrepresenting me. I am simply one of the few people to give voice to the frequentists who are roundly beaten up with howlers which run from the childish to the downright dangerous–and they are certainly not all from Bayesians by any means.
I keep hearing, for instance, that we have it wrong because we don’t condition on the data point! Well if you do that, you are forced to dispense with the basis for criticizing inferences guilty of p-hacking, cherry-picking, optional stopping, etc: error probabilities.

Remember too that i’m a philosopher, and right now the best possible foundational hope for the Bayesians who are not subjectivists is error statistical. So I’m not denouncing them, but defending frequentists and also showing that they need new foundations (as even Gelman admits).

• rasmusab

“it’s the Bayesian that may be stuck since he’s used up all his prior probability”

I didn’t know that you could run out of probability… To me that sounds like running out of integers 🙂

The problem was how to integrate the problem specific information that µ is larger than 3. And generally it’s very hard for me, when working with classical statistics, to see how one can incorporate problem specific knowledge (that doesn’t come in the form of data) in an easy way.

“On your main allegations, you have no basis whatsoever to suggest I’m trying to show Bayesian methods are “worse than useless”

It wasn’t intended as an allegation, but it is honestly how I interpreted a lot that has been written in the comments on this blog. If you believe that Bayesian methods are a useful part of data analysis, then I’m truly sorry of having accused you of thinking otherwise!

“Remember too that i’m a philosopher, and right now the best possible foundational hope for the Bayesians who are not subjectivists is error statistical.”

Well, Jaynes is also pretty good I think 🙂

• Alan

This requires special comment:

“If on the other hand you’re assuming that the population mean μ is known for sure to be greater than or equal to c, you are on very thin ice. This would not count as a physical constraint.”

Imagine this situation. We wish to measure the length of a sofa mu. We use a tape measure with normal errors. We get observations mu+error.

Later on we realize that sofa sits in a room who’s length is precisely known because it was measured using a surveyors laser by the architect. That knowledge places an (upper) cutoff on mu. It’s a physical constraint that has to be respected and has nothing to do with the errors from the tape measure or any prior statistical analysis that might have been done. I could multiply this example endlessly since in realistic engineering problems parameters often have firm cutoffs.

• john byrd

It seems obvious to me that you can calculate the CI and truncate off at the value of the length of the room on the upper end of the interval. This truncation is done as a separate step with its own justification and is not a statistical analytical step, but more an appropriate use of additional information. Of course the laser transit has its own error that is not factored in… So maybe stick with the CI…

• Alan, the problem with your sofa example is that you are assuming the errors have mean 0. That’s an assumption, not a physical constraint.

I do believe I have answered all your questions, even though I understand that you don’t think I have. I’m going to leave it at that. Thanks for the interesting interaction.

• I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.

Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?

• Alan

Matloff,

The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right.

The longer answer goes like this. Statisticians can get inferences and their associated uncertainties from probability distributions. If those inferences are true to within those uncertainties, we say the distribution is ‘good’. Statisticians typically do this with posteriors. Good posteriors being those that give us interval estimates that jive with reality. Obviously though it can be done for any distribution no matter what it’s type or purpose.

Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.

Given the prior with support on (c, infty) we’d infer that “the true mu is greater than c”. If the true mu is less than c, then the prior is ‘bad’ and shouldn’t’ be used. Using it is equivalent to making a false assumption no different than “assume 3 is an even number”,

• Alan

The moral of the story Matloff is that your prior should only say “mu is greater than c” if your prior information guarantees it. If the prior information about mu isn’t strong enough to guarantee it with certainty you should choose a prior which reflects that and has a larger support than (c, infty)

• rasmusab

Well using a (c,∞) prior makes a model that “considers” values c and when you don’t have time or need to coming up with something more nuanced. But if it seems that the (c,∞) is not doing a good job (or if you learn new information) there is nothing stopping you from changing the prior (as you can change other assumptions in the model).

Of course, if you want to you can put some other prior on mu μ where you reserve a tiny bit of probability on μ 3 and in that model you would have the property that “the effect of the prior washes out as the sample size n goes to infinity”.

• rasmusab

Strange, some of the less than and larger than signs disappeared … I’m trying again:

Well using a (c,∞) prior makes a model that “considers” values less than c impossible and is useful when you don’t have time or need to coming up with something more nuanced. But if it seems that the (c,∞) is not doing a good job (or if you learn new information) there is nothing stopping you from changing the prior (as you can change other assumptions in the model). So you could say, “All priors all false, but some are usefull”.

Of course, if you want to you can put some other prior on μ where you reserve a tiny bit of probability on μ less than 3 and in that model you would have the property that “the effect of the prior washes out as the sample size n goes to infinity”.

• Thanks for the thoughtful comments, Alan and rasmusab. But I think you agree, then, with my point: One of the most famous defenses offered by Bayesians for their methods — that the influence of the prior gradually washes out (“Our answers won’t be much different from those of the frequentists”) — fails in a broad category of situations. The Bayesian philosophy is not quite as advertised.

The other point I’d make in response to your comments (which I’ve mentioned before here and in Andrew Gelman’s blog) is that frequentist methods are robust to bad assumptions, in the sense that one can verify the assumptions via the data (if you have enough of it). By contrast, one can’t do that for a (subjective) prior, by definition, because one is working with only one realization of the parameter θ.

• (WordPress problem, trying again.)

Thanks for the thoughtful comments, Alan and rasmusab. But I think you agree, then, that one of the most famous defenses offered by Bayesians of their methods — that the effect of the prior washes out as n goes to infinity (“Our answers won’t be much different from those of the frequentists”) — fails in a broad category of settings. The Bayesian approach is not quite as advertised.

The other point I’d make in response to your comments is one I’ve made before, here and in Andrew Gelman’s blog: While you are correct in pointing out that any model can be wrong, the difference is that with frequentist methods one can VERIFY one’s assumptions via the data. By contrast, one cannot verify a (subjective) prior — by definition, since one is working with just a single realization of the target parameter θ.

• john byrd

From Alan: “Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.”

I understand that a Bayesian model– like any model– can be validated by estimating error probabilities that will result from applications of it. That is a good thing and a saving grace. But, consider this need for validation in the context of the toy example of the couch measurement, and it becomes very clear why Mark’s answer was correct, and my suggestion to stick to the CI because a laser transit has its own error makes practical sense for scientists trying to solve problems. If you get a CI with most likely values of mu below 3, you will likely end up having to revise your prior following attempts to validate…

It seems very improbable to me that you can follow the protocol of validating a Bayesian model against real data and end sharply divergent from the CI in a case like that. If you gain advantage by validation in that you obtain more data, then the CI can also be narrowed with the additional data. Two paths to the same end point?

• Those Bayesians who are availing themselves of the “saving grace” of testing assumptions using error probabilities would seem to me to be error statisticians. I don’t see how you can say, condition on the data, hold the likelihood principle, deny use of the sampling distribution post data–as coherent Bayesian have been preaching–and also violate these rules.

• rasmusab

“[…] that the influence of the prior gradually washes out […] fails in a broad category of situations.”

Well, it fails in those situation when you specify a prior with zero support in some parameter region. If you don’t do that, then the influence of the prior will “wash out”. Nothing surprising with that.

“Frequentist methods are robust to bad assumptions, in the sense that one can verify the assumptions via the data (if you have enough of it). By contrast, one can’t do that for a (subjective) prior”.

I agree (I think) that one version of this “subjective” philosophy of Bayesian statistics is surely absurd, where you can just pick and stick with a prior because it feels right, and because it is “your” prior, no need for good arguments or anything. I have never met any one that actually subscribes to this way of doing Bayesian statistics, but guess they must have existed. Doing more pragmatic Bayesian statistics in the sense of say Rubin/Jaynes then there is no problem using the data to criticize the model.

• rasmusab

“Those Bayesians who are availing themselves of the “saving grace” of testing assumptions using error probabilities would seem to me to be error statisticians.”

A rose by any other name..?

“I don’t see how you can say, condition on the data, hold the likelihood principle, deny use of the sampling distribution post data–as coherent Bayesian have been preaching–and also violate these rules.”

Well, but you can. *Within the model you are fitting* this is exactly what you do: “condition on the data and hold the likelihood principle”. Outside of the model, you are you, and anything goes, what matters is actual performance. Like the presidential election prediction by Drew Linzer (votamatic.org/). A fully Bayesian model, but that’s not what made the model so good, what made it good was that it predicted so well.

• Alan

Matloff, I’ve never heard anyone claim that if a prior assigns zero probability to the true value of mu that the posterior will settle on the true mu given enough data. Since elementary algebra shows the support of the posterior is a subset of the support of the prior, the claim is trivially false, and I doubt anyone ever did say it was true.

John Byrd, there is no “validated by estimating error probabilities that will result from applications of it” being done. The prior and posterior describe an uncertainty range for a single mu. There are no frequencies to calibrate to. Separately, if x_i = mu+e_i and the measuring instrument gives errors ~N(0,10) as in the post, it’s possible to get a CI entirely below the cuttoff. This will happen some small percentage of the time randomly. If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set (the intersection of the CI and the interval greater than the cutoff). Is that answer acceptable to you? Mayo seems to indicate it is, and that I’m “stamping [my] feet” over it.

Mayo, for P(mu|A) to do it’s job it has to faithfully reflect what A says about mu. If it doesn’t the distribution is “wrong”. If A says “it’s possible mu is less than c” but P(mu|A) says “mu must be bigger than c” then the distribution is bad. P(mu|A) is contradicting what A has to say about mu. That’s the philosophical origin of the ‘test’ and it in no way requires some extra Bayesian ingredient.

Even if it did, in what sense could this secretly be “Error Statistical” when it involves assigning probabilities to hypothesis and uses distributions which aren’t frequency distributions in any way? (this is not a rhetorical question. If everything else is ignored, please answer this one)

• john byrd

Alan: It appears that you employ circular reasoning. The prior is to be corrected through “experience” unless it is to be taken as a certainty before application? Makes no sense. This is what I call the self-licking ice cream cone approach to Bayesian philosophy. Establish a prior, take it as meaningful, sell it to others unless the model does not work. If the model performs poorly, change the prior, call it prior information anyway, then repeat process.

You say: ” If we know from other evidence that mu is guaranteed to be greater than the cutoff, then “truncation” will imply the true mu is in the empty set .”. So, you say we must accept the prior as more important than the data. And also:“Therefore, a prior is only ‘good’ if the inferences drawn from it are true to within the implied uncertainties. That’s how Bayesian priors on mu are ‘tested’ even though the prior is modeling the uncertainty in a single value of mu rather than the frequency of multiple mu’s. You simply compare the inferences from the prior and see if it’s consistent with the prior information.” It appears the latter approach of testing to correct the prior is most reasonable. The latter approach would correct the prior to avoid the empty set.

So, you are faced with a scenario where IF you are willing to allow that your prior is subject to revision when faced with reality, then your Bayesian model will gravitate to the CI solution. Or, you can simply not test it. But then it becomes religion not science.

And, it appears to me that validating a model by comparing its predictions to reality to measure its performance is precisely seeking to minimize error probabilities. Seems obvious to me. I am puzzled that you do nor think so.

• John: You bring out a good point: they have to assume something like the single mu that is responsible for the current data itself having been randomly selected from a population of mus. That’s a sample of size 1. We wouldn’t reject a statistical hypothesis on the basis of a sample of size one. So, it’s not clear they can be seen as getting error probabilities, which require a sampling distribution. We’re never just interested in fitting this case, the error probabilities are used to assess the overall capacity of the method to have resulted in erroneous interpretations of data.

And of course, there’s the problem of distinguishing between violated assumptions, like iid, and a violated prior. I note this in my remarks on Gelman and Shalizi’s paper.

• Alan: But the size of the sofa would have to be less than the size of the room it was in; there’s no analogy to the case where the data are indicative of being produced by a smaller mu value than you assume is the minimal size. maybe this was brought up by someone already, I’m just reviewing the comments I was unable to read earlier

• e.berk

Alan: If you give equal priors then you’re back to the .5, .5 assignment to the point hypotheses in her example.

• Alan

Assigning a uniform prior does in a sense assign .5 to each hypothesis, but it’s not doing the same as what Mayo does in the post. A trivial Bayesian calculation using a uniform prior gives you numbers for the posterior identical to Mayo’s SEV function. Although they’re being interpreted differently, they do still lead to same substantiative conclusions at the end. One merely says “these are the well warranted values for mu” and the other says “these are the probably values of mu”. The statistician walks away with the identical interval in either case. I can’t see any sense in which they would use them differently.

3. David Rohde

Do you put the likelihood ratio in quotes because it is not calculated correctly?

4. Michael Lew

What’s wrong with that ratio is that neither alpha nor beta is a likelihood. Therefore the ratio is not a likelihood ratio!

Alpha is a method-related constant that is not data-dependent and so it cannot be a likelihood. Beta is an unknown constant that is dependent on the relationship between the null hypothesised value of the parameter in question and its true but unknown value. Beta cannot be a likelihood either.

If people do assume that (1-beta)/alpha is a likelihood ratio then that is a disaster, but I haven’t seen such a howler.

• Michael: Well you must lead a charmed life, it’s rather old. See the posts on why p-values exaggerate evidence. The test reformers all do it, e.g., Ioannidis. See p. 204 of Senn’s Statistical Issues in Drug Development (2007). He warns you need to be extremely careful in interpreting the resulting posteriors. It’s the basis for a common criticism of using “tail areas”–the evidence against the null is much higher if one computes posteriors this way, which is the way it is presumed they must be computed starting with p-values. Goodman is another source.

• Michael Lew

Mayo, I don’t see how your reply is in any way connected to my comment. I did not mention P-values, tail areas or posteriors so I suspect that you are reading your own expectations into my words. Try reading it again.

• The likelihood ratios that enter into these Bayesian or quasi-Bayesian computations are (1-beta)/alpha or the reverse.

• Michael: In an important context – where all you as the reader have observed is the reported p_value and so your and the other readers’ likelihood is the probability of that observed given a point in the parameter space.

If your assumptions match the paper authors’ assumptions (i.e. you assume the same data generating model for their raw data and the same correct way to calculate the p_value), then your likelihood at the null parameter point would equal their p_value and your likelihood at an alternative parameter point would equal power under those assumptions.

Once you get access to their raw data, you would and should throw all the above away.

All the gory theory details are here http://andrewgelman.com/wp-content/uploads/2010/06/ThesisReprint.pdf

• Keith: Thank you for clarifying this. But if you did that, going to the alternative against which the test has high power, then you get the counterintuitive posterior or Bayes factor, or whatever. So it’s best to use the data itself, as you say, but in numerous examples, people do the former and then blame the “tail area” for exaggerating their posterior. The point is that power was never intended as a “fit” measure, but somewhat the reverse.

5. Michael Lew

But they are not likelihood ratios! Just saying that they are does not make it so.

• I didn’t say they were, I’ve been saying they’re not and moreover that they do not properly serve the intended capacity of a comparative fit measure. What I’m saying is that they are used in the place s that LRs go in various computations be they Bayesian factors or the like. They are used, also, by the way, in Ioannidis (2005) computations of positive predictive values, with some unhappy consequences. I’ve been saying this for over a year now. I don’t know how typical it is in diagnostics.

• vl

I’m looking at Ioannidis 2005 and the word “likelihood ratio” (or even “likelihood”) is not anywhere in the manuscript. It’s been a while since I read that paper but I remember it being mostly a series of straight product rule probability calculations.

• So you don’t see power mentioned anywhere at all? no alpha? no use of power in the straight product rule to get posteriors and odds? Maybe I should extend my query to what’s wrong with taking the power/size ratio in getting posterior probs, odds or what have you?

• The word “posterior” is not in there at all. I think we’re barking up the wrong tree if we’re thinking that Ioannidis is saying anything about an assessment of a evidence regarding H1 vs. H0. He’s doing straight probability calculations using a purely frequency definition of probability there’s not a “subjective” or “plausible” use of probability anywhere; the results are simulations after all.

Larry runs through the calculation here:

https://normaldeviate.wordpress.com/2012/12/27/most-findings-are-false/

doesn’t bring up any qualms about any evidential interpretations (because, as far as I can see there don’t seem to be any).

BTW I’m not defending Ioannidis here, I agree with LW that the calculations seem too obvious to have merited the impact it had, but oh well…

• You’re missing the point. If you get a positive predictive value of, say, .1 to
H*: there’s a relationship
(he only has dichotomous inference) then he’d regard it as poor grounds for H*, low evidence of replicability of the effect or the like. He even slips into speaking of degree of credibility for the hypothesis–I know it’s an easy slide, and anyway, it’s not the point of my post. But it’s relevant to your comment.

H* might be the alternative against which the test has .8 power. The only alternative is the null. (Obviously not exhaustive.)
I disagreed with Wasserman’s readiness to allow this blurring of testing concepts and diagnostic screening constant, even though, of course, it’s right that one shouldn’t confuse the type 1 error probability in tests, with 1- PPV.

6. Matloff: (No room at the bottom of your comment) I share your worry about Bayesian hunches (however well meaning) entering (with minimal check points, whereas error probability checks are direct). How could one not be skeptical when we read things like this about the FDA:
http://www.slate.com/articles/health_and_science/science/2015/02/fda_inspections_fraud_fabrication_and_scientific_misconduct_are_hidden_from.html?wpsrc=sh_all_dt_tw_bot

7. All: I’m always interested to hear Matloff’s views, and he’s invited to continue if he wishes, but I have to say that we’ve gone off topic, so far as I can see, and I dislike having just any post becoming a forum for discussing EVERYTHING when in fact we’ve taken up issues individually. I’m not about to try to teach my entire statistical philosophy in answering a blog comment. Check “all she wrote so far” to see all the posts for 3 years up until sept 2014; then the subsequent months under the archives.
I’d still like to go back to the issue at hand which hasn’t really been discussed much here.