With this post, I finally get back to the promised sequel to “Breaking the Law! (of likelihood) (A) and (B)” from a few weeks ago. You might wish to read that one first.*** **A relevant paper by Royall is here.

Richard Royall is a statistician^{1} who has had a deep impact on recent philosophy of statistics by giving a neat proposal that appears to settle disagreements about statistical philosophy! He distinguishes three questions:

**What should I believe?****How should I act?****Is this data evidence of some claim? (or How should I interpret this body of observations as evidence?)**

It all sounds quite sensible– *at first–*and, impressively, many statisticians and philosophers of different persuasions have bought into it. At least they appear willing to go this far with him on the 3 questions.

How is each question to be answered? According to Royall’s ~~commandments~~ writings, what to believe is captured by Bayesian posteriors; how to act, by a behavioristic, N-P long-run performance. And what method answers the evidential question? A comparative likelihood approach. You may want to reject all of them (as I do),^{2} but just focus on the last.

Remember with likelihoods, the data * x* are fixed, the hypotheses vary. A great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with “the law”. But I fail to see why we should obey it.

To begin with, a report of comparative likelihoods isn’t very useful: *H* might be less likely than *H*’, given * x*, but so what? What do I do with that information? It doesn’t tell me I have evidence against or for either.

^{3}Recall, as well, Hacking’s points here about the variability in the meanings of a likelihood ratio across problems.

Royall: “the likelihood view is that observations [like * x* and

*]…have no valid interpretation as evidence in relation to the single hypothesis*

**y***H*.” (2004, p. 149). In his view, all attempts to say whether

**is good evidence for**

*x**H*or even if

**is better evidence for**

*x**H*than is

**are utterly futile. Only comparing a fixed**

*y***to**

*x**H*versus some alternative

*H’*can work, according to Royall’s likelihoodist.

Which alternative to use in the comparison? Should it be a specific alternative? A vague catchall hypothesis? (See Barnard post.) A maximally likely alternative? An alternative against which a test has high power? The answer differs greatly based on the choice. Moreover, an account restricted to comparisons cannot answer our fundamental question: is ** x** good evidence for

*H*

**or**is it a case of BENT evidence (bad evidence no test)? His likelihood account obeys the Likelihood Principle (LP) or, as he puts it, the “irrelevance of the sample space”. That means ignoring the impact of stopping rules on error probabilities. A 2 s.d. difference from “trying and trying again” (using the two-sided Normal tests in the links) or a fixed sample size registers exactly the same, because the likelihoods are proportional. (On stopping rules, see this post, Mayo and Kruse (2001), EGEK (1996, chapter 10); on the LP see Mayo 2014, and search this blog for quite a lot under SLP).

When I challenged Royall with the optional stopping case at the ecology conference (that gave rise to the Taper and Lele volume), he looked surprised at first, and responded (in a booming voice): “But it’s a law!” (My contribution to the Taper and Lele volume is here.) Philosopher Roger Rosenkrantz remarks:

“The likelihood principle implies…the irrelevance of predesignation, of whether an hypothesis was thought of beforehand or was introduced to explain known effects.” (Rosenkrantz, p. 122)

[What a blissful life these likelihoodists live, in the face of today’s data plundering.]

Nor does Royall object to the point Barnard made in criticizing Hacking when he was a likelihoodist:

Turning over the top card of a shuffled deck of playing cards, I find an ace of diamonds:

“According to the law of likelihood, the hypothesis that the deck consists of 52 aces of diamonds (H_{1}) is better supported than the hypothesis that the deck is normal (H_{N}) [by the factor 52]…Some find this disturbing.”

But not Royall.

“Furthermore, it seems unfair; no matter what card is drawn, the law implies that the corresponding trick-deck hypothesis (52 cards just like the one drawn) is better supported than the normal-deck hypothesis. Thus even if the deck is normal we will always claim to have found strong evidence that it is not.”

To Royall, it only shows a confusion between evidence and belief. If you’re not convinced the deck has 52 aces of diamonds “it does not mean that the observation is not strong evidence in favor of *H*_{1} versus *H*_{N}.” It just wasn’t strong enough to overcome your prior beliefs. Now Royall is no Bayesian, at least he doesn’t think a Bayesian computation gives us answers about evidence. (Actually, he alludes to this as a frequentist attempt, at least in Taper and Lele). In his view, evidence comes solely from these (deductively given) comparative likelihoods (1997, 14). (I don’t know if he ever discusses model checking.) An appeal to beliefs enters only to explain any disagreements with his “law”.

Consider Royall’s treatment of the familiar example where a positive diagnostic result is more probable under “disease” than “no disease”. Then, even if a low prior probability for disease is sufficiently small to result in a low posterior for disease “to interpret the positive test result as evidence that the subject does not have the disease is never appropriate––it is simply and unequivocally wrong. Why is it wrong?” (2004, 122).

Well you already know the answer: it violates “the law”.

“[I]t violates the fundamental principle of statistical reasoning. That principle, the basic rule for interpreting statistical evidence, is what Hacking (1965, 70) named the law of likelihood. It states:

If hypothesis A implies that the probability that a random variable X takes the valuexis p_{A}(x), while hypothesis B implies that the probability is p_{B}(x), then the observation X = x is evidence supporting A over B if and only if p_{A}(x) > p_{B}(x), and the likelihood ratio, p_{A}(x)/ p_{B}(x), measures the strength of that evidence.” (Royall, 2004, p. 122)

“This says simply that if an event is more probable under hypothesis A than hypothesis B, then the occurrence of that event is evidence supporting A over B––the hypothesis that did the better job of predicting the event is better supported by its occurrence.” Moreover, “the likelihood ratio, is the exact factor by which the probability ratio [ratio of priors in A and B] is changed”. (ibid. 123)

There are basically two ways to supplement comparative likelihoods: introduce other possible hypotheses (e.g., prior probabilities) or other possible outcomes (e.g., sampling distributions, error probabilities).

**NOTES**

*Like everyone else, I’m incredibly pressed at the moment. It’s either unpolished blog posts, or no posts. So please send corrections. If I update this, I’ll mark it as part (C, 2nd).

^{1} Royall, retired from Johns Hopkins, now serves as Chairman of Advisory Board of Analytical Edge Inc. “Prof. Royall is internationally recognized as the father of modern Likelihood methodology, having largely formulated its foundation and demonstrating its viability for representing, interpreting and communicating statistical evidence via the likelihood function.” (Link is here).

*[Incidentally, I always attempt to contact people I post on; but the last time I tried to contact Royall, I didn’t succeed.]
*

^{2} I consider that the proper way to answer questions of evidence is by means of an error statistical account used to assess and control the severity of tests. Comparative likelihoodist accounts fail to provide this.

^{3} Do not confuse an account’s having a rival–which certainly N-P tests and CIs do–with the account being merely comparative. With the latter, you do not detach an inference, it’s always on the order of x “favors” H over H’ or the like. And remember, that’s ALL statistical evidence is in this account.

**REFERENCES**

Mayo, D. G. (2004). “An Error-Statistical Philosophy of Evidence,” 79-118, in M. Taper and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press.

Mayo, D. G. (2014) On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). *Statistical Science* 29, no. 2, 227-266.

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” 381-403, in D. Cornfield and J. Williamson (eds.) * Foundations of Bayesianism*. Dordrecht: Kluwer.

Rosenkrantz, R. (1977)* Inference, Method and Decision*. Dordrecht: D. Reidel.

Royall, R. (1997) *Statistical Evidence: A likelihood paradigm, *Chapman and Hall, CRC Press.

Royall, R. (2004), “The Likelihood Paradigm for Statistical Evidence” 119-138; Rejoinder 145-151, in M. Taper, and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press.

It seems to me that that your criticisms of Royall’s likelihoodist account on grounds that *you* find important/compelling are akin to Bayesians’ criticisms of frequentist accounts on Bayesian grounds.

Corey: Have you read Royall? He’s the one criticizing Fisherian and N-P tests using a criterion that goes radically against theirs. And the reason it does is precisely the reason I object to his comparative likelihood account–it dismisses the sample space and selection effects. I think I’ve spent many years arguing against that type of account in books and papers. But he’s free to be a likelihoodist (there are few pure likelihoodists.) What puzzles me is why his position would be considered a legitimate criticism of an error statistical account.

You did not mention Royall’s recommended cutoffs for deciding one hypothesis is better supported than another (e.g. 8:1 or 32:1). How can one justify such steps to mitigate making erroneous inferences while not acknowledging sample space?? (It seems Lew’s paper showed the relation between liklihoods and p-values, and thus the sample space).

John: I take it that Royall’s cut-offs refer to intuitively chosen weights, along the same lines Bayesians might choose Bayes factors as satisfactory.

I read Royall’s little book about five years ago; it’s gone a little fuzzy at this point.

Corey: I have a link to his paper here.

It seems to me that there are two important issues that are regularly ignored or confused. First, the hypotheses that likelihoods can be usefully calculated for are always parameter values, and for the likelihoods to be usefully compared the parameters in question have to be parameters of the same statistical model.

It is common to end up with a continuous likelihood function and so discussions that focus on pairs of pre-specified hypotheses miss preclude consideration of some important aspects of a likelihood analysis: it can show what parameter values are favoured by the evidence, not just the relative favouring of particular hypotheses. Mayo, when you ask for specific hypotheses for the likelihood ratio, you show that you are not thinking about the likelihoods as points on the continuous likelihood function that usually is the product of a likelihood analysis. Notice that Royall, Edwards and Pawitan all fill their likelihood books with graphs of functions rather than ratios of pairs of likelihoods.You don’t have to focus on parameter values (hypotheses) that are pre-specified.

The second issue that is missed is perhaps even more important. Certainly it is more general in scope. It is the fact that a scientist should make inferences on the basis of strength of evidence _and_ possible error rate _and_ gain or loss functions _and_ prior information _and_ an understanding of the strengths and weaknesses of the experimental approaches that yielded the data, and probably also on the basis of some aspects of evidence that are not well described or understood. There is nothing in the likelihood principle that requires inference to be made on the basis of only a likelihood ratio, just as no-one should be making inference only on the basis of a P-value. Yes, I know that the the true richness of scientific inference procedures are hard to deal with philosophically, but reality is what it is.

Michael: I’m not sure whether you’re speaking just of Royall’s particular treatment of evidence? (just the EVIDENCE question as he defines it). He specifically denies he can handle compound hypotheses, except in very special cases Certainly not our general one-sided tests. It would help the conversation to be clear if we are talking of different likelihood accounts. I linked a short article of his.

I never said the LP requires inference to be made on a likelihood ratio, but I am speaking of Royall’s account of evidence. If you add more, then you’re strictly outside his account of evidence. So let’s stick to the restricted topic for the moment, and see if one can live within it or not.

I am not clear on the meaning of your last sentence, but never mind.

Mayo, I’m not restricting my comment to Royall’s account of evidence, although I do agree with most of it. I also agree with most of Edwards account, but not all.

(I do not think that Royall has correctly represented P-values. The relationship between likelihood and one-sided tests is explored in my paper that John Byrd refers to. I would be very pleased if you would read it. arxiv.org/abs/1311.0081 My disagreement with Edwards centres on his prior likelihoods idea, which I prefer to see as a way to pretend that prior probabilities are something other than what they are. His prior likelihoods may be a simple mechanism for generating calibrated priors, though.)

My comment refers specifically to the process of making inferences and to the nature of the hypotheses for which likelihood is a useful measure.

My last sentence was intended to refer to the difficulty of exploring the philosophical basis of a set of interrelated and interlocking approaches that can and should be used for scientific inference. It is my opinion that much of the difficulty that you have faced in getting your severity concepts accepted and used relates to the fact that their real novelty and utility comes from them crossing the borders between frequentist (pre-data perspective) and likelihood (post-data, evidence perspective) ideas.

Michael: I’d very much like to restrict the discussion to Royall’s conception at least at first, or all we’ll do is leave people perplexed. That your view might go beyond it will emerge, I would think, as a result of the shortcomings with Royall’s approach.

I of course agree that severity crosses the borders between pre-data and post data–and hopefully people will soon realize that this is what we do every day when we criticize inferences to H for having made it too easy to find evidence favoring H, even if H is specifiably false.

There’s pre-data planning and post-data interpretation, and an account needs both.

Well, I’m reminded of a great remark by Peirce, and if Keith comes around I’ll mention it.

I will look at your paper, I seem to think I did at one point.

Michael: I read your paper, “To P or not to P”,or actually reread it, since it came around at some point last year.

http://arxiv.org/pdf/1311.0081v1.pdf

Is it published somewhere?

It’s always interesting to me to come across a slightly different permutation on the views that differ from my own. It’s just too bad that in defending P-values, you defend a conception quite at odds with Fisher’s and Cox’s (and all others I know–share exceptions). So, rather than defending correct uses of P-values, the upshot of your paper inadvertently endorses some of the worst abuses of P-values, i.e., failing to adjust P-values due to stopping rules, cherry-picking, selection effects and the like.

Why do you think Cox insists on adjustments even though he accepts an evidential use of P-values (by the way, have you read Mayo and Cox 2010?)

On your view he’s just wrong to do that. Right?

You claim that the only ground for such adjustments would be low long run error rates of the detested N-P sort. So Cox, according to you, is confused about the function of P-values, when he says adjustments are required (in cases of certain selection effects) to restore the proper function of P-vaues. N-P is quite a lot closer to Fisher than you think, and both are open to evidential construals that are NOT mere summaries of likelihoods.

It’s a classic, but serious blunder, I argue.

To say you just don’t care if a procedure has a high probability of calling observed differences “rare under Ho”, when they are actually entirely common under Ho, is to forfeit the central value of P-values.

You must have disagreed with my recent post “Are P-values error probabilities?”

https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

I agree with what you say about P-values and sample size.

Mayo, thanks for your comments. The paper has now been rejected by three journals, but I am almost finished a major re-write. I’m confident that the paper has a novel perspective and valuable insight that is worth the heartache of repeated submission. Maybe.

Yes, I do disagree with some of Cox’s, some of Fisher’s and some of your writings. I also disagree with some of Royall’s writing. However, each of them has a lot to offer. I think that the problem is that the word ‘evidence’ is being used in too many different ways.

The strength of evidence and the parameter values that are supported by data can be had from a likelihood function, and that likelihood function is indexed by a P-value. The likelihood principle is sound, certainly as a first-cut normative principle and probably also as a philosophical consequence of the standard intuitive conception of evidence. (Whatever that is!)

The error rates that might attend various behavioural responses to the P-value (or the likelihood function) are not measurable in terms of strength of evidence, but may well be important considerations for inference and the for the direction of further experimentation. ‘Correction’ of a P-value for multiplicity and for stopping rules changes the relationship between the P-value and the evidence, and is thus undesirable. I prefer to consider the error rate consequences of stopping rules separately from the evidence in the data. If we do that then there need not worry so much about the apparent contradictions between likelihood and frequentism and so we might be able to devise a philosophically principled description of evidence and its role in inference.

Your severity curves seem to me to have potential as an approach that integrates the evidential meaning of a likelihood function with error rate considerations that supplement the evidence for the purposes of inference. However, if you continue to battle against the likelihood principle you will not get far, in my opinion. (You must have noticed they likelihood functions and severity curves relate to each other in a one-to-one manner if there is nothing to ‘correct’ in the stopping rules and no multiplicity of testing.)

“I prefer to consider the error rate consequences of stopping rules separately from the evidence in the data. ”

In my book I reply to this stance by saying something like: “There are some people who say we’ll take account of stopping rules and selection effects later. Well this IS later!” We’re making inferences now!

“However, if you continue to battle against the likelihood principle you will not get far, in my opinion. ”

I’m amazed anyone could say this to me (especially after I recently found holes in the classic Birnbaum “proof”.*) And if you really followed it, you wouldn’t have to take account of selection effects “later on”–a cop out. I don’t have to “battle” the LP, anyone who accepts it gives up the incredible power that enables statistical method to solve inductive problems. Outside statistical contexts, the same LP-violating strategies are at the very heart of human knowledge and deception discovery.

*https://errorstatistics.com/2014/09/06/statistical-science-the-likelihood-principle-issue-is-out/

You did not answer my question about Cox being confused on this one point. Saying you don’t agree with all of Cox is, once again, a cop out.

Mayo, I guess I’ll have to say that you might have skimmed my paper too quickly. Your response suggests that you did not notice the idea that evidence has different dimensions. The likelihood function shows the strength and specificity of the evidence, and it, seems to me, the error rates show something else: veracity or reliability or something like that.

You ask what I think the reason is that Cox wants ‘adjustments’ while accepting the evidential aspects of P-values? I think that Cox is, like others, trying to accommodate the need to deal with error rates of procedures at the same time as the evidence in the data. Of course. However, as I tried to show in my paper, the two parts of the information should have different types of effects on the inferences and so it is not a good idea to combine them into a single number. Better to deal with the separately.

Consider the task of choosing a new car. You have a certain preference for colour and a preference for style and another preference for fuel economy. You could attempt to come up with an algorithmic function that combines the three properties of the car into a single numerical value (observed colour adjusted for style, style adjusted for reported fuel economy, etc.) and choose the car on that basis. However, you should probably consider each aspect separately and come to a decision on the basis of you hard to express and hard to quantify preference system. The likelihood function from the data, the error rates of the procedure and the other relevant information (subjective and objective) are all relevant to the choice of inference in the same way that colour, style and fuel economy are considerations in choosing a car.

I’m sorry to have offended you by suggesting that you consider likelihood to be useful. It is though. And your ‘disproof’ of Birnbaum’s ‘proof’ does not really make much difference to how I view likelihood functions because I don’t see that it is necessarily the case that statistical principles like the likelihood principle and the repeated sampling principle need necessarily be subject to mathematical proof. (I haven’t seen a proof of the latter principle, but it nonetheless seems to be a good idea to prefer procedures with low error rates.)

Michael: Even though I read your paper somewhat quickly, I checked the topic of each paragraph. (I thought there was much more than needed on 1-sided v 2-sided by the way).The only place I saw a reference to different dimensions is a sentence on p 17. There you talked about inference being like estimation, and I got no sense of your saying error probs show a different dimension of evidence. What page did you have in mind?

Royall would certainly say they show a different dimension–one that is relevant for action, I guess. If the idea of evidence having different dimensions, akin to different qualities in choosing a car or whatever, is central , you might have put it in the introduction of the paper. To me, ignoring the error probabilities–as i propose to use them, i.e., for severity assessments, not long-runs– would be like saying look at price, color, air-conditioning and radio and ignore whether the car actually drives! I’ll worry if it’s got a gas pedal later on, for now I’m into color and price. There are certain essential things required in testing a car, and the same goes for testing a hypothesis.

So anyway,if the different dimensions of evidence is important to your paper, I’d recommend fleshing it out in detail and placing it right at the top, and explaining why some properties are more essential for interpreting evidence than others, etc.and how you deal with conflicts, etc.

I think I’m doing it better in the revision that is underway right now.

Mayo, in reviewing my recent comments on your blog and your responses it became clear to me that I had misremembered the content of my arXived paper. That is probably because I have spent so long on revising it that I was thinking that it contain some of the new material. I apologise.

I’ve sent you the current draft of my rewrite by email, and will post it on arXive when I have finished it (as long as the journal that I decide to send it to allows arXive).

Michael: Thanks for clarifying. I didn’t think I could have missed it, and was interested to know how to locate it. I look forward to reading your revised paper.

Michael: By the way, on your remark: “I’m sorry to have offended you by suggesting that you consider likelihood to be useful.” I don’t know where in the world you got this, you must be confusing my remark that I reject the likelihood principle. I don’t reject the use of likelihoods, which would make no sense. It’s too bad that when N-P were setting out tests they spoke of a “likelihood principle” whereby they meant using likelihood ratios as a basis for building test statistics whose distribution under test hypotheses then needed to be calculated (to get error probs). They later called it the lambda criterion. But I don’t think that’s what confused you.

Michael: I can see the relation between the p-value and the likelihood function that was demonstrated in your simulations. It seems to me that recognition of the relationship would take the air out of the bag of those who claim that the sample space is irrelevant. At least to the extent that they fear that consideration of outcomes that might have happened will send us off in the wrong direction. Seems to me that is an irrational fear, and that the sample space is a fundamental component in understanding how any statistic will vary under use, and can be linked to a likelihood function via the p-value. You say the LP makes sense as a first cut principle, but do you also see the relevance of the sample space in making inferences? At some we ignore the LP because we know that sampling error is a reality to be considered in inference?

John: Explain more about the air out the bag, and I don’t mean appraising a car’s airbags. If it does support the relevance of the sample space, then is you point that Michael should not be denying this? I did like his handling of dealing with tail areas. I have a post on that, but with a different take.

Mayo: I am talking about the hot air bag I see so often in rhetorical writing that criticizes significance testing and other error prob-oriented approaches because they defy the LP. We are told that it is an abuse of reason to make inferences that take heed of outcomes that only could have occurred. Yet, Lew’s simulations show that p-values hold a deep relationship to the likelihood function, and the message cuts both ways. The likelihood function is what Fisher always said it was, a gateway to understanding what the sample says about the various possible parameter values under the model. But, the p-value is the proactive way to take account of uncertainty in the inference while making the inference. And the p-value is not estranged from likelihood functions. Its message tells you about strength of evidence tempered by the shortcomings of the method used to derive the sample data.

John, the sample space is obviously relevant for inference, and yes, I agree that the relationship between P-values and likelihood functions shows that it is a mistake to think that P-values conflict with the likelihood principle because of their reference to the sample space, results that might have been observed but were not.

The likelihood principle doesn’t say that inferences should be based _only_ on the likelihood function. (Or should I say that it shouldn’t be expressed in a way that says that.) Instead, the evidence in the data is in the likelihood function and so the effect of the evidence on the inference comes from consideration of the likelihood function. The effect of experimental design features on the inference should not be by way of adjustment of the strength of evidence, but by way of what Abelson called ‘principled argument’.

Michael: you’ve got some confusions running through this still. Just to mention one: “it is a mistake to think that P-values conflict with the likelihood principle because of their reference to the sample space, results that might have been observed but were not.” But P-values do/would conflict with the LP if inferences were altered by outcomes other than the one observed. You can try to pin the difference on the prior, as some do–which introduces brand new problems i discuss in the Mayo and Kruse link to this blog. Good luck.

I think Michael has a different definition for the LP than I have seen. I think that most presentations of the LP focus on the sample being the evidence in a standalone fashion, which forces the need for direct comparisons of hypotheses. Of course, everyone sees something a little different in a figure, but what I see in the figures resulting from the simulations are clear indicators of the reality of sampling error and its relation to the likelihoods via the p-value. So, I take this a solid refutation of the LP (defined as stating amongst other things that outcomes that could have occurred are to be ignored). Perhaps I should call that version the Naivete Principle (NP).

Going back to Royal, could it be that his 8:1 and 32:1 business is precisely because he is aware of the relation of the likelihood function to the sample space in practice? The Law of Likelihood is foolhardy because of the effects of sampling error?

John, you are correct to think that there are many versions of the likelihood principle and that it is sometimes taken as a principle that insists that we should ignore all counter-factual outcomes, and thus ignore the sampling space. However, that is a consequence of a possible interpretation of the likelihood principle but not the principle itself. I think that it is a mistaken interpretation.

The likelihood principle at a minimum says that the evidence in the data is in the likelihood function. It doesn’t have to say that inference has to be made only on the basis of that evidence, but it is sometimes stated in manner that implies exactly that. In my opinion, it should not. If it did then it would rule out Bayesian inferences on the basis of the posterior probabilities and also rule out any consideration of protocol-related error rates.

By showing the relationship between P-values and likelihood functions, my simulations show that the relationship between the data and its sampling space is a determinant of the evidence. However, the error rate characteristics of that sampling space are not relevant to the evidence, as they are not relevant to the likelihood function. That does not mean that error rates are not relevant to inference, just that they are not part of the evidence.

The difficulty in communicating this seems to lie in the word ‘evidence’. The likelihood principle effectively defines ‘evidence’ to mean that aspect of the data that determines the likelihood function. Non-likelihoodlums want to own the word ‘evidence’ as well, but they cannot without disputing the validity of the likelihood principle. Thus in order to move our understanding forward we need to be explicit about what evidence consists of, what it tells us, how to measure or scale it, and what we should mix with it for scientific inference and how. So far I have not seen a clear synthesis of all of that.

Michael:

I hope we have managed to pin down the meaning of the LP (SLP) so that should not enter as a source of confusion. You say:

“However, the error rate characteristics of that sampling space are not relevant to the evidence, as they are not relevant to the likelihood function. That does not mean that error rates are not relevant to inference, just that they are not part of the evidence.”

But this is just to assume the LP and that’s precisely what error statisticians deny. If the method by which you found your “evidence” for H has terrible error probabilities, then it fails to warrant inferring H. If someone wishes to say, nevertheless there’s great evidence for H , or like Royall, great evidence for H compared to some other H* of his choosing with much lower likelihood, then I say he’s got no account of evidence or inference at all.

We can deductively assert a bunch of comparative likelihoods, given the model (and I don’t know how he checks his models) and fail utterly at the main tasks for statistical inference: e.g., is there a real effect, do the data warrant discrepancy as large as d? Inductive inference must go beyond a report of pairs of deductively entailed likelihoods for various hypotheses one dreams up.

Mayo, I agree entirely that if the evidence comes from a method with bad error rate properties then the evidence does not warrant inference in the same way as evidence of similar strength from a better behaved method would.

Would ‘error statisticians’ deny the likelihood principle if they realised that it does not have to imply that inference should be based ONLY on the evidence as depicted by the likelihood function? I think not, and I don’t see how the likelihood principle has to be taken as implying that inference has to be based only on evidence as depicted in the likelihood function. (Yes, I know that some statements of the likelihood principle say things like ‘proportional likelihood functions support the same inference’, but that is an over-reach in my opinion. So are the corollaries to the LP that say that counter-factual results should not be part of inference. So are the interpretations of the stopping rule principle that assume that it bans consideration of stopping rules from inference: it should just say that the stopping rules do not affect the evidence depicted in the likelihood function.)

There are more considerations for inference than just the strength of evidence, just as there are more considerations than just error rates.

I don’t think that you are being fair to Royall’s approach. An important difference that seems to be lost is that the likelihood functions in his book support estimation at the same time as they support inference. That is not the same as “various hypotheses one dreams up”. The models used by Royall are no more arbitrary or subjective than the models used in a frequentist approach. The difference is that a likelihoodlum can choose the parameter values of interest (the hypotheses) from the likelihood function post experiment whereas the frequentist should choose them from other considerations before the experiment. One approach need not compete with the other because they each bring a different aspect of the experiment into the analysis. In making inference I would like to know both what the evidence points to and its strength (i.e. the likelihood function) and the reliability of that evidence (i.e. the error rate properties of protocol used for the gathering of the evidence).

> be explicit about what evidence consists of, what it tells us, how to measure or scale it, and what we should mix with it for scientific inference and how. So far I have not seen a clear synthesis of all of that.

Mike Evans is working on that and here is an example http://arxiv.org/abs/1401.4215

I see at least three different issues raised by your post. A) Whether Royall’s three valued distinction is useful B) Whether, if it is accepted, his recipe for what you should apply for each of the three purposes should be accepted C) Whether likelihood shows that stopping rules are always irrelevant.

I would imagine that most statisticians accept A), that many reject B) and also that many reject C).

I am surprised if you do not accept A) It seems to me that unless the distinction between belief and action is accepted it is difficult to explain the widespread phenomenon of insurance. We surely accept that each party in an insurance contract can rationally enter into it despite agreeing on the probabilities. Differences in utilities would simply explain this. You would have to be an extreme follower of DeFinetti to accept that the factorisation of probability and utility was legitimate. Even an extreme Neymanite, who might turn her nose up at the use of the word ‘belief’, would accept that the acceptable type I error rate could be changed according to the losses involved. In short, I would think that there is pretty well universal acceptance of the distinction between the first and second of Royall’s three.

Is there a third thing? Can one ask ‘Are these data evidence of some claim?’ or perhaps ‘What is the evidence of these data’ and consider that neither 1 ‘what should I believe’ nor 2 ‘what should I do is the answer’

I think that there is and one reason that this is so is that decisions made by some can be revisited by others. Scientists, whatever their statistical philosophy, seem to accept that this is legitimate. The questions then is ‘what is the result of the experiment that can be shared’ but want to go beyond ‘what does my colleague believe’ and what did (s)he decide’. Thus I certainly accept A) above.

One view is that all that needs to shared is the likelihood. This I think is Royall’s view. I think that this is too narrow. I think that far more needs to be shared, since, for example, if a remote scientist queries the conclusion he might also query the general model. This leads to the view that data should be shared. However, this also soon leads one to realise that the protocol also needs to be shared and this should include the stopping rule. Thus B is too narrow.

As regards C. I don’t accept that the stopping rule is never relevant but I do believe it becomes of little interest once one is faced with summarising the results from a number of experiments and this is precisely because of the way that such summarising should be carried out.

Consider the process of summarising a series of clinical trials via a meta-analysis: an extremely common thing to do now. One way, universally recognised as a possibility but also universally agreed as being a poor way to do so, is to ‘vote count’ on whether the result was significant or not. If you decide to do this, then you should note whether the trial were run to a sequential protocol or not and this should be taken into account when judging whether the trial was significant or not.

If, however, you decide to use a superior approach of weighting estimates by the amount of information they provide, then for reasons I have explained on a bog post on this website

( see https://errorstatistics.com/2012/09/29/stephen-senn-on-the-irrelevance-of-stopping-rules-in-meta-analysis/)

and recently outlined in a paper in Pharmaceutical Statistics

(see http://onlinelibrary.wiley.com/doi/10.1002/pst.1639/abstract)

you don’t

Thus, although statisticians may disagree regarding B and C above most, I think, will agree that Royall was onto something useful with A.

Stephen: Just on your first point, I find it very hard to grant acceptance to someone’s claim in (A) if they’ve cashed them out so seriously incorrectly (leading to reject (B)). This isn’t just a game of semantics.

What’s your point here? A statistician may mistakenly think that accepting point A requires acceptance of B. Are you arguing that this mere false reasoning (falsely assuming A -> B) makes acceptance of A wrong? Of course if A necessarily implies B and B is wrong, then A is wrong but that is another matter.

To give another example. Many statisticians have assumed that the conditionality principle leads inexorably to the likelihood principle. You have argued that this assumption is incorrect. Does this require one to reject conditionality?

I assume that you agree not. Therefore it is a reasonable comment to make that Royall’s three part classification A does not have to be rejected because you reject B.

Stephen: I’m afraid with my (A), (B), (C) parts that using A, B might be confusing people. But never mind.

Look, in the post I grant that many people are willing to go with Royall that far, i.e., to your (A), which means merely that one recognizes a distinction between the three terms:

belief, action, evidence,

leaving the terms undefined. I definitely would go along with the idea that subjective or prudential belief differs from decision-making, differs from inferential appraisal of evidence. Always have. But “act” isn’t quite the same as “decide” in my book.

I can imagine others identifying warranted belief with inference, and others identifying acts of detaching inferences as actions such as Neyman, the failure of H is “a confirmation of H0” (when there is high power to have detected failure) or “regard the data asindicating a discrepancy d”, or Fisher:

“The possible deviation from truth of my working hypothesis, to examine which the test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.”

Fisher said that calling such utterances “actions” merely reflects Neyman’s preferred way of talking. (One which Popper preferred as well). And I think Fisher has a good point. The bottom line is that this is a semantics game and has little to do with the issues as to the methods and rationales for determining what the data warrant.

P.S. I recall being surprised when I read something of yours that appeared to endorse Royall’s tri-part distinction (it’s one of the reasons I’m posting on it), and now your comment makes me think that all you meant to be endorsing was (A).

I certainly endorse Royall’s three valued distinction without feeling any great need to define them.

A Bayesian interpretation might be that his 1 is prior + likelihood his 2 is prior + likelihood + utility and his 3 is likelihood and that that being so it would be more useful to simply deal with, prior, utility and likelihood. I would not go that far for a number of reasons, including the fact that various inferential devices we commonly use require different amounts of structure to be already agreed to be in place. For example, pure tests of significance may require less structure than likelihood.

However, in the regulatory context in which I work the practical distinction seems reasonable enough: a regulator might 1) concede that on balance it seems reasonable to suppose that the drug works but that 2) for other reasons a license will not be granted, whilst 3) leaving the door open for further research, in which case in due course the current evidence will need to be combined with that from future trials.

I agree with you that the Neyman-Pearson versus Fisher distinction is partly a semantic one but that is because in the context in which they were talking, utilities were not admitted. (If you like they were arguing about aspects of Royall’s 1 and 3.) But that does not mean that the distinction between his 1 and 2 is mere semantics.

To get to the most contentious issues, there is an explicit denial that priors are part of the evidential appraisal. The evidence is solely through the likelihoods.

Then there’s the assumption that accounts that employ error probabilities of methods (error statistics)–be theyFisherian or N-P–must follow the behavioristic justification of those error probabilities, and thus can at most be relevant for determining how we should act. I don’t know if he thinks all actions we take based on data get their rationale from the desire to avoid making errors too often in the long run. So for example, say that I declare I have gained 1 pound since last month based on a host of reliable and well-calibrated scales–info given by error probabilities. I would say this evidence warrants my inference about my weight right now. I do not say the rationale for declaring “I gained a pound” is that were I to behave this way whenever the scales show these readings then I will rarely be wrong in my weight declarations in the long run of weighing activities. I find that utterly absurd as the rationale for my inferences (yes inferences) about my weight, based on the data and the error probabilities.

Recall, too, the very long set of comments by “anon-fan”

https://errorstatistics.com/2014/08/29/breaking-the-law-of-likelihood-only-way-to-keep-their-fit-measures-in-line-a/#comment-91702

(in the earlier “breaking the law” post) denying it made sense to distinguish evidence ( a la the law of likelihood) and belief, for a Bayesian.

> This leads to the view that data should be shared. However, this also soon leads one to realise that the protocol also needs to be shared and this should include the stopping rule.

Nicely put – do think Fisher disregarded this in his claim that only the likelihoods need be reported.

One way to deal with this would be to not try to discern what the evidence is before discerning how the data were generated and an acceptable data generating model for that – so it very explicitly depends on that assessment.

I meant to say “You would have to be an extreme follower of DeFinetti NOT to accept that the factorisation of probability and utility was legitimate.”

Sorry the “comments” setting slipped into manual; anyone previously approved should be able to post automatically now.

Royall: “the likelihood view is that observations [like x and y]…have no valid interpretation as evidence in relation to the single hypothesis H.” (2004, p. 149).

Therefore the likelihood view is not an account of evidence.