Given all the recent attention given to kvetching about significance tests, it’s an apt time to reblog Aris Spanos’ overview of the error statistician talking back to the critics [1]. A related paper for your Saturday night reading is Mayo and Spanos (2011).[2] It mixes the error statistical philosophy of science with its philosophy of statistics, introduces severity, and responds to 13 criticisms and howlers.

I’m going to comment on some of the ASA discussion contributions I hadn’t discussed earlier. Please share your thoughts in relation to any of this.

[1]It was first blogged here, as part of our seminar 2 years ago.

[2] For those seeking a bit more balance to the main menu offered in the ASA Statistical Significance Reference list.

See also on this blog:

A. Spanos, “Recurring controversies about p-values and confidence intervals revisited”

A. Spanos, “Lecture on frequentist hypothesis testing

Section 8.1 is particularly nice. I get tired of hearing that “All models are false.”

Thanks Richard.

Worse, soe people say since all models are false, we don’t need to test them, since we have already falsified them! As if we don’t want to learn about specific shortcomings and discrepancies for clues to a better model.

“All models are false” is useful as a reminder that we cannot verify the model assumptions, and that this is not the task of model checking. The task of model checking is to reveal *critical* deviations from the assumptions (invalidating analyses and error probabilities), not *all* deviations. This requires knowledge of which deviations are critical and which are not (which may depend on what is done with the model).

“All models are false” is bad indeed if it is used to justify using a model without checking.

Christian: I agree with this. How do you determine if violated assumptions invalidate error probabilities, aside from extreme cases?

Theory can help, simulations can help. For example, for which non-Gaussian distributions does the Gaussian distribution give a good approximation of the distribution of, say, the arithmetic mean or the sample variance for given n, and for which distributions is the approximation bad? What is the effect of what amount of positive or negative dependence between observations etc.? One can have a good intuition about some of these, but it can help to simulate things to find out whether there is a problem and how big it is for any specific violation of the assumptions. There are also things such as worst case considerations and influence functions in robust statistics.

As I read or reread the various comments on the ASA p-value document, I notice various points worthy of discussion. For example, Sander Greenland’s commentary rightly complains about the “null bias” wherein the hypothesis to test is a “no effect” claim.

This is linked with the ASA limitation to Fisher-type tests rather than Neyman-Pearson (N-P) tests. For N-P testers, the “null” or test hypothesis is to be chosen as the one whose erroneous rejection is deemed the most “serious”. So, for example, in most settings, we’d place “drug is carcinogenic” as the null hypothesis. As Greenland points out, it’s the tendency for Bayesians to give a spiked prior to a point null that leads to the common but erroneous view that p-values “exaggerate” the evidence against the null.

We’ve discussed this a lot on this blog. I’ll add a link.

Mayo, I’m quite interested in Neyman’s suggestion that we set the null on the basis of which error is most serious. For the “drug is carcinogenic” example that you mention, where exactly does one set the “hypothesis” in the continuous space of anti-carcinogenic — non-carcinogenic — carcinogenic scale. Neyman’s point is fine in discontinuous settings and where the null and alternative hypotheses are naturally mutually exclusive and exhaustive, but most real-world problems, including your carcinogenic drug example, would need a dichotomisation of a graded “hypothesis” space. How does one do that in a principled manner?

Michael: Usually it would be something like the hazard rate is higher among those exposed to Q than not–a one sided test. Or it can be done with risk ratios..

I think that Spanos does a good job of explaining the meaning of severity curves in this article, but Mayo’s introduction of it would make a reader expect that Spanos was talking about classical significance testing. The severity curve analyses do add usefully to the communicability of the results, but we should not pretend that conventional usage of hypothesis tests survive the criticisms as easily as the severity curve-extended significance tests do.

Michael: Hmm. I don’t know what “conventional usage of hyp tests” that don’t survive refers to–abuses? In that case, fine Or, maybe optimality as primary? There too, the point is to give it a better statistical philosophy.

It seems likely that every chemical is carcinogenic. In fact, a compound declared to be a carcinogen may be safer the alternative to those that aren’t, since it wasn’t otherwise toxic enough to mask the carcinogenic activity.

“These considerations of mechanism suggest that at chronic doses close to the toxic dose, any chemical, whether synthetic or natural, and whether genotoxic or nongenotoxic, is a likely rodent and human carcinogen. Not all chemicals would be expected to be carcinogens at high doses; the MTD may not be reached (101) or the chemical may be toxic without causing cell killing or mitogenesis.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC54830/

Anon: No one says you can’t estimate the extent of the harm, even if there are debates about the right extrapolation model to use. hTat is, the question isn’t, is there some dose that could cause some problem? The choice of extrapolation model,test animal, sample size etc definitely influences the protectiveness of the method, but it can be critiqued along those lines. For example, GMO testing used such small sample sizes as to have little power to pick up on effects on non-targeted species– at least in studies from maybe a decade ago (since I’ve read ecologists on this, I assume things are better now).

I’m thinking of replacing null hyp with test hyp in my book, but it’s so much shorter to allude to the “null”. Any other expressions around, Ho?

You could see the Greenland et al. response to the ASA statement for an attempt to rename the null hypothesis. An attempt that they justify by a strange and ahistorical account of what it means. As far as I know, Fisher coined the phrase and specified that it referred to the hypothesis that the data are given the opportunity to “nullify”.

I think that it is useful to make the distinction between a hypothesis and a statistical model parameter value clear, as I pointed out to you by email a week or two ago (no reply?). This is what I wrote:

I think the general usage of ‘hypothesis’ makes communication difficult. It is used for cases where the quantity in question is the value of the statistical model parameter of interest, and it is used without distinction to denote a more complicated type of ‘object’. I assume that you can see how useful it might be to distinguish between the nature of a hypothesis that the universe revolves around the earth, and a simple hypothesis that mu is equal to zero. Likelihoods can be calculated for the latter type of hypothesis, but I am not at all sure that they can be calculated for the former.

That previous comment is from me. I’m not sure why it is anonymous.

Mayo – a question. Suppose that, instead of minimising the probability of a Type II error for fixed probability of a Type I error as in NP, we choose the test to minimise a linear combination of the two error probabilities. Ie we introduce a parameter determining how much more or less we care about one of the errors.

What does the resulting theory look like?

Om: Well, as I assume you know, this is often tried, particularly making them equal (by likelihoodists).

I’m curious as to whether you think this is a good/bad approach and why. Why NP and not this? I can see pros and cons (though overall I prefer a different approach all together).

I believe you can achieve a similar effect by plotting the various cutoffs in a ROC curve and choosing the optimal solution.

According to Cornfield, such a theory is both possible and desirable. Read about it here:

Cornfield, J. (1966). Sequential Trials, Sequential Analysis and the Likelihood Principle. The American Statistician, 20(2), 18–23.

Hi Michael, thanks.

But as Mayo noted I was aware of this…I was hoping that if I didn’t mention ‘likelihood’ then Mayo might address what to me seems like a reasonable question about ‘how’ exactly we should ‘control’ error probabilities – i.e. how we capture the unavoidable trade-offs and why NP’s approach is preferable to the likelihoodist version etc.

One response that seems available is that controlling ‘absolute’ errors is more important than ‘relative’ errors.

This also strikes me as the motivation for things like posterior predictive checks in Gelman’s approach – given a relative ranking of parameter values within a model according to a posterior or likelihood function then one should also check the ‘absolute’ fit (of at least the best model) to obtain a proper scale.

Alternatively, one might be able to set the trade-off parameter based on ‘prior’ knowledge of the specific set up for what the ‘best fit’ is likely to be, but again an additional absolute check is probably desirable.

BTW – the ‘classic’ playing card example Spanos mentions is to me both facile and interesting (if that’s possible). I think you tried to address it in your arxiv manuscript? To me it points to the danger of ‘exact conditioning’ on the data given a model with a large number of parameters relative to the data (in this case allowing number of parameters = number of data points).

The likelihoodist has a clear response in that they have obvious prior reasons for not caring about a relative comparison of models in a model space where deterministic models with the same number of parameters as data points are included. They could ‘regularise’ in a number of ways e.g. by not allowing such pathological cases into the comparison a priori.

However, once one tries to formalise more clearly why they don’t care and what is happening I think one is lead to something that is not really Frequentist or Bayesian but which is more about clearly defining the topology of the data space.

Om: You’d think lihelihoodists would discount such things but they do not, & I understand why (after a long exchange with Royall). In fact Royall gives as a prime example the hypothesis that all the cards in the deck are identical to the one you chose. He’d rather imagine at that point you switch to what should be believed. If you want a logicist or evidential-relation account, then it’s to be context free. This was the goal for a very long time w/ logical positivism & remains a goal for some. There should “be” a purely logical relationship between statements, philosophers held, and some still do. Popper was among the first (in contemporary phil scie) to say this. (Philosophers from earlier times knew this.) But just saying there needs to be more than an E-R measure still doesn’t tell you what it should be, and it isn’t so easy to develop. Popper failed, even though he had the right idea.

Not caring (about certain hypotheses) is beside the point, and one can easily substitute many hypotheses and methods that have the exact same effect but aren’t as ad hoc looking, and about which you do care. Birnbaum erected slates of examples, and they almost all come from him. The hypotheses can even be predesignated and you select the best fit. Or it can be due to stopping rules, and assorted rigging.

This is THE core issue.

Om: N-P never said you had to first control alpha, they set out to provide a rationale for the tests promoted by fisher and others. Fisher only had the type 1 error prob, and intuitions about sensible test statistics for promoting “sensitivity”. They came up with a rationale to justify his tests, but made it clear that different choices were possible and the user should choose one that seems relevant. But there’s a very important point you’re overlooking. Regardless of how I chose the N-P test, you can critically assess the result given my chosen specifications. You can say, for instance, that having insisted on so small an alpha, the test was readily failing to detect discrepancies of such and such magnitude. The specs are part and parcel of the interpretation. One doesn’t blindly report “reject Ho”, say, without indicating what is and is not warranted. So if you make alpha and beta equal, say, I’m going to interpret the test the same way as if you fixed alpha, even if the theory of test generation differs.

Remember too, N invented confidence intervals in connection with tests.

omaclaren, sorry to be slow in responding, but I’m travelling at the moment and not online often.

Yes, I addressed the one card trick issue in my arxived paper http://arxiv.org/abs/1507.08394 (as usual, rejected. It must be a good one!) The central idea is that likelihood ONLY allow comparisons of the support for parameter values that are part of a single statistical model. The model has to provide probabilities for all possible outcomes, so to make the normal deck of cards and all cards the same hypotheses into parameter values for a single model one ends up with not just number of parrameters=number of data points, but number of parameters >> number of data points. One need not be very well versed in data analysis to realise that such a situation is untenable.

I think that a major difficulty that we face is the use of the word “hypothesis” when what we mean is a particular value of the parameter of interest of the statistical model. As soon as you inspect the full likelihood function that shows ALL of the valid parameter values the complaints of anti-likelihoodists seem to become irrelevant. I note, for the information of all readers that both Edwards’s and Royall’s books are filled with complete likelihood functions. Discussions of likelihoods that focus only on particular points in parameter space miss the utility of the likelihood function.

Michael: Doesn’t hold up even restricting to parameter values. There’s still cherry-picking (from predesignated hyps),optional stopping etc.. You have error control only with two predesignated point against point hypotheses, as has long been known. Savage tried to change the discussion to such a case in responding to Armitage in the Savage Forum (1959/1962). That’s why, in the end, Birnbaum restricted his endorsement of the LP to predesignated point against point.

Mayo, why does “error control” trump all other considerations? I know that it might do so when one chooses to look at the world thought frequentist lenses, but such lenses do not allow the evidence in the data to be seen with any clarity.

I also note, as a follow on to omaclaren’s point about the balance of type I and type II errors, that the usual discussions of error control (yes, yours included) focus on type I without any consideration of other error types. Optional stopping increases type I errors, but _decreases_ the type II errors. Many reasonable loss function designs lead to the situation where errors are well controlled with optional stopping and costs are reduced.

Michael: I definitely am highly concerned not only with both types of errors (type 1 and 2) but with errors about discrepancies. Power, which is linked to a type 2 error prob, is central for me, even though I prefer a data dependent computation. So I don’t know what made you think I cared only about type i errors, like our illicit NHSTers.

As for, why care about error probabilities–it’s a good question to ask. My answer for inferential contexts rests on employing them in arguments about severity. I don’t deny long-run performance matters in different contexts.

Mayo, I’ve read most of your blogs and followed the comments. I do not recall you ever writing about the fact that optional stopping decreases type II error rates for any average effect size, nor do I recall you writing that adjustment of P-values for multiplicities of testing increases them. Maybe I’m wrong, but my impression is that you do not pay much attention to the effect of frequentist `corrections’ of P-values on the resultant type II error rates.

Michael: My view of testing was never restricted to P-values. Actually, the whole severity idea grew out of fallacious attempts to claim evidence of 0 or low hazards–type 2 errors! With proper tests–not guilty of selection effects–, this occurs typically because of low power. With likelihood ratios or Bayes factors, or data dependent selection in N-P tests,this occurs by such means as comparing the null with an alternative far away, or selecting a characteristic post data where hazards are absent, or picking a factor on which the treatment in question is advantageous. The same kinds of data dependent biases, as well as violated assumptions, do damage to the type 2 error. Severity requires interpreting non-significant results in terms of discrepancies ruled out. with low powered tests, it may be that the only hazard increase that’s ruled out is far higher than set by statutes. Here, it’s the typical null of no or low hazard increase, or the like.

The popular Bayesian idea of giving a point spike prior to 0, of course, exacerbates the problem and makes it easy for risky technologies or substances to pass muster as having no or low harmful effects. That’s what Greenland was on about

Mayo, I feel that we are not communicating well, as usual. I did not, and do not, intend to say that severity analyses are inappropriate because they fail to take type I and type II errors into account. Severity curves are closely related to likelihood functions, so why would I not like them?

I intended to point to the fact that in you many posts about P-values and likelihood you routinely call the evidentially interpretable observed P-values merely “nominal”. You say or imply that the P-values have to be computed or adjusted or corrected so that they represent the probability of a false positive error conditioned on the actual experimental design. To insist on that is to privilege type I error control over minimisation of type II errors. Proper ‘error control’ should take both into account.

Look again at the comment by omaclaren about minimising a linear combination of type I and type II errors and at the relevant part of the Cornfield paper that I included in my response to omaclaren. They might help you see what I am on about.

Michael – yes I effectively agree with you. I think it’s perfectly possible to use an Edwards-style approach sensibly. I do think there are some subtle issues behind the ‘common-sense’ advice for how to use likelihood functions though. (BTW – I don’t think standard frequentism really addresses these issues in the best way for me either).

First of all – *why* do we not want number of parameters >> data points? That is, does the likelihood theory itself *explain* why or is it a ‘meta’ principle to you?

A related (to me) question is – *why* should we restrict comparisons to point values of the likelihood function? Again, I agree that this is the sensible approach for the Edwards likelihoodist but *only subject to associated ‘common-sense’ prescriptions* such as not having parameters >> data points. Are these prescriptions explained within or externally to the theory?

I’m not just being negative about likelihoodism, I’m trying to point out what I think are key issues to be explicitly and constructively addressed.

[I do in fact have my own idea for the justification of a version of ‘regularised likelihoodism’ that (I believe) ‘derives’ the regularisation principles and likelihood principles from a more basic starting point. It’s not that different from some versions of Bayesianism – or, even more, so structural probability arguments – but has less philosophical baggage imo.

It occurred to me while teaching partial differential equations this semester and discussing the derivation of differential equations from integral equations, i.e. ‘localisation’ of integral equations. Likelihoodism faces the same question – how to properly ‘localise’ probability statements for continuous models, which are based on integrals/measures, to statements valid at point values on a continuum. It turns out applied mathematicians and physicists have been doing this for integral/differential equations for a long time; Fraser is the only one I’ve seen explicitly do effectively the same thing in the context of statistical inference, though (but I haven’t checked comprehensively).]

(well…I should emphasise that people using measure theory do this sort of thing all the time e.g. with Radon-Nikodym or with more functional-analytic approaches…but Fraser is the only one I’ve seen do it as a sort of ‘easy-to-understand-for-undergrads’ analysis of likelihoodism and statistical inference)

omaclaren, thank you for your response. I was beginning to feel very lonely!

Why should we restrict the number of parameters? I only have pragmatic and empirically derived reasons, but I suspect that a more principled reason might be available via information theory and I will insist that a theory of likelihood need not bear the responsibility of providing such a reason by itself. My pragmatic reason is that if the number of parameters is higher than the number of data points then all statistical methods go awry. Likelihood is no different. Would anyone really consider fitting a linear model with two parameters to a single data point? I don’t think so. But does anyone ask the analyst why he or she insists on more data or on reducing the number of fitted parameter values? No. To ask that question of a likelihoodlum is to make likelihood different.

Why restrict the comparisons described by the law of likelihood to likelihood points on a single likelihood function? Three reasons. First, it makes no sense to do otherwise. I see no justification in the likelihood principle for a comparison between likelihoods from different models. The idea that likelihoods from different models can be compared is nothing more than overreach and extrapolation from a logically solid base. If you can tell me how the likelihood principle provides a justification for such comparisons I would be surprised but grateful.

Second, all likelihood functions are as reasonable as the statistical model from which they are derived. Or as unreasonable. However, the likelihood ratio of individual points is a technical null measure of relative support that cancels out the reasonability aspect of the model when the two likelihoods are from the same model so that the ratio provides an uncomplicated measure of the relative support for the parameter values _within_ the statistical model. If the likelihoods are not from the same model the reasonableness considerations do not exactly cancel, and so the ratio is numerically controlled by the relative reasonablenesses of the models. That makes such a ratio something other than a simple numerical measure of relative support. Models that have too many parameters lack reasonableness in my opinion, as do models that include evil demons who are determined to deceive. You have read my Arxiv paper on the topic? http://arxiv.org/abs/1507.08394

(Null measures are used quite a bit in my main discipline, pharmacology. Look at “ratiometric dyes” like Fura-2 to see how factors like dye concentration, path length and instrumentation sensitivity can be eliminated using the ratio as a null measurement. The ratio of concentration-response curve locations used for a classical Schild plot analysis gives another example where a ratio is a null measurement that cancels out a host of confounding factors.)

There is a third way to justify the restriction, and I think that it is one that Fisher had in mind, even though I cannot find anywhere that he makes it explicit. Likelihoods are only defined to a proportionality factor. They are proportional to the probability of the data given the parameter value, even though many authors carelessly or mistakenly present them as being equal to that probability. The proportionality constant is essential to deal with the rounding errors associated with real-world presentation of continuously varying parameter values. The likelihood of a parameter value of 2.0 is much larger than the likelihood of a parameter value of 2.000000 because the former is an integral from 1.95 to 2.05 whereas the latter is an integral only from 1.9999995 to 2.0000005. Different models potentially need different proportionality factors to make the rounding of the parameters equivalent to allow the likelihood ratio to be a comparison of like with like.

Royall might not but Edwards (Fisher’s student!) does.

I agree the need for regularisation is a/the core issue, I just to think about how it arises and how to resolve it in a different way. The frequentist does better on some particular examples by appealing to the sampling distribution which seems to me to just be a roundabout way of ‘conditioning on a neighborhood’ of the observed data rather than exactly on the data. In more complicated examples it seems like a local approach (based on a neighborhood of the given data) to stability would be preferable to considering all more extreme deviations.

Re confidence intervals – likelihood intervals obtain a similar justification based on choosing a tradeoff parameter and including all parameter values giving likelihood ratios exceeding the threshold. The curvature of the likelihood function should give an approximate measure of the width of the likelihood interval and hence the number of ‘similar’ competitors.

So again – a minor modification of the criterion for choosing a test and a corresponding minor modification of interval estimation based on test inversion gives a likelihood approach rather than NP. If you allow a neighborhood of the data to be relevant rather than exactly that data then they become even closer (in fact I would probably take this as an indication that the key issues are really the geometry and topology of parameter and data space and maps between them)

Om: Only time to respond to first sentence. Where does Edwards preclude such problematic inferences? Check Hacking’s review of Edwards (hopefully the link works): https://errorstatistics.files.wordpress.com/2014/08/hacking_review-likelihood.pdf

Mayo – several times you’ve pointed me to Hacking’s review and several times I’ve pointed you back to Edwards’ book – although I don’t fully accept Edwards’ approach, I don’t think the review does the approach justice.

I have a 1992 paperback edition of the 1972 version.

In chapter 8 ‘Application in anomalous cases’ he discusses a number of…anomalous cases.

For example in section 8.4 ‘Singularities in the support function’ he discusses an example due to Barnard of a single observation from a Normal distribution with unknown mean and variance. He states:

“In the unlikely event that we would not reject a zero variance on a priori grounds, the situation may still be resolved by taking into account the fact that the fundamental uncertainty of measurement necessitates a non-zero variance in the probability model for the actual observation…we may expect similar difficulties whenever there is the option, under the model, of setting a Normal variance to zero without achieving a zero likelihood thereby. For if the likelihood is not zero, but the variance is, it implies that an observation took a particular values because, under the model, it had no choice; in the case of a continuous distribution, the likelihood is then infinite.”

In section 8.6 ‘Inadequate information in the sample’ he states:

“In some cases the support curve may be quite regular with respect to some function of the parameter of interest, but exhibit peculiarities with respect to the parameter itself. This is no more than an indication that the sample does not contain the desired information.

In section 8.7 Discussion he states

“The support may, in difficult cases, depend very critically on the precise model adopted, and if this is not an adequate description of the process generating the observations singularities may result. It is fortunate that in most applications the support does not seem to depend very critically on the finer details of the model, but it should come as no surprise that in some instances these details are very important. We are involved in conditional arguments, and if the conditions change we may expect the results to do so, sometimes with marked effect.”

Etc.

Om: Note what Hacking says about stipulating a positive variance. Please point me to a case where selection effects–optional stopping, cherry picking, outcome switching, multiple testing, etc are taken up by Edwards. I have a copy here, and haven’t looked at it in a while. I credit him for one wonderful line–something like, many Bayesians appeal to wash-out theorems to justify priors. ‘The less said about that justification the better.’ I’m sure it’s not the exact quote. So OK, where does he take up the problem of selection effects. I know he’s got something like prior supports, but I’m not sure how they enter here.

Well I don’t wanna search through everything for you but he does say:

“The general principle we should follow is to condition as much as possible without destroying any information about the parameter of interest. Support functions are independent of the rule for stopping the count provided they are not conditioned on any statistic which is itself informative”.

Can you point to examples where we have correctly conditioned (including making sure the result doesn’t depend on conditioning on the data to a precision beyond that realisable in practice)?

So if I am reading the paper correctly, the SEV is one minus the P-Value, at least for the simple models discussed within. Is it really that simple?

Matt. No, but in one case, SEV(mu > mu-0) with observed p-value, it can be. That’s never adequate because you must also indicate the SEV for several discrepancies from 0, not merely an indication of some discrepancy. Note: the confidence level associated with (mu > mu-0) = SEV for this one inference, but the confidence level is always fixed at 1 – a, whereas, we require several benchmarks. Finally, the rationale is not “coverage probability”.

We don’t change the test hypotheses by the way, but rather consider several candidate claims to which to attach SEV values.

It is a simple idea, but results in a very different construal of tests than is commonly thought. for example, SEV for mu > mu’ goes in the opposite direction as does POW(mu’)–in relation to this example.

MattW, I also arrived to the conclusion that the severity is one minus the p-value for another hypothesis. I think this may be true in general and not just in these simple cases. I would like to see a counterexample if you find one.

https://errorstatistics.com/2015/08/20/how-to-avoid-making-mountains-out-of-molehills-using-powerseverity/#comment-130129

You obtain a p-value of exactly .05 in our one sided test. That is, the observed test stat reaches the cut-off c. Whats SEV (mu > c)? Is is .95? Absolutely not.

Absolutely not, I agree. The value of SEV(mu>c) is equal to the p-value obtained when you test the null hypothesis mu=c (against the alternative mu≥c) using the data x0, i.e. 0.5.

Note that there is a problem in your notation: you can’t compare mu (which is in the data space) with c (which is in the statistic space). I’m assuming n=1, mu0=0 and sigma=1 so d(c)=c.

What you’re describing is not a significance test. Don’t change the hypotheses. And for every inference there are lower and upper discrepancies of interest, some well warranted, others not. On the parameter and sample space, here they are the same and nothing illicit occurs at all. Consider how confidence intervals are dual to tests. The lower bound of the one-sided lower 1 – a CI interval is the parameter value such that mu ≤ CI-lower is rejected at level a.

As far as I can see, your the result of your severity calculation SEV(mu>x) is identical to the p-value for the test of *another hypothesis* (as I mentioned in my first comment, but maybe MattW had reached a different conclusion) H0′: mu=x.

As I said, I would be interested to know if this is not always the case.

Here’s a comment I made on Gelman’s blog in reaction to a statement by Senn taken out of context:

http://andrewgelman.com/2016/03/29/bayesians-quite-rightly-so-according-to-the-theory/#comment-267796

Here’s one of Senn’s reply, I may want to allude to it some time, so I’m parking it here:

Stephen John Senn (@stephensenn) says:

March 31, 2016 at 6:56 am

Andrew,

The qualification I gave ‘exchangeable to the extent described by the model’ is important. A simple example as to what I mean is given by using a beta-conjugate distribution for the estimate of a binary probability. If this is informative you can replace it by two things, an uninformative paleo-prior and some subsequent pseudo-data. Given the model these pseudo-data are exchageable with the real data that you then collect and you can’t have it both ways. You can’t use the pseudo-data as a stabilising influence for most real data sets you might collect but reject them for some cases you don’t like. If you do the whole set-up was wrong in the first place. This was the essence of my comment on Geelman and Shalizi some while back

and is also covered in my blog-post on Deborah Mayo’s site.

https://errorstatistics.com/2012/05/01/stephen-senn-a-paradox-of-prior-probabilities/

This comment should not be taken as a denigration of applied Bayesian anyalysis but rather as a criticism of the argument that a theory of how to remain perfect is necessarily the best recipe for becoming good

However, I agree with you entirely that the choice of likelihood is also (partly) arbitrary, although in frequentist accounts design is closely related to the model used or to be used and this helps in aligning likelihood and data.

Reply to this comment

Passing a severe test is defined as follows in page 164:

A hypothesis H passes a severe test T with data x0 if,

(S-1) x0 accords with H, (for a suitable notion of accordance) and

(S-2) with very high probability, test T would have produced a result that accords less well with H than x0 does, if H were false or incorrect.

Could you clarify for me what is the hypothesis H in the example given in page 169?

When discussing the case “d(x0)=2.0” you write:

“Each statistically significant result “accords with” the alternative (μ > 0). So (S-1) is satisfied.”

It seems that H: μ > 0. But to check condition (S-2) you calculate :

SEV (μ > .2) = P (X .2 is false) = P (X .2

So what is it, “H: μ > 0” or “H: μ > .2”?

If the former, shouldn’t you calculate SEV(μ > 0)?

If the latter, why do you say that (S-1) is satisfied?

Yes, one has an indication of some discrepancy from 0 and now you want to consider specific values.The hypotheses of the test don’t change, but one entertains various claims about discrepancies rather than stopping with significant/non-significant at a level.

Yes to what? I don´t understand if “one entertains various claims” should be interpreted as “one uses different values for H in the different clauses (S-1) and (S-2)”.

Does the symbol H have a constant meaning through the definition or not?

The definition starts with “A hypothesis H…” before introducing (S-1) and (S-2).

If H refers to a unique hypothesis, the examples are not consistent with the definition.

If H can be one thing in (S-1) and something else in (S-2), wouldn’t it be better to change the definition and use different symbols to refer to different entities?

Carlos: This isn’t a very good way to explain these points clearly. Why don’t you read the paper, or a related paper, and email me with questions. Thanks.

I find my questions very clear, but let me try one last time from a slightly different angle with a really simple question:

What is the hypothesis passing a severe test, if any, in the example?

(the example being the bottom half of page 169 in Mayo and Spanos (2011))

Carlos: Your questions might be clear, but so are the answers in my papers. I’m not willing to rewrite them on my blog at the request of readers.

Fair enough. Maybe some fellow reader can offer some insight on this issue and help me to understand how severity is defined.

My understanding is that the severity assessment of a specific hypothesis H, which may differ from the null H0, is relative to the null in the following sense.

If the null hypothesis H0 (mu x_obs_bar; H)

If the null hypothesis H0 (mu <= value0) is rejected then for another H of interest (mu = value1) the severity is

SEV(H) = P(X_bar <= x_obs_bar; H)

In the first case the severity is the 'attained power' while in the second it is (1-attained power) and which is used depends on H0 rather than on H.

Ugh same issue with order symbols.

If null H0 is accepted then for another H > H0 of interest the severity for H is

P (X bar greater than X obs; H)

If H0 rejected then the severity for H is

P (X bar less than X obs ; H)

Om: You have to give the form of the H’s. It’s never a point hypothesis, inference is by way of discrepancy assertions.

Thanks for your reply, omaclaren. I was trying to go beyond the mathematics (where I think, by the way, that I agree with you and Michael Lew) and look at the intended meaning of these calculations. My problem is that I don’t see how to reconcile the computations with the definition of passing a severe test.

In the example I cited, there is a calculation of SEV (μ > .2). I can’t tell if it is:

a) intended to show that the hypothesis μ > .2 passes a severe test T with data x0

b) intended to show that the hypothesis μ > 0 passes a severe test T with data x0

c) unrelated to the concept of a hypothesis passing a severe test

(I have no issues with the data x0 or the test T, which I understand is constructed to test the null hypothesis mu=0 against the alternative mu>0)

Hi Carlos,

Yes I agree it seems somewhat ambiguously worded. Is S-1 accepted on the basis that that mu > 0.2 implies mu > 0? But I wouldn’t say that x0 necessarily accords with say mu > 1000. So now I’m also a bit puzzled. Shouldn’t we check mu > 0.2 itself, ie change the H in S-1 and S-2?

It’s obviously a, that’s what the abbreviation means. You’ve rejected a null, say and now you want to know what discrepancies are warranted beyond the rather uninformative claim that some (pos) discrepancy is indicated. I prefer to call the “claims”about discrepancies claims. So you consider the prob you’d get a worse fit than you did under .2. If this paper doesn’t help try ch 11 of my book (1996) , EGEK, under publications on the left.

Thank you. It’s now clear that we’re looking at the claim “the hypothesis μ > .2 passes a severe test T with data x0” which is (obviously) equivalent to the conjunction of the two following claims:

(S-1) x0 accords with μ > .2, (for a suitable notion of accordance) and

(S-2) with very high probability, test T would have produced a result that accords less well with μ > .2 than x0 does, if μ > .2 were false

Does x0 accord with μ > .2 ? We know that mean(x0)=0.4>0.2, so we can check that box if that is a suitable notion of accordance. (As omaclaren points out, the fact that we have rejected H0: μ = 0 does not seem a suitable notion of accordance as we would have to conclude as well that x0 accords with μ > 1000)

The second condition is being measured by the severity SEV(μ > .2)=.841, which looks “high probability” enough.

I think that makes your argument clear, thanks again.

I would like to add something, with the complete understanding that this is not what you are doing, claiming, or endorsing. Other suitable notions of accordance can be envisaged. One could measure how well x0 accords with μ > .2 by using a one-sided test for the null hypothesis μ = .2 against the alternative μ > .2. One would get a p-value .159 which, by fate or coincidence, happens to be one minus the severity above.

Carlos: The main thing is to have the two error probabilities: a high (low)probability of (in) correctly rejecting.

Any single error statistical inference could be reformulated in lots of different ways, but the imepetus behind the interpretation and warrant willbe absent, unless of course, it’s the exact same thing by other names. That’s fine with me. The entire move began as my own way of interpreting tests and avoiding fallacies, but it turns out to have much greater payoffs in both philosophy of science and statistics.

One other thing: oftenone can take one’s pick and lambast an inference on “fit” grounds or argue appealing to (S-2). I’m totally open to modifications here, I tried to stay close to existing testing practice (as do other suggested reforms–except that the “reforms” we keep hearing about conflict with N-P & severity logic.)

Thanks Carlos, your comments have clarified/solidified a few things for me. Given the connections with p-values, likelihood integrals and fiducial arguments it seems to me that severity is essentially how Fisher might have formulated what NP were trying to do. Funny, since my understanding is that NP developed from trying to reformulate Fisher!

Personally I think I prefer Fisher’s approach over NP’s but it seems like it just leads back to quantifying evidence via functionals of the likelihood function, at least in simple cases.

As I think Fisher noted several times, this means that unless you consider all such (or a sufficient number of) functionals then you are leaving information behind that is contained in the whole likelihood function.

On the other hand, you might argue, e.g. along the lines of Laurie Davies, that you only ever effectively have access to e.g. a finite number of such functionals and the likelihood can’t be fully localised. As long as the likelihood is recognised as analogous to a weak derivative (e.g. in the generalised function sense) and hence only considered via its action as part of a functional then the two approaches should be equivalent.

(I’m sorry, I forgot about the issues with “less than” characters)

Could you clarify for me what is the hypothesis H in the example given in page 169?

When discussing the case “d(x0)=2.0” you write:

“Each statistically significant result “accords with” the alternative (μ > 0). So (S-1) is satisfied.”

It would seem that H: μ > 0. But to check condition (S-2) you calculate :

SEV (μ > .2) = P (X ≤ 0.4; μ > .2 is false) = P (X ≤ 0.4; μ ≤ .2 is true).

Which is what you would for for H: μ > .2

So what is it, “H: μ > 0” or “H: μ > .2”?

If the former, shouldn’t you calculate SEV(μ > 0)?

If the latter, why do you say that (S-1) is satisfied?

The first case asks whether H is also accepted along with H0 and the second asks whether H is also rejected along with H0. I interpret this as something like exploring a neighborhood of H0.

Om: against, it makes no sense without specifying the H, in the form of a discrepancy it could be H: mu < muo but it needn't be. Spanos once proposed a notation, but I thought it looked too complicated, but we did use it in a couple of papers.

I’m assuming the problem has been written in a standard form of H: mu = mu0 + delta, delta >= 0 (see my lazy H > H0 above)

Another attempt at sign conventions. How about

Let

H0: mu less than mu0

If x0 > mu0 then

H(delta): mu less than mu0 – delta, delta pos.

SEV(H) = P(X>x0; H)

If x0 less than mu0 then

H(delta): mu less than mu0 + delta, delta pos.

SEV(H) = P(X>x0; H)

I don’t really see a lot of distinction with a likelihood function analysis tbh other than the likelihood approach localizing to x0 and mu in data and parameter space simultaneously, and sev being more absolute than relative. As long as the limit is taken carefully and and additional absolute check is carried out for the likelihood case they should be very similar.

Also, SEV is a random variable since it depends on the observed data right? How exactly is it a meta assessment while the likelihood function is not?

In a case where a frequentist would make no adjustment to a P-value for multiplicity, optional stopping rules and the like, a severity curve can be calculated as one minus the definite integral of the relevant likelihood function. As far as I can tell, the likelihood analysis and severity analysis are effectively interchangeable in such circumstances. (Mayo has previously told me that I am mistaken, but the maths seems otherwise.)

In cases where a frequentist would adjust a ‘nominal’ P-value the severity curve would (presumably) differ from the integral of the likelihood function. However, I have not seen such a severity curve and I am uncertain as to how one would calculate such a beast.

Yes that seems to be true as far as I can tell too. Interesting that the equivalence appears to be related to Fisher’s fiducial argument – mu and x0 related by the same structural relation; can fix mu and integrate over X or fix x0 and integrate over mu. Same result subject to a few possible technicalities.

Hi Michael, my last comment somewhere above (in response to Carlos) summarises a few of my views on the role of the likelihood function, (at least when I’m putting on my Fisherian hat), if you’re interested.

I think my views are effectively the same as yours except I prefer to only ever ‘weakly localise’ e.g. in the sense of weak derivatives ( https://en.wikipedia.org/wiki/Weak_derivative ) to particular parameters within a model rather than strongly localise (ordinary derivative).

(the comment starting “Thanks Carlos, your comments have clarified/solidified a few things for me…”