**As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago. **

*‘Fisher’s alternative to the alternative’*

*By: Stephen Senn*

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are *more sensitive* than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by *hidden postulates* Fisher means *alternative hypotheses* and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test

statistics. You have to choose one, however. To say that you should choose the one with the greatest *power* gets you nowhere. This *power* depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

**References: **

J. H. Bennett (1990) *Statistical Inference and Analysis Selected Correspondence of R.A. Fisher*, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. *The Annals of Statistics,* 441-500.

###### Related articles

- JERZY NEYMAN: Note on an Article by Sir Ronald Fisher (errorstatistics.com)
- E.S. PEARSON: Statistical Concepts in Their Relation to Reality (errorstatistics.com)
- Fisher, Statistical Methods and Scientific Inference (errorstatistics.com)

“All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often.” This makes is sound like they would both seek a most powerful or sensitive test, it’s just a matter of whether one arrives at it by past experience or some analytic means. So they are both considering directional departures, or test statistics with this goal in mind. Is this the view?

Perhaps two things to keep in mind.

1. In statistics being close, like in horseshoes, is often good enough (if you get something vaguely in the direction of the true unknown alternative you will get less poor power.)

2. The justification for abduction is not very solid but I would choose past experience over analytic means (unless that it is a formulation of past experience). So would Peirce http://en.wikisource.org/wiki/A_Neglected_Argument_for_the_Reality_of_God

I post this question because if the answer is yes, then the appearance of some kind of conflict about the need to consider at least directional alternatives, or test statistics, between Fisher and N-P kind of disappears. Neyman shows that if you leave the test statistics/or alternative completely open then you can define one that will give opposite inferences. so he says no probabilistic account of testing is possible without considering, even subliminally, an alternative. This is shown in several places; stated, but not shown, in the Neyman 1956 that is linked.

Stephen, a very interesting post. I hadn’t really thought of the difference in the treatment of the null in that way. However, now that you mention it I realise that Fisher’s eventual conception of the null hypothesis is made clear in his 1955 paper Statistical Methods and Scientific Induction:

“in the theory of estimation we consider a continuum of hypotheses each eligible as null hypothesis, and it is the aggregate of frequencies calculated from each possibility in turn as true […] which supply the likelihood function.” (p. 73 Fisher,1955 J. Royal Stat Soc B 17: 69-78)

It seems to me that one area of disagreement between myself and frequentists is the fixity of the null hypothesis. In their conception the null is a special hypothesis whereas in mine it is nothing more than a particular value of the parameter of interest. Of course, the decision process of a Neyman-Pearson hypothesis test requires the null to be an anchor, but a complete assessment of the evidence in the data does not.

Michael: The way I interpret tests, it could well be any claim one wishes to entertain, and thus it is more like what Fisher describes in “2 new properties of likelihood” a couple of posts ago. I just learned Stephen is away for a week, so we’ll likely have to come back to this. By the way, I was writing the section on Royall in my book today and noticed he uses (1 – beta)/alpha as a likelihood ratio. I believe you were one of the ones asking me about this in my post that went off topic the other day.

[Breaking my promise below] speaking of this paper, doesn’t Fisher basically say the whole likelihood must be taken into account to get all the information, including higher derivatives (or a definite integral, say)? See https://errorstatistics.com/2015/02/16/r-a-fisher-two-new-properties-of-mathematical-likelihood-just-before-breaking-up-with-n-p/

omaclaren;

It is interesting but perhaps needs some clarification (I’ll quote from my thesis http://andrewgelman.com/wp-content/uploads/2010/06/ThesisReprint.pdf )

‘[Fisher] concluded that, in general, single estimates will not suffice but that the

entire course of the likelihood function would be needed. He then defined the necessary ancillary

statistics in addition to the MLE in this case as the second and higher differential coefficients at

the MLE (given that these are defined). These would allow one to recover the individual sample

log-likelihood functions (although he did not state the conditions under which the Taylor series

approximation at a given point recovers the full function – see Bressoud[14]) and with their addition,

the log-likelihood from the combined individual observations from all the samples.”

So it involved a math error, was initially coined ancillary but he redefined ancillary soon afterwards (according to Stigler) but he did point out the need for the full likelihood function to avoid loss of information.

Fisher mentions the math error in Statistical Methods and Scientiffic Inference

“. . . it is the Likelihood function that must supply all the material for estimation, and

that the ancillary statistics obtained by differentiating this function are inadequate only

because they do not specify the function fully.”

But claims

“It is usually convenient to tabulate its [the likelihoods] logarithm, since for independent

bodies of data such as might be obtained by different investigators, the “combination

of observations” requires only that the log-likelihoods be added.”

I think he was misguided here,

“These suggestions assume complete certainty of the probability model assumed by

the investigators – if this should differ between them or at any point come under question (i.e. such

as when a consistent mean variance relationship is observed over many studies when assumptions

were Normal)” it would be problematic.

Also note the total disregard for possibly different frequency properties of likelihoods from different studies…

Thanks for the opportunity to revisit this.

Fascinating, thanks! I’ve downloaded your thesis, too.

PS I was also similarly sloppy – I did in fact mean that he implied that the full likelihood is required (but that in certain circumstances one could use the Taylor series for local approximations).

Really interesting that he stated this so explicitly, as you note:

“it is the Likelihood function that must supply all the material for estimation, and that the ancillary statistics obtained by differentiating this function are inadequate only because they do not specify the function fully”.

!!!

If we assume sufficient smoothness, where does this leave discussions of the Likelihood Principle?

Relatedly, I now take the (distinct) ‘Law of Likelihood’ to be a simplistic assumption to only focus on the first derivative of the (log-likelihood). Of course this will often give the same max likelihood estimate for a number of different likelihoods, but there will also typically be differing measures of variance (etc.) at that point *according to full knowledge of the relevant likelihoods*, as noted by Fisher himself.

So this at least leaves the *possibility* of purely likelihood-based inference, properly understood, that avoids the typical issues with, say, the ‘Law of Likelihood’, no?

You should note, by the way, that “principle of likelihood” and other similar phrases are used in different ways by Fisher, N-P, and neither mean the LP as we know it. Same for “law of likelihood”.

Mayo –

Of course – obviously I’m not up on all the history and subtleties.

A few points, though:

– Based solely on these quotes Fisher seems to be putting forward the view that – given/conditional on/assuming the validity of a model – the subsequent estimation can/should be carried out based on the whole likelihood function.

– Whether this means extra criteria/operations – e.g. some integration/differentiation or whatever over the sample space – should be used or not is not clear, again based only on these quotes. Some people (e.g. Fraser) seem to have looked at how one can use continuity in the model relation between parameter changes and changes in observables to include additional local sample-space-style information in likelihood-style estimation, but this is a bit beyond me for now. It may relate to Fisher’s fiducial ideas?

– How to combine/compare likelihoods from different experiments/different models may be subtle (Keith’s thesis appears to look at this question) because of the dependence of the likelihood-based estimation on the validity of the model.

– However, the ‘Law of Likelihood’ can be considered misguided from a purely Likelihoodist view even in a pure estimation context and without any frequentist considerations. In particular because it encourages what amounts to simple point-against-point or ‘first derivative’ style comparisons using the likelihood.

– One could supplement the ‘Law’ with another similarly ‘local’ concept such as consideration of the second derivative of the log-likelihood as a measure of ‘variance’ or ‘information’ or even ‘confidence’ at a point. This would (?) naturally lead to the construction of ‘likelihood intervals’, without any frequentist considerations but which converge to classical confidence intervals under certain assumptions.

– In general, however, if one is following the Likelihoodist paradigm it is probably better to consider the whole likelihood function (again, restricting attention to estimation/a valid model) when considering how the ‘evidence’ is distributed over parameter/hypothesis space, rather than any additional/simplistic ‘Law of likelihood’ or similar.

– My personal reading/translation of the general statement of the severity principle, putting aside its specific frequentist implementation (e.g. any operations over sample space), is that it encourages (locally-speaking) considering the sort of ‘higher-order derivatives’ or ‘higher-order counterfactuals’ in parameter/hypothesis space mentioned above. The more global implementation would presumably involve some sort of integral of the ‘fit-measure’ over the alternative hypotheses, and this seems to be the case (I haven’t looked at any detailed calculations).

– These ideas apply similarly to Bayes factors – they are potentially bad for essentially exactly the same reasons that point-against-point likelihood ratios are bad. That is, they ignore the ‘full’ information in the posterior. E.g., speaking locally, they have no consideration of ‘curvature’ or ‘variance-like’ measures (or any other higher-order derivatives). This can again be corrected based purely on Bayesian or Likelihoodist-style considerations – use the whole posterior/likelihood instead!

– Again I don’t know much, but what what I’ve seen it seems that many Bayesians are advocating against Bayes factors for basically the above reasons, and moving towards replacing uses of Bayes factors with consideration of full posterior distributions. Again, this need not (and doesn’t seem to) be motivated for any frequentist reasons, rather a desire to ‘use the full posterior information’.

omaclaren: you say you haven’t seen any severity calculations when we’ve done tons of them over the years you’ve been reading–I think.

On bayes factors. I see them as more popular than posteriors. when they’re given up, it tends to mean the person is not doing anything reecognizably Bayesian, nothing with Bayesian foundations.

Mayo –

I just meant that I haven’t carried out detailed calculations comparing severity to other things myself. Being lazy, basically.

I agree that bayes factors seem to be currently popular, but I think that this is a bad idea for the reasons described. In terms of visible bayesians – Gelman has openly argued against bayes factors based on what seem to be similar intutions about these issues. Robert (x’ian) seems to be moving away from bayes factors from a cursory reading of his blog, too, towards more posterior-style summaries of tests.

I disagree about lack of bayesian foundations for non bayes factor measures – I think bayes factors are only a (generally speaking) poor approximation to what would be ‘proper’ Bayesian inference using a full posterior.

Mayo: “On bayes factors. I see them as more popular than posteriors.”

In philosophical discussions, maybe. But no way in practice; at least one factor in this is, I think, use of MCMC methods to do the calculations. These handle posteriors from smooth priors with relative ease, but with spiked priors, of the sort that lead naturally to Bayes Factors, implementation is far trickier. Gelman et al’s book on Bayesian Data Analysis largely ignores Bayes Factors.

omaclaren:

First, I think Stephen put it best here https://errorstatistics.com/2015/02/19/stephen-senn-fishers-alternative-to-the-alternative-2/#comment-117798 – no one has convinced anyone else that they know how to deal with nuisance parameters adequately.

No end to vagueness and confusion about LP but most everyone I think considers it to involve the the full likelihood function. When they say sufficiency, as in sufficient statistics, they mean they can generate the full likelihood function from those statistics (its the definition of sufficiency these days).

To get outside deductive logic and to really get at induction (which should be what is obviously wrong with most advertised versions of Bayesian statistics.) one needs to get something more than the posterior. Induction needs to be a process of getting less wrong about representations not just less wrong about parameter values in a given representation. Mayo may wish to call this error statistical but I would prefer something more like pragmatistic induction or induction of induction.

Mike Evans has some material you might find interesting here http://www.utstat.toronto.edu/mikevans/ and in for LP in particular http://projecteuclid.org/euclid.ejs/1382706342 .

A very high level take on Mayo and Mike’s approach to LP might be usefully characterized as Mayo – “The implications are silly, the assumptions must be wrong”and Mike – “The fully explicated assumptions are silly, therefor any implication will be silly”.

Enjoy the intellectual adventure.

Thanks! I do feel bad for spamming this blog with my thoughts during my ‘intellectual adventure’ though…

Mayo, several times it has been implied that my own use of “likelihood principle” differs to yours. My own is taken pretty much intact from Edwards:

Within a statistical model, all of the evidence in the data relevant to a parameter of interest is contained in the relevant likelihood function.

What is the principle that you would say “as we know it”?

Mayo, I’m glad to hear that you’re writing, but I have to say that I do not understand your first sentence.

On your by the way about the likelihood ratio, the point of my comment on that previous post was that as neither alpha nor beta is a likelihood, that ratio cannot be construed as a likelihood ratio. It doesn’t matter what Royall may or may not have written, a likelihood ratio has to be a ratio of likelihoods.

Also, re: Royall and likelihood (off topic again, sigh) – as you’ve repeatedly mentioned, considering the shape of the likelihood function is as important as any isolated point. Considering the ‘amount of likelihood’ in a neighbourhood of the max likelihood point – ie considering the second derivative of the likelihood function – seems to amount to a (local) severity-style counterfactual check of the max likelihood hypothesis.

‘With high probability the fit under other hypotheses is worse’ can be reinterpreted along the lines of ‘there is a low total likelihood contained in the (by definition) worse-fitting neighbourhood of the max likelihood hypothesis’. More generally the entire likelihood could be inspected. And non-max points etc. This means the classic card example doesn’t work as the likelihood has low curvature at the maximum, right? And if one admits priors then it would be even more similar, in terms of the second derivative of the posterior. Obviously the analogy is not necessarily exact, and is probably obvious to some, but interesting to me. Converts the ‘logical principle’ to a more geometrical notion which helps with the intuition.

They treat it as if the event is the tail area, it’s rather widespread. Do you have Royall’s book?

No, I have to admit I don’t have Royall’s book. Skimming/searching what I can through google books seems to indicate he doesn’t mention anything like what I’ve said. He seems to argue for the fairly naive account you attribute him, though I can’t be sure without reading properly.

However, doing the same with Edwards’ book on likelihood is quite different – he seems to very explicitly discuss such issues. The second derivative then being related to measures of ‘information’ and ‘observed information’ (of course!). So the second derivative evaluated post-data at some parameter theta is the ‘informativeness’ of the data about theta. Or something along those lines (I haven’t read/thought carefully). Seems a similar sort of thing to severity?

I don’t see how, this looks to be the usual info-theoretical notion. Severity assesses how capable a method is to have avoided erroneous interpretations of data. The data are used in statistical inference to infer something about aspects of the process triggering the data (as modeled). We know such inferences are open to error, and the severity assessment concerns the inability (or ability) to alert us to such errors. This is the basis for assessing how well or poorly tested the inference is. It’s rather different from how comparatively likely, how comparatively well supported, or how (non-comparatively) probable claims are.

So it differs from the comparative likelihoodist and the probabilist of all stripes.

By the way, there’s a good short review of Edwards by Hacking in Brit J for the philo of science 1972. Same issue has a review of Hacking (when he was a likelihoodist) by none other than George Barnard.

Thanks for the reference, I’ve had a quick read. The important thing to me is that my comment refers to the second derivative of the log-likelihood as a measure of how confident we are in an estimate that may have been proposed separately eg according to max likelihood (first derivative zero). So these are distinct concepts and I don’t see any mention of this by Hacking. This additional notion is in essence a variance measure, no?

omaclaren, you raise a very important issue by pointing tot he similarity of severity and likelihood functions. The severity curve for an observed sample from a normal with known variance seems to be identical to the definite integral of the likelihood function, so they may be more than just similar.

As far as I can work out, likelihood functions and severity curves would support substantially different inferences only in the circumstances where stopping rules, multiplicity of testing etc. make the method unreliable. The likelihood function is unaffected by such things but I assume the severity curve would change. However, I have never seen a severity curve for such situations.

Michael – that correspondence makes a lot of sense to me! Re: stopping rules and things, I believe Gelman has examples in his book where he aims to bring those into the probability model directly – which makes sense as those are part of the process by which the data are generated.

So the likelihood would seem to change under these circumstances if you have this info on the ‘full’ data generating mechanism.

Won’t work for Bayesians or likelihoodists because the difference cancels out in the ratio. Look up LP and SLP on this blog. Rejecting the LP is one thing, never mind that it makes Bayesians incoherent, it still doesn’t follow they get meaningful error control, nor that they’ve made the adjustments for the “right reasons”. Or rather, if made for the right reasons, they are error statisticians.

Perhaps you could dissect Gelman’s examples at some point to see. Maybe Royall etc base everything on simplistic ratios but that doesn’t mean all probability modelling that uses likelihood or bayes for the estimation phase (re my comments on predictive distributions on that other post) is subject to the same errors. It may make them ‘error statisticians’ but it may not! Perhaps it could even work the other way, or neither?

omaclaren, on your suggestion that the likelihood function might change when the stopping rules and things are taken into account: no, not at all. But there is something that needs to be added to the likelihood function for inference in such circumstances.

The likelihood function is not affected by stopping rules, multiplicity of comparisons or any of the things that are thought to constitute “P-hacking”. I think of that as being a result of the likelihood function being a product of the data: after an experiment is run the data in hand are not affected by design features of the experiment such as stopping rules and neither are they affected by the existence of analytical comparisons or other datasets. Thus P-hacking does not affect the likelihood function. What P-hacking does affect is the reliability of the method, and we may wish to pay attention to the reliability of the method when making inferences.

What does the reliability of the method do to the evidence? Nothing if the evidence is in the data, but it nonetheless should colour how we respond to the evidence. Evidence can be misleading in that it may point towards parameter values that are far from the true value. An unreliable method is more likely to yield misleading evidence than a more reliable method and so we should be more sceptical of the likelihood function from an experiment or analysis that has a low reliability. However, we need to characterise that unreliability separately from the likelihood function in order to calibrate our scepticism.

Mayo’s severity seems to be an attempt to meld the evidence with the reliability of the method. It’s a good attempt, but I suspect that we might be better served by considering the two things separately.

I should clarify – I think the probability model (generating mechanism, sampling model, whatever you want to call it) specification changes. Given a correct probability model the likelihood is of course defined via this model evaluated at fixed data with varying parameters. So I think messing with how you collect data leads to model misspecification (since you’ve left out processes – tampering!) and so invalidates any estimation (eg by inspecting/using the likelihood) which assumes model validity.

But Gelman discusses this much better than me in his book (I don’t have it on hand)

omaclaren, on you suggestion that the model changes: the stopping rules of the model do not affect the likelihood function. That is easily demonstrated by simulations and so I am not just saying it because of the (irrelevance of) stopping rule principle. That is exactly why I am convinced that we need to consider the likelihood function and the reliability separately: the likelihood function does not, and cannot, display the consequences of stopping rules.

That’s why the law of likelihood and accounts that obey the Likelihood Principle are bankrupt for the problem of inference. (see my post, why the LL is bankrupt…..)They fail utterly to pick up on essentially everything that has people wringing their hands over these days: barn-hunting, cherry picking, multiple testing, post-data subgroups, verification bias, trying and trying again, optional stopping.

Royall and others will say, or you’re confusing evidence with belief, and I say, no. These things destroy the evidence! And even if I totally believed the hypothesis in question , I would hold that these destroy the evidence whenever they result in no or poor control of the error probabilities needed for severity.

(There are cases of double counting, that do sustain severe tests. I’ve discussed them in 3 or 4 papers, 1 posted.)

It’s silly to describe the best characterisation of the evidential content of the data as bankrupt. In cases where there has been no P-hacking then the likelihood function tells you all that you need to know. If there has been P-hacking then you need to know in addition the degree of unreliability of the data.

In the case that there has been no P-hacking the severity curve-based inference will be identical to the likelihood function-based inference. Is it also bankrupt?

It’s silly to beg the question in favor of your pet philosophy in the face of trenchant argument against likelihoodism. There’s good reason that there are almost no pure likelihoodists around, and they can only save themseves by adding priors. A

Second para is also illicit. You should know by now from this blog: getting similar numbers does not mean you’re doing the same thing, nor that it’s justified in the same way. The error statistician is doing something radically different from the likelihoodist, note how Royall cannot deal with composite tests even.

“That’s why the law of likelihood and accounts that obey the Likelihood Principle are bankrupt … They fail utterly to pick up on essentially everything that has people wringing their hands more than ever these days: barn-hunting, cherry picking, multiple testing, post-data subgroups, verification bias, trying and trying again, optional stopping.”

That’s an interesting theory, but as with all theories it helps to occasionally compare it to reality. Every bad paper you’re describing I’ve personally seen or had to deal with was written by non-Bayesians with practically no exposure to Bayesian statistics. I’ve yet to see one such monstrosity written by a Bayesian. Every non-reproducible paper I’ve seen first hand was pure Fisher/Neyman-Pearson type classical statistics.

Non-Bayesians have had control of curriculum, text books, professorships, positions of power in the statistics community for a very long time. They teach students, and those students go out and do horrible research. It’s not happening because some Bayesian told them it was Ok to do horrible research. They got the idea their horrible research was good from their classical statistical education.

Alan: Your post is a non-sequiter as usual. Corrupt uses of a method does not entail that using a method that has no way to properly pick up the corruption is good. Just the opposite, they remain invisible there. The unreliable research is fairly easy to explain. In more sophisticated cases, e.g., Potti and Nevins, w/ bad data and violated assumptions (although they claim to be doing Bayesian regression) , it’s more error statistics that serves to fraudbust; wrt social psychology, even though more stat is used in the replication industry, I doubt this is correcting what’s really wrong there.

Michael – I’m not very knowledgeable about the details of this topic, but see here for more discussion and refs – http://andrewgelman.com/2014/02/13/stopping-rules-bayesian-analysis/

“In the case that there has been no P-hacking the severity curve-based inference will be identical to the likelihood function-based inference.”. We have been around the bush on this issue many times in this blog. Many of us have a broader view of evidence as information that is capable of proving or disproving something, and inference as a process for arriving at conclusions. To me, calling the likelihood function from a set of data ” THE evidence” from the data separate from consideration of the process that generated the data is invalid and cannot support inference. It is invalid precisely because it is vulnerable to misconception through the many ways one can generate data (e.g. p-hacking, cherry-picking etc). Michael clearly declares one should pay attention to these factors, but wants to say they are not part of the evidence. This I think confuses people and can mislead.

My final comment on this thread is that one tries to set up (specify) a probability model that captures the full process of obtaining data, including collecting and filtering it and including the possibility of data dependent stopping rules etc. This is covered in many places. Given a properly specified model the estimation can be carried out. This next part is where likelihood is used, severity is used, bayes is used etc. They can all give v similar answers – in particular one need not just consider maximum likelihood say without considering other properties of the likelihood function such as its curvature or integral or whatever. This seems to give essentially identical answers to severity when used as advocated by more sophisticated likelihoodists such as Edwards. This is not mere numerical correspondence but reflects (speaking in terms of local properties) considering both equilibria – max likelihood/zero derivative say – and stability – variance/second derivative say.

Of course all estimation-related procedures are conditional on a correct model, severity included!? Model checking is a different phase to estimation and should be consistently distinguished.

Michael: I appreciate your consistency, instead of trying to wriggle out of the official line. Now I hope the difference isn’t just semantics. I take it you really think that “evidence” is not influenced by the reliability of the method claiming to produce that evidence. Again, your consistency deserves credit, I’m so tired of people fudging what their account plainly says (e.g., we violate the likelihood principle but only in a legalistic sense).

Mayo, yes, I really do think of the evidence and method as being separate. The key thing is that the evidence can point towards the right region of parameter space or the wrong region. If it points wrong then it is misleading evidence. Even the best method will occasionally yield misleading evidence, but a bad method will give misleading evidence more frequently. Therefore we need an index of goodness for the method that is distinct from the misleadingness of the evidence. Thus I propose reliability of the method.

I’m thinking that severity captures the reliability of the method, which is good, but then it mixes it in with the evidence which may be disadvantageous. We need first to characterise the two things separately before deciding how best to combine them for inference.

I’m glad that you are starting to see my viewpoint. It is not hard for me to be consistent because the consistency does not seem to lead to any problems. The ‘line’ that I am following comes more from Edwards than from Royall, whose book is as much a polemic as it is a definition of likelihood-based inference.

Michael: I appreciate your consistency so that I can pin you down; I know many people feel “evidence” is somehow separate from features which, to me, are essentially determinative of evidence, and thus are inseparable. I would have to say things like: the data are very strong evidence for claim C, and yet they are terrible evidence, no test at all of C. Thus I find it cannot possible do as an account of evidence. At best I’d have to say things like, these data WOULD have counted as evidence, say, of Potti and Nevins’ model, if they hadn’t thrown out data, cherry-picked, misrecorded data, and if the assumptions of the model held–but unfortunately…. I think that would be very strange and I even doubt one could ever be on firm grounds in saying it would have been great, if only it wasn’t so botched and spurious.

Mayo, the other thing that I am consistent about is that “claims” that might be true or false are not the subject of likelihood functions. Instead, the likelihood function presents the evidence that relates to the values of parameters of the model within which the likelihood has been calculated. The evidence is relative and does not say true/false.

I agree with your take on the results of experiments that have been incompletely and incorrectly reported. However, I don’t really see that any statistical approach can deal with those results because the necessary information is not available. What we should be trying to achieve is a system whereby ordinary users are able to make the right decisions about how best to design experiments and make inferences. The current situation is that ordinary users have no idea. Alan’s comment above is right on the money in that regard, and even though it does not apply to severity and your ‘error statistics’ it is a fair description of the real world. (It deserved a less dismissive response.)

I think that an important advantage of my own proposals is that the explicit separation of the data-dependent evidence from the method-dependent reliability makes it easier to talk and think about the things that will make for good experiments and inference.

That’s the biggest different with error statistics: it takes error probabilities into account, and these are altred by selection effects. Since the goal is assessing well-testedness, this is altogether crucial.

Now the meeting grounds you find may be explained by considering the likelihood ratio method for generating tests (by N-P).

Although I quite like error statistics I get uncomfortable if too much is claimed for it. The devil is in the detail. For example a standard frequentist approach to analysing cross-over trials, the so-called two-stage analysis, was shown by the Bayesian Peter Freeman to be awful not only in Bayesian terms but also in its failure to control the type I error rate. You can find a discussion here : http://www.senns.demon.co.uk/ROEL.pdf

Also severity has strong similarity to fiducial inference and the latter has difficulties with multi-parameter problems (as does likelihood, as does automatic Bayes etc.)

So the reason I give for rejecting an error statistics only approach is that I still have a day job and that involves planning statistical investigations and analysing them for real. (I really have planned and analysed cross-over trials for a living, for example.) It’s exactly the same reason I rejected Bayesian statistics as the only way to go. Unless and until I am convinced that a statistical system can solve all my problems I won’t commit to it.

Stephen: So you’re here! I’ve a query on your post.

But just to note: Absurd to argue that SEV is like fiducial, when it’s just either based on ordinary N-P error probabilities, or informal conceptions of probativeness.

Fiducialists are all over the place these days, by the way. confidence distribution folks, and at least 3 people who discussed my Birnbaum paper.

https://errorstatistics.com/2014/09/06/statistical-science-the-likelihood-principle-issue-is-out/

Some also seem to be using ordinary error probabilities, but it would be silly to pin Fisher’s erroneous “probabilistic instantiation” on them or me. (I don’t know if Fisher really meant to commit that fallacy in his 1955 paper against Neyman in the “triad”.)

One thing on your current post arose when I was rereading the Mayo and Cox (2010) paper. In cases where there is a single null, under embedded nulls, Cox describes formulating an alternative for purposes of getting a suitable test statistic. In those cases, there is no suggestion that rejecting the null would be grounds to infer the alternative, but it’s a useful way to get a test stat to check if the simplifying assumption in the null holds up. p. 258, Mayo and Cox (2010), Error and Inference. It’s interesting that this is a Fisherian idea of formulating a type of alternative for purposes of a test stat, so one is not really operating without one, just without one that is entirely within a statistical model. Do you agree?

I agree partly. I am part way betweeen Fisher and Neyman. I think that you can give theoretical grounds for using a variance stabilising transformation (like Bliss) but the question is “does it work in practice?”. As regards the former, alternative hypotheses may be useful but as the latter it is statistics that are important and hence experience. If not too much stress is given to alternatives, that’s fine. But I don’t approve of ‘worshiping the alterative’.

As regards inference more generally, nuisance parameters are the Achilles heel of systems of inference so until somebody shows me how theirs deals with such problems, I reserve judgement..

Michael: If the entirely of the available data consisted of the fact that a test had been carried out and the outcome was in the rejection region, then (1 – beta) and alpha would indeed be likelihoods.

Corey: Yes that is how many see it. Thanks. The trouble is that it does them a great disservice in trying to get a plausible measure of comparative fit or support or the like (e.g., for use in a Bayesian computation). It comes out backwards.

Corey, when would an experimenter have only the fact that a hypothesis test had taken place, and why would one in such a circumstance try to make likelihoods from that fact?

How is beta ever usable given that it is never a known probability given that it depends on the unknown true effect size?

Michael:

I pointed out what Corey pointed out already, perhaps the problem lies in your having a different definition of likelihood than the rest of us. Mayo points out that its distracting (confuses people) but its not wrong (according to how most define likelihood.)

(You could read my thesis http://andrewgelman.com/wp-content/uploads/2010/06/ThesisReprint.pdf and raise questions if its not clear…)

Keith

Michael, all I wanted to point out was that the assertion that alpha and 1 – beta are not likelihoods is not strictly true, i.e., there exists a state of information for which those quantities are indeed the relevant likelihoods.

That said, such quantities are often treated as likelihoodish sorts of things in Ioannidis-style calculations regarding collections of tests.

(I’m nitpicking, in other words.)

I am travelling, with limited web access and no access to my books. However, from memory, Bahadur and Savage showed that if you want to test against every possible alternative hypothesis power will be no greater than the type I error rate.

My original post was garbled. I meant to say

” If you knew that under all circumstances in which the null hypothesis was false which alternative was true, you would already know more than the experiment was designed to find out.”

As I said in my post, I think that Fisher’s criticism is not entirely fair but it it is still important. It is not at all obvious that the way things should work are

1 HO 2 H1 3 Test statistic. That’s what may seems natural to a mathematician but maths is not statistics. Asserting a conclusion that is impeccable if the assumptions are correct is not at all the same thing as suggesting a reasonable inference given some observed data.

It is a historical fact that the important tests were all derived without the N-P machinery.

My pretentious quote for this post is

“Par ma foi, il y a plus de quarante ans que je dis de la prose, sans que j’en susse rien.”

Good heavens! For more than forty years I have been speaking prose without knowing it.

Moliere. Le Bourgeois Gentilhomme

Good heavens the t-test is reasonable! It was reasonable all along but I did not know it until Neyman came along.

Stephen:

I found B most interesting

“B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)”

Thinking about it a bit, it might also help discern some of the issues of disagreement between Rubin and Pearl regarding causal inference?

p.s. I thought there was a typo but I was not completely sure 😉

Stephen: Moliere! I hated having to read that pompous thing in high school French class.

Re Bahadur and Savage: The 1956 result is that, for a sufficiently rich family of distributions, it is not possible to have tests (of nulls describing mean parameters) that uniformly control Type I error rate, while also providing nontrivial power, i.e. power greater than the Type I error rate.

The family of distributions involved is large, but not unreasonable in practice. For example, it allows the family of all distributions with finite variance.

The result shows one has to make some level of uncheckable assumptions in order to get anywhere, even for some extremely simple problems. (Note that the data’s not going to tell you, reliably, whether the population has finite variance.) And that those who judge others for making assumptions should perhaps first consider the beam in their own eye.

George:

Nice of you to provide this.

I am concerned about implications for practice that are implied by continuity or countable infinities, things in this universe that will affect us for the next K years are finite.

So “whether the population has finite variance” concerns me because I am not concerned with populations with non-finite variance (which are mathematically fine), so I will have to look up Bahadur and Savage.

I raised this once with NormalDeviate and he was unable to dissuade me. Also an example here at https://xianblog.wordpress.com/2015/01/19/post-grading-weekend/comment-page-1/#comment-94588

Maybe not pretentious, but this is the best quote: “Asserting a conclusion that is impeccable if the assumptions are correct is not at all the same thing as suggesting a reasonable inference given some observed data.”

At the risk of making a fool of myself by ignoring 50+ responses one of which may have either made or ridiculed the following already, just a remark regarding the original posting.

Given a null hypothesis and a test statistic T, I’d say that the test statistic implicitly defines a class of alternatives, namely those distributions for which the probability to reject the null using T is larger than for the null itself. So one could be Neyman-Pearsonian and start from an alternative to derive a T, but also one could choose T first, which then implicitly defines the “direction” against which the null hypothesis is tested, or in other words the alternative. So for me the concept of an alternative is very important for understanding what a test does, even though it doesn’t have to be used for Neyman-Pearson optimality considerations.

I wonder whether Fisher would have agreed with me on this. He may object against my affection for the concept of an alternative, but it may be that he really had more of an issue with power optimization, which I haven’t involved here, and not so much with the plain idea that any test tests against a certain specific “alternative direction” and it is useful to think about what that might be in order to choose an appropriate test (this, I’d think, is Stephen’s concern when thinking about odds ratios vs. risk differences – they point in slightly different alternative directions one of which may be less appropriate than the other).

One very simple argument turned me away from Fisher’s approach.

If your observations are more probable when the true effect size is zero than they are when the true effect is what you have observed then the evidence that there is a real effect is surely weak.

Since this can happen when p values are small, I’m persuaded that NHST are not a good idea.