Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 *Statistical Science* issue discussing Birnbaum’s result is here. Reference [5] links to the *Synthese* 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading!

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical Methods in Scientific Inference (posted earlier here)

It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence.This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data.(The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum

New York University

Courant Institute of Mathematical Sciences,

251 Mercer Street,

New York, NY 10012

Birnbaum’s *confidence concept,* sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence” simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below).

Still, since Birnbaum’s (Conf) appears to allude to pre-trial error probabilities, I regard (Conf) as still too “behavioristic”. *But I discovered that Pratt, in the link in [5] below, entertains the possibility of viewing Conf in terms of what might be called post-data or “attained” error probabilities. *Some of his papers hint at the possibility that he would have wanted to use Conf for a post-data assessment of how well (or poorly) various claims were tested. I developed the concept of severity and severe testing to provide an “evidential” or “inferential” notion, along with a statistical philosophy and a philosophy of science in which it is to be embedded.

I think that Fisher (1955) is essentially correct in maintaining that “When, therefore, Neyman denies the existence of inductive reasoning he is merely expressing a verbal preference”. It is a verbal preference one can also find in Popper’s view of corroboration. (He, and current day critical rationalists, also hold that probability arises to evaluate degrees of severity, well-testedness or corroboration, not inductive confirmation.) The inference to the severely corroborated claim is still inductive. It goes beyond the premises. It is qualified by the relevant severity assessments.

I have many of Birnbaum’s original drafts of papers and articles here (with carbon copies (!) and hand-written notes in the margins), thanks to the philosopher of science, Ronald Giere, who gave them to me years ago[iii].

***

[i] His untimely death was a suicide.

[ii] A considerable number of posts on the strong likelihood principle (SLP) may be found searching this blog (e.g., here and here). Links or references to the associated literature, perhaps all of it, may also be found here. A post linking to the 2014 Statistical Science issue on my criticism of Birnbaum’s “breakthrough” (to the SLP) is here.

[iii]See posts under “Neyman’s Nursery” (1, 2, 3, 4, 5)

**References**

[3] Birnbaum, A., in

Philosophy, Science and Method: Essays in Honor of Ernest Nagel(edited by Morgenbesser, S., Suppes, P., and White, M.) (St. Martin’s Press. NY, 1969).

[4] Likelihood in

International Encyclopedia of the Social Sciences(Crowell-Collier, NY, 1968).[5] Full contents of Synthese

1977, dedicated to his memory in 1977,can be found in this post.[6] Birnbaum, A. (1977). “The Neyman-Pearson theory as decision theory, and as inference theory; with a criticism of the Lindley-Savage argument for Bayesian theory”.

Synthese36 (1) : 19-49. See links in [5]

I instinctively dislike many aspects of the NP approach – I think in general probably due to the (overuse) of optimality concepts.

On the other hand, Fraser’s writing has (re-) convinced me that there is something to the ‘Confidence’ concept (and hence p-values) after all. Furthermore, that Fisher was essentially right modulo a few details that are usually overemphasised.

The (draft?) Fraser paper I sent you recently “p-values: The insight to modern statistical inference” (http://www.utstat.toronto.edu/dfraser/documents/276-AnnRev-v2.pdf) states:

“In essence a p-value records just where a data value is located relative to a parameter value of interest, or where it is with respect to a hypothesis of interest, and does this in statistical units…

…Our approach here is to describe pragmatically what has happened and thus record just where the data value is with respect to the parameter value of interest, avoiding decision statements or procedural rules, and leaving evaluation to the judgment of the appropriate community of researchers”

I can agree with this wholeheartedly. All that remains is the crucial problem of nuisance parameters and tackling higher-dimensional problems. In this regard I agree with Birnbaum that the key contribution of NP was “techniques for systematically appraising and bounding the probabilities” in these scenarios.

Under simple one-dimensional, no-nuisance parameter (etc) problems Confidence = Fiducial = SEV (and approx = Likelihood). So I’d love to hear more about the philosophical side of Conf/SEV in the presence of nuisance parameters and NP contributions in this regard. This seems where the key contributions lie, where Bayes typically claims to provide the solution (i.e. integrate out using priors), as well as where most practical folk need guidance (every problem has nuisance parameters).

Omaclaran: Thanks for your comment. My feeling is that if we can get the issue down to how best to deal with nuisance parameters, we’ve made progress. I think that understanding the philosophical side–that is, in particular, the larger picture of scientific learning that goes with severity (that Popper may have had in mind but to which Peirce got closer)– provides a standpoint to probe the “nuisance” parameter issue, for different contexts/problems. Aris Spanos says we shouldn’t be treating them as nuisances and trying to have them disappear, but rather, we should model them. What does Fraser say on this, by the way? Background information plays a key role as to the best way to model them. Spanos has found it appropriate to estimate the parameters and do a severity analysis on each, but he can describe it better than I.

Modelling rather than eliminating nuisance parameters is a fair starting point but we then face potentially very high-dimensional problems (where do you draw the model boundaries and how do you justify this?) and can only make joint statements about all parameters. In many cases we also want to try to make statements about lower-dimensional parameters of interest and in many cases we can.

The details of how this works for different approaches makes a big difference – it seems like it’s what killed Fiducial (potential paradoxes for higher-dimensional problems), what makes people both attracted to and nervous about Bayes (e.g. easy elimination of nuisance parameters vs potential marginalisation paradoxes/curvature effects etc) and confused about Freq (ad-hoc choice of pivots etc).

This seems like a crucial philosophical issue to me – once you work ‘within’ a simple model most approaches are fairly straightforward and give reasonable answers. The biggest issue is usually getting a simple, reasonably justified model that makes systematic and consistent use of contextual information in the first place.

RE: Fraser. It would be best to ask him, obviously, but my rough interpretation – sample/data space probabilities are taken as basic. E.g. p-values. To connect data space to parameter space he usually invokes ‘structural’ assumptions such as continuity (and e.g. differentiability). Given this connection one can ‘transport’ probability statements (integrals) in data space to the parameter space. Bayes-like (marginalisation) and various weighted or profile etc likelihood approximation methods can then be used to make targeted probability statements about parameters of interest that retain data space validity and avoid marginalisation and other paradoxes.

So, not dissimilar to Bayes and Likelihood methods for dealing with nuisance parameters but focussed on retaining validity of probability statements in data space.

Here’s a link to the Fraser paper Maclaran was talking about:

“p-values: The insight to modern statistical inference D A S Fraser”

Best to ask him – but there does seem to be a fairly wide consensus that one wants to remove any dependence of operating characteristics on nuisance parameters so that one gets common p_value distributions, confidence interval coverage and error rates that are the same for all values of possible nuisance parameters.

How else is one to interpret p_values, confidence intervals and error rates if they mean different things for different values of unknown nuisance parameters?

Student got lucky, Neyman brought mathematical insight on how to do this for a couple convenient problems and most others have failed except in restricted domains. XL Meng characterized Fraser’s and related approaches as treating the prior as a nuisance parameter so that a prior had little impact. Similar to motivated Neyman – he was striving to get good interval coverage no matter how bad the prior was and ended up not needing any prior at all.

Keith O’Rourke

Phanerono:

Thanks for your interesting comment:

“there does seem to be a fairly wide consensus that one wants to remove any dependence of operating characteristics on nuisance parameters so that one gets common p_value distributions, ….

How else is one to interpret p_values, confidence intervals and error rates if they mean different things for different values of unknown nuisance parameters?”

David Cox seems to argue that it might be better in some cases to consider allowing minimal dependence on nuisance parameters. (This was something that came up in Cox and Mayo 2010,p. 293 and the 2nd para is his take on the issue.)

“when certain requirements are satisfied rendering the statistic V a “complete” sufficient statistic for nuisance parameter lambda, there is no other way of achieving the Neyman–Pearson goal of an exactly alpha-level rejection region that is fixed regardless of nuisance parameters – exactly similar tests. These requirements are satisfied in many familiar classes of significance tests. In all such cases, exactly similar size alpha rejection regions are equivalent to regions where the conditional probability of Y being significant at level alpha is independent of v…where v is the value of the statistic V that is used to eliminate dependence on the nuisance parameter. Rejection regions where this condition holds are called regions of Neyman structure. …In the most familiar cases, therefore, conditioning on a sufficient statistic for a nuisance parameter may be regarded as an outgrowth of the aim of calculating the relevant p-value independent of unknowns, or, alternatively, as a by-product of seeking to obtain the most powerful similar tests.

However, requiring exactly similar rejection regions precludes tests that merely satisfy the weaker requirement of being able to calculate p approximately, with only minimal dependence on nuisance parameters; and yet these tests may be superior from the perspective of ensuring adequate sensitivity to departures, given the particular data and inference of relevance. This fact is especially relevant when optimal tests are absent. Some examples are considered in Section 10”. (Cox and Mayo 2010, p. 293)

By “one wants to remove any dependence ” I meant “would chose to” if possible.

I was going to comment on getting commonness approximately over nuisance parameter values – perhaps I should have – but that runs into your section 10 issues.

Keith O’Rourke

Phaneron0: Can you explain your last sentence about section 10 issues?

That’s the point of Fraser’s and related approaches (including, as Keith points out, Neyman’s original motivation) – approximations such that the nuisance parameter has minimal or bounded or averaged or whatever effect in a given reduction. The question is how to do this systematically and what effect to minimise. Conf folk want to minimise the effect on the data space (p values etc). Bayes folk don’t prioritise data space over parameter space. The answers tend to agree for one dimensional, probably linear problems. For higher dimensions the question of what effect you want to minimise is highlighted. Fraser argues data space is our primary intuition of ‘reproducibility’ and hence we want any method of nuisance parameter elimination to approximately preserve this. Bayes only preserves this in simple situations hence why he calls it quick and dirty confidence. Bayes folk (eg Christian Robert’s response) argue they aren’t trying to preserve the same thing he is so it doesn’t matter that they disagree.

I’d like to understand your comment:

“Conf folk want to minimise the effect on the data space (p values etc). Bayes folk don’t prioritise data space over parameter space.”

This is an interesting thought, but error stat or conf (as I understand it) isn’t prioritizing data space over parameter space (whatever that means), but prioritizing the ability to make inferences from observable data to parameters (and thereby to modelled aspects of the source of data) by finding or creating knowable connections between the two spaces. For Robert to say, as he does, that he doesn’t care about ensuring posteriors have Fraser’s conf property leaves one wondering what purpose his posteriors serve (in that he claims priors are not expressions of belief, but are assignments used to get posteriors).

As for agreement in simple situations, I know what Fraser means, but it’s interesting that the Bayes-freq criticisms occupying many people are about those simple cases So, for example, there’s the allegation that p-values exaggerate the evidence. When one points out (as do Casella and Berger 1987, many others) that using a one-sided test and/or non spiked prior (on the null) makes the “exaggeration” disappear, there seems to be an insistence on using the assignments that produce the disagreement. Of course we know their numbers mean different things as well. There, “prioritizing” the posterior has a different meaning (namely, it’s a standpoint from which to criticize an error prob).

I’ll try to explain what I mean, which may differ from your interpretation. It is based on an interpretation of p-values as simply recording where the observed data (or a computed statistic) is relative to a model instance (probability measure over sample/data space).

It is similar I believe to Fraser’s and Laurie Davies interpretations of p-values but may differ from yours.

First, here is a quote from Fraser that at least hints at the idea (discussing Bayes as an approximation to Conf in ‘Why does statistics have two theories’):

“Thus in this location model context, a sample space integration can routinely be replaced by a parameter space integration, a pure calculus formality. And thus in the location model context there is no frequency-Bayes contradiction, just the matter of choosing the prior that yields the translation property which in turn enables the integration change of variable and thus the transfer of the integration from sample space to parameter space…

…for the Bayes approach above, the arbitrariness can disappear if the need for data dependence is acknowledged and the locally based differential prior is used to examine sample space probability on the parameter space.”

Thus in my interpretation here the sample space/data space probability (e.g. p-value or some other data-space integral) is given primary importance.

Here is a quote from Laurie’s book (p. 19)

“The simple idea behind the whole of this book is to accept a model P if the data generated under the model look like the actual data”

and p. 22

“The values of the parameter are determined solely by the requirement that [the observed data] looks like typical data sets generated under the model”

Again, in my interpretation, the probabilities in data space – i.e. location of observed data relative to simulated data – are given primary importance and parameters evaluated solely on the basis that they generate simulated data that looks similar to the observed data.

The question of model reduction – i.e. nuisance parameter elimination – would in these approaches be based on replacing the full model with a reduced model with similar data space probabilities, that is to say the location of the observed data relative to simulated data is similar. If for example one could arbitrarily set the value of one parameter without giving a different fit then one could claim that at least some of the remaining variables are responsible for the observed fit.

The Bayesian approach is based on a joint probability model (measure) over both parameter space and data space. Thus determining any marginal probability – e.g. that obtained by integrating over a nuisance parameter – is a necessary consequence of probability calculus and requires no more assumptions.

Presumably the point is that what the Conf folk prioritise – which here invariance of the data space marginal – is not guaranteed to be invariant in the same way in the Bayesian approach, and depends on the prior. The Bayesian approach enforces different invariance requirements e.g. that the parameter-only marginal is invariant, (again, rather than the data-only marginal).

Most notions of ‘objectivity’ are typically inextricably related to some sort of invariance requirement (look at physics), the question is invariance of what. Under simple linear models invariance in one space gives invariance in the other so there is no issue.

For nonlinear problems or reductions of high-dimensional models (effectively the same as nonlinear problems) the issue comes to the fore and you potentially have to choose one (or the other or come up with an approach preserving both, which exist).

RE: the disagreements for simple problems.

Interestingly this is because Bayesian testing differs significantly (pun) from Bayesian estimation. Most proposals to improve Bayesian testing these days tend to bring it closer to Bayesian estimation and to Frequentist testing (for simple cases, as discussed). E.g. Evans’ work, Aitkin’s work etc. Even those who don’t care so much about bringing Bayesian testing and estimation together (e.g. Gelman) prefer Bayesian estimation to testing, which again agrees with the Frequentist results for simple problems.

So I’d say – ignore traditional Bayesian-Frequentist testing disagreements as most recent proposals are focused on fixing flaws in Bayesian testing or replacing it with Bayesian estimation. Again, these agree for simple problems so the next frontier of disagreements is higher-dimensional or nonlinear problems.

Assuming my other comment comes through (all seem to get stuck in moderation) it seems that your severity principle could be used as a criterion for nuisance parameter elimination, if you accept a p-value as simply a measure of fit (see my comment above re setting a nuisance parameter to an arbitrary value). Probably the main advantage to a p value as a fit measure over likelihood (even ignoring probability vs density issues) is that it can be used to consider multiple data features simultaneously as Laurie has demonstrated (which took me a while to understand). Most equivalent likelihood approaches require knowledge of/assumptions on the joint distribution of these features.

Om:

Firstly, your comments shouldn’t be stuck in any moderation, I have checked that setting several times and it’s open for anyone having had a comment approved. Sorry if you’ve had trouble.

Well, I’m only distantly getting what you’re after.

looking ” like typical data sets generated under the model” goes beyond fit as I would define that.

“Probably the main advantage to a p value as a fit measure over likelihood”.

But i don’t view a p-value as either a fit measure (unless that’s specially defined) or a likelihood, but rather an error probability. As I see it, a likelihood is a (mere) fit measure. Conf for Fraser also requires an error probability assurance, however, he’s more performancy oriented than I am. I give a counterfactual interpretation to error probabilities.

“But I don’t view a p-value as either a fit measure (unless that’s specially defined) or a likelihood, but rather an error probability”

Yes I suspected that. Following Laurie (and Fraser to some extent at least – re the quotes above), I think interpreting it as a fit – or adequacy of fit – measure is the best and simplest interpretation, however.

That is, an (observed) p-value is simply a measure of where the observed data lies relative to a (data space) probability model. That is its literal meaning, anyway.

Besides that, which is really a side point, my longer comment above with quotes from Fraser is probably more important. Which points are you specifically unclear on? Do you have an alternative interpretation of Fraser’s papers or are you also unclear on what he is doing/saying as well as on my interpretation? What do you think about my comment about Bayesian testing?

Om: Before moving on, I just want to say something about “interpreting it as a fit – or adequacy of fit – measure is the best and simplest interpretation”, but under that construal, as I understand it, the idea of a spurious p-value resulting, say from cherry picking, p-hacking and other biasing selection effects loses its meaning–does it not?

A small observed p-value (literally intepreted) means the given data is unusual under the model specified. It doesn’t say why.

It might be because your data is unusual or because your model is misspecified. A way to check if your data is unusual is to repeat the experiment and see if the data comes out similarly.

I.e. Fisher’s ‘A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”.

I could interpret this as saying you can repeatedly bring about the same p-value i.e. same level of (here, mis-)fit.

omaclaren:

I think there are two things here.

One: From memory, Fisher once commented on the necessary choice between making error properties common (invariant) over the unknown parameters versus common (invariant) over observed samples, that to do both was expecting to much. He would have prioritized the latter (to avoid recognizable subsets). I think it in regard to this example https://en.wikipedia.org/wiki/Behrens%E2%80%93Fisher_problem

(Fraser in a verbal comment asked at talk he gave did say had he done both for Behrens-Fisher but this was this was not later confirmed in any comment or publication to my knowledge.)

Two: The (pre-data) frequency evaluation of Bayesian Machinery (what Rubin coined as Bayesianly relevant considerations) is not targeting uniform error properties over the parameters but rather focuses on these averaged over the parameters using the prior. Now the FDA (I believe appropriately) wants to be informed on these for at least some points in the Null and some in the alternative. More generally large variation over the parameters makes for a large dependence on the prior which should be a concern (unless one is “sure” of the prior). Post-data frequency evaluation of Bayesian Machinery draws some controversy that has yet to be full sorted out.

Keith O’Rourke

Thanks Keith, very helpful comments.

“but under that construal, as I understand it, the idea of a spurious p-value resulting, say from cherry picking, p-hacking and other biasing selection effects loses its meaning–does it not?”

Laurie’s computations take into account the whole set of adequacy evaluation/tests that one runs, so they are adjusted for multiple testing (at least this was so in his 1995 paper; can’t check the book right now because I’m travelling).

Yes you’re correct in that he apportions probability across the different features, using what appears to be a Boole/Bonferroni inequality type argument. E.g. he might allocate 2.5% to one feature, 2.5% to another to guarantee an overall 5% p-value.

This is what I meant above with :

“Probably the main advantage to a p value as a fit measure over likelihood (even ignoring probability vs density issues) is that it can be used to consider multiple data features simultaneously as Laurie has demonstrated (which took me a while to understand)”

I had to ask him about this explicitly as it wasn’t super clear to me when reading his book, but probably because I wasn’t so familiar with such arguments. I would probably suggest he makes this point even clearer in future editions of the book for simple-minded folk like myself.

I still think that my other comment above re-hashing Fisher’s interpretation is important as it gets at the idea of whether it makes sense to have a ‘spurious’ as opposed to ‘actual’ p-value, regardless of how the ‘feature accounting’ is carried out. Laurie’s interpretation of p-values appears to be consistent with Fisher’s in this sense (again, independently of how multiple features are accounted for).

[Note also that my question to Laurie was in the context of what happens when using multiple correlated features, in which case his – fair enough – response is that this would simply mean you are ‘wasting probability’. So perhaps a general reduction principle is to first find a natural (e.g. orthogonalised in some sense) representation before carrying out the reduction. Idea comes up in likelihood-based method too.]

Christian: But what’s the rationale? There has to be epistemological (or performance) influences due to the selection effects that can be pointed to.

Mayo – that’s why I didn’t make Christian’s point directly in the first place, though I alluded to it. I (personally) intepret the rationale as similar to Fisher’s, as argued above.

I am again somewhat late in joining this discussion. I was deflected by having yet another of my papers being rejected by JRSS B maintaining their 100% record in this respect. I have had an exchange of emails with Oliver about fiducial inference but have not completely understood what is going on so I will concentrate on P-values and start with a simple situation. The statistician has a sample of size 100 and the N(0,1) i.i.d. model. It is pointless ask the undecidable question as to whether the data are i.i.d. N(0,1) but we can ask the decidable question as to whether they look like typical samples of size 100 of i.i.d. N(0,1). typical quantified by alpha say 0.95 so that 95% of all sample generated under the model are typical. Look like requires specifying certain features of samples under the model such that 95% of the samples exhibit these features. What these are will depend on the purpose of the analysis but let us take then to be the mean, the standard deviation, the largest absolute value (outliers) and the distance of the empirical distribution from that of the N(0,1) distribution (shape9. We have 0.05 probability to spend so one possibility is to spread it equally over the four properties. There may be good reasons however for spending more on the mean and the standard deviation. Simulations show that 97% of samples satisfy the requirement rather than 95% so we are wasting 2%. One easy way of doing the adjustment is to replace 95% by 93% and re-spread the probabilities. This nearly always works. For each feature we can calculate a P-value and the N(0,1) model is then accepted if all the P-values are at least 0.0125 (sticking to alpha=0.95). This can be done for all (mu,sigma) and the resulting region is a 0.95-approximation region for (mu,sigma). It is not a confidence region and this can lead to utter confusion as can be seen by some contributions to Andrew Gelman’s blog. So what does one report for any (mu,sigma)? Presumably not one P-value but four: each P-values measures how the data stand to the model in that particular respect. Is this necessary? One can take the case of outliers, or rather the lack of them for normal samples. The statistician can just look at the data, decide there are no outliers and then concentrate just on the other three. He/she may go even further, look at a histogram and decide the data are sufficiently normal and now concentrate on the mean and standard deviation. Finally the standard deviation may be ignored and the P-value derived directly from the standard t-distribution. Is this to be regarded as good statistical practice? Or should one have a best practice protocol for analysing data sets using standard models such as the Gaussian? A previous topic was on QRP. The raw data was not available for reconstruction but sometimes the mean and standard deviation were given. Are the mean and standard deviation sufficient? They allow no conclusions about outliers. Perhaps something like Tukey’s fivenum but which five numbers should be given? Any thoughts?

Hi Laurie, you might be interested in this essay by Senn on Fisher’s view of significance tests – https://errorstatistics.com/2014/02/21/stephen-senn-fishers-alternative-to-the-alternative/

Elements of Fisher’s approach appear to be (ignoring Fiducial for now)

– Choice of statistics to base analysis on is primary and comes from external considerations not optimising within a model (cf power)

– Significance tests with small p values are to be interpreted as ‘either the model is inadequate or my data is/are unusual’ and are based on observed p-values. To me this means he used it as a measure of fit or adequacy.

– The ‘truth’ of an effect can only be established by actual replications of an experiment giving the same fit, not hypothetical replications (though this may be a useful tool for analysis of potential properties of procedures)

One reason for pointing this out is that you use p-values but do not give it an NP interpretation which can lead to confusion as you note. Pointing out the similarity to Fisher’s interpretation perhaps gives you a way to help indicate what you mean as people are used to the idea that Fisher and NP had disagreements over interpretation.

RE what to report – as I have previously mentioned to you I like the idea of reporting p-value heat maps for each feature (plotted over parameter space). Like in the figure I sent you of the comb data. Or similar.

Laurie:

Sorry to hear your paper was rejected. Dealing with reviewers has got to be the hardest thing, even when they may serve constructively to a greatly improved work.

On your comment, I can’t tell whether you’re talking about testing the model assumptions more or less from scratch, as with misspecification testing. We had a series of posts, based on Spaos, e.g., https://errorstatistics.com/2012/02/23/misspecification-testing-part-2/

Mayo – at the risk of jumping in where I’m not being addressed – your comment

“On your comment, I can’t tell whether you’re talking about testing the model assumptions more or less from scratch, as with misspecification testing”

gets to the point!

Spanos’ approach as I understand it is essentially

– Use Fisherian ‘pure significance’ misspecification testing to test ‘externally’

– Use NP hypothesis testing to carry out estimation ‘within’ a model.

Here is a quote from Spanos’ 1999 book on econometrics:

“We call into question the conventional wisdom that the Fisher approach has been largely superseded by the Neyman–Pearson approach. In chapter 14, it is argued that the two approaches have very different objectives and are largely complementary.

Fisher’s approach is better suited for testing without and the Neyman–Pearson approach is more appropriate for testing within the boundaries of the postulated statistical model.

In chapter 15 we discuss the problem of misspecification testing and argue that the Fisher approach is the pro- cedure of choice with the Neyman–Pearson approach requiring certain crucial modifications to be used for such a purpose.

When testing theoretical restrictions within a statistically adequate model, however, the Neyman–Pearson approach becomes the pro- cedure of choice.”

One of the key features of Laurie’s approach is that he wants to get rid of the within/without distinction. Furthermore he wants to stay in what Spanos would call the ‘without’ mode.

So…this naturally leads to the conclusion that, as intepreted relative to Spanos’ above framework, Laurie discards the NP ‘within’ estimation entirely and operates in a ‘pure Fisherian’ mode. That is to say, in Laurie’s approach but in Spanos’ words, all testing is misspecification testing.

(obviously my quoting of Spanos should end before I start saying ‘One of the key features of Laurie’s approach…)

fixed.

Om: The truth is that no good test can test everything at once. In practice even “testing without” is required to delimit the alternative, be it vague or directional, be it non-parametric or parametric. Fisher said he assumed the researcher knew what effect he was interested in and so didn’t need to be explicit about it, but N-P argued that an adequate test procedure required identifying the type of departure sought in advance.

Recall my “Les Miserable Citation” theater post: https://errorstatistics.com/2016/04/16/jerzy-neyman-and-les-miserables-citations-statistical-theater-in-honor-of-his-birthday/.

“In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.]”

After the Fisher/N-P break-up in 35, this distinction as to whether to choose a test statistic or whether to choose an alternative was blown way out of proportion as if N-P didn’t want to recognize that the researcher knew full well what difference he was interested in, as if their helping Fisher to ensure the sensitivity he wanted was tantamount to advocating a cookbook procedure where discretionary judgment wasn’t needed.

“The truth is that no good test can test everything at once.”

Agreed.

Do you agree or disagree with Spanos’ book and Senn’s post that Fisherian and NP testing serve distinct goals and have distinct interpretations?

I know you like to argue that the disagreement was overblown but nothing I’ve read of Fisher’s actual writing suggests that there weren’t legitimate and substantive disagreements.

Similiarly I find myself tending to like a lot of Fisher’s ideas and much less in NP’s. He himself mentioned something about differences in training in the natural sciences and differences in the temperaments between British and Americans.

It’s possible that this view of mine might be some form of cognitive dissonance but there seems a distinct ‘psychology’ (perhaps a bad choice given the issues of NHST in that field!) to those inclined towards Fisherian concepts and those inclined towards NP’s.

Wonder if we could design a study, do a NSHT and submit it to PNAS? 😉

Om: What’s an NSHT? Oh you must mean NHST. The trouble is, let’s suppose, as is quite plausible, that there’s a genuine correlation between those who hear Fisher is “evidential” and N-P or at least N, is merely for acceptance sampling for commercial profit in the U.S. or 5 year plans in Russia. Vote for Fisher. But the point is that that’s BS, so it wouldn’t matter if one had been drilled on the “Neyman says there’s no such thing as statistical inference only “acting” to reach a conclusion or decision. Neyman showed very clearly that Fisher’s talk of proceeding as if there’s no difference (in the face on a non-sig result, is a kind of act or decision about how to appraise a clim as opposed to a probabilistic assignment or degree of belief in a claim.

In that same post I argued that, ironically, one can trace N’s behaviorism to N-P’s struggles to defend Fisher with his fiducialism + single null. In N-P practice they used error probs evidentially, but weren’t particular focused on making their philosophical foundations clear.

So the irony in the possible reality of any such real correlation is that it would reflect the entirely wrong lesson about N-P, but would reflect instead a constantly heard meme about them.

Oh, and the business about “the temperaments between British and Americans” in their tendencies toward Brexit vs rugged individualism is hogwash or should I say balderdash?

I have based all my interpretations on direct quotes and attempts at careful reading, as have the others I refer to – eg Spanos’ book (he may have changed his mind but he still distinguishes misspecification from estimation as far as I am aware), Senn’s post, Fraser’s papers, my reading that Laurie’s interpretation of pvalues seems closer to Fisher than NP and his own statement that he interprets them differently than NP.

My misspelled NHST joke and reference to psychology is completely beside the point. You clearly have an interpretation but defend it by simply asserting that your interpretation is correct.

Again – do you disagree with the quote from Spanos and the post by Senn for example that Fisherian testing should be distinguished in role and interpretation from NP?

My view is that I find Neyman’s ‘defense’ of Fisher to miss the point and prefer Fisherian testing. Taking that route I see a viable approach based on discarding what Spanos calls ‘within ‘ the model NP estimation and only using what Spanos calls ‘without’ the model Fisherian testing. In fact I see Laurie’s approach as essentially this – or all testing is misspecification testing, again using Spanos’ terminology.

You may differ but I’m not sure why you can’t accept that there is room to disagree esp given all the on the record statements by many folk that there is in fact a difference.

Om: “You clearly have an interpretation but defend it by simply asserting that your interpretation is correct.”

So I give no reasons for it? Rather baffled by this.

So far as different kinds of tests go, it’s pretty obvious that there is a need for different kinds of tests. Cox gave a taxonomy a long time ago. The supposition that there’s only one kind of test needed in science is far too restrictive. We review some of the taxonomy in Mayo and Cox 2006.

There’s a different interpretation called for in the different cases. For example, you can’t infer evidence for the alternative in a misspecification test–something Spanos points out.

But all the tests follow the same reasoning. The fact that they do is what tells you what you’re entitled to infer and what you’re not. The reasoning is intended to be captured by considering what has and has not been severely tested in the different cases.

By the way, I don’t think Spanos’ views on these matters today is necessarily identical to what he wrote in his last text. I believe a new version is coming out.

One other thing: I think it’s a very bad idea to let personalities have so strong an influence on how we view methods today. It has been a serious stumbling block in statistics. This is something Fraser emphasizes as well. It’s the methods and their properties that matter, not what some long dead statistician said someplace (or yelled). We should take their work into account by looking at how they apply methods in practice. N uses “Fisherian” tests left and right in building his models, CIs where relevant, posteriors , where frequentist priors are available.

It didn’t even make it to a referee, an AE stopped it in its tracks. My 1995 paper hat 12 referee’s report, 11 negative. Richard Gill, his own man, published it even though he had two negative and the one and only positive report. Do you have to mention Brexit?

As Oliver also said, you cannot check everything, so you have to think about what you do want to check. Here Tukey on the same topic:

Thought and debate as to just which aspects are to be denied legitimacy will be both necessary and valuable.

I think the within-without distinction corresponds to my EDA and formal inference distinction. As he writes, I do indeed think that all testing is misspecification testing. One can write a lot about the within-without distinction and indeed I have. And yet once again. Take my favourite copper in drinking water example. I use it because the situation is very simple. It helps that it is also of considerable practical importance, we all want clean drinking water. So the legal limit for copper in drinking water is say 2.2 mg per litre. To analyse the data we start of with the family of normal distributions. Without testing: outliers- no ok, shape – yes ok. Without finished and now to within. What is H_0? Identify the quantity of water in the sample with the parameter mu of the normal distribution. So H_0: mu <2.2 or whatever way round it is, I can never remember. Now N-P. What is the optimal test. Say t-Test and everything seems fine. But then comes Tukey, who at one stage in his career was a chemist and states bluntly, that sort of data is always skewed. Ok let us try a skewed model, say log-normal. Without testing: outliers- no ok, shape – yes ok. Without finished and now to within. But what is H_0? No answer. In my opinion this failure in itself is fatal to a within-without distinction. For the record there are other arguments against it, I mention different topologies.

Oliver. Thanks for the Senn reference and your comments. The thought of reading Fisher fills me with dread. It is much easier for me to rely on you, a form of exploitation I suppose. Nevertheless I defend not reading Fisher. Hammersley is reputed to have said (I have no reference but suspect Geoffrey Grimmett) 'Never read the literature unless absolutely necessary and not even then'. The point being that reading Fisher makes it more difficult to think for yourself. You run of course the risk that at the end of the day someone (Oliver Mclaren) points out that Fisher said it all before, or at least some of it

Happy to be exploited 🙂 I agree re:reading the literature but also have picked up the bad habit of reading too much of it anyway (especially things I’m not supposed to). Perhaps it’s a temperament I picked up from being isolated down under – not as many people around to exploit so you have to figure out what the big city folk are doing by reading!

Laurie: It was I who said that one cannot test everything at once, not Oliver.

The distinction you now draw between EDA and formal testing is rather different from the distinctions (within and without) in formal testing–the ones Oliver was making between N-P and Fisher.

According to the quotes from Spanos’ book and reading Laurie’s book I am happy that the distinction stands.

Also Mayo if you are worried about me letting personalities affect my judgement (by using direct quotes!) then by my own judgement I am also happy that the distinction stands.

Om: you’re not getting my drift.

I should have said that the analogy between the two distinctions stands. But this is becoming unproductive again.

Laurie:

Was Hammersley speaking as a mathematician rather than statistician?

(I know his co-author Peter Clifford and I would not say that of him).

> reading makes it more difficult to think for yourself

I would agree reading first makes it more difficult to think for yourself, but believe thinking before and then reading is better.

Academic research should be a communal activity and your lack of reading likely hampers your ability to “plug in” into the statistical community…

(OK Wittgenstein got away with it but CS Peirce against his wishes was largely unconnected did not, at least until 50+ years after his death.)

Deborah, apologies, you did indeed say it. I have had it said to me many times but I have never understood why. It seems so obvious

“For example, you can’t infer evidence for the alternative in a misspecification test–something Spanos points out”.

I don’t understand. Can you give me a simple example?

RE:

>

“For example, you can’t infer evidence for the alternative in a misspecification test–something Spanos points out”.

I don’t understand

<

My interpretation of Laurie's approach is that he is not trying to infer an alternative in a misspecification test. Hence why he doesn't understand.

Mayo – Gelman also made a similar point to you re his own model checking procedures in the second comment to this post: https://errorstatistics.com/2013/09/03/gelmans-response-to-my-comment-on-jaynes/

Gelman:

"I’m not using model checks to “infer a model.” I’m using model checks to examine possible problems with a model that I’d already like to use."

I would again suggest that there is a consistent difference in interpretation here.

Another point of difference (pun, if you read on) occurs to me.

Laurie’s approach as I interpret does not directly consider composite parameter space hypotheses. Eg in a parametric model he would not consider theta =0 directly. The reason being he requires a model to be a single probability measure and each to be tested separately. This may contribute to the difference in understanding of model checking and whether ‘inferring an alternative’ is something he aims to do.

WordPress ate by order symbols.

This:

‘does not directly consider composite parameter space hypotheses. Eg in a parametric model he would not consider theta =0 directly’

Should read:

‘does not directly consider composite parameter space hypotheses. Eg in a parametric model he would not consider theta less than or equal to 0 vs theta greater than or equal to 0 directly.’

Om: Gelman needs to infer evidence of a problem with his model. That too is an inference. In fact, however he tends to go even further, to inferring a specific type of problem that would explain any observed anomaly. The former inference may be warranted with severity, but the latter requires more.

Laurie: It’s very simple, although often hidden. Say your null is “independence holds” and your alternative specifies ONE type of violation of independence, maybe Markov of a type. Then a statistically significant result indicates a departure from the null, but not necessarily of this particular type. There are other ways to explain the violation of independence. The null and alternative are not exhaustive. To have an indication for the alternative, you’d have to be able to say it’s improbable to get such an observed difference (from the null) if the alternative is false. And it’s not improbable, because even if it’s false there are other types of independence violations (which of course could subsequently be probed).

A failure to reject indicates at most a lack of violation of the sort the test had reasonable power to have detected.

Deborah, and an example of a test which is not a misspecification test.

Laurie: I’m not sure of your question.

Deborah, you write

‘For example, you can’t infer evidence for the alternative in a misspecification test’

so I thought that if there are such things as misspecification tests there are probably tests that are not misspecification tests, otherwise all tests are misspecification tests. I just wanted an example.

Davies: Sure, testing within a model already “audited” for adequacy. Say Normal testing mu ≤ mu’ vs mu > mu’ using the sample mean M from 100 iid samples, known variance.

You have now completed the analogy perfectly.

You have now completed the analogy perfectly

(feel free to delete one of the above duplicate comments – I’ve still been having trouble getting comments through the moderation system. Usually when I post via my phone for some reason)

phaneron0, I have often used the ‘quote’ in conversations but this is the first time I have written it down. I could possibly have got it from Peter Clifford, maybe even David Stirzaker. I never met Hammersley and do not know the context, mathematics or statistics. The meaning I think is clear, it is not meant to be taken literally.

‘your lack of reading likely hampers your ability to “plug in” into the statistical community’

The bibliography of my book has 223 items. I have other theories about my inability to plug in but they are best discussed over a glass of wine, Newton and his baleful influence on British mathematics would be a topic I would bring up. The machine learning people were also unable to plug in and went elsewhere.

Deborah, that is what I thought you meant but I just wanted to make sure. Oliver has already replied but before agreeing with him I decided to await your response.

“a model already “audited” for adequacy”

What is a model? As Oliver pointed out a model in statistics it is very often a parametric family of distributions, the normal model, the Poisson model etc. For me a model is a single probability measure. At the time I started contributing to this blog Michael Lew pointed this out and suggested I use the standard terminology of statistics. So let us suppose the normal model -in your sense- has been successfully audited for adequacy. What does this mean? The only sense I can make of it is that there are some parameter values (mu,sigma) such that the model -in my sense- N(mu,sigma^2) is consistent with the data. I would find it very strange if a model -again in your sense- is successfully audited as adequate but that is no model -in my sense- which is adequate. Suppose therefore you agree that a model successfully audited for adequacy means that there are some individual parameter values whose associated models -in my sense- are consistent with the data. I now ask you, quite reasonably in my opinion, what these parameter values are, or at least give one such value. I suspect your answer will be that you are unable to specify any such values. Take for simplicity the Poisson family and do a standard chi-squared goodness-of-fit. This will be (best practice???) based on the mean of the data. Because of this it will be emphasized to students that they have to reduce the degrees of freedom by one. But this test specifies not one particular value of lambda and hence your inability to state the values of lambda which are consistent with the data. Actually it is not an inability to state them, you could if you wanted to, it is a refusal to state them.

‘

“Normal testing mu ≤ mu’ vs mu > mu’”

This is my interpretation: mu ≤ mu’ means there is a mu ≤ mu’ and a sigma such that the N(mu,sigma^2) model is consistent with the data. Now I have already done this, my approximation region contains all the (mu,sigma) consistent with the data. So all I now do is to check whether my approximation region contains a pair (mu,sigma) with mu ≤ mu’. All hypothesis testing is misspecification testing.

“one cannot test everything at once” and “laurie, it’s very simple ….”

Take the Kolmogorov-Smirnov test which I use when checking the shape of the data against that of a Gaussian model. The test is consistent, in the end (n=infinity) it will pick up any deviation, but its power depends on, in your language, the alternative. See

Global power functions of goodness of fit tests, Arnold Janssen, Ann. Statist. 28(1) (2000), 239-253.

In particular the test has very low power for outliers. One huge outlier will never be picked up. I know this and I am interested in outliers. So what do I do. I include a sort of outlier test. The features I include are the mean, variance, outliers and shape, just four not everything and thought has been given to each. I have 0.05 probability to spend and for simplicity I spend 0.0125 on each. Suppose it turns out that some data sets differ from normality in an important aspect but in a direction where the first four tests have low power. I then include a further test which picks up this form of deviation. I now have five features and spend 0.01 on each. I don’t see a problem there. Nor do I see what else one can do.

Laurie: Without taking up the particulars here, we seem to be coming at things from a different direction. (Perhaps mine is more Kantian, and certainly more pragmatic.) The fact is that we learn about the world, and the task is to figure out what enables this, how we can do it better, and why some ways are terrible. The land of formal statistics is a small part of the landscape of learning, but it helps to characterize what’s going on in entirely qualitative and informal contexts. That’s the interest of statistics for a philosopher of science/knowledge.

If there’s ever a procedure for generating iid data, we know about the distribution of the sample mean. We build a strong argument from coincidence to vouchsafe a method for generating data with given properties. Simulation methods check further. That let’s us test when we have failed. All learning from error is based on essentially the same reasoning–there are exemplary types of errors, we learn to distinguish their effects, learn to detect them through amplification and by making their effects ramify in other measurements based on different assumptions, etc. etc. This is all in EGEK. The statistical point from Neyman is on p. 166 of EGEK. I have no interest in radical skepticism (which is both boring and self-refuting).

I wasn’t going to respond to this comment coz I figured we’d reached the end of productive discussion (funnily enough I found it quite helpful to further clarify my understanding of Laurie’s approach) but then I saw you tweeted a link to this comment as if it was a reasonable response.

So to any who come here via that link I would suggest you read Laurie’s book and Mayo’s EGEK and decide who’s is ‘more pragmatic’.

(Arg typos, phone again)

Om: But I take it you had no problem with my being more Kantian? Pragmatism as a philosophical view (in this case, about inductive-error prone inference), like Kantianism,in the same sentence, has a different meaning in philosophy than in ordinary language. I take your point to be that Laurie’s book–in hard core statistics–must surely be more pragmatic than Mayo’s book in philosophy of science/statistics. That’s a given. Few people consider philosophy of science relevant to scientific practice. When it comes to questions about understanding and justifying science and inductive learning–my peculiar interest–the relevance of philosophy can’t automatically be dismissed. Assuming then we are in the land of philosophy of statistics/science (the domain of this blog), my position appears more Kantian than Laurie’s–based on his comment on this blog–and in just the way I indicated.

Deborah, I did ask you some very specific questions in my last posting. I will try again. Your philosophy requires one to audit the model:

Sure, testing within a model already “audited” for adequacy. Say Normal testing mu ≤ mu’ vs mu > mu’ using the sample mean M from 100 iid samples, known variance.

How do you audit the model? When a model say the normal model has been “audited” for adequacy, does this mean that there are some parameter values (mu,sigma) such that the model N(mu,sigma^2) is adequate? Or is your auditing such that the normal model may be successfully audited without there being any parameter values (mu,sigma) for which that particular model N(mu,sigma^2) is adequate?

Laurie: I was including under “auditing” tests of the iid assumptions and assessment of selections effects, cherry-picking, etc–sources of distortion of error prob properties of the proposed method.

Your last query is equivocal.

Deborah, and no doubt somewhere along the way you ask yourself whether the normal distribution is adequate. How do you do that?