# Monthly Archives: February 2013

## Statistically speaking…

calculus tattoo

Statistically speaking, we don’t use calculus By Dave Gammon

An article in a local op-ed piece today (Roanoke Times) claims:

“Quantitative skills are highly sought after by employers, and the best time to learn these skills is in high school and early college. And we all know the best math students should eventually learn calculus.

Or should they? Maybe it’s statistics, not calculus, that is a more worthy pursuit for the vast majority of students.”

This reminds me of the trouble I got into when, as a graduate student at the University of Pennsylvania, I supplemented my fellowship in philosophy by leading some recitation classes in statistics at the Wharton school. Although it was vaguely suggested that I not assign homework problems that required calculus, since many of the exercises in the sections of the text (on business statistics) that I was to cover required, and were illuminated by, calculus, (and given that the text was written by a Wharton statistics professor [de Cani]), I went ahead and assigned some of them, and promptly was reported by the students[i]. The author of this article appears to have no clue that statistical methods depend on calculus and the “area under a curve”. Read more »

Categories: Statistics, Uncategorized | 18 Comments

## Stephen Senn: Also Smith and Jones

Also Smith and Jones[1]
by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)

This story is based on a paradox proposed to me by Don Berry. I have my own opinion on this but I find that opinion boring and predictable. The opinion of others is much more interesting and so I am putting this up for others to interpret.

Two scientists working for a pharmaceutical company collaborate in designing and running a clinical trial known as CONFUSE (Clinical Outcomes in Neuropathic Fibromyalgia in US Elderly). One of them, Smith is going to start another programme of drug development in a little while. The other one, Jones, will just be working on the current project. The planned sample size is 6000 patients.

Smith says that he would like to look at the experiment after 3000 patients in order to make an important decision as regards his other project. As far as he is concerned that’s good enough.

Jones is horrified. She considers that for other reasons CONFUSE should continue to recruit all 6000 and that on no account should the trial be stopped early.

Smith say that he is simply going to look at the data to decide whether to initiate a trial in a similar product being studied in the other project he will be working on. The fact that he looks should not affect Jones’s analysis.

Jones is still very unhappy and points out that the integrity of her trial is being compromised.

Smith suggests that all that she needs to do is to state quite clearly in the protocol that the trial will proceed whatever the result of the interim administrative look and she should just write that this is so in the protocol. The fact that she states publicly that on no account will she claim significance based on the first 3000 alone will reassure everybody including the FDA. (In drug development circles, FDA stands for Finally Decisive Argument.)

However, Jones insists. She wants to know what Smith will do if the result after 3000 patients is not significant.

Smith replies that in that case he will not initiate the trial in the parallel project. It will suggest to him that it is not worth going ahead.

Jones wants to know suppose that the results for the first 3000 are not significant what will Smith do once the results of all 6000 are in.

Smith replies that, of course, in that case he will have a look. If (but it seems to him an unlikely situation) the results based on all 6000 will be significant, even though the results based on the first 3000 were not, he may well decide that the treatment works after all and initiate his alternative program, regretting, of course, the time that has been lost.

Jones points out that Smith will not be controlling his type I error rate by this procedure.

‘OK’, Says Smith, ‘to satisfy you I will use adjusted type I error rates. You, of course, don’t have to.’

The trial is run. Smith looks after 3000 patients and concludes the difference is not significant. The trial continues on its planned course. Jones looks after 6000 and concludes it is significant P=0.049. Smith looks after 6000 and concludes it is not significant, P=0.052. (A very similar thing happened in the famous TORCH study(1))

Shortly after the conclusion of the trial, Smith and Jones are head-hunted and leave the company.  The brief is taken over by new recruit Evans.

What does Evans have on her hands: a significant study or not?

Reference

1.  Calverley PM, Anderson JA, Celli B, Ferguson GT, Jenkins C, Jones PW, et al. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England journal of medicine. 2007;356(8):775-89.

[1] Not to be confused with either Alias Smith and Jones nor even Alas Smith and Jones

Categories: Philosophy of Statistics, Statistics | | 14 Comments

## Fisher:’Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the tesitng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

## R. A. Fisher: how an outsider revolutionized statistics

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the  ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in Metron, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to Biometrika, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

References

Bowley, A. L. (1902, 1920, 1926, 1937) Elements of Statistics, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) The Life of a Scientist: R. A. Fisher, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” Messenger of Mathematics, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” Metron 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of c2 from contingency tables, and the calculation of p, “Journal of the Royal Statistical Society 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,”  Journal of the Royal Statistical Society, 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” Journal of the Royal Statistical Society, 87, 442-450.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” Journal of the Royal Statistical Society 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” Annals of Eugenics, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” Biometrika, 6, 1-25.

Hald, A. (1998) A History of Mathematical Statistics from 1750 to 1930, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” Journal of the American Statistical Association, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” Journal of the Royal Statistical Society, 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, Statistical Science, 7, 34-48.

RSS (Royal Statistical Society) (1934) Annals of the Royal Statistical Society 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” Annals of Statistics, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in The New Palgrave Dictionary of Economics, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) The Methods of Statistics, Williams & Norgate, London.

[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

Categories: phil/history of stat, Statistics | 9 Comments

## Fisher and Neyman after anger management?

Would you agree if your (senior) colleague urged you to use his/her book rather than your own –even if you thought doing so would change for the positive the entire history of your field? My guess is that the answer is no. For that matter, would you ever try to insist that your (junior) colleague use your book in teaching a course rather than his/her own notes or book?  Again I guess no. But perhaps you’d be more tactful than were Fisher and Neyman. It wasn’t just Fisher (whose birthday is tomorrow) who seemed to need some anger management training, Erich Lehmann (in conversation and in 2011) points to a number of incidences wherein Neyman is the instigator of gratuitous ill-will. Their substantive statistical and philosophical disagreements, I now think, were minuscule in comparison to the huge animosity that developed over many years. Here’s how Neyman describes a vivid recollection he has of the 1935 book episode to Constance Reid (1998, 126). [i]

A couple of months “after Neyman criticized Fisher’s concept of the complex experiment” Neyman vividly recollects  Fisher stopping by his office at University College on his way to a meeting which was to decide on Neyman’s reappointment[ii]:

“And he said to me that he and I are in the same building… . That, as I know, he has published a book—and that’s Statistical Methods for Research Workers—and he is upstairs from me so he knows something about my lectures—that from time to time I mention his ideas, this and that—and that this would be quite appropriate if I were not here in the College but, say, in California—but if I am going to be at University College, this this is not acceptable to him. And then I said, ‘Do you mean that if I am here, I should just lecture using your book?’ And then he gave an affirmative answer. And I said, ‘Sorry, no. I cannot promise that.’ And then he said, ‘Well, if so, then from now on I shall oppose you in all my capacities.’ And then he enumerated—member of the Royal Society and so forth. There were quite a few. Then he left. Banged the door.”

“I’d be very pleased to use Statistical Methods for Research Workers in my class, what else?”

Or what if Fisher had said:

“Of course you’ll want to use your own notes in your class, but I hope you will use a portion of my text when mentioning some of its key ideas.”

Very unlikely [iii].

How would you have handled it?

Ironically, Neyman did something very similar to Erich Lehmann at Berkeley, and blocked his teaching graduate statistics after one attempt that may have veered slightly off Neyman’s path. But Lehmann always emphasized that, unlike Fisher, Neyman never created professional obstacles for him. [iv]

[i] At the meeting that followed this exchange, Fisher tried to shoot down Neyman’s reappointment, but did not succeed (Reid, 125).

[ii]This is Neyman’s narrative to Reid. I’m sure Fisher would relate these same episodes differently. Let me know if you have any historical material to add. I met Lehmann for the first time shortly after he had worked with Reid on her book, and he had lots of stories. I should have written them all down at the time.

[iii] I find it hard to believe, however, that Fisher would have thrown some of Neyman’s wooden models onto the floor:

“ After the Royal Statistical Society meeting of March 28, relations between workers on the two floors of K.P.’s old preserve became openly hostile. One evening, late that spring, Neyman and Pearson returned to their department after dinner to do some work. Entering they were startled to find strewn on the floor the wooden models which Neyman had used to illustrate his talk on the relative advantages of randomized blocks and Latin squares. They were regularly kept in a cupboard in the laboratory. Both Neyman and Pearson always believed that the models were removed by Fisher in a fit anger.” (Reid 124, noted in Lehmann 2011, p. 59. K.P. is, of course, Karl Pearson.)

[iv] I didn’t want to relate this anecdote without a citation, and finally found one in Reid (215-16). Actually I would have anyway, since Lehmann separately told it to Spanos and me.

Lehmann, E. (2011). Fisher, Neyman and the Creation of Classical Statistics, Springer.

Reid, C (1998), Neyman., Springer

Categories: phil/history of stat, Statistics | 21 Comments

## Statistics as a Counter to Heavyweights…who wrote this?

When any scientific conclusion is supposed to be [shown or disproved] on experimental evidence [or data], critics who still refuse to accept the conclusion are accustomed to take one of two lines of attack. They may claim that the interpretation of the [data] is faulty, that the results reported are not in fact those which should have been expected had the conclusion drawn been justified, or that they might equally well have arisen had the conclusion drawn been false. Such criticisms of interpretation are usually treated as falling within the domain of statistics. They are often made by professed statisticians against the work of others whom they regard as ignorant of or incompetent in statistical technique; and, since the interpretation of any considerable body of data is likely to involve computations it is natural enough that questions involving the logical implications of the results of the arithmetical processes implied should be relegated to the statistician. At least I make no complaint of this convention. The statistician cannot evade the responsibility for understanding the processes he applies or recommends. My immediate point is that the questions involved can be dissociated from all that is strictly technical in the statistician’s craft…..

The other type of criticism to which experimental results [or data] are exposed is that the experiment itself was ill designed or, of course, badly executed….This type of criticism is usually made by what I might call a heavyweight authority. Prolonged experience, or at least the long possession of a scientific reputation, is almost a pre-requisite for developing successfully this line of attack. Technical details are seldom in evidence. The authoritative assertion “His controls are totally inadequate” must have temporarily discredited many a promising line of work; and such an authoritarian method of judgment must surely continue, human nature being what it is, so long as [general methods for data generation, modeling and analysis] are lacking…

[T]he subject matter [of this work] has been regarded from the point of view of an experimenter [or data analyst], who wishes to carry out his work competently, and having done so wishes to safeguard his results, so far as they are validly established, from ignorant criticism by different sorts of superior persons.

Categories: phil/history of stat, Statistics | 9 Comments

## U-Phil: Mayo’s response to Hennig and Gandenberger

brakes on the ‘breakthrough’

“This will be my last post on the (irksome) Birnbaum argument!” she says with her fingers (or perhaps toes) crossed. But really, really it is (at least until midnight 2013). In fact the following brief remarks are all said, more clearly, in my PAPER , Mayo 2010Cox & Mayo 2011 (appendix), and in posts connected to this U-Phil: Blogging the likelihood principle, new summary 10/31/12*.

What’s the catch?

In my recent ‘Ton o’ Bricks” post,many readers were struck by the implausibility of letting the evidential interpretation of x’* be influenced by the properties of experiments known not to have produced x’*. Yet it is altogether common to be told that, should a sampling theorist try to block this, “unfortunately there is a catch” (Ghosh, Delampady, and Semanta 2006, 38): We would be forced to embrace the strong likelihood principle (SLP, or LP, for short), at least according to an infamous argument by Allan Birnbaum (who himself rejected the LP [i]).

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. . . . The “dilemma” argument is therefore an illusion. (Cox and Mayo 2010, 298)

In my many detailed expositions, I have explained the source of the illusion and sleight of hand from a number of perspectives (I will not repeat references here). While I appreciate the care that Hennig and Gandenberger have taken in their U-Phils (and wish them all the luck in published outgrowths), it is clear to me that they are not hearing (or are unwittingly blocking) the scre-e-e-e-ching of the brakes!

No revolution, no breakthrough!

Berger and Wolpert, in their famous monograph The Likelihood Principle, identify the core issue:

The philosophical incompatibility of the LP and the frequentist viewpoint is clear, since the LP deals only with the observed x, while frequentist analyses involve averages over possible observations. . . . Enough direct conflicts have been . . . seen to justify viewing the LP as revolutionary from a frequentist perspective. (Berger and Wolpert 1988, 65-66)[ii]

If Birnbaum’s proof does not apply to a frequentist sampling theorist, then there is neither a revolution nor a breakthrough (as Savage called it). The SLP holds just for methodologies in which it holds . . . We are going in circles.

Since Birnbaum’s argument has stood for over fifty years, I’ve given it the maximal run for its money, and haven’t tried to block its premises, however questionable its key moves may appear. Despite such latitude, I’ve shown that the “proof” to the SLP conclusion will not wash, and I’m just a wee bit disappointed that Hennig and Gandenberger haven’t wrestled with my specific argument, or shown just where they think my debunking fails. What would this require?

Since the SLP is a universal generalization, it requires only a single counterexample to falsify it. In fact, every violation of the SLP within frequentist sampling theory, I show, is a counterexample to it! In other words, using the language from the definition of the SLP, the onus is on Birnbaum to show that for any x’* that is a member of an SLP pair (E’, E”) with given, different probability models f’, f”, that x’* and x”* should have the identical evidential import for an inference concerning parameter q–, on pain of facing “the catch” above, i.e., being forced to allow the import of data known to have come from E’ to be altered by unperformed experiments known not to have produced x’*.

If one is to release the breaks from my screeching halt, defenders of Birnbaum might try to show that the SLP counterexamples lead me to “the catch” as alleged. I have considered two well-known violations of the SLP. Can it be shown that a contradiction with the WCP or SP follows? I say no. Neither Hennig[ii] nor Gandenberger show otherwise.

In my tracing out of Birnbaum’s arguments, I strived to assume that he would not be giving us circular arguments. To say that “I can prove that your methodology must obey the SLP,” and then to set out to do so by declaring “Hey Presto! Assume sampling distributions are irrelevant (once the data are in hand),” is a neat trick, but it assumes what it purports to prove. All other interpretations are shown to be unsound.

______

[i] Birnbaum himself, soon after presenting his result, rejected the SLP. As Birnbaum puts it, ”the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” (Birnbaum 1969, p. 128.)

(We use LP and SLP synonymously here.)

[ii] Hennig initially concurred with me, but says a person convinced him to get back on the Birnbaum bus (even though Birnbaum got off it [i]).

Some other, related, posted discussions: Brakes on Breakthrough Part 1 (12/06/11)  & Part 2 (12/07/11); Don’t Birnbaumize that experiment (12/08/12); Midnight with Birnbaum re-blog (12/31/12). The initial call to this U-Phil, the extension, details here,  the post from my 28 Nov. seminar, (LSE), and the original post by Gandenberger,

OTHER :

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57 (298), 269-306.

Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). On the foundations of statistical inference: “Discussion (of Birnbaum 1962)”,  Journal of the American Statistical Association 57 (298), 307-326.

Birbaum, A (1970). Statistical Methods in Scientific Inference  (letter to the editor). Nature 225, 1033.

Cox D. R. and Mayo. D. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo & A. Spanos eds.), CUP 276-304.

…and if that’s not enough, search this blog.

Categories: Birnbaum Brakes, Likelihood Principle, Statistics | 30 Comments

## U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof

Defending Birnbaum’s Proof

Greg Gandenberger
PhD student, History and Philosophy of Science
Master’s student, Statistics
University of Pittsburgh

In her 1996 Error and the Growth of Experimental Knowledge, Professor Mayo argued against the Likelihood Principle on the grounds that it does not allow one to control long-run error rates in the way that frequentist methods do.  This argument seems to me the kind of response a frequentist should give to Birnbaum’s proof.  It does not require arguing that Birnbaum’s proof is unsound: a frequentist can accommodate Birnbaum’s conclusion (two experimental outcomes are evidentially equivalent if they have the same likelihood function) by claiming that respecting evidential equivalence is less important than achieving certain goals for which frequentist methods are well suited.

More recently, Mayo has shown that Birnbaum’s premises cannot be reformulated as claims about what sampling distribution should be used for inference while retaining the soundness of his proof.  It does not follow that Birnbaum’s proof is unsound because Birnbaum’s original premises are not claims about what sampling distribution should be used for inference but instead as sufficient conditions for experimental outcomes to be evidentially equivalent.

Mayo acknowledges that the premises she uses in her argument against Birnbaum’s proof differ from Birnbaum’s original premises in a recent blog post in which she distinguishes between “the Sufficient Principle (general)” and “the Sufficiency Principle applied in sampling theory.“  One could make a similar distinction for the Weak Conditionality Principle.  There is indeed no way to formulate Sufficiency and Weak Conditionality Principles “applied in sampling theory” that are consistent and imply the Likelihood Principle.  This fact is not surprising: sampling theory is incompatible with the Likelihood Principle!

Birnbaum himself insisted that his premises were to be understood as “equivalence relations” rather than as “substitution rules” (i.e., rules about what sampling distribution should be used for inference) and recognized the fact that understanding them in this way was necessary for his proof.  As he put it in his 1975 rejoinder to Kalbfleisch’s response to his proof, “It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1972 paper, to the monster of the likelihood axiom” (263).

Because Mayo’s argument against Birnbaum’s proof requires reformulating Birnbaum’s premises, it is best understood as an argument not for the claim that Birnbaum’s original proof is invalid, but rather for the claim that Birnbaum’s proof is valid only when formulated in a way that is irrelevant to a sampling theorist.  Reformulating Birnbaum’s premises as claims about what sampling distribution should be used for inference is the only way for a fully committed sampling theorist to understand them.  Any other formulation of those premises is either false or question-begging.

Mayo’s argument makes good sense when understood in this way, but it requires a strong prior commitment to sampling theory. Whether various arguments for sampling theory such as those Mayo gives in Error and the Growth of Experimental Knowledge are sufficient to warrant such a commitment is a topic for another day.  To those who lack such a commitment, Birnbaum’s original premises may seem quite compelling.  Mayo has not refuted the widespread view that those premises do in fact entail the Likelihood Principle.

Mayo has objected to this line of argument by claiming that her reformulations of Birnbaum’s principles are just instantiations of Birnbaum’s principles in the context of frequentist methods. But they cannot be instantiations in a literal sense because they are imperatives, whereas Birnabaum’s original premises are declaratives.  They are instead instructions that a frequentist would have to follow in order to avoid violating Birnbaum’s principles. The fact that one cannot follow them both is only an objection to Birnbaum’s principles on the question-begging assumption that evidential meaning depends on sampling distributions.

********

Birnbaum’s proof is not wrong but error statisticians don’t need to bother

Christian Hennig
Department of Statistical Science
University College London

I was impressed by Mayo’s arguments in “Error and Inference” when I came across them for the first time. To some extent, I still am. However, I have also seen versions of Birnbaum’s theorem and proof presented in a mathematically sound fashion with which I as a mathematician had no issue.

After having discussed this a bit with Phil Dawid, and having thought and read more on the issue, my conclusion is that
1) Birnbaum’s theorem and proof are correct (apart from small mathematical issues resolved later in the literature), and they are not vacuous (i.e., there are evidence functions that fulfill them without any contradiction in the premises),
2) however, Mayo’s arguments actually do raise an important problem with Birnbaum’s reasoning.

Here is why. Note that Mayo’s arguments are based on the implicit (error statistical) assumption that the sampling distribution of an inference method is relevant. In that case, application of the sufficiency principle to Birnbaum’s mixture distribution enforces the use of the sampling distribution under the mixture distribution as it is, whereas application of the conditionality principle enforces the use of the sampling distribution under the experiment that actually produced the data, which is different in the usual examples. So the problem is not that Birnbaum’s proof is wrong, but that enforcing both principles at the same time in the mixture experiment is in contradiction to the relevance of the sampling distribution (and therefore to error statistical inference). It is a case in which the sufficiency principle suppresses information that is clearly relevant under the conditionality principle. This means that the justification of the sufficiency principle (namely that all relevant information is in the sufficient statistic) breaks down in this case.

Frequentists/error statisticians therefore don’t need to worry about the likelihood principle because they shouldn’t accept the sufficiency principle in the generality that is required for Birnbaum’s proof.

Having understood this, I toyed around with the idea of writing this down as a publishable paper, but I now came across a paper in which this argument can already be found (although in a less straightforward and more mathematical manner), namely:
M. J. Evans, D. A. S. Fraser and G. Monette (1986) On Principles and Arguments to Likelihood. Canadian Journal of Statistics 14, 181-194, http://www.jstor.org/stable/3314794, particularly Section 7 (the rest is interesting, too).

NOTE: This is the last of this group of U-Phils. Mayo will issue a brief response tomorrow. Background to these U-Phils may be found here.

Categories: Philosophy of Statistics, Statistics, U-Phil | | 12 Comments

## New kvetch: Filly Fury

A little humor: rejected post: Filly Fury

## From Gelman’s blog: philosophy and the practice of Bayesian statistics

I hadn’t read Gelman and Shalizi’s response to my comment on their paper in the British Journal of Mathematical and Statistical Psychology. I see the issue is posted on Gelman’s blogHere’s the issue of the journal,

Philosophy and the practice of Bayesian statistics (with all the discussions!)

Mark Andrews and Thom Baguley

## Mark Chang (now) gets it right about circularity

Mark Chang wrote a comment this evening, but it is buried back on my Nov. 31 post in relation to the current U-Phil. Given all he has written on my attempt to “break through the breakthrough”, I thought to bring it up to the top. Chang ends off his comment with the sagacious, and entirely correct claim that so many people have missed:

“What Birnbaum actually did was use the SLP to prove the SLP – as simple as that!” (Mark Chang)

It is just too bad that readers of his (2013) book will not have been told this*!  Mark: Can you issue a correction?  I definitely think you should!  If only you’d written to me, I could have pointed this out pre-pub.

That Birnbaum’s argument assumes what it claims to prove is just what I have been arguing all along. It is called a begging-the-question fallacy: An argument that boils down to:

A/therefore A

Such an argument is logically valid, and that is why formal validity does not mean much for getting conclusions accepted. Why? Well, even though such circular arguments are usually dressed up so that the premises do not so obviously repeat the conclusion, they are similarly fallacious: the truth of the premises already assumes the truth of the conclusion. If we are allowed to argue that way, you can argue anything you like! To not-A as well. That is not what the Great “Breakthrough” was supposed to be doing.

Chang’s comment (which is the same one he posted on Xi’an’s og here) also includes his other points, but fortunately, Jean Miller has recently gone through those in depth. In neither of my (generous) construals of Birnbaum do I claim his premises are inconsistent, by the way.

*But instead his readers are led to believe my criticism is flawed because of something about sufficiency having to do with a FAMILY of distributions (his caps on “family”, p. 138). This all came up as well in Xi”an’s og.

Chang, M. (2013) Paradoxes in Scientific Inference.

Categories: strong likelihood principle, U-Phil | 2 Comments

## January Palindrome Winner

Francis Lee
Palindrome*G.I. bootstrap able to null “ahs” on lie. Neil, no shallu? Not Elba! Parts too big!

See his statement and book prize on my rejected posts. The Elba judges and I quizzed Lee severely: I post the dialogue here before banishing it to my alternative blog:

Elba Judges to Francis: What is “shallu”? Can you send a reference?

Francis: Shallu is a type of grain that I believe originates from Africa, but requires very particular weather conditions in order to successfully grow, hence the rarity of its use. I believe the seeds are rather large though.
1. Here’s a store that sells shallu: http://rareseeds.com/shallu-egyptian-wheat.html
2. And here is a reference by the US Government: http://digital.library.unt.edu/ark:/67531/metadc96470/

Francis: An unscrupulous farmer who wears military boots when he farms, lied to the public and claims to have developed a system to grow it more efficiently than was conceivable in the exact same conditions as is typically allowed, when in reality he was just mixing it with more common crops mixed in, and selling it as the more expensive shallu. Being suspicious, an inquisitive scientist snoops around.
Upon further examination of the dirt attached to his boots, they have concluded that the caked dirt was wildly lacking in some characteristic that soil conditions of shallu typically have. Along with other gaping inconsistencies in his story, this evidence warrants a trial and he is quickly convicted. Any professional fascination stemming from his methodology is soon rebuffed.
Nevertheless, there is still a rampant shallu demand from the public due to the effective marketing strategies of the farmer. The current state of affairs is that a task force appointed by the government is figuring out potential locations to grow shallu and meet public demand. One of them suggests our beloved Elba, which is swiftly denied by another on account of shallu being too unwieldly to cultivate on an island the size of Elba._____

Mayo to FrancisYou are a candidate for winning the January palindrome contest. Congratulations. But can you please explain the phrase “Null ‘ahs’ on my lie”? Thank you.

Francis: Null “ahs” on my lie was my long way of saying that his revealing of the truth removed any sense of wonderment from the public.

Anyone get this? No matter, congratulations Lee!

*The minimum requirement was to  include Elba plus any one of: bootstrap, demonstrate (demonstrable), null. Using two would beat out candidates using just one, even though there weren’t any.

## U-Phil: Ton o’ Bricks

by Deborah Mayo

Birnbaum’s argument for the SLP involves some equivocations that are at once subtle and blatant. The subtlety makes it hard to translate into symbolic logic (I only partially translated it). Philosophers should have a field day with this, and I should be hearing more reports that it has suddenly hit them between the eyes like a ton of bricks, to use a mixture metaphor. Here are the key bricks. References can be found in here, background to the U-Phil here..

Famous (mixture) weighing machine example and the WLP

The main principle of evidence on which Birnbaum’s argument rests is the weak conditionality principle (WCP).  This principle, Birnbaum notes, follows not from mathematics alone but from intuitively plausible views of “evidential meaning.” To understand the interpretation of the WCP that gives it its plausible ring, we consider its development in “what is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992).

The basis for the WCP

Example 3. Two measuring instruments of different precisions. We flip a fair coin to decide which of two instruments, E’ or E”, to use in observing a normally distributed random sample X to make inferences about mean q. Ehas a known variance of 10−4, while that of E” is known to be 104. The experiment is a mixture: E-mix. The fair coin or other randomizer may be characterized as observing an indicator statistic J, taking values 1 or 2 with probabilities .5, independent of the process under investigation. The full data indicates first the result of the coin toss, and then the measurement: (Ej, xj).[i]

The sample space of E-mix with components Ej, j = 1, 2, consists of the union of

{(j, x’): j = 0, possible values of X’} and {(j, x”): j = 1, possible values of X”}.

In testing a null hypothesis such as q = 0, the same x measurement would correspond to a much smaller p-value were it to have come from E′ than if it had come from E”: denote them as p′(x) and p′′(x), respectively. However, the overall significance level of the mixture, the convex combination of the p-value: [p′(x) + p′′(x)]/2, would give a misleading report of the precision or severity of the actual experimental measurement (See Cox and Mayo 2010, 296).

Suppose that we know we have observed a measurement from E” with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance] (Cox 1958, 361).

In effect, an individual unlucky enough to use the imprecise tool gains a more informative assessment because he might have been lucky enough to use the more precise tool! (Birnbaum 1962, 491; Cox and Mayo 2010, 296). Once it is known whether E′ or E′′ has produced x, the p-value or other inferential assessment should be made conditional on the experiment actually run.

Weak Conditionality Principle (WCP): If a mixture experiment is performed, with components E’, E” determined by a randomizer (independent of the parameter of interest), then once (E’, x’) is known, inference should be based on E’ and its sampling distribution, not on the sampling distribution of the convex combination of E’ and E”.

Understanding the WCP

The WCP includes a prescription and a proscription for the proper evidential interpretation of x’, once it is known to have come from E’:

The evidential meaning of any outcome (E’, x’) of any experiment E having a mixture structure is the same as: the evidential meaning of the corresponding outcome x’ of the corresponding component experiment E’, ignoring otherwise the over-all structure of the original experiment E (Birnbaum 1962, 489 Eh and xh replaced with E’ and x’ for consistency).

While the WCP seems obvious enough, it is actually rife with equivocal potential. To avoid this, we spell out its three assertions.

First, it applies once we know which component of the mixture has been observed, and what the outcome was (Ej xj). (Birnbaum considers mixtures with just two components).

Second, there is the prescription about evidential equivalence. Once it is known that Ej has generated the data, given that our inference is about a parameter of Ej, inferences are appropriately drawn in terms of the distribution in Ej —the experiment known to have been performed.

Third, there is the proscription. In the case of informative inferences about the parameter of Ej our inference should not be influenced by whether the decision to perform Ej was determined by a coin flip or fixed all along. Misleading informative inferences might result from averaging over the convex combination of Ej and an experiment known not to have given rise to the data. The latter may be called the unconditional (sampling) distribution. ….

______________________________________________

One crucial equivocation:

Casella and R. Berger (2002) write:

The [weak] Conditionality principle simply says that if one of two experiments is randomly chosen and the chosen experiment is done, yielding data x, the information about q depends only on the experiment performed. . . . The fact that this experiment was performed, rather than some other, has not increased, decreased, or changed knowledge of q. (p. 293, emphasis added)

I have emphasized the last line in order to underscore a possible equivocation. Casella and Berger’s intended meaning is the correct claim:

(i) Given that it is known that measurement x’ is observed as a result of using tool E’, then it does not matter (and it need not be reported) whether or not E’ was chosen by a random toss (that might have resulted in using tool E”) or had been fixed all along.

Of course we do not know what measurement would have resulted had the unperformed measuring tool been used.

Compare (i) to a false and unintended reading:

(ii) If some measurement x is observed, then it does not matter (and it need not be reported) whether it came from a precise tool E’ or imprecise tool E”.

The idea of detaching x, and reporting that “x came from somewhere I know not where,” will not do. For one thing, we need to know the experiment in order to compute the sampling inference. For another, E’ and E” may be like our weighing procedures with very different precisions. It is analogous to being given the likelihood of the result in Example 1,(here) withholding whether it came from a negative binomial or a binomial.

Claim (i), by contrast, may well be warranted, not on purely mathematical grounds, but as the most appropriate way to report the precision of the result attained, as when the WCP applies. The essential difference in claim (i) is that it is known that (E, x’), enabling its inferential import to be determined.

The linguistic similarity of (i) and (ii) may explain the equivocation that vitiates the Birnbaum argument.

Now go back and skim 3 short pages of notes here, pp 11-14, and it should hit you like a ton of bricks!  If so, reward yourself with a double Elba Grease, else try again. Report your results in the comments.