R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

A SPANOS

.

In recognition of R.A. Fisher’s birthday on February 17….

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.

imgres

.

After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.

Due to its importance, the 1915 paper provided Fisher’s first skirmish with the  ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in Metron, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to Biometrika, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]

Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]

Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!

Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).

To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:

“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)

Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).

Fisher, in his reply to Bowley and the old guard, was equally contemptuous:

“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]

In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]

In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!

Read more in Spanos 2008 (below)

References

Bowley, A. L. (1902, 1920, 1926, 1937) Elements of Statistics, 2nd, 4th, 5th and 6th editions, Staples Press, London.

Box, J. F. (1978) The Life of a Scientist: R. A. Fisher, Wiley, NY.

Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” Messenger of Mathematics, 41, 155-160.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10, 507-21.

Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” Metron 1, 2-32.

Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society, A 222, 309-68.

Fisher, R. A. (1922a) “On the interpretation of c2 from contingency tables, and the calculation of p, “Journal of the Royal Statistical Society 85, 87–94.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,”  Journal of the Royal Statistical Society, 85, 597–612.

Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” Journal of the Royal Statistical Society, 87, 442-450.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver & Boyd, Edinburgh.

Fisher, R. A. (1935) “The logic of inductive inference,” Journal of the Royal Statistical Society 98, 39-54, discussion 55-82.

Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” Annals of Eugenics, 7, 303-318.

Gossett, W. S. (1908) “The probable error of the mean,” Biometrika, 6, 1-25.

Hald, A. (1998) A History of Mathematical Statistics from 1750 to 1930, Wiley, NY.

Hotelling, H. (1930) “British statistics and statisticians today,” Journal of the American Statistical Association, 25, 186-90.

Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” Journal of the Royal Statistical Society, 97, 558-625.

Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, Statistical Science, 7, 34-48.

RSS (Royal Statistical Society) (1934) Annals of the Royal Statistical Society 1834-1934, The Royal Statistical Society, London.

Savage, L . J. (1976) “On re-reading R. A. Fisher,” Annals of Statistics, 4, 441-500.

Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in The New Palgrave Dictionary of Economics, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.

Tippet, L. H. C. (1931) The Methods of Statistics, Williams & Norgate, London.


[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.

[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.

[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.

Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.

This was first posted on 17, Feb. 2013 here.

HAPPY BIRTHDAY R.A. FISHER!

Categories: Fisher, phil/history of stat, Spanos, Statistics | 2 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017.  The comments from 2017 lead to a troubling issue that I will bring up in the comments today.

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:

A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?

B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)

It seems clear that by hidden postulates Fisher means alternative hypotheses and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test
statistics. You have to choose one, however. To say that you should choose the one with the greatest power gets you nowhere. This power depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.

I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.

Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.

References: 

J. H. Bennett (1990) Statistical Inference and Analysis Selected Correspondence of R.A. Fisher, Oxford: Oxford University Press.

L. J. Savage (1976) On rereading R A Fisher. The Annals of Statistics, 441-500.

Categories: Fisher, S. Senn, Statistics | 1 Comment

Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I’ll post some Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the testng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 1 Comment

3 YEARS AGO (FEBRUARY 2015): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: February 2015 [1]. Here are some items to for your Saturday night reading and rereading. Three are in preparation for Fisher’s birthday next week (Feb 17). One is a Saturday night comedy where Jeffreys appears to substitute for Jay Leno. The 2/25 entry lets you go back 6 years where there’s more on Fisher, a bit of statistical theatre (of the absurd), Misspecification tests, and a guest post (by Schachtman) on that infamous Matrixx court case (wherein the Supreme Court is thought to have weighed in on statistical significance tests). The comments are often the most interesting parts of these old posts.

February 2015

  • 02/05 Stephen Senn: Is Pooling Fooling? (Guest Post)
  • 02/10 What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?
  • 02/13 Induction, Popper and Pseudoscience
  • 02/16 Continuing the discussion on truncation, Bayesian convergence and testing of priors
  • 02/16 R. A. Fisher: ‘Two New Properties of Mathematical Likelihood’: Just before breaking up (with N-P)
  • 02/17 R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

  • 02/19 Stephen Senn: Fisher’s Alternative to the Alternative
  • 02/21 Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)
  • 02/25 3 YEARS AGO: (FEBRUARY 2012) MEMORY LANE

    .

  • 02/27 Big Data is the New Phrenology?

 

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Categories: 3-year memory lane | Leave a comment

S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)

.

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Evidence Based or Person-centred? A statistical debate

It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.

Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.

In fact, EBM will be essential to delivering effective PCH, as I shall now explain.

Blood will have blood

I shall take a rather unglamorous problem, that of deciding whether a generic form of phenytoin is equivalent in effect to a brand-name version. It may seem that this has little to do with PCH but in fact, unpromising as it may seem, it illuminates many points of the often-made but misleading claim that EBM is essentially about averages and is irrelevant to PCH, a view, in my opinion, that is behind the sentiment expressed in RLA’s final sentence: Causal singularism teaches us what PCH already knows:  that each person is unique, and that one size does not fit all.

If you want to prove that a generic formulation is equivalent to a brand-name drug a common design that is used to get evidence to that effect is a cross-over, in which a number of subjects are given both formulations on separate occasions and the concentrations in the blood of the two formulations are compared to see if they are similar. Such experiments, referred to as bioequivalence studies[3], may seem simple and trivial but they exhibit in extreme form several common characteristics of RCTs that contradict standard misconceptions regarding them

  1. General point. The subjects studied are not like the target population. Local instance. Very often young male healthy volunteers are studied, even though the treatments will be used in old as well as young ill people of either sex.
  2. General point. The outcome variable studied is not directly clinically relevant. Local instance. Analysis will proceed in terms of log area under the concentration-time curve, despite the fact that what is important to the patient is clinical efficacy and tolerability, neither of which will form the object of this study.
  3. General point. In terms of any measure that was clinically relevant, there would be important differences between the sample and the target population. Local instance. Serum concentration in a male healthy volunteer will possibly be much lower than for an elderly female patient. He will weigh more and probably eliminate the drug faster than she would. What would be a safe dose for him might not be for her.
  4. General point. Making use of the results requires judgement and background knowledge and is not automatic. Local instance. Theories of drug elimination and distribution imply that if time concentration in the blood of two formulations is similar their effect site concentrations and hence effects should be similar. “The blood is a gate through which the drug must pass.”

In fact, the whole use of such experiments is highly theory-laden and employs an implied model partly based on assumptions and partly on experience. The idea is that, first, equality of concentration in the blood implies equality of clinical effects and, second, although the concentration in the blood could be quite different between volunteers and patients, the relative bioavailability of two formulations should be similar from the one group to the other. Hence, analysis takes place on this scale, which is judged portable from volunteers to patients. In other words, one size does not fit all but evidence from a sample is used to make judgements about what would be seen in a population.

Concrete consideration

Consider a concrete example of a trial comparing two formulations of phenytoin reported by Shumaker and Metzler[4]. This was a double cross-over in 26 healthy volunteers. In the first pair of periods each subject was given one of the two formulations, the order being randomised. This was then repeated in a second pair of periods.  Figure 1 shows the relative bioavailability, that is to say the ratio of the area under the concentration time curve for the generic (test) formulation compared to the brand-name (reference) using data from the first pair of periods only. For a philosopher’s recognition of what is necessary to translate results from trails to practice, see Nancy Cartwright’s aptly named article, A philosopher’s view of the long road from RCTs to effectiveness [5] and for a statistician’s see Added Values[6].

Figure 1 Relative bioavailability in 26 volunteers for two formulations of phenytoin.

This plot may seem less than reassuring. It is true that the values seem to cluster around 1 (dashed line), which would imply equality of the formulations, the object of the study, but one value is perilously close to the limit of 0.8 and another is actually above the limit of 1.25, these two boundaries usually being taken to be acceptable limits of similarity.

However, it would be hasty to assume that the differences in relative bioavailability reflect any personal feature of the volunteers. Because the experiment is rather more complex than usual and each volunteer was tested in two cross-overs, we can plot a second determination of the relative bioavailability against the first. This has been done in Figure 2.

Figure 2 Relative bioavailability in the second cross-over plotted against the first.

There are 26 points, one for each volunteer with the X axis value being relative bioavailability in the first cross-over and the Y axis the corresponding figure for the second. The XY plane can also be divided into two regions: one in which the difference between the second determination and the first is less than the difference between the second and the mean of the first (labelled personal better) and one in which the reverse is the case (labelled mean better). The 8 points that are labelled with blue circles are in the former region and the 18 with black asterisks in the second. The net result is that for the majority of subjects one would predict the relative bioavailability on the second occasion better using the average value of all subjects rather than using the value for that subject.  Note that since much of the prediction error will be due to the inescapable unpredictability of relative bioavailability from occasion to occasion, the superiority of using the average here is plausibly underestimated. In the long run it would do far better.

In fact, judgement of similarity of the two formulations would be based on the average bioavailability, not the individual values and a suitable analysis of the data from the first period fitting subject and period effects in addition to treatment to the log-transformed values would produce a 90% confidence interval of 0.97 to 1.02.

Of course one could argue that this is an extreme example. A plausible explanation is that the long-run relative bioavailability is the same for every individual and it could be argued that there are many clinical trials in patients measuring more complex outcomes where effects would not be constant. Nevertheless, doing better than using the average is harder than it seems and trying to do better will require more evidence not less.

The moral

The moral is that if you are not careful you can easily do worse by attempting to go beyond the average. This is well known in quality control circles where it is understood that if managers attempt to improve the operation of machines, processes and workers without knowing whether or not observed variation has a specific identifiable and actionable source they can make quality worse. In choosing ‘one size does not fit all’ RLA has plumped for a misleading analogy. When fitting someone out with clothes, their measurements can be taken with very little error, it can be assumed that they will not change much in the near future and that what fits now will do so for some time.

Patients and diseases are not like that. The mistake is to assume that the following statement, ‘the effect on you of a given treatment will almost certainly differ from its effect averaged over others,’ justifies the following policy, ‘I am going to ignore the considerable evidence from others and just back my best hunch about you’. The irony is that doing best for the individual may involve making substantial use of the average.

What statisticians know is that where there is much evidence on many patients and a very little on the patient currently presenting, to do best will involve a mixed strategy involving finding some optimal compromise between ignoring all experience and ignoring that of the current patient. To produce such optimal strategies requires careful planning, good analysis and many data[7, 8]. The latter are part of what we call evidence and to claim, therefore, that personalisation involves abandoning evidence based medicine is quite wrong. Less ontology and more understanding of the statistics of prediction is needed.

References

[1] Mumford, S. & Anjum, R.L. 2013 Causation: a very short introduction, OUP Oxford.

[2] Anjum, R.L. 2016 Evidence based or person centered? An ontological debate.

[3] Senn, S.J. 2001 Statistical issues in bioequivalence. Statistics in Medicine 20, 2785-2799.

[4] Shumaker, R.C. & Metzler, C.M. 1998 The phenytoin trial is a case study of “individual bioequivalence”. Drug Information Journal 32, 1063–1072,.

[5] Cartwright, N. 2011 A philosopher’s view of the long road from RCTs to effectiveness. The Lancet 377, 1400-1401.

[6] Senn, S.J. 2004 Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 23, 3729-3753.

[7] Araujo, A., Julious, S. & Senn, S. 2016 Understanding Variation in Sets of N-of-1 Trials. PloS one 11, e0167167. (doi:10.1371/journal.pone.0167167).

[8] Senn, S. 2017 Sample size considerations for n-of-1 trials. Statistical methods in medical research.

Categories: personalized medicine, RCTs, S. Senn | 6 Comments

3 YEARS AGO (JANUARY 2015): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 2-3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one.

 

January 2015

  • 01/02 Blog Contents: Oct.- Dec. 2014
  • 01/03 No headache power (for Deirdre)
  • 01/04 Significance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)
  • 01/07 “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)
  • 01/08 On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post).
  • 01/12 “Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins)
  • 01/16 Winners of the December 2014 Palindrome Contest: TWO!
  • 01/18 Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?
  • 01/21 Some statistical dirty laundry.
  • 01/24 What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri
  • 01/26 Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)
  • 01/27 3 YEARS AGO: (JANUARY 2012) MEMORY LANE
  • 01/31 Saturday Night Brainstorming and Task Forces: (4th draft)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Categories: 3-year memory lane | Leave a comment

S. Senn: Being a statistician means never having to say you are certain (Guest Post)

.

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Being a statistician means never having to say you are certain

A recent discussion of randomised controlled trials[1] by Angus Deaton and Nancy Cartwright (D&C) contains much interesting analysis but also, in my opinion, does not escape rehashing some of the invalid criticisms of randomisation with which the literatures seems to be littered. The paper has two major sections. The latter, which deals with generalisation of results, or what is sometime called external validity, I like much more than the former which deals with internal validity. It is the former I propose to discuss.

The trouble starts, in my opinion, with the discussion of balance. Perfect balance is not, contrary to what is often claimed a necessary requirement for causal inference, nor is it something that randomisation attempts to provide. Conventional analyses of randomised experiments make an allowance for imbalance and that allowance is inappropriate if all covariates are balanced. If you analyse a matched-pairs design as if it were completely randomised, you fail that question in Stat 1. (At least if I am marking the exam.) The larger standard error for the completely randomised design is an allowance for the probable imbalance that such a design will have compared to a matched-pairs design.

This brings me on to another criticism. D&C discuss matching as if it were somehow an alternative to randomisation. But Fisher’s motto for designs can be expressed as, “block what you can and randomise what you can’t”. We regularly run cross-over trials, for example, in which there is blocking by patient, since every patient receives each treatment, and also blocking by period, since each treatment appears equally often in each period but we still randomise patients to sequences.

Part of their discussion recognizes this but elsewhere they simply confuse the issue, for example discussing randomisation as if it were an alternative to control. Control makes randomisation possible. Without control, there is no randomisation. Randomisation makes blinding possible, without randomisation there can be no convincing blinding. Thus in order of importance they are, control, randomisation and blinding but to set randomisation up as some alternative to control is simply misleading and unhelpful.

Elsewhere they claim, “the RCT strategy is only successful if we are happy with estimates that are arbitrarily far from the truth, just so long as errors cancel out over a series of imaginary experiments” but this is not what RCTs rely on. The mistake is in becoming fixated with the point estimate. This will, indeed be in error but any decent experiment and analysis will deliver an estimate of that error, as, indeed, they concede elsewhere. Being a statistician is never having to say you are certain. To prove a statistician is a liar you have to prove that the probability statement is wrong. That is harder than it may seem.

They correctly identify that when it comes to hidden covariates it is the totality of their effect that matters. In this, their discussion is far superior to the indefinitely many confounders argument that has been incorrectly proposed by others as being some fatal flaw. (See my previous blog Indefinite Irrelevance). However, they then undermine this by adding “but consider the human genome base pairs. Out of all those billions, only one might be important, and if that one is unbalanced, the result of a single trial can be ‘randomly confounded’ and far from the truth”. To which I answer “so what?”. To see the fallacy in this argument, which simultaneously postulates a rare event and conditions on its having happened, even though it is unobserved, consider the following. I maintain that if a fair die is rolled six times, the probability of six sixes in a row will be 1/46,656 and so rather rare. “Nonsense” say D&C, “suppose that the first five rolls have each produced a six, it will then happen one in six times and so is really not rare at all”.

I also consider that their simulation is irrelevant. They ask us to believe that if 100 samples of size 50 are taken from a log-Normal distribution and then for each sample, the values are permuted 1000 times to 25 in the control and 25 in the experimental group the type I error rate for a nominal 5% using the two-sample t-test will be 13.5%. In view of what is known about the robustness of the t-test under the null hypothesis (there is a long literature going back to Egon Pearson in the 1930s), this is extremely surprising and as soon as I saw it I disbelieved it. I simulated this myself using 2000 permutations, just for good measure, and found the distribution of type one error rates in the accompanying figure.

Each dot represents the type I error rate over 2000 permutations for one of the 100 samples. It can be seen that for most of the samples the proportion of significant t-tests is less than the nominal 5% and in fact, the average for the simulation is 4%. It is, of course, somewhat regrettable that some of the values are above 5% and, indeed, five of them have got a value of nearly 6% but if this worries you, the cure is at hand. Use a permutation t-test rather than a parametric one. (For a history of this approach, see the excellent book by Mielke et al [2].)  Don’t confuse the technical details of analysis with the randomisation. Whatever you do for analysis, you will be better off for having randomised whatever you haven’t blocked.

Why does my result differ from theirs? It is hard for me to work out exactly what they have done but I suspect that it is because they have assumed an impossible situation. They are allowing that the average treatment effect for the millions of patients that might have been included is zero but then sampling varying effects (that is to say the difference the treatment makes), rather than merely values (that is to say the reading for given patients), from this distribution. For any given sample the mean of the effects will not be zero and so the null-hypothesis will, as they point out, not be true for the sample, only for the population. But in analysing clinical trials we don’t consider this population. We have precise control of the allocation algorithm (who gets what if they are in the trial) and virtually none over the presenting process (who gets into the trial) and the null hypothesis that we test is that the effect is zero in the sample not in some fictional population. It may be that I have misunderstood what they are doing but I think that this is the origin of the difference.

This is an example of the sort of approach that led to Neyman’s famous dispute with Fisher. One can argue about the appropriateness of the Fisherian null hypothesis, “the treatments are the same”, but Neyman’s “the treatments are not the same but on average they are the same” is simply incredible[3]. As D&C’s simulation shows, as soon as you allow this, you will never find a sample for which it is true. If there is no sample for which it is true, what exactly are the remarkable properties of the population for which it is true? D&C refer to magical thinking about RCTs dismissively but this is straight out of some wizard’s wonderland.

My view is that randomisation should not be used as an excuse for ignoring what is known and observed but that it does deal validly with hidden confounders[4]. It does not do this by delivering answers that are guaranteed to be correct; nothing can deliver that. It delivers answers about which valid probability statements can be made and, in an imperfect world, this has to be good enough. Another way I sometimes put it is like this: show me how you will analyse something and I will tell you what allocations are exchangeable. If you refuse to choose one at random I will say, “why? Do you have some magical thinking you’d like to share?”

Acknowledgement

My research on inference for small populations is carried out in the framework of the IDeAl project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

References

  1. Deaton A, Cartwright N. Understanding and misunderstanding randomized controlled trials. Social Science and Medicine 2017.
  2. Berry KJ, Johnston JE, Mielke PWJ. A Chronicle of Permutation Statistical Methods. Springer International Publishing Limited Switzerland: Cham, 2014.
  3. Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004; 23: 3729-3753.
  4. Senn SJ. Seven myths of randomisation in clinical trials. Statistics in Medicine 2013; 32: 1439-1450.
Categories: Error Statistics, RCTs, Statistics | 25 Comments

3 YEARS AGO (DECEMBER 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: December 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 3- 4 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one.

December 2014

  • 12/02 My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics
  • 12/04 “Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)
  • 12/06 How power morcellators inadvertently spread uterine cancer
  • 12/11 Msc. Kvetch: What does it mean for a battle to be “lost by the media”?
  • 12/13 S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)
  • 12/17 Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)

  • 12/21 Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)
  • 12/23 All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”
  • 12/26 3 YEARS AGO: MONTHLY (Dec.) MEMORY LANE
  • 12/29 To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)
  • 12/31 Midnight With Birnbaum (Happy New Year)

 

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).

 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Categories: 3-year memory lane | Leave a comment

Midnight With Birnbaum (Happy New Year 2017)

 Just as in the past 6 years since I’ve been blogging, I revisit that spot in the road at 11p.m., just outside the Elbar Room, look to get into a strange-looking taxi, to head to “Midnight With Birnbaum”. (The pic on the left is the only blurry image I have of the club I’m taken to.) I wondered if the car would come for me this year, as I waited out in the cold, given that my Birnbaum article has been out since 2014. The (Strong) Likelihood Principle–whether or not it is named–remains at the heart of many of the criticisms of Neyman-Pearson (N-P) statistics (and cognate methods). 2018 will be the 60th birthday of Cox’s “weighing machine” example, which was the start of Birnbaum’s attempted proof. Yet as Birnbaum insisted, the “confidence concept” is the “one rock in a shifting scene” of statistical foundations, insofar as there’s interest in controlling the frequency of erroneous interpretations of data. (See my rejoinder.) Birnbaum bemoaned the lack of an explicit evidential interpretation of N-P methods. Maybe in 2018? Anyway, the cab is finally here…the rest is live. Happy New Year! Continue reading

Categories: Birnbaum Brakes, strong likelihood principle | Tags: , , , | 3 Comments

60 yrs of Cox’s (1958) weighing machine, & links to binge-read the Likelihood Principle

IMG_0079

.

2018 will mark 60 years since the famous chestnut from Sir David Cox (1958). The example  “is now usually called the ‘weighing machine example,’ which draws attention to the need for conditioning, at least in certain types of problems” (Reid 1992, p. 582). When I describe it, you’ll find it hard to believe many regard it as causing an earthquake in statistical foundations, unless you’re already steeped in these matters. A simple version: If half the time I reported my weight from a scale that’s always right, and half the time use a scale that gets it right with probability .5, would you say I’m right with probability ¾? Well, maybe. But suppose you knew that this measurement was made with the scale that’s right with probability .5? The overall error probability is scarcely relevant for giving the warrant of the particular measurement, knowing which scale was used. So what’s the earthquake? First a bit more on the chestnut. Here’s an excerpt from Cox and Mayo (2010, 295-8): Continue reading

Categories: Sir David Cox, Statistics, strong likelihood principle | 4 Comments

Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 15 Comments

How to avoid making mountains out of molehills (using power and severity)

.

In preparation for a new post that takes up some of the recent battles on reforming or replacing p-values, I reblog an older post on power, one of the most misunderstood and abused notions in statistics. (I add a few “notes on howlers”.)  The power of a test T in relation to a discrepancy from a test hypothesis H0 is the probability T will lead to rejecting H0 when that discrepancy is present. Power is sometimes misappropriated to mean something only distantly related to the probability a test leads to rejection; but I’m getting ahead of myself. This post is on a classic fallacy of rejection. Continue reading

Categories: CIs and tests, Error Statistics, power | 4 Comments

The Conversion of Subjective Bayesian, Colin Howson, & the problem of old evidence (i)

.

“The subjective Bayesian theory as developed, for example, by Savage … cannot solve the deceptively simple but actually intractable old evidence problem, whence as a foundation for a logic of confirmation at any rate, it must be accounted a failure.” (Howson, (2017), p. 674)

What? Did the “old evidence” problem cause Colin Howson to recently abdicate his decades long position as a leading subjective Bayesian? It seems to have. I was so surprised to come across this in a recent perusal of Philosophy of Science that I wrote to him to check if it is really true. (It is.) I thought perhaps it was a different Colin Howson, or the son of the one who co-wrote 3 editions of Howson and Urbach: Scientific Reasoning: The Bayesian Approach espousing hard-line subjectivism since 1989.[1] I am not sure which of the several paradigms of non-subjective or default Bayesianism Howson endorses (he’d argued for years, convincingly, against any one of them), nor how he handles various criticisms (Kass and Wasserman 1996), I put that aside. Nor have I worked through his, rather complex, paper to the extent necessary, yet. What about the “old evidence” problem, made famous by Clark Glymour 1980?  What is it? Continue reading

Categories: Bayesian priors, objective Bayesians, Statistics | Tags: | 25 Comments

Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*

.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

Continue reading

Categories: Fisher, P-values, phil/history of stat | 3 Comments

3 YEARS AGO (NOVEMBER 2014): MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 3- 4 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one (11/1/14 & 11/09/14 and 11/15/14 & 11/25/14 are grouped). The comments are worth checking out.

 

November 2014

  • 11/01 Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”
  • 11/09 “Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA
  • 11/11 The Amazing Randi’s Million Dollar Challenge
  • 11/12 A biased report of the probability of a statistical fluke: Is it cheating?
  • 11/15 Why the Law of Likelihood is bankrupt–as an account of evidence

     

  • 11/18 Lucien Le Cam: “The Bayesians Hold the Magic”
  • 11/20 Erich Lehmann: Statistician and Poet
  • 11/22 Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”
  • 11/25 How likelihoodists exaggerate evidence from statistical tests
  • 11/30 3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).

 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Categories: 3-year memory lane | 1 Comment

Yoav Benjamini, “In the world beyond p < .05: When & How to use P < .0499…"

.

These were Yoav Benjamini’s slides,”In the world beyond p<.05: When & How to use P<.0499…” from our session at the ASA 2017 Symposium on Statistical Inference (SSI): A World Beyond p < 0.05. (Mine are in an earlier post.) He begins by asking:

However, it’s mandatory to adjust for selection effects, and Benjamini is one of the leaders in developing ways to carry out the adjustments. Even calling out the avenues for cherry-picking and multiple testing, long known to invalidate p-values, would make replication research more effective (and less open to criticism). Continue reading

Categories: Error Statistics, P-values, replication research, selection effects | 22 Comments

Going round and round again: a roundtable on reproducibility & lowering p-values

.

There will be a roundtable on reproducibility Friday, October 27th (noon Eastern time), hosted by the International Methods Colloquium, on the reproducibility crisis in social sciences motivated by the paper, “Redefine statistical significance.” Recall, that was the paper written by a megateam of researchers as part of the movement to require p ≤ .005, based on appraising significance tests by a Bayes Factor analysis, with prior probabilities on a point null and a given alternative. It seems to me that if you’re prepared to scrutinize your frequentist (error statistical) method on grounds of Bayes Factors, then you must endorse using Bayes Factors (BFs) for inference to begin with. If you don’t endorse BFs–and, in particular, the BF required to get the disagreement with p-values–*, then it doesn’t make sense to appraise your non-Bayesian method on grounds of agreeing or disagreeing with BFs. For suppose you assess the recommended BFs from the perspective of an error statistical account–that is, one that checks how frequently the method would uncover or avoid the relevant mistaken inference.[i] Then, if you reach the stipulated BF level against a null hypothesis, you will find the situation is reversed, and the recommended BF exaggerates the evidence!  (In particular, with high probability, it gives an alternative H’ fairly high posterior probability, or comparatively higher probability, even though H’ is false.) Failing to reach the BF cut-off, by contrast, can find no evidence against, and even finds evidence for, a null hypothesis with high probability, even when non-trivial discrepancies exist. They’re measuring very different things, and it’s illicit to expect an agreement on numbers.[ii] We’ve discussed this quite a lot on this blog (2 are linked below [iii]).

If the given list of panelists is correct, it looks to be 4 against 1, but I’ve no doubt that Lakens can handle it.

Continue reading

Categories: Announcement, P-values, reforming the reformers, selection effects | 5 Comments

Deconstructing “A World Beyond P-values”

.A world beyond p-values?

I was asked to write something explaining the background of my slides (posted here) in relation to the recent ASA “A World Beyond P-values” conference. I took advantage of some long flight delays on my return to jot down some thoughts:

The contrast between the closing session of the conference “A World Beyond P-values,” and the gist of the conference itself, shines a light on a pervasive tension within the “Beyond P-Values” movement. Two very different debates are taking place. First there’s the debate about how to promote better science. This includes welcome reminders of the timeless demands of rigor and integrity required to avoid deceiving ourselves and others–especially crucial in today’s world of high-powered searches and Big Data. That’s what the closing session was about. [1] Continue reading

Categories: P-values, Philosophy of Statistics, reforming the reformers | 8 Comments

Statistical skepticism: How to use significance tests effectively: 7 challenges & how to respond to them

Here are my slides from the ASA Symposium on Statistical Inference : “A World Beyond p < .05”  in the session, “What are the best uses for P-values?”. (Aside from me,our session included Yoav Benjamini and David Robinson, with chair: Nalini Ravishanker.)

7 QUESTIONS

  • Why use a tool that infers from a single (arbitrary) P-value that pertains to a statistical hypothesis H0 to a research claim H*?
  • Why use an incompatible hybrid (of Fisher and N-P)?
  • Why apply a method that uses error probabilities, the sampling distribution, researcher “intentions” and violates the likelihood principle (LP)? You should condition on the data.
  • Why use methods that overstate evidence against a null hypothesis?
  • Why do you use a method that presupposes the underlying statistical model?
  • Why use a measure that doesn’t report effect sizes?
  • Why do you use a method that doesn’t provide posterior probabilities (in hypotheses)?

 

Categories: P-values, spurious p values, statistical tests, Statistics | Leave a comment

New venues for the statistics wars

I was part of something called “a brains blog roundtable” on the business of p-values earlier this week–I’m glad to see philosophers getting involved.

Next week I’ll be in a session that I think is intended to explain what’s right about P-values at an ASA Symposium on Statistical Inference : “A World Beyond p < .05”. Continue reading

Categories: Announcement, Bayesian/frequentist, P-values | 3 Comments

Blog at WordPress.com.