Stephen Senn

S. Senn: “A Vaccine Trial from A to Z” with a Postscript (guest post)

.

Stephen Senn
Consultant Statistician
Edinburgh, Scotland

Alpha and Omega (or maybe just Beta)

Well actually, not from A to Z but from AZ. That is to say, the trial I shall consider is the placebo- controlled trial of the Oxford University vaccine for COVID-19 currently being run by AstraZeneca (AZ) under protocol AZD1222 – D8110C00001 and which I considered in a previous blog, Heard Immunity. A summary of the design  features is given in Table 1. The purpose of this blog is to look a little deeper at features of the trial and the way I am going to do so is with the help of geometric representations of the sample space, that is to say the possible results the trial could produce. However, the reader is warned that I am only an amateur in all this. The true professionals are the statisticians at AZ who, together with their life science colleagues in AZ and Oxford, designed the trial.

Whereas in an October 20 post (on PHASTAR) I considered the sequential nature of the trial, here I am going to ignore that feature and only look at the trial as if it had a single look. Note that the trial employs a two to one randomisation, twice as many subjects being given vaccine as placebo

However, first I shall draw attention to one interesting feature. Like the two other trials that I also previously considered (one by BioNTech and Pfizer and the other by Moderna) the null hypothesis that is being tested is not that the vaccine has no efficacy but that its efficacy does not exceed 30%. Vaccine Efficacy (VE) is defined as

Where Rplacebo & Rvaccine are the ‘true’  rates of infection under placebo and vaccine respectively

Obviously, if the vaccine were completely ineffective, the value of VE would be 0. Presumably the judgement is that a vaccine will be of no practical use unless it has an efficacy of 30%. Perhaps a lower value than this could not really help to control the epidemic. The trial is designed to show that this is the case. In what follows, you can take it as read that the probability of the trial failing because the efficacy is equal to some value that is less than 30% (such as 27%, say) is even greater than if the value is exactly 30%. Therefore, it becomes of interest to consider the way the trial will behave if the value is exactly 30%.

Figuring it out

Figure 1 gives a representation of what might happen in terms of cases of infected subjects in both arms of the trial based on its design. It’s a complicated diagram and I shall take some time to explain it. For the moment I invite the reader to ignore the concentric circles and the shading. I shall get to those in due course.

Figure 1 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the null hypothesis is true.

The X axis gives the number of cases  in the vaccine group and the Y axis the number of cases under Placebo. It is important to bear in mind that twice as many subjects are being treated with vaccine as with placebo. The line of equality of infection rates is given by the dashed white diagonal line  towards the bottom right hand side of the pot and labelled ‘0% efficacy’. This joins (for example) the points (80,40) and (140, 70) corresponding to twice as many cases under vaccine as placebo and reflecting the 2:1 allocation ratio. Other diagonal lines correspond to 30%, 50% and 60% VE respectively.

The trial is deigned to stop once 150 cases of infection have occurred. This boundary is represented by the diagonal solid red line descending from the upper left (30 cases in the vaccine group and 120 cases in the placebo group) towards the bottom right (120 cases in the vaccine group and 30 cases in the placebo group). Thus, we know in advance, that the combination of results we shall see must lie on this line.

Note that the diagram is slightly misleading, since where the space concerned refers to number of cases, it is neither continuous in X nor continuous in Y. The only possible values are those given by the whole numbers, W, that is to say the integers plus zero. However, the same is not true for expected numbers and this is a common difference between parameters and random variables in statistics. For example, if we have a Poisson random variable with a given mean, the only possible values of the random variable are the whole numbers 0,1,2… but the mean can be any positive real number.

Ripples in the pond

Figure 2 is the same diagram as Figure 1 as regards every feature except that which I invited the reader to ignore. The concentric circles are contour plots that represent features of the trial that are suitable for planning. In order to decide how many subjects to recruit, the scientists at AZ and Oxford had to decide what infection rate was likely. They chose an infection rate of 0.8% per 6 months under placebo. This in turn implies that of 10,000 subjects treated with placebo, we might expect 80 to get COVID. On the other hand, a vaccine efficacy of 30% would imply an infection rate of 0.56% since 

For 20,000 subjects treated with vaccine we would expect (0.56/100)20,000 = 112 of them to be infected with COVID and if the vaccine efficacy were 60%, the value assumed for the power calculation, then the expected infection rate would be 0.32% and we would expect 64 of the subjects to be infected.

Since the infection rates are small, a Poisson distribution is a possible simple model for the probability of seeing certain combinations of infections. This is what the contour plots illustrate. For both cases, the expected number of cases under placebo is assumed to be 80 and this is illustrated by a dashed horizontal white line. However, the lower infection rate under H1 has the effect of shifting the contour plots to the left. Thus, in Figure 1 the dashed vertical line indicating the expected numbers in the vaccine arm is at 112 and in Figure 2 it is at 64. Nothing else changes between the figures.

Figure 2 Possible and expected outcomes for the trial plotted in the two dimensional space of vaccine and placebo cases of infection. The contour plot applies when the value under the alternative hypothesis assumed for power calculations  is true.

Test bed

How should we carry out a significance test? One way of doing so is to condition on the total number of infected cases. The issue of whether to condition or not is a notorious controversy ins statistics, Here the total of 150 is fixed but I think that there is a good argument for doing so whether or not it is fixed. Such conditioning in this case leads to a binomial distribution describing the number of cases of infection observed out of the 150 that are in the vaccine group. Ignoring any covariates, therefore, a simple analysis, is to compare the proportion of cases we see to the proportion we would expect to see under the null hypothesis. This proportion is given by 112/(112+80)=0.583. (Note a subtle but important point here. The total number of cases expected is 192 but we know the trial will stop at 150. That is irrelevant. It is the expected proportion that matters here.)

By trial and error or by some other means we can now discover that the probability of 75 or fewer cases given vaccine out of 150 in total when the probability is 0.583 is 0.024.The AZ protocol requires a two-sided P-value less than or equal to 4.9%, which is to say 0.0245 one sided, assuming the usual doubling rule, so this is just low enough.  On the other hand, the probability of 76 or fewer cases under vaccine is 0.035 and thus too high. This establishes the point X=75, Y=75 as a critical value of the test. This is shown by the small red circle labelled ‘critical value’ on both figures. It just so happens that this lies along the 50% efficacy line. Thus observed 50% efficacy will be (just) enough to reject the hypothesis that the true efficacy is 30% or lower.

Reading the tea-leaves 

There are many other interesting features of this trial I could discuss, in particular what alternative analyses might be tried (the protocol refers to a ‘modified Poisson regression approach’ due to Zou, 2004) but I shall just consider one other issue here. That is that in theory when the trial stops might give some indication as to vaccine efficacy, a point that might be of interest to avid third party trial-watchers. If you look at Figure 3, which combines Figure 1 and Figure 2, you will note that the expected number of cases under H­0, if the values used for planning are correct,  is at least (when vaccine efficacy is 30%) 80+112=192. For zero efficacy the figure is 80+160=240. However, the trial will stop once 150 cases of infection have been observed. Thus, under H0, the trial is expected to stop before all 30,000 subjects have had six months of follow-up.

On the other hand, for an efficacy of 60% given in Figure 3 the value is 80+64 =144 and so slightly less then the figure required. Thus, under H1, the trial might not be big enough. Taken together, these figures imply that other things being equal, the earlier the trial stops the more likely the result is to be negative and the longer it continues, the more likely it is to be positive.

Of course, this raises the issue as to whether one can judge what is early and what is late. To make some guesses as to background rates of infection is inevitable when planning a trial. One would be foolish to rely on them when interpreting it.

Figure 3 Combination of Figures 1 and 2 showing contour plots for the joint density for the number of cases when the vaccine efficacy is 30% (H0) and the value under H1 of 60% used for planning.

Reference

Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004;159(7):702-6.

POSTSCRIPT: Needlepoint

Pressing news

Extract of a press-release from Pfizer, 9 November 2020:

“I am happy to share with you that Pfizer and our collaborator, BioNTech, announced positive efficacy results from our Phase 3, late-stage study of our potential COVID-19 vaccine. The vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis.” Albert Bourla (Chairman and CEO, Pfizer.)

Naturally, this had Twitter agog and calculations were soon produced to try and reconstruct the basis on which the claim was being made: how many cases of COVID-19 infection under vaccine had there been seen in order to be able to make this claim? In the end these amateur calculations don’t matter. It’s what Pfizer calculates and what the regulators decide about the calculation that matters. I note by the by that a fair proportion of Twitter seemed to think that journal publication and peer review is essential. I don’t share this point of view, which I tend to think of as “quaint”. It’s the regulator’s view I am interested in but we shall have to wait for that.

Nevertheless, calculation can be fun and if I don’t think so, I am in the wrong profession. So here goes. However, first I should acknowledge that Jen Rogers’s interesting blog on the subject has been very useful in preparing this note.

The back of the envelope

To do the calculation properly, this is what one would have to know

Need to know

Discussion

Disposition of Subjects

Randomisation was one to one but strictly speaking we want to know the exact achieved proportions. BusinessWire describe a total of “43,538 participants to date, 38,955 of whom have received a second dose of the vaccine candidate as of November 8, 2020”.

Number of cases of infection

According to BusinessWire 94 were seen.

Method of analysis

Pfizer claims in the protocol a Bayesian analysis will be used. I shall not attempt this but use a very simple frequentist one conditioning on totals infected.

Aim of claim

Is the point estimate the basis of the claim or is the lower bound of some confidence interval the basis?

Level of confidence to be used

Pfizer planned to look five times but it seems that the first look was abandoned. The reported look is the 2nd but at a number of cases that is slightly greater (94) than the number originally planned for the 3rd (92). I shall assume that the confidence level for look three of an O’Brien-Fleming boundary is appropriate.

Missingness

A simple analysis would assume no missing data or at least that any missing data are missing completely at random.

Other matters

Two doses are required. Were there any cases arising between the two doses and if so, what was done with them?

 

If I condition on the total number of infected cases, and assume equal numbers of subjects on each arm, then by varying the number of cases in the vaccine group and subtracting them from the total of 94 to get those on the control group arm, I can calculate the vaccine efficacy. This has been done in the figure below.

The solid blue circles are the estimate of the vaccine efficacy. The ‘whiskers’ below indicate a confidence limit of 99.16% which (I think) is the level appropriate for the third look in an O’Brien-Fleming scheme for an overall type I error rate of 5%. Horizontal lines have been drawn at 30% efficacy (the value used in the protocol for the null hypothesis) and 90% efficacy (the claimed effect in the press release). Three cases on the vaccine arm would give a vaccine efficacy at about 91.3% for the lower confidence interval whereas four gives a value of 89.2%. Eight cases would give a point estimate of 90.7%. So depending on what exactly the claim of “more than 90% effective” might mean (and a whole host of other assumptions) we could argue that between three and eight cases of infection were seen.

Safety second

Of course safety is often described as being first in terms of priorities but it usually takes longer to see the results that are necessary to judge it than to see those for efficacy. According to BusinessWire “Pfizer and BioNTech are continuing to accumulate safety data and currently estimate that a median of two months of safety data following the second (and final) dose of the vaccine candidate – the amount of safety data specified by the FDA in its guidance for potential Emergency Use Authorization – will be available by the third week of November.”

The world awaits the results with interest.

A Dec. 2, 2020 update by Senn

https://www.linkedin.com/pulse/needlework-guesswork-stephen-senn/

Reference

  1. C. O’Brien and T. R. Fleming (1979) A multiple testing procedure for clinical trials. Biometrics, 549-556.

 

Categories: covid-19, RCTs, Stephen Senn | 9 Comments

S. Senn: “Error point: The importance of knowing how much you don’t know” (guest post)

.

Stephen Senn
Consultant Statistician
Edinburgh

‘The term “point estimation” made Fisher nervous, because he associated it with estimation without regard to accuracy, which he regarded as ridiculous.’ Jimmy Savage [1, p. 453] 

First things second

The classic text by David Cox and David Hinkley, Theoretical Statistics (1974), has two extremely interesting features as regards estimation. The first is in the form of an indirect, implicit, message and the second explicit and both teach that point estimation is far from being an obvious goal of statistical inference. The indirect message is that the chapter on point estimation (chapter 8) comes after that on interval estimation (chapter 7). This may puzzle the reader, who may anticipate that the complications of interval estimation would be handled after the apparently simpler point estimation rather than before. However, with the start of chapter 8, the reasoning is made clear. Cox and Hinkley state: Continue reading

Categories: Fisher, randomization, Stephen Senn | Tags: | 8 Comments

S. Senn: Red herrings and the art of cause fishing: Lord’s Paradox revisited (Guest post)

 

Stephen Senn
Consultant Statistician
Edinburgh

Background

Previous posts[a],[b],[c] of mine have considered Lord’s Paradox. To recap, this was considered in the form described by Wainer and Brown[1], in turn based on Lord’s original formulation:

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls : : : . Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded. [2](p. 304)

The issue is whether the appropriate analysis should be based on change-scores (weight in June minus weight in September), as proposed by a first statistician (whom I called John) or analysis of covariance (ANCOVA), using the September weight as a covariate, as proposed by a second statistician (whom I called Jane). There was a difference in mean weight between halls at the time of arrival in September (baseline) and this difference turned out to be identical to the difference in June (outcome). It thus follows that, since the analysis of change score is algebraically equivalent to correcting the difference between halls at outcome by the difference between halls at baseline, the analysis of change scores returns an estimate of zero. The conclusion is thus, there being no difference between diets, diet has no effect. Continue reading

Categories: Stephen Senn | 26 Comments

Stephen Senn: On the level. Why block structure matters and its relevance to Lord’s paradox (Guest Post)

.

Stephen Senn
Consultant Statistician
Edinburgh

Introduction

In a previous post I considered Lord’s paradox from the perspective of the ‘Rothamsted School’ and its approach to the analysis of experiments. I now illustrate this in some detail giving an example.

What I shall do

I have simulated data from an experiment in which two diets have been compared in 20 student halls of residence, each diet having been applied to 10 halls. I shall assume that the halls have been randomly allocated the diet and that in each hall 10 students have been randomly chosen to have their weights recorded at the beginning of the academic year and again at the end. Continue reading

Categories: Lord's paradox, Statistical Inference as Severe Testing, Stephen Senn | 34 Comments

S. Senn: Fishing for fakes with Fisher (Guest Post)

.

 

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Fishing for fakes with Fisher

 Stephen Senn

The essential fact governing our analysis is that the errors due to soil heterogeneity will be divided by a good experiment into two portions. The first, which is to be made as large as possible, will be completely eliminated, by the arrangement of the experiment, from the experimental comparisons, and will be as carefully eliminated in the statistical laboratory from the estimate of error. As to the remainder, which cannot be treated in this way, no attempt will be made to eliminate it in the field, but, on the contrary, it will be carefully randomised so as to provide a valid estimate of the errors to which the experiment is in fact liable. R. A. Fisher, The Design of Experiments, (Fisher 1990) section 28.

Fraudian analysis?

John Carlisle must be a man endowed with exceptional energy and determination. A recent paper of his is entitled, ‘Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals,’ (Carlisle 2017) and has created quite a stir. The journals examined include the Journal of the American Medical Association and the New England Journal of Medicine. What Carlisle did was examine 29,789 variables using 72,261 means to see if they were ‘consistent with random sampling’ (by which, I suppose, he means ‘randomisation’). The papers chosen had to report either standard deviations or standard errors of the mean. P-values as measures of balance or lack of it were then calculated using each of three methods and the method that gave the value closest to 0.5 was chosen. For a given trial the P-values chosen were then back-converted to z-scores combined by summing them and then re-converted back to P-values using a method that assumes the summed Z-scores to be independent. As Carlisle writes, ‘All p values were one-sided and inverted, such that dissimilar means generated p values near 1’. Continue reading

Categories: Fisher, RCTs, Stephen Senn | 5 Comments

The ASA Document on P-Values: One Year On

imgres-6

I’m surprised it’s a year already since posting my published comments on the ASA Document on P-Values. Since then, there have been a slew of papers rehearsing the well-worn fallacies of tests (a tad bit more than the usual rate). Doubtless, the P-value Pow Wow raised people’s consciousnesses. I’m interested in hearing reader reactions/experiences in connection with the P-Value project (positive and negative) over the past year. (Use the comments, share links to papers; and/or send me something slightly longer for a possible guest post.)
Some people sent me a diagram from a talk by Stephen Senn (on “P-values and the art of herding cats”). He presents an array of different cat commentators, and for some reason Mayo cat is in the middle but way over on the left side,near the wall. I never got the key to interpretation.  My contribution is below: 

Chart by S.Senn

“Don’t Throw Out The Error Control Baby With the Bad Statistics Bathwater”

D. Mayo*[1]

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed. Continue reading

Categories: Bayesian/frequentist, P-values, science communication, Statistics, Stephen Senn | 14 Comments

S. Senn: “Placebos: it’s not only the patients that are fooled” (Guest Post)

Stephen Senn

Stephen Senn

Placebos: it’s not only the patients that are fooled

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

In my opinion a great deal of ink is wasted to little purpose in discussing placebos in clinical trials. Many commentators simply do not understand the nature and purpose of placebos. To start with the latter, their only purpose is to permit blinding of treatments and, to continue to the former, this implies that their nature is that they are specific to the treatment studied.

Consider an example. Suppose that Pannostrum Pharmaceuticals wishes to prove that its new treatment for migraine, Paineaze® (which is in the form of a small red circular pill) is superior to the market-leader offered by Allexir Laboratories, Kalmer® (which is a large purple lozenge). Pannostrum decides to do a head-to head comparison and of course, therefore will require placebos. Every patient will have to take a red pill and a purple lozenge. In the Paineaze arm what is red will be Paineaze and what is purple ‘placebo to Kalmer’. In the Kalmer arm what is red will be ‘placebo to Paineaze’ and what is purple will be Kalmer.

senn-placebo

Continue reading

Categories: PhilPharma, PhilStat/Med, Statistics, Stephen Senn | 6 Comments

S. Senn: “Painful dichotomies” (Guest Post)

.

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Painful dichotomies

The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)

The bare facts they established are the following:

The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.

and the false conclusion they immediately asserted is the following

This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.

To understand the fallacy, look at the accompanying graph. Continue reading

Categories: junk science, PhilStat/Med, Statistics, Stephen Senn | 27 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)[4]

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law*[4]

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes: Continue reading

Categories: Jeffreys, P-values, reforming the reformers, Stephen Senn | Tags: | 11 Comments

Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

C.D. Broad quoteNote that in the case that only one counter remains we have n = m + 1 and the two probabilities are the same. However, if n > m+1 they are not the same and in particular if m is large but n is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See Dicing with Death(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Continue reading

Categories: Jeffreys, P-values, reforming the reformers, Statistics, Stephen Senn | 42 Comments

Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

Comedy hour icon

.

This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

 ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) ….

IMG_1547

.

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Discussion continued, Fisher, Jeffreys, P-values, Statistics, Stephen Senn | 7 Comments

Stephen Senn: Fisher’s Alternative to the Alternative

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago.  

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 59 Comments

What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

mayo_thumbnail_rings

.

Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

  1. Simple rules for alternatives against which T+ has high power:
  • If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

Categories: Bayesian/frequentist, law of likelihood, Statistical power, statistical tests, Statistics, Stephen Senn | 87 Comments

3 YEARS AGO: (JANUARY 2012) MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.

January 2012

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014. I will count U-Phil’s on a single paper as one of the three I highlight (else I’d have to choose between them). I will comment on  3-year old posts from time to time.

This Memory Lane needs a bit of explanation. This blog began largely as a forum to discuss a set of contributions from a conference I organized (with A. Spanos and J. Miller*) “Statistical Science and Philosophy of Science: Where Do (Should) They meet?”at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, in June 2010 (where I am a visitor). Additional papers grew out of conversations initiated soon after (with Andrew Gelman and Larry Wasserman). The conference site is here.  My reflections in this general arena (Sept. 26, 2012) are here.

As articles appeared in a special topic of the on-line journal, Rationality, Markets and Morals (RMM), edited by Max Albert[i]—also a conference participant —I would announce an open invitation to readers to take a couple of weeks to write an extended comment.  Each “U-Phil”–which stands for “U philosophize”- was a contribution to this activity. I plan to go back to that exercise at some point.  Generally I would give a “deconstruction” of the paper first, followed by U-Phils, and then the author gave responses to U-Phils and me as they wished. You can readily search this blog for all the U-Phils and deconstructions**.

I was also keeping a list of issues that we either haven’t taken up, or need to return to. One example here is: Bayesian updating and down dating. Further notes about the origins of this blog are here. I recommend everyone reread Senn’s paper.** 

For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

[i] Along with Hartmut Kliemt and Bernd Lahno.

*For a full list of collaborators, sponsors, logisticians, and related collaborations, see the conference page. The full list of speakers is found there as well.

**The U-Phil exchange between Mayo and Senn was published in the same special topic of RIMM. But I still wish to know how we can cultivate “Senn’s-ability.” We could continue that activity as well, perhaps.

Previous 3 YEAR MEMORY LANES:

Dec. 2011
Nov. 2011
Oct. 2011
Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, blog contents, Statistics, Stephen Senn, U-Phil | 2 Comments

S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

Stephen Senn

.

Stephen Senn
Head, Methodology and Statistics Group
Competence Center for Methodology and Statistics (CCMS)
Luxembourg

Responder despondency: myths of personalized medicine

The road to drug development destruction is paved with good intentions. The 2013 FDA report, Paving the Way for Personalized Medicine  has an encouraging and enthusiastic foreword from Commissioner Hamburg and plenty of extremely interesting examples stretching back decades. Given what the report shows can be achieved on occasion, given the enthusiasm of the FDA and its commissioner, given the amazing progress in genetics emerging from the labs, a golden future of personalized medicine surely awaits us. It would be churlish to spoil the party by sounding a note of caution but I have never shirked being churlish and that is exactly what I am going to do. Continue reading

Categories: evidence-based policy, Statistics, Stephen Senn | 50 Comments

Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)

Stephen SennBlood Simple?
The complicated and controversial world of bioequivalence

by Stephen Senn*

images-10

Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult[1]. Continue reading

Categories: bioequivalence, confidence intervals and tests, PhilPharma, Statistics, Stephen Senn | 22 Comments

Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Unknown-3Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance. Continue reading

Categories: exchange with commentators, Neyman's Nursery, P-values, Phil6334, power, Stephen Senn | 6 Comments

Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

Stephen Senn

Senn

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

Delta Force
To what extent is clinical relevance relevant?

Inspiration
This note has been inspired by a Twitter exchange with respected scientist and famous blogger  David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as clinically relevant could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are not obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject. Continue reading

Categories: power, Statistics, Stephen Senn | 39 Comments

Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

Comedy hour icon

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance… 

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

IMG_1547It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Fisher, Jeffreys, P-values, Stephen Senn | Leave a comment

STEPHEN SENN: Fisher’s alternative to the alternative

Reblogging 2 years ago:

By: Stephen Senn

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 31 Comments

Blog at WordPress.com.