# Stephen Senn

## Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

Double Jeopardy?: Judge Jeffreys Upholds the Law

“But this could be dealt with in a rough empirical way by taking twice the standard error as a criterion for possible genuineness and three times the standard error for definite acceptance”. Harold Jeffreys(1) (p386)

This is the second of two posts on P-values. In the first, The Pathetic P-Value, I considered the relation of P-values to Laplace’s Bayesian formulation of induction, pointing out that that P-values, whilst they had a very different interpretation, were numerically very similar to a type of Bayesian posterior probability. In this one, I consider their relation or lack of it, to Harold Jeffreys’s radically different approach to significance testing. (An excellent account of the development of Jeffreys’s thought is given by Howie(2), which I recommend highly.)

The story starts with Cambridge philosopher CD Broad (1887-1971), who in 1918 pointed to a difficulty with Laplace’s Law of Succession. Broad considers the problem of drawing counters from an urn containing n counters and supposes that all m drawn had been observed to be white. He now considers two very different questions, which have two very different probabilities and writes:

Note that in the case that only one counter remains we have n = m + 1 and the two probabilities are the same. However, if n > m+1 they are not the same and in particular if m is large but n is much larger, the first probability can approach 1 whilst the second remains small.

The practical implication of this just because Bayesian induction implies that a large sequence of successes (and no failures) supports belief that the next trial will be a success, it does not follow that one should believe that all future trials will be so. This distinction is often misunderstood. This is The Economist getting it wrong in September 2000

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

See Dicing with Death(3) (pp76-78).

The practical relevance of this is that scientific laws cannot be established by Laplacian induction. Jeffreys (1891-1989) puts it thus

Thus I may have seen 1 in 1000 of the ‘animals with feathers’ in England; on Laplace’s theory the probability of the proposition, ‘all animals with feathers have beaks’, would be about 1/1000. This does not correspond to my state of belief or anybody else’s. (P128)

Here Jeffreys is using Broad’s formula with the ratio of m to n of 1:1000.

To Harold Jeffreys the situation was unacceptable. Scientific laws had to be capable of being if not proved at least made more likely by a process of induction. The solution he found was to place a lump of probability on the simpler model that any particular scientific law would imply compared to some vaguer and more general alternative. In hypothesis testing terms we can say that Jeffreys moved from testing

H0A: θ  ≤  0,v, H1A: θ  >  0  &  H0B: θ  ≥  0,vH1B: θ  <  0

to testing

H0: θ = 0, v, H1: θ ≠ 0

As he put it

The essential feature is that we express ignorance of whether the new parameter is needed by taking half the prior probability for it as concentrated in the value indicated by the null hypothesis, and distributing the other half over the range possible.(1) (p249)

Now, the interesting thing is that in frequentist cases these two make very little difference. The P-value calculated in the second case is the same as that in the first, although its interpretation is slightly different. In the second case it is a sense exact since the null hypothesis is ‘simple’. In the first case it is a maximum since for a given statistic one calculates the probability of a result as extreme or more extreme as that observed for that value of the null hypothesis for which this probability is maximised.

In the Bayesian case the answers are radically different as is shown by the attached figure, which gives one-sided P-values and posterior probabilities (calculated from a simulation for fun rather than by necessity) for smooth and lump prior distributions. If we allow that θ may vary smoothly over some range which is, in a sense, case 1 and is the Laplacian formulation, we get a very different result to allowing it to have a lump of probability at 0, which is the innovation of Jeffreys. The origin of the difference is to do with the prior probability. It may seem that we are still in the world of uninformative prior probabilities but this is far from so. In the Laplacian formulation every value of θ is equally likely. However in the Jeffreys formulation the value under the null is infinitely more likely than any other value. This fact is partly hidden by the approach. First make H0 & H1 equally likely. Then make every value under equally likely. The net result is that all values of θ are far from being equally likely.

This simple case is a paradigm for a genuine issue in Bayesian inference that arises over and over again. It is crucially important as to how you pick up the problem when specifying prior distributions. (Note that this is not in itself a criticism of the Bayesian approach. It is a warning that care is necessary.)

Has Jeffreys’s innovation been influential? Yes and no. A lot of Bayesian work seems to get by without it. For example, an important paper on Bayesian Approaches to Randomized Trials by Spiegelhalter, Freedman and Parmar(4) and which has been cited more than 450 times according to Google Scholar (as of 7 May 2015), considers four types of smooth priors, reference, clinical, sceptical and enthusiastic, none of which involve a lump of probability.

However, there is one particular area where an analogous approach is very common and that is model selection. The issue here is that if one just uses likelihood as a guide between models one will always prefer a more complex one to a simpler one. A practical likelihood based solution is to use a penalised form whereby the likelihood is handicapped by a function of the number of parameters. The most famous of these is the AIC criterion. It is sometimes maintained that this deals with a very different sort of problem to that addressed by hypothesis/significance testing but its originator, Akaike (1927-2009) (5), clearly did not think so, writing

So, it is clear that Akaike regarded this as being a unifying approach, to estimation and hypothesis testing: that which was primarily an estimation tool, likelihood, was now a model selection tool also.

However, as Murtaugh(6) points out, there is a strong similarity between using AIC and P-values to judge the adequacy of a model. The AIC criterion involves log-likelihood and this is what is also in involved in analysis of deviance where the fact that asymptotically minus twice the difference in log likelihoods between two nested models has a chi-square distribution with mean equal to the difference in the number of parameters modelled. The net result is that if you use AIC to choose between a simpler model and a more complex model within which it is nested and the more complex model has one extra parameter, choosing or rejecting the more complex values is equivalent to using a significance threshold of 16%.

For those who are used to the common charge levelled by, for example Berger and Sellke(7) and more recently, David Colquhoun (8) in his 2014 paper that the P-value approach gives significance too easily, this is a baffling result: significance tests are too conservative rather than being too liberal. Of course, Bayesians will prefer the BIC to the AIC and this means that there is a further influence of sample size on inference that is not captured by any function that depends on likelihood and number of parameters only. Nevertheless, it is hard to argue that, whatever advantages the AIC may have in terms of flexibility, for the purpose of comparing nested models, it somehow represents a more rigorous approach than significance testing.

However, it is easily understood if one appreciates the following. Within the Bayesian framework, in abandoning smooth priors for lump priors, it is also necessary to change the probability standard. (In fact I speculate that the 1 in 20 standard seemed reasonable partly because of the smooth prior.) In formulating the hypothesis-testing problem the way that he does, Jeffreys has already used up any preference for parsimony in terms of prior probability. Jeffreys made it quite clear that this was his view, stating

I maintain that the only ground that we can possibly have for not rejecting the simple law is that we believe that it is quite likely to be true (p119)

He then proceeds to express this in terms of a prior probability. Thus there can be no double jeopardy. A parsimony principle is used on the prior distribution. You can’t use it again on the posterior distribution. Once that is calculated, you should simply prefer the more probable model. The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.

ACKNOWLEDGEMENT

My research on inference for small populations is carried out in the framework of the IDEAL project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.

REFERENCES

1. Jeffreys H. Theory of Probability. Third ed. Oxford: Clarendon Press; 1961.
2. Howie D. Interpreting Probability: Controversies and Developments in the Early Twentieth Century. Skyrms B, editor. Cambridge: Cambridge University Press; 2002. 262 p.
3. Senn SJ. Dicing with Death. Cambridge: Cambridge University Press; 2003.
4. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian Approaches to Randomized Trials. Journal of the Royal Statistical Society Series a-Statistics in Society. 1994;157:357-87.
5. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Czáki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiadó; 1973. p. 267-81.
6. Murtaugh PA. In defense of P values. Ecology. 2014;95(3):611-7.
7. Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.
8. Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science. 2014;1(3):140216.

Also Relevant

Related Posts

P-values overstate the evidence?

P-values versus posteriors (comedy hour)

Spanos: Recurring controversies about P-values and confidence intervals (paper in relation to Murtaugh)

## Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

.

This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) ….

.

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## Stephen Senn: Fisher’s Alternative to the Alternative

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago.

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | | 59 Comments

## What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

.

Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

1. Simple rules for alternatives against which T+ has high power:
• If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
• If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

## 3 YEARS AGO: (JANUARY 2012) MEMORY LANE

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: January 2012. I mark in red three posts that seem most apt for general background on key issues in this blog.

January 2012

• (1/3) Model Validation and the LLP-(Long Playing Vinyl Record)
• (1/8) Don’t Birnbaumize that Experiment my Friend*
• (1/10) Bad-Faith Assertions of Conflicts of Interest?*
• (1/13) U-PHIL: “So you want to do a philosophical analysis?”
• (1/14)  But You Are Probably Wrong” (Extract from Senn RMM article)
• (1/15) U-Phil  “How Can We Culivate Senn’s-Ability?”
• (1/17) “Philosophy of Statistics”: Nelder on Lindley
• (1/19) RMM-6 Special Volume on Stat Sci Meets Phil Sci (Sprenger)
• (1/22) U-Phil: Stephen Senn (1): C. Robert, A. Jaffe, and Mayo (brief remarks)
• (1/23): Andrew Gelman
• (1/24) U-Phil (3): Stephen Senn on Stephen Senn!
• (1/26) Updating & Downdating: One of the Pieces to Pick up
• (1/29)  Skepticism, Rationality, Popper, and All That: First of 3 Parts

This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014. I will count U-Phil’s on a single paper as one of the three I highlight (else I’d have to choose between them). I will comment on  3-year old posts from time to time.

This Memory Lane needs a bit of explanation. This blog began largely as a forum to discuss a set of contributions from a conference I organized (with A. Spanos and J. Miller*) “Statistical Science and Philosophy of Science: Where Do (Should) They meet?”at the London School of Economics, Center for the Philosophy of Natural and Social Science, CPNSS, in June 2010 (where I am a visitor). Additional papers grew out of conversations initiated soon after (with Andrew Gelman and Larry Wasserman). The conference site is here.  My reflections in this general arena (Sept. 26, 2012) are here.

As articles appeared in a special topic of the on-line journal, Rationality, Markets and Morals (RMM), edited by Max Albert[i]—also a conference participant —I would announce an open invitation to readers to take a couple of weeks to write an extended comment.  Each “U-Phil”–which stands for “U philosophize”- was a contribution to this activity. I plan to go back to that exercise at some point.  Generally I would give a “deconstruction” of the paper first, followed by U-Phils, and then the author gave responses to U-Phils and me as they wished. You can readily search this blog for all the U-Phils and deconstructions**.

I was also keeping a list of issues that we either haven’t taken up, or need to return to. One example here is: Bayesian updating and down dating. Further notes about the origins of this blog are here. I recommend everyone reread Senn’s paper.**

For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

[i] Along with Hartmut Kliemt and Bernd Lahno.

*For a full list of collaborators, sponsors, logisticians, and related collaborations, see the conference page. The full list of speakers is found there as well.

**The U-Phil exchange between Mayo and Senn was published in the same special topic of RIMM. But I still wish to know how we can cultivate “Senn’s-ability.” We could continue that activity as well, perhaps.

Previous 3 YEAR MEMORY LANES:

Categories: 3-year memory lane, blog contents, Statistics, Stephen Senn, U-Phil | 2 Comments

## S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

.

Stephen Senn
Competence Center for Methodology and Statistics (CCMS)
Luxembourg

Responder despondency: myths of personalized medicine

The road to drug development destruction is paved with good intentions. The 2013 FDA report, Paving the Way for Personalized Medicine  has an encouraging and enthusiastic foreword from Commissioner Hamburg and plenty of extremely interesting examples stretching back decades. Given what the report shows can be achieved on occasion, given the enthusiasm of the FDA and its commissioner, given the amazing progress in genetics emerging from the labs, a golden future of personalized medicine surely awaits us. It would be churlish to spoil the party by sounding a note of caution but I have never shirked being churlish and that is exactly what I am going to do. Continue reading

Categories: evidence-based policy, Statistics, Stephen Senn | 49 Comments

## Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)

Blood Simple?
The complicated and controversial world of bioequivalence

by Stephen Senn*

Those not familiar with drug development might suppose that showing that a new pharmaceutical formulation (say a generic drug) is equivalent to a formulation that has a licence (say a brand name drug) ought to be simple. However, it can often turn out to be bafflingly difficult[1]. Continue reading

## Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity

Is it taboo to use a test’s power to assess what may be learned from the data in front of us? (Is it limited to pre-data planning?) If not entirely taboo, some regard power as irrelevant post-data[i], and the reason I’ve heard is along the lines of an analogy Stephen Senn gave today (in a comment discussing his last post here)[ii].

Senn comment: So let me give you another analogy to your (very interesting) fire alarm analogy (My analogy is imperfect but so is the fire alarm.) If you want to cross the Atlantic from Glasgow you should do some serious calculations to decide what boat you need. However, if several days later you arrive at the Statue of Liberty the fact that you see it is more important than the size of the boat for deciding that you did, indeed, cross the Atlantic.

My fire alarm analogy is here. My analogy presumes you are assessing the situation (about the fire) long distance. Continue reading

## Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)

Senn

Stephen Senn
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

Delta Force
To what extent is clinical relevance relevant?

Inspiration
This note has been inspired by a Twitter exchange with respected scientist and famous blogger  David Colquhoun. He queried whether a treatment that had 2/3 of an effect that would be described as clinically relevant could be useful. I was surprised at the question, since I would regard it as being pretty obvious that it could but, on reflection, I realise that things that may seem obvious to some who have worked in drug development may not be obvious to others, and if they are not obvious to others are either in need of a defence or wrong. I don’t think I am wrong and this note is to explain my thinking on the subject. Continue reading

Categories: power, Statistics, Stephen Senn | 38 Comments

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

This headliner appeared last month, but to a sparse audience (likely because it was during winter break), so Management’s giving him another chance…

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

What’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

## STEPHEN SENN: Fisher’s alternative to the alternative

Reblogging 2 years ago:

By: Stephen Senn

This year [2012] marks the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.

The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | | 31 Comments

## Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]

You might not have thought there could be new material for 2014, but there is, and if you look a bit more closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike ….

It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

Well, what’s unusual, is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation.

We can view p-values in terms of rejecting H0, as in the joke: There’s a test statistic D such that H0 is rejected if its observed value d0 reaches or exceeds a cut-off d* where Pr(D > d*; H0) is small, say .025.
Reject H0 if Pr(D > d0H0) < .025.
The report might be “reject Hat level .025″.
Example:  H0: The mean light deflection effect is 0. So if we observe a 1.96 standard deviation difference (in one-sided Normal testing) we’d reject H0 .

Now it’s true that if the observation were further into the rejection region, say 2, 3 or 4 standard deviations, it too would result in rejecting the null, and with an even smaller p-value. It’s also true that H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference in the sense that differences so large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes! This is “a remarkable procedure” indeed! [i]

So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this …. Continue reading

Categories: Comedy, Fisher, Jeffreys, P-values, Statistics, Stephen Senn | 12 Comments

## Stephen Senn: Dawid’s Selection Paradox (guest post)

Stephen Senn
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

You can protest, of course, that Dawid’s Selection Paradox is no such thing but then those who believe in the inexorable triumph of logic will deny that anything is a paradox. In a challenging paper published nearly 20 years ago (Dawid 1994), Philip Dawid drew attention to a ‘paradox’ of Bayesian inference. To describe it, I can do no better than to cite the abstract of the paper, which is available from Project Euclid, here: http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?

When the inference to be made is selected after looking at the data, the classical statistical approach demands — as seems intuitively sensible — that allowance be made for the bias thus introduced. From a Bayesian viewpoint, however, no such adjustment is required, even when the Bayesian inference closely mimics the unadjusted classical one. In this paper we examine more closely this seeming inadequacy of the Bayesian approach. In particular, it is argued that conjugate priors for multivariate problems typically embody an unreasonable determinism property, at variance with the above intuition.

I consider this to be an important paper not only for Bayesians but also for frequentists, yet it has only been cited 14 times as of 15 November 2013 according to Google Scholar. In fact I wrote a paper about it in the American Statistician a few years back (Senn 2008) and have also referred to it in a previous blogpost (12 May 2012). That I think it is important and neglected is excuse enough to write about it again.

Philip Dawid is not responsible for my interpretation of his paradox but the way that I understand it can be explained by considering what it means to have a prior distribution. First, as a reminder, if you are going to be 100% Bayesian, which is to say that all of what you will do by way of inference will be to turn a prior into a posterior distribution using the likelihood and the operation of Bayes theorem, then your prior distribution has to satisfy two conditions. First, it must be what you would use to bet now (that is to say at the moment it is established) and second no amount of subsequent data will change your prior qua prior. It will, of course, be updated by Bayes theorem to form a posterior distribution once further data are obtained but that is another matter. The relevant time here is your observation time not the time when the data were collected, so that data that were available in principle but only came to your attention after you established your prior distribution count as further data.

Now suppose that you are going to make an inference about a population mean, θ, using a random sample from the population and choose the standard conjugate prior distribution. Then in that case you will use a Normal distribution with known (to you) parameters μ and σ2. If σ2 is large compared to the random variation you might expect for the means in your sample, then the prior distribution is fairly uninformative and if it is small then fairly informative but being uninformative is not in itself a virtue. Being not informative enough runs the risk that your prior distribution is not one you might wish to use to bet now and being too informative that your prior distribution is one you might be tempted to change given further information. In either of these two cases your prior distribution will be wrong. Thus the task is to be neither too informative nor not informative enough. Continue reading

## Highly probable vs highly probed: Bayesian/ error statistical differences

A reader asks: “Can you tell me about disagreements on numbers between a severity assessment within error statistics, and a Bayesian assessment of posterior probabilities?” Sure.

There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)

1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over k or more factors?  Or to distinguish optional stopping with sequential trials from fixed sample size experiments.  Here’s a quote I came across just yesterday:

“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a fixed sample had been taken, one could guarantee arbitrarily strong frequentist ‘significance’ against H0.” (Berger and Wolpert, 1988, 77).

The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong)  likelihood principle, and Birnbaum.

2. Highly probable vs. highly probed. SEV doesn’t obey the probability calculus: for any test T and outcome x, the severity for both H and ~H might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle. Continue reading

## Stephen Senn: Open Season (guest post)

Stephen Senn
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

“Open Season”

The recent joint statement(1) by the Pharmaceutical Research and Manufacturers of America (PhRMA) and the European Federation of Pharmaceutical Industries and Associations(EFPIA) represents a further step in what has been a slow journey towards (one assumes) will be the achieved  goal of sharing clinical trial data. In my inaugural lecture of 1997 at University College London I called for all pharmaceutical companies to develop a policy for sharing trial results and I have repeated this in many places since(2-5). Thus I can hardly complain if what I have been calling for for over 15 years is now close to being achieved.

However, I have now recently been thinking about it again and it seems to me that there are some problems that need to be addressed. One is the issue of patient confidentiality. Ideally, covariate information should be exploitable as such often increases the precision of inferences and also the utility of decisions based upon them since they (potentially) increase the possibility of personalising medical interventions. However, providing patient-level data increases the risk of breaching confidentiality. This is a complicated and difficult issue about which, however, I have nothing useful to say. Instead I want to consider another matter. What will be the influence on the quality of the inferences we make of enabling many subsequent researchers to analyse the same data?

One of the reasons that many researchers have called for all trials to be published is that trials that are missing tend to be different from those that are present. Thus there is a bias in summarising evidence from published trial only and it can be a difficult task with no guarantee of success to identify those that have not been published. This is a wider reflection of the problem of missing data within trials. Such data have long worried trialists and the Food and Drug Administration (FDA) itself has commissioned a report on the subject from leading experts(6). On the European side the Committee for Medicinal Products for Human Use (CHMP) has a guideline dealing with it(7).

However, the problem is really a particular example of data filtering and it also applies to statistical analysis. If the analyses that are present have been selected from a wider set, then there is a danger that they do not provide an honest reflection of the message that is in the data. This problem is known as that of multiplicity and there is a huge literature dealing with it, including regulatory guidance documents(8, 9).

Within drug regulation this is dealt with by having pre-specified analyses. The broad outlines of these are usually established in the trial protocol and the approach is then specified in some detail in the statistical analysis plan which is required to be finalised before un-blinding of the data. The strategies used to control for multiplicity will involve some combination of defining a significance testing route (an order in which test must be performed and associated decision rules) and reduction of the required level of significance to detect an event.

I am not a great fan of these manoeuvres, which can be extremely complex. One of my objections is that it is effectively assumed that the researchers who chose them are mandated to circumscribe the inferences that scientific posterity can make(10). I take the rather more liberal view that provided that everything that is tested is reported one can test as much as one likes. The problem comes if there is selective use of results and in particular selective reporting. Nevertheless, I would be the first to concede the value of pre-specification in clarifying the thinking of those about to embark on conducting a clinical trial and also in providing a ‘template of trust’ for the regulator when provided with analyses by the sponsor.

However, what should be our attitude to secondary analyses? From one point of view these should be welcome. There is always value in looking at data from different perspectives and indeed this can be one way of strengthening inferences in the way suggested nearly 50 years ago by Platt(11). There are two problems, however. First, not all perspectives are equally valuable. Some analyses in the future, no doubt, will be carried out by those with little expertise and in some cases, perhaps, by those with a particular viewpoint to justify. There is also the danger that some will carry out multiple analyses (of which, when one consider the possibility of changing endpoints, performing transformations, choosing covariates and modelling framework there are usually a great number) but then only present those that are ‘interesting’. It is precisely to avoid this danger that the ritual of pre-specified analysis is insisted upon by regulators. Must we also insist upon it for those seeking to reanalyse?

To do so would require such persons to do two things. First, they would have to register the analysis plan before being granted access to the data. Second, they would have to promise to make the analysis results available, otherwise we will have a problem of missing analyses to go with the problem of missing trials. I think that it is true to say that we are just beginning to feel our way with this. It may be that the chance has been lost and that the whole of clinical research will be ‘world wide webbed’: there will be a mass of information out there but we just don’t know what to believe. Whatever happens the era of privileged statistical analyses by the original data collectors is disappearing fast.

[Ed. note: Links to some earlier related posts by Prof. Senn are:  “Casting Stones” 3/7/13, “Also Smith & Jones” 2/23/13, and “Fooling the Patient: An Unethical Use of Placebo?” 8/2/12 .]

References

1. PhRMA, EFPIA. Principles for Responsible Clinical Trial Data Sharing. PhRMA; 2013 [cited 2013 31 August]; Available from: http://phrma.org/sites/default/files/pdf/PhRMAPrinciplesForResponsibleClinicalTrialDataSharing.pdf.

2. Senn SJ. Statistical quality in analysing clinical trials. Good Clinical Practice Journal. [Research Paper]. 2000;7(6):22-6.

3. Senn SJ. Authorship of drug industry trials. Pharm Stat. [Editorial]. 2002;1:5-7.

4. Senn SJ. Sharp tongues and bitter pills. Significance. [Review]. 2006 September 2006;3(3):123-5.

5. Senn SJ. Pharmaphobia: fear and loathing of pharmaceutical research. [pdf] 1997 [updated 31 August 2013; cited 2013 31 August ]; Updated version of paper originally published on PharmInfoNet].

6. Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012 Oct 4;367(14):1355-60.

7. Committee for Medicinal Products for Human Use (CHMP). Guideline on Missing Data in Confirmatory Clinical Trials London: European Medicine Agency; 2010. p. 1-12.

8. Committee for Proprietary Medicinal Products. Points to consider on multiplicity issues in clinical trials. London: European Medicines Evaluation Agency2002.

9. International Conference on Harmonisation. Statistical principles for clinical trials (ICH E9). Statistics in Medicine. 1999;18:1905-42.

10. Senn S, Bretz F. Power and sample size when multiple endpoints are considered. Pharm Stat. 2007 Jul-Sep;6(3):161-70.

11. Platt JR. Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science. 1964 Oct 16;146(3642):347-53.

## Stephen Senn: Indefinite irrelevance

Stephen Senn
Competence Center for Methodology and Statistics (CCMS),
Luxembourg

At a workshop on randomisation I attended recently I was depressed to hear what I regard as hackneyed untruths treated as if they were important objections. One of these is that of indefinitely many confounders. The argument goes that although randomisation may make it probable that some confounders are reasonably balanced between the arms, since there are indefinitely many of these, the chance that at least some are badly confounded is so great as to make the procedure useless.

This argument is wrong for several related reasons. The first is to do with the fact that the total effect of these indefinitely many confounders is bounded. This means that the argument put forward is analogously false to one in which it were claimed that the infinite series ½, ¼,⅛ …. did not sum to a limit because there were infinitely many terms. The fact is that the outcome value one wishes to analyse poses a limit on the possible influence of the covariates. Suppose that we were able to measure a number of covariates on a set of patients prior to randomisation (in fact this is usually not possible but that does not matter here). Now construct principle components, C1, C2… .. based on these covariates. We suppose that each of these predict to a greater or lesser extent the outcome, Y  (say).  In a linear model we could put coefficients on these components, k1, k2… (say). However one is not free to postulate anything at all by way of values for these coefficients, since it has to be the case for any set of m such coefficients that where  V(  ) indicates variance of. Thus variation in outcome bounds variation in prediction. This total variation in outcome has to be shared between the predictors and the more predictors you postulate there are, the less on average the influence per predictor.

The second error is to ignore the fact that statistical inference does not proceed on the basis of signal alone but also on noise. It is the ratio of these that is important. If there are indefinitely many predictors then there is no reason to suppose that their influence on the variation between treatment groups will be bigger than their variation within groups and both of these are used to make the inference. Continue reading

Categories: RCTs, Statistics, Stephen Senn | 15 Comments

## Gelman sides w/ Neyman over Fisher in relation to a famous blow-up

blog-o-log

Andrew Gelman had said he would go back to explain why he sided with Neyman over Fisher in relation to a big, famous argument discussed on my Feb. 16, 2013 post: “Fisher and Neyman after anger management?”, and I just received an e-mail from Andrew saying that he has done so: “In which I side with Neyman over Fisher”. (I’m not sure what Senn’s reply might be.) Here it is:

“In which I side with Neyman over Fisher” Posted by  on 24 May 2013, 9:28 am

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis. Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , | 10 Comments

## Guest post: Bad Pharma? (S. Senn)

Professor Stephen Senn*