S. Senn

S. Senn: To infinity and beyond: how big are your data, really? (guest post)

.

 

Stephen Senn
Consultant Statistician
Edinburgh

What is this you boast about?

Failure to understand components of variation is the source of much mischief. It can lead researchers to overlook that they can be rich in data-points but poor in information. The important thing is always to understand what varies in the data you have, and to what extent your design, and the purpose you have in mind, master it. The result of failing to understand this can be that you mistakenly calculate standard errors of your estimates that are too small because you divide the variance by an n that is too big. In fact, the problems can go further than this, since you may even pick up the wrong covariance and hence use inappropriate regression coefficients to adjust your estimates.

I shall illustrate this point using clinical trials in asthma.

Breathing lessons

Suppose that I design a clinical trial in asthma as follows. I have six centres, each centre has four patients, each patient will be studied in two episodes of seven days and during these seven days the patients will be measured daily, that is to say, seven times per episode. I assume that between the two episodes of treatment there is a period of some days in which no measurements are taken. In the context of a cross-over trial, which I may or may not decide to run, such a period is referred to as a washout period.

The block structure is like this:

Centres/Patients/Episodes/Measurements

The / sign is a nesting operator and it shows, for example, that I have Patients ‘nested’ within centres. For example, I could label the patients 1 to 4 in each centre, but I don’t regard patient 3 (say) in centre 1 as being somehow similar to patient 3 in centre 2 and patient 3 in centre 3 and so forth. Patient is a term that is given meaning by referring it to centre.

The block structure is shown in Figure 1, which does not, however, show the seven measurements per episode.

Figure 1. Schematic representation of the block structure for some possible clinical trials. The six centres are shown by black lines. For each centre there are four patients shown by blue lines and each patient is studied in two episodes, shown by red lines.

I now wish to compare two treatments, two so-called beta-agonists. The first of these, I shall call Zephyr and the second Mistral. I shall do this using a measure of lung function called forced expiratory volume in one second, (FEV1). If there are no dropouts and no missing measurements, I shall have 6 x 4 x 2 x 7 =336 FEVreadings. Is this my ‘n’?

I am going to use Genstat®, a package that fully incorporates John Nelder’s ideas of general balance[1, 2]and the analysis of designed experiments and uses, in fact, what I have called the Rothamsted approach to experiments.

I start by declaring the block structure thus

BLOCKSTRUCTURE Centre/Patient/Episode/Measurement

This is the ‘null’ situation: it describes the variation in the experimental material before any treatment is applied. If I ask Genstat®to do a ‘null’ skeleton analysis of variance for me, by typing the statement

ANOVA

and the output is as given in Table 1

Analysis of variance

Source of variation d.f.
Centre stratum    5
Centre.Patient stratum  18
Centre.Patient.Episode stratum  24
Centre.Patient.Episode.Measurement stratum 288
Total 335

Table 1. Degrees of freedom for a null analysis of variance for a nested block structure.

This only gives me possible sources of variation and degrees of freedom associated with them but not the actual variances: that would require data. There are six centres, so five degrees of freedom between centres. There are four patients per centre, so three degrees of freedom per centre between patients but there are six centres and therefore 6 x 3 = 18 in total. There are two episodes per patient and so one degree of freedom between episodes per patient but there are 24 patients and so 24 degrees of freedom in total. Finally, there are seven measurements per episode and hence six degrees of freedom but 48 episodes in total so  48 x 6 = 288 degrees of freedom for measurements.

Having some actual data would put flesh on the bones of this skeleton by giving me some mean square errors, but to understand the general structure this is not necessary. It tells me that at the highest level I will have variation between centres, next patients within centres, after that episodes within patients and finally measurements within episodes. Which of these are relevant to judging the effect of any treatments I wish to study depends how I allocate treatments.

Design matters

I now consider, three possible approaches to allocating treatments to patients. In each of the three designs, the same number of measurements will be available for each treatment. There will be 168 measurements under Zephyr and 168 measurements under Mistral and thus 336 in total. However, as I shall show, the designs will be very different, and this will lead to different analyses being appropriate and lead us to understand better what our is.

I shall also suppose that we are interested in causal analysis rather than prediction. That is to say, we are interested in estimating the effect that the treatments did have (actually, the difference in their effects) in the trial that was actually run. The matter of predicting what would happen in future to other patients is much more delicate and raises other issues and I shall not address it here, although I may do so in future. For further discussion see my paper Added Values[3].

In the first experiment, I carry out a so-called cluster-randomised trial. I choose three centres at random and all patents, in both episodes on all occasions in the three centres chosen receive Zephyr. For the other three centres, all patients on all occasions receive Mistral. I create a factor Treatment (cluster trial), (Cluster for short) which encodes this allocation so that the pattern of allocation to Zephyr or Mistral reflects this randomised scheme.

In the second experiment, I carry out a parallel group trial blocking by centre. In each centre, I choose two patients to receive Zephyr and two to receive Mistral. Thus, overall, there 6 x 2 = 12 patients on each treatment. I create a factor Treatment (parallel trial) (Parallel for short) to reflect this.

The third experiment consists of a cross-over trial. Each patient is randomised to one of two sequences, either receiving Zephyr in episode one and Mistral in episode two, or vice versa. Each patient receives both treatments so that there will be 6 x 4 = 24 patients given each treatment. I create a factor Treatment (cross-over trial) (Cross-over for short) to encode this.

Note that the total number of measurements obtained is the same for each of the three schemes. For the cluster randomised trial, a given treatment will be studied in three centres each of which has four patients, each of whom will be studied in two episodes on seven occasions. Thus, we have 3 x 4 x 2 x 7 = 168 measurement per treatment. For the parallel group trial, 12 patients are studied for a given treatment in two episodes, each providing 7 measurements. Thus, we have 12 x 2 x 7 = 168 measurement per treatment. For the cross-over trial we have 24 patients each of whom will receive a given treatment in one episode (either episode one or two) so we have 24 x 1 x 7 + 168 measurements per treatment.

Thus, from one point of view the in the data is the same for each of these three designs. However, each of the three designs provides very different amounts of information and this alone should be enough to warn anybody against assuming that all problems of precision can be solved by increasing the number of data.

Controlled Analysis

Before collecting any data, I can analyse this scheme and use Nelder’s approach to tell me where the information is in each scheme.

Using the three factors to encode the corresponding allocation, I now ask Genstat® to prepare a dummy analysis of variance (in advance of having collected any data) as follows. All I need to do is type a statement of the form

TREATMENTSTRUCTURE Design
ANOVA

Where Design is set equal to the Cluster, Parallel, Crossover, as the case may be. The result is shown in Table 2

 

Analysis of variance

Source of variation d.f.
Centre stratum
Treatment (cluster trial) 1
Residual 4
Centre.Patient stratum
Treatment (parallel trial) 1
Residual 17
Centre.Patient.Episode stratum
Treatment (cross-over trial) 1
Residual 23
Centre.Patient.Episode.Measurement stratum 288
Total 335

Table 2. Analysis of variance skeleton for three possible designs using the block structure given in Table 1

This shows us that the three possible designs will have quite different degrees of precision associated with them. Since, for the cluster trial, any given centre only receives one of the treatments, the variation between centres affects the estimate of the treatment effect and its standard error must reflect this. Since, however, the parallel trial balances treatments by centres it is unaffected by variation between centres. It is, however, affected by variation between patients. This variation is, in turn, eliminated by the cross-over trial which, in consequence is only affected by variation between episodes (although this variation will, itself, inherit variation from measurements). Each higher level of variation inherits variation from the lower levels but adds its own.

Note, however, that for all three designs the unbiased estimate of the treatment effect is the same. All that is necessary is to average the 168 measurements under Zephyr and the 168 under Mistral and calculate the difference. It is the estimate of the appropriate variation in the estimate that varies.

Suppose that, more generally, we have centres, with patients per centre and episodes per patient, with the number of measurements per episode fixed, then for the cross-over trial the variance of our estimate will be proportional to \sigma_E^2 /(mnp) where \sigma_E^2 is variance between episodes. For the parallel group trial, there will be a further term involving \sigma_P^2 /(mn) where \sigma_P^2 is the variance between patients. Finally, for the cluster randomised trial there will be a further term involving \sigma_C^2 /m, where \sigma_C^2  is the variance between centres.

The consequences of this are, you cannot decrease the variance of a cluster randomised trial indefinitely simply by increasing the number of patients; it is centres you need to increase. You cannot decrease the variance of a parallel group trial indefinitely by increasing the number of episodes; it is patients you need to increase.

Degrees of Uncertainty

Why should this matter? Why should it matter how certain we are about anything? There are several reasons. Bayesian statisticians need to know what relative weight to give their prior belief and the evidence from the data. If they do not, they do not know how to produce a posterior distribution. If they do not know what the variances of both data and prior are, they don’t know the posterior variance. Frequentists and Bayesians are often required to combine evidence from various sources as, say, in a so-called meta-analysis. They need to know what weight to give to each and again to assess the total information available at the end. Any rational approach to decision-making requires an appreciation of the value of information. If one had to make a decision with no further prospect of obtaining information based on a current estimate it might make little difference how precise it was but if the option of obtaining further information at some cost applies, this is no longer true. In short, estimation of uncertainty is important. Indeed, it is a central task of statistics.

Finally, there is one further point that is important. What applies to variances also applies to covariances. If you are adjusting for a covariate using a regression approach, then the standard estimate of the coefficient of adjustment will involve a covariance divided by a variance. Just as there can be variances at various levels there can be covariances at various levels. It is important to establish which is relevant[4] otherwise you will calculate the adjustment incorrectly.

Consequences

Just because you have many data does not mean that you will come to precise conclusions: the variance of the effect estimate may not, as one might naively suppose, be inversely proportional to the number of data, but to some other much rarer feature in the data-set. Failure to appreciate this has led to excessive enthusiasm for the use of synthetic patients and historical controls as alternatives to concurrent controls. However, the relevant dominating component of variation is that between studies not between patients. This does not shrink to zero as the number of subjects goes to infinity. it does not even shrink to zero as the number of studies goes to infinity, since if the current study is the only one that the new treatment is on, the relevant variance for that arm is at least \sigma_{St}^2 /1, where \sigma_{St}^2  is the variance between studies, even if, for the ‘control’ data-set it may be negligible , thanks to data collected from many subjects in many studies.

There is a lesson also for epidemiology here. All too often, the argument in the epidemiological, and more recently, the causal literature has been about which effects one should control for or condition on without appreciating that merely stating what should be controlled for does not solve how. I am not talking here about the largely sterile debate, to which I have contributed myself[5] as to how at a given level, adjustment should be made for possible confounders (for example, propensity score or linear model), but to the level at which such adjustment can be made. The usual implicit assumption is that an observational study is somehow a deficient parallel group trial, with maybe complex and perverse allocation mechanisms that must somehow be adjusted for, but that once such adjustments have been made, precision increases as the subjects increase. But suppose the true analogy is a cluster randomised trial. Then, whatever you adjust for, your standard errors will be too small.

Finally, it is my opinion, that much of the discussion about Lord’s paradox would have benefitted from an appreciation of the issue of components of variance. I am used to informing medical clients that saying we will analyse the data using analysis of variance is about as useful as saying we will treat the patients with a pill. The varieties of analysis of variance are legion and the same is true of analysis of covariance. So, you conditioned on the baseline values. Bravo! But how did you condition on them? If you used a slope obtained at the wrong level of the data then, except fortuitously, your adjustment will be wrong, as will the precision you claim for it.

Finally, if I may be permitted an auto-quote, the price one pays for not using concurrent control is complex and unconvincing mathematics. That complexity may be being underestimated by those touting ‘big data’.

.

References

  1. Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 147-162.
  2. Nelder JA. The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London. Series A 1965; 283: 163-178.
  3. Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004; 23: 3729-3753.
  4. Kenward MG, Roger JH. The use of baseline covariates in crossover studies. Biostatistics2010; 11: 1-17.
  5. Senn SJ, Graf E, Caputo A. Stratification for the propensity score compared with linear regression techniques to assess the effect of treatment or exposure. Statistics in Medicine 2007; 26: 5529-5544.

Some relevant blogposts

Lord’s Paradox:

    • (11/11/18) Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)
    • (11/22/18) Stephen Senn: On the level. Why block structure matters and its relevance to Lord’s paradox (Guest Post)

Personalized Medicine:

    • (01/30/18) S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)
    • (7/11/18) S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)
    • (07/26/14) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)

Randomisation:

    • (07/01/17) S. Senn: Fishing for fakes with Fisher (Guest Post)
Categories: Lord's paradox, S. Senn | 4 Comments

Guest Post: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of posts on R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012, and 2017. See especially the comments from Feb 2017. 

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | Leave a comment

S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)

Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine?

A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.

Stephen Senn
Consultant Statistician,
Edinburgh

Introduction

Thirty years ago, Laupacis et al1 proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).

Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write:  ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’2. Fifteen years further on, bringing us up to date,  Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’3

This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and  Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirty-year anniversary of NNTs provides a good excuse. Continue reading

Categories: personalized medicine, PhilStat/Med, S. Senn | 7 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017.  The comments from 2017 lead to a troubling issue that I will bring up in the comments today.

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 1 Comment

S. Senn: Evidence Based or Person-centred? A Statistical debate (Guest Post)

.

Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Evidence Based or Person-centred? A statistical debate

It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.

Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.

In fact, EBM will be essential to delivering effective PCH, as I shall now explain. Continue reading

Categories: personalized medicine, RCTs, S. Senn | 7 Comments

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?

Slide1

.

ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173184 … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself. Continue reading

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 18 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012.  (I will comment in the comments.)

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 13 Comments

Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand

images-3

.

Here’s the follow-up post to the one I reblogged on Feb 3 (please read that one first). When they sought to subject Uri Geller to the scrutiny of scientists, magicians had to be brought in because only they were sufficiently trained to spot the subtle sleight of hand shifts by which the magician tricks by misdirection. We, too, have to be magicians to discern the subtle misdirections and shifts of meaning in the discussions of statistical significance tests (and other methods)—even by the same statistical guide. We needn’t suppose anything deliberately devious is going on at all! Often, the statistical guidebook reflects shifts of meaning that grow out of one or another critical argument. These days, they trickle down quickly to statistical guidebooks, thanks to popular articles on the “statistics crisis in science”. The danger is that their own guidebooks contain inconsistencies. To adopt the magician’s stance is to be on the lookout for standard sleights of hand. There aren’t that many.[0]

I don’t know Jim Frost, but he gives statistical guidance at the minitab blog. The purpose of my previous post is to point out that Frost uses the probability of a Type I error in two incompatible ways in his posts on significance tests. I assumed he’d want to clear this up, but so far he has not. His response to a comment I made on his blog is this: Continue reading

Categories: frequentist/Bayesian, P-values, reforming the reformers, S. Senn, Statistics | 39 Comments

Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)

Slide1

.

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

The recommendation is much more fully fleshed out in a 2016 paper by Bayarri, Benjamin, Berger, and Sellke (BBBS 2016): Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Their recommendation is:

…that researchers should report the ‘pre-experimental rejection ratio’ when presenting their experimental design and researchers should report the ‘post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results. (BBBS 2016, p. 3)….

The (pre-experimental) ‘rejection ratio’ Rpre , the ratio of statistical power to significance threshold (i.e., the ratio of the probability of rejecting under H1 and H0 respectively), is shown to capture the strength of evidence in the experiment for Hover H0. (ibid., p. 2)

But in fact it does no such thing! [See my post from the FUSION conference here.] J. Berger, and his co-authors, will tell you the rejection ratio (and a variety of other measures created over the years) are entirely frequentist because they are created out of frequentist error statistical measures. But a creation built on frequentist measures doesn’t mean the resulting animal captures frequentist error statistical reasoning. It might be a kind of Frequentstein monster! [1] Continue reading

Categories: J. Berger, power, reforming the reformers, S. Senn, Statistical power, Statistics | 36 Comments

Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

old blogspot typewriter

.

I first blogged this letter here. Below the references are some more recent blog links of relevance to this issue. 

 Dear Reader:  I am typing in some excerpts from a letter Stephen Senn shared with me in relation to my April 28, 2012 blogpost.  It is a letter to the editor of Statistics in Medicine  in response to S. Goodman. It contains several important points that get to the issues we’ve been discussing. You can read the full letter here. Sincerely, D. G. Mayo

 STATISTICS IN MEDICINE, LETTER TO THE EDITOR

From: Stephen Senn*

Some years ago, in the pages of this journal, Goodman gave an interesting analysis of ‘replication probabilities’ of p-values. Specifically, he considered the possibility that a given experiment had produced a p-value that indicated ‘significance’ or near significance (he considered the range p=0.10 to 0.001) and then calculated the probability that a study with equal power would produce a significant result at the conventional level of significance of 0.05. He showed, for example, that given an uninformative prior, and (subsequently) a resulting p-value that was exactly 0.05 from the first experiment, the probability of significance in the second experiment was 50 per cent. A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5. Continue reading

Categories: 4 years ago!, reproducibility, S. Senn, Statistics | Tags: , , , | 3 Comments

Stephen Senn: The pathetic P-value (Guest Post) [3]

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value* [3]

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 27 Comments

Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics

.

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

This post first appeared here. An issue sometimes raised about randomized clinical trials is the problem of indefinitely many confounders. This, for example is what John Worrall has to say:

Even if there is only a small probability that an individual factor is unbalanced, given that there are indefinitely many possible confounding factors, then it would seem to follow that the probability that there is some factor on which the two groups are unbalanced (when remember randomly constructed) might for all anyone knows be high. (Worrall J. What evidence is evidence-based medicine? Philosophy of Science 2002; 69: S316-S330: see p. S324 )

It seems to me, however, that this overlooks four matters. The first is that it is not indefinitely many variables we are interested in but only one, albeit one we can’t measure perfectly. This variable can be called ‘outcome’. We wish to see to what extent the difference observed in outcome between groups is compatible with the idea that chance alone explains it. The indefinitely many covariates can help us predict outcome but they are only of interest to the extent that they do so. However, although we can’t measure the difference we would have seen in outcome between groups in the absence of treatment, we can measure how much it varies within groups (where the variation cannot be due to differences between treatments). Thus we can say a great deal about random variation to the extent that group membership is indeed random. Continue reading

Categories: RCTs, S. Senn, Statistics | Tags: , | 6 Comments

Can You change Your Bayesian prior? (ii)

images-1

.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

aps_2015_logo_cropped-1

.

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

 

D. Mayo: “Error Statistical Control: Forfeit at your Peril” 

 

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

 

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

 

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Stephen Senn: The pathetic P-value (Guest Post)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 148 Comments

Stephen Senn: Is Pooling Fooling? (Guest Post)

Stephen Senn

.

Stephen Senn
Head, Methodology and Statistics Group,
Competence Center for Methodology and Statistics (CCMS), Luxembourg

Is Pooling Fooling?

‘And take the case of a man who is ill. I call two physicians: they differ in opinion. I am not to lie down, and die between them: I must do something.’ Samuel Johnson, in Boswell’s A Journal of a Tour to the Hebrides

A common dilemma facing meta-analysts is what to put together with what? One may have a set of trials that seem to be approximately addressing the same question but some features may differ. For example, the inclusion criteria might have differed with some trials only admitting patients who were extremely ill but with other trials treating the moderately ill as well. Or it might be the case that different measurements have been taken in different trials. An even more extreme case occurs when different, if presumed similar, treatments have been used.

It is helpful to make a point of terminology here. In what follows I shall be talking about pooling results from various trials. This does not involve naïve pooling of patients across trials. I assume that each trial will provide a valid within- trial comparison of treatments. It is these comparisons that are to be pooled (appropriately).

A possible way to think of this is in terms of a Bayesian model with a prior distribution covering the extent to which results might differ as features of trials are changed. I don’t deny that this is sometimes an interesting way of looking at things (although I do maintain that it is much more tricky than many might suppose[1]) but I would also like to draw attention to the fact that there is a frequentist way of looking at this problem that is also useful.

Suppose that we have k ‘null’ hypotheses that we are interested in testing, each being capable of being tested in one of k trials. We can label these Hn1, Hn2, … Hnk. We are perfectly entitled to test the null hypothesis Hjoint that they are all jointly true. In doing this we can use appropriate judgement to construct a composite statistic based on all the trials whose distribution is known under the null. This is a justification for pooling. Continue reading

Categories: evidence-based policy, PhilPharma, S. Senn, Statistics | 19 Comments

Blog at WordPress.com.