1) A convenience sample is drawn. Its probabilistic relationship to any population of patients is not known. To make inferences about such a population requires the sort of reasoning that forms the second part of Deaton and Cartwright’s two part paper and I specifically excluded that from my discussion. It is the matter of Q5 but I simply listed all the questions covered in my Added Values paper. Q5 is not the subject of discussion here. You will, however, find some comments on it in that paper..

2) The group of patients so sampled are allocated at random to either treatment (T) or control. (C) If we could under identical conditions treat every patient with T and C, then what you say about using the observed difference and not needing statistical tests would be true.

3) However we have allocated the patients at random. This means that the 25 T and 25 C split that we chose is one of 50!(25!25!)=1.264*10^14 possible such splits. We consider the relationship of the actual allocation we had to the population of all the allocations we could have made, since the average of all these populations would give us the average causal effect in the 50 patients, since over all such allocations every patient would be treated half the time with T and half the time with C.

4) Thus we

The whole theory of randomisation and permutation tests is related to this. The book I cited perviously is a good reference. For a simple discussion related to clinical trials see

1. Ludbrook J, Dudley H. Issues in Biomedical Statistics – Statistical-Inference. Aust N Z J Surg 1994;64(9):630-36.

2. Ludbrook J, Dudley H. Why permutation tests are superior to t and F tests in biomedical research. American Statistician 1998;52(2):127-32.

First, I’ve invited Deaton and Cartwright to comment or link us to relevant replies, as I always do when people are discussed, but I don’t know we’ll hear from them (Deaton suggested he was too old for blogs).

Second, I’m traveling back to the states tomorrow, so, while I’ll leave the comments open, anyone who hasn’t been approved before is likely to be held in moderation.

Third, I love your remark:

“RCTs bring great strength to reinforcing the weakest link of the argument. That’s the whole point. I fully agree, that this does not make the argument as strong as this link is made. However, it does make it stronger than those arguments that ignore this weakness.”

There’s an important thing to say about the allegation to the charge that an argument is only as strong as its weakest link. That position is its based on the kind of notion philosophers tend to like, of an argument as a kind of tower, where finding a weak spot causes it all to tumble down. It is used by critics of regulatory positions, ignoring the strong evidence that’s not threatened by the piece they chip away at. That gambit works for a tower or “linked” argument, but not for what’s called a “convergent’ argument wherein many distinct strands work together to build a strong argument from coincidence based on distinct pieces, & self-correcting checks

Anyway, I’m disappointed that you think the issue is analogous to the Neyman-Fisher issue on that experimental design matter, in that Neyman never questioned randomization..

]]>The Sprint RCT recruited patients with systolic BPs between 130 and 180 and found a ‘statistically significant’ reduction in adverse outcomes. This prompted a conclusion that everyone with a systolic BP above 130 should be treated. This gave rise to controversy unsurprisingly.

There was no attempt to estimate the absolute proportion benefitting at different levels of initial BP. It would probably be minimal at 130 but substantial at 180. I teach this in the Oxford Handbook of Clinical Diagnosis but it seems to be lost on those who prepare ‘evidence-based’ guidelines and who criticise pharmaceutical organisations for promoting over diagnosis.

There is an awareness of relative and absolute risk reduction for risk scores based on multiple risk factors. However these concepts are not applied routinely to RCTs based on single entry criteria. There is also a need for confidence or credibility intervals on these absolute proportions benefitting as opposed to P values based on a null hypothesis.

As far as application to individual patients is concerned, these findings have to be combined with the probability of improved overall well being also bearing in mind adverse effects etc. Decision analysis has been proposed for doing this of course but the usual approach is discussion with or without ‘shared decision aids’.

]]>With respect to Mayo’s response: My understanding is that it’s the the design of and measurement(s) used in an RCT that enable us to identify causal processes, not the statistical tests used to analyze the data.

With respect to Stephen Senn’s response: Qs 1-4 are sample-specific, right? These can be answered by estimating appropriate statistics (means, SDs) in the sample, and so don’t require statistical tests, as far as I can tell. By way of contrast, Q5 seems to be explicitly about drawing inferences about some population of interest, so it seems inconsistent with the text I quoted and emphasized from the post.

]]>not with the randomization and the control—which do give us causal identification, albeit subject

to sampling variation and relative to a particular local treatment effect. So really we’re saying at all

empirical trials have problems, a point which has arisen many times in discussions of experiments

and causal reasoning in political science”.

The supposition that because you’ve randomized “treatments,” enabling a statistical significance assessment, that the results are relevant for a research claim is wrong–let alone may one extrapolate to other populations. Granted too, in some cases, as you say, that once “randomization (or other identification strategies) go in, researchers often seem to turn off their brains”–or at least think they’re protected from fallacies. Reading into the significant or non-significant result with one or another presupposed theory is problematic, which is why Fisher demanded more for genuine effects (replication), and more still for causal inference (making your theories elaborate and varied).

However, I think it would be highly problematic to argue that since randomized studies are also open to fallacies that we might as well do non-randomized studies. I’m not saying they argue this (I will study their paper when I return from travels). Even Savage (whom they mention as denying the necessity for randomization) and other subjective Bayesians (Kadane) have worked very hard to find ways to justify it, despite the conflict with the Likelihood Principle.

In today’s world where statistical inference methods are often blamed for non-replication, rather than violations of experimental design,data-dependent selection effects, and well-known fallacies of statistical inference, and the “21 century cures” act* says, or seems to say, we do not need randomization, it’s very important to be clear on just what’s being blamed, and showing why alternative methods will do better.

*https://rejectedpostsofdmayo.com/2017/11/08/you-are-no-longer-bound-to-traditional-clinical-trials-21st-century-cures/

]]>To me and I think Stephen, the definition of Neyman Null Model is that the mean is exactly zero or some constant. If by Neyman Model you take the restriction off the mean, I don’t think there is much, if any disagreement?

Keith O’Rourke

]]>It’s only ever the first step. Step 1: ‘prove’ that the treatment worked in the patients actually studied.

These are the questions I suggested in my paper ‘Added Values’ (1) were relevant to thinking about data from clinical trials

Q1. Was there an effect of treatment in this trial?

Q2. What was the average effect of treatment in this trial?

Q3. Was the treatment effect identical for all patients in the trial?

Q4. What was the effect of treatment for different subgroups of patients?

Q5. What will be the effect of treatment when used more generally (outside of the trial)?

For a practical application of this philosophy see Araujo, Julious and Senn (2)

References

1 Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004;23(24):3729-53.

2. Araujo A, Julious S, Senn SJ. Understanding variation in sets of n-of-1 trials. PloS one 2016;11(12):e0167167.

However, to allow for treatment-by-unit interaction in estimation is eminently sensible. Even here, however, there are problems with Neyman’s approach and they carry-over into the apparently endless discussion of random effects meta-analysis, a point that I discussed here (2)

http://onlinelibrary.wiley.com/doi/10.1002/sim.2639/abstract , where I claimed that the combination of large variation in effects and small average effects is rarely very credible.

So to pick up your statement ‘as it has never made sense to me to think of treatment effects to be exactly the same for all people’, in my opinion, that’s really not the point at issue. It is rather that it can be dangerous to allow that treatment effects can be large individually and small on average and the extreme combination of this is to allow that the former are large and the latter is zero

To put it another way, allowing for an interaction and having the main effect null is also a marginality violation. .

References

1. Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutic trials. Journal of chronic diseases 1967;20:637-48.

2. Senn SJ. Trying to be precise about vagueness. Statistics in Medicine 2007;26:1417-30.

(Emphasis mine.) I don’t understand this statement. Isn’t the null hypothesis a statement about a population? And isn’t an effect trivially either zero or non-zero in a particular sample? If you’re only concerned with the sample at hand, what is the point of any kind of statistical test? What are you drawing inferences about?

]]>80 years ago #Statistics giants Yates & Cochran nailed it: RCTs do not need representative samples to get generalizable estimates of efficacy. See also https://t.co/K17ku2SrX0 https://t.co/1eNcskhxVc

— Frank Harrell (@f2harrell) January 16, 2018

]]>

I’ve always thought the Neyman model makes more sense than the Fisher model, as it has never made sense to me to think of treatment effects to be exactly the same for all people. See here for why I say this: http://andrewgelman.com/2013/05/24/in-which-i-side-with-neyman-over-fisher/

]]>“In words, Fisher’s null hypothesis can be described as being, ‘all treatments are equal’, whereas

Neyman’s is, ‘on average all treatments are equal’. The first hypothesis necessarily implies

the second but the converse is not true. Neyman developed a model in which on average over

the field, the yields of different treatments could be the same (if the null hypothesis were true)

but they could actually differ on given plots. Although it seems that this is more general than

Fisher’s null hypothesis it is, in fact, not sensible. Anyone who doubts this should imagine

themselves faced with the following task: it is known that Fisher’s null hypothesis is false and

the treatments are not identical; find a field for which Neyman’s hypothesis is true.” (p 3733)

Reference

1. Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004;23(24):3729-53.