S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)

Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine?

A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.

Stephen Senn
Consultant Statistician,


Thirty years ago, Laupacis et al1 proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).

Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write:  ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’2. Fifteen years further on, bringing us up to date,  Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’3

This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and  Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirty-year anniversary of NNTs provides a good excuse.

NNTs based on dichotomies, as opposed to those based on true binary outcomes (which are very rare), do not measure the proportion of patients who benefit from the drug and even when not based on such dichotomies, they say less about differential response than many suppose. Common false interpretations of NNTs are creating confusion about the scope for personalised medicine.

Not necessarily true

To illustrate the problem, consider a 2015  Nature comment piece by Nicholas Schork4 calling for N-of-1 trials to be used more often in personalising medicine. These are trials in which, as a guide to treatment, patients are repeatedly randomised in different episodes to the therapies being compared. 5.

NNTs are commonly used in health economics. Other things being equal, a drug with a larger NNT ought to have a lower cost per patient day than one with a smaller NNT if it is to justify its place in the market. Here however, they were used to make the case for the scope for personalised medicine, and hence the need for N-of-1 trials, a potentially very useful approach to personalising treatment.  Schork claimed, ‘The top ten highest-grossing drugs in the United States help between 1 in 25 and 1 in 4 of the people who take them (p609). This claim may or may not be correct (it is almost certainly wrong) but the argument for it is false.

The figure: Imperfect medicine is based on Schork’s figure Imprecision medicine and shows the NNTs for the ten best selling drugs in the USA at the time of his comment. The NNTs range, for example, from 4 for Humira® in arthritis to 25 for Nexium in heartburn. This is then interpreted as meaning that since, for example, on average 4 patients would have to be treated with Humira rather than placebo in order to get one more response, only one in 4 patients responds to Humira.Imperfect medicine: Numbers Needed to Treat based on a figure in Schork (2015). The total number of dots represents how many patients you would have to switch to the treatment mentioned to get one additional response (blue dot). The red dots are supposed to represent the patients for whom it would make no difference.

Take the example of Nexium. The figure quoted by Schork is taken from a meta-analysis carried out by Gralnek et al6 based on several studies comparing Esomeprazole (Nexium) to other protein pump inhibitors. The calculation of the NNT may be illustrated by taking one of the studies that comprise the meta-analysis, the EXPO study reported by Labenz et al7 in which a clinical trial with more than 3000 patients compared Esomeprazole to Pantoprazole. Patients with erosive oesophagitis were treated with either one or the other treatment and then evaluated at 8 weeks.

Of those treated with Esomeprazole 92.1% were healed. Of those treated with Pantoprazole 87.3% were healed. The difference of 4.8% is the risk difference. Expressed as a proportion this is 0.048 and the reciprocal of this figure is 21, rounded up to the nearest whole number. This figure is the NNT and an interpretation is that on average you would need to treat 21 patients with Esomeprazole rather than with Pantoprazole to have one extra healed case at 8 weeks. For the meta-analysis as a whole, Gralnek et al6 found a risk difference of 4% and this yields an NNT of 25, the figure quoted by Schork. (See Box for further discussion.)

 Two different interpretations of the EXPO oesophageal ulcer data


It is impossible for us to observe the ulcers that were studied in the EXPO trial under both treatments. Each patient, was treated with either Esomeprazole or Pantoprazole. We can imagine what response would have been on either but we can only observe it on one. Table 1 and Table 2 have the same observable marginal probabilities of ulcer healing but different postulated joint ones.

  Not healed Healed Total
Pantoprazole Not healed        7.9        4.8       12.7
Healed        0.0      87.3       87.3
  Total        7.9      92.1     100.0

Table 1 Possible joint distribution of response (percentages) for the EXPO trial. Case where no patient would respond on Pantoprazole who did not on Esomeprazole

In the case of Table 1, no patient that would not have been healed by Esomeprazole could have been healed by Pantoprazole. In consequence the total number of patients who could have been healed are those who were healed with Esomeprazole, that is to say 92.1%. In the case of Table 2, all patients who were not healed with by Esomeprazole, that is to say 7.9%, could have been healed by Pantoprazole. In principle it becomes possible to heal all patients. Of course, intermediate situations are possible but all such tables have the same NNT of 21. The NNT cannot tell us which is true.

  Not healed Healed Total
Pantoprazole Not healed        0.0      12.7       12.7
Healed        7.9      79.4       87.3
  Total        7.9      92.1     100.0

Table 2 Possible joint distribution of response (percentages) for the EXPO trial. Case where all patients did not respond on Esomeprazole would respond on Pantoprazole


A number of points can be made taking this example. First, it is comparator-specific. Proton pump inhibitors as a class are highly effective and one would get quite a different figure if placebo rather than Pantoprazole had been used as the control for Esomeprazole. Second, the figure, of itself, does not tell us the scope for personalising medicine. It is quite compatible with the two extreme positions given in the Box. In the first case, every single patent who was helped by Pantoprazole would have been so by Esomeprazole. If there are no cost or tolerability advantages to the former the optimal policy would be to give all patient the latter. In the second case, every single patient who was not helped by Esomeprazole would have been helped by Pantoprazole. If a suitable means can be found of identifying such patients, all patients can be treated successfully. Third, healing is a process that takes time. The eight-week time-point is partly arbitrary. The careful analysis presented by Labenz et al7 shows healing rates rising with time with the Esomeprazole rate always above that for Pantoprazole. Perhaps with time, either would heal all ulcers, the difference between them being one of speed. Fourth, although it is not directly related to this discussion, it should be appreciated that a given drug can have many NNTs. The NNT will vary both according to the comparator, the outcome chosen, the cut point for any dichotomy or the follow-up8. (The original article proposing NNTs by Laupacis et al1 discusses a number of such caveats.) Indeed, for the EXPO study the risk difference at 4 weeks is 8.7 with an NNT of  rather than 21 for 8 weeks. This shows the importance of not mixing NNTs for different follow-ups in a meta-analysis.

An easy lie or a difficult truth?

There are no shortcuts to finding evidence for variation in response9. Dichotomising continuous measures not only has the capacity to exaggerate unimportant differences it is also inefficient and needlessly increases trial sizes10.

Rather than becoming simpler, ways that clinical trial are reported need to be more nuanced. In a previous blog I showed how a NNT of 10 for headache had been misinterpreted as meaning that only 1 in 10 benefitted from paracetamol. It is, or ought to be obvious, that in order to understand the extent to which patients respond to paracetamol you should study them more than once under treatment and under control. For example, a design could be employed in which each patient was treated for four headaches, twice with placebo and twice with paracetamol. This is an example of the n-of-1 trials than Schork calls for4. We hardly ever run these. Of course for some diseases they are not practical but where we can’t run them, we should not pretend to have identified what we can’t.

The role for n-of-1 trials is indeed there but not necessarily to personalise treatment. More careful analysis of response may simply reveal that this is less variable than supposed11. In some cases such trials may simply deliver the message that we need to do better for everybody12.

In his editorial of 2003 Smith referred to pharmacogenetics as providing ‘hopes that greater understanding of genetics will mean that we will be able to identify with a “simple genetic test” people who will respond to drugs and design drugs for individuals rather than populations.’ and added, ‘We have, however, been hearing this tune for a long time’2.

Smith’s complaint about an old tune is as true today as it was in 2003. However, the message for the pharmaceutical industry may simply be that we need better drugs not better diagnosis.


I am grateful to Andreas Laupacis and Jennifer Deevy for helpfully providing me with a copy of the 1988 paper.


  1. Laupacis A, Sackett DL, Roberts RS. An Assessment of Clinically Useful Measures of the Consequences of Treatment. New England Journal of Medicine 1988;318(26):1728-33.
  2. Smith R. The drugs don’t work. British Medical Journal 2003;327(7428).
  3. Wikipedia. Number needed to treat 2018 [Available from: https://en.wikipedia.org/wiki/Number_needed_to_treat.
  4. Schork NJ. Personalized medicine: Time for one-person trials. Nature 2015;520(7549):609-11.
  5. Araujo A, Julious S, Senn S. Understanding Variation in Sets of N-of-1 Trials. PloS one 2016;11(12):e0167167.
  6. Gralnek IM, Dulai GS, Fennerty MB, et al. Esomeprazole versus other proton pump inhibitors in erosive esophagitis: a meta-analysis of randomized clinical trials. Clin Gastroenterol Hepatol 2006;4(12):1452-8.
  7. Labenz J, Armstrong D, Lauritsen K, et al. A randomized comparative study of esomeprazole 40 mg versus pantoprazole 40 mg for healing erosive oesophagitis: the EXPO study. Alimentary pharmacology & therapeutics 2005;21(6):739-46.
  8. Suissa S. Number needed to treat: enigmatic results for exacerbations in COPD. The European respiratory journal : official journal of the European Society for Clinical Respiratory Physiology 2015;45(4):875-8.
  9. Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine 2016;35(7):966-77.
  10. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25(1):127-41.
  11. Churchward-Venne TA, Tieland M, Verdijk LB, et al. There are no nonresponders to resistance-type exercise training in older men and women. Journal of the American Medical Directors Association 2015;16(5):400-11.
  12. Senn SJ. Individual response to treatment: is it a valid assumption? BMJ 2004;329(7472):966-68.
Categories: personalized medicine, PhilStat/Med, S. Senn | 7 Comments

Statistics and the Higgs Discovery: 5-6 yr Memory Lane


I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories. 

“Higgs Analysis and Statistical Flukes: part 2”

Everyone was excited when the Higgs boson results were reported on July 4, 2012 indicating evidence for a Higgs-like particle based on a “5 sigma observed effect”. The observed effect refers to the number of excess events of a given type that are “observed” in comparison to the number (or proportion) that would be expected from background alone, and not due to a Higgsparticle. This continues my earlier post (part 1). It is an outsider’s angle on one small aspect of the statistical inferences involved. But that, apart from being fascinated by it, is precisely why I have chosen to discuss it: we [philosophers of statistics] should be able to employ a general philosophy of inference to get an understanding of what is true about the controversial concepts we purport to illuminate, e.g., significance levels.

Here I keep close to an official report from ATLAS, researchers define a “global signal strength” parameter “such that μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the SM Higgs boson signal in addition to the background” (where SM is the Standard Model). The statistical test may be framed as a one-sided test, where the test statistic (which is actually a ratio) records differences in the positive direction, in standard deviation (sigma) units. Reports such as

Pr(Test T would yield at least a 5 sigma excess; H0: background only) = extremely low

are deduced from the sampling distribution of the test statistic, fortified with much cross-checking of results (e.g., by modeling and simulating relative frequencies of observed excesses generated with “Higgs signal +background” compared to background alone).  The inferences, even the formal statistical ones, go beyond p-value reports. For instance, they involve setting lower and upper bounds such that values excluded are ruled out with high severity, to use my term. But the popular report is in terms of the observed 5 sigma excess in an overall test T, and that is mainly what I want to consider here.

Error probabilities

In a Neyman-Pearson setting, a cut-off cα is chosen pre-data so that the probability of a type I error is low. In general,

Pr(d(X) > cαH0) ≤  α

and in particular, alluding to an overall test T:

(1) Pr(Test T yields d(X) > 5 standard deviations; H0) ≤  .0000003.

The test at the same time is designed to ensure a reasonably high probability of detecting global strength discrepancies of interest. (I always use “discrepancy” to refer to parameter magnitudes, to avoid confusion with observed differences).

[Notice these are not likelihoods.] Alternatively, researchers can report observed standard deviations (here, the sigmas), or equivalently, the associated observed statistical significance probability, p0. In general,

Pr(P < p0H0) < p0

and in particular,

(2) Pr(Test T yields P < .0000003H0.0000003.

For test T to yield a “worse fit” with H(smaller p-value) due to background alone is sometimes called “a statistical fluke” or a “random fluke”, and the probability of so statistically significant a random fluke is ~0.  With the March 2013 results, the 5 sigma difference has grown to 7 sigmas.

So probabilistic statements along the lines of (1) and (2) are standard.They allude to sampling distributions, either of test statistic d(X), or the p-value viewed as a random variable. They are scarcely illicit or prohibited. (I return to this in the last section of this post).

An implicit principle of inference or evidence

Admittedly, the move to taking the 5 sigma effect as evidence for a genuine effect (of the Higgs-like sort) results from an implicit principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null. (I will deliberately use a few different variations on statements that can be made.)

Data x from a test T provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010) and a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006).[3]

The sampling distribution is computed, under the assumption that the production of observed results is similar to the “background alone”, with respect to relative frequencies of signal-like events. (Likewise for computations under hypothesized discrepancies.) The relationship between H0 and the probabilities of outcomes is an intimate one: the various statistical nulls live their lives to refer to aspects of general types of data generating procedures (for a taxonomy, see Cox 1958, 1977).  “His true” is a shorthand for a very long statement that H0 is an approximately adequate model of a specified aspect of the process generating the data in the context. (This relates to statistical models and hypotheses living “lives of their own”.)

Severity and the detachment of inferences

The sampling distributions serve to give counterfactuals. In this case, they tell us what it would be like, statistically, were the mechanism generating the observed signals similar to H0.[i] While one would want to go on to consider the probability test T yields so statistically significant an excess under various alternatives to μ = 0, this suffices for the present discussion. Sampling distributions can be used to arrive at error probabilities that are relevant for understanding the capabilities of the test process, in relation to something we want to find out. Since a relevant test statistic is a function of the data and quantities about which we want to learn, the associated sampling distribution is the key to inference. (This is why the bootstrap, and other types of, re-sampling works when one has a random sample from the process or population of interest.)

The severity principle, put more generally:

Data from a test T[ii] provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

(The severity principle can also be made out just in terms of relative frequencies, as with bootstrap re-sampling.) In this case, what is surviving is minimally the non-null. Regardless of the specification of a statistical inference, to assess the severity associated with a claim H requires considering H‘s denial: together they exhaust the answers to a given question.

Without making such a principle explicit, some critics assume the argument is all about the reported p-value. The inference actually detached from the evidence can be put in any number of ways, and no uniformity is to be expected or needed:

(3) There is strong evidence for H: a Higgs (or a Higgs-like) particle.

(3)’ They have experimentally demonstrated  H: a Higgs (or Higgs-like) particle.

Or just, infer H.

Doubtless particle physicists would qualify these statements, but nothing turns on that. ((3) and (3)’ are a bit stronger than merely falsifying the null because certain properties of the particle must be shown. I leave this to one side.)

As always, the mere p-value is a pale reflection of the detailed information about the consistency of results that really fortifies the knowledge of a genuine effect. Nor is the precise improbability level what matters. We care about the inferences to real effects (and estimated discrepancies) that are warranted.

Qualifying claims by how well they have been probed

The inference is qualified by the statistical properties of the test, as in (1) and (2), but that does not prevent detaching (3). This much is shown: they are able to experimentally demonstrate the Higgs particle. They can take that much of the problem as solved and move on to other problems of discerning the properties of the particle, and much else that goes beyond our discussion*. There is obeisance to the strict fallibility of every empirical claim, but there is no probability assigned.  Neither is there in day-to-day reasoning, nor in the bulk of scientific inferences, which are not formally statistical. Having inferred (3), granted, one may say informally, “so probably we have experimentally demonstrated the Higgs”, or “probably, the Higgs exists” (?). Or an informal use of “likely” might arise. But whatever these might mean in informal parlance, they are not formal mathematical probabilities. (As often argued on this blog, discussions on statistical philosophy must not confuse these.)

[We can however write, SEV(H) ~1]

The claim in (3) is approximate and limited–as are the vast majority of claims of empirical knowledge and inference–and, moreover, we can say in just what ways. It is recognized that subsequent data will add precision to the magnitudes estimated, and may eventually lead to new and even entirely revised interpretations of the known experimental effects, models and estimates. That is what cumulative knowledge is about. (I sometimes hear people assert, without argument, that modeled quantities, or parameters, used to describe data generating processes are “things in themselves” and are outside the realm of empirical inquiry. This is silly. Else we’d be reduced to knowing only tautologies and maybe isolated instances as to how “I seem to feel now,” attained through introspection.)

Telling what’s true about significance levels

So we grant the critic that something like the severity principle is needed to move from statistical information plus background (theoretical and empirical) to inferences about evidence and inference (and to what levels of approximation).  It may be called lots of other things and framed in different ways, and the reader is free to experiment . What we should not grant the critic is any allegation that there should be, or invariably is, a link from a small observed significance level to a small posterior probability assignment to H0. Worse, (1- the p-value) is sometimes alleged to be the posterior probability accorded to the Standard Model itself! This is neither licensed nor wanted!

If critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct them with the probabilities in (1) and (2). If they say, it is patently obvious that Higgs researchers want to use the p-value as a posterior probability assignment to H0, point out the more relevant and actually attainable [iii] inference that is detached in (3). If they persist that what is really, really wanted is a posterior probability assignment to the inference about the Higgs in (3), ask why? As a formal posterior probability it would require a prior probability on all hypotheses that could explain the data. That would include not just H and H0 but all rivals to the Standard Model, rivals to the data and statistical models, and higher level theories as well. But can’t we just imagine a Bayesian catchall hypothesis?  On paper, maybe, but where will we get these probabilities? What do any of them mean? How can the probabilities even be comparable in different data analyses, using different catchalls and different priors?[iv]

Degrees of belief will not do. Many scientists perhaps had (and have) strong beliefs in the Standard Model before the big collider experiments—given its perfect predictive success. Others may believe (and fervently wish) that it will break down somewhere (showing supersymmetry or whatnot); a major goal of inquiry is learning about viable rivals and how they may be triggered and probed. Research requires an open world not a closed one with all possibilities trotted out and weighed by current beliefs. [v] We need to point up what has not yet been well probed which, by the way, is very different from saying of a theory that it is “not yet probable”.

Those prohibited phrases

One may wish to return to some of the condemned phrases of particular physics reports. Take,

“There is less than a one in a million chance that their results are a statistical fluke”.

This is not to assign a probability to the null, just one of many ways (perhaps not the best) of putting claims about the sampling distribution:  The statistical null asserts that H0: background alone adequately describes the process.

H0 does not assert the results are a statistical fluke, but it tells us what we need to determine the probability of observed results “under H0”. In particular, consider all outcomes in the sample space that are further from the null prediction than the observed, in terms of p-values {x: p < p0}. Even when H0 is true, such “signal like” outcomes may occur. They are p<sub:0 level flukes. Were such flukes generated even with moderate frequency under H0, they would not be evidence against H0. But in this case, such flukes occur a teeny tiny proportion of the time. Then SEV enters: if we are regularly able to generate such teeny tiny p-values, we have evidence of a genuine discrepancy from H0.

I am repeating myself, I realize, on the hopes that at least one phrasing will drive the point home. Nor is it even the improbability that substantiates this, it is the fact that an extraordinary set of coincidences would have to have occurred again and again. To nevertheless retain H0 as the source of the data would block learning. (Moreover, they know that if some horrible systematic mistake was made, it would be detected in later data analyses.)

I will not deny that there have been misinterpretations of p-values, but if a researcher has just described performing a statistical significance test, it would be “ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “confirmation bias” whereby one insists on finding one sentence among very many that could conceivably be misinterpreted Bayesianly.

Triggering, indicating, inferring

As always, the error statistical philosopher would distinguish different questions at multiple stages of the inquiry. The aim of many preliminary steps is “behavioristic” and performance oriented: the goal being to control error rates on the way toward finding excess events or bumps of interest.

If interested: See statistical flukes (part 3)

The original posts of parts 1 and 2 had around 30 comments each; you might want to look at them:

Part 1: https://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Part 2 https://errorstatistics.com/2013/03/27/higgs-analysis-and-statistical-flukes-part-2/

*Fisher insisted that to assert a phenomenon is experimentally demonstrable:[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher Design of Experiments 1947, 14).

2018/2015/2014 Notes

[0]Physicists manage to learn quite a lot from negative results. They’d love to find something more exotic, but the negative results will not go away. A recent article from CERN, “We need to talk about the Higgs” says: While there are valid reasons to feel less than delighted by the null results of searches for physics beyond the Standard Model (SM), this does not justify a mood of despondency. 

“Physicists aren’t just praying for hints of new physics, Strassler stresses. He says there is very good reason to believe that the LHC should find new particles. For one, the mass of the Higgs boson, about125.09 billion electron volts, seems precariously low if the census of particles is truly complete. Various calculations based on theory dictate that the Higgs mass should be comparable to a figure called the Planck mass, which is about 17 orders of magnitude higher than the boson’s measured heft.”The article is here.

[1]My presentation at a Symposium on the Higgs discovery at the Philosophy of Science Association (Nov. 2014) is here.

[2] I have often noted that there are other times where we are trying to find evidence to support a previously held position.

[3]Aspects of the statistical controversy in the Higgs episode occurs in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018)


Original notes:

[i] This is a bit stronger than merely falsifying the null here, because certain features of the particle discerned must also be shown. I leave details to one side.

[ii] Which almost always refers to a set of tests, not just one.

[iii] I sense that some Bayesians imagine P(H) is more “hedged” than to actually infer (3). But the relevant hedging, the type we can actually attain, is  given by an assessment of severity or corroboration or the like. Background enters via a repertoire of information about experimental designs, data analytic techniques, mistakes and flaws to be wary of, and a host of theories and indications about which aspects have/have not been severely probed. Many background claims enter to substantiate the error probabilities; others do not alter them.

[iv] In aspects of the modeling, researchers make use of known relative frequencies of events (e.g., rates of types of collisions) that lead to legitimate, empirically based, frequentist “priors” if one wants to call them that.

[v] After sending out the letter, prompted by Lindley, O’Hagan wrote up a synthesis https://errorstatistics.com/2012/08/25/did-higgs-physicists-miss-an-opportunity-by-not-consulting-more-with-statisticians/

REFERENCES (from March, 2013 post):

ATLAS Collaboration  (November 14, 2012),  Atlas Note: “Updated ATLAS results on the signal strength of the Higgs-like boson for decays into WW and heavy fermion final states”, ATLAS-CONF-2012-162. http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

Cox, D.R. (1958), “Some Problems Connected with Statistical Inference,” Annals of Mathematical Statistics, 29: 357–72.

Cox, D.R. (1977), “The Role of Significance Tests (with Discussion),” Scandinavian Journal of Statistics, 4: 49–70.

Mayo, D.G. (1996), Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 247-275.

Mayo, D.G., and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323–357.

Categories: Higgs, highly probable vs highly probed, P-values | Leave a comment

Replication Crises and the Statistics Wars: Hidden Controversies


Below are the slides from my June 14 presentation at the X-Phil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some.  However, I did learn something about X-phil, particularly from the very interesting poster session [1]. This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.

[1] The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.

Categories: Philosophy of Statistics, replication research, slides | Leave a comment

Your data-driven claims must still be probed severely

Vagelos Education Center

Below are the slides from my talk today at Columbia University at a session, Philosophy of Science and the New Paradigm of Data-Driven Science, at an American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics. Todd was brave to sneak in philosophy of science in an otherwise highly mathematical conference.

Philosophy of Science and the New Paradigm of Data-Driven Science : (Room VEC 902/903)
Organizer and Chair: Todd Kuffner (Washington U)

  1. Deborah Mayo (Virginia Tech) “Your Data-Driven Claims Must Still be Probed Severely”
  2.  Ian McKeague (Columbia) “On the Replicability of Scientific Studies”
  3.  Xiao-Li Meng (Harvard) “Conducting Highly Principled Data Science: A Statistician’s Job and Joy


Categories: slides, Statistics and Data Science | 5 Comments

“Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman did discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.) Continue reading

Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 6 Comments

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars


Excerpts from the Preface:

The Statistics Wars: 

Today’s “statistics wars” are fascinating: They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. How should the integrity of science be restored? Experts do not agree. This book pulls back the curtain on why. Continue reading

Categories: Announcement, SIST | Leave a comment

Getting Up to Speed on Principles of Statistics


“If a statistical analysis is clearly shown to be effective … it gains nothing from being … principled,” according to Terry Speed in an interesting IMS article (2016) that Harry Crane tweeted about a couple of days ago [i]. Crane objects that you need principles to determine if it is effective, else it “seems that a method is effective (a la Speed) if it gives the answer you want/expect.” I suspected that what Speed was objecting to was an appeal to “principles of inference” of the type to which Neyman objected in my recent post. This turns out to be correct. Here are some excerpts from Speed’s article (emphasis is mine): Continue reading

Categories: Likelihood Principle, Philosophy of Statistics | 4 Comments

3 YEARS AGO (May 2015): Monthly Memory Lane

3 years ago...               3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one, as in the case of 5/16, 5/19 and 5/24.

May 2015

  • 05/04 Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)
  • 05/08 What really defies common sense (Msc kvetch on rejected posts)
  • 05/09 Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
  • 05/16 “Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo
  • 05/19 Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)
  • 05/24 From our “Philosophy of Statistics” session: APS 2015 convention
  • 05/27 “Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday
  • 05/30 3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

I regret being away from blogging as of late (yes, the last bit of proofing on the book): I shall return soon! Send me stuff to post of yours or items of interest in the mean time.


Categories: 3-year memory lane | 1 Comment

Neyman vs the ‘Inferential’ Probabilists continued (a)


Today is Jerzy Neyman’s Birthday (April 16, 1894 – August 5, 1981).  I am posting a brief excerpt and a link to a paper of his that I hadn’t posted before: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). [iii] I’ll explain in later stages of this post & in comments…(so please check back); I don’t want to miss the start of the birthday party in honor of Neyman, and it’s already 8:30 p.m in Berkeley!

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program. HAPPY BIRTHDAY NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman, Statistics | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: April 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one.

April 2015

  • 04/01 Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)
  • 04/04 Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”
  • 04/08 Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!
  • 04/13 Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC
  • 04/16 A. Spanos: Jerzy Neyman and his Enduring Legacy
  • 04/18 Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen
  • 04/22 NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)
  • 04/24 “Statistical Concepts in Their Relation to Reality” by E.S. Pearson
  • 04/27 3 YEARS AGO (APRIL 2012): MEMORY LANE
  • 04/30 96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

New Warning: Proceed With Caution Until the “Alt Stat Approaches” are Evaluated

I predicted that the degree of agreement behind the ASA’s “6 principles” on p-values , partial as it was,was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? There’s word that the ASA will  hold meeting where the other approaches are put through their paces. I don’t know when. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. 

“Restoring Credibility in Statistical Science: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason Continue reading

Categories: Announcement | 2 Comments

February Palindrome Winner: Lucas Friesen

Winner of the February 2018 Palindrome Contest: (a dozen book choice)


Lucas Friesen: a graduate student in Measurement, Evaluation, and Research Methodology at the University of British Columbia


Ares, send a mere vest set? Bagel-bag madness.

Able! Elbas! Send AM: “Gable-Gab test severe. Madness era.”

The requirement: A palindrome using “madness*” (+ Elba, of course). Statistical, philosophical, scientific themes are awarded more points.) *Sorry, the editor got ahead of herself in an earlier post, listing March’s word.
Book choice: This is horribly difficult, but I think I have to go with the allure of the unknown: Statistical Inference as Severe Testing: How to get beyond the statistics wars.

Continue reading

Categories: Palindrome | Leave a comment

J. Pearl: Challenging the Hegemony of Randomized Controlled Trials: Comments on Deaton and Cartwright


Judea Pearl

Judea Pearl* wrote to me to invite readers of Error Statistics Philosophy to comment on a recent post of his (from his Causal Analysis blog here) pertaining to a guest post by Stephen Senn (“Being a Statistician Means never Having to Say You Are Certain”.) He has added a special addendum for us.[i]

Challenging the Hegemony of Randomized Controlled Trials: Comments on Deaton and Cartwright

Judea Pearl

I was asked to comment on a recent article by Angus Deaton and Nancy Cartwright (D&C), which touches on the foundations of causal inference. The article is titled: “Understanding and misunderstanding randomized controlled trials,” and can be viewed here: https://goo.gl/x6s4Uy

My comments are a mixture of a welcome and a puzzle; I welcome D&C’s stand on the status of randomized trials, and I am puzzled by how they choose to articulate the alternatives. Continue reading

Categories: RCTs | 26 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one.

March 2015

  • 03/01 “Probabilism as an Obstacle to Statistical Fraud-Busting”
  • 03/05 A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
  • 03/12 All She Wrote (so far): Error Statistics Philosophy: 3.5 years on
  • 03/16 Stephen Senn: The pathetic P-value (Guest Post)
  • 03/21 Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”
  • 03/24 3 YEARS AGO (MARCH 2012): MEMORY LANE
  • 03/28 Your (very own) personalized genomic prediction varies depending on who else was around?

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

Cover/Itinerary of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

SNEAK PREVIEW: Here’s the cover of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars:

It should be out in July 2018. The “Itinerary”, generally known as the Table of Contents, is below. I forgot to mention that this is not the actual pagination, I don’t have the page proofs yet. These are the pages of the draft I submitted. It should be around 50 pages shorter in the actual page proofs, maybe 380 pages.



Continue reading

Categories: Announcement | 9 Comments

Deconstructing the Fisher-Neyman conflict wearing fiducial glasses (continued)


Fisher/ Neyman

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. I supply a few more intriguing articles you may find enlightening to read and/or reread on a Saturday night

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74). Continue reading

Categories: fiducial probability, Fisher, Neyman, Statistics | 4 Comments

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]


R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a couple of years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehman (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.

So what’s fiducial inference? I follow Cox (2006), adapting for the case of the lower limit: Continue reading

Categories: fiducial probability, Fisher, Statistics | Leave a comment

R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts in honor of his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. They are each very short and bear rereading

17 February 1890 — 29 July 1962

“Statistical Methods and Scientific Induction”

by Sir Ronald Fisher (1955)


The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of  acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

  1. “Repeated sampling from the same population”,
  2. Errors of the “second kind”,
  3. “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.


Note on an Article by Sir Ronald Fisher

by Jerzy Neyman (1956)




(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.  (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values.  The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight.  (4)  The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.


E.S. Pearson

“Statistical Concepts in Their Relation to Reality”.

by E.S. Pearson (1955)

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data.  We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done.  If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.

In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect.  There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans.  It was really much simpler–or worse.  The original heresy, as we shall see, was a Pearson one!…

To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE

Categories: E.S. Pearson, fiducial probability, Fisher, Neyman, phil/history of stat | 3 Comments

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)



In recognition of R.A. Fisher’s birthday on February 17….

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Spanos, Statistics | 3 Comments

Guest Blog: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”


As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017.  The comments from 2017 lead to a troubling issue that I will bring up in the comments today.

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | 1 Comment

Blog at WordPress.com.