Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ll blog some E. Pearson items this week, including, my latest reflection on a historical anecdote regarding Egon and the woman he wanted marry, and surely would have, were it not for his father Karl!


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171) 

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

Three Steps in the Original Construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts”.

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)




Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to RealityJournal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” Biometrika 20(A): 175-240.

[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

Categories: highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

Thieme on the theme of lowering p-value thresholds (for Slate)


Here’s an article by Nick Thieme on the same theme as my last blogpost. Thieme, who is Slate’s 2017 AAAS Mass Media Fellow, is the first person to interview me on p-values who (a) was prepared to think through the issue for himself (or herself), and (b) included more than a tiny fragment of my side of the exchange.[i]. Please share your comments.

Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.



Illustration by Slate

                 Illustration by Slate

Last week a team of 72 scientists released the preprint of an article attempting to address one aspect of the reproducibility crisis, the crisis of conscience in which scientists are increasingly skeptical about the rigor of our current methods of conducting scientific research.

Their suggestion? Change the threshold for what is considered statistically significant. The team, led by Daniel Benjamin, a behavioral economist from the University of Southern California, is advocating that the “probability value” (p-value) threshold for statistical significance be lowered from the current standard of 0.05 to a much stricter threshold of 0.005.

P-values are tricky business, but here’s the basics on how they work: Let’s say I’m conducting a drug trial, and I want to know if people who take drug A are more likely to go deaf than if they take drug B. I’ll state that my hypothesis is “drugs A and B are equally likely to make someone go deaf,” administer the drugs, and collect the data. The data will show me the number of people who went deaf on drugs A and B, and the p-value will give me an indication of how likely it is that the difference in deafness was due to random chance rather than the drugs. If the p-value is lower than 0.05, it means that the chance this happened randomly is very small—it’s a 5 percent chance of happening, meaning it would only occur 1 out of 20 times if there wasn’t a difference between the drugs. If the threshold is lowered to 0.005 for something to be considered significant, it would mean that the chances of it happening without a meaningful difference between the treatments would be just 1 in 200.

On its face, this doesn’t seem like a bad idea. If this change requires scientists to have more robust evidence before they can come to conclusions, it’s easy to think it’s a step in the right direction. But one of the issues at the heart of making this change is that it seems to assume there’s currently a consensus around how p-value ought to be used and this consensus could just be tweaked to be stronger.

P-value use already varies by scientific field and by journal policies within those fields. Several journals in epidemiology, where the stakes of bad science are perhaps higher than in, say, psychology (if they mess up, people die), have discouraged the use of p-values for years. And even psychology journals are following suit: In 2015, Basic and Applied Social Psychology, a journal that has been accused of bad statistical (and experimental) practice, banned the use of p-values. Many other journals, including PLOS Medicine and Journal of Allergy and Clinical Immunology, actively discourage the use of p-values and significance testing already.

On the other hand, the New England Journal of Medicine, one of the most respected journals in that field, codes the 0.05 threshold for significance into its author guidelines, saying “significant differences between or among groups (i.e P<.05) should be identified in a table.” That may not be an explicit instruction to treat p-values less than 0.05 as significant, but an author could be forgiven for reading it that way. Other journals, like the Journal of Neuroscience and the Journal of Urology, do the same.

Another group of journals—including Science, Nature, and Cell—avoid giving specific advice on exactly how to use p-values; rather, they caution against common mistakes and emphasize the importance of scientific assumptions, trusting the authors to respect the nuance of any statistics tools. Deborah Mayo, award-wining philosopher of statistics and professor at Virginia Tech, thinks this approach to statistical significance, where various fields have different standards, is the most appropriate. Strict cutoffs, regardless of where they fall, are generally bad science.

Mayo was skeptical that it would have the kind of widespread benefit the authors assumed. Their assessment suggested tightening the threshold would reduce the rate of false positives—results that look true but aren’t—by a factor of two. But she questioned the assumption they had used to assess the reduction of false positives—that only 1 in 10 hypotheses a scientist tests is true. (Mayo said that if that were true, perhaps researchers should spend more time on their hypotheses.)

But more broadly, she was skeptical of the idea that lowering the informal p-value threshold will help fix the problem, because she’s doubtful such a move will address “what almost everyone knows is the real cause of nonreproducibility”: the cherry-picking of subjects, testing hypothesis after hypothesis until one of them is proven correct, and selective reporting of results and methodology.

There are plenty of other ways that scientists are testing to help address the replication crisis. There’s the move toward pre-registration of studies before analyzing data, in order to avoid fishing for significance. Researchers are also now encouraged to make data and code public so a third party can rerun analyses efficiently and check for discrepancies. More negative results are being published. And, perhaps most importantly, researchers are actually conducting studies to replicate research that has already been published. Tightening standards around p-values might help, but the debate about reproducibility is more than just a referendum on the p-value. The solution will need to be more than that as well.


 [i] We did not discuss that recent test ban(“Don’t ask don’t tell”).  If we had, I might have pointed him to my post on “P-value madness”. 

Link to Nick Thieme’s Slate article:Will Lowering P-Value Thresholds Help Fix Science? P-values are already all over the map, and they’re also not exactly the problem.”

Categories: P-values, reforming the reformers, spurious p values | 8 Comments

“A megateam of reproducibility-minded scientists” look to lowering the p-value


Having discussed the “p-values overstate the evidence against the null fallacy” many times over the past few years, I leave it to readers to disinter the issues (pro and con), and appraise the assumptions, in the most recent rehearsal of the well-known Bayesian argument. There’s nothing intrinsically wrong with demanding everyone work with a lowered p-value–if you’re so inclined to embrace a single, dichotomous standard without context-dependent interpretations, especially if larger sample sizes are required to compensate the loss of power. But lowering the p-value won’t solve the problems that vex people (biasing selection effects), and is very likely to introduce new ones (see my comment). Kelly Servick, a reporter from Science, gives the ingredients of the main argument given by “a megateam of reproducibility-minded scientists” in an article out today:

To explain to a broader audience how weak the .05 statistical threshold really is, Johnson joined with 71 collaborators on the new paper (which partly reprises an argument Johnson made for stricter p-values in a 2013 paper). Among the authors are some big names in the study of scientific reproducibility, including psychologist Brian Nosek of the University of Virginia in Charlottesville, who led a replication effort of high-profile psychology studies through the nonprofit Center for Open Science, and epidemiologist John Ioannidis of Stanford University in Palo Alto, California, known for pointing out systemic flaws in biomedical research.

The authors set up a scenario where the odds are one to 10 that any given hypothesis researchers are testing is inherently true—that a drug really has some benefit, for example, or a psychological intervention really changes behavior. (Johnson says that some recent studies in the social sciences support that idea.) If an experiment reveals an effect with an accompanying p-value of .05, that would actually mean that the null hypothesis—no real effect—is about three times more likely than the hypothesis being tested. In other words, the evidence of a true effect is relatively weak.

But under those same conditions (and assuming studies have 100% power to detect a true effect)—requiring a p-value at or below .005 instead of .05 would make for much stronger evidence: It would reduce the rate of false-positive results from 33% to 5%, the paper explains.

Her article is here.

From the perspective of the Bayesian argument on which the proposal is based, the p-value appears to exaggerate evidence, but from the error statistical perspective, it’s the Bayesian inference (to the alternative) that exaggerates the inference beyond what frequentists allow. Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman (2016, p. 342) observe, correctly, that whether “P-values exaggerate the evidence” “depends on one’s philosophy of statistics and the precise meaning given to the terms involved”. [1]

Share your thoughts.

[1] .” has been argued that P values overstate evidence against test hypotheses, based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … Nonetheless, many other statisticians do not accept these quantities as gold standards” (Greenland et al, p. 342).

Categories: Error Statistics, highly probable vs highly probed, P-values, reforming the reformers | 52 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one. This month there are three such groups: 7/8 and 7/10; 7/14 and 7/23; 7/26 and 7/31.

July 2014

  • (7/7) Winner of June Palindrome Contest: Lori Wike
  • (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
  • (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
  • (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
  • (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
  • (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
  • (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.







Categories: 3-year memory lane, Higgs, P-values | Leave a comment

On the current state of play in the crisis of replication in psychology: some heresies


The replication crisis has created a “cold war between those who built up modern psychology and those” tearing it down with failed replications–or so I read today [i]. As an outsider (to psychology), the severe tester is free to throw some fuel on the fire on both sides. This is a short update on my post “Some ironies in the replication crisis in social psychology” from 2014.

Following the model from clinical trials, an idea gaining steam is to prespecify a “detailed protocol that includes the study rationale, procedure and a detailed analysis plan” (Nosek 2017). In this new paper, they’re called registered reports (RRs). An excellent start. I say it makes no sense to favor preregistration and deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the RR is equivalent to worrying about biasing selection effects when they’re not written down; your statistical method should pick up on them (as do p-values, confidence levels and many other error probabilities). There’s a tension between the RR requirements and accounts following the Likelihood Principle (no need to name names [ii]).

“By reviewing the hypotheses and analysis plans in advance, RRs should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case” (Nosek et. al)

A novel idea is that papers are to be provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. How will she use them to learn something (improve or falsify claims or methods)? I see nothing in preregistration, in and of itself, so far, to promote that. Existing replication research doesn’t go there. It would be wrong-headed to condemn CARKing, by the way. Post-data criticism of inquiries must be post-data. How else can you check if assumptions were met by the data in hand? [Note 7/12: Of course, what they must not be are ad hoc saves of the original finding, else they are unwarranted–minimal severity.] It would be interesting to see inquiries into potential hidden biases not often discussed. For example, what did the students (experimental subjects) know and when did they know it (the older the effect the more likely they know it)? What’s the attitude toward the finding conveyed (to experimental subjects) by the person running the study? I’ve little reason to point any fingers, it’s just part of the severe tester’s inclination toward cynicism and error probing. (See my “rewards and flexibility hypothesis” in my earlier discussion.)

It’s too soon to see how RR’s will fare, but plenty of credit is due to those sticking their necks out to upend the status quo. Research into changing incentives is a field in its own right. The severe tester may, again, appear awfully jaundiced to raise any qualms, but we shouldn’t automatically assume that research into incentivizing researchers to behave in a fashion correlated with good science –data sharing, preregistration–is itself likely to improve the original field. Not without thinking through what would be needed to link statistics up with the substantive hypotheses or problem of interest. (Let me be clear, I love the idea of badges and other carrots;it’s just that the real scientific problems shouldn’t be lost sight of.) We might be incentivizing researchers to study how to incentivize researchers to behave in a fashion correlated with good science.

Surely there are areas where the effects or measurement instruments (or both) genuinely aren’t genuine. Isn’t it better to falsify them than to keep finding ad hoc ways to save them? Is jumping on the meta-research bandwagon[iii] just another way to succeed in a field that was questionable? Heresies, I know.

To get the severe tester into further hot water, I’ll share with you her view that, in some fields, if they completely ignored statistics and wrote about plausible conjectures about human motivations, prejudices, attitudes etc. they would have been better off. There’s a place for human interest conjectures, backed by interesting field studies rather than experiments on psych students. It’s when researchers try to “test” them using sciency methods that the whole thing becomes pseudosciency.

Please share your thoughts. (I may add to this, calling it (2).)

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2017, July 8). The Preregistration Revolution (PDF). Open Science Framework. Retrieved from

[i] This article mentions a failed replication discussed on Gelman’s blog on July 8, on which I left some comments.

[ii] New readers, please search likelihood principle on this blog

[iii] This must be distinguished from the use of “meta” in describing a philosophical scrutiny of methods (meta-methodology). Statistical meta-researchers do not purport to be doing philosophy of science.

Categories: Error Statistics, preregistration, reforming the reformers, replication research | 9 Comments

S. Senn: Fishing for fakes with Fisher (Guest Post)



Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Fishing for fakes with Fisher

 Stephen Senn

The essential fact governing our analysis is that the errors due to soil heterogeneity will be divided by a good experiment into two portions. The first, which is to be made as large as possible, will be completely eliminated, by the arrangement of the experiment, from the experimental comparisons, and will be as carefully eliminated in the statistical laboratory from the estimate of error. As to the remainder, which cannot be treated in this way, no attempt will be made to eliminate it in the field, but, on the contrary, it will be carefully randomised so as to provide a valid estimate of the errors to which the experiment is in fact liable. R. A. Fisher, The Design of Experiments, (Fisher 1990) section 28.

Fraudian analysis?

John Carlisle must be a man endowed with exceptional energy and determination. A recent paper of his is entitled, ‘Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals,’ (Carlisle 2017) and has created quite a stir. The journals examined include the Journal of the American Medical Association and the New England Journal of Medicine. What Carlisle did was examine 29,789 variables using 72,261 means to see if they were ‘consistent with random sampling’ (by which, I suppose, he means ‘randomisation’). The papers chosen had to report either standard deviations or standard errors of the mean. P-values as measures of balance or lack of it were then calculated using each of three methods and the method that gave the value closest to 0.5 was chosen. For a given trial the P-values chosen were then back-converted to z-scores combined by summing them and then re-converted back to P-values using a method that assumes the summed Z-scores to be independent. As Carlisle writes, ‘All p values were one-sided and inverted, such that dissimilar means generated p values near 1’.

He then used a QQ plot, which is to say he plotted the empirical distribution of his P-values against the theoretical one. For the latter he assumed that the P-values would have a uniform distribution, which is the distribution that ought to apply for P-values for baseline tests of 1) randomly chosen baseline variates, in 2) randomly chosen RCTs 3) when analysed as randomised. The third condition is one I shall return to and the first is one many commentators have picked up, however, I am ashamed to say that the second is one I overlooked, despite the fact that every statistician should always ask ‘how did I get to see what I see?’, but which took a discussion with my daughter to reveal to me.

Little Ps have lesser Ps etc.

Carlisle finds, from the QQ plot, that the theoretical distribution does not fit the empirical one at all well. There is an excess of P-values near 1, indicating far too frequent poorer-than-expected imbalance and an excess of P-values near 0 indicating balance that is too-good-to-be-true. He then calculates a P-value of P-values and finds that this is 1.2 x 10-7.

Before going any further, I ought to make clear that I consider that the community of those working on and using the results of randomised clinical trials (RCTs), whether as practitioners or theoreticians, owe Carlisle a debt of gratitude. Even if I don’t agree with all that he has done, the analysis raises disturbing issues and not necessarily the ones he was interested in. (However, it is also only fair to note that despite a rather provocative title, Carlisle has been much more circumspect in his conclusions than some commentators.)  I also wish to make clear that I am dismissing neither error nor fraud as an explanation for some of these findings. The former is a necessary condition of being human and the latter far from incompatible with it. Carlisle, disarmingly admits that he may have made errors and I shall follow him and confess likewise. Now to the three problems.

Three coins in the fountain

First, there is one decision that Carlisle made, which almost every statistical commentator has recognised as inappropriate. (See, for example, Nick Brown for a good analysis.) In fact, Carlisle himself even raised the difficulty, but I think he probably underestimated the problem. The method he uses for combining P-values only works if the baseline variables are independent. In general, they are not: sex and height, height and baseline forced expiratory volume in one second (FEV1), baseline FEV1 and age are simple examples from the field of asthma and similar ones can be found for almost every indication. The figure shows the Z-score inflation that attends combining correlated values as if they were independent. Each line gives the ratio of the falsely calculated Z-score to what it should be given a common positive correlation between covariates. (This correlation model is implausible but sufficient to illustrate the problem and simplifies both theory and exposition (Senn and Bretz 2007).) Given the common correlation coefficient assumption, this ratio only depends on the correlation coefficient itself and the number of variates combined. It can be seen that unless either, the correlation is zero or the trivial case of 1 covariate is considered, z-inflation occurs and it can easily be considerable. This phenomenon could be one explanation for the excess of P-values close to 0.

I shall leave the second issue until last. The third issue is subtler than the first but is one Fisher was always warning researchers about and is reflected in the quotation in the rubric. If you block by a factor in an experiment but don’t eliminate it in analysis, the following are the consequences on the analysis of variance. 1) All the contribution in variability of the blocks is removed from the ‘treatment’ sum of squares. 2) That which is removed is added to the ‘error’ sum of squares. 3) The combined effect of 1) and 2)  means that the ratio of the two no longer has the assumed distribution under the null hypothesis (Senn 2004). In particular, excessively moderate Z scores may be the result.

Now, it is a fact that very few trials are completely randomised. For example, many pharma-industry trials use randomly permuted blocks and many trials run by the UK Medical Research Council (MRC) or the European Organisation for Research and Treatment of Cancer (EORTC) use minimisation (Taves 1974). As regards the former, this tends to balance trials by centre. If there are strong differences between centres, this balancing alone will be eliminated from the treatment sum of squares effect but not from the error sum of squares, which, in fact, will increase.   Since centre effects are commonly fitted in pharma-industry trials when analysing outcomes, this will not be a problem: in fact, much sharper inferences will result. It is interesting to note that Marvin Zelen, who was very familiar with public-sector trials but less so with pharmaceutical industry trials, does not appear to have been aware that this was so, and in a paper with Zheng recommended that centre effects ought to be eliminated in future (Zheng and Zelen 2008) unaware that in many cases they already were. Similar problems arise with minimised trials if covariates involved in minimisation are not fitted (Senn, Anisimov, and Fedorov 2010). Even if centre and covariate effects are fitted, if there are any time trends, both of the above methods of allocation will tend to balance by them (since typically the block size is smaller than the centre size and minimisation forces balance not only by the end of the trial but at any intermediate stage), and if so, this will inflate the error variance unless the time trend is fitted. The problem of time trends is one Carlisle himself alluded to.

Now, tests of baseline balance are nothing if not tests of the randomisation procedure itself (Berger and Exner 1999, Senn 1994). (They are useless for determining what covariates to fit.) Hence, if we are to test the randomisation procedure, we require that the distribution of the test-statistic under the null hypothesis has the required form and Fisher warned us it wouldn’t, except by luck, if we blocked and didn’t eliminate. The result would be to depress the Z-statistic. Thus, this is a plausible possible explanation of the excess of P-values near zero that Carlisle noted since he could not adjust for such randomisation approaches.

Now to the second and (given my order of exposition) last of my three issues. I had assumed, like most other commentators, that the distribution of covariates at baseline ought to be random to the degree specified by the randomisation procedure (which is covered by issue three). This is true for each and every trial looking forward. It is not true for published trials looking backward. The questions my daughter put to me was, ‘what about publication bias?,’ and stupidly I replied, ‘but this is not about outcomes’. However, as I ought to know, the conditional type I error rate of an outcome variable varies with the degree of balance and correlation with a baseline variable. What applies one way applies the other and since journals have a bias in favour of positive results (often ascribed to the process of submission only (Goldacre 2012) but very probably part of the editorial process also) (Senn 2012, 2013), then published trials do not provide a representative sample of trials undertaken. Now, although, the relationship between balance and the Type I error rate is simple (Senn 1989) the relationship between being published and balance is much more complex, depending as it does on  two difficult-to- study further things: 1) the distribution of real treatment effects (if  I can be permitted a dig at a distinguished scientist and ‘blogging treasure’, only David Colquhoun thinks this is easy); 2) the extent of publication bias.

However, despite having the information we need, it is clear, that one cannot simply expect baseline distribution of published trials to be random.

Two out of three is bad

Which of Carlisle’s findings turn out to be fraud, which error and which explicable by one of these three (or other) mechanisms, remains to be seen. The first one is easily dealt with. This is just an inappropriate analysis. Things should not be looked at this way. However, pace Meatloaf, two out of three is bad when the two are failures of the system.

As regards issue two, publication bias is a problem and we need to deal with it. Relying on journals to publish trials is hopeless: self-publication by sponsors or trialists is the answer.

However, issue three is a widespread problem: Fisher warned us to analyse as we randomise. If we block or balance by factors that we don’t include in our models, we are simply making trials bigger than they should be and producing standard errors in the process that are larger than necessary. This is sometimes defended on the grounds that it produces conservative inference but in that respect I can’t see how it is superior than multiplying all standard errors by two. Most of us, I think, would regard it as a grave sin to analyse a matched pairs design as a completely randomised one. Failure to attract any marks is a common punishment in stat 1 examinations when students make this error. Too many of us, I fear, fail to truly understand why this implies there is a problem with minimised trials as commonly analysed. (See Indefinite Irrelevance for a discussion.)

As ye randomise so shall ye analyse (although ye may add some covariates) we were warned by the master. We ignore him at our peril. MRC & EORTC, please take note.


I thank Dr Helen Senn for useful conversations. My research on inference for small populations is carried out in the framework of the IDeAL project and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.


Berger, V. W., and D. V. Exner. 1999. “Detecting selection bias in randomized clinical trials.” Controlled Clinical Trials no. 20 (4):319-327.

Carlisle, J. B. 2017. “Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals.” Anaesthisia. doi: 10.1111/anae.13938.

Fisher, Ronald Aylmer, ed. 1990. The Design of Experiments. Edited by J.H. Bennet, Statistical Methods, Experimental Design and Scientific Inference. Oxford: Oxford.

Goldacre, B. 2012. Bad Pharma: How Drug Companies Mislead Doctors and Harm Patients. London: Fourth Estate.

Senn, S., and F. Bretz. 2007. “Power and sample size when multiple endpoints are considered.” Pharm Stat no. 6 (3):161-70.

Senn, S.J. 1989. “Covariate imbalance and random allocation in clinical trials [see comments].” Statistics in Medicine no. 8 (4):467-75.

Senn, S.J. 1994. “Testing for baseline balance in clinical trials.” Statistics in Medicine no. 13 (17):1715-26.

Senn, S.J. 2004. “Added Values: Controversies concerning randomization and additivity in clinical trials.” Statistics in Medicine no. 23 (24):3729-3753.

Senn, S.J., V. V. Anisimov, and V. V. Fedorov. 2010. “Comparisons of minimization and Atkinson’s algorithm.” Statistics in  Medicine no. 29 (7-8):721-30.

Senn, Stephen. 2012. “Misunderstanding publication bias: editors are not blameless after all.” F1000Research no. 1.

Senn, Stephen. 2013. Authors are also reviewers: problems in assigning cause for missing negative studies  20132013]. Available from

Taves, D. R. 1974. “Minimization: a new method of assigning patients to treatment and control groups.” Clinical Pharmacology and Therapeutics no. 15 (5):443-53.

Zheng, L., and M. Zelen. 2008. “MULTI-CENTER CLINICAL TRIALS: RANDOMIZATION AND ANCILLARY STATISTICS.” Annals of Applied Statistics no. 2 (2):582-600. doi: 10.1214/07-aoas151.

Categories: Fisher, RCTs, Stephen Senn | 5 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others of general relevance to philosophy of statistics [2].  Posts that are part of a “unit” or a group count as one.

June 2014

  • (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
  • (6/9) “The medical press must become irrelevant to publication of clinical trials.”
  • (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
  • (6/14) “Statistical Science and Philosophy of Science: where should they meet?”
  • (6/21) Big Bayes Stories? (draft ii)
  • (6/25) Blog Contents: May 2014
  • (6/28) Sir David Hendry Gets Lifetime Achievement Award
  • (6/30) Some ironies in the ‘replication crisis’ in social psychology (4th and final installment)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.






Categories: 3-year memory lane | Leave a comment

Can You Change Your Bayesian Prior? The one post whose comments (some of them) will appear in my new book


I blogged this exactly 2 years ago here, seeking insight for my new book (Mayo 2017). Over 100 (rather varied) interesting comments ensued. This is the first time I’m incorporating blog comments into published work. You might be interested to follow the nooks and crannies from back then, or add a new comment to this.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data? Continue reading

Categories: Bayesian priors, Bayesian/frequentist | 14 Comments

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

E.S. Pearson died on this day in 1980. Aside from being co-developer of Neyman-Pearson statistics, Pearson was interested in philosophical aspects of statistical inference. A question he asked is this: Are methods with good error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. But how exactly does it work? It’s not just the frequentist error statistician who faces this question, but also some contemporary Bayesians who aver that the performance or calibration of their methods supplies an evidential (or inferential or epistemic) justification (e.g., Robert Kass 2011). The latter generally ties the reliability of the method that produces the particular inference C to degrees of belief in C. The inference takes the form of a probabilism, e.g., Pr(C|x), equated, presumably, to the reliability (or coverage probability) of the method. But why? The frequentist inference is C, which is qualified by the reliability of the method, but there’s no posterior assigned C. Again, what’s the rationale? I think existing answers (from both tribes) come up short in non-trivial ways.

I’ve recently become clear (or clearer) on a view I’ve been entertaining for a long time. There’s more than one goal in using probability, but when it comes to statistical inference in science, I say, the goal is not to infer highly probable claims (in the formal sense)* but claims which have been highly probed and have passed severe probes.  Even highly plausible claims can be poorly tested (and I require a bit more of a test than informal uses of the word.) The frequency properties of a method are relevant in those contexts where they provide assessments of a method’s capabilities and shortcomings in uncovering ways C may be wrong. Knowledge of the methods capabilities are used, in turn, to ascertain how well or severely C has been probed. C is warranted only to the extent that it survived a severe probe of ways it can be incorrect. There’s poor evidence for C when little has been done to rule out C’s flaws. The most important role of error probabilities is in blocking inferences to claims that have not passed severe tests, but also to falsify (statistically) claims whose denials pass severely. This view is in the spirit of E.S. Pearson, Peirce, and Popper–though none fully worked it out. That’s one of the things I do or try to in my latest work. Each supplied important hints. The following remarks of Pearson, earlier blogged here, contains some of his hints.

*Nor to give a comparative assessment of the probability of claims

From Pearson, E. S. (1947)

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Continue reading

Categories: E.S. Pearson, highly probable vs highly probed, phil/history of stat | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2014. I leave them unmarked this month, read whatever looks interesting.

May 2014

  • (5/1) Putting the brakes on the breakthrough: An informal look at the argument for the Likelihood Principle
  • (5/3) You can only become coherent by ‘converting’ non-Bayesianly
  • (5/6) Winner of April Palindrome contest: Lori Wike
  • (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
  • (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
  • (5/15) Scientism and Statisticism: a conference* (i)
  • (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
  • (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop
  • (5/25) Blog Table of Contents: March and April 2014
  • (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976
  • (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.







Categories: 3-year memory lane | 1 Comment

Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s birthday. In honor of his birthday, I’m posting the articles in the Synthese volume that was dedicated to his memory in 1977. The editors describe it as their way of  “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. I paste a few snippets from the articles by Giere and Birnbaum. If you’re interested in statistical foundations, and are unfamiliar with Birnbaum, here’s a chance to catch up. (Even if you are, you may be unaware of some of these key papers.)


Synthese Volume 36, No. 1 Sept 1977: Foundations of Probability and Statistics, Part I

Editorial Introduction:

This special issue of Synthese on the foundations of probability and statistics is dedicated to the memory of Professor Allan Birnbaum. Professor Birnbaum’s essay ‘The Neyman-Pearson Theory as Decision Theory; and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory’ was received by the editors of Synthese in October, 1975, and a decision was made to publish a special symposium consisting of this paper together with several invited comments and related papers. The sad news about Professor Birnbaum’s death reached us in the summer of 1976, but the editorial project could nevertheless be completed according to the original plan. By publishing this special issue we wish to pay homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics. We are grateful to Professor Ronald Giere who wrote an introductory essay on Professor Birnbaum’s concept of statistical evidence and who compiled a list of Professor Birnbaum’s publications.


Continue reading

Categories: Birnbaum, Likelihood Principle, Statistics, strong likelihood principle | Tags: | 1 Comment

Frequentstein’s Bride: What’s wrong with using (1 – β)/α as a measure of evidence against the null?



ONE YEAR AGO: …and growing more relevant all the time. Rather than leak any of my new book*, I reblog some earlier posts, even if they’re a bit scruffy. This was first blogged here (with a slightly different title). It’s married to posts on “the P-values overstate the evidence against the null fallacy”, such as this, and is wedded to this one on “How to Tell What’s True About Power if You’re Practicing within the Frequentist Tribe”. 

In their “Comment: A Simple Alternative to p-values,” (on the ASA P-value document), Benjamin and Berger (2016) recommend researchers report a pre-data Rejection Ratio:

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant. (Benjamin and Berger 2016, p. 1)

Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, J. Berger, power, S. Senn | 17 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: April 2014. I mark in red three posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 4 others I’d recommend[2].  Posts that are part of a “unit” or a group count as one. For this month, I’ll include all the 6334 seminars as “one”.

April 2014

  • (4/1) April Fool’s. Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic
  • (4/3) Self-referential blogpost (conditionally accepted*)
  • (4/5) Who is allowed to cheat? I.J. Good and that after dinner comedy hour. . ..
  • (4/6) Phil6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides
  • (4/8) “Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)
  • (4/12) “Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)
  • (4/14) Phil6334: Notes on Bayesian Inference: Day #11 Slides
  • (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
  • (4/17) Duality: Confidence intervals and the severity of tests
  • (4/19) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)
  • (4/21) Phil 6334: Foundations of statistics and its consequences: Day#12
  • (4/23) Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”
  • (4/26) Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day #13)
  • (4/30) Able Stats Elba: 3 Palindrome nominees for April! (rejected post)


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 (moved to 4) -very convenient way to allow data-dependent choices.






Categories: 3-year memory lane, Statistics | Leave a comment

How to tell what’s true about power if you’re practicing within the error-statistical tribe



This is a modified reblog of an earlier post, since I keep seeing papers that confuse this.

Suppose you are reading about a result x  that is just statistically significant at level α (i.e., P-value = α) in a one-sided test T+ of the mean of a Normal distribution with n iid samples, and (for simplicity) known σ:   H0: µ ≤  0 against H1: µ >  0. 

I have heard some people say:

A. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is poor evidence of a discrepancy (from the null) corresponding to µ’.  (i.e., there’s poor evidence that  µ > µ’ ).*See point on language in notes.

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is warranted, or at least not problematic.

I have heard other people say:

B. If the test’s power to detect alternative µ’ is very low, then the just statistically significant x is good evidence of a discrepancy (from the null) corresponding to µ’ (i.e., there’s good evidence that  µ > µ’).

They will generally also hold that if POW(µ’) is reasonably high (at least .5), then the inference to µ > µ’ is unwarranted.

Which is correct, from the perspective of the (error statistical) philosophy, within which power and associated tests are defined? Continue reading

Categories: power, reforming the reformers | 17 Comments

“Fusion-Confusion?” My Discussion of Nancy Reid: “BFF Four- Are we Converging?”


Here are the slides from my discussion of Nancy Reid today at BFF4: The Fourth Bayesian, Fiducial, and Frequentist Workshop: May 1-3, 2017 (hosted by Harvard University)

Categories: Bayesian/frequentist, C.S. Peirce, confirmation theory, fiducial probability, Fisher, law of likelihood, Popper | Tags: | 1 Comment

S. Senn: “Automatic for the people? Not quite” (Guest post)

Stephen Senn

Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Automatic for the people? Not quite

What caught my eye was the estimable (in its non-statistical meaning) Richard Lehman tweeting about the equally estimable John Ioannidis. For those who don’t know them, the former is a veteran blogger who keeps a very cool and shrewd eye on the latest medical ‘breakthroughs’ and the latter a serial iconoclast of idols of scientific method. This is what Lehman wrote

Ioannidis hits 8 on the Richter scale: … Bayes factors consistently quantify strength of evidence, p is valueless.

Since Ioannidis works at Stanford, which is located in the San Francisco Bay Area, he has every right to be interested in earthquakes but on looking up the paper in question, a faint tremor is the best that I can afford it. I shall now try and explain why, but before I do, it is only fair that I acknowledge the very generous, prompt and extensive help I have been given to understand the paper[1] in question by its two authors Don van Ravenzwaaij and Ioannidis himself. Continue reading

Categories: Bayesian/frequentist, Error Statistics, S. Senn | 18 Comments

The Fourth Bayesian, Fiducial and Frequentist Workshop (BFF4): Harvard U


May 1-3, 2017
Hilles Event Hall, 59 Shepard St. MA

The Department of Statistics is pleased to announce the 4th Bayesian, Fiducial and Frequentist Workshop (BFF4), to be held on May 1-3, 2017 at Harvard University. The BFF workshop series celebrates foundational thinking in statistics and inference under uncertainty. The three-day event will present talks, discussions and panels that feature statisticians and philosophers whose research interests synergize at the interface of their respective disciplines. Confirmed featured speakers include Sir David Cox and Stephen Stigler.

The program will open with a featured talk by Art Dempster and discussion by Glenn Shafer. The featured banquet speaker will be Stephen Stigler. Confirmed speakers include:

Featured Speakers and DiscussantsArthur Dempster (Harvard); Cynthia Dwork (Harvard); Andrew Gelman (Columbia); Ned Hall (Harvard); Deborah Mayo (Virginia Tech); Nancy Reid (Toronto); Susanna Rinard (Harvard); Christian Robert (Paris-Dauphine/Warwick); Teddy Seidenfeld (CMU); Glenn Shafer (Rutgers); Stephen Senn (LIH); Stephen Stigler (Chicago); Sandy Zabell (Northwestern)

Invited Speakers and PanelistsJim Berger (Duke); Emery Brown (MIT/MGH); Larry Brown (Wharton); David Cox (Oxford; remote participation); Paul Edlefsen (Hutch); Don Fraser (Toronto); Ruobin Gong (Harvard); Jan Hannig (UNC); Alfred Hero (Michigan); Nils Hjort (Oslo); Pierre Jacob (Harvard); Keli Liu (Stanford); Regina Liu (Rutgers); Antonietta Mira (USI); Ryan Martin (NC State); Vijay Nair (Michigan); James Robins (Harvard); Daniel Roy (Toronto); Donald B. Rubin (Harvard); Peter XK Song (Michigan); Gunnar Taraldsen (NUST); Tyler VanderWeele (HSPH); Vladimir Vovk (London); Nanny Wermuth (Chalmers/Gutenberg); Min-ge Xie (Rutgers)

Continue reading

Categories: Announcement, Bayesian/frequentist | 2 Comments

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday)


Neyman April 16, 1894 – August 5, 1981

For my final Jerzy Neyman item, here’s the post I wrote for his birthday last year: 

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”. Continue reading

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen


April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets: Continue reading

Categories: Error Statistics, Neyman, Statistics | Tags: | 2 Comments

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in honor of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314) Continue reading

Categories: Neyman, Spanos | Leave a comment

Blog at