Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen


Neyman, drawn by ?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and asssociates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I hadn’t posted this paper of Neyman’s before, so here’s something for your weekend reading:  “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena“. I recommend, especially, the example on home ownership. Here are two snippets:


The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand,and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction.

Particularly with reference to applied statistical work in a variety of domain of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.

(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion,these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. in my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)

To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.

Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking  to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman, like Peirce, Popper and many others, hold that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis H only to the extent that it passed a severe test–one with a high probability of having found flaws in H, if they existed.  Of course, Neyman puts this in terms of having high power to reject H, if H is false, and high probability of finding no evidence against H if true, but it’s the same idea. (Their weakness is in being predesignated error probabilities, but severity fixes this.) Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.

Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach [2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.

De Finetti gets it right when he says  that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own,the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).

For related papers, see:

[1] That really is a decision though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There’s plenty of evidence, by the way, that Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”

[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.

Fisher is the arch anti-Bayesian, whereas Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors.  Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):

[3] Who drew the picture of Neyman above? Anyone know?


de Finetti, B. 1972. Probability, Induction and Statistics: The Art of Guessing. Wiley.

Neyman, J. 1976.Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena“. Commun. Statist. Theor. Meth. A5(8), 737-751

Categories: Error Statistics, Neyman | Tags: | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy


A Statistical Model as a Chance Mechanism
Aris Spanos 

Today is the birthday of Jerzy Neyman (April 16, 1894 – August 5, 1981). Neyman was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:


Fisher and Neyman

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,

where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from  f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

Xt = α0 + α1Xt-1 + σεt,  t=1,2,…,n

This indicates how one can use pseudo-random numbers for the error term  εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.

J. Neyman and E. Pearson

J. Neyman and E. Pearson

Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).



For further discussion on the above issues see:

Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.

Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.

[i]He was born in an area that was part of Russia.

Categories: Neyman, phil/history of stat, Spanos, Statistics | Tags: , | Leave a comment

Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC

Start Spreading the News…..



 The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,
2015 APS Annual Convention
Saturday, May 23  
2:00 PM- 3:50 PM in Wilder

(Marriott Marquis 1535 B’way)





Andrew Gelman

Professor of Statistics & Political Science
Columbia University



Stephen Senn

Head of Competence Center
for Methodology and Statistics (CCMS)

Luxembourg Institute of Health



D. Mayo headshot

D.G. Mayo, Philosopher



Richard Morey, Session Chair & Discussant

Senior Lecturer
School of Psychology
Cardiff University
Categories: Announcement, Bayesian/frequentist, Statistics | 7 Comments

Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!


bending of starlight.

[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories.” (Popper, CR, [p. 36))


Popper lauds Einstein’s General Theory of Relativity (GTR) as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found. The truth is that even if no deflection effect had been found in the 1919 experiments it would have been blamed on the sheer difficulty in discerning so small an effect (the results that were found were quite imprecise.) This would have been entirely correct! Yet many Popperians, perhaps Popper himself, get this wrong.[i] Listen to Popperian Paul Meehl (with whom I generally agree).

The stipulation beforehand that one will be pleased about substantive theory T when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing…what astrologers and Marxists and psychoanalysts allegedly do, playing heads I win, tails you lose.” (Meehl 1978, 821)

No, there is a confusion of logic. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect need not be taken to refute the effect, or even as evidence as against H. This makes perfect sense if one keeps in mind that a test might have had little chance to detect the effect, even if it existed. The point really reflects the asymmetry of falsification and corroboration. Popperian Alan Chalmers wrote an appendix to a chapter of his book, What is this Thing Called Science? (1999)(which at first had criticized severity for this) once I made my case. [i]

For example, one of the sets of eclipse plates from Sobral (the controversial astrographic plates) were so blurred by a change of focus in the telescope that they precluded any decent estimate of the standard error. If all the eclipse results were like that, they would announce no deflection had been found. But this would not constitute evidence that the deflection effect didn’t exist, much less that GTR was false. Even if the deflection effect exists, the probability of failing to detect it with the crude 1919 instruments is high. So discerning no effect would not be evidence of no effect. To think otherwise would be an example of what we may call a fallacy of negative or non significant results. The eclipse tests, not just those of 1919, but all eclipse tests of the deflection effect, failed to give very precise results. Nothing like a stringent estimate of λ emerged until the field was rescued by radioastronomical data from quasars in the 1960s.

If one wants to go through the gymnastics of how the severity requirement cashes this out: Let H assert the Einstein deflection effect, and not-H, the Einstein effect is absent, or it is smaller than the predicted amount.[ii] Now to have evidence against H here is to have evidence for not-H. So we can just apply the severity criterion to not-H and see what happens. The observed failure to detect H is in accordance with not-H, so the first severity requirement holds, but there’s a high probability of this occurring even if H is true (not-H is false). So there’s poor evidence for not-H, on severity grounds.

By contrast, once instruments were available to powerfully detect any deflection effects, a non-show would have to be taken against its existence, and thus against GTR..

Popperian requirements are upheld: you are not free to readily interpret any result as consistent with H, much less as counting in favor of the Einstein prediction. However, failure to find data in accord with prediction H isn’t evidence against H if such a no show is easy to explain even if the predicted effect holds. (See also this post on Popper and pseudoscience)

See also chapter 8 of EGEK: Severe Tests and Novel Evidence.

[i] Alan Chalmers at first had this kind of case as a criticism of my account of severity in his “What is this thing called science?” After arguing my case, he did the very rare thing of amending his book before publication. It came as an Appendix.

Appendix: Happy meetings of theory and experiment. Many agree that the merit of a theory is demonstrated by the extent to which it survives severe tests. However, there is a wide wide class of cases of confirmation in science that do not fit readily into this picture, unless great care is taken in characterising severity of tests. The cases I have in mind involve significant matches between theory and observation in circumstances where a lack of match would not tell against the theory.

[Several examples follow.]

One common kind of situation in science involves making a novel prediction from a theory in conjunction with some complicated and perhaps dubious auxiliary assumptions.  [If the theory] is not confirmed, the problem could as well lie with the auxiliary hypotheses as with the theory. Consequently, it might appear that testing the prediction did not constitute a sever test of the theory. ….

Deborah Mayo’s characterisation of severity is able to accommodate these examples She will ask whether the confirmation would have been likely to occur if the theory were false.  Both in the case of my Copernican example and the dislocations example the answer is that they would be very unlikely to occur [were the theories false]…..Mayo’s conception of severity is in line with scientific practice. (Chalmers 1999, 210-212).

I heartily recommend Chalmers’ introductory text!



[i]The famous 1919 eclipse expeditions purported to test Einstein’s new account of gravity against the long-reigning Newtonian theory. According to Einstein’s theory of gravitation, to an observer on earth, light passing near the sun is deflected by an angle,λ, reaching its maximum of 1.75″ for light just grazing the sun, but light deflection would be undetectable on earth with the instruments available in 1919. Although the light deflection of stars near the sun (approximately 1 second of arc) would be detectable, the sun’s glare renders such stars invisible, save during a total eclipse, which “by strange good fortune” would occur on May 29, 1919.” ([1920] 1987, 113),

There were three hypotheses for which “it was especially desirable to discriminate between” (Dyson, 1923, 291). Each is a statement about a parameter, the deflection of light at the limb of the sun, λ (in arc seconds):λ = 0 (no deflection);λ =.87 (Newton),λ=1.75” (Einstein). The Newtonian prediction deflection stems from assuming light has mass and follows Newton’s law of gravity.


Chalmers, A. 1999. What is This Thing Called Science?  3rd edition Hackett.

Dyson, E. W., A. S. Eddington, and C. Davidson. 1923. “A Determination of the Deflection of Light by the Sun’s Gravitational Field, from Observations Made at the Total Eclipse of May 29, 1919.” Memoirs of the Royal Astronomical Society LXII (1917-1923): 291–333.

Eddington, Arthur. 1987. Space, Time and Gravitation: An Outline of the General Relativity Theory. Cambridge Science Classics Series. Cambridge: Cambridge University Press.

Mayo, Deborah. 2010. “Learning from Error: The Theoretical Significance of Experimental Knowledge.” Edited by Kent Staley. The Modern Schoolman 87 (Experimental and Theoretical Knowledge) (The Ninth Henle Conference in the History of Philosophy) (May): 191–217.

Meehl, Paul. 1978. “Theoretical Risks and Tabular Asterisks:Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology.” Journal of Consulting and Clinical Psychology 1978, Vol. 46, 806-834.

Popper, Karl. 1962. Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.



Categories: fallacy of non-significance, philosophy of science, Popper, Severity, Statistics | Tags: | Leave a comment

Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”

I finally saw The Imitation Game about Alan Turing and code-breaking at Bletchley Park during WWII. This short clip of Joan Clarke, who was engaged to Turing, includes my late colleague I.J. Good at the end (he’s not second as the clip lists him). Good used to talk a great deal about Bletchley Park and his code-breaking feats while asleep there (see note[a]), but I never imagined Turing’s code-breaking machine (which, by the way, was called the Bombe and not Christopher as in the movie) was so clunky. The movie itself has two tiny scenes including Good. Below I reblog: “Who is Allowed to Cheat?”—one of the topics he and I debated over the years. Links to the full “Savage Forum” (1962) may be found at the end (creaky, but better than nothing.)

[a]”Some sensitive or important Enigma messages were enciphered twice, once in a special variation cipher and again in the normal cipher. …Good dreamed one night that the process had been reversed: normal cipher first, special cipher second. When he woke up he tried his theory on an unbroken message – and promptly broke it.” This, and further examples may be found in this obituary

[b] Pictures comparing the movie cast and the real people may be found here.


Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

I. J. Good

I. J. Good

It was from my Virginia Tech colleague I.J. Good (in statistics), who died five six years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.) 

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

 To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

images-3By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

 Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level. 

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”

 Howls of laughter.

 But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.

The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

Xi ~ N(µ,σ) and we test  H0: µ=0, vs. H1: µ≠0.

The stopping rule might take the form:

Keep sampling until |m| ≥ 1.96 σ/√n),

with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.

Stopping Rule Principle

Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

A Funny Thing Happened at the Savage Forum[i]

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not.  And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule.  It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available.  If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

One last thing….

 The Optional Stopping Effect with Bayesian (Two-sided) Confidence Intervals

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian  interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ =  m + 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”.  Many other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith. For my latest, and final (I hope) post on the (sttrong) likelihood principle, see the post with the link to my paper with discussion in Statistical Science.

Link to complete discussion: 

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder). Statistical Science 29 (2014), no. 2, 227-266.


[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:


Armitage, P. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 72.

Berger J. O. and Wolpert, R. L. (1988), The Likelihood Principle: A Review, Generalizations, and Statistical Implications 2nd edition, Lecture Notes-Monograph Series, Vol. 6, Shanti S. Gupta, Series Editor, Hayward, California: Institute of Mathematical Statistics.

Birnbaum, A. (1969), “Concepts of Statistical Evidence” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, S. Morgenbesser, P. Suppes, and M. White (eds.): New York: St. Martin’s Press, 112-43.

Cox, D. R. (1977), “The Role of Significance Tests (with discussion)”, Scandinavian Journal of Statistics 4, 49–70.

Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, London: Chapman & Hall.

Edwards, W., H, Lindman, and L. Savage. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 193-242.

Good, I.J.(1983), Good Thinking, The Foundations of Probability and its Applications, Minnesota.

Howson, C., and P. Urbach (1993[1989]), Scientific Reasoning: The Bayesian Approach, 2nd  ed., La Salle: Open Court.

Mayo, D. (1996):[EGEK] Error and the Growth of Experimental Knowledge, Chapter 10 Why You Cannot Be Just a Little Bayesian. Chicago

Mayo, D. G. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Savage, L. (1962), “Discussion”, in The Foundations of Statistical Inference: A Discussion, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

Spanos, A. “Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80 (2013): 73-93.

Categories: Bayesian/frequentist, optional stopping, Statistics, strong likelihood principle | 6 Comments

Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)



Given recent evidence of the irreproducibility of a surprising number of published scientific findings, the White House’s Office of Science and Technology Policy (OSTP) sought ideas for “leveraging its role as a significant funder of scientific research to most effectively address the problem”, and announced funding for projects to “reset the self-corrective process of scientific inquiry”. (first noted in this post.)ostp

I was sent some information this morning with a rather long description of the project that received the top government award thus far (and it’s in the millions). I haven’t had time to read the proposal*, which I’ll link to shortly, but for a clear and quick description, you can read the excerpt of an interview of the OSTP representative by the editor of the Newsletter for Innovation in Science Journals (Working Group), Jim Stein, who took the lead in writing the author check list for Nature.

Stein’s queries are in burgundy, OSTP’s are in blue. Occasional comments from me are in black, which I’ll update once I study the fine print of the proposal itself.

Your office has been inundated with proposals since announcing substantial funding for projects aimed at promoting the self-corrective process of scientific inquiry, and ensuring aggregate scientific reproducibility. I understand the top funded endeavor has been approved by over 40 stakeholders, including editors of leading science journals, science agencies, patient-advocacy groups, on line working groups, post publication reviewers, and researchers themselves. Can you describe this funded initiative and what it will mean for scientists?

OSTP:  Not all the details are in place, but the idea is quite simple. We will start with a pilot program to control aggregate retractions of articles due to problematic or irreproducible results across the sciences. In Phase 1, each journal is given the rights to anywhere from 8-18 retractions every 4 years.

I believe that Nature had around 14 retractions total in barely two years in 2013-14. What do they do when they’ve come close to the capped limit?

OSTP: They can purchase retraction rights from a journal that hasn’t used up its limit. In other words, journals with solid publications and low retractions can sell their retraction rights (or “offsets”) to journals throughout the period. We think this is a sound way to ensure aggregrate replicability and instill researcher responsibility.

So this is something like trading carbon emissions?

OSTP: Some aspects are redolent of those enterprises, and we have leading economists on the pilot Phase 1, but the outcome is entirely in the control of the journal, authors and reviewers. Emissions are largely necessary consequences of industrial activities, retractions are not.


Oh my, and I had thrown out an idea like this facetiously. Here’s a picture that just came to mind of a retraction offset certificate:


Mayo’s depiction of what they might have in mind….

 I don’t mean to poke fun at this effort. It’s a dramatic step, and I certainly endorse the last couple of ideas below.

OSTP: As an alternative to purchasing retraction rights from another journal, an editor may seek to withdraw articles as a precautionary measure… [with some algorithm as yet to be worked out]. Given the low rates of reproducibiity, for instance, that scientists at Amgen found “scientific findings were confirmed in only 11% cases, we think something like 30% -50% precautionary withdrawals might be the norm, at least at the outset.

Can papers accepted for publication be withdrawn by journal editors even without grounds that reach the level of retraction, or perhaps are even problem-free? How will authors cope with this uncertainty?

OSTP: According to the new rules, if papers are accepted and approved in final form, authors grant withdrawal rights to journals for a probationary period of 4 years, alongside other rights they currently bestow publishers. Accepted papers are cross-validated against holdout data, and other means, by independent monitoring groups. Once the paper is written and accepted, an independent group uses the holdout data set to verify the claim or not and add an appendix to the paper. Failed replication is not grounds for retraction, but a journal might seek precautionary withdrawal of such a paper. On-line sources, post-publication peer review, for example, could also provide sources of information leading to a PW [precautionary withdrawal].

Precautionary withdrawal is an entirely new concept for all of us. Might a paper be withdrawn by a journal even lacking grounds to suspect the integrity of the work?

OSTP: Strictly speaking yes, but only during the (4 year) probationary period. Let me be clear, editors from leading journals are fully behind this, they themselves told the NIH that having a holdout data set is the key. The nice thing about it is that papers subject to precautionary withdrawal are not subject to the stigma often associated with retractions, whether for honest error, flawed information, incomplete methodology, or scientific misconduct.

On the other hand, a retraction, under this new plan, in serious cases, could well lead researchers and institutions to be required to refund federal funds involved. Impact factors are also likely to be diminished following retractions but not following withdrawals [with an algorithm to be developed by members of Retraction Watch].[See Time for a Retraction Penalty? i]

Are there carrots as well as sticks?

OSTP: Quite a few. For one thing, journals will be rewarded for publishing articles critically appraising failed methodologies, biases, and dead-ends, what didn’t work and why? ‘How not to go about research of a given type’ will be a newly funded area in NSF, NIH and other agencies and societies. This kind of “meta” research will be a new field in which young scientists are well placed to achieve respectable publication credentials.We’re really excited about this.

Yes, journals have often been loathe to publish failed results and negative data; hence the file-drawer problem.

OSTP: They will now, because the number of retraction offsets a journal is granted goes up dramatically with the percentage of critical meta-research. There will be other perks to incentivize this critical meta-research as well. We’re not talking merely of negative results mind you, but substantive and systematic analyses of deceptive results and a demonstration of how they’ve misled subsequent research, led to flawed clinical trials (and even deaths), and created obstacles to cumulative knowledge.

Ah, the meta-research on failed attempts is a great idea! (I will see if I can apply to contribute to this. Seriously.)

OSTP: The truth is, we currently have whole fields that have been created around results that have been retracted!

Nobody knew?

OSTP: Apparently not, and nobody bothered to check before “building” on piles of sand [see i]. Now the truth will out, but in a positive, constructive, truth-seeking manner. In this connection, another idea in the works is to incentivize the public to improve science and recover wasted federal funds by allowing a percentage of recovered funds to be paid to those who first provide on-line evidence of scientific misconduct leading to retraction (by whatever explicit definition).

This is really going to shake up the system.

OSTP: Frankly, we are left with no option. Of course this is just one of several pilot projects to be explored through government funding over several years.

Is there a general name for the new project?

OSTP: The word “paradigm” is overworked, so we have adopted “framework”; a new Framework to Ensure Aggregate Reproducibility.

Given that there are a few more sticks than carrots, this seems an apt name: Framework to Ensure Aggregate Reproducibility (FEAR)?  Was it on purpose, do you suppose?


*I’m about to board a plane, and won’t get to update this for several hours. 

[i]The footnote in the interview was to a paper by Adam Marcus and Ivan Oransky (from Retraction Watch), in Labtimes (03/2014) Time for a Retraction Penalty? “There’s evidence, in fact, that CellNature and Science would suffer the most from such penalties, since journals with high impact factors tend to have higher rates of retraction, as Arturo Casadevall and Ferric Fang showed in a 2011 paper in Infection and Immunity…Journals might also get points for raising awareness of their retractions, in the hope that authors wouldn’t continue to cite such papers as if they’d never been withdrawn – an alarming phenomenon that John Budd and colleagues have quantified and that seems to echo the 1930s U.S. Works Progress Administration employees, being paid to build something that another crew is paid to tear down. After all, if those citations don’t count toward the impact factor, journals wouldn’t have an incentive to let them slide”.


April 2 addition: Check date!

Many people wrote that they were utterly convinced for most of the day—but my 4/1 posts are always inclined to be almost true, true in part, and they may well even become true. I’m still rooting for critical meta-research. You’re still free to comment on why the FEAR would or would not work.



Categories: junk science, reproducibility, science communication, Statistics | 9 Comments

Your (very own) personalized genomic prediction varies depending on who else was around?


personalized medicine roulette

As if I wasn’t skeptical enough about personalized predictions based on genomic signatures, Jeff Leek recently had a surprising post about a “A surprisingly tricky issue when using genomic signatures for personalized medicine“.  Leek (on his blog Simply Statistics) writes:

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

….it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set.

Here’s an extract from the paper,”Test set bias affects reproducibility of gene signatures“:

Test set bias is a failure of reproducibility of a genomic signature. In other words, the same patient, with the same data and classification algorithm, may be assigned to different clinical groups. A similar failing resulted in the cancellation of clinical trials that used an irreproducible genomic signature to make chemotherapy decisions (Letter (2011)).

This is a reference to the Anil Potti case:

Letter, T. C. (2011). Duke Accepts Potti Resignation; Retraction Process Initiated with Nature Medicine.

But far from the Potti case being some particularly problematic example (see here and here), at least with respect to test set bias, this article makes it appear that test set bias is a threat to be expected much more generally. Going back to the abstract of the paper: Continue reading

Categories: Anil Potti, personalized medicine, Statistics | 10 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: March 2012. I mark in red three posts that seem most apt for general background on key issues in this blog. (Posts that are part of a “unit” or a group of “U-Phils” count as one.) This new feature, appearing the last week of each month, began at the blog’s 3-year anniversary in Sept, 2014.

Since the 3/14 and 3/18 posts on objectivity (part of a 5-part unit on objectivity) were recently reblogged, they are marked in burgundy, not bright red. The 3/18 comment on Barnard and Copas includes two items: A paper by Bernard and Copas, which happens to cite Stephen Senn twice, and a note by Aris Spanos pertaining to that paper.

March 2012

Categories: 3-year memory lane | Leave a comment

Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”

dirty hands

We constantly hear that procedures of inference are inescapably subjective because of the latitude of human judgment as it bears on the collection, modeling, and interpretation of data. But this is seriously equivocal: Being the product of a human subject is hardly the same as being subjective, at least not in the sense we are speaking of—that is, as a threat to objective knowledge. Are all these arguments about the allegedly inevitable subjectivity of statistical methodology rooted in equivocations? I argue that they are! [This post combines this one and this one, as part of our monthly “3 years ago” memory lane.]

“Argument from Discretion” (dirty hands)

Insofar as humans conduct science and draw inferences, it is obvious that human judgments and human measurements are involved. True enough, but too trivial an observation to help us distinguish among the different ways judgments should enter, and how, nevertheless, to avoid introducing bias and unwarranted inferences. The issue is not that a human is doing the measuring, but whether we can reliably use the thing being measured to find out about the world.

Remember the dirty-hands argument? In the early days of this blog (e.g., October 13, 16), I deliberately took up this argument as it arises in evidence-based policy because it offered a certain clarity that I knew we would need to come back to in considering general “arguments from discretion”. To abbreviate:

  1. Numerous  human judgments go into specifying experiments, tests, and models.
  2. Because there is latitude and discretion in these specifications, they are “subjective.”
  3. Whether data are taken as evidence for a statistical hypothesis or model depends on these subjective methodological choices.
  4. Therefore, statistical inference and modeling is invariably subjective, if only in part.

We can spot the fallacy in the argument much as we did in the dirty hands argument about evidence-based policy. It is true, for example, that by employing a very insensitive test for detecting a positive discrepancy d’ from a 0 null, that the test has low probability of finding statistical significance even if a discrepancy as large as d’ exists. But that doesn’t prevent us from determining, objectively, that an insignificant difference from that test fails to warrant inferring evidence of a discrepancy less than d’.

Test specifications may well be a matter of  personal interest and bias, but, given the choices made, whether or not an inference is warranted is not a matter of personal interest and bias. Setting up a test with low power against d’ might be a product of your desire not to find an effect for economic reasons, of insufficient funds to collect a larger sample, or of the inadvertent choice of a bureaucrat. Or ethical concerns may have entered. But none of this precludes our critical evaluation of what the resulting data do and do not indicate (about the question of interest). The critical task need not itself be a matter of economics, ethics, or what have you. Critical scrutiny of evidence reflects an interest all right—an interest in not being misled, an interest in finding out what the case is, and others of an epistemic nature. Continue reading

Categories: Objectivity, Statistics | Tags: , | 6 Comments

Stephen Senn: The pathetic P-value (Guest Post)

S. Senn

S. Senn

Stephen Senn
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health

The pathetic P-value

This is the way the story is now often told. RA Fisher is the villain. Scientists were virtuously treading the Bayesian path, when along came Fisher and gave them P-values, which they gladly accepted, because they could get ‘significance’ so much more easily. Nearly a century of corrupt science followed but now there are signs that there is a willingness to return to the path of virtue and having abandoned this horrible Fisherian complication:

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started …

A condition of complete simplicity..

And all shall be well and
All manner of thing shall be well

TS Eliot, Little Gidding

Consider, for example, distinguished scientist David Colquhoun citing the excellent scientific journalist Robert Matthews as follows

“There is an element of truth in the conclusion of a perspicacious journalist:

‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug. ‘

Robert Matthews Sunday Telegraph, 13 September 1998.” [1]

However, this is not a plain fact but just plain wrong. Even if P-values were the guilty ‘mathematical machine’ they are portrayed to be, it is not RA Fisher’s fault. Putting the historical record right helps one to understand the issues better. As I shall argue, at the heart of this is not a disagreement between Bayesian and frequentist approaches but between two Bayesian approaches: it is a conflict to do with the choice of prior distributions[2].

Fisher did not persuade scientists to calculate P-values rather than Bayesian posterior probabilities; he persuaded them that the probabilities that they were already calculating and interpreting as posterior probabilities relied for this interpretation on a doubtful assumption. He proposed to replace this interpretation with one that did not rely on the assumption. Continue reading

Categories: P-values, S. Senn, statistical tests, Statistics | 147 Comments

All She Wrote (so far): Error Statistics Philosophy: 3.5 years on


metablog old fashion typewriter

D.G. Mayo with typewriter

Error Statistics Philosophy: Blog Contents (3.5 years)
By: D. G. Mayo [i]

September 2011

October 2011

Continue reading

Categories: blog contents, Metablog, Statistics | 1 Comment

A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)



A large number of people have sent me articles on the “test ban” of statistical hypotheses tests and confidence intervals at a journal called Basic and Applied Social Psychology (BASP)[i]. Enough. One person suggested that since it came so close to my recent satirical Task force post, that I either had advance knowledge or some kind of ESP. Oh please, no ESP required.None of this is the slightest bit surprising, and I’ve seen it before; I simply didn’t find it worth blogging about. Statistical tests are being banned, say the editors, because they purport to give probabilities of null hypotheses (really?) and do not, hence they are “invalid”.[ii] (Confidence intervals are thrown in the waste bin as well—also claimed “invalid”).“The state of the art remains uncertain” regarding inferential statistical procedures, say the editors.  I don’t know, maybe some good will come of all this.

Yet there’s a part of their proposal that brings up some interesting logical puzzles, and logical puzzles are my thing. In fact, I think there is a mistake the editors should remedy, lest authors be led into disingenuous stances, and strange tangles ensue. I refer to their rule that authors be allowed to submit papers whose conclusions are based on allegedly invalid methods so long as, once accepted, they remove any vestiges of them!

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).”

Now if these measures are alleged to be irrelevant and invalid instruments for statistical inference, then why should they be included in the peer review process at all? Will reviewers be told to ignore them? That would only seem fair: papers should not be judged by criteria alleged to be invalid, but how will reviewers blind themselves to them? It would seem the measures should be excluded from the get-go. If they are included in the review, why shouldn’t the readers see what the reviewers saw when they recommended acceptance?

But here’s where the puzzle really sets in. If the authors must free their final papers from such encumbrances as sampling distributions and error probabilities, what will be the basis presented for their conclusions in the published paper? Presumably, from the notice, they are allowed only mere descriptive statistics or non-objective Bayesian reports (added: actually can’t tell which kind of Bayesianism they allow, given the Fisher reference which doesn’t fit*). Won’t this be tantamount to requiring authors support their research in a way that is either (actually) invalid, or has little to do with the error statistical properties that were actually reported and on which the acceptance was based?[ii] Continue reading

Categories: P-values, reforming the reformers, Statistics | 72 Comments

“Probabilism as an Obstacle to Statistical Fraud-Busting”

Boston Colloquium 2013-2014


“Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?” was my presentation at the 2014 Boston Colloquium for the Philosophy of Science):“Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge.”  

 As often happens, I never put these slides into a stand alone paper. But I have incorporated them into my book (in progress*), “How to Tell What’s True About Statistical Inference”. Background and slides were posted last year.

Slides (draft from Feb 21, 2014) 

Download the 54th Annual Program

Cosponsored by the Department of Mathematics & Statistics at Boston University.

Friday, February 21, 2014
10 a.m. – 5:30 p.m.
Photonics Center, 9th Floor Colloquium Room (Rm 906)
8 St. Mary’s Street

*Seeing a light at the end of tunnel, finally.
Categories: P-values, significance tests, Statistical fraudbusting, Statistics | 7 Comments

Big Data Is The New Phrenology?




It happens I’ve been reading a lot lately about the assumption in social psychology and psychology in general that what they’re studying is measurable, quantifiable. Addressing the problem has been shelved to the back burner for decades thanks to some redefinitions of what it is to “measure” in psych (anything for which there’s a rule to pop out a number says Stevens–an operationalist in the naive positivist spirit). This at any rate is what I’m reading, thanks to papers sent by a colleague of Meehl’s (N. Waller).  (Here’s one by Mitchell.) I think it’s time to reopen the question.The measures I see of “severity of moral judgment”, “degree of self-esteem” and much else in psychology appear to fall into this behavior in a very non-self critical manner. No statistical window-dressing (nor banning of statistical inference) can help them become more scientific. So when I saw this on Math Babe’s twitter I decided to try the “reblog” function and see what happened. Here it is (with her F word included). The article to which she alludes is “Recruiting Better Talent Through Brain Games” )

Originally posted on mathbabe:

Have you ever heard of phrenology? It was, once upon a time, the “science” of measuring someone’s skull to understand their intellectual capabilities.

This sounds totally idiotic but was a huge fucking deal in the mid-1800’s, and really didn’t stop getting some credit until much later. I know that because I happen to own the 1911 edition of the Encyclopedia Britannica, which was written by the top scholars of the time but is now horribly and fascinatingly outdated.

For example, the entry for “Negro” is famously racist. Wikipedia has an excerpt: “Mentally the negro is inferior to the white… the arrest or even deterioration of mental development [after adolescence] is no doubt very largely due to the fact that after puberty sexual matters take the first place in the negro’s life and thoughts.”

But really that one line doesn’t tell the whole story. Here’s the whole thing…

View original 351 more words

Categories: msc kvetch, scientism, Statistics | 3 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: February 2012. I am to mark in red three posts (or units) that seem most apt for general background on key issues in this blog. Given our Fisher reblogs, we’ve already seen many this month. So, I’m marking in red (1) The Triad, and (2) the Unit on Spanos’ misspecification tests. Plase see those posts for their discussion. The two posts from 2/8 are apt if you are interested in a famous case involving statistics at the Supreme Court. Beyond that it’s just my funny theatre of the absurd piece with Barnard. (Gelman’s is just a link to his blog.)


February 2012


  • (2/11) R.A. Fisher: Statistical Methods and Scientific Inference
  • (2/11)  JERZY NEYMAN: Note on an Article by Sir Ronald Fisher
  • (2/12) E.S. Pearson: Statistical Concepts in Their Relation to Reality





This new, once-a-month, feature began at the blog’s 3-year anniversary in Sept, 2014.


Jan. 2012

Dec. 2011

Nov. 2011

Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

Categories: 3-year memory lane, Statistics | 1 Comment

Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)

Comedy hour icon


This headliner appeared before, but to a sparse audience, so Management’s giving him another chance… His joke relates to both Senn’s post (about alternatives), and to my recent post about using (1 – β)/α as a likelihood ratio--but for very different reasons. (I’ve explained at the bottom of this “(b) draft”.)

 ….If you look closely, you’ll see that it’s actually not Jay Leno who is standing up there at the mike, (especially as he’s no longer doing the Tonight Show) ….



It’s Sir Harold Jeffreys himself! And his (very famous) joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler joke* in criticizing the use of p-values.

“Did you hear the one about significance testers rejecting H0 because of outcomes H0 didn’t predict?

‘What’s unusual about that?’ you ask?

What’s unusual is that they do it when these unpredicted outcomes haven’t even occurred!”

Much laughter.

[The actual quote from Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure.” (Jeffreys 1939, 316)]

I say it’s funny, so to see why I’ll strive to give it a generous interpretation. Continue reading

Categories: Comedy, Discussion continued, Fisher, Jeffreys, P-values, Statistics, Stephen Senn | 5 Comments

Stephen Senn: Fisher’s Alternative to the Alternative


As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog Senn from 3 years ago.  

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.


The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.

Fisher replies with a ‘shorter catechism’ on transformations which ends as follows: Continue reading

Categories: Fisher, Statistics, Stephen Senn | Tags: , , , | 59 Comments

R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)



In recognition of R.A. Fisher’s birthday….

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Spanos, Statistics | 6 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood': Just before breaking up (with N-P)

17 February 1890–29 July 1962

In recognition of R.A. Fisher’s birthday tomorrow, I will post several entries on him. I find this (1934) paper to be intriguing –immediately before the conflicts with Neyman and Pearson erupted. It represents essentially the last time he could take their work at face value, without the professional animosities that almost entirely caused, rather than being caused by, the apparent philosophical disagreements and name-calling everyone focuses on. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a very similar place (no pun intended) while starting from different origins. I quote just the most relevant portions…the full article is linked below. I’d blogged it earlier here.  You may find some gems in it.

‘Two new Properties of Mathematical Likelihood’

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 3 Comments

Continuing the discussion on truncation, Bayesian convergence and testing of priors



My post “What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?” gave rise to a set of comments that were mostly off topic but interesting in their own right. Being too long to follow, I put what appears to be the last group of comments here, starting with Matloff’s query. Please feel free to continue the discussion here; we may want to come back to the topic. Feb 17: Please note one additional voice at the end. (Check back to that post if you want to see the history)


I see the conversation is continuing. I have not had time to follow it, but I do have a related question, on which I’d be curious as to the response of the Bayesians in our midst here.

Say the analyst is sure that μ > c, and chooses a prior distribution with support on (c,∞). That guarantees that the resulting estimate is > c. But suppose the analyst is wrong, and μ is actually less than c. (I believe that some here conceded this could happen in some cases in whcih the analyst is “sure” μ > c.) Doesn’t this violate one of the most cherished (by Bayesians) features of the Bayesian method — that the effect of the prior washes out as the sample size n goes to infinity?


(to Matloff),

The short answer is that assuming information such as “mu is greater than c” which isn’t true screws up the analysis. It’s like a mathematician starting a proof of by saying “assume 3 is an even number”. If it were possible to consistently get good results from false assumptions, there would be no need to ever get our assumptions right. Continue reading

Categories: Discussion continued, Statistics | 60 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 721 other followers