“Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I hadn’t posted this paper of Neyman’s before, so here’s something for your weekend reading: “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I recommend, especially, the example on home ownership. Here are two snippets:

1. INTRODUCTION

The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction.

Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.…

(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)

To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.

Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H

_{0}is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H

_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0}cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman, like Peirce, Popper and many others, hold that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis *H* only to the extent that it passed a severe test–one with a high probability of having found flaws in *H*, if they existed. Of course, Neyman puts this in terms of having high power to reject *H*, if *H* is false, and high probability of finding no evidence against *H* if true, but it’s the same idea. (Their weakness is in being predesignated error probabilities, but severity fixes this.) Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.

Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.

De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).

For related papers, see:

- Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,”
*Optimality: The Second Erich L. Lehmann Symposium*(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97. - Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,”
*British Journal of Philosophy of Science*, 57: 323-357.

[1] That really is a decision though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There’s plenty of evidence, by the way, that Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”

[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.

Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):

- Mayo, D. 1996. “Why Pearson rejected the Neyman-Pearson (behavioristic) philosophy and a note on objectivity in statistics” .

[3] Who drew the picture of Neyman above? Anyone know?

**References**

de Finetti, B. 1972. *Probability, Induction and Statistics*: The Art of Guessing. Wiley.

Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” *Commun. Statist. Theor. Meth*. A5(8), 737-751.

Under a harmonization of Neyman’s and Fisher’s views, how does one explain fiducial inference? The fiducial solution (e.g., a fiducial interval) and Neyman’s solution (an optimal CI) can differ from one another (and given the symmetry of CIs and tests, this would seem to raise some questions about hypothesis tests as well). Even if it truly is “the methods, stupid”, it seems reasonable to infer that the two had some real differences, and that one can’t just chalk it up to a battle of egos.

I was also going to add relevant subsets to this, but I see that Stephen Senn mentioned it on a previous post: (https://errorstatistics.com/2014/08/11/egon-pearsons-heresy/#comment-90063)

Richard:

I recently found out that Neyman made a math error in that particular argument that Stephen refers to.

Fisher did claim there was an error but I guess did not think he needed to walk people (who thought they were better mathematicians than him) through it.

https://projecteuclid.org/euclid.ss/1408368581

Keith O’Rourke

Phanerono: I didn’t think there was a math error (and there’s a large exchange on it) but rather a difference in modeling. Gelman appears to agree with Neyman:

https://errorstatistics.com/2013/05/24/gelman-sides-w-neyman-over-fisher-in-relation-to-a-famous-blow-up/

Perhaps but Sabbaghi and Rubin seem quite clear (and it is a year after those comment by Andrew.)

“Neyman in fact made a crucial algebraic mistake

in his appendix, and his expressions for the expected

mean residual sum of squares for both designs are

generally incorrect. We present the correct expressions

in Sections”

Keith O’Rourke

Mayo: Sabbaghi and Rubin are unequivocal, there was a math error. Though this doesn’t rule out discussion of which null one might more usefully test.

Maybe so, but this wasn’t the argument by Senn, nor, to the opposite conclusion, by Gelman.

I read Gelman as being more generous that that. He does say that in some situations “go with Fisher” but in others “go with Neyman”.

Similarly, Senn’s comment, linked from Gelman’s discussion, says how “weird” Neyman’s null is in one setting, but it’s hardly a conclusive argument that one should never consider it.

Mayo:

I was pointing to the math error – it does not necessarily mean Fisher was right about everything but it does mean Neyman was in error about something. (I thought error statisticians wanted to know about errors rather than not know.)

It does however suggest that Fisher had some justification for being cranky – he pointed to where the error was and that should have been enough for a good mathematician to realize they should verify their derivation to ensure that there wasn’t an error.

Keith O’Rourke

Phenerono: Firstly, I don’t recall fisher pointing out a math error as opposed to making, essentially Senn’s point. But Fisher was wrong (and blatantly so) in some/many of his probabilistic instantiations to his fiducial intervals–read Fisher’s 1955 from the triad!– (I put it this way (as “some/many”) since I know people are holding out on fixing it for certain cases), and other things besides. If you know the history you’d know that didn’t make fisher “cranky” for decades, however long it was. Check out anger management. If Neyman had said he’d use his book, they would have been friends. Instead Fisher runs to try to get him fired as soon as he said he wanted to use his own book. As Cox put it the problem was owing to their being under the same roof (or the like in his 2006 book)

Lets stick with 1935

From http://arxiv.org/pdf/1409.2675v1.pdf (which maybe more accessible.)

In fact, Fisher was the sole discussant who identified an incorrect equation (27),

in Neyman’s appendix: Then how had Dr. Neyman been led by his symbolism to deceive himself on so simple a question? . . . Equations (13) and (27) of his appendix showed that the quantity which Dr. Neyman had chosen to call σ^2 did not contain the same components of

error as those which affected the actual treatment means, or as those which contributed

to the estimate of error. (Fisher, 1935, page 156)

Neyman in fact made a crucial algebraic mistake in his appendix..

Fisher, R. A. (1935). Comment on “Statistical problems in agricultural experimentation (with discussion).” Suppl. J. Roy. Statist. Soc. Ser. B 2 154–157, 173.

If you do find this is a misquote, I will apologize and personally ask the authors to retract it.

Keith O’Rourke

Richard: I don’t say there weren’t real differences, but I doubt it’s possible to disentangle the differences due to views of science or inferential philosophy, as opposed to those created in the midst of in-fighting. For example, N-P and Fisher were on the same side in the early development of tests, and they worked in order to justify Fisher’s tests and other tests then in use that lacked a clear rationale. They did it “his” way, and then later Fisher went back on his own views. Fisher said it was only around 20 years after the development of tests that Barnard told him that N-P had turned significance tests into acceptance procedures. I’ve often said that the history of stat would look very different if it had been Pearson and Pearson (both Egons).

Anyway, I’ve been through all this. My point is that we, living today, should want to understand the methods themselves. (At least some of the criticisms could actually even be on target rather than non sequiturs.) There are all sorts of reasons that the notions were chosen. Pearson and Neyman actually have different recollections of where the appeal to the alternative hypothesis came from, and whether the notion of power came from one of Pearson’s early jobs (as Egon said), or something Lola said, or Student, or Borel, or when they were eating shrimp on a summer holiday (see E. Pearson’s “the N-P Story”). By placing weight on these matters of idiosyncrasy and personality (“discovery vs “justification” considerations) all sorts of criticisms have been launched of the form: “I know a P-value isn’t an error probability because Fisher lambasted Neyman and called him a Russian”. These are fallacious arguments and yet free people from the much harder job of erecting appropriate, perhaps different, rationales for the various methods in use. It also lets them declare falsely that the methods are inconsistent with each other and so should be dumped entirely.

Pearson always said that N-P statistics were not to be considered a static system but were to be upgraded and modified as problems and technologies changed. Read Lehmann’s recent, slim book on Neyman, Fisher and the creation of classical statistics. (It can be won for free in my palindrome contest.)

There is no doubt that each recommended different methods at times, so what? This in no way hampers communication or critical appraisal, as if scientists don’t generally tackle a problem in completely different ways quite deliberately. The idea that there is to be one and only one way to skin a cat is very distant from real science Different methods are good at unearthing errors that others miss. If one remembers that statistical inference is distinct from scientific inference, the various moves at the statistical level don’t preclude shared inferences at the substantive level.

As for fiducial intervals, everyone knows it’s an intriguing, deep/dark mystery just what Fisher meant, and trillions of pages have been written. It’s the house of the rising sun of statistics, and many have spent hours in misery trying to figure out how to make sense of Fisher. I find it very interesting that it’s enjoying a resurgence and several of the people who commented on my Likelihood Principle article in Statistical Science are advancing modern variants of fiducialism (Fraser is one). Sometimes Fisher even seemed to have severity in mind!

ADDITION: I should noted that the infelicities of the fiducial argument, appearing to countenance an illicit probabilistic instantiation, is definitely one of the reasons Neyman emphasized the behavioristic interpretation of CIs. He wanted to be clear that he wasn’t agreeing with Fisher on fiducial intervals. See the Fisher/ Neyman exchange in the “triad” of 1955/56 on this blog.

Mayo: “Read Lehmann’s recent, slim book on Neyman, Fisher and the creation of classical statistics.” I already own it; it’s a nice book. I tweeted it as a recommended read for everyone a few weeks ago. Palindromes aren’t really my thing.

part stats trap?

To me, the main distinction to be drawn between Neyman-Pearson hypothesis tests and Fisherian significance tests is that the N-P framework focusses on the error probabilities of the method whereas Fisher’s was more focussed on determining what the data in hand allowed one to infer about the hypothesis in question.

Fisher wrote: “The appreciation of such more complicated cases will be much aided by a clear view of the nature of a test of significance applied to a single hypothesis by a unique body of observations.” (1956 p. 46), but ion marked contrast, Neyman & Pearson wrote that their framework was the reasult of having decided that “no test based on the theory of probability can by itself provide any valuable evidence of the truth or falsehood of [a particular] hypothesis.” and “Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour…” (1933, pp. 290-291). Neyman’s focus is on global probabilities and Fisher’s was on local probabilities, and where the local probabilities were unavailable via his fiducial argument, he suggested that we look at the likelihoods.

To the first approximation a hypothesis test yields a decision regarding the null hypothesis that preserves global error probabilities and a significance test yields a P-value that has a claim to being related to local evidence.

I agree with your position that it doesn’t much matter what either Neyman or Fisher wrote and did, we should worry about the properties of the methods. However, it is important that we do not lose sight of the distinction that should be drawn between long-run error rates and evidence.

Michael:: If you remember

https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

and other posts like it, you’ll remember that it was Fisher who was using tests as automatic rules to decide whether to pay any more attention to an effect or not. The justification for such an automatic rule was 5% error in the long run. N-P formulated tests to try and justify what Fisher was doing but without some of the latitude for bad tests. The remark you cite is 20 years after the break-up (see anger management post) when Fisher took off his behavioristic hat etc. etc. He was livid that people everywhere were using N-P statistics more than his tests. Barnard gave Fisher this basis for beating up on Neyman (Barnard told me this personally, but it’s also in Fisher’s writing in the “triad”)

As for that passage, it’s incorrectly understood if viewed as an assertion about long-runs AS OPPOSED to an assertion—one that is quite early, in the 33 paper–about the NEED FOR AN ALTERNATIVE. I analyze that passage over 10 pages in my book.

I agree there’s a whiff more of behaviorism in Neyman than Fisher but just barely and only later on. They both used tests inferentially and behavioristically. And in practice Neyman is more “evidential” than Fisher. Pearson was somewhere in the middle, but clearly hinted at an evidential twist to the error probabilities. Birnbaum gave another hint. I propose a full-fledged evidential construal.

I certainly think that whether testing leads to inferences or decisions is not the difference between significance and hypothesis testing. Ceratinly, P-values can be used in either the Fisherian or the Neyman-Pearson system. In (1) I quoted Lehmann (2) as follows

‘In applications, there is usually available a nested family of rejection regions corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level alpha_^ =alpha_ˆ(x), the significance probability or p-value, at which the hypothesis would be rejected for the given observation’

It seems to me that the differences between Fisher and Neyman are more to do with the following:

A) Is the alternative more primitive than the test statistic?

B) Should the property of the tests be judged by relative frequencies over all repetitions of the experiment as performed or should one condition on relevant subsets?

C) Can it be the case that in certain complex cases the program of controlling the type I error rate at a fixed level while maximising power is inappropriate?

I think that it is certainly true that in the Latin square case Neyman and Fisher’s hypothesis are different. I was unaware that it was also the case that Neyman’s algebra was wrong.

(However, it is interesting t note that in the discussion Fisher refers to a mathematician, who must have been Wilks, having proved that his method was correct.)

1. Senn S. A comment on replication, p-values and evidence, S.N.Goodman, Statistics in Medicine 1992; 11:875-879. Statistics in Medicine 2002; 21: 2437-2444; author reply 2445-2437.

2 Lehmann EL. Testing Statistical Hypotheses. Chapman and Hall: New York, 1994.

Stephen: That you were unaware of any math error, assuming there is one, supports what I wrote to Phanerono, that Fisher hadn’t been arguing about a math error that Neyman made and denied.

I agree with your points about ostensive differences, but Neyman scarcely thought he was proposing one way for doing anything. He said “here are some exemplars, see if any of your cases could use something like this, else don’t use it”.

In this connection, I mention the housing example in the paper in the above post. (I’m also curious if any weight was ever given to the radical hypothesis about cancer in Le Cam’s study). So what is it that surprised Spanos and I about the housing example? Take a look at how he invents a technique to check lack of randomness. While I forget the details, the fascinating thing is the attitude of inventing a test post-data to pose this question, a question that could be posed in zillions of ways, and didn’t follow from any preset optimality. He’s quite satisfied that it shows him what he needs to know. Neyman in practice speaks louder than Neyman in theory.