“Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena” by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

Neyman died on August 5, 1981. Here’s an unusual paper of his, “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” I have been reading a fair amount by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance*, is largely a fable that has been steadily built up and accepted as gospel. This has justified ignoring Neyman-Pearson statistics (as resting solely on long-run performance and irrelevant to scientific inference) and turning to crude variations of significance tests, that Fisher wouldn’t have countenanced for a moment (so-called NHSTs), lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis. The upshot is to reject all of frequentist statistics, even though P-values are a teeny tiny part. *This represents a change in my perception of Neyman’s philosophy since EGEK (Mayo 1996). I still say that that for our uses of method, it doesn’t matter what anybody thought, that “it’s the methods, stupid!” Anyway, I recommend, in this very short paper, the general comments and the example on home ownership. Here are two snippets:

1. INTRODUCTION

The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinction. Particularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.…

(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)

To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.

Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my new book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result “confirm” the hypothesis that H

_{0}is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc. If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H

_{0}, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H_{0}cannot be reasonably considered as anything like a confirmation of H_{0}. The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Neyman, like Peirce, Popper and many others, holds that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis *H* only to the extent that it passed a severe test–one with a high probability of having found flaws in *H*, if they existed. Neyman puts this in terms of having high power to reject *H*, if *H* is false and alternative *H’* true, and high probability of finding no evidence against *H* if true, but it’s the same idea. (Their weakness is in being predesignated error probabilities, but severity fixes this.) Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.

Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.

De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).

For related papers, see:

- Mayo, D.G. and Cox, D. R. (2006) “Frequentists Statistics as a Theory of Inductive Inference,”
*Optimality: The Second Erich L. Lehmann Symposium*(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97. - Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,”
*British Journal of Philosophy of Science*, 57: 323-357.

[1] That really is a decision though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There’s plenty of evidence, by the way, that Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”

[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.

Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors. Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):

- Mayo, D. 1996. “Why Pearson rejected the Neyman-Pearson (behavioristic) philosophy and a note on objectivity in statistics” .

**References**

de Finetti, B. 1972. *Probability, Induction and Statistics*: The Art of Guessing. Wiley.

Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” *Commun. Statist. Theor. Meth*. A5(8), 737-751.

Mayo wrote: I have been reading a lot by Neyman this summer in writing about the origins of his philosophy, and have found further corroboration of the position that the behavioristic view attributed to him, while not entirely without substance, is largely a fable that has been steadily built up and accepted as gospel.

Here’s a bit of substance: quotes from two of Neyman’s most famous papers. First, the Neyman-Pearson lemma paper (On the Problem of the Most Efficient Tests of Statistical Hypotheses, 1933):

“…We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.”

Second, the confidence interval paper (Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability, 1937):

“The justification [for the confidence interval procedure] lies in the character of probabilities as used here, and in the law of great numbers. According to this empirical law, which has been confirmed by numerous experiments, whenever we frequently and independently repeat a random experiment with a constant probability, α, of a certain result, A, then the relative frequency of the occurrence of this result approaches α. [] It follows that if the practical statistician applies permanently the [confidence interval procedure for interval estimation], in the long run he will be correct in about 99 per cent. of all cases.”

(Curiously, Neyman doesn’t set α = 0.99; he had a brain fart in this passage.)

In Neyman’s reply to Carnap, he says that power is relevant *if* one is forming an “intuitive feeling of confidence” regarding a hypothesis. But it seems clear to me from the above quotes that he thought that the only possible *justification* for a procedure was that it would perform well in the long run. Maybe you’ve read other statements of his philosophy that soften this, but I don’t think you can blame people reading his most famous papers for coming away with that impression.

Corey: Well it’s funny, I have a very long discussion of that one passage from a very early N-P paper in my new book. Looking at the passages before and after give the depth of meaning. But I’ve discussed this a million times before and I am not blaming anyone except that if you actually read their work, you get a different emphases. Surely you remember many posts of mine on how the philosophy is a lot like Popper and others. I can link to them. That is, if you reject probabilism, as they did, but you’re doing something that employs data to learn, what should you call it? They called it learning how to act. So instead of inductive inference it becomes inductive behavior. As I’ve also said many times, it really doesn’t matter what anyone thought, it’s the methods, stupid.

We can see, in particular, the value of considering a method’s behavior under hypothetical repetitions for understanding the data/inference in front of us.It teaches us the capabilities of the method, giving us a statistical connection between our data and some aspect of the data generation, as modeled.

If the question is about Neyman’s philosophy, then it is definitely matters what he thought he was doing when he employed his methods. Did he think he was employing data to learn? Even if you hold that this is what he was actually doing, it’s still possible that he would not agree.

This reminds me is a bit of quantum mechanics: proponents of different interpretations agree about the calculations to be performed but not about the calculations’ proper meaning. I would argue that interpretations (whether of QM or of frequentist statistics) do matter, because they guide new developments. For instance, it seems doubtful that someone who thought they were using data to learn would propose randomized tests.

Corey: Yes, I agree it matters for getting at his philosophy, I was thinking of other readers who might think it all turned on Neyman and Fisher’s in-fighting. I mention that because the more they fought, the more Fisher withdrew his previous behavioristic attitude and the more Neyman stressed his.

Randomized tests, for those who don’t know, involve doing something like tossing a coin for borderline outcomes in order that, overall, you achieve the right error rate. Granted, that’s a performance idea that almost everyone rejects including NP who had an “undecided” category.

Now here’s the irony. These days it’s all about determining replication rates, throwing all significant “findings” in a great big urn, and figuring out how often false findings would occur–ignoring the particularities of the individual test. We’ve had many discussions on this blog about that, notably, recently with Senn. Neyman’s behavioristic side wouldn’t be nearly behavioristic enough in today’s world.

Nice to hear from you, by the way.

You can read Pearson’s emphasis on using tests for “learning” in his part of the “triad”.

https://errorstatistics.com/2015/04/24/statistical-concepts-in-their-relation-to-reality-by-e-s-pearson/

Sometimes the coin-tossing in tests is so cleverly disguised that even those who propose it may not have noticed it. For example, certain “more powerful” alternatives to Fisher’s exact test use the fact that a number of different points in the sample space map to the same statistic. By fixing things so that some of the points mapping to the same statistic declare significance and some do not you can increase the power of the test for a pre-defined type one error rate. Effectively the sample space is being used as a cunning auxiliary randomiser. Of course the attained significance level will be less impressive. See [1] for a discussion

I agree with Mayo as regards the tedious distinction made between significance and hypothesis testing. I think that this is unimportant except as regards one thing: Neyman used alternative hypotheses, Fisher did not.(See https://errorstatistics.com/2015/02/19/stephen-senn-fishers-alternative-to-the-alternative-2/)

But in Neyman’s system, hypothesis test are everything. Confidence intervals are obtained by inverting hypothesis test, for example. Fisher has further systems and uses likelihood as a criterion in itself rather than a justification of power and also fiducial inference. The latter, as we know, ran into problems but so do confidence intervals and there seems to be some similarity between fiducial inference and severity.

For this and other reasons, I don’t think “lacking alternatives, incapable of learning from negative results, and permitting all sorts of P-value abuses” is a fair description of what Fisher proposed.

Reference

[1] Senn, S. (2007). “Drawbacks to noninteger scoring for ordered categorical data.” Biometrics 63(1): 296-298; discussion 298-299.

Stephen; First, thanks for noting the erroneous year for Neyman’s death (I had 1991 accidentally). Second, I worried that that sentence, which I wrote too quickly, would be misunderstood. That is, I intended to say, as I have said, that the animal called NHST is a distortion of Fisherian tests. However, by limiting themselves to a single hypothesis, and neglecting the fact that statistical significance alone doesn’t warrant inferences to various research hypotheses, critics of frequentist methods have an easier time. To take a small p-value (an isolated one will do) as evidence for your typical psych research hypothesis, is distant from anything Fisher endorsed. However, if you’re going to be unthinking about the whole matter, as critics, and perhaps some p-value users, tend to be, at least N-P tests are explicit that rejecting the null at most allows an inference to it’s statistical denial. You’d at least avoid the crass fallacy of going from stat sig -> research hypothesis. Going with the iconic N-P tests, with required predesignated test specifications, would be a big plus given today’s chicanery (even though I claim it’s overly rigid). Likewise if you’re going to allow formulating a flabby alternative along with all manner of discretionary steps to get a small p-value, you’re going to have an easier time discrediting tests than if you stayed within the classic N-P fold.

So, let me go back to that sentence to make sure that the faulty reading is not taken by readers, sorry.

Readers might also check “p-values exaggerate the evidence” , and “are p-values error probabilities?”, in several posts on this blog. Being able to trash crude NHSTs and declare all of frequentism is wrong, simply because Neyman spoke of behavior and not inference, has been a gift from the gods for those wishing to condemn frequentist methods, and downplay the protections offered by error statistical accounts.

I’d very much like to get your take on the connection between fiducial inference and severity.

As regards flabby hypothesis testing, I think that there are two different issues here.

1) Whether to specify a particular test you need to have a precise alternative hypothesis in mind

2) Whether you should pre-specify your test.

The answer to 1) is “no” and quite comptaibly with that, the answer to 2) is “yes”. In fact in the world in which I work, pharmaceutical statistics, 2) is mandatory and extremely detailed statsitical analysis plans are written in advance of decoding the data in which the precision of the recipe for all aspects of the test far exceeds the sophistication with which any alternative hypotheses are stated (if they are stated at all). In fact, you can pretty much bet that no regulator would accept a statement of alternative hypotheses as a substitute for the precise algorithm for testing that the analysis plan will contain.

Fisher’s criticism of the NP system is that. 1) Historically it did not predate the tests it claimed to justify 2) It’s pretty much redundant 3) In practice it is experience with statistics and tests that justifies choice of alternative and not vice versa.

It is true, however, that Fsher himself was somewhat relaxed about pre-specification (a point about which Student famously criticised him) but I see that as being somewhat orthogonal to the issue as whether pre-specification is important.

Hi Stephen: I’m curious about “the precise algorithm for testing that the analysis plan will contain”. Clearly it’s not the founders who were responsible. I take it you’re saying this is a good thing, then?

It’s interesting because I thought N-P were always castigated for supposedly requiring predesignation of everything. For example, I was recently rereading Barnard on testing GTR–a topic on which I spoke to him about in person–where he points out that there was no way to know how many eclipse plates would be salvageable. He regarded this as a rebuke of Neyman. I can only guess Neyman gave that impression, but you can see in his examples many a time where he invents and performs tests of assumptions post data (as with the housing example). Then again, Barnard was clearly on Fisher’s side in the N-P/Fisher debates and reminded me more than once that he was the one to tell fisher what Neyman had done to his tests (many years after the break-up).

The last time I spoke to Barnard was March of 1999 when I was in England. This is just a side remark. I’d love to ask him a few more things now that I have thought so much more about them.

Dear Deborah, I work a lot on analysis plans. And this is what they will contain

1) Definition of primary endpoint

2) How to handle multiplicity if more than one primary endpoint is involved (e.g Bonferonni-Holm, Hochberg etc)

3) How to handle multiplicity if there are more than two treatment arms

4) How to handle missing data

5) What sensitivity analyses are envisaged for missing data

6) What transformations if any of the data will be used

7) How centre effects will be modelled

8) What covariates will be put in the model

9) Major approach (e.g mixed model, survival analysis, logistic regression, non-parametric)

10) Significance level

11 ) If sequential the number of looks and the strategy for adjustment. For example, 5 looks at approximately equal information fractions using O’Brien-Fleming via a Lan DeMets spending function with stopping for futility.

12) Analysis sets

13) Which effects are fixed and what are random

etc

I don’t think anybody cares about the alternative hypothesis and if they did it would be one hell of a challenge to formulate it in such a way that it inexorably produced the recipe for analysis

See http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073137.pdf

There are 3 kinds of statistician: those who can count and those who can’t.

I seem to have got point 5) twice.

Stephen: I made one of them 6 and recounted the rest, is that OK?

Stephen: Thanks for this. But sure you know what you’re inferring from all this, e.g., increased survival or whatever. Now in your great Delta force post, your list included power, and I think your discussion of that same topic in your book is also great!

Stephen: I meant to mention something on your last point. It’s scarcely at odds with N-P tests to recognize that “in practice it is experience with statistics and tests that justifies choice of alternative and not vice versa.” They said, here are some ways to formulate and specify classes of methods (so as to capture many tests in use, while preventing unintended pathologies); see if any of these offer suggestive ways tackle your problem and balance threats of errors as seems apt.

I agree with Corey. The quoted bit from the early Neyman & Pearson paper is explicit and impossible to misinterpret. At least at that time Neyman was clearly thinking that evidence from the data usable for probabilistic evaluation of specific hypotheses relating to the state of the world existing in the experimental system was unavailable. “as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.”

The fact that Neyman may have changed his tune later on to be a bit more Fisherian, does not alter the nature of “hypothesis testing”. The lack of a consistently used distinction between hypothesis testing using pre-specified critical values of alpha and significance testing where P-values are assumed to have evidential meaning is a real problem. In my opinion it is the most important contributor to the chronic misunderstanding and misuse of P-values. I’ve published a paper on the topic that you may not have seen, as it is published in a pharmacology journal: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900/

They were referring to the need to consider alternative hypotheses.

I’ve re-read the paper and I have to say that I think you are entirely mistaken. Read it again yourself.

Here’s a snippet from an unpublished letter Egon wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

https://errorstatistics.com/2012/08/16/e-s-pearsons-statistical-philosophy/#more-6425

It turns out that like apparently everyone else in history who attempted to usefully interpret a p-value, Neyman makes this error (from your post above): “permitting all sorts of P-value abuses–notably going from a small p-value to claiming evidence for a substantive research hypothesis.”

Here is what he wrote regarding the cancer research that is discussed[1]: “The application of the chosen test left little doubt that the lymphocytes from household contact persons are, by and large, better breast cancer cell killers than those from the controls.”

Nope, all you can say is that the data in column A was higher on average than the data in column B. That is it. Nothing more than that.

From the paper where the same data is reported [2] we see the method they used to measure cytotoxicity was the Chromium-51 release assay, which is reported to underestimate the cytotoxiticy by ~50%: “As described above, the BLI assay resulted in ∼2-fold more cytotoxicity than observed with the chromium release assay” [3]. Interestingly that is exactly the size of the effect in the data inspected by Neyman: “Ci values of HHCbr-D, if divided by two, are superimposable on the Ci values of the normal population” [2].

So another possible explanation is that the cells from household contacts were somehow preventing the sequestering of chromium 51, which would appear to cause increased levels of cell death in that assay. That is just one possibility. It is also notable that this research was motivated by one of the most difficult to reproduce and controversial ideas of the last 60 years. It appears to have eventually just faded into obscurity as people got fed up with the conflicting evidence:

Byers et al write: “One of the purposes of this study was to identify a population of individuals who could be used as donors of tumor specific transfer factor in a large-scale clinical trial.”[2] There is not much current information regarding this transfer factor idea, but wikipedia claims: “Transfer factor has some promising findings as a target of immune research, but can only be considered an incompletely investigated field that has been essentially lost to history.”[4]

Reminds me of what Paul Meehl noted regarding psychology: “Perhaps the easiest way to convince yourself is by scanning the literature of soft psychology over the last 30 years and noticing what happens to theories. Most of them suffer the fate that General MacArthur ascribed to old generals—They never die, they just slowly fade away.”[5]

[1] Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.” Commun. Statist. Theor. Meth. A5(8), 737-751.

[2] Byers et al. Identification of Human Populations with a High Incidence of Cellular Immunity against Breast Carcinoma. Cancer Immunology, Immunotherapy July 1977, Volume 2, Issue 3, pp 163-172

http://link.springer.com/article/10.1007%2FBF00205433

[3] Karimi et al. Measuring Cytotoxicity by Bioluminescence Imaging Outperforms the Standard Chromium-51 Release Assay.PLoS One. 2014; 9(2): e89357. Published online 2014 Feb 19. doi: 10.1371/journal.pone.0089357

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3929704/

[4] https://en.wikipedia.org/wiki/Transfer_factor

[5] Paul E. Meehl (1978) Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology Journal of Consulting and Clinical Psychology, Vol. 46, pp. 806–834.

http://www.psych.umn.edu/people/meehlp/113TheoreticalRisks.pdf

Anon: I guess I don’t consider immunology as on quite the same level as psych, but anyway, Neyman was just reporting on Le Cam’s research. It is a surprising thesis.