Monthly Archives: April 2019

Neyman vs the ‘Inferential’ Probabilists

.

We celebrated Jerzy Neyman’s Birthday (April 16, 1894) last night in our seminar: here’s a pic of the cake.  My entry today is a brief excerpt and a link to a paper of his that we haven’t discussed much on this blog: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. So if you hear Neyman rejecting “inferential accounts” you have to understand it in this very specific way: he’s rejecting “new measures of confidence or diffidence”. Here he alludes to them as “easy ways out”. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?).

drawn by his wife,Olga

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program.
HAPPY BIRTHDAY WEEK FOR NEYMAN!

 

 

What doesn’t Neyman like about Birnbaum’s advocacy of a Principle of Sufficiency S (p. 25)? He doesn’t like that it is advanced as a normative principle (e.g., about when evidence is or ought to be deemed equivalent) rather than a criterion that does something for you, such as control errors. (Presumably it is relevant to a type of context, say parametric inference within a model.) S is put forward as a kind of principle of rationality, rather than one with a rationale in solving some statistical problem

“The principle of sufficiency (S): If E is specified experiment, with outcomes x; if t = t (x) is any sufficient statistic; and if E’ is the experiment, derived from E, in which any outcome x of E is represented only by the corresponding value t = t (x) of the sufficient statistic; then for each x, Ev (E, x) = Ev (E’, t) where t = t (x)… (S) may be described informally as asserting the ‘irrelevance of observations independent of a sufficient statistic’.”

Ev(E, x) is a metalogical symbol referring to the evidence from experiment E with result x. The very idea that there is such a thing as an evidence function is never explained, but to Birnbaum “inferential theory” required such things. (At least that’s how he started out.) The view is very philosophical and it inherits much from logical positivism and logics of induction.The principle S, and also other principles of Birnbaum, have a normative character: Birnbaum considers them “compellingly appropriate”.

“The principles of Birnbaum appear as a kind of substitutes for known theorems” Neyman says. For example, various authors proved theorems to the general effect that the use of sufficient statistics will minimize the frequency of errors. But if you just start with the rationale (minimizing the frequency of errors, say) you wouldn’t need these”principles” from on high as it were. That’s what Neyman seems to be saying in his criticism of them in this paper. Do you agree? He has the same gripe concerning Cornfield’s conception of a default-type Bayesian account akin to Jeffreys. Why?

[i] I thank @omaclaran for reminding me of this paper on twitter in 2018.

[ii] Or so I argue in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, 2018, CUP.

[iii] Do you think Neyman is using “breakthrough” here in reference to Savage’s description of Birnbaum’s “proof” of the (strong) Likelihood Principle? Or is it the other way round? Or neither? Please weigh in.

REFERENCES

Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27.

Categories: Bayesian/frequentist, Error Statistics, Neyman | Leave a comment

Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday yesterday)

images-14

Neyman April 16, 1894 – August 5, 1981

My second Jerzy Neyman item, in honor of his birthday, is a little play that I wrote for Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018):

A local acting group is putting on a short theater production based on a screenplay I wrote:  “Les Miserables Citations” (“Those Miserable Quotes”) [1]. The “miserable” citations are those everyone loves to cite, from their early joint 1933 paper:

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, pp. 290-1).

In this early paper, Neyman and Pearson were still groping toward the basic concepts of tests–for example, “power” had yet to be coined. Taken out of context, these quotes have led to knee-jerk (behavioristic) interpretations which neither Neyman nor Pearson would have accepted. What was the real context of those passages? Well, the paper opens, just five paragraphs earlier, with a discussion of a debate between two French probabilists—Joseph Bertrand, author of “Calculus of Probabilities” (1907), and Emile Borel, author of “Le Hasard” (1914)! According to Neyman, what served “as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses”(1977, p. 103) initially grew out of remarks of Borel, whose lectures Neyman had attended in Paris. He returns to the Bertrand-Borel debate in four different papers, and circles back to it often in his talks with his biographer, Constance Reid. His student Erich Lehmann (1993), regarded as the authority on Neyman, wrote an entire paper on the topic: “The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory”.

Since it’s Saturday night, let’s listen in on this one act play, just about to begin at the Elba Dinner Theater. Don’t worry, food and drink are allowed to be taken in. (I’ve also included, in the References, several links to papers for your weekend reading enjoyment!)  There go les trois coups–the curtain’s about to open!

search

The curtain opens with a young Neyman and Pearson (from 1933) standing mid-stage, lit by a spotlight. (Neyman does the talking, since its his birthday).

Neyman: “Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars…form a ‘system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that…H was probably false, and vice versa. Bertrand expressed the pessimistic view that no test of this kind could give reliable results.

Borel, however, considered…that the method described could be applied with success provided that the character, x, of the observed facts were properly chosen—were, in fact, a character which he terms ‘en quelque sorte remarquable’” (Neyman and Pearson 1933, p.141/290).

The stage fades to black, then a spotlight shines on Bertrand, stage right.

Bertrand: “How can we decide on the unusual results that chance is incapable of producing?…The Pleiades appear closer to each other than one would naturally expect…In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances?…Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility. …

[He turns to the audience, shaking his head.]

The application of such calculations to questions of this kind is a delusion and an abuse.” (Bertrand, 1907, p. 166; Lehmann 1993, p. 963).

The stage fades to black, then a spotlight appears on Borel, stage left.

Borel: “The particular form that problems of causes often take…is the following: Is such and such a result due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But …to refuse to answer under the pretext that the answer cannot be absolutely precise, is to… misunderstand the essential nature of the application of mathematics.” (ibid. p. 964) Bertrand considers the Pleiades. ‘If one has observed a [precise angle between the stars]…in tenths of seconds…one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle’… (ibid.)

The question is whether one has the same reservations in the case in which one states that one of the angles of the triangle formed by three stars has “une valeur remarquable” [a striking or noteworthy value], and is for example equal to the angle of the equilateral triangle…. (ibid.)

Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand, because the number of such events that may appear striking, from different points of view, is very substantial” (ibid. p. 964).J. Neyman and E. Pearson

The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage. (Neyman does the talking)

Neyman: “We appear to find disagreement here, but are inclined to think that…the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature. …What is the precise meaning of the words ‘an efficient test of a hypothesis’?” (1933, p. 140/290)

“[W]e may consider some specified hypothesis, as that concerning the group of stars, and look for a method which we should hope to tell us, with regard to a particular group of stars, whether they form a system, or are grouped ‘by chance,’…their relative movements unrelated.” (ibid.)
“If this were what is required of ‘an efficient test’, we should agree with Bertrand in his pessimistic view. For however small be the probability that a particular grouping of a number of stars is due to ‘chance’, does this in itself provide any evidence of another ‘cause’ for this grouping but ‘chance’? …Indeed, if x is a continuous variable—as for example is the angular distance between two stars—then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point.” (Emphasis added; ibid. pp. 141-2; 290-1)

Fade to black, spot on narrator mid-stage:

Narrator: We all know our famous (miserable) lines are about to come. But let’s linger on the “as far as a particular hypothesis is concerned” portion. For any particular case, one may identify a data dependent feature x that would be highly improbable “under the particular hypothesis of chance”. We must “carefully guard,” Borel warns, “against the tendency to consider as striking an event that one has not specified beforehand”. But if you are required to set the test’s capabilities ahead of time then you need to specify the type of falsity of Ho, the distance measure or test statistic beforehand. An efficient test should capture Fisher’s concern with tests sensitive to departures of interest. Listen to Neyman over 40 years later, reflecting on the relevance of Borel’s position in 1977.

Fade to black. Spotlight on an older Neyman, stage right.

Neyman April 16, 1894 – August 5, 1981

Neyman: “The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed difference…contradicts the stochastic model….

This question was the subject of a lively discussion by Borel and others. Borel was optimistic but insisted that: (a) the criterion to test a hypothesis (a ‘statistical hypothesis’) using some observations must be selected not after the examination of the results of observation, but before, and (b) this criterion should be a function of the observations (of some sort remarkable) (Neyman 1977, pp. 102-103).
It is these remarks of Borel that served as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses.”(ibid. p. 103)

Fade to back. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993).

Pearson: “I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape… The above stages [wherein he had been asking ‘Why use that particular test statistic?’] led up to Borel’s requirement of finding…a criterion which was ‘a function of the observations ‘en quelque sorte remarquable’. Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’s expression neatly capped our own.”

Fade to black. End Play

Egon has the habit of leaving the most tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement already reached due to their being “serious humane thinkers”? I can well imagine growing this one act play into something like the expressionist play of Michael Fraylin, Copenhagen, wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’t say that it would enjoy a long life on Broadway, but a small handful of us would relish it.

As with my previous attempts at “statistical theatre of the absurd, (e.g., “Stat on a hot-tin roof”) there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included.

Deconstructions on the Meaning of the Play by Theater Critics

It’s not hard to see that “as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’s the probability of 3 hurricanes followed by 2 plane crashes (as occurred last month, say)? Harold Jeffreys put it this way: any sample is improbable in some respect;to cope with this fact statistical method does one of two things: appeals to prior probabilities of a hypothesis or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H’ than chance by an appropriately low prior weight to H’. What does the latter approach do? It says, we need to consider the problem as of a general type. It’s a general rule, from a test statistic to some assertion about alternative hypotheses, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations. We deliberately move away from the particularity of the case at hand, to the general type of mistake that could be made.

Isn’t this taken care of by Fisher’s requirement that Pr(P < p0; Ho) = p—that the test rarely rejects the null if true?   It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’s simple significance tests. With just the null hypothesis, it is unwarranted to take low P-values as evidence for a specific “cause” or non-chance explanation. Many could be erected post data, but the ways these could be in error would not have been probed. Fisher (1947, p. 182) is well aware that “the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.

The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. [T]he experimenter is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he inquires what test of significance, if any, is available appropriate to his needs (ibid., p. 185).

Even if “an experienced experimenter” knows the appropriate test, this doesn’t lessen the importance of NP’s interest in seeking to identify a statistical rationale for the choices made on informal grounds. In today’s world, if not in Fisher’s day, there’s legitimate concern about selecting the alternative that gives the more impressive P-value.

Here’s Egon Pearson writing with Chandra Sekar: In testing if a sample has been drawn from a single normal population, “it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail” ( p. 121). “It is sometimes held that the criterion for a test can be selected after the data, but it will be hard to be unprejudiced at this point” (Pearson & Chandra Sekar, 1936, p. 129).

To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true….By choosing the feature most unfavourable to Ho out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n, say, examined should have as unfavourable a value as this? (ibid., p. 127).

Notice, the goal is not behavioristic; it’s a matter of avoiding the glaring fallacies in the test at hand, fallacies we know all too well.

“The statistician who does not know in advance with which type of alternative to H0 he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an omnibus tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions” (ibid., p. 126).

In a famous result, Neyman (1952) demonstrates that by dint of a post-data choice of hypothesis, a result that leads to rejection in one test yields the opposite conclusion in another, both adhering to a fixed significance level. [Fisher concedes this as well.] If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question. We can infer discrepancies from the null, as well as corroborate their absence by considering those the test had high power to detect.

Playbill Souvenir

Let’s flesh out Neyman’s conclusion to the Borel-Bertrand debate: if we accept the words, “an efficient test of the hypothesis H” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (a) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (b) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We must steer clear of isolated or particular curiosities to find indications that we are tracking genuine effects.

“Fisher’s the one to be credited,” Pearson remarks, “for his emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962, p. 277). If you’re planning, you’re prespecifying, perhaps, nowadays, by means of explicit preregistration.

Nevertheless prespecifying the question (or test statistic) is distinct from predesignating a cut-off P-value for significance. Discussions of tests often suppose one is somehow cheating if the attained P-value is reported, as if it loses its error probability status. It doesn’t.[2] I claim they are confusing prespecifying the question or hypothesis, with fixing the P-value in advance–a confusion whose origin stems from failing to identify the rationale behind conventions of tests, or so I argue. Nor is it even that the predesignation is essential, rather than an excellent way to promote valid error probabilities.

But not just any characteristic of the data affords the relevant error probability assessment. It has got to be pretty remarkable!

Enter those pivotal statistics called upon in Fisher’s Fiducial inference. In fact, the story could well be seen to continue in the following two posts: “You can’t take the Fiducial out of Fisher if you want to understand the N-P performance philosophy“, and ” Deconstructing the Fisher-Neyman conflict wearing fiducial glasses”.

[1] Or, it might have been titled, “A Polish Statistician in Paris”, given the remake of “An American in Paris” is still going strong on Broadway, last time I checked.

[2] We know that Lehmann insisted people report the attained p-value so that others could apply their own preferred error probabilities. N-P felt the same way. (I may add some links to relevant posts later on.)

REFERENCES

Bertrand, J. (1888/1907). Calcul des Probabilités. Paris: Gauthier-Villars.

Borel, E. 1914. Le Hasard. Paris: Alcan.

Fisher, R. A. 1947. The Design of Experiments (4th ed.). Edinburgh: Oliver and Boyd.

Lehmann, E.L. 2012. “The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory” in J. Rojo (ed.), Selected Works of E. L. Lehmann, 2012, Springer US, Boston, MA, pp. 965-974.

Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP)

Neyman, J. 1952. Lectures and Conferences on Mathematical Statistics and Probability. 2nd ed. Washington, DC: Graduate School of U.S. Dept. of Agriculture.

Neyman, J. 1977. “Frequentist Probability and Frequentist Statistics“, Synthese 36(1): 97–131.

Neyman, J. & Pearson, E. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses“, Philosophical Transactions of the Royal Society of London 231. Series A, Containing Papers of a Mathematical or Physical Character: 289–337.

Pearson, E. S. 1962. “Some Thoughts on Statistical Inference”, The Annals of Mathematical Statistics, 33(2): 394-403.

Pearson, E. S. & Sekar, C. C. 1936. “The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations“, Biometrika 28(3/4): 308-320. Reprinted (1966) in The Selected Papers of E. S. Pearson, (pp. 118-130). Berkeley: University of California Press.

Reid, Constance (1982). Neyman–from life

 

 

Categories: E.S. Pearson, Neyman, Statistics | Leave a comment

A. Spanos: Jerzy Neyman and his Enduring Legacy

Today is Jerzy Neyman’s birthday. I’ll post various Neyman items this week in recognition of it, starting with a guest post by Aris Spanos. Happy Birthday Neyman!

A. Spanos

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Fisher’s original parametric statistical model Mθ(x) was based on the idea of ‘a hypothetical infinite population’, chosen so as to ensure that the observed data x0:=(x1,x2,…,xn) can be viewed as a ‘truly representative sample’ from that ‘population’:

“The postulate of randomness thus resolves itself into the question, Of what population is this a random sample? (ibid., p. 313), underscoring that: the adequacy of our choice may be tested a posteriori.’’ (p. 314)

In cases where data x0 come from sample surveys or it can be viewed as a typical realization of a random sample X:=(X1,X2,…,Xn), i.e. Independent and Identically Distributed (IID) random variables, the ‘population’ metaphor can be helpful in adding some intuitive appeal to the inductive dimension of statistical inference, because one can imagine using a subset of a population (the sample) to draw inferences pertaining to the whole population.

This ‘infinite population’ metaphor, however, is of limited value in most applied disciplines relying on observational data. To see how inept this metaphor is consider the question: what is the hypothetical ‘population’ when modeling the gyrations of stock market prices? More generally, what is observed in such cases is a certain on-going process and not a fixed population from which we can select a representative sample. For that very reason, most economists in the 1930s considered Fisher’s statistical modeling irrelevant for economic data!

Due primarily to Neyman’s experience with empirical modeling in a number of applied fields, including genetics, agriculture, epidemiology, biology, astronomy and economics, his notion of a statistical model, evolved beyond Fisher’s ‘infinite populations’ in the 1930s into Neyman’s frequentist ‘chance mechanisms’ (see Neyman, 1950, 1952):

Guessing and then verifying the ‘chance mechanism’, the repeated operation of which produces the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical. (Neyman, 1977, p. 99)

From my perspective, this was a major step forward for several reasons, including the following.

First, the notion of a statistical model as a ‘chance mechanism’ extended the intended scope of statistical modeling to include dynamic phenomena that give rise to data from non-IID samples, i.e. data that exhibit both dependence and heterogeneity, like stock prices.

Second, the notion of a statistical model as a ‘chance mechanism’ is not only of metaphorical value, but it can be operationalized in the context of a statistical model, formalized by:

Mθ(x)={f(x;θ), θ∈Θ}, x∈Rn , Θ⊂Rm; m << n,

where the distribution of the sample f(x;θ) describes the probabilistic assumptions of the statistical model. This takes the form of a statistical Generating Mechanism (GM), stemming from  f(x;θ), that can be used to generate simulated data on a computer. An example of such a Statistical GM is:

Xt = α0 + α1Xt-1 + σεt,  t=1,2,…,n

This indicates how one can use pseudo-random numbers for the error term  εt ~NIID(0,1) to simulate data for the Normal, AutoRegressive [AR(1)] Model. One can generate numerous sample realizations, say N=100000, of sample size n in nanoseconds on a PC.

Third, the notion of a statistical model as a ‘chance mechanism’ puts a totally different spin on another metaphor widely used by uninformed critics of frequentist inference. This is the ‘long-run’ metaphor associated with the relevant error probabilities used to calibrate frequentist inferences. The operationalization of the statistical GM reveals that the temporal aspect of this metaphor is totally irrelevant for the frequentist inference; remember Keynes’s catch phrase “In the long run we are all dead”? Instead, what matters in practice is its repeatability in principle, not over time! For instance, one can use the above statistical GM to generate the empirical sampling distributions for any test statistic, and thus render operational, not only the pre-data error probabilities like the type I-II as well as the power of a test, but also the post-data probabilities associated with the severity evaluation; see Mayo (1996).

HAPPY BIRTHDAY NEYMAN!

For further discussion on the above issues see:

Spanos, A. (2012), “A Frequentist Interpretation of Probability for Model-Based Inductive Inference,” in Synthese:

http://www.econ.vt.edu/faculty/2008vitas_research/Spanos/1Spanos-2011-Synthese.pdf

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222: 309-368.

Mayo, D. G. (1996), Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago.

Neyman, J. (1950), First Course in Probability and Statistics, Henry Holt, NY.

Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington.

Neyman, J. (1977), “Frequentist Probability and Frequentist Statistics,” Synthese, 36, 97-131.

[i]He was born in an area that was part of Russia.

Categories: Neyman, Spanos | Leave a comment

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Source: Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars « Statistical Modeling, Causal Inference, and Social Science

Categories: Error Statistics | Leave a comment

Excursion 1 Tour II: Error Probing Tools versus Logics of Evidence-Excerpt

.

For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

1.4 The Law of Likelihood and Error Statistics

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.

Law of Likelihood (LL):Data are better evidence for hypothesis Hthan for Hif is more probable under Hthan under H0: Pr(x; H1) > Pr(x; H0) that is, the likelihood ratio LR of Hover Hexceeds 1.

H0 and H1 are statistical hypotheses that assign probabilities to the values of the random variable X.  A fixed value of is written x0, but we often want to generalize about this value, in which case, following others, I use x. The likelihood of the hypothesis H, given data x, is the probability of observing x, under the assumption that is true or adequate in some sense. Typically, the ratio of the likelihood of Hover Halso supplies the quantitative measure of comparative support. Note when X is continuous, the probability is assigned over a small interval around X to avoid probability 0. 

Does the Law of Likelihood Obey the Minimal Requirement for Severity?

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (x)–all winners. A hypothesis to explain this is that their method always succeeds in picking winners. entails x, so the likelihood of given is 1. Yet we wouldn’t say is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.

Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as x=<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis H0 : θ = 0.5, given x0, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;x0), because it’s always computed given data x0; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θs(1- θ)f, 0< θ<1, where s is the number of successes and f the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.

The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis H0

Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis Hmuch better “supported” than H0 even when His true. As George Barnard puts it, “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).

Note that for any outcome of Bernoulli trials, the likelihood of H0 : θ = 0.5 is (0.5)n, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to Hwould be quite high. Since one could always erect such an alternative,

(*) Pr(LR in favor of Hover H0; H0) = maximal.

Thus the LL permits BENT evidence. The severity for His minimal, though the particular His not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses Gellerized, since Uri Geller was apt to erect a way to explain his results in ESP trials.  Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.

What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes H0 maximally likely, we can find an H1 that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution. It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data.  Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.

To continue reading Excursion 1 Tour II, go here.

__________

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Jan 10, 2019 Excerpt from SIST is here, Jan 27 is here, and Feb 23 here.

Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.

March 5, 2019 Blurbs of all 16 Tours can be found here.

 

Where YOU are in the journey

 

 

 

 

 

 

 

Categories: Error Statistics, law of likelihood, SIST | 2 Comments

there’s a man at the wheel in your brain & he’s telling you what you’re allowed to say (not probability, not likelihood)

It seems like every week something of excitement in statistics comes down the pike. Last week I was contacted by Richard Harris (and 2 others) about the recommendation to stop saying the data reach “significance level p” but rather simply say

“the p-value is p”.

(For links, see my previous post.) Friday, he wrote to ask if I would comment on a proposed restriction (?) on saying a test had high power! I agreed that we shouldn’t say a test has high power, but only that it has a high power to detect a specific alternative, but I wasn’t aware of any rulings from those in power on power. He explained it was an upshot of a reexamination by a joint group of the boards of statistical associations in the U.S. and UK. of the full panoply of statistical terms. Something like that. I agreed to speak with him yesterday. He emailed me the proposed ruling on power:

Do not say a test has high power. Don’t believe that if a test has high power to produce a low p-value when an alternative H’ is true, that finding a low p-value is good evidence for H’. This is wrong. Any effect, no matter how tiny, can produce a small p-value if the power of the test is high enough.

Recommendation: Report the complement of the power in relation to H’: the probability of a type II error β(H’). For instance, instead of saying the power of the test against H’ is .8, say “β(H’) = 0.2.”

“So what do you think?” he began the conversation. Giggling just a little, I told him I basically felt the same way about this as the ban on significance/significant. I didn’t see why people couldn’t just stop abusing power, and especially stop using it in the backwards fashion that is now common (and is actually encouraged by using power as a kind of likelihood). I spend the entire Excursion 5 on power in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars 2018, CUP. Readers of this blog can merely search for power to find quite a lot! That said, I told him the recommendation seemed OK, but noted that you need a minimal threshold (for declaring evidence against a test hypothesis) in order to compute power.

After talking about power, we moved on to some other statistical notions under review. He’d told me last time he lacked a statistical background, and asked me to point out any flagrant mistakes (in his sum-ups, not the Board’s items, which were still under embargo, whatever that means.) I was glad to see that, apparently, the joint committees were subjecting some other notions to scrutiny (for once). According to his draft, the Board doesn’t want people saying a hypothesis or model receives a high probability, say .95, because it is invariably equivocal. 

Do not say a hypothesis or model has a high posterior probability (e.g., 0.95) given the data. The statistical and ordinary language meanings of “probability” are now so hopelessly confused; the term should be avoided for that reason alone. 

Don’t base your scientific conclusion or practical decision solely on whether a claim gets a high posterior probability. That a hypothesis is given a .95 posterior does not by itself mean it has a “truth probability” of 0.95, nor that H is practically certain (while it’s very improbable that H is false), nor that the posterior was arrived at by a method that is correct 95% of the time, nor that it is rational to bet that the probability of H is 0.95, nor that H will replicate 95% of the time, nor that the falsity of H would have produced a lower posterior on H with probability .95. The posterior can reflect empirical priors, default or data dominant priors, priors from an elicitation of beliefs, conjugate priors, regularisation, prevalence of true effects in a field, or many, many others.

A Bayesian posterior report doesn’t tell you how uncertain that report is.

A posterior of .95 depends on just one way of exhausting the space of possible hypotheses or models (invariably excluding those not thought of). This can considerably distort the scientific process which is always open ended. 

Recommendation: If you’re doing a Bayesian posterior assessment, just report “a posterior on H is .95” (or other value), or a posterior distribution over parameters. Don’t say probable.

At this point I began to wonder if he was for real. Was he the Richard Harris who wrote that article last week? I was approached by 3 different journals, and never questioned them. Was this some kind of a backlash to the p-value pronouncements from Stat Report Watch? Or maybe he was that spoofer Justin Smith (whom I don’t know in the least) who recently started that blog on the P-value police. My caller assured me he was on the level, and he did have the official NPR logo. So we talked for around 2 hours!

Comparative measures don’t get off scott-free, according to this new report of the joint Boards:

Don’t say one hypothesis H is more likely than another H’ because this is likely to be interpreted as H is more probable than H’. 

Don’t believe that because H is more likely than H’, given data x, that H is probable, well supported or plausible, while H’ is not. This is wrong. It just means H makes the data x more probable than does H’. A high likelihood ratio LR can occur when both H and H’ are highly unlikely, and when some other hypothesis H” is even more likely. Two incompatible hypotheses can both be maximally likely. Being incompatible, they cannot both be highly probable. Don’t believe a high LR in favor of H over H’ means the effect size is large or practically important. 

Recommendation: Report “the value of the LR (of H over H’) = k” rather than “H is k times as likely as H'”. As likelihoods enter the computation as a ratio, the word “likelihood” is not necessary and should be dropped wherever possible. The LR level can be reported. The statistical and ordinary language meanings of “likely” are sufficiently confused to avoid the term.

Odds ratios and Bayes Factors (BFs), surprisingly, are treated almost the same way as the LR. (Don’t say H is more probable than H’. Just report BFs and prior odds. A BF doesn’t tell you if an effect size is scientifically or practically important. There’s no BF value between H and H’ that tells you there’s good evidence for H.

But maybe I shouldn’t be so surprised. As in the initial ASA statement, the newest Report avers

Nothing in this statement is new. Statisticians and others have been sounding the alarm about these matters for decades,

In support of their standpoint on posteriors as well as on Bayes Factors they cite Andrew Gelman:

“I do not trust Bayesian induction over the space of models because the posterior probability of a continuous-parameter model depends crucially on untestable aspects of its prior distribution” (Gelman 2011, p. 70). He’s also cited as regards their new rule on Bayes Factors. “To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011, p. 74) or a weighted average of them as in Madigan and Raftery (1994).

Also familiar is the mistake in taking a high BF of in favor of a null or test hypothesis H over an alternative H’ as if it supplies evidence in favor of H. It’s always just a comparison; there’s never a falsification, unlike statistical tests (unless supplemented with a falsification rule[i]).

The Board warns: “Don’t believe a Bayes Factor in favor of H over H’, using a “default” Bayesian prior, means the results are neutral, uninformative, or bias free. Here the report quotes Uri Simonsohn:

Saying a Bayesian test ‘supports the null’ in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”

What they actually ought to write is ‘the data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’”

The default Bayes factor test “means the Bayesian test ends up asking: ‘is the effect zero, or is it biggish?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability.

Scanning the rest of Herris’ article, which was merely in rough draft yesterday, I could see that next in line to face the axe are: confidence, credible, coherent, and probably other honorifics. Maybe now people can spend more time thinking, or so they tell you [ii]!

Check date! 

[i] For example, the rule might be: falsify H in favor of H’ whenever H’ is k times as likely or probable as H, or whenever the posterior of H’ exceeds .95, or whenever the p-value against H is less than .05.)

 

 

Categories: Bayesian/frequentist | 5 Comments

Blog at WordPress.com.