Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson (11 Aug, 1895-12 June, 1980). I reblog a relevant post from 2012.

*Cases of Type A and Type B*

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B.Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

*We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing.* As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

*Three Steps in the Original Construction of Tests*

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

*Neyman Was the More Behavioristic of the Two*

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

Aside: It is interesting, given these non-behavioristic leanings, that Pearson had earlier worked in acceptance sampling and quality control (from which he claimed to have obtained the term “power”). From the Cox-Mayo “conversation” (2011, 110):

COX:It is relevant that Egon Pearson had a very strong interest in industrial design and quality control.

MAYO:Yes, that’s surprising, given his evidential leanings and his apparent dis-taste for Neyman’s behavioristic stance. I only discovered that around 10 years ago; he wrote a small book.[iii]

COX:He also wrote a very big book, but all copies were burned in one of the first air raids on London.

Some might find it surprising to learn that it is from this early acceptance sampling work that Pearson obtained the notion of “power”, but I don’t have the quote handy where he said this……

** **

**References:**

Cox, D. and Mayo, D. G. (2011), “Statistical Scientist Meets a Philosopher of Science: A Conversation,” *Rationality, Markets and Morals*:* Studies at the Intersection of Philosophy and Economics*, 2: 103-114.

Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” *Biometrika* 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” *Journal of the Royal Statistical Society, Series B, (Methodological)*, 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” *Biometrika* 20(A): 175-240.

“Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate.”

“As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level.”

I’d be interested to see SEV functions for the two tests that Pearson carried out.

Corey: Pearson gives a very long and detailed discussion of many (not just two) different ways of applying statistical tests to this problem. The two of relevance to the Barnard contrast, as I recall, concerned the difference between a design-based random assignment, where the inference is just to these shells, and a model based difference between means when there are two random selections, from the “treated” and untreated populations. There is a difference in the questions being asked. You should have a look at Pearson’s paper, and see what you think.

The likelihood functions for these selection methods are proportional, right? Since I’m not an error statistician, I’m not concerned with data that could have been observed but were not.

But I were an error statistician, I might be interested in determining which discrepancies from the null hypothesis (of equal proportions, I assume) are well-warranted and which are not. How would you go about that?

Corey: As I say, you should look up Pearson’s discussion; these are rather different examples, one concerning just the n observed, and the other two samples, each selected randomly. Several other variations are also considered, corresponding to different questions, and thus different answers. Of course, if you don’t care about error probabilities, your answer is bound to differ from all of them.

I can’t read Pearson’s paper in its entirety — I don’t have academic access. Parts of it can be read in a volume of Pearson’s collected works on Google Books.

I can’t help but notice that you have not addressed the issue of the SEV function in the context of these tests. Do you have any comment on that subject?

Not at the moment, I’m finishing the last chapter of my book and traveling.

It’s too bad that Pearson was so cowed by his father. He wrote that Fisher opened his eyes to some flaws in his father’s work (notably, having to do with Karl’s Bayesian assumptions) which resulted in a love-hate attitude toward Fisher. It was traumatic for Egon that Fisher dared to find flaws in “my God”, as Pearson put it.

One day, Egon fell in love with a woman engaged to his cousin George, and even though she returned the ring to George the next day, Egon waited around two years to give George a chance to win her back. Even then, Karl’s disapproval, claiming Egon wrecked his cousin’s engagement, was enough to kill the relationship. Very wimpy of Egon.

I’m happy with Pearson’s advice (presumably) to let the P=0.052 calculation yield the same conclusion or action as the P=0.025 calculation, as it is (presumably) in accordance with a likelihood analysis*. However, where does it leave the error rate properties of the test(s)? Would it be correct to say that Pearson’s relaxed attitude leads to a loss of definition of the error probabilities associated with the procedure(s)? Wouldn’t that be a violation of the basic idea of the repeated sampling principle?

Note that I’m not trying to be provocative, I actually want to know what an error statistican would think.

* I note that for the exercise with cannon shells, as presented, it would be silly to do a statistical analysis because without a proper loss function the only reasonable response to the results would be to prefer the manufacturing process that yielded the higher proportion of successes.

A paper by Barnard: A worthy weekend read, and tantalizing grist for the mills of my likelihood friends. https://errorstatistics.files.wordpress.com/2015/08/barnard1967.pdf

But he changed his view at the time I knew/talked with him in the 90s.

“I would suggest, it appears that the

central and most valuable concept of the Neyman-Pearson approach to

statistical inference, that of the power curve or operating characteristic,

can be regarded as essentially equivalent to that of a likelihood function.” p.245

“The principal advantage

of power functions over the corresponding likelihoods lies in the fact

that, if the maximum of the power function is low, we may be inclined to

doubt the adequacy of the mathematical model, whereas if we use likelihood

we are taking the model as gospel.”

If the principle advantage of the power function is that one is inclined to doubt the underlying mathematical (statistical) model, then surely all we need to do is to think about the adequacy of the underlying model. The principle advantage of the likelihood function lies in its clear depiction of the evidence in the data relevant to the model parameter of interest. That evidence can be interpreted in light of the satisfactoriness of the underlying model and the biasses of the experimental design. Varying degrees of suspicion of the statistical model seems to be a very tangential reason for preferring the repeated sampling principle over the likelihood principle or for preferring the frequency interpretation of probability over the fractional belief interpretation.

If anyone is interested in why Barnard, Birnbaum and Hacking changed their minds about the desirability of conformity with the likelihood principle, they should read my paper that shows their mistaken interpretation of the single observation class of alleged counter-examples to the principle. Birnbaum referred to the alleged counter-example as “specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them”.

http://arxiv.org/abs/1507.08394

(For anyone who doesn’t want to read my paper, its punch-line is that the idea, often presented on this blog, that the likelihod principle leads necessarily to the domination by a “whatever happened had to happen” hypothesis over the more meaningful and interesting hypotheses is just wrong.)

Not sure what to make of the assertion of “essential equivalence” between power curve and likelihood:

the power curve doesn’t depend on the data. What did Barnard actually mean?Corey: I think he describes it best in the paper, perhaps it holds when there’s just one point in the rejection region.

Beyond that, I’m not sure, which is why I described the paper as grist for the likelihoodist’s mill. But he changed his position.

I scanned the paper but I missed the detail that explains the equivalence Barnard was drawing: if one were performing a test with a singleton rejection region and the observed datum coincided with it, then the power curve would be the likelihood function — or something like that. The data enter by specifying the rejection region of the test for which the equivalence holds. That has a rather SEV feel to it…

Corey, there is an intimate relationship between a likelihood function and a family of power functions, but it runs sideways to how people usually think of power curves. In my ongoing (endless) rewrite of my P-values and likelihood paper (old version at http://arxiv.org/abs/1311.0081) I have written an appendix on the derivation of likelihood functions from power functions that will probably explain what Barnard had in mind. I’ve uploaded that section to my Dropbox:

https://www.dropbox.com/s/hhdfnczz28n13s0/Likelihood%20function%20and%20power%20curves.pdf?dl=0

Likelihoods are “fit” measures whereas power is not. Power is a capacity measure (of a test) that runs in the opposed direction to fit.

‘Likelihoods are “fit” measures’ – Mayo

While true, there is more subtlety here than you seem to acknowledge. For example, when the (log-) likelihood is twice differentiable (say) then it is trivial to form likelihood intervals, with the second derivative as a variance measure, and which in many cases correspond asymptotically to confidence intervals. Note that a second derivative is similar to a higher-order counterfactual in the sense of evaluating how the relative fit (first derivative) varies (second derivative) with locally different hypotheses (parameter values). And this is only using up to second derivative information.

The full likelihood function thus in general contains much more information than just the relative fit of two points (and also more than just interval estimates). I have yet to see a philosophical discussion of likelihood which acknowledges this – other than the pro-likelihood view of Edwards (again, these points are not acknowledged in Hacking’s review).

“We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).”

This sounds like using the hypothesis observation grid (HOG) as in Wrighton (1953).

John: It’s just the N-P criterion for generating tests. I don’t know what HOG is.

The HOG simply gave the cutoff values at a series of p-values to help the reader and user of a test method quickly assess their result. It also provides a quick view of how effect sizes relate to significance levels.

John: Can you give a link? Anyway, Pearson is talking about formulating tests, not interpreting them. He’s discussing how N-P arrive at tests to begin with.

The citation is:Wrighton RF. 1953. The theoretical basis of the therapeutic trial. Acta Genetica et Statistica Medica 4:312-43. I do not have a link to it. It seems to me that Pearson is saying something equivalent.

Pearson was talking about how to choose/construct the test statistic; HOG sounds like the quantile function of the test statistic, which can’t be computed until the test statistic has been chosen.

Can you give an example that might exemplify what you mean?

Sure. Correlation coefficient of a bivariate normal distribution, all parameters unknown:

Step 1: Suppose you’re looking at N/2 bivariate normal samples, so that the sample space is N-dimensional (step 1).

Step 2: At each point in sample space, consider infinitesimal shifts in all possible directions and select the set of directions which leaves you neither more nor less inclined to reject the hypothesis of zero correlation. Each set of points that contains only those points equivalent with respect to your inclination to reject the hypothesis of zero correlation is one of Pearson’s “system of boundaries”: a contour in indifference in data space. By travelling in a direction orthogonal to each contour you will travel at the fastest possible rate towards (away from) sets of results that make you more and more (less and less) inclined to accept the hypothesis of zero correlation. For example, you will likely find that your inclination to accept the hypothesis of zero correlation reaches a maximum on the contour of zero sample correlation coefficient. You may even find that the level curves of the sample correlation coefficient function match your contours. When there’s a nice function on the data space whose level curves match your contours, that simplifies your life considerably, but the contours are the primitive notion, not the function.

Step 3: If possible, calculate under the hypothesis of zero correlation the probability that the data will lie outside of each of your contours.

Corey: Thanks, but of course it’s not just your inclination, but is based on likelihoods or other distance measure. Then, of course, step 3.

I chose the word “inclined” because it’s the one Pearson uses:

“…as we pass across one boundary and proceed to the next, we come to a class of results which

makes us more and more inclinedon the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173, emphasis added).”Yeah, I know he does.

Click to access neyman-1977_frequentist-probability-and-frequentist-statistics.pdf