Deconstructing the Fisher-Neyman conflict wearing fiducial glasses + Excerpt 5.8 from SIST

imgres-4

Fisher/ Neyman

This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. These 2 posts reflect my working out of these ideas in writing Section 5.8 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST, CUP 2018). Here’s all of Section 5.8 (“Neyman’s Performance and Fisher’s Fiducial Probability”) for your Saturday night reading.* 

Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74).

Fisher was right that Neyman’s calling the outputs of statistical inferences “actions” merely expressed Neyman’s preferred way of talking. Nothing earth-shaking turns on the choice to dub every inference “an act of making an inference”.[i] The “rationality” or “merit” goes into the rule. Neyman, much like Popper, had a good reason for drawing a bright red line between his use of probability (for corroboration or probativeness) and its use by ‘probabilists’ (who assign probability to hypotheses). Fisher’s Fiducial probability was in danger of blurring this very distinction. Popper said, and Neyman would have agreed, that he had no problem with our using the word induction so long it was kept clear it meant testing hypotheses severely.

In Fisher’s next few sentences, things get very interesting. In reinforcing his choice of language, Fisher continues, Neyman “seems to claim that the statement (a) “μ has a probability of 5 per cent. of exceeding M” is a different statement from (b) “M has a probability of 5 per cent. of falling short of μ”. There’s no problem about equating these two so long as M is a random variable. But watch what happens in the next sentence. [I’m using M rather than X ; Fisher’s paper uses lower case x in the following, though clearly he means X in [1].] According to Fisher,

Neyman violates ‘the principles of deductive logic [by accepting a] statement such as

[1]                      Pr{(M – ts) < μ < (M  + ts)} = α,

as rigorously demonstrated, and yet, when numerical values are available for the statistics M
and s, so that on substitution of these and use of the 5 per cent. value of t, the statement would read

[2]                   Pr{92.99 < μ < 93.01} = 95 per cent.,

to deny to this numerical statement any validity. This evidently is to deny the syllogistic process of making a substitution in the major premise of terms which the minor premise establishes as equivalent (Fisher 1955, p. 75).

But the move from (1) to (2) is fallacious! Could Fisher really be commiting this fallacious probabilistic instantiation? I.J. Good (1971) describes how many felt, and often still feel:

…if we do not examine the fiducial argument carefully, it seems almost inconceivable that Fisher should have made the error which he did in fact make. It is because (i) it seemed so unlikely that a man of his stature should persist in the error, and (ii) because he modestly says(…[1959], p. 54) his 1930 explanation left a good deal to be desired’, that so many people assumed for so long that the argument was correct. They lacked the daring to question it.

In responding to Fisher, Neyman (1956, p.292) declares himself at his wit’s end in trying to find a way to convince Fisher of the inconsistencies in moving from (1) to (2).

When these explanations did not suffice to convince Sir Ronald of his mistake, I was tempted to give up. However, in a private conversation David Blackwell suggested that Fisher’s misapprehension may be cleared up by the examination of several simple examples. They illustrate the general rule that valid probability statements regarding relations involving random variables may cease and usually do cease to be valid if random variable are replaced by their observed particular values.(p. 292)[ii]

“Thus if X is a normal random variable with mean zero and an arbitrary variance greater than zero, we may agree” [that Pr(X < 0)= .5 But observing, say X = 1.7 yields Pr(1.7< 0) = .5, which is clearly illicit]. “It is doubtful whether the chaos and confusion now reigning in the field of fiducial argument were ever equaled in any other doctrine. The source of this confusion is the lack of realization that equation (1) does not imply (2)” (Neyman 1956).

For decades scholars have tried to figure out what Fisher might have meant, and while the matter remains unsettled, this much is agreed: The instantiation that Fisher is yelling about 20 years after the creation of N-P tests and the break with Neyman, is fallacious. Fiducial probabilities can only properly attach to the method. Keeping to “performance” language, is a sure way to avoid the illicit slide from (1) to (2). Once the intimate tie-ins with Fisher’s fiducial argument is recognized, the rhetoric of the Neyman-Fisher dispute takes on a completely new meaning. When Fisher says “Neyman only cares for acceptance sampling contexts” as he does after around 1950, he’s really saying Neyman thinks fiducial inference is contradictory unless it’s viewed in terms of properties of the method in (actual or hypothetical) repetitions. The fact that Neyman (with the contributions of Wald, and later Robbins) went overboard in his behaviorism [iii], to the extent that even Egon wanted to divorce him—ending his 1955 reply to Fisher with the claim that inductive behavior was “Neyman’s field rather than mine”—is a different matter.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] Fisher also commonly spoke of the output of tests as actions. Neyman rightly says that he is only following Fisher. As the years went by, Fisher comes to renounce things he himself had said earlier in the midst of polemics against Neyman.

[ii] But surely this is the kind of simple example that would have been brought forward right off the bat, before the more elaborate, infamous cases (Fisher-Behrens). Did Fisher ever say “oh now I see my mistake” as a result of these simple examples? Not to my knowledge. So I find this statement of Neyman’s about the private conversation with Blackwell a little curious. Anyone know more about it?

[iii]At least in his theory, but not not in his practice. A relevant post is “distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen“.

Fisher, R.A. (1955). “Statistical Methods and Scientific Induction”.

Good, I.J. (1971b), In reply to comments on his “The probabilistic explication of information, evidence, srprise, causality, explanation and utility’. In Godambe and Sprott (1971).

Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher”.

Pearson, E.S. (1955). “Statistical concepts in Their Relation to Reality“.

 

____________________________

 

*Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Jan 10, 2019 Excerpt from SIST is here.

Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.

Categories: fiducial probability, Fisher, Neyman, Statistics | Leave a comment

Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]

imgres

R.A. Fisher: February 17, 1890 – July 29, 1962

Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a few years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability

[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).

The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehmann (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.

So what’s fiducial inference? I follow Cox (2006), adapting for the case of the lower limit:

We take the simplest example,…the normal mean when the variance is known, but the considerations are fairly general. The lower limit, [with Z the standard Normal variate, and M the sample mean]:

M0 – zc σ/√n

derived from the probability statement

Pr(μ > M – zc σ/√n ) = 1 – c

is a particular instance of a hypothetical long run of statements a proportion 1 – c of which will be true, assuming the model is sound. We can, at least in principle, make such a statement for each c and thereby generate a collection of statements, sometimes called a confidence distribution. (Cox 2006, p. 66).

For Fisher it was a fiducial distribution. Once M0 is observed, M0 – zc σ/√n is what Fisher calls the fiducial c per cent limit for μ. Making such statements for different c’s yields his fiducial distribution.

In Fisher’s earliest paper on fiducial inference in 1930, he sets 1 – c as .95 per cent. Start from the significance test of μ (e.g., μ< μ0 vs. μ>μ0 ) with significance level .05. He defines the 95 percent value of the sample mean M, M.95 , such that in 95% of samples M< M.95 . In the Normal testing case, M.95 = μ0 + 1.65σ/√n. Notice M.95 is the cut-off for rejection in a .05 one-sided test T+ (of μμ0 vs. μ>μ0).

We have a relationship between the statistic [M] and the parameter μ such that M.95 = is the 95 per cent value corresponding to a given μ. This relationship implies the perfectly objective fact that in 5 per cent of samples M> M.95. (Fisher 1930, p. 533; I use μ for his θ, M in place of T).
That is, Pr(M < μ + 1.65σ/√n) = .95.

The event M > M.95 occurs just in case μ0  < M − 1.65σ/√n .[i]

For a particular observed M0 , M0 − 1.65σ/√n is the fiducial 5 per cent value of μ.

We may know as soon as M is calculated what is the fiducial 5 per cent value of μ, and that the true value of μ will be less than this value in just 5 per cent of trials. This then is a definite probability statement about the unknown parameter μ which is true irrespective of any assumption as to it’s a priori distribution. (Fisher 1930, p. 533 emphasis is mine).

This seductively suggests that μ μ.05 gets the probability .05! But we know we cannot say that Pr(μ μ.05) = .05.[ii]

However, Fisher’s claim that we obtain “a definite probability statement about the unknown parameter μ” can be interpreted in another way. There’s a kosher probabilistic statement about the pivot Z, it’s just not a probabilistic assignment to a parameter. Instead, a particular substitution is, to paraphrase Cox “a particular instance of a hypothetical long run of statements 95% of which will be true.” After all, Fisher was abundantly clear that the fiducial bound should not be regarded as an inverse inference to a posterior probability. We could only obtain an inverse inference, Fisher explains, by considering μ to have been selected from a superpopulation of μ‘s with known distribution. But then the inverse inference (posterior probability) would be a deductive inference and not properly inductive. Here, Fisher is quite clear, the move is inductive.

People are mistaken, Fisher says, when they try to find priors so that they would match the fiducial probability:

In reality the statements with which we are concerned differ materially in logical content from inverse probability statements, and it is to distinguish them from these that we speak of the distribution derived as a fiducial frequency distribution, and of the working limits, at any required level of significance, ….as the fiducial limits at this level. (Fisher 1936, p. 253).

So, what is being assigned the fiducial probability? It is, Fisher tells us, the “aggregate of all such statements…” Or, to put it another way, it’s the method of reaching claims to which the probability attaches. Because M and S (using the student’s T pivot) or M alone (where σ is assumed known) are sufficient statisticswe may infer, without any use of probabilities a priori, a frequency distribution for μ which shall correspond with the aggregate of all such statements … to the effect that the probability that μ is less than M – 1.65σ/√n is .05.” (Fisher 1936, p. 253)[iii]

Suppose you’re Neyman and Pearson aiming to clarify and justify Fisher’s methods.

”I see what’s going on’ we can imagine Neyman declaring. There’s a method for outputting statements such as would take the general form

μ >M – zcσ/√n

Some would be in error, others not. The method outputs statements with a probability of 1 – c of being correct. The outputs are instances of general form of statement, and the probability alludes to the relative frequencies that they would be correct, as given by the chosen significance or fiducial level c . Voila! “We may look at the purpose of tests from another viewpoint,” as Neyman and Pearson (1933) put it. Probability qualifies (and controls) the performance of a method.

There is leeway here for different interpretations and justifications of that probability, from actual to hypothetical performance, and from behavioristic to more evidential–I’m keen to develop the latter. But my main point here is that in struggling to extricate Fisher’s fiducial limits, without slipping into fallacy, they are led to the N-P performance construal. Is there an efficient way to test hypotheses based on probabilities? ask Neyman and Pearson in the opening of the 1933 paper.

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong (Neyman and Pearson 1933, pp. 141-2/290-1).

At the time, Neyman thought his development of confidence intervals (in 1930) was essentially the same as Fisher’s fiducial intervals. Fisher’s talk of assigning fiducial probability to a parameter, Neyman thought at first, was merely the result of accidental slips of language, altogether expected in  explaining a new concept. There was evidence that Fisher accepted Neyman’s reading. When Neyman gave a paper in 1934 discussing confidence intervals, seeking to generalize fiducial limits, but making it clear that the term “confidence coefficient” is not synonymous to the term probability, Fisher didn’t object. In fact he bestowed high praise, saying Neyman “had every reason to be proud of the line of argument he had developed for its perfect clarity. The generalization was a wide and very handsome one,” the only problem being that there wasn’t a single unique confidence interval, as Fisher had wanted (for fiducial intervals).[iv] Slight hints of the two in a mutual admiration society are heard, with Fisher demurring that “Dr Neyman did him too much honor” in crediting him for the revolutionary insight of Student’s T pivot. Neyman responds that of course in calling it Student’s T he is crediting Student, but “this does not prevent me from recognizing and appreciating the work of Professor Fisher concerning the same distribution.”(Fisher comments on Neyman 1934, p. 137). For more on Neyman and Pearson being on Fisher’s side in these early years, see Spanos’s post.

So how does this relate to the current consensus view of Neyman-Pearson vs Fisher? Stay tuned.[v] In the mean time, share your views.

The next installment is here.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] (μ < M – zc σ/√n) iff M > M(1 – c) = M >μ + zc σ/√n

[ii] In terms of the pivot Z, the inequality Z >zc is equivalent to the inequality

μ < M –zc σ/√n

“so that this last inequality must be satisfied with the same probability as the first.” But the fiducial value replaces M with M0 and then Fisher’s assertion

Pr(μ > M0 –zc σ/√n ) = 1 – c

no longer holds. (Fallacy of probabilistic instantiation.) In this connection, see my previous post on confidence intervals in polling.

[iii] If we take a number of samples of size n from the same or from different populations, and for each calculate the fiducial 5 percent value for μ, then in 5 per cent of cases the true value of μ will be less than the value we have found. There is no contradiction in the fact that this may differ from a posterior probability. “The fiducial probability is more general and, I think, more useful in practice, for in practice our samples will all give different values, and therefore both different fiducial distributions and different inverse probability distributions. Whereas, however, the fiducial values are expected to be different in every case, and our probabilty statements are relative to such variability, the inverse probability statement is absolute in form and really means something different for each different sample, unless the observed statistic actually happens to be exactly the same.” (Fisher 1930, p. 535)

[iv]Fisher restricts fiducial distributions to special cases where the statistics exhaust the information. He recognizes”The political principle that ‘Anything can be proved with statistics’ if you don’t make use of all the information. This is essential for fiducial inference”. (1936, p. 255). There are other restrictions to the approach as he developed it; many have extended it. There are a number of contemporary movements to revive fiducial and confidence distributions. For references, see the discussants on my likelihood principle paper.

[v] For background, search Fisher on this blog. Some of the material here is from my forthcoming book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP).

REFERENCES

Cox, D. R. (2006), Principles of Statistical Inference. Cambridge.

Fisher, R.A. (1930), “Inverse Probability,” Mathematical Proceedings of the Cambridge Philosophical Society, 26(4): 528-535.

Fisher, R.A. (1936), “Uncertain Inference,”Proceedings of the American Academy of Arts and Sciences 71: 248-258.

Lehmann, E. (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?Journal of the American Statistical Association 88 (424): 1242–1249.

Neyman, J. (1934), “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” Early Statistical Papers of J. Neyman: 98-141. [Originally published (1934) in The Journal of the Royal Statistical Society 97(4): 558-625.]

This material is now part of Section 5.8 in Statistical Inference as Severe Testing: how to Get Beyond the Statistics Wars (Mayo 2018, CUP)

Categories: fiducial probability, Fisher, Phil6334/ Econ 6614, Statistics | Leave a comment

Guest Blog: R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)

A SPANOS

.

In recognition of R.A. Fisher’s birthday on February 17…a week of Fisher posts!

‘R. A. Fisher: How an Outsider Revolutionized Statistics’

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Spanos, Statistics | 2 Comments

R.A. Fisher: “Statistical methods and Scientific Induction”

I continue a week of Fisherian posts begun on his birthday (Feb 17). This is his contribution to the “Triad”–an exchange between  Fisher, Neyman and Pearson 20 years after the Fisher-Neyman break-up. The other two are below. They are each very short and are worth your rereading.

17 February 1890 — 29 July 1962

“Statistical Methods and Scientific Induction”

by Sir Ronald Fisher (1955)

SUMMARY

The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of  acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.

The three phrases examined here, with a view to elucidating they fallacies they embody, are:

  1. “Repeated sampling from the same population”,
  2. Errors of the “second kind”,
  3. “Inductive behavior”.

Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.

To continue reading Fisher’s paper.

 

Note on an Article by Sir Ronald Fisher

by Jerzy Neyman (1956)

Neyman

Neyman

Summary

(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation.  (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible.  (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values.  The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight.  (4)  The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.

 

E.S. Pearson

“Statistical Concepts in Their Relation to Reality”.

by E.S. Pearson (1955)

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data.  We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done.  If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.

In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect.  There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans.  It was really much simpler–or worse.  The original heresy, as we shall see, was a Pearson one!…

To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE

Categories: E.S. Pearson, fiducial probability, Fisher, Neyman, phil/history of stat, Phil6334/ Econ 6614 | 1 Comment

Guest Post: STEPHEN SENN: ‘Fisher’s alternative to the alternative’

“You May Believe You Are a Bayesian But You Are Probably Wrong”

.

As part of the week of posts on R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012, and 2017. See especially the comments from Feb 2017. 

‘Fisher’s alternative to the alternative’

By: Stephen Senn

[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).

The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests. Continue reading

Categories: Fisher, S. Senn, Statistics | Leave a comment

Happy Birthday R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Today is R.A. Fisher’s birthday. I will post some Fisherian items this week in recognition of it*. This paper comes just before the conflicts with Neyman and Pearson erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  We may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!

Two New Properties of Mathematical Likelihood

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other.

Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H0 to that on the hypothesis H1 is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H0 bears to the likelihood of H1, a ratio less than some fixed value defining the contour. (295)…

It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number.  In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T.  For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient.  Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the testng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters.  Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)

*I’ve posted several of these items, in different forms, during the years of writing Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP): this is the first year I can point to the discussions of Fisher therein. The current post emerges in Excursion 5 Tour III. However, I still think it’s crucial to read and reread the original articles!

Categories: Fisher, phil/history of stat, Phil6334/ Econ 6614, Statistics | Tags: , , , | Leave a comment

American Phil Assoc Blog: The Stat Crisis of Science: Where are the Philosophers?

Ship StatInfasST

The Statistical Crisis of Science: Where are the Philosophers?

This was published today on the American Philosophical Association blog. 

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

How should scientific integrity be restored? Experts do not agree and the disagreement is intertwined with fundamental disagreements regarding the nature, interpretation, and justification of methods and models used to learn from incomplete and uncertain data. Today’s reformers, fraudbusters, and replication researchers increasingly call for more self-critical scrutiny on philosophical foundations. Philosophers should take this seriously. While philosophers of science are interested in helping to clarify, if not also to resolve, matters of evidence and inference, they are rarely consulted in practice for this end. The assumptions behind today’s competing evidence reforms–issues of what I will call evidence-policy–are largely hidden to those outside the loop of the philosophical foundations of statistics and data analysis, or Phil Stat. This is a crucial obstacle to scrutinizing the consequences to science policy, clinical trials, personalized medicine, and across a wide landscape of Big Data modeling.

Statistics has a fascinating and colorful history of philosophical debate, marked by unusual heights of passion, personality, and controversy for at least a century. Wars between frequentists and Bayesians have been so contentious that everyone wants to believe we are long past them: we now have unifications and reconciliations, and practitioners only care about what works. The truth is that both brand new and long-standing battles simmer below the surface in questions about scientific trustworthiness. They show up unannounced in the current problems of scientific integrity, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports, the ASA Statement being just one. There isn’t even agreement as to what is to be meant by the method “works”. These are key themes in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).

Many of the key problems in today’s evidence-policy disputes inherit the conceptual confusions of the underlying methods for evidence and inference. They are intertwined with philosophical terms that often remain vague, such as inference, reliability, testing, rationality, explanation, induction, confirmation, and falsification. This hampers communication among various stakeholders, making it difficult to even recognize and articulate where they agree. The philosopher’s penchant for laying bare presuppositions of claims and arguments would let us cut through the unclarity that blocked the experts at the ASA meeting from clearly pinpointing where and why they agree or disagree. (As a mere “observer”, I rarely intervened.) We should put philosophy to work on the popular memes: “All models are false”, “Everything is equally subjective and objective”, “P -values exaggerate evidence”, and “ most published research findings are false”.

So am I calling on my fellow philosophers (at least some of them) to learn formal statistics? That would be both too much and too little. Too much because it would be impractical; too little because despite technical sophistication, basic concepts of statistical testing and inference are more unsettled than ever. Debates about P-values–whether to redefine them, lower them, or ban them altogether–are all the subject of heated discussion and journalistic debates. Megateams of seventy or more authors array themselves on either side of the debate (e.g., Benjamin 2017, Lakens 2018), including some philosophers (I was a co-author in Lakens, arguing that redefining significance would not help with the problem of replication). The deepest problems underlying the replication crisis go beyond formal statistics–into measurement, experimental design, communication of uncertainty. Yet these rarely occupy center stage in all the brouhaha. By focusing just on the formal statistical issues, the debates give short shrift to the need to tie formal methods to substantive inferences, to a general account of collecting and learning from data, and to entirely non-statistical types of inference. The goal becomes: who can claim to offer the highest proportion of “true” effects among those outputted by a formal method?

You might say my project is only relevant for philosophers of science, logic, formal epistemology and the like. While they are the obvious suspects, it goes further. Despite the burgeoning of discussions of ethics in research and in data science, the work is generally done by practitioners apart from philosophy, or by philosophers apart from the nitty-gritty details of the data sciences themselves. Without grasping the basic statistics, informed by understanding contrasting views of the nature and goals of using probability in learning, it’s impossible to see where the formal issues leave off and informal, value-laden issues arise or intersect. Philosophers in research ethics can wind up building arguments that forfeit a stronger stance that a critical assessment of the methods would afford (e.g., arguing for a precautionary stance, when there is evidence of genuine risk increase in the data, despite non-significant results.) Interest in experimental philosophy is another area that underscores the importance of a critical assessment of the statistical methods on which it is based. Formal methods, logic and probability are staples of philosophy, why not methods of inference based on probabilistic methods? That’s what statistics is.

Not only is PhilStat relevant to addressing some long-standing philosophical problems of evidence, inference and knowledge, it offers a superb avenue for philosophers to genuinely impact scientific practice and policy. Even a sufficient understanding of the inference methods together with a platform for raising questions about fallacies and pitfalls could be extremely effective. What is at stake is a critical standpoint that we may be in danger of losing. Without it, we forfeit the ability to communicate with, and hold accountable, the “experts,” the agencies, the quants, and all those data handlers increasingly exerting power over our lives. It goes beyond philosophical outreach–as important as that is–to becoming citizen scholars and citizen scientists.

I have been pondering how to overcome these obstacles, and am keen to engage fellow philosophers in the project. I am going to take one step toward exploring and meeting this goal, together with a colleague, Aris Spanos, in economics. We are running a two-week immersive seminar on PhilStat for philosophy faculty and post-docs who wish to acquire or strengthen their background in PhilStat as it relates to philosophical problems of evidence and inference, to today’s statistical crisis of replication, and to associated evidence-policy debates. The logistics are modeled on the NEH Summer Seminars for college faculty that I directed in 1999 (on Philosophy of Experiment: Induction, Reliability, and Error). The content reflects Mayo (2018), which is written as a series of Excursions and Tours in a “Philosophical Voyage” to illuminate statistical inference. Consider joining me. In the meantime, I would like to hear from philosophers interested or already involved in this arena. Do you have references to existing efforts in this direction? Please share them.

 

Barnard, G. (1985). A Coherent View of Statistical Inference, Statistics Technical Report Series. Department of Statistics & Actuarial Science, University of Waterloo, Canada.

Benjamin, D. et al (2017). Redefine Statistical Significance, Nature Human Behaviour 2, 6-10.

Birnbaum, A. (1972). More on concepts of statistical evidence. J. Amer. Statist. Assoc. 67 858–861. MR0365793

Lakens et al (2018). Justify Your Alpha Nature Human Behaviour 2, 168-71.

Levelt Committee, Noort Committee, Drenth Committee (2012). Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel (www.commissielevelt.nl/).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP). (The first chapter [Excursion 1 Tour I ] is here.)

Wasserstein & Lazar (2016). The ASA’s Statement on P-values: Context, Process and Purpose, (and supplemental materials), The American Statistician 70(2), 129–33.

 

Credit for the ‘statistical cruise ship’ artwork goes to Mickey Mayo of Mayo Studios, Inc.

 

Deborah Mayo is Professor Emerita in the Department of Philosophy at 

Virginia Tech. She’s the author of Error and the Growth of Experimental Knowledge (1996, Chicago), which won the 1998 Lakatos Prize awarded to the most outstanding contribution to the philosophy of science during the previous six years. She co-edited, with Aris Spanos, Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science(2010, CUP), and co-edited, with Rochelle Hollander, Acceptable Evidence: Science and Values in Risk Management (1991, Oxford). Other publications are available here.

many thanks to Nathan Oserloff for inviting me to submit this blogpost to the APA blog.
Categories: Error Statistics, Philosophy of Statistics, Summer Seminar in PhilStat | 1 Comment

Summer Seminar in PhilStat: July 28-Aug 11

Please See New Information for Summer Seminar in PhilStat

Categories: Announcement, Summer Seminar in PhilStat | 1 Comment

Little Bit of Logic (5 mini problems for the reader)

Little bit of logic (5 little problems for you)[i]

Deductively valid arguments can readily have false conclusions! Yes, deductively valid arguments allow drawing their conclusions with 100% reliability but only if all their premises are true. For an argument to be deductively valid means simply that if the premises of the argument are all true, then the conclusion is true. For a valid argument to entail  the truth of its conclusion, all of its premises must be true.  In that case the argument is said to be (deductively) sound.

Equivalently, using the definition of deductive validity that I prefer: A deductively valid argument is one where, the truth of all its premises together with the falsity of its conclusion, leads to a logical contradiction (A & ~A).

Show that an argument with the form of disjunctive syllogism can have a false conclusion. Such an argument take the form (where A, B are statements): Continue reading

Categories: Error Statistics | 22 Comments

Mayo Slides Meeting #1 (Phil 6334/Econ 6614, Mayo & Spanos)

Slides  Meeting #1 (Phil 6334/Econ 6614: Current Debates on Statistical Inference and Modeling (D. Mayo and A. Spanos)

 

Categories: Phil6334/ Econ 6614 | Leave a comment

Excerpt from Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

4.8 All Models Are False

. . . it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. . . . The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis. (Cox 1995, p. 456)

 A popular slogan in statistics and elsewhere is “all models are false!”  Is this true? What can it mean to attribute a truth value to a model? Clearly what is meant involves some assertion or hypothesis about the model – that it correctly or incorrectly represents some phenomenon in some respect or to some degree. Such assertions clearly can be true. As Cox observes, “the very word model implies simplification and idealization.”  To declare, “all models are false”  by dint of their being idealizations or approximations, is to stick us with one of those  “all flesh is grass”  trivializations (Section 4.1). So understood, it follows that all statistical models are false, but we have learned nothing about how statistical models may be used to infer true claims about problems of interest. Since the severe tester’s goal in using approximate statistical models is largely to learn where they break down, their strict falsity is a given. Yet it does make her wonder why anyone would want to place a probability assignment on their truth, unless it was 0? Today’s tour continues our journey into solving the problem of induction (Section 2.7). Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Protected: Participants in 6334/6614 Meeting place Jan-Feb

This content is password protected. To view it please enter your password below:

Categories: SIST | Enter your password to view comments.

6334/6614: Captain’s Library: Biblio With Links

Mayo and A. Spanos
PHIL 6334/ ECON 6614: Spring 2019: Current Debates on Statistical Inference and Modeling

Bibliography (this includes a selection of articles with links; numbers 1-15 after the item refer to seminar meeting number.)

See Syllabus (first) for class meetings, and the page PhilStat19 menu up top for other course items.

Achinstein (2010). Mill’s Sins or Mayo’s Errors? (E&I: 170-188). (11)

Bacchus, Kyburg, & Thalos (1990). Against Conditionalization, Synthese (85): 475-506. (15)

Barnett (1999). Comparative Statistical Inference (Chapter 6: Bayesian Inference), John Wiley & Sons. (1), (15)

Begley & Ellis (2012) Raise standards for preclinical cancer research. Nature 483: 531-533. (10)

Continue reading

Categories: SIST | 1 Comment

(Full) Excerpt of Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

A month ago, I excerpted just the very start of Excursion 4 Tour I* on The Myth of the “Myth of Objectivity”. It’s a short Tour, and this continues the earlier post.

4.1    Dirty Hands: Statistical Inference Is Sullied with Discretionary Choices

If all flesh is grass, kings and cardinals are surely grass, but so is everyone else and we have not learned much about kings as opposed to peasants. (Hacking 1965, p.211)

Trivial platitudes can appear as convincingly strong arguments that everything is subjective. Take this one: No human learning is pure so anyone who demands objective scrutiny is being unrealistic and demanding immaculate inference. This is an instance of Hacking’s “all flesh is grass.” In fact, Hacking is alluding to the subjective Bayesian de Finetti (who “denies the very existence of the physical property [of] chance” (ibid.)). My one-time colleague, I. J. Good, used to poke fun at the frequentist as “denying he uses any judgments!” Let’s admit right up front that every sentence can be prefaced with “agent x judges that,” and not sweep it under the carpet (SUTC) as Good (1976) alleges. Since that can be done for any statement, it cannot be relevant for making the distinctions in which we are interested, and we know can be made, between warranted or well-tested claims and those so poorly probed as to be BENT. You’d be surprised how far into the thicket you can cut your way by brandishing this blade alone. Continue reading

Categories: objectivity, SIST | Leave a comment

New Course Starts Tomorrow: Current Debates on Statistical Inference and Modelings: Joint Phil and Econ

I will post items on a new PhilStat Spring 19 page on this blogI

Categories: Announcement | Leave a comment

A letter in response to the ASA’s Statement on p-Values by Ionides, Giessing, Ritov and Page

I came across an interesting letter in response to the ASA’s Statement on p-values that I hadn’t seen before. It’s by Ionides, Giessing, Ritov and Page, and it’s very much worth reading. I make some comments below. Continue reading

Categories: ASA Guide to P-values, P-values | 7 Comments

Mementos from Excursion 4: Objectivity & Auditing: Blurbs of Tours I – IV

Excursion 4: Objectivity and Auditing (blurbs of Tours I – IV)

 

.

Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.

Keywords

objective vs. subjective, objectivity requirements, auditing, dirty hands argument, phenomena vs. epiphenomena, logical positivism, verificationism, loss and cost functions, default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, transparency, epistemology: internal/external distinction

 

Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?

We begin with the Mountains out of Molehills Fallacy (large n problem): The fallacy of taking a (P-level) rejection of H0 with larger sample size as indicating greater discrepancy from H0 than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough n, a .05 significant result can correspond to assigning H0 a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.

It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.

Keywords

significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)

 

Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small P-values, yet replication attempts find it difficult to get small P-values with preregistered results. I call this the paradox of replication. The problem isn’t P-values but failing to adjust them for cherry picking and other biasing selection effects. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)

Keywords

error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)

 

Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being adequate for a problem, meaning it enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical  probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.

Keywords

adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification

 

Where you are in the Journey 

Categories: SIST, Statistical Inference as Severe Testing | 2 Comments

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

getting beyond…

Excerpt from Excursion 4 Tour II*

 

4.4 Do P-Values Exaggerate the Evidence?

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π0 of 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.

More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P. Continue reading

Categories: SIST, Statistical Inference as Severe Testing | 1 Comment

January Invites: Ask me questions (about SIST), Write Discussion Analyses (U-Phils)

.

ASK ME. Some readers say they’re not sure where to ask a question of comprehension on Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–SIST– so here’s a special post to park your questions of comprehension (to be placed in the comments) on a little over the first half of the book. That goes up to and includes Excursion 4 Tour I on “The Myth of ‘The Myth of Objectivity'”. However,I will soon post on Tour II: Rejection Fallacies: Who’s Exaggerating What? So feel free to ask questions of comprehension as far as p.259.

All of the SIST BlogPost (Excerpts and Mementos) so far are here.

.

WRITE A DISCUSSION NOTE: Beginning January 16, anyone who wishes to write a discussion note (on some aspect or issue up to p. 259 are invited to do so (<750 words, longer if you wish). Send them to my error email.  I will post as many as possible on this blog.

We initially called such notes “U-Phils” as in “You do a Philosophical analysis”, which really only means it’s an analytic excercize that strives to first give the most generous interpretation to positions, and then examines them. See the general definition of  a U-Phil.

Some Examples:

Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution

U-Phil: A Further Comment on Gelman by Christian Hennig.

For a whole group of reader contributions, including Jim Berger on Jim Berger, see: Earlier U-Phils and Deconstructions

If you’re writing a note on objectivity, you might wish to compare and contrast Excursion 4 Tour I with a paper by Gelman and Hennig (2017): “Beyond subjective and objective in Statistics”.

These invites extend through January.

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

SIST* Blog Posts: Excerpts & Mementos (to Dec 31 2018)

Surveying SIST Blog Posts So Far

Excerpts

  • 05/19: The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
  • 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
  • 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
  • 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
  • 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
  • 10/10: Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)
  • 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3
  • 12/01: Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)
  • 12/04: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]
  • 12/11: It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II  (Mayo 2018, CUP)
  • 12/20: Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III
  • 12/26: Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)
  • 12/29: 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II.

Mementos, Keepsakes and Souvenirs

  • 10/29: Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)
  • 11/8:   Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)
  • 10/5:  “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1)
  • 11/14: Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation)
  • 11/17: Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction
  • 12/08: Memento & Quiz (on SEV): Excursion 3, Tour I
  • 12/13: Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)
  • 12/26: Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Blog at WordPress.com.