phil/history of stat

Neyman, Power, and Severity

April 16, 1894 – August 5, 1981

NEYMAN: April 16, 1894 – August 5, 1981

Jerzy Neyman: April 16, 1894-August 5, 1981. This reblogs posts under “The Will to Understand Power” & “Neyman’s Nursery” here & here.

Way back when, although I’d never met him, I sent my doctoral dissertation, Philosophy of Statistics, to one person only: Professor Ronald Giere. (And he would read it, too!) I knew from his publications that he was a leading defender of frequentist statistical methods in philosophy of science, and that he’d worked for at time with Birnbaum in NYC.

Some ten 15 years ago, Giere decided to quit philosophy of statistics (while remaining in philosophy of science): I think it had to do with a certain form of statistical exile (in philosophy). He asked me if I wanted his papers—a mass of work on statistics and statistical foundations gathered over many years. Could I make a home for them? I said yes. Then came his caveat: there would be a lot of them.

As it happened, we were building a new house at the time, Thebes, and I designed a special room on the top floor that could house a dozen or so file cabinets. (I painted it pale rose, with white lacquered book shelves up to the ceiling.) Then, for more than 9 months (same as my son!), I waited . . . Several boxes finally arrived, containing hundreds of files—each meticulously labeled with titles and dates.  More than that, the labels were hand-typed!  I thought, If Ron knew what a slob I was, he likely would not have entrusted me with these treasures. (Perhaps he knew of no one else who would  actually want them!) Continue reading

Categories: Neyman, phil/history of stat, power, Statistics | Tags: , , , | 5 Comments

Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

images-9If there’s somethin’ strange in your neighborhood. Who ya gonna call?(Fisherian Fraudbusters!)*

*[adapted from R. Parker’s “Ghostbusters”]

When you need to warrant serious accusations of bad statistics, if not fraud, where do scientists turn? Answer: To the frequentist error statistical reasoning and to p-value scrutiny, first articulated by R.A. Fisher[i].The latest accusations of big time fraud in social psychology concern the case of Jens Förster. As Richard Gill notes:

The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’s. Many statistics textbooks give as an illustration Fisher’s re-analysis (one could even say: meta-analysis) of Mendel’s data on peas. The tests of goodness of fit were, again and again, too good. There are two ingredients here: (1) the use of the left-tail probability as p-value instead of the right-tail probability. (2) combination of results from a number of independent experiments using a trick invented by Fisher for the purpose, and well known to all statisticians. (Richard D. Gill)

Continue reading

Categories: Error Statistics, Fisher, significance tests, Statistical fraudbusting, Statistics | 42 Comments

A. Spanos: Jerzy Neyman and his Enduring Legacy

A Statistical Model as a Chance Mechanism
Aris Spanos 

Jerzy Neyman (April 16, 1894 – August 5, 1981), was a Polish/American statistician[i] who spent most of his professional career at the University of California, Berkeley. Neyman is best known in statistics for his pioneering contributions in framing the Neyman-Pearson (N-P) optimal theory of hypothesis testing and his theory of Confidence Intervals. (This article was first posted here.)

Neyman: 16 April

Neyman: 16 April 1894 – 5 Aug 1981

One of Neyman’s most remarkable, but least recognized, achievements was his adapting of Fisher’s (1922) notion of a statistical model to render it pertinent for  non-random samples. Continue reading

Categories: phil/history of stat, Spanos, Statistics | Tags: , | 4 Comments

Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)

picture-216-1

.

We spent the first half of Thursday’s seminar discussing the FisherNeyman, and E. Pearson “triad”[i]. So, since it’s Saturday night, join me in rereading for the nth time these three very short articles. The key issues were: error of the second kind, behavioristic vs evidential interpretations, and Fisher’s mysterious fiducial intervals. Although we often hear exaggerated accounts of the differences in the Fisherian vs Neyman-Pearson (NP) methodology, in fact, N-P were simply providing Fisher’s tests with a logical ground (even though other foundations for tests are still possible), and Fisher welcomed this gladly. Notably, with the single null hypothesis, N-P showed that it was possible to have tests where the probability of rejecting the null when true exceeded the probability of rejecting it when false. Hacking called such tests “worse than useless”, and N-P develop a theory of testing that avoids such problems. Statistical journalists who report on the alleged “inconsistent hybrid” (a term popularized by Gigerenzer) should recognize the extent to which the apparent disagreements on method reflect professional squabbling between Fisher and Neyman after 1935 [A recent example is a Nature article by R. Nuzzo in ii below]. The two types of tests are best seen as asking different questions in different contexts. They both follow error-statistical reasoning. Continue reading

Categories: phil/history of stat, Phil6334, science communication, Severity, significance tests, Statistics | Tags: | 35 Comments

R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’

17 February 1890–29 July 1962

Exactly 1 year ago: I find this to be an intriguing discussion–before some of the conflicts with N and P erupted.  Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power.  It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below.

by R.A. Fisher, F.R.S.

Proceedings of the Royal Society, Series A, 144: 285-307 (1934)

  The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance.  Neyman and Pearson introduce the notion that any chosen test of a hypothesis H0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H1, when it rejects H0 in a set of samples having an assigned aggregate frequency ε when H0 is true, and the greatest possible aggregate frequency when H1 is true.

If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H1 is less than that of any other group of samples outside the region, but is not less on the hypothesis H0, then the test can evidently be made more powerful by substituting the one group for the other. Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , | 1 Comment

Aris Spanos: The Enduring Legacy of R. A. Fisher

spanos 2014

More Fisher insights from A. Spanos, this from 2 years ago:

One of R. A. Fisher’s (17 February 1890 — 29 July 1962) most re­markable, but least recognized, achievement was to initiate the recast­ing of statistical induction. Fisher (1922) pioneered modern frequentist statistics as a model-based approach to statistical induction anchored on the notion of a statistical model, formalized by:

Mθ(x)={f(x;θ); θ∈Θ}; x∈Rn ;Θ⊂Rm; m < n; (1)

where the distribution of the sample f(x;θ) ‘encapsulates’ the proba­bilistic information in the statistical model.

Before Fisher, the notion of a statistical model was vague and often implicit, and its role was primarily confined to the description of the distributional features of the data in hand using the histogram and the first few sample moments; implicitly imposing random (IID) samples. The problem was that statisticians at the time would use descriptive summaries of the data to claim generality beyond the data in hand x0:=(x1,x2,…,xn). As late as the 1920s, the problem of statistical induction was understood by Karl Pearson in terms of invoking (i) the ‘stability’ of empirical results for subsequent samples and (ii) a prior distribution for θ.

Fisher was able to recast statistical inference by turning Karl Pear­son’s approach, proceeding from data x0 in search of a frequency curve f(x;ϑ) to describe its histogram, on its head. He proposed to begin with a prespecified Mθ(x) (a ‘hypothetical infinite population’), and view x0 as a ‘typical’ realization thereof; see Spanos (1999).

In my mind, Fisher’s most enduring contribution is his devising a general way to ‘operationalize’ errors by embedding the material ex­periment into Mθ(x), and taming errors via probabilification, i.e. to define frequentist error probabilities in the context of a statistical model. These error probabilities are (a) deductively derived from the statistical model, and (b) provide a measure of the ‘effectiviness’ of the inference procedure: how often a certain method will give rise to correct in­ferences concerning the underlying ‘true’ Data Generating Mechanism (DGM). This cast aside the need for a prior. Both of these key elements, the statistical model and the error probabilities, have been refined and extended by Mayo’s error statistical approach (EGEK 1996). Learning from data is achieved when an inference is reached by an inductive procedure which, with high probability, will yield true conclusions from valid inductive premises (a statistical model); Mayo and Spanos (2011). Continue reading

Categories: Fisher, phil/history of stat, Statistics | Tags: , , , , , , | 2 Comments

R. A. Fisher: how an outsider revolutionized statistics

A SPANOSToday is R.A. Fisher’s birthday and I’m reblogging the post by Aris Spanos which, as it happens, received the highest number of views of 2013.

by Aris Spanos

Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)

“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)

What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper. Continue reading

Categories: Fisher, phil/history of stat, Spanos, Statistics | 6 Comments

Fisher and Neyman after anger management?

Photo on 2-15-13 at 11.47 PM

Monday is Fisher’s birthday, and to set the stage for some items to appear, I’m posing the anger management question from a year ago post (please also see the comments from then). Here it is:


Would you agree if your (senior) colleague urged you to use his/her book rather than your own –even if you thought doing so would change for the positive the entire history of your field? My guess is that the answer is no (but see “add on”). For that matter, would you ever try to insist that your (junior) colleague use your book in teaching a course rather than his/her own notes or book?  Again I guess no. But perhaps you’d be more tactful than were Fisher and Neyman.

It wasn’t just Fisher who seemed to need some anger management training, Erich Lehmann (in conversation and in 2011) points to a number of incidences wherein Neyman is the instigator of gratuitous ill-will. Their substantive statistical and philosophical disagreements, I now think, were minuscule in comparison to the huge animosity that developed over many years. Here’s how Neyman describes a vivid recollection he has of the 1935 book episode to Constance Reid (1998, 126). [i]

A couple of months “after Neyman criticized Fisher’s concept of the complex experiment” Neyman vividly recollects  Fisher stopping by his office at University College on his way to a meeting which was to decide on Neyman’s reappointment[ii]: Continue reading

Categories: phil/history of stat, Statistics | 9 Comments

“Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)

IMG_0244Update: Feb. 21, 2014 (slides at end)Ever find when you begin to “type” a paper to which you gave an off-the-cuff title months and months ago that you scarcely know just what you meant or feel up to writing a paper with that (provocative) title? But then, pecking away at the outline of a possible paper crafted to fit the title, you discover it’s just the paper you’re up to writing right now? That’s how I feel about “Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?” (the impromptu title I gave for my paper for the Boston Colloquium for the Philosophy of Science):

The conference is called: “Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge.”  

 Here are some initial chicken-scratchings (draft (i)). Share comments, queries. (I still have 2 weeks to come up with something*.) Continue reading

Categories: P-values, significance tests, Statistical fraudbusting, Statistics | Leave a comment

A. Spanos lecture on “Frequentist Hypothesis Testing”

may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809d

Aris Spanos

I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.

Frequentist Hypothesis Testing: A Coherent Approach

Aris Spanos

1    Inherent difficulties in learning statistical testing

Statistical testing is arguably  the  most  important, but  also the  most difficult  and  confusing chapter of statistical inference  for several  reasons, including  the following.

(i) The need to introduce numerous new notions, concepts and procedures before one can paint —  even in broad brushes —  a coherent picture  of hypothesis  testing.

(ii) The current textbook discussion of statistical testing is both highly confusing and confused.  There  are several sources of confusion.

  • (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
  • (b) Inadequate knowledge by textbook writers who often do not have  the  technical  skills to read  and  understand the  original  sources, and  have to rely on second hand  accounts  of previous  textbook writers that are  often  misleading  or just  outright erroneous.   In most  of these  textbooks hypothesis  testing  is poorly  explained  as  an  idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square,  etc., where the underlying  statistical  model that gives rise to the testing procedure  is hidden  in the background.
  • (c)  The  misleading  portrayal of Neyman-Pearson testing  as essentially  decision-theoretic in nature, when in fact the latter has much greater  affinity with the Bayesian rather than the frequentist inference.
  • (d)  A deliberate attempt to distort and  cannibalize  frequentist testing by certain  Bayesian drumbeaters who revel in (unfairly)  maligning frequentist inference in their  attempts to motivate their  preferred view on statistical inference.

(iii) The  discussion of frequentist testing  is rather incomplete  in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary  literatures attempting to address  these  problems,  but  often making  things  much  worse!  Indeed,  in some fields like psychology  it has reached the stage where one has to correct the ‘corrections’ of those chastising  the initial  correctors!

In an attempt to alleviate  problem  (i),  the discussion  that follows uses a sketchy historical  development of frequentist testing.  To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain  erroneous  in- terpretations or misleading arguments.  The discussion will pay special attention to (iii), addressing  some of the key foundational problems.

[i] It is based on Ch. 14 of Spanos (1999) Probability Theory and Statistical Inference. Cambridge[ii].

[ii] You can win a free copy of this 700+ page text by creating a simple palindrome! https://errorstatistics.com/palindrome/march-contest/

Categories: Bayesian/frequentist, Error Statistics, Severity, significance tests, Statistics | Tags: | 36 Comments

Lucien Le Cam: “The Bayesians hold the Magic”

Nov.18, 1924 -April 25, 2000

Nov.18, 1924 -April 25, 2000

Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me.  A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a snippet from a passage from Le Cam (1977) (Here I have fleshed it out):

“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation. Whether this is true, false, unacceptable or inspiring, it tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices. It is only at the design stage that the statistician can help you.

Another claim is the very curious one that if one follows the neo-Bayesian theory strictly one would not randomize experiments….However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. …

In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.

There are many other curious statements concerning confidence intervals, levels of significance, power, and so forth. These statements are only confusing to an otherwise abused public”. (Le Cam 1977, 158)

Back to EGEK:

Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,

In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .

The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others.

Given the difference in aims, it is not surprising that information relevant to the Bayesian task is very different from that relevant to the task of the error statistician. In this section I want to sharpen and make more rigorous what I have already said about this distinction.

…. the secret to solving a number of problems about evidence, I hold, lies in utilizing—formally or informally—the error probabilities of the procedures generating the evidence. It was the appeal to severity (an error probability), for example, that allowed distinguishing among the well-testedness of hypotheses that fit the data equally well… .

A few pages later in a section titled “Bayesian Freedom, Bayesian Magic” (350-1):

 A big selling point for adopting the LP (strong likelihood principle), and with it the irrelevance of stopping rules, is that it frees us to do things that are sinful and forbidden to an error statistician.

“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson). . . . Many experimenters would like to feel free to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money or patience … Classi­cal statisticians … have frowned on [this]”. (Edwards, Lindman, and Savage 1963, 239)1

Breaking loose from the grip imposed by error probabilistic requirements returns to us an appealing freedom.

Le Cam, … hits the nail on the head:

“It is characteristic of [Bayesian approaches] [2] . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from what­ever data and whatever features one happens to notice”. (Le Cam 1977, 145)

In contrast, the error probability assurances go out the window if you are allowed to change the experiment as you go along. Repeated tests of significance (or sequential trials) are permitted, are even desirable for the error statistician; but a penalty must be paid for perseverance—for optional stopping. Before-trial planning stipulates how to select a small enough significance level to be on the lookout for at each trial so that the overall significance level is still low. …. Wearing our error probability glasses—glasses that compel us to see how certain procedures alter error probability characteristics of tests—we are forced to say, with Armitage, that “Thou shalt be misled if thou dost not know that” the data resulted from the try and try again stopping rule. To avoid having a high probability of following false leads, the error statistician must scrupulously follow a specified experimental plan. But that is because we hold that error probabilities of the procedure alter what the data are saying—whereas Bayesians do not. The Bayesian is permitted the luxury of optional stopping and has nothing to worry about. The Bayesians hold the magic.

Or is it voodoo statistics?

When I sent him a note, saying his work had inspired me, he modestly responded that he doubted he could have had all that much of an impact.
_____________

*I had forgotten that this Synthese (1977) volume on foundations of probability and statistics is the one dedicated to the memory of Allan Birnbaum after his suicide: “By publishing this special issue we wish to pay homage to professor Birnbaum’s penetrating and stimulating work on the foundations of statistics” (Editorial Introduction). In fact, I somehow had misremembered it as being in a Harper and Hooker volume from 1976. The Synthese volume contains papers by Giere, Birnbaum, Lindley, Pratt, Smith, Kyburg, Neyman, Le Cam, and Kiefer.

REFERENCES:

Armitage, P. (1961). Contribution to discussion in Consistency in statistical inference and decision, by C. A. B. Smith. Journal of the Royal Statistical Society (B) 23:1-37.

_______(1962). Contribution to discussion in The foundations of statistical inference, edited by L. Savage. London: Methuen.

_______(1975). Sequential Medical Trials. 2nd ed. New York: John Wiley & Sons.

Edwards, W., H. Lindman & L. Savage (1963) Bayesian statistical inference for psychological research. Psychological Review 70: 193-242.

Le Cam, L. (1974). J. Neyman: on the occasion of his 80th birthday. Annals of Statistics, Vol. 2, No. 3 , pp. vii-xiii, (with E.L. Lehmann).

Le Cam, L. (1977). A note on metastatistics or “An essay toward stating a problem in the doctrine of chances.”  Synthese 36: 133-60.

Le Cam, L. (1982). A remark on empirical measures in Festschrift in the honor of E. Lehmann. P. Bickel, K. Doksum & J. L. Hodges, Jr. eds., Wadsworth  pp. 305-327.

Le Cam, L. (1986). The central limit theorem around 1935. Statistical Science, Vol. 1, No. 1,  pp. 78-96.

Le Cam, L. (1988) Discussion of “The Likelihood Principle,” by J. O. Berger and R. L. Wolpert. IMS Lecture Notes Monogr. Ser. 6 182–185. IMS, Hayward, CA

Le Cam, L. (1996) Comparison of experiments: A short review. In Statistics, Probability and Game Theory. Papers in Honor of David Blackwell 127–138. IMS, Hayward, CA.

Le Cam, L.,  J. Neyman and E. L. Scott (Eds). (1973). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Vol. l: Theory of Statistics, Vol. 2: Probability Theory, Vol. 3: Probability Theory. Univ. of Calif. Press, Berkeley Los Angeles.

Mayo, D. (1996). [EGEK] Error Statistics and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. (Chapter 10; Chapter 3)

Neyman, J. and L. Le Cam (Eds). (1967).  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. I: Statistics, Vol. II: Probability Part I & Part II. Univ. of Calif. Press, Berkeley and Los Angeles.

[1] For some links on optional stopping on this blog: Highly probably vs highly probed: Bayesian/error statistical differences.Who is allowed to cheat? I.J. Good and that after dinner comedy hour….New SummaryMayo: (section 7) “StatSci and PhilSci: part 2″After dinner Bayesian comedy hour….; Search for more, if interested.

[2] Le Cam is alluding mostly to Savage, and (what he called) the “neo-Bayesian” accounts.

Categories: Error Statistics, frequentist/Bayesian, phil/history of stat, strong likelihood principle | 58 Comments

WHIPPING BOYS AND WITCH HUNTERS

This, from 2 years ago, “fits” at least as well today…HAPPY HALLOWEEN! Memory Lane

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways).  Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a of “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline.   It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting, that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago.  They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis–especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.

The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the reverse. By and large, these critics are not espousing a Bayesian line but rather see themselves as offering “reforms” e.g., supplementing simple significance tests with power (e.g., Jacob Cohen’s “power analytic movement), and most especially,  replacing tests with confidence interval estimates of the size of discrepancy (from the null) indicated by the data.  Of course, the use of power is central for (frequentist) Neyman-Pearson tests, and (frequentist) confidence interval estimation even has a duality with hypothesis tests!)

But rather than take a temporary job of pointing up some understandable fallacies in the use of newly adopted statistical tools by social scientific practitioners, or lead by example of right-headed statistical analyses, the New Reformers have seemed to settle into a permanent career of showing the same fallacies.  Yes, they advocate “alternative” methods, e.g., “effect size” analysis, power analysis, confidence intervals, meta-analysis.  But never having adequately unearthed the essential reasoning and rationale of significance tests—admittedly something that goes beyond many typical expositions—their supplements and reforms often betray the same confusions and pitfalls that underlie the methods they seek to supplement or replace! (I will give readers a chance to demonstrate this in later posts.)

We all reject the highly lampooned, recipe-like uses of significance tests; I and others insist on interpreting tests to reflect the extent of discrepancy indicated or not (back when I was writing my doctoral dissertation and EGEK 1996).  I never imagined that hypotheses tests (of all stripes) would continue to be flogged again and again, in the same ways!

Frustrated with the limited progress in psychology, apparently inconsistent results, and lack of replication, an imagined malign conspiracy of significance tests is blamed: traditional reliance on statistical significance testing, we hear,

“has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122)[i].

Meta-analysis was to be the cure that would provide cumulative knowledge to psychology: Lest enthusiasm for revisiting the same cluster of elementary fallacies of tests begin to lose steam, the threats of dangers posed  become ever shriller: just as the witch is responsible for whatever ails a community, the significance tester is portrayed as so powerful as to be responsible for blocking scientific progress. In order to keep the gig alive, a certain level of breathless hysteria is common: “statistical significance is hurting people, indeed killing them” (Ziliak and McCloskey 2008, 186)[ii]; significance testers are members of a “cult” led by R.A. Fisher” whom they call “The Wasp”.  To the question, “What if there were no Significance Tests,” as the title of one book inquires[iii], surely the implication is that once tests are extirpated, their research projects would bloom and thrive; so let us have Task Forces[iv] to keep reformers busy at journalistic reforms to banish the test once and for all!

Harlow, L., Mulaik, S., Steiger, J. (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Hunter, J.E. (1997), “Needed: A Ban on the Significance Test,”, American Psychological Society 8:3-7.

Morrison, D. and Henkel, R. (eds.) (1970), The Significance Test Controversy, Aldine, Chicago.

MSERA (1998), Research in the Schools, 5(2) “Special Issue: Statistical Significance Testing,” Birmingham, Alabama.

Rosenthal, R. and Gaito, J. (1963), “The Interpretation of Levels of Significance by Psychologicl Researchers,”  Journal of Psychology 55:33-38.

Ziliak, T. and McCloskey, D. (2008), The Cult of Statistical Significance, University of Michigan Press.


[i]Schmidt was the one Erich Lehmann wrote to me about, expressing great concern.

[ii] While setting themselves up as High Priest and Priestess of “reformers” their own nostroms reveal they fall into the same fallacy pointed up by Rosenthal and Gaito (among many others) nearly a half a century ago.  That’s what should scare us!

[iii] In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

[iv] MSERA (1998): ‘Special Issue: Statistical Significance Testing,’ Research in the Schools, 5.   See also Hunter (1997). The last I heard, they have not succeeded in their attempt at an all-out “test ban”.  Interested readers might check the status of the effort, and report back.

Related posts:

Saturday night brainstorming and taskforces” 

“What do these share in common: MMs, limbo stick, ovulation, Dale Carnegie?: Sat. night potpourri”

Categories: significance tests, Statistics | Tags: , , | 3 Comments

Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”

Unknown-1

Sir David Cox

David Cox sent me a letter relating to my post of Oct.5, 2013. He has his own theory as to who might have been doing the teasing! I’m posting it  here, with his permission: 

Dear Deborah

I was interested to see the correspondence about Jeffreys and the possible teasing by Neyman’s associate. It brought a number of things to mind.

  1. While I am not at all convinced that any teasing was involved, if there was it seems to me much more likely that Jeffreys was doing the teasing. He, correctly surely, disapproved of that definition and was putting up a highly contrived illustration of its misuse.
  2. In his work he was not writing about a subjective view of probability but about objective degree of belief. He did not disapprove of more physical definitions, such as needed to describe radioactive decay; he preferred to call them chances.
  3.  In assessing his work it is important that the part on probability was perhaps 10% of what he did. He was most famous for The earth (1924) which is said to have started the field of geophysics. (The first edition of his 1939 book on probability was in a series of monographs in physics.) The later book with his wife, Bertha,  Methods of mathematical physics is a masterpiece.
  4. I heard him speak from time to time and met him personally on a couple of occasions. He was superficially very mild and said very little. He was involved in various controversies but, and I am not sure about this, I don’t think they ever degenerated into personal bitterness. He lived to be 98 and, a mark of his determination is that in his early 90’s he cycled in Cambridge having a series of minor accidents. He was stopped only when Bertha removed the tires from his bike. Bertha was a highly respected teacher of mathematics.
  5.  He and R.A.Fisher were not only towering figures in statistics in the first part of the 20th century but surely among the major applied mathematicians of that era in the world.
  6. Neyman was not at all Germanic, in the sense that one of your correspondents described. He could certainly be autocratic but not in personal manner. While all the others at Berkeley were Professor this or Dr that, he insisted on being called Mr Neyman.
  7. The remarks [i] about how people addressed one another 50 plus years ago in the UK are  broadly accurate, although they were not specific to Cambridge and certainly could be varied. From about age 11 boys in school, students and men in the workplace addressed one another by surname only. Given names were for family and very close friends. Women did use given names or  were Miss or Mrs, certainly never Madam unless they were French aristocrats. Thus in 1950 or so I worked with, published with and was very friendly with two physical scientists, R.C. Palmer and S.L. Anderson. I have no idea what their given names were; it was irrelevant. To address someone you did not know by name you used Sir or Madam. It would be very foolish to think that meant unfriendliness or that the current practice of calling absolutely everyone by their given name means universal benevolence.

Best wishes

David

D.R.Cox
Nuffield College
Oxford
UK

[i]In comments to this post.

Categories: phil/history of stat, Statistics | Tags: | 13 Comments

Was Janina Hosiasson pulling Harold Jeffreys’ leg?

images

Hosiasson 1899-1942

The very fact that Jerzy Neyman considers she might have been playing a “mischievous joke” on Harold Jeffreys (concerning probability) is enough to intrigue and impress me (with Hosiasson!). I’ve long been curious about what really happened. Eleonore Stump, a leading medieval philosopher and friend (and one-time colleague), and I pledged to travel to Vilnius to research Hosiasson. I first heard her name from Neyman’s dedication of Lectures and Conferences in Mathematical Statistics and Probability: “To the memory of: Janina Hosiasson, murdered by the Gestapo” along with around 9 other “colleagues and friends lost during World War II.” (He doesn’t mention her husband Lindenbaum, shot alongside her*.)  Hosiasson is responsible for Hempel’s Raven Paradox, and I definitely think we should be calling it Hosiasson’s (Raven) Paradox for much of the lost credit to her contributions to Carnapian confirmation theory[i].

questionmark pink

But what about this mischievous joke she might have pulled off with Harold Jeffreys? Or did Jeffreys misunderstand what she intended to say about this howler, or?  Since it’s a weekend and all of the U.S. monuments and parks are shut down, you might read this snippet and share your speculations…. The following is from Neyman 1952:

“Example 6.—The inclusion of the present example is occasioned by certain statements of Harold Jeffreys (1939, 300) which suggest that, in spite of my 
insistence on the phrase, “probability that an object A will possess the 
property B,” and in spite of the five foregoing examples, the definition of 
probability given above may be misunderstood.
 Jeffreys is an important proponent of the subjective theory of probability
 designed to measure the “degree of reasonable belief.” His ideas on the
 subject are quite radical. He claims (1939, 303) that no consistent theory of probability is possible without the basic notion of degrees of reasonable belief. 
His further contention is that proponents of theories of probabilities alternative to his own forget their definitions “before the ink is dry.”  In 
Jeffreys’ opinion, they use the notion of reasonable belief without ever
 noticing that they are using it and, by so doing, contradict the principles 
which they have laid down at the outset.

The necessity of any given axiom in a mathematical theory is something
 which is subject to proof. …
However, Dr. Jeffreys’ contention that the notion of degrees of reasonable
 belief and his Axiom 1are necessary for the development of the theory 
of probability is not backed by any attempt at proof. Instead, he considers
 definitions of probability alternative to his own and attempts to show by
 example that, if these definitions are adhered to, the results of their application would be totally unreasonable and unacceptable to anyone. Some 
of the examples are striking. On page 300, Jeffreys refers to an article of
 mine in which probability is defined exactly as it is in the present volume.
 Jeffreys writes:

The first definition is sometimes called the “classical” one, and is stated in much 
modern work, notably that of J. Neyman.

However, Jeffreys does not quote the definition that I use but chooses 
to reword it as follows:

If there are n possible alternatives, for m of which p is true, then the probability of 
p is defined to be m/n.


He goes on to say:

The first definition appears at the beginning of De Moivre’s book (Doctrine of 
Chances, 1738). It often gives a definite value to a probability; the trouble is that the 
value is one that its user immediately rejects. Thus suppose that we are considering 
two boxes, one containing one white and one black ball, and the other one white and 
two black. A box is to be selected at random and then a ball at random from that box.
 What is the probability that the ball will be white? There are five balls, two of which 
are white. Therefore, according to the definition, the probability is 2/5. But most
 statistical writers, including, I think, most of those that professedly accept the definition,
would give (1/2)•(1/2) + (1/2)•(1/3) = 5/12. This follows at once on the present theory, 
the terms representing two applications of the product rule to give the probability of 
drawing each of the two white balls. These are then added by the addition rule. But 
the proposition cannot be expressed as the disjunction of five alternatives out of twelve.
 My attention was called to this point by Miss J. Hosiasson.


The solution, 2/5, suggested by Jeffreys as the result of an allegedly 
strict application of my definition of probability is obviously wrong. The 
mistake seems to be due to Jeffreys’ apparently harmless rewording of the 
definition. If we adhere to the original wording (p. 4) and, in particular, to the 
phrase “probability of an object A having the property B,” then, prior to
attempting a solution, we would probably ask ourselves the questions: 
”What are the ‘objects A’ in this particular case?” and “What is the
’property B,’ the probability of which it is desired to compute?” Once 
these questions have been asked, the answer to them usually follows and
 determines the solution.

In the particular example of Dr. Jeffreys, the objects A are obviously 
not balls, but pairs of random selections, the first of a box and the second 
of a ball. If we like to state the problem without dangerous abbreviations, 
the probability sought is that of a pair of selections ending with a white 
ball. All the conditions of there being two boxes, the first with two balls 
only and the second with three, etc., must be interpreted as picturesque 
descriptions of the F.P.S. of pairs of selections. The elements of this set 
fall into four categories, conveniently described by pairs of symbols (1,w),
(1,b), (2,w), (2,b), so that, for example, (2,w) stands for a pair of 
selections in which the second box was selected in the first instance, and
 then this was followed by the selection of the white ball. Denote by n1,w, n1,b, n2,w, and n2,b the (unknown) numbers of the elements of the F.P.S. belonging to each of the above categories, and by n their sum. Then the probability sought is “(Neyman 1952, 10-11).

Then there are the detailed computations from which Neyman gets the right answer (entered 10/9/13):

P{w|pair of selections} = (n1,w + n2,w)/n.

The conditions of the problem imply

P{1|pair of selections} = (n1,w + n1,b)/n = ½,

P{2|pair of selections} = (n2,w + n2,b)/n = ½,

P{w| pair of selections beginning with box No. 1} = n1,w/(n1,w + n1,b) = ½,

P{w| pair of selections beginning with box No. 2} = n2,w/(n2,w + n2,b) = 1/3.

It follows

n1,w = 1/2(n1,w + n1,b) = n/4,

n2,w = 1/3(n2,w + n2,b)  = n/6,

P{w|pair of selections} = 5/12.

The method of computing probability used here is a direct enumeration 
of elements of the F.P.S. For this reason it is called the “direct method.”
 As we can see from this particular example, the direct method is occasionally cumbersome and the correct solution is more easily reached through
 the application of certain theorems basic in the theory of probability. These theorems, the addition theorem and the multiplication theorem, are very 
easy to apply, with the result that students frequently manage to learn the
 machinery of application without understanding the theorems. To check
 whether or not a student does understand the theorems, it is advisable to 
ask him to solve problems by the direct method. If he cannot, then he
 does not understand what he is doing.

Checks of this kind were part of the regular program of instruction in
 Warsaw where Miss Hosiasson was one of my assistants. Miss Hosiasson
 was a very talented lady who has written several interesting contributions 
to the theory of probability. One of these papers deals specifically with
 various misunderstandings which, under the high sounding name of paradoxes, still litter the scientific books and journals. Most of these paradoxes originate from lack of precision in stating the conditions of the 
problems studied. In these circumstances, it is most unlikely that Miss 
Hosiasson could fail in the application of the direct method to a simple 
problem like the one described by Dr. Jeffreys. On the other hand, I can 
well imagine Miss Hosiasson making a somewhat mischievous joke.


Some of the paradoxes solved by Miss Hosiasson are quite amusing…….”
 (Neyman 1952, 10-13)

What think you? I will offer a first speculation in a comment.

The entire book Neyman (1952) may be found here, in plain text, here.

*June, 2017: I read somewhere today that her husband was killed in 41, so before she was, but all refs I know are sketchy.

[i]Of course there are many good, recent sources on the philosophy and history of Carnap, some of which mention her, but obviously do not touch on this matter. I read that Hosiasson was trying to build a Carnapian-style inductive logic setting out axioms (which to my knowledge Carnap never did). That was what some of my fledgling graduate school attempts had tried, but the axioms always seemed to admit counterexamples (if non-trivial). So much for the purely syntactic approach. But I wish I’d known of her attempts back then, and especially her treatment of paradoxes of confirmation. {I’m sometimes tempted to give a logic for severity, but I fight the temptation.)

REFERENCES

Hosiasson, J. (1931) Why do we prefer probabilities relative to many data? Mind 40 (157): 23-36 (1931)

Hosiasson-Lindenbaum, J. (1940) On confirmation Journal of Symbolic Logic 5 (4): 133-148 (1940)

Hosiasson, J. (1941) Induction et analogie: Comparaison de leur fondement Mind 50 (200): 351-365 (1941)

Hosiasson-Lindenbaum, J. (1948) Theoretical Aspects of the Advancement of Knowledge Synthese 7 (4/5):253 – 261 (1948)

Jeffreys, H. (1939) Theory of Probability (1st ed.). Oxford: The Clarendon Press

Neyman, J. (1952) Lectures and Conferences in Mathematical Statistics and Probability. Graduate School, U.S. Dept. of Agriculture

Categories: Hosiasson, phil/history of stat, Statistics | Tags: , | 22 Comments

Barnard’s Birthday: background, likelihood principle, intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

Reblog (year ago) : G.A. Barnard’s birthday is today, so here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to some earlier issues: stopping rules, likelihood principle, and background information here and here (at least of one type). (A few other Barnard links on this blog are below* .) Happy Birthday George!

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant.

In particular, suppose somebody sets out to demonstrate the existence of extrasensory perception and says ‘I am going to go on until I get a one in ten thousand significance level’. Knowing that this is what he is setting out to do would lead you to adopt a different test criterion. What you would look at would not be the ratio of successes obtained, but how long it took him to obtain it. And you would have a very simple test of significance which said if it took you so long to achieve this increase in the score above the chance fraction, this is not at all strong evidence for E.S.P., it is very weak evidence. And the reversing of the choice of test criteria would I think overcome the difficulty.

This is the answer to the point Professor Savage makes; he says why use one method when you have vague knowledge, when you would use a quite different method when you have precise knowledge. It seem to me the answer is that you would use one method when you have precisely determined alternatives, with which you want to compare a given hypothesis, and you use another method when you do not have these alternatives.

Savage: May I digress to say publicly that I learned the stopping-rule principle from professor Barnard, in conversation in the summer of 1952. Frankly I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right. I am particularly surprised to hear Professor Barnard say today that the stopping rule is irrelevant in certain circumstances only, for the argument he first gave in favour of the principle seems quite unaffected by the distinctions just discussed. The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head and cannot be known to those who have to judge the experiment. Never having been comfortable with that argument, I am not advancing it myself. But if Professor Barnard still accepts it, how can he conclude that the stopping-rule principle is only sometimes valid? (emphasis added) Continue reading

Categories: Background knowledge, Likelihood Principle, phil/history of stat, Philosophy of Statistics | Leave a comment

Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”

metablog old fashion typewriter

Memory lane: Did you ever consider how some of the colorful exchanges among better-known names in statistical foundations could be the basis for high literary drama in the form of one-act plays (even if appreciated by only 3-7 people in the world)? (Think of the expressionist exchange between Bohr and Heisenberg in Michael Frayn’s play Copenhagen, except here there would be no attempt at all to popularize—only published quotes and closely remembered conversations would be included, with no attempt to create a “story line”.)  Somehow I didn’t think so. But rereading some of Savage’s high-flown praise of Birnbaum’s “breakthrough” argument (for the Likelihood Principle) today, I was swept into a “(statistical) theater of the absurd” mindset.

The first one came to me in autumn 2008 while I was giving a series of seminars on philosophy of statistics at the LSE. Modeled on a disappointing (to me) performance of The Woman in Black, “A Funny Thing Happened at the [1959] Savage Forum” relates Savage’s horror at George Barnard’s announcement of having rejected the Likelihood Principle!

The current piece taking shape also features George Barnard and since tomorrow (9/23) is his birthday, I’m digging it out of “rejected posts”. It recalls our first meeting in London in 1986. I’d sent him a draft of my paper “Why Pearson Rejected the Neyman-Pearson Theory of Statistics” (later adapted as chapter 11 of EGEK) to see whether I’d gotten Pearson right. He’d traveled quite a ways, from Colchester, I think. It was June and hot, and we were up on some kind of a semi-enclosed rooftop. Barnard was sitting across from me looking rather bemused.Barnard-1979-picture

The curtain opens with Barnard and Mayo on the roof, lit by a spot mid-stage. He’s drinking (hot) tea; she, a Diet Coke. The dialogue (is what I recall from the time[i]):

 Barnard: I read your paper. I think it is quite good.  Did you know that it was I who told Fisher that Neyman-Pearson statistics had turned his significance tests into little more than acceptance procedures?

Mayo:  Thank you so much for reading my paper.  I recall a reference to you in Pearson’s response to Fisher, but I didn’t know the full extent.

Barnard: I was the one who told Fisher that Neyman was largely to blame. He shouldn’t be too hard on Egon.  His statistical philosophy, you are aware, was different from Neyman’s. Continue reading

Categories: Barnard, phil/history of stat, rejected post, Statistics | Tags: , , , , | 6 Comments

(Part 3) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Last third of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society 41(2) 2005: 299-319

Part 2 is here.

8. Random sampling and the uniformity of nature

We are now at the point to address the final move in warranting Peirce’s SCT. The severity or trustworthiness assessment, on which the error correcting capacity depends, requires an appropriate link (qualitative or quantitative) between the data and the data generating phenomenon, e.g., a reliable calibration of a scale in a qualitative case, or a probabilistic connection between the data and the population in a quantitative case. Establishing such a link, however, is regarded as assuming observed regularities will persist, or making some “uniformity of nature” assumption—the bugbear of attempts to justify induction.

But Peirce contrasts his position with those favored by followers of Mill, and “almost all logicians” of his day, who “commonly teach that the inductive conclusion approximates to the truth because of the uniformity of nature” (2.775). Inductive inference, as Peirce conceives it (i.e., severe testing) does not use the uniformity of nature as a premise. Rather, the justification is sought in the manner of obtaining data. Justifying induction is a matter of showing that there exist methods with good error probabilities. For this it suffices that randomness be met only approximately, that inductive methods check their own assumptions, and that they can often detect and correct departures from randomness.

… It has been objected that the sampling cannot be random in this sense. But this is an idea which flies far away from the plain facts. Thirty throws of a die constitute an approximately random sample of all the throws of that die; and that the randomness should be approximate is all that is required. (1.94)

Peirce backs up his defense with robustness arguments. For example, in an (attempted) Binomial induction, Peirce asks, “what will be the effect upon inductive inference of an imperfection in the strictly random character of the sampling” (2.728). What if, for example, a certain proportion of the population had twice the probability of being selected? He shows that “an imperfection of that kind in the random character of the sampling will only weaken the inductive conclusion, and render the concluded ratio less determinate, but will not necessarily destroy the force of the argument completely” (2.728). This is particularly so if the sample mean is near 0 or 1. In other words, violating experimental assumptions may be shown to weaken the trustworthiness or severity of the proceeding, but this may only mean we learn a little less.

Yet a further safeguard is at hand:

Nor must we lose sight of the constant tendency of the inductive process to correct itself. This is of its essence. This is the marvel of it. …even though doubts may be entertained whether one selection of instances is a random one, yet a different selection, made by a different method, will be likely to vary from the normal in a different way, and if the ratios derived from such different selections are nearly equal, they may be presumed to be near the truth. (2.729)

Here, the marvel is an inductive method’s ability to correct the attempt at random sampling. Still, Peirce cautions, we should not depend so much on the self-correcting virtue that we relax our efforts to get a random and independent sample. But if our effort is not successful, and neither is our method robust, we will probably discover it. “This consideration makes it extremely advantageous in all ampliative reasoning to fortify one method of investigation by another” (ibid.).

“The Supernal Powers Withhold Their Hands And Let Me Alone”

Peirce turns the tables on those skeptical about satisfying random sampling—or, more generally, satisfying the assumptions of a statistical model. He declares himself “willing to concede, in order to concede as much as possible, that when a man draws instances at random, all that he knows is that he tried to follow a certain precept” (2.749). There might be a “mysterious and malign connection between the mind and the universe” that deliberately thwarts such efforts. He considers betting on the game of rouge et noire: “could some devil look at each card before it was turned, and then influence me mentally” to bet or not, the ratio of successful bets might differ greatly from 0.5. But, as Peirce is quick to point out, this would equally vitiate deductive inferences about the expected ratio of successful bets.

Consider our informal example of weighing with calibrated scales. If I check the properties of the scales against known, standard weights, then I can check if my scales are working in a particular case. Were the scales infected by systematic error, I would discover this by finding systematic mismatches with the known weights; I could then subtract it out in measurements. That scales have given properties where I know the object’s weight indicates they have the same properties when the weights are unknown, lest I be forced to assume that my knowledge or ignorance somehow influences the properties of the scale. More generally, Peirce’s insightful argument goes, the experimental procedure thus confirmed where the measured property is known must work as well when it is unknown unless a mysterious and malign demon deliberately thwarts my efforts. Continue reading

Categories: C.S. Peirce, Error Statistics, phil/history of stat | 6 Comments

Gelman’s response to my comment on Jaynes

3-d red yellow puzzle people (E&I)Gelman responds to the comment[i] I made on my 8/31/13 post:
Popper and Jaynes
Posted by Andrew on 3 September 2013
Deborah Mayo quotes me as saying, “Popper has argued (convincingly, in my opinion) that scientific inference is not inductive but deductive.” She then follows up with:

Gelman employs significance test-type reasoning to reject a model when the data sufficiently disagree.

Now, strictly speaking, a model falsification, even to inferring something as weak as “the model breaks down,” is not purely deductive, but Gelman is right to see it as about as close as one can get, in statistics, to a deductive falsification of a model. But where does that leave him as a Jaynesian?

My reply:

I was influenced by reading a toy example from Jaynes’s book where he sets up a model (for the probability of a die landing on each of its six sides) based on first principles, then presents some data that contradict the model, then expands the model.

I’d seen very little of this sort of this reasoning before in statistics! In physics it’s the standard way to go: you set up a model based on physical principles and some simplifications (for example, in a finite-element model you assume the various coefficients aren’t changing over time, and you assume stability within each element), then if the model doesn’t quite work, you figure out what went wrong and you make it more realistic.

But in statistics we weren’t usually seeing this. Instead, model checking typically was placed in the category of “hypothesis testing,” where the rejection was the goal. Models to be tested were straw man, build up only to be rejected. You can see this, for example, in social science papers that list research hypotheses that are not the same as the statistical “hypotheses” being tested. A typical research hypothesis is “Y causes Z,” with the corresponding statistical hypothesis being “Y has no association with Z after controlling for X.” Jaynes’s approach—or, at least, what I took away from Jaynes’s presentation—was more simpatico to my way of doing science. And I put a lot of effort into formalizing this idea, so that the kind of modeling I talk and write about can be the kind of modeling I actually do.

I don’t want to overstate this—as I wrote earlier, Jaynes is no guru—but I do think this combination of model building and checking is important. Indeed, just as a chicken is said to be an egg’s way of making another egg, we can view inference as a way of sharpening the implications of an assumed model so that it can better be checked.

P.S. In response to Larry’s post here, let me give a quick +1 to this comment and also refer to this post, which remains relevant 3 years later.

I still don’t see how one learns about falsification from Jaynes when he alleges that the entailment of x from H disappears once H is rejected. But put that aside. In my quote from Gelman 2011, he was alluding to simple significance tests–without an alternative–for checking consistency of a model; whereas, he’s now saying what he wants is to infer an alternative model, and furthermore suggests one doesn’t see this in statistical hypotheses tests. But of course Neyman-Pearson testing always has an alternative, and even Fisherian simple significance tests generally indicate a direction of departure. However, neither type of statistical test method would automatically license going directly from a rejection of one statistical hypotheses to inferring an alternative model that was constructed to account for the misfit. A parametric discrepancy,δ, from a null may be indicated if the test very probably would not have resulted in so large an observed difference, were such a discrepancy absent (i.e., when the inferred alternative passes severely). But I’m not sure Gelman is limiting himself to such alternatives.

As I wrote in a follow-up comment: “there’s no warrant to infer a particular model that happens to do a better job fitting the data x–at least on x alone. Insofar as there are many alternatives that could patch things up, an inference to one particular alternative fails to pass with severity. I don’t understand how it can be that some of the critics of the (bad) habit of some significance testers to move from rejecting the null to a particular alternative, nevertheless seem prepared to allow this in Bayesian model testing. But maybe they carry out further checks down the road; I don’t claim to really get the methods of correcting Bayesian priors (as part of a model)”

A published discussion of Gelman and Shalizi on this matter is here.

[i] My comment was:

” If followers of Jaynes agree with [one of the commentators] (and Jaynes, apparently) that as soon as H is falsified, the grounds on which the test was based disappear!—a position that is based on a fallacy– then I’m confused as to how Andrew Gelman can claim to follow Jaynes at all. 
“Popper has argued (convincingly, in my opinion) that scientific inference is not inductive but deductive…” (Gelman, 2011, bottom p. 71).
Gelman employs significance test-type reasoning to reject a model when the data sufficiently disagree.
 Now, strictly speaking, a model falsification, even to inferring something as weak as “the model breaks down,” is not purely deductive, but Gelman is right to see it as about as close as one can get, in statistics, to a deductive falsification of a model. But where does that leave him as a Jaynesian? Perhaps he’s not one of the ones in Paul’s Jaynes/Bayesian audience who is laughing, but is rather shaking his head?”
Categories: Error Statistics, significance tests, Statistics | 9 Comments

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

may-4-8-aris-spanos-e2809contology-methodology-in-statistical-modelinge2809dWith permission from my colleague Aris Spanos, I reblog his (8/18/12): “Egon Pearson’s Neglected Contributions to Statistics“. It illuminates a different area of E.S.P’s work than my posts here and here.

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model:

Xk ∽ NIID(μ,σ²), k=1,2,…,n,…             (1)

where ‘NIID(μ,σ²)’ stands for ‘Normal, Independent and Identically Distributed with mean μ and variance σ²’. These procedures include the ‘optimal’ estimators of μ and σ², Xbar and s², and the pivotal quantities:

(a) τ(X) =[√n(Xbar- μ)/s] ∽ St(n-1),  (2)

(b) v(X) =[(n-1)s²/σ²] ∽ χ²(n-1),        (3)

where St(n-1) and χ²(n-1) denote the Student’s t and chi-square distributions with (n-1) degrees of freedom.

The question of ‘how these inferential results might be affected when the Normality assumption is false’ was originally raised by Gosset in a letter to Fisher in 1923:

“What I should like you to do is to find a solution for some other population than a normal one.”  (Lehmann, 1999)

He went on to say that he tried the rectangular (uniform) distribution but made no progress, and he was seeking Fisher’s help in tackling this ‘robustness/sensitivity’ problem. In his reply that was unfortunately lost, Fisher must have derived the sampling distribution of τ(X), assuming some skewed distribution (possibly log-Normal). We know this from Gosset’s reply:

“I like the result for z [τ(X)] in the case of that horrible curve you are so fond of. I take it that in skew curves the distribution of z is skew in the opposite direction.”  (Lehmann, 1999)

After this exchange Fisher was not particularly receptive to Gosset’s requests to address the problem of working out the implications of non-Normality for the Normal-based inference procedures; t, chi-square and F tests.

In contrast, Egon Pearson shared Gosset’s concerns about the robustness of Normal-based inference results (a)-(b) to non-Normality, and made an attempt to address the problem in a series of papers in the late 1920s and early 1930s. This line of research for Pearson began with a review of Fisher’s 2nd edition of the 1925 book, published in Nature, and dated June 8th, 1929.  Pearson, after praising the book for its path breaking contributions, dared raise a mild criticism relating to (i)-(ii) above:

“There is one criticism, however, which must be made from the statistical point of view. A large number of tests are developed upon the assumption that the population sampled is of ‘normal’ form. That this is the case may be gathered from a very careful reading of the text, but the point is not sufficiently emphasised. It does not appear reasonable to lay stress on the ‘exactness’ of tests, when no means whatever are given of appreciating how rapidly they become inexact as the population samples diverge from normality.” (Pearson, 1929a)

Fisher reacted badly to this criticism and was preparing an acerbic reply to the ‘young pretender’ when Gosset jumped into the fray with his own letter in Nature, dated July 20th, in an obvious attempt to moderate the ensuing fight. Gosset succeeded in tempering Fisher’s reply, dated August 17th, forcing him to provide a less acerbic reply, but instead of addressing the ‘robustness/sensitivity’ issue, he focused primarily on Gosset’s call to address ‘the problem of what sort of modification of my tables for the analysis of variance would be required to adapt that process to non-normal distributions’. He described that as a hopeless task. This is an example of Fisher’s genious when cornered by an insightful argument. He sidestepped the issue of ‘robustness’ to departures from Normality, by broadening it – alluding to other possible departures from the ID assumption – and rendering it a hopeless task, by focusing on the call to ‘modify’ the statistical tables for all possible non-Normal distributions; there is an infinity of potential modifications!

Egon Pearson recognized the importance of stating explicitly the inductive premises upon which the inference results are based, and pressed ahead with exploring the robustness issue using several non-Normal distributions within the Pearson family. His probing was based primarily on simulation, relying on tables of pseudo-random numbers; see Pearson and Adyanthaya (1928, 1929), Pearson (1929b, 1931). His broad conclusions were that the t-test:

τ0(X)=|[√n(X-bar- μ0)/s]|, C1:={x: τ0(x) > cα},    (4)

for testing the hypotheses:

H0: μ = μ0 vs. H1: μ ≠ μ0,                                             (5)

is relatively robust to certain departures from Normality, especially when the underlying distribution is symmetric, but the ANOVA test is rather sensitive to such departures! He continued this line of research into his 80s; see Pearson and Please (1975).

Perhaps more importantly, Pearson (1930) proposed a test for the Normality assumption based on the skewness and kurtosis coefficients: a Mis-Specification (M-S) test. Ironically, Fisher (1929) provided the sampling distributions of the sample skewness and kurtosis statistics upon which Pearson’s test was based. Pearson continued sharpening his original M-S test for Normality, and his efforts culminated with the D’Agostino and Pearson (1973) test that is widely used today; see also Pearson et al. (1977). The crucial importance of testing Normality stems from the fact that it renders the ‘robustness/sensitivity’ problem manageable. The test results can be used to narrow down the possible departures one needs to worry about. They can also be used to suggest ways to respecify the original model.

After Pearson’s early publications on the ‘robustness/sensitivity’ problem Gosset realized that simulation alone was not effective enough to address the question of robustness, and called upon Fisher, who initially rejected Gosset’s call by saying ‘it was none of his business’, to derive analytically the implications of non-Normality using different distributions:

“How much does it [non-Normality] matter? And in fact that is your business: none of the rest of us have the slightest chance of solving the problem: we can play about with samples [i.e. perform simulation studies], I am not belittling E. S. Pearson’s work, but it is up to you to get us a proper solution.” (Lehmann, 1999).

In this passage one can discern the high esteem with which Gosset held Fisher for his technical ability. Fisher’s reply was rather blunt:

“I do not think what you are doing with nonnormal distributions is at all my business, and I doubt if it is the right approach. … Where I differ from you, I suppose, is in regarding normality as only a part of the difficulty of getting data; viewed in this collection of difficulties I think you will see that it is one of the least important.”

It’s clear from this that Fisher understood the problem of how to handle departures from Normality more broadly than his contemporaries. His answer alludes to two issues that were not well understood at the time:

(a) departures from the other two probabilistic assumptions (IID) have much more serious consequences for Normal-based inference than Normality, and

(b) deriving the consequences of particular forms of non-Normality on the reliability of Normal-based inference, and proclaiming a procedure enjoys a certain level of ‘generic’ robustness, does not provide a complete answer to the problem of dealing with departures from the inductive premises.

In relation to (a) it is important to note that the role of ‘randomness’, as it relates to the IID assumptions, was not well understood until the 1940s, when the notion of non-IID was framed in terms of explicit forms of heterogeneity and dependence pertaining to stochastic processes. Hence, the problem of assessing departures from IID was largely ignored at the time, focusing almost exclusively on departures from Normality. Indeed, the early literature on nonparametric inference retained the IID assumptions and focused on inference procedures that replace the Normality assumption with indirect distributional assumptions pertaining to the ‘true’ but unknown f(x), like the existence of certain moments, its symmetry, smoothness, continuity and/or differentiability, unimodality, etc. ; see Lehmann (1975). It is interesting to note that Egon Pearson did not consider the question of testing the IID assumptions until his 1963 paper.

In relation to (b), when one poses the question ‘how robust to non-Normality is the reliability of inference based on a t-test?’ one ignores the fact that the t-test might no longer be the ‘optimal’ test under a non-Normal distribution. This is because the sampling distribution of the test statistic and the associated type I and II error probabilities depend crucially on the validity of the statistical model assumptions. When any of these assumptions are invalid, the relevant error probabilities are no longer the ones derived under the original model assumptions, and the optimality of the original test is called into question. For instance, assuming that the ‘true’ distribution is uniform (Gosset’s rectangular):

Xk ∽ U(a-μ,a+μ),   k=1,2,…,n,…        (6)

where f(x;a,μ)=(1/(2μ)), (a-μ) ≤ x ≤ (a+μ), μ > 0,

how does one assess the robustness of the t-test? One might invoke its generic robustness to symmetric non-Normal distributions and proceed as if the t-test is ‘fine’ for testing the hypotheses (5). A more well-grounded answer will be to assess the discrepancy between the nominal (assumed) error probabilities of the t-test based on (1) and the actual ones based on (6). If the latter approximate the former ‘closely enough’, one can justify the generic robustness. These answers, however, raise the broader question of what are the relevant error probabilities? After all, the optimal test for the hypotheses (5) in the context of (6), is no longer the t-test, but the test defined by:

w(X)=|{(n-1)([X[1] +X[n]]-μ0)}/{[X[1]-X[n]]}|∽F(2,2(n-1)),   (7)

with a rejection region C1:={x: w(x) > cα},  where (X[1], X[n]) denote the smallest and the largest element in the ordered sample (X[1], X[2],…, X[n]), and F(2,2(n-1)) the F distribution with 2 and 2(n-1) degrees of freedom; see Neyman and Pearson (1928). One can argue that the relevant comparison error probabilities are no longer the ones associated with the t-test ‘corrected’ to account for the assumed departure, but those associated with the test in (7). For instance, let the t-test have nominal and actual significance level, .05 and .045, and power at μ10+1, of .4 and .37, respectively. The conventional wisdom will call the t-test robust, but is it reliable (effective) when compared with the test in (7) whose significance level and power (at μ1) are say, .03 and .9, respectively?

A strong case can be made that a more complete approach to the statistical misspecification problem is:

(i) to probe thoroughly for any departures from all the model assumptions using trenchant M-S tests, and if any departures are detected,

(ii) proceed to respecify the statistical model by choosing a more appropriate model with a view to account for the statistical information that the original model did not.

Admittedly, this is a more demanding way to deal with departures from the underlying assumptions, but it addresses the concerns of Gosset, Egon Pearson, Neyman and Fisher much more effectively than the invocation of vague robustness claims; see Spanos (2010).

References

Bartlett, M. S. (1981) “Egon Sharpe Pearson, 11 August 1895-12 June 1980,” Biographical Memoirs of Fellows of the Royal Society, 27: 425-443.

D’Agostino, R. and E. S. Pearson (1973) “Tests for Departure from Normality. Empirical Results for the Distributions of b₂ and √(b₁),” Biometrika, 60: 613-622.

Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10: 507-521.

Fisher, R. A. (1921) “On the “probable error” of a coefficient of correlation deduced from a small sample,” Metron, 1: 3-32.

Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society A, 222, 309-368.

Fisher, R. A. (1922b) “The goodness of fit of regression formulae, and the distribution of regression coefficients,” Journal of the Royal Statistical Society, 85: 597-612.

Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh.

Fisher, R. A. (1929), “Moments and Product Moments of Sampling Distributions,” Proceedings of the London Mathematical Society, Series 2, 30: 199-238.

Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” Biometrika, 20A: 175-240.

Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses”, Philosophical Transanctions of the Royal Society, A, 231: 289-337.

Lehmann, E. L. (1975) Nonparametrics: statistical methods based on ranks, Holden-Day, San Francisco.

Lehmann, E. L. (1999) “‘Student’ and Small-Sample Theory,” Statistical Science, 14: 418-426.

Pearson, E. S. (1929a) “Review of ‘Statistical Methods for Research Workers,’ 1928, by Dr. R. A. Fisher”, Nature, June 8th, pp. 866-7.

Pearson, E. S. (1929b) “Some notes on sampling tests with two variables,” Biometrika, 21: 337-60.

Pearson, E. S. (1930) “A further development of tests for normality,” Biometrika, 22: 239-49.

Pearson, E. S. (1931) “The analysis of variance in cases of non-normal variation,” Biometrika, 23: 114-33.

Pearson, E. S. (1963) “Comparison of tests for randomness of points on a line,” Biometrika, 50: 315-25.

Pearson, E. S. and N. K. Adyanthaya (1928) “The distribution of frequency constants in small samples from symmetrical populations,” Biometrika, 20: 356-60.

Pearson, E. S. and N. K. Adyanthaya (1929) “The distribution of frequency constants in small samples from non-normal symmetrical and skew populations,” Biometrika, 21: 259-86.

Pearson, E. S. and N. W. Please (1975) “Relations between the shape of the population distribution and the robustness of four simple test statistics,” Biometrika, 62: 223-241.

Pearson, E. S., R. B. D’Agostino and K. O. Bowman (1977) “Tests for departure from normality: comparisons of powers,” Biometrika, 64: 231-246.

Spanos, A. (2010) “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204-220.

Student (1908), “The Probable Error of the Mean,” Biometrika, 6: 1-25.

Categories: phil/history of stat, Statistics, Testing Assumptions | Tags: , , , | 5 Comments

Blogging E.S. Pearson’s Statistical Philosophy

E.S. Pearson photo

E.S. Pearson

For a bit more on the statistical philosophy of Egon Sharpe (E.S.) Pearson (11 Aug, 1895-12 June, 1980), I reblog a post from last year. It gets to the question I now call: performance or probativeness?

Are frequentist methods mainly useful to supply procedures which will not err too frequently in some long run? (performance) Or is it the other way round: that the control of long run error properties are of crucial importance for probing causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This I think was also the view of Egon Pearson.

(i) Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B. Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171) 

“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

(ii) Three Steps in the Original construction of Tests

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information  available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2.  However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

(iii) Neyman Was the More Behavioristic of the Two

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged last time):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

Aside: It is interesting, given these non-behavioristic leanings that Pearson had earlier worked in acceptance sampling and quality control (from which he claimed to have obtained the term “power”).  From the Cox-Mayo “conversation” (2011, 110):

COX: It is relevant that Egon Pearson had a very strong interest in industrial design and quality control.

MAYO: Yes, that’s surprising, given his evidential leanings and his apparent dis-taste for Neyman’s behavioristic stance. I only discovered that around 10 years ago; he wrote a small book.[iii]

COX: He also wrote a very big book, but all copies were burned in one of the first air raids on London.

Some might find it surprising to learn that it is from this early acceptance sampling work that Pearson obtained the notion of “power”, but I don’t have the quote handy where he said this……

 

References:

Cox, D. and Mayo, D. G. (2011), “Statistical Scientist Meets a Philosopher of Science: A Conversation,” Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics, 2: 103-114.

Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” Biometrika 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to RealityJournal of the Royal Statistical Society, Series B, (Methodological), 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.Biometrika 20(A): 175-240.


[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

[iii] I thank Aris Spanos for locating this work of Pearson’s from 1935

Categories: phil/history of stat, Statistics | Tags: | Leave a comment

Blog at WordPress.com.