Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

November Cruise

This first excerpt for November is really just the preface to 3.1. Remember, our abbreviated cruise this fall is based on my LSE Seminars in 2020, and since there are only 5, I had to cut. So those seminars skipped 3.1 on the eclipse tests of GTR. But I want to share snippets from 3.1 with current readers, along with reflections in the comments. (I promise, I’ve even numbered them below)

Excursion 3 Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests

[T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted.The theory is incompatible with certain possible results of observation – in fact with results which everybody before Einstein would have expected. This is quite different from the situation I have previously described, [where] . . . it was practically impossible to describe any human behavior that might not be claimed to be a verification of these [psychological] theories. (Popper 1962, p. 36)

The 1919 eclipse experiments opened Popper’ s eyes to what made Einstein’ s theory so different from other revolutionary theories of the day: Einstein was prepared to subject his theory to risky tests.[1] Einstein was eager to galvanize scientists to test his theory of gravity, knowing the solar eclipse was coming up on May 29, 1919. Leading the expedition to test GTR was a perfect opportunity for Sir Arthur Eddington, a devout follower of Einstein as well as a devout Quaker and conscientious objector. Fearing “ a scandal if one of its young stars went to jail as a conscientious objector,” officials at Cambridge argued that Eddington couldn’ t very well be allowed to go off to war when the country needed him to prepare the journey to test Einstein’ s predicted light deflection (Kaku 2005, p. 113).

The museum ramps up from Popper through a gallery on “ Data Analysis in the 1919 Eclipse” (Section 3.1) which then leads to the main gallery on origins of statistical tests (Section 3.2). Here’ s our Museum Guide:

According to Einstein’ s theory of gravitation, to an observer on earth, light passing near the sun is deflected by an angle, λ , reaching its maximum of 1.75″ for light just grazing the sun, but the light deflection would be undetectable on earth with the instruments available in 1919. Although the light deflection of stars near the sun (approximately1 second of arc) would be detectable, the sun’ s glare renders such stars invisible, save during a total eclipse, which “ by strange good fortune” would occur on May 29, 1919 (Eddington [1920] 1987, p. 113).

There were three hypotheses for which “ it was especially desired to discriminate between” (Dyson et al. 1920 p. 291). Each is a statement about a parameter, the deflection of light at the limb of the sun (in arc seconds): λ = 0″ (no deflection), λ = 0.87″ (Newton), λ = 1.75″ (Einstein). The Newtonian predicted deflection stems from assuming light has mass and follows Newton’ s Law of Gravity. The difference in statistical prediction masks the deep theoretical differences in how each explains gravitational phenomena. Newtonian gravitation describes a force of attraction between two bodies; while for Einstein gravitational effects are actually the result of the curvature of spacetime. A gravitating body like the sun distorts its surrounding spacetime, and other bodies are reacting to those distortions.

 

Where Are Some of the Members of Our Statistical Cast of Characters in 1919? In 1919, Fisher had just accepted a job as a statistician at Rothamsted Experimental Station. He preferred this temporary slot to a more secure offer by Karl Pearson (KP), which had so many strings attached – requiring KP to approve everything Fisher taught or published – that Joan Fisher Box writes: After years during which Fisher “ had been rather consistently snubbed” by KP, “It seemed that the lover was at last to be admitted to his lady’ s court – on conditions that he first submit to castration” (J. Box 1978, p. 61). Fisher had already challenged the old guard. Whereas KP, after working on the problem for over 20 years, had only approximated “the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments” in 1915 (Spanos 2013a). Unable to fight in WWI due to poor eyesight, Fisher felt that becoming a subsistence farmer during the war, making food coupons unnecessary, was the best way for him to exercise his patriotic duty.

In 1919, Neyman is living a hardscrabble life in a land alternately part of Russia or Poland, while the civil war between Reds and Whites is raging. “It was in the course of selling matches for food” (C. Reid 1998, p. 31) that Neyman was first imprisoned (for a few days) in 1919. Describing life amongst “roaming bands of anarchists, epidemics” (ibid., p. 32), Neyman tells us,“existence” was the primary concern (ibid., p. 31). With little academic work in statistics, and “ since no one in Poland was able to gauge the importance of his statistical work (he was ‘sui generis,’ as he later described himself)” (Lehmann 1994, p. 398), Polish authorities sent him to University College in London in 1925/1926 to get the great Karl Pearson’ s assessment. Neyman and E. Pearson begin work together in 1926. Egon Pearson, son of Karl, gets his B.A. in 1919, and begins studies at Cambridge the next year, including a course by Eddington on the theory of errors. Egon is shy and intimidated, reticent and diffi dent, living in the shadow of his eminent father, whom he gradually starts to question after Fisher’ s criticisms. He describes the psychological crisis he’ s going through at the time Neyman arrives in London: “ I was torn between conflicting emotions: a. finding it difficult to understand R.A.F., b. hating [Fisher] for his attacks on my paternal ‘ god,’ c. realizing that in some things at least he was right” (C. Reid 1998, p. 56). As far as appearances amongst the statistical cast: there are the two Pearsons: tall, Edwardian, genteel; there’ s hardscrabble Neyman with his strong Polish accent and small, toothbrush mustache; and Fisher: short, bearded, very thick glasses, pipe, and eight children. Let’ s go back to 1919, which saw Albert Einstein go from being a little known German scientist to becoming an international celebrity.

As noted at the start, my LSE seminars skipped Section 3.1, but there are things in it I’d like to talk to readers about.

3.1 Statistical Inference and Sexy Science: The 1919 Eclipse Test

p. 121 …….I get the impression that statisticians consider there to be a world of difference between statistical inference and appraising large-scale theories in “glamorous” or “sexy science.” The way it actually unfolds, which may not be what you find in philosophical accounts of theory change, revolves around local data analysis and statistical inference. Even large-scale, sexy theories are made to connect with actual data only by intermediate hypotheses and models. To falsify, or even provide anomalies, for a large-scale theory like Newton’s, we saw, is to infer “falsifying hypotheses,” which are statistical in nature….

p. 122 …There are two key stages of inquiry corresponding to two questions within the broad umbrella of auditing an inquiry:

(i) is there a deflection effect of the amount predicted by Einstein as against Newton (the “Einstein effect”)?

(ii) is it attributable to the sun’s gravitational field as described in Einstein’s hypothesis?

A distinct third question, “higher” in our hierarchy, in the sense of being more theoretical and more general, is: is GTR an adequate account of gravity as a whole?…… Comment 3.1.1.

p. 123…The problem in (i) is reduced to a statistical one: the observed mean deflections (from sets of photographs) are Normally distributed around the predicted mean deflection .

The proper way to frame this as a statistical test is to choose one of the values as H0 and define composite H1  to include alternative values of interest. For instance, the Newtonian “half deflection” can specify H0: μ ≤ 0.87 , and the H1: μ > 0.87 includes the Einsteinian value of 1.75

p. 124…A text by Ghosh et al. (2010, p. 48) presents the Eddington results as a two-sided Normal test of Normal test of H0: μ = 1.75 (the Einstein value) vs. H1:≠ 1.75, with a lump of prior probability given to the point null. If any theoretical prediction were to get a lump at this stage, it is Newton’s. …Comment 3.1.2

p. 125Some Popperian Confusions About Falsification and Severe Tests

Popper lauds GTR as sticking its neck out, bravely being ready to admit its falsity were the deflection effect not found (1962, pp. 36-7). Even if no deflection effect had been found in the 1919 experiments, it would have been blamed on the sheer difficulty in discerning so small an effect. This would have been entirely correct. Yet many Popperians, perhaps Popper himself, get this wrong. Listen to Popperian Meehl:

[T]he stipulation beforehand that one will be pleased about substantive theory  when the numerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing … what astrologers and Marxists and psychoanalysts allegedly do, playing ‘heads I win, tails you lose.’ (Meehl 1978, p. 821)

There is a confusion here, and it’s rather common. …Here’s how the severity requirement cashes this out…Comment 3.1.3

p. 127Big Picture Inference: Can Other Hypotheses Explain the Observed Deflection?

Even to the extent that they had found a deflection effect, it would have been fallacious to infer the effect “attributable to the sun’s gravitational field.” The question (ii) must be tackled: A statistical effect is not a substantive effect. Addressing the causal attribution demands the use of the eclipse data as well as considerable background information. Here we’re in the land of “big picture” inference: the inference is “given everything we know”. In this sense, the observed effect is used and is “non-novel” (in the use-novel sense). Once the deflection effect was known, imprecise as it was, it had to be used. Deliberately seeking a way to explain the eclipse effect while saving Newton’s Law of Gravity from falsification isn’t the slightest bit pejorative – so long as each conjecture is subject to severe test. …
It’s Not How Plausible, but How Well Probed…

p. 129Souvenir I: So What Is a Statistical Test, Really?

So what’s in a statistical test? First there is a question or problem, a piece of which is to be considered statistically, either because of a planned experimental design, or by embedding it in a formal statistical model. There are (A) hypotheses, and a set of possible outcomes or data; (B) a measure of accordance or discordance, fit, or misfit,  between possible answers (hypotheses) and data; and (C) an appraisal of a relevant distribution associated with . Since we want to tell what’s true about tests now in existence, we need an apparatus to capture them, while also offering latitude to diverge from their straight and narrow paths. Comment 3.1.4

…To read further, see Tour I Ex3 TI (full proofs) of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP, 2018)

Where you are in the journey:

Excursion 3: Statistical Tests and Scientific Inference

Tour I Ingenious and Severe Tests                                             119

            YOU

3.1 Statistical Inference and Sexy Science: The 1919
Eclipse Test                                                                                               121

3.2 N-P Tests: An Episode in Anglo-Polish Collaboration              131

3.3 How to Do All N-P Tests D (and more) While
a Member of the Fisherian Tribe                                                          146

  • The full Itinerary is here.

Interested in joining us?  Please email Jean Miller  (jemille6@vt.edu), with your info, and she will send you a clean copy of the monthly materials.

Categories: SIST, Statistical Inference as Severe Testing | 2 Comments

Post navigation

2 thoughts on “Where Are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3, snippets from 3.1

  1. My comments on numbered comments in this post:

    Comment 3.1.1. Skepticism of the relationship between local statistical inquiries and scientific questions and theories are related to the supposition that there is a single, unified logic relating data to inferences and theories, e.g., Bayesian updating. Formal statistical inference generally relates to deliberately simplified, toy hypotheses, and do not automatically connect to substantive claims. Viewing learning from data as a complex series of piece-meal problems, and iterated checks on earlier, provisional, inferences, however, gives homes to deliberately simplified probes (if carried out intelligently). It’s not that formal statistical methods operate at each level, but rather that statistical reasoning enters, at different levels, to tackle problems due to variability, errors and biases.

    Comment 3.1.2. compares an error statistical vs Bayesian reconstruction of the eclipse tests. A spike prior on the GTR value in 1919 does not align with the tests carried out. In general, standard error statistical methodology would not use the artificial point null (popular among Bayesians) unless there was no interest in the direction and a statistically significant difference is to be interpreted as “there’s some effect,” without saying anything about the direction. If there isn’t a direction of interest, David Cox recommends two one-sided tests. Point values can be inferred with severity, however, by the manner of improving discrepancies ruled out, together with background information (see the case of inferring the equivalence principle in relativistic physics (p. 161). Another approach is via equivalence tests.

  2. Comment 3.1.3:

    I quote Popperian Meehl:

    [T]he stipulation beforehand that one will be pleased about substantive theory T when thenumerical results come out as forecast, but will not necessarily abandon it when they do not, seems on the face of it to be about as blatant a violation of the Popperian commandment as you could commit. For the investigator, in a way, is doing . . . what astrologers and Marxists and psychoanalysts allegedly do, playing ‘heads I win, tails you lose.’ (Meehl 1978, p. 821)

    There is a confusion here, and it’s rather common. A successful result may rightly be taken as evidence for a real effect H, even though failing to find the effect would not, and should not, be taken to refute the effect, or as evidence against H. This makes perfect sense if one keeps in mind that a test might have had a low probability of detecting the effect, even if it exists.

    I count myself an enthusiastic member of the Paul Meehl fanclub: He was brilliant, and few of today’s criticisms of simple Fisherian tests are not already spelled out in his works. But he also erred in some of his arguments, such as this one. It would not be “unPopperian” in 1919 to have denied that failing to detect the deflection effect was evidence against GTR. The probability of a no-show is high even if the deflection effect exists—at least with the tools of the day. In fact, all eclipse tests gave highly imprecise results. The point really reflects the asymmetry of falsification and corroboration. (See the discussion in SIST, p. 125).

    One of the values of philosophy of science is its providing tools to critically evaluate arguments even if made by high priests.

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.