As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*
We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.
key terms (incomplete please send me yours)
GTR, eclipse test, ether effect, corona effect, PPN framework, statistical test ingredients, Anglo-Polish collaboration, Lambda criterion; Type I error, Type II error, power, P-value, unbiased tests, consistent tests uniformly most powerful (UMP); severity interpretation of tests, severity function, water plant accident; sufficient statistic; frequentist principle of evidence FEV; sensitivity achieved, [same as attained power (att power)], Cox’s taxonomy (embedded, nested, dividing, testing assumptions), Nordvedt effect, equivalence principle (strong and weak)
Semi-Severe Severity Quiz, based on the example in Exhibit (i) of Excursion 3
- Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, and n = 100, observed x = 152 (i.e., d = 2), find the severity associated with μ > 150.5 .
i.e.,SEV100(μ > 150.5) = ________
- Compute 3 or more of the severity assessments for Table 3.2, with x = 153.
- Comparing n = 100 with n = 10,000: Keeping to Test T+ with H0: μ ≤ 150 vs. H1: μ > 150, σ = 10, change the sample size so that n = 10,000.
The 2SE rejection rule would now be: reject (i.e., “infer evidence against H0”) whenever X > _____.
Assume x = just reaches this 2SE cut-off. (added previous sentence, Dec 10, I thought it was clear.) What’s the severity associated with inferring μ > 150.5 now?
i.e., SEV10,000(μ > 150.5) = ____
Compare with SEV100(μ > 150.5).
4. NEW. I realized I needed to include a “negative” result. Assume x = 151.5. Keeping to the same test with n = 100, find SEV100(μ ≤ 152).
5. If you’re following the original schedule, you’ll have read Tour II of Excursion 3, so here’s an easy question: Why does Souvenir M tell you to “relax”?
6. Extra Credit: supply some key terms from this Tour that I left out in the above list.
*The reference is to Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
1. SEV(μ > 150.5) is the worst case (i.e., highest) probability, under μ ≤ 150, of getting a test statistic that accords less with “μ > 150.5” than does the observed test statistic x_bar = 152. So,
SEV(μ > 150.5) = Pr(X_bar ≤ 152; μ = 150) = 0.93
3. In Q1 the standard error was 1; now the standard error is 0.1 and the 2SE rejection threshold is 150.2 and SEV(μ > 150.5) = 1 – tiny ε
4. Back to SE = 1. SEV(μ ≤ 152) is the worst case (i.e., highest) probability, under μ > 152, of getting a test statistic that accords less with “μ ≤ 152” than does the observed test statistic x_bar = 151.5. So,
SEV(μ ≤ 152) = Pr(X_bar > 151.5; μ = 152) = 0.69
5. I’m not there yet. Let’s see… probably something to do with how severity clears up confusions and makes equivocations that paper over disagreements between various tribes unnecessary.
So I have a question — a genuine question, mind, not a trick question or a rhetorical point-making one — regarding the application of severity reasoning in discrete settings. Suppose I have a measurement device that answers some binary yes/no question. The device is known to give correct answers 7 times in 10 and to give answers selected uniformly at random 3 times in 10.
The severity with which H passes a test with some test statistic, we are told, is the (worst-case, but that’s not relevant here) probability of getting a test statistic that accords less well with H than the one observed, supposing H to be false.
I have yet to see how to measure accordance in the discrete setting (perhaps it’s described in sections of the book I haven’t read yet). The question is: does accordance encompass being correct by happy chance in the 3 times in 10 that the measurement device gives a random answer? If so, severity is the proportion of times the answer is correct, 17 times in 20. But I suspect not, so I guess my real question is: is severity just the proportion of times the measurement device operates correctly?