I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:
Souvenir C: A Severe Tester’s Translation Guide
Just as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(x) greater than or equal to 1.96, i.e., {x: d(x) ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particular x, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedure would have yielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcome or other.
When you see Pr(d(X) ≥ d(x0); H0), or Pr(d(X) ≥ d(x0); H1), for any particular alternative of interest, insert:
“the test procedure would have yielded”
just before the d(X). In other words, this expression, with its inequality, is a signal of interest in, and an abbreviation for, the error probabilities associated with a test.
Applying the Severity Translation. In Exhibit (i), Royall described a significance test with a Bernoulli(θ) model, testing H0: θ ≤ 0.2 vs. H1: θ >0.2. We blocked an inference from observed difference d(x) = 3.3 to θ = 0.8 as follows. (Recall that x= 0.53 and d(x0) ≃ 3.3.)
We computed Pr(d(X) > 3.3; θ = 0.8) ≃1.
We translate it as Pr(The test would yield d(X) > 3.3; θ = 0.8) ≃1.
We then reason as follows:
Statistical inference: If θ = 0.8, then the method would virtually always give a difference larger than what we observed. Therefore, the data indicate θ < 0.8.
(This follows for rejecting H0 in general.) When we ask: “How often would your test have found such a significant effect even if H0 is approximately true?” we are asking about the properties of the experiment that did happen. The counterfactual “would have” refers to how the procedure would behave in general, not just with these data, but with other possible data sets in the sample space.
I have a question about the Birnbaum example in θ can take values in 0, 1, …, 100 and for which the probability model is X = θ with probability 1 if θ ≠ 0 and is uniformly distributed on 1,… ,100 if θ = 0.
Q1. If I pre-designate H1: θ = 37 to test against H2: θ = 0 and then r = 37, has H1 passed a severe test?
Q2. Does it matter how I made the choice to test θ = 37?
Q3. What if I pre-designate five values chosen at random, (say, 4, 24, 46, 63, 83; I literally just chose them at random) and declare that my test procedure is to claim H2: θ = 0 if I observe one of them — and I do. Has H2 passed a severe test? Why or why not?
Q3 is incomplete; the procedure is to H2: θ = 0 if I observe one of the 5 values and to claim “θ = r” if I don’t.
A the example says, there is error control with 2 predesignated hyps. But moving to the severe testing context, the inferences aren’t to points. It would fall under anon-exhaustive case, from the soundof it, so the ans is no.
So here’s the severity criteria I’m looking at:
“A hypothesis H passes a severe test T with data x0 if,
(S-1) x0 accords with H, (for a suitable notion of accordance) and
(S-2) with very high probability, test T would have produced a result
that accords less well with H than x0 does, if H were false or incorrect.
Equivalently, (S-2) can be stated:
(S-2)*: with very low probability, test T would have produced a result
that accords as well as or better with H than x0 does, if H were false
or incorrect.”
So in Q1, the inference is to one of the two possible values of θ that survive post-data. I look at the severity criteria and I see (with change of notation) that a severe test has been passed if:
(S-1): r = 37 accords with H1: θ = 37, (for a suitable notion of accordance) and
(S-2): with very low probability, my test procedure would have produced a result
that accords as well as or better with H1 than r = 37 does, if H1 were false
or incorrect.
(S-1) seems clearly satisfied. For (S-2), I consider ways that H1 could have been incorrect. The data rule out all possibilities except θ = 37 or θ = 0; under θ = 0, the probability that my test procedure would have produced a result that accords as well as (better isn’t possible) r = 37 with H1 is 0.01, which is indeed very low.
In Q3, I have r in the set of 5 values that I have pre-designated (uniformly at random) as indicating H2: θ = 0. For ease of notation, call that set of 5 values S. I look at the severity criteria and I that a severe test has been passed if:
(S-1): the result “r is in S” accords with H2: θ = 0, (for a suitable notion of accordance) and
(S-2): with very low probability, my test procedure would have produced a result
that accords as well as or better with H2 than does the result “r is in S”, if H2 were false
or incorrect.
For (S-1) I’m not sure what a suitable notion of accordance might be; it seems like all possible values of r are equally in accord with θ = 0. For (S-2), I consider ways that H2 could have been incorrect. The data rule out all possibilities except θ = r or θ = 0; under θ = r, the probability that my test procedure (which, recall, is to choose 5 values at random for S) would have produced a result that accords as well with H2 (better isn’t possible) as “r is in S” is 0.05, which is indeed very low.
Just on Q1, other values for the parameter aren’t ruled out by the outcome. With the predesignated pts there’s no trouble in inferring the likelihoodist’s comparative evidence claim. With one observation, it’s a weak indication & no way to check assumptions. As far as adding increasing predesignated hypos, it’s discussed in 6.3, 6.4 of Mayo and Kruse: yes there can be error control though it diminishes with increasing hyps. https://www.phil.vt.edu/dmayo/personal_website/Mayo%20&%20Kruse%20Principles%20of%20inference%20and%20their%20consequences%20B.pdf
Q2- yes. Q3 doesn’t look like a good test statistic.
I don’t really see what your getting at. About to shut down tonight.
On Q1, other parameter values are indeed ruled out by the outcome. The probability of getting r = 37 when, say, θ = 44, is nil, because the model specifies Pr(X = 44; θ = 44) = 1 which immediately implies Pr(X ≠ 44; θ = 44) = 0.
Likelihood theory is not the topic I’m interested in here — just severity.
On Q2, could you maybe go into a little more detail?
On Q3, it absolutely is not a good statistic. The question is basically on what grounds an error statistician can say so.
Re: Q3,
This is why I think (I think!) S1 should be (as I mentioned in a recent post):
S1’: The test indicates H with high probability under H.
Your choice of stat has a fairly low prob of indicating H, under repeated sampling under H, no?
Pingback: National Academies of Science: Please Correct Your Definitions of P-values | Error Statistics Philosophy