I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.

**Frequentist Hypothesis Testing: A Coherent Approach**

Aris Spanos

1 Inherent difficulties in learning statistical testing

Statistical testing is arguably the most important, but also the most difficult and confusing chapter of statistical inference for several reasons, including the following.

(i) The need to introduce numerous new notions, concepts and procedures before one can paint — even in broad brushes — a coherent picture of hypothesis testing.

(ii) The current textbook discussion of statistical testing is both highly confusing and confused. There are several sources of confusion.

- (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
- (b) Inadequate knowledge by textbook writers who often do not have the technical skills to read and understand the original sources, and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of these textbooks hypothesis testing is poorly explained as an idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the background.
- (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much greater affinity with the Bayesian rather than the frequentist inference.
- (d) A deliberate attempt to distort and cannibalize frequentist testing by certain Bayesian drumbeaters who revel in (unfairly) maligning frequentist inference in their attempts to motivate their preferred view on statistical inference.

(iii) The discussion of frequentist testing is rather incomplete in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary literatures attempting to address these problems, but often making things much worse! Indeed, in some fields like psychology it has reached the stage where one has to correct the ‘corrections’ of those chastising the initial correctors!

In an attempt to alleviate problem (i), the discussion that follows uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain erroneous in- terpretations or misleading arguments. The discussion will pay special attention to (iii), addressing some of the key foundational problems.

[i] It is based on Ch. 14 of Spanos (1999) *Probability Theory and Statistical Inference. *Cambridge[ii].

[ii] You can win a free copy of this 700+ page text by creating a simple palindrome! https://errorstatistics.com/palindrome/march-contest/

What can be done to establish a distance measure when the sample space is a Cartesian product of discrete and continuous spaces?

Thanks ARis for letting me post your slides. I always found it interesting how much of the work on probability and statistical modeling in econometrics takes place before formal hypothesis testing. I’m glad you found the severity idea a bit more than a useful ‘rule of thumb’ in 1999. I’m looking forward to our seminar on philosophy of statistics in the spring.

I might mention to readers that these were quite informal notes to which Aris alluded over a couple of classes, as he spoke to a group of graduate students with whom he’d already covered fairly advanced econometric modeling over 14 weeks. (You might read it with a Greek accent.)

Readers: I think people forget the 2 central reasons N-P went beyond the likelihood ratio to consider its sampling distribution under various alternatives (something that comes out clearly in Spanos’ slides): (1) The LR, by itself, doesn’t mean the same thing in different cases. (2) In order to understand if a high ratio “in support” of an alternative parameter value, over a null value, really means evidence for it, one needs to consider the probability LR > g (or LR < g) under various alternatives. (Consider for example data-dependent alternatives chosen to maximize the likelihood).

As an example, for the past several weeks, a number of people had been sending me a paper by someone named Valen Johnson on “Uniformly Most Powerful Bayesian Tests”.

Click to access 1309.4656.pdf

(Well Berger

~~tried to steal~~succeeded in redefining "error probabilities" so it stands to reason someone would try to snatch "UMP" from the frequentist's cookie jar. But just as with Berger's error probability, UMP means something quite different here.) So now that our term is complete, I’ve had a look at the paper, and I see he wants to hold the Bayes ratio fixed (I forgot to mention that he gives .5 prior to the point(?) null and the arrived at alternative.)It reminds me of Good's Bayes-non-Bayes compromise, but for the moment, I'm just wondering how he interprets a rejection (assuming we have set up one of the tests he recommends). Take his example of a one sided (positive) Normal test (Ex. 3.2 p. 15) with sigma known. Here's my question:

Does one take a rejection as evidence for the specific alternative against which the Bayes Factor reaches his chosen gamma? Or does one just infer evidence for the composite non-null? I need to study it more carefully….[ANSWER BELOW]

I felt let down to see him say, on p. 3, that his approach “provides a remedy to the two primary deficiencies of classical significance tests—their inability to quantify evidence in favor of the null hypothesis when the null hypothesis is not rejected, and their tendency to exaggerate evidence against the null when it is.” I would deny these, as several posts on this blog argue, but I’ll put off reacting until I get clearer on his interpretation of reject the null. Insights are very welcome.

[Yes (I asked him) he will take it as evidence for the alternative equal to the cut-off for rejection, call it mu’. But he’s not interested in the actual discrepancy indicated, saying it’s enough to have rejected the null. Odd. Anyway, the inference that the (population) discrepancy is as large as or larger than the alternative mu’ has passed a very insevere test, i.e., .5. i.e., even if the true population mu is less than mu’, an observed difference as large as or larger than what is observed would occur ~50% of the time.I have now commented more fully below on Johnson.]

Johnson’s paper has made a splash, but I haven’t bothered to pick it up — I’ve been much more involved in figuring out how SEV works when the sample space is as described in my question above.

Corey: I don’t get the question really…You might have to define the test you’re talking about.

An example would be optional stopping with the normal distribution (variance known). From the fact that the likelihood doesn’t depend on the sampling plan, it follows that the sufficient statistic is (x_bar, N) where x_bar is the data mean at stopping and N is the sample size at stopping. Presuming that the test will depend on the data only through the sufficient statistics, how should SEV be calculated?

Corey: We’d base it on the p-value from sequential trials. Armitage has various computations, as as do I in Mayo and Kruse.

Mayo: SEV is a function in which the data are notionally fixed and the (boundary of a one-sided) hypothesis varies. That’s the computation I’m looking for.

In Mayo and Kruse I see a computation of the actual type I error rate when the trial stops once a nominal 0.05 rate boundary has been exceeded. That doesn’t really help me. Likewise, most of the literature on sequential testing concentrates on sequential stopping boundaries that minimize expected sample size and the like. The info I need to do a post-data SEV calculation isn’t there.

In a fixed sample size setting, N-P optimality principles recommend the use of a particular statistic — the mean — and the SEV calculation has the nice property it only depends on the statistic used in the test; that nothing else about the test (i.e., alpha, null hypothesis) matters. But imagine for a moment that we’re *forced* by circumstances beyond our control to use an optional stopping design with a particular fixed stopping boundary. As far as I can tell, that “nice property” of SEV in the fixed sample size setting no longer holds.

I need a test procedure that I can invert to yield a rule that takes the observed statistic, (x_bar, N), and returns an entire rejection region (i.e., over all possible sample sizes) such that the observed statistic lies right on the boundary. Then — and only then — will I be able to do the SEV computation.

Why would that be the relevant SEV computation?

Maybe it isn’t; I’m still sorting this out. I do need (or want, anyway) a notion of “accordance with a given one-sided hypothesis” that provides a total order on the sample space of the sufficient statistic. The test procedure I referred to in my previous comment would provide such a total order; a Spanos-style distance measure would do the job too.

Corey: the severity evaluation is based on the distribution of the sample, and not the likelihood function. In the case of optional stopping the random variables making up the sample are no longer independent, and the distribution of the sample does reflect that [see Berger, 1985]. However, the term of the distribution of the sample that accounts for whether to draw one more observation after the k-th, does not depend on the unknown parameters and thus, it does not appear explicitly in the likelihood function; it’s part of the constant.

Spanos: Yes, I am aware of that. I brought up the likelihood function only for the purpose of determining the sufficient statistic via the Fisher-Neyman factorization theorem. The fact that the term accounting for optional stopping is free of the unknown parameter (only the mean is unknown; variance is assumed known) makes it immediately obvious that the sufficient statistic is still (x_bar, N), just as with the fixed sample size sampling plan.

I still don’t understand all the details of how to carry out the SEV calculation in optional stopping, but I am (deeply!) aware that the sampling distribution of the sufficient statistic is not the same in with the fixed sample size sampling plan.

Corey: you just answered your own question. The severity evaluation is based on the sampling distribution of x-bar under the optional stopping scheme for different discrepancies from the null.

Spanos: The sampling distribution of x_bar marginal of N? Okie-doke.

Spanos: Another scenario for you: normal deviate (call it X), unknown mean, variance chosen at random according to a two-point mixture at Var(X) = 1, Var(X) = 9, as follows: if abs(E(X)) < 10, Pr(Var(X) = 9) = 0.99; otherwise, Pr(Var(X) = 1) =0.99. The data is the random vector (X, Var(X)).

What is the distance measure? How is SEV calculated upon observing (x, var(X)) = (9, 9)?

Talk about cannibalization, Johnson’s game, very roughly, (Spanos and I have ascertained) is essentially this: the one-sided cut-off in a UMP Normal (positive) test of a mean

mu= 0 vs mu> 0

is of form: reject Ho iff x-bar > d*. If you reach d*, the max likely alternative is d* . Give it and the null a prior of .5, and get the Bayes factor in favor of d* associated with the rejection. But SEV(mu > x-bar) = .5.

He cites Mayo and Spanos (2006) but completely overlooks how failure to reject a null provides evidence for ruling out various departures from the null.

The fact that his PNAS paper (revised standard for statistical evidence) begins by lumping together recent discussions of scientific misconduct, retractions of papers, and non reproducibility should make one wary. And the scapegoat for all of these is—you guessed it!–the significance test! Amazing.

‘Berger tried to steal “error probabilities”‘

Mayo: I don’t know exactly what Berger did that you are characterizing as “steal[ing]”, but I will take a moment to object to any one person or collection of people laying claim to a turn of phrase as generic as “error probability”. I acknowledge that it is likely bizarre for you to hear the phrase used to refer to a Bayesian posterior probability that a particular claim is in error rather than a sampling probability of an hypothesis passing a test in error. Despite this, I know of no reason to privilege your particular meaning of the phrase as the only correct one.

Corey: It’s not a matter of ownership of words–and of course there was a jestiness to my remark in the comments– it’s a matter of causing confusion in terminology in an arena majorly beset by confusion. It isn’t problematic if someone says I’m going to define things this way, which is distinct from some other definitions. But if a leading statistician comes along and says, I will give you real frequentist error probabilities, I will give you what Neyman really wanted (but didn’t quite realize it), then it’s more than confusing (e.g., in J. Berger’s 2003 paper which I commented on). J. Berger is shrewd in his choice of words.*

*I changed “tried to steal” to “succeeded in redefining” 12/24/13

Click to access Berger%20Could%20Fisher%20Jeffreys%20and%20Neyman%20have%20agreed%20on%20testing%20with%20Commentary.pdf

What I mainly object to, and I think you’ll agree this is a fair objection, is when people announce that they are rescuing A’s from themselves by redefining their terms in a manner that A’s reject**. I don’t see frequentist error statisticians saying we are giving you real posterior probabilities. I’m no linguist. But it’s clearly an argument strategy….

I know there are many terms used in all sorts of ways, probability, likelihood, sampling distribution…(Carnap had like 3 different subscripts on probability to distinguish them, maybe we need that–though it’s ugly). If we wish to keep even moderately clear the contrasting features of two distinct systems (even granting they each have their uses) and we discover there are no words whose meaning isn’t shifting all about, then we have continual confusion.

**It’s back to the importance of giving “a generous interpretation” to a different view.

“What I mainly object to, and I think you’ll agree this is a fair objection, is when people announce that they are rescuing A’s from themselves by redefining their terms in a manner that A’s reject**.

Mayo: I’d actually be cool with that provided that it comes with a very clear statement from the Rescuer acknowledging that A’s would not agree with the R’s approach and redefinition.

(If I recall correctly, Berger abstracted away from the historical Neyman and Fisher by stating that he was using those names to refer to a stylized characterization of those principals’ views. It seems to me that the point at which valid criticism can be leveled is at this attempted stylized characterization; the high-handed claim of rescue falls down because this characterization gets it wrong. — And if I recall correctly, this is the tack you took in your response.)

Part 1 on Johnson:

According to Valen Johnson, tests have no way to use a non-rejection of a null to provide information about a parameter! Not so. He next asserts that p-values overstate the evidence. Well, there’s an argument to that effect by Bayesians in relation to two-sided tests. I’ve argued against it (it requires a lump of prior on the point null, for starters, search this blog), but never mind, he refers to one-sided tests. Yet, when it comes to one-sided tests, it’s been shown in several places over many years that the p-value is the same as or much the same as the posterior probability (using an uniform prior)[i]. So it is rather strange that Johnson will deny those posteriors and instead give .5 to the null and to the data-dependent maximally likely alternative.

To be continued in part 2.

[i]Quoting from an earlier post: “As noted in Ghosh, Delampady, and Samanta (2006, p. 35), if we wish to reject a null value when “the posterior odds against it are 19:1 or more, i.e., if posterior probability of H0 is < .05” then the rejection region matches that of the corresponding test of H0, (at the .05 level) if that were the null hypothesis."

[i]https://errorstatistics.com/2012/04/25/3594/

Part 2: Continuation of my comment on Johnson:

Never mind these problematic priors for now.

Johnson advocates moving up the sample size to satisfy his .005 p-level rather than .05, while ensuring a power of ~.8 for an “effect size” of .3 standard deviation units.

It’s our one-sided Normal iid test T+ of H0: μ≤μ0 against H1: μ>μ0 let μ0= 0, σ = 1.

For p = .05, n = 69, whereas for p = .005, n = 130*.

(σ /√n) = .12, for n = 69 and (σ /√n) = .09 for n = 130.

Compare 2 results, one from each experiment, both statistically significant at the same .001 level (any level will do).

This is the ~3 standard deviation cut-off (.36 for n = 69 vs .27 for n = 69). This is a one-sided upper test T+, so take the .95 lower limit for each (any confidence level will do):

For the first, smaller experiment, the lower limit is μ > .12, but for the second, larger experiment, it’s μ > .09. Note: SEV(μ > .12) = .95 for the smaller test, and SEV(μ > .09) = .95 for the larger test (using an outcome just at the 3 standard deviation cut-off).

Thus, finding a result significant at a given p level indicates a larger discrepancy if it arose from the smaller experiment. As sample size increases, sensitivity increases, so smaller discrepancies are discerned.

True, a 3 standard deviation effect with the larger test indicates a larger discrepancy than a 2 or 1.65 standard deviation difference with the smaller test.

And, the bigger the discrepancy indicated by the given statistically significant result speaks of a larger effect. Whether that would make it more replicable at the new, higher standard is not obvious. ( I think it’s still ~.5.)** Most importantly, the error statistician already has a means to indicate the extent of discrepancy indicated by a statistically significant result, and that is what she should use.

* I can show you a rule of thumb for computing his sample sizes.

** Here is the blogpost where Stephen Senn notes:

“A more general form of this result is as follows. If the first trial yields p=α then the probability that a second trial will be significant at significance level α (and in the same direction as the first trial) is 0.5.”

https://errorstatistics.com/2012/05/10/excerpts-from-s-senns-letter-on-replication-p-values-and-evidence/

Excellent! Adopting the higher standard offers a handy excuse for non-replication–they changed the standards on me! Could have met the older one.

Visitor: Well, it may merely be just as difficult (rather than harder)–at least for the next try. I’ll have to ask Senn about the probability of success in subsequent attempts to replicate.

Readers: Stephen Senn just confirmed this, and also sent the full letter which I didn’t have:

Click to access goodman.pdf

Mayo:

This last bit in your “part 2” comment is far too optimistic. I think in reality the probability of replicating at the observed alpha is far less than 50%. I clicked back to the Senn quote and he appears to be reporting a derivation assuming an (unrealistic) uniform prior distribution on effect sizes.

Readers: Having read Senn’s 2002 letter in full, it is evident that he exposes a serious and pervasive blunder in some of the best known of today’s horror stories about significance levels and lack of replication–at least one very big one of many. Questions 1 and 2 (p. 2440) are confused with question 3 (whose answer is indeed .5). The answers to Q1 and Q2 however, which are the ones of interest to us, are quite satisfactory for a significance tester. (They concern future results corroborating the observed result.)

Further Senn shows that if we are Bayesians and really do expect future p-values to be at least as small as an observed one, it would mean, not that the p-value overstates the evidence, but that it understates it! (p. 2442). Computations are given. Thanks so much Stephen!

Even though Stephen couldn’t have predicted in 2002 that this same fallacy would continue, in ever new forms, he is exactly spot on.

You write: “Senn shows that if we are Bayesians and really do expect future p-values to be at least as small as an observed one . . .”

Again, I’m not blaming Senn here, but a reasonable Bayesian calculation would show that future p-values are most likely

notat least as small as observed ones, not if you condition on p<0.05 or any such threshold. It's the basic "regression to the mean" story. In our replication, the joint distribution of (p-value at time 1, p-value at time 2) is symmetric. Thus if you condition on low values of "p-value at time 1," you will find, on average, higher (less statistically significant) values of "p-values at time 2."Andrew: Your point here seems in sync with the assignment of .5 probability to a “replication” of the observed p-value. No? What do you think a reasonable Bayesian calculation would be? (Of course Senn was trying to say why we ought not to regard the “evidential meaning” of the current data in terms of future replication probability.)

A big source of confusion here is that attaining an “isolated p-value” is distinct from Fisher’s requirement for having genuine knowledge of an experimental effect. If the latter is a “warranted p-value”, then our expectations would differ. By imagining a scientist announces a result on the basis of a single small p-value, and suggesting it’s fine to continue such a robotic practice, so long as you lower the p-value, use these priors, & increase your sample size a bit, –all the wrong messages are given.

To clarify. Evidence is what it is. It would be most undesirable if evidence brought with it a guarantee of future evidence. It would then have a miraculous property of providing twice what you paid for and so on ad infinitum. We would not accept a lawyer in court producing a witness and then saying, ‘it is superfluous for me to produce any further witnresses since clearly, in view of what this witness has said, they will merely back the witness up. therefore you can take it from me that we effectively have the testimony of several witnesses.’

My complaint with Steve Goodman’s original Statistics in Medicine article was the suggestion that the replication probability was central to understanding inferential meaning. This would require the P-value to be a) self referential (to be a probability statement about itself as a probability statement) and b) to apply to a future experiment that happen to be of exactly the same size as the current one. These are, quite frankly, bizarre requirements and as I pointed out in my note, Bayesian posterior probabilities are statements about future experiments of infinite size they are not the same as Bayesian predictive statements. If you are going to make a Bayesian predictive statement why not about a minimal future experiment?

Stephen: thanks for the clarification, it’s really a great “letter” which we should discuss some more*. Now Gelman says .5 is actually too high, at least for his Bayesian assessment.

*The most important thing is the crucial distinction people miss between corroboration of a finding (or its having been well-tested), and predicting a probability assignment. I keep reading complaints that p-values “are irrelevant to inferential meaning” –alluding to replication probabilities–when in fact, “replication probabilities are not of direct relevance to inferential meaning.” (Senn 2002, p. 2439).

I think the idea of requiring frequentist statements to have Bayesian properties and vice versa should not be overdone. Many Bayesians would object to frequentists criticising Bayesian statements because they don’t have frequentist properties and I don’t see why frequentists can’t be just as annoyed with complaints that they don’t fulfill Bayesian properties.

Consider the following. If you have a meta-analysis of a very large number of trials measuring the same treatment effect and half of them are significant at the 5% level and half are not, you will have a minute P-value overall at the end. So a replication probability of 0.5 is perhaps not so bad after all.

Note however that fuiture trials will not independently be significant with 50% probability. Further discussion of this point is given in my letter in Statistics in Medicine.

Of course if Gelman has a lump of propbability that the null is true then this will (in many cases) downweight significance from a Bayesian perspective, so the replication probability will not even be 50%, but there will be another Bayesian who disagrees with him so it ought to be clear that the difference is a property of the different prior distributions and not P-values per se.

Stephen:

I’d really like to know whether 50% is about what’s expected, and even pretty decent, for meta-analysis, as you say here and in your letter. Given the tiny overall p-value, it’s further proof that replication rates are quite distinct from evidence for an effect.

“Many Bayesians would object to frequentists criticising Bayesian statements because they don’t have frequentist properties and I don’t see why frequentists can’t be just as annoyed with complaints that they don’t fulfill Bayesian properties.”

Well I certainly am….but the strange thing about these newish Bayesian-style critiques of significance tests is that they’re sensible neither on Bayesian nor frequentist error statistical grounds. For the former: if you believe the Bayesian way is the way to go, why aren’t you giving posterior probabilities rather than cannibalizing (Spanos’s word) significance tests, maintaining that what they really ought to be doing is supplying reports of the % of nulls that are true given they are rejected. It’s a bizarre screening effort, wherein apparently each single isolated stat sig result is reported–as against Fisherian requirements–and then at the end of the year we somehow look at the % that have not been replicated (it’s assumed 50% of the nulls are true). It’s so remote from any real science that the whole thing would be dismissed as a bad joke if it weren’t actually taken seriously by the powers that be. (The gatekeepers who apparently control significance levels for robotic testers)*

Have you read Valen Johnson’s (2) papers?** I was actually assuming there was something there, given all the attention, but …well you can read my comments, and I’ll have more at some point.

*Since it’s becoming so common, I need a good label for these attempts for my book. Readers: send any that come to mind.

**Valen E. Johnson Uniformly Most Powerful Bayesian Tests, https://errorstatistics.files.wordpress.com/2013/12/johnson_unifromaly-most-powerful-bayesian-tests.pdf

Revised standards for statistical evidence:

Click to access pnas20131.pdf

Another point I sometimes make is that since P-values and posterior probabilities are not identical concepts one should not require their values to be identical. For instance if a P-value of 0.05 is the highest which you will consider gives evidence against the null one might argue that it should correspond to a posterior probabability of just under 50% of the null being true (or perhaps a movement from being more probable to less probable).

I can make an analogy. A French chief of police favours a minimum height requirement for officers on the force of 175 cms A British chief of police favours a minimum weight requirement of 175lbs. The British chief of police says this height requirement is ridiculous. What Jean-Pierre Frenchman does not realise is that 175cms dosen’t correspond to 175lbs and it is quite possible to be 175cms tall and weigh less than 140lbs.

It seems to me that Bayesian who criticise P-values because the number assigned to a Bayesian posterior probability (making various assumptions) does not correspond to the number assigned to a corresponding P-value are making a similar error.

Stephen: Yes, “agreement on numbers” isn’t all it’s cracked up to be. Your analogy is apt. The situation wouldn’t be so bad if, like pounds, the height measures in cms used by the French police meant the same thing, even within their own scrutiny of officers. Trouble is it can mean lots of different things: the number Pierre believes the person’s height to be, a measure of what estimate of height a rational agent would bet on, given all relevant information (having recently visited the Eiffel Tower), a function of the boost in height over the height of the chief of police prior to the time of hiring, the height that maximizes the missing information in an expected police employment questionnaire, the height judgment estimated to give at least .5 chance that the office can frighten off a criminal while minimizing the cost of not squeezing into the average window in order to rescue someone from a fire, or—most commonly of all– merely an undefined mathematical concept that is plugged into a formula in order to determine whether the officer meets the height requirement.

I note the Johnson responds to Gelman and Robert on Gelman’s blog today: http://andrewgelman.com/2013/12/26/statistical-evidence-revised-standards/#respond

From one year ago today: Mayo and Spanos on the 13 well-worn criticisms of significance tests https://errorstatistics.com/2012/12/24/13-well-worn-criticisms-of-significance-tests-and-how-to-avoid-them/