I attended a lecture by Aris Spanos to his graduate econometrics class here at Va Tech last week[i]. This course, which Spanos teaches every fall, gives a superb illumination of the disparate pieces involved in statistical inference and modeling, and affords clear foundations for how they are linked together. His slides follow the intro section. Some examples with severity assessments are also included.
Frequentist Hypothesis Testing: A Coherent Approach
1 Inherent difficulties in learning statistical testing
Statistical testing is arguably the most important, but also the most difficult and confusing chapter of statistical inference for several reasons, including the following.
(i) The need to introduce numerous new notions, concepts and procedures before one can paint — even in broad brushes — a coherent picture of hypothesis testing.
(ii) The current textbook discussion of statistical testing is both highly confusing and confused. There are several sources of confusion.
- (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline.
- (b) Inadequate knowledge by textbook writers who often do not have the technical skills to read and understand the original sources, and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of these textbooks hypothesis testing is poorly explained as an idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the background.
- (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much greater affinity with the Bayesian rather than the frequentist inference.
- (d) A deliberate attempt to distort and cannibalize frequentist testing by certain Bayesian drumbeaters who revel in (unfairly) maligning frequentist inference in their attempts to motivate their preferred view on statistical inference.
(iii) The discussion of frequentist testing is rather incomplete in so far as it has been beleaguered by serious foundational problems since the 1930s. As a result, different applied fields have generated their own secondary literatures attempting to address these problems, but often making things much worse! Indeed, in some fields like psychology it has reached the stage where one has to correct the ‘corrections’ of those chastising the initial correctors!
In an attempt to alleviate problem (i), the discussion that follows uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain erroneous in- terpretations or misleading arguments. The discussion will pay special attention to (iii), addressing some of the key foundational problems.
[i] It is based on Ch. 14 of Spanos (1999) Probability Theory and Statistical Inference. Cambridge[ii].
[ii] You can win a free copy of this 700+ page text by creating a simple palindrome! https://errorstatistics.com/palindrome/march-contest/
What can be done to establish a distance measure when the sample space is a Cartesian product of discrete and continuous spaces?
Thanks ARis for letting me post your slides. I always found it interesting how much of the work on probability and statistical modeling in econometrics takes place before formal hypothesis testing. I’m glad you found the severity idea a bit more than a useful ‘rule of thumb’ in 1999. I’m looking forward to our seminar on philosophy of statistics in the spring.
I might mention to readers that these were quite informal notes to which Aris alluded over a couple of classes, as he spoke to a group of graduate students with whom he’d already covered fairly advanced econometric modeling over 14 weeks. (You might read it with a Greek accent.)
Readers: I think people forget the 2 central reasons N-P went beyond the likelihood ratio to consider its sampling distribution under various alternatives (something that comes out clearly in Spanos’ slides): (1) The LR, by itself, doesn’t mean the same thing in different cases. (2) In order to understand if a high ratio “in support” of an alternative parameter value, over a null value, really means evidence for it, one needs to consider the probability LR > g (or LR < g) under various alternatives. (Consider for example data-dependent alternatives chosen to maximize the likelihood).
As an example, for the past several weeks, a number of people had been sending me a paper by someone named Valen Johnson on “Uniformly Most Powerful Bayesian Tests”.
Click to access 1309.4656.pdf
tried to stealsucceeded in redefining "error probabilities" so it stands to reason someone would try to snatch "UMP" from the frequentist's cookie jar. But just as with Berger's error probability, UMP means something quite different here.) So now that our term is complete, I’ve had a look at the paper, and I see he wants to hold the Bayes ratio fixed (I forgot to mention that he gives .5 prior to the point(?) null and the arrived at alternative.)
It reminds me of Good's Bayes-non-Bayes compromise, but for the moment, I'm just wondering how he interprets a rejection (assuming we have set up one of the tests he recommends). Take his example of a one sided (positive) Normal test (Ex. 3.2 p. 15) with sigma known. Here's my question:
Does one take a rejection as evidence for the specific alternative against which the Bayes Factor reaches his chosen gamma? Or does one just infer evidence for the composite non-null? I need to study it more carefully….[ANSWER BELOW]
I felt let down to see him say, on p. 3, that his approach “provides a remedy to the two primary deficiencies of classical significance tests—their inability to quantify evidence in favor of the null hypothesis when the null hypothesis is not rejected, and their tendency to exaggerate evidence against the null when it is.” I would deny these, as several posts on this blog argue, but I’ll put off reacting until I get clearer on his interpretation of reject the null. Insights are very welcome.
[Yes (I asked him) he will take it as evidence for the alternative equal to the cut-off for rejection, call it mu’. But he’s not interested in the actual discrepancy indicated, saying it’s enough to have rejected the null. Odd. Anyway, the inference that the (population) discrepancy is as large as or larger than the alternative mu’ has passed a very insevere test, i.e., .5. i.e., even if the true population mu is less than mu’, an observed difference as large as or larger than what is observed would occur ~50% of the time.I have now commented more fully below on Johnson.]