I am grateful to A. Spanos for letting me post a link to his comments on a paper S. Senn shared last week. You can find a pdf of his comments here.
You can read the original Bernard and Copas (2002) article here
@Aris; could you define d() please? (My guess is it’s the t-test t-statistic, but I’d like to be sure). Also, can you indicate why the probability in your table does not condition on the variance of observations X?
Thanks Aris, for this interesting comment. Severity is a probability that is a function of the parameter. So it is a probbability that has somehoww become attached to the parameter while fixing the statistic. This is very reminiscent of Fisher’s fiducial probability. What we know, however, about the history of fiducial probability is that it became difficult and controversial to apply when having to deal with nuisance parameters. So I think that Guest’s question is interesting. What is the severity approach to nuisance parameters?
Stephen Senn sent his take on computations relating to Spanos’s post: http://www.phil.vt.edu/dmayo/personal_website/Mathcad%20-%20Cushny%20and%20Spanos.pdf “To Investigate a Blogpost of Aris Spanos’s”. I haven’t gone through the calculations, but I note that Senn introduces a term I never heard before: veresity. Maybe I should use this*. One thing I have to remind about severity is that it must be understood and computed relative to a particular hypothesis/claim (not limited to those initially set out in tests, say), a test and an outcome of the test. My own recommendation in identifying these parts correctly is to consider how the claim one is entertaining might be false, or how it may be an error to regard the data as evidence for/against a given claim. The notion grew as a way to avoid the classic howlers against significance tests (noted in earlier posts), and also to adequately capture the Popperian notion (as Popper did not). (See my Popper posts.)
*The reason I don’t think so is that, among other things, if H is accorded low severity, it does not entail a high not-H gets high severity, and certainly it is not accorded high truthiness (whatever that means). It really does call for a distinct logic; probabilistic considerations enter to evaluate how severely hypotheses have and have not passed, but we have no need to imagine this yields a probability (or degree of belief or support) attached to the hypotheses themselves. This is actually in sync with ordinary claims of appraising warranted inference and well-testedness. But it requires a bit of patience to suspend the standpoint of “probabilists” and “logicists” (Hacking).
An interesting article at: http://www.significancemagazine.org/details/magazine/1475273/The-formula-that-killed-Wall-Street.html. My question to Aris is how might M-S testing have contributed to averting problems in the financial markets described in the article. It appears to me that naive acceptance of assumptions was costly. It also seems that the approach advocated by Aris could have led to a different outcome.
In the early 1990s I applied model validation, using thorough M-S testing, to most of the fundamental theories of financial economics, including the Capital Asset Pricing Model (CAPM), the Arbitrage Pricing Theory (APT), the Efficient Market Hypothesis (EMH), the Black-Scholes pricing etc., etc. The general conclusion was that almost every theory or hypothesis making up modern financial economics, was based on statistically misspecified models, calling seriously into question the validity of these theories. When I respecified the original statistical models these theories were implicitly assuming, with a view to find statistically adequate models it became clear that the Student’s t distribution was a lot more appropriate than the Normal upon which the original statistical models were based. This led me to introduce several new statistical models stemming from the multivariate Student’s t distribution that gave rise to statistically adequate models for a lot of the financial data. The main reason for that move was the most of such data exhibit leptokurticity (fat tails) and second order temporal dependence (perios of small changes in volatility are followed by periods of large changes; the Student’s t distribution can model both regularities. That enabled me to used the respecified and statistically adequate models based on the Student’s t distribution to test the validity of several key financial theories. Note that no evidence for or against a theory can be inferred on the basis of a statistically misspecified model. The general result was that almost all these theories turned out to be seriously false when tested against real-world data! To get some idea of what I’m talking about, one should look at my 1995 paper where I use exchange rate data to falsify the Efficient Market Hypothesis;
Spanos, A. (1995), “On theory testing in Econometrics: modeling with nonexperimental data”, Journal of Econometrics, 67: 189-226.
In addition to this, I published several other papers where I falsified various other financial theories.
Thank you for your interesting comments. To me, Li`s model looks like another case of basing a `probability` as an answer on a weak or erroneous foundation of data. Once the model was presented, no responsibility was taken for ensuring sound method. Was Li`s model rightly viewed as a Bayesian approach?
Li’s Gaussian copula function is another model (evaluation device for the joint probability of default) that was based on totally erroneous empirical foundations, because it was never validated properly! As I argued above, the Gaussian (Normal) distribution is clearly the wrong distribution to “model” the regularities in asset price data because it cannot account for the two key statistical regularities that determine volatility (risk): the fat tails and the second order temporal dependence!
From my perspective, whether it was modeled in the context of a Bayesian or a frequentist framework is not particularly important because both approaches invoke statistical models whose inadequacy vis-a-vis the data will call into question both Bayesian and frequentist inferences!
Aris: Given these points, and given what I’ve studied of your PR approach to m-s testing over the past years, I’m especially puzzled by the earlier comments by Hennig and Guest and perhaps other readers, alleging that m-s tests have little probative capacity to find flaws (comments to March 14, 2011 post). Of course the only relevant assessment of a general m-s testing procedure for detecting a given violation would be to actually evaluate how good a job it does in leading to the inferred model, at least with respect to the given violations–as opposed to looking at a single test, to see if it works this time. If a method contains within it self-critical constraints, it is misleading to evaluate its reliability imagining these constraints were absent. In those cases where an m-s test wasn’t very capable of discriminating the presence or absence of a given violation, then a method that discerns this is a point in its favor. Investigating the impact on intended use is of further value. It is silly to talk as if scientifically valuable tools are one-shot and one-size, and when they don’t get the answer they are looking for right away, are open to a warranted challenge. (Prove to me I could have done a better a job than I did!) But everyone knows that good science is never done that way, so why should it be different in the case of statistical model checking? You doubtless have more specific responses.
I totally agree with your point concerning allegations that M-S tests have little probative capacity to find deparures! Like other frequentist tests, some M-S tests have low power to detect discrepancies from the null, but have wider probing capacity. These tests are often referred to as omnibus tests and they are usually nonparametric. In practice, one should never rely exclusively on such tests because of their limited capacity to detect departures from the model assumptions. Such tests should always be supplemented with directional (parametric) M-S tests that have high power in specific directions of departure. In a previous reply to a “Guest” [March 14 post] I explained the role of graphical techniques in aiding the choice of such directional tests.
As you indicate, slogans like “all M-S tests are created equal” and “one size fits all” will be calamitous for the effectiveness of M-S testing in practice!
In general, to avoid being misled one should adopt a number of common sense strategies with a view to enhance the effectiveness and reliability of the M-S diagnosis. These strategies include:
1. Judicious combinations of omnibus (nonparametric), directional (parametric) and simulation-based M-S tests, probing as broadly and as far away from the null as possible, and upholding dissimilar combinations of model assumptions.
2. Astute ordering of M-S tests so as to exploit the interrelationship among the model assumptions with a view to “correct” each other’s diagnosis.
3. Joint M-S tests (probing several assumptions simultaneously) designed to avoid ‘erroneous’ diagnoses as well as minimize the maintained assumptions.
Perhaps I should start by saying that I agree that the severity assessment here delivers the most useful information.
1) I, too, miss an explanation of how you handled the unknown variance in your original posting. In principle the severity is a function of the pair of location and variance, not the location alone, isn’t it? (Or am I blind and you addressed this somewhere here already?)
2) Regarding M-S tests, I’m all in favour of using “judiciously combined common sense strategies” but if people who like objectivity want to evaluate how well this works and how a good strategy actually looks like, it is necessary to specify precisely (and pre-data) what is recommended to be done including later conditional decisions. Only then it can be simulated or even evaluated theoretically. (Admittedly, afterwards one can probably still do better incorporating graphics and more flexibility.)
Aris: 1) sigma is not known here, is it? The noncentrality parameter then depends on sigma. So the question was, how did you choose that? Maybe the answer is hidden somehow in the definition of d, which is used but not defined (as the guest already had pointed out, but he didn’t get a reply from you).
2) OK, conditional decisions only have to be taken if at some point the model is rejected and one goes on to choose and test another one. Still I find it disappointing that you are reluctant to unambiguously define the whole procedure because as we hopefully all know, analyses conditionally on the results of prior M-S tests are rarely equivalent (I know they sometimes are!) to unconditional analyses and I still insist that one would like to analyse the impact and the error probabilities.
Is this so problematic? Couldn’t you just write down for example in the case above with 10 observations assumed to be i.i.d. normally distributed, which tests you would carry out at what level so that one could formally define the set of datasets for which you’d finally stick to this assumption (i.e., properly define the overall combined M-S test)?
Of course there are issues to address here such as multiple testing – it is for example interesting how much better it really is to run a series of Bonferroni-corrected (say) tests against parametric alternatives compared with runing a single omnibus test.
On your question 1): The sampling distribution of the test statistic [the famous t-test] is Student’s t with n-1 degrees of freedom; that distribution is central when the evaluation is under the null and non-central when the evaluation is under the alternative. When sigma is known, both of these sampling distributions are Normal.
On your question 2): The statistical model and its probabilistic assumptions are pre-specified. Hence, all the hypotheses pertaining to M-S testing are totally objectively defined.
In light of the fact that a complete set of probabilistic assumptions comprising the particular statistical model is clearly defined at the outset, anyone can probe independently these assumptions using their own choice of M-S tests to confirm or deny one’s results.
In relation to the three strategies I mentioned in my previous reply, I have published several papers illustrating them with both real-world as well as simulated data. There is nothing subjective about the choice of the particular M-S tests applied in a particular case. There are, however, effective and ineffective choices in practice. I particularly recommend joint M-S tests based on auxiliary regressions for their effectiveness. A few elementary illustrations of such tests are given in ch. 15 of Spanos, A. (1999), Probability Theory and Statistical Inference: Econometric Modeling with Observational Data, Cambridge University Press, Cambridge.
In relation to your question concerning “conditional decisions”, my answer is that there are no such decisions in M-S testing. The objective is to probe as effectively as possible with a view to detect possible departures from the model assumptions. If any departures are found, the model, as a whole, is rendered misspecified and a new model should be specified afresh. The primary aim in selecting the respecified model is to account for the statistical regularities the original model did not. A secondary aim is to parameterize the new model in a way that enables one to pose the original substantive questions of interest.
@Aris; The sampling distribution of the t-statistic (or indeed any statistic) does *not* change depending on whether we know sigma. Perhaps you mean that different test statistics will be used if we (assume we) know sigma? I’m still unclear as to what these calculations actually involve, regarding assumptions on sigma – and so are Christian and Stephen, apparently – maybe you can clarify them for us?
Also, you may view conditional decisions as taking place outside of m-s testing, but in practice they do not. Your procedure in which models are respecified, possibly many times, seems to run into widely-studies problems of data-dredging and over-fitting, as users try model after model, until, either by luck or judgement, they get one that doesn’t appear to violate any parametric assumptions. Accurate calibration of the long-run properties of this ad hoc procedure are impossible – because it’s ad hoc – and naive use of intervals, tests, etc, from the final model will in many cases produce notably anti-conservative statements. These are well-known problems, I’m alarmed to see them dismissed as lightly as you seem to do.
Dear Guest: Relax! Your alarm is misplaced. As I understand it, any model that eventually passes through the Spanos’ mill is statistically adequate. It is best to look at his published work on this.
Granted the naive frequentist and the traditional Bayesian perspective on statistical modeling and inference do have most of the problems the “guest” mentions and then some! I agree that “These are well-known problems”, and the difficulty in beginning a discussion in a forum like this is that one cannot use real data, graphs, technical arguments, etc. to substantiate one’s views.
The whole point of the different perspective that I have developed since the mid 1980s, joining forces with Mayo around 2000 — what we now call Error Statistics — is exactly to address all these foundational and methodological problems (and a lot more besides) by recasting statistical model specification, Mis-Specification (M-S) testing and Respecification with a view to secure statistical adequacy. For example,statistical adequacy addresses the underdetermination problem at the statistical model level.
Unfortunately, a blog like this is not the proper forum to explain all the issues that a commentator might raise. Having engaged in discussions like these for the last 30 years, I know well that each statistician, econometrician, psychometrician, biometrician etc., etc., has his/her own perspective on every one of these issues and arguing in a blog like this will usually deteriorate rapidly into name calling.
I have no intention to go down that road, but let me reiterate that the error statistical perspective is not vulnerable to any of the problems raised by the “guest”.
Over the years, my answer to referees of econometric journals questioning the process of establishing statistical adequacy due to “data-dredging and over-fitting, as users try model after model”, has been to show that it is exactly what one is not doing in error statistics. Goodness-of-fit/prediction has nothing to do with statistical adequacy, and respecification is not about trying one model after another relying on luck; that will be the most inefficient route, since there is an infinite number of possible models.
Statistical model specification is about partitioning the set of all possible models. Respecification is about repartitioning the same set, every time starting with a clean slate, but hopefully better educated guessing thanks to “learning from data”. It usually takes 1-2 iterations to whittle things down in order to reach a statistically adequate model if one is using thorough M-S testing in conjunction with graphical techniques effectively. To those who dismiss the value of a statistically adequate model because it was reached through a process where one applies these techniques, my reply is that by the same token Kepler’s empirical regularity concerning the motion of the planets was unwarranted [tell that to Newton], and could provide no basis for inference [it produces notably anti-conservative statements], because he was playing around with the same data for 6 years before it dawned on him that an elliptical motion accounts for the regularities in the data much better than circular motion!
On all the above claims I elaborate and illustrate using several data sets, including Kepler’s data, in several published papers. For those who are interested in more detailed answers see some of my most recent papers:
Spanos, A. (2006), “Where Do Statistical Models Come From? Revisiting the Problem of Specification,” pp. 98-119 in Optimality: The Second Erich L. Lehmann Symposium, edited by J. Rojo, Lecture Notes-Monograph Series, vol. 49, Institute of Mathematical Statistics.
Spanos, A. (2007), “Curve-Fitting, the Reliability of Inductive Inference and the Error-Statistical Approach,” Philosophy of Science, 74: 1046–1066.
Spanos, A. (2010a), “Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification,” Journal of Econometrics, 158: 204-220.
Spanos, A. (2010b), “Statistical Adequacy and the Trustworthiness of Empirical Evidence: Statistical vs. Substantive Information,” Economic Modelling, 27: 1436–1452.
@Aris, I appreciate the determination not to get into name calling. In the “model after model” point, I was not suggesting that users rely on luck alone. Instead, when the modeling strategy is very far from transparent (as described nicely here; http://andrewgelman.com/2009/07/that_modeling_f/ ) it’s extremely hard to tell the extent to which a final model is being deemed adequate because it’s close to the truth (good) or because it’s the last step in a process of “sampling to a foregone conclusion”, in which *some* model will always be deemed adequate (bad).
My April 1 comment, while not a joke, seems to have accidentally gotten us off the topic of this post. But it might be noted that any statistical model arising from Spanos’s approach must have all its assumptions tested–quite different from what some consider pejorative data mining. We may return to these issues with Hendry’s paper.
Here are two things that I don’t quite buy. First, I get the whole severity thing, it seems to be very closely related to Poole’s “p-value” plots, which is pointed out in Mayo 96. However, from an inferential point of view, it seems like one would say in this example that 0.75 is more severely tested than 1.04, say… But this goes against the interpretation of a frequentist confidence interval and sampling distribution which would tell us that *based on this single sample* we should be no more confident in a value of 1.04 or even 1.50 than we should in 0.75.
Second, I honestly don’t see how the concept of “statistical adequacy” is any less subjective than a subjective posterior distribution. It seems that what is “adequate” for you might be completely inadequate for me, and vice versa. I know that you want to define it in terms of “correct error probabilities”, but how could you ever possibly assess that based on a single observed sample? For example, in the US population growth example from a few weeks ago, I honestly find the first model to be completely adequate for the extent of inferences that I believe to be warranted from those data. If I wanted to estimate the population *within the observed range of the data* (which would be a reasonable use of a descriptive regression model), then I’d be perfectly happy to use a model based on Aris’ grandmother’s shoes as a proxy for time. However, if I wanted to predict population growth well beyond the range of the observed data, then I frankly wouldn’t have much more faith in your assumed Markov AR model than I would in the first model. This requires a level of inductive reasoning that I can’t see either Fisher, Neyman, or Popper supporting.
Mark: Severity is importantly different from Poole’s idea, though I don’t make much of the differences in EGEK. What you say of confidence intervals isn’t so, at least as I construe them. I think I’ve discussed objectivity/subjectivity a lot on this blog, please do a search. If you insist that “what is “adequate” for you might be completely inadequate for me” then, what can I say?
Deborah, I apologize for raising the whole subjectivity/objectivity thing, I’m happy to let that drop. But confidence intervals? Based on fundamental frequentist theory, the point estimate is nothing more than a single random draw from an underlying distribution of unknown variability. In fact, in about 1 of every 3 95% CIs, one of the bounds will be closer to the true value than is the point estimate. Thus, one should really only be more confident in values close to the point estimate to the same extent that one would be confident in obtaining a value greater than 2 on a single roll of a die. But it seems to me that taking severity at face value seems to miss this point.
Mark: I don’t understand your point about confidence intervals at all, sorry. On the particular example in this guest post, as it was by Spanos and based on his analysis, I let him respond if he wishes. I’m not familiar with the Barnard-Copas example; Barnard generally seems too focussed on point estimates for my inclinations.
Experts convene to explore new philosophy of statistics field
Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited.
Excerpts and links may be used, provided that full and clear credit is given to Deborah G. Mayo and Error Statistics Philosophy with appropriate and specific direction to the original content.
Blog at WordPress.com.