“Some Thoughts Prompted by David Hendry’s Essay **, * (RMM) Special Topic: Statistical Science and Philosophy of Science,” by Professor Clark Glymour

**Part 2 (of 2) (Please begin with part 1)**

The first thing one wants to know about a search method is what it is searching for, what would count as getting it right. One might want to estimate a probability distribution, or get correct forecasts of some probabilistic function of the distribution (e.g., out-of-sample means), or a causal structure, or some probabilistic function of the distribution resulting from some class of interventions. Secondly, one wants to know about what decision theorists call a loss function, but less precisely, what is the comparative importance of various errors of measurement, or, in other terms, what makes some approximations better than others. Third, one wants a limiting consistency proof: sufficient conditions for the search to reach the goal in the large sample limit. There are various kinds of consistency—pointwise versus uniform for example—and one wants to know which of those, if any, hold for a search method under what assumptions about the hypothesis space and the sampling distribution. Fourth, one wants to know as much as possible about the behavior of the search method on finite samples. In simple cases of statistical estimation there are analytic results; more often for search methods only simulation results are possible, but if so, one wants them to explore the bounds of failure, not just easy cases. And, of course, one wants a rationale for limiting the search space, as well as, some sense of how wrong the search can be if those limits are violated in various ways.

There are other important economic features of search procedures. Probability distributions (or likelihood functions) can instantiate any number of constraints—vanishing partial correlations for example, or inequalities of correlations. Suppose the hypothesis space delimits some big class of probability distributions. Suppose the search proceeds by testing constraints (the points that follow apply as well if the procedure computes posterior probabilities for particular hypotheses and applies a decision rule.) There is a natural partial ordering of classes of constraints: B is weaker than A if and only if every distribution that satisfies class A satisfies class B. Other things equal, a weakest class might be preferred because it requires fewer tests. But more important is what the test of a constraint does in efficiently guiding the search. A test that eliminates a particular hypothesis is not much help. A test that eliminates a big class of hypotheses is a lot of help.

Other factors: the power of the requisite tests; the numbers of tests (or posterior probability assessments) required; the computational requirements of individual tests (or posterior probability assessments.) And so on. And, finally, search algorithms have varying degrees of generality. For example, there are general algorithms, such as the widely used PC search algorithm for graphical causal models, that are essentially search schema: stick in whatever decision procedure for conditional independence and PC becomes a search procedure using that conditional independence oracle. By contrast, some searches are so embedded in a particular hypothesis space that it is difficult to see the generality.

I am sure I am not qualified to comment on the details of Hendry’s search procedure, and even if I were, for reasons of space his presentation is too compressed for that. Still, I can make some general remarks. I do not know from his essay the answers to many of the questions pertinent to evaluating a search procedure that I raised above. For example, his success criterion is “congruence” and I have no idea what that is. That is likely my fault, since I have read only one of his books, and that long ago.

David Hendry dismisses “priors,” meaning, I think, Bayesian methods, with an argument from language acquisition. Kids don’t need priors to learn a language. I am not sure of Hendry’s logic. Particular grammars within a parametric “universal grammar” could in principle be learned by a Bayesian procedure, although I have no reason to think they are. But one way or the other, that has no import for whether Bayesian procedures are the most advantageous for various search problems by any of the criteria I have noted above. Sometimes they may be, sometimes not, there is no uniform answer, in part because computational requirements vary. I could give examples, but space forbids.

Abstractly, one could think there are two possible ways of searching when the set of relationships to be uncovered may form a complex web: start by positing all possible relationships and eliminate from there, or start by positing no relationships and build up. Hendry dismisses the latter, with what generality I do not know. What I do know is that the relations between “bottom-up” and “top-down” or “forward” and “backward” search can be intricate, and in some cases one may need both for consistency. Sometimes either will do. Graphical models, for example can be searched starting with the assumption that every variable influences every other and eliminating, or starting with the assumption that no variable influences any other and adding. There are pointwise consistent searches in both directions. The real difference is in complexity.

Finally, I am struck by a substantive assumption that seems to vary with discipline, and I am not sure why. Like Spanos, from whom I learned what little I know of econometrics, Hendry wants to Normalize distributions, at least enough so that tests of hypotheses based on the Normal distribution assumption can be used. In linear systems, when searching for causal relations Normal distributions are the worst case, not the best. In non-Gaussian distributions for linear systems higher moments provide information about causal direction that cannot be recovered from Gaussian distributions. The residuals and the recorded variables in psychological studies with magnetic resonance imaging time series are not Gaussian, and that fact has proved critical in reliably estimating directions of influence.

In general, I think it is hard to make sound informative generalizations about search strategies. Just as with ordinary statistical estimation, appropriateness or optimality (by whatever criteria) judgements require a careful analysis of the very structure of the search problem. Hendry’s generalizations may be right about search in econometrics, and indeed about some other domains as well, but deciding one way or the other, would require a very different forum than this. Until then, I congratulate him for helping defeat the dogmas of the context distinction, and urge him to recognize that he has done that very thing.

*Hendry, D. (2011) “Empirical Economic Model Discovery and Theory Evaluation“, in *Rationality, Markets and Morals*, Volume 2 Special Topic: *Statistical Science and Philosophy of Science, *Edited by Deborah G. Mayo, Aris Spanos and Kent W. Staley: 115-145.

Clark: Thanks for your post, it links in numerous ways with what we’ve been talking about. Here’s my take on discovery/justification:

In the olden days, when philosophers viewed justification in terms of an “evidential-relationship” between given observation statements and given hypotheses (perhaps some still do), the classic discovery-justification distinction was fairly clear cut. The discovery context was everything that went into getting the H, and x, and the background. Carnap, at least when espousing this sort of “evidential-relation” (E-R) logic of confirmation could not have been clearer in stating the goal: Given a logic of confirmation, the scientist would go to the philosopher, hand her the data, and hypothesis statements,and the philosopher would compute the E-R measure.

Popper, and the Lakatosians after him, moved (a bit) away from the E-R picture: Even the data x were recognized as error prone and the result of various earlier hypotheses and inferences. Moreover, and most important, was the recognition that the E-R logics conflicted with traditional concerns about ad hocness, saving hypotheses from refutation, data dependent saves, and the like. (I can find almost an exact quote of this.) You can’t care about those things and be an E-R theorist, in other words.

Then there was Kuhn and the post-logical positivists who also recognized how much error and interpretation, and so on, goes into hypotheses testing and theory appraisal. In fact, for Kuhn testing was always within a full paradigm with background theories and methods and aims all rolled up into one, and not questioned while engaging in the “mopping up” exercises that go under “normal testing”. (When the fundamental paradigm is inquestion, of course, the Kuhnian is in no-man’s land of irrational conversion.) So contexts of discovery/justification are really blurred.

Far more extreme were the post-modernists, da daists, social constructivists, and anarchists that came after: it’s all a matter of (choose your favorite) social negotiation, political power, subjective whims of a rich enough social group. The very idea of any kind of objectively constrained appraisal of claims goes out the window. (These movements sometimes have analogues in statistical philosophies.) My point is that one can go to the Popperian level of taking into account aspects of the selection and generation of evidence and hypotheses, and background hypotheses and threats of error (all of which Peirce and others knew long before), without skidding over the slippery slide to the land of no distinction between the idiosyncratic features of discovery and the empirically constrained aspects of “justification” or warrant or the like. In formal statistical settings the latter features can be captured, in part at least, by the goals of controlling and appraising the ability of tests to detect errors and flaws (error probabilities, formally or informally captured).

To that extent, and understood in that way, I still retain the discovery-justification distinction!

It may be argued, as does Imre Lakatos, the relativity of appraisal to features such as theoretical novelty entails that justification has a historical dimension. Thus the distinction collapses.

E. Berk: Not at all, that was precisely my point. These considerations may enter into the appraisal of claims, and we have a clear criterion for when and why they do. Other features of discovery do not.

“Certainly, a general language system seems to be hard wired in the human brain but that hardly constitutes a prior. Thus, in one of the most complicated tasks imaginable, which computers still struggle to emulate, priors are not needed”

I’d say this is a ridiculous argument – the amount of information encapsulated by human physiology is easily many orders of magnitude greater than the amount of information required to learn a language.

Why doesn’t a rock learn “whatever native tongue is prevalent around it”? Because it lacks the prior information encoded in human physiology which is necessary to learn a language.

rv: Language acquisition may be a weak analogy for the discovery context that Hendry is discussing, but his point is clear enough: “‘Prior distributions’ widely used in Bayesian analyses, whether subjective or ‘objective’, cannot be formed in such a setting either, absent a falsely assumed crystal ball.” The issue is one we’ve considered many times, in setting sail to discover novel theories and models, one does not start out with a list of all possible rivals that might arise, much less do we want to have to assign degrees of probability to each, or to a “catchall hypotheses”. This is the point made by Wesley Salmon and others that to compute likelihoods under the “catchall” would be tantamount to predicting the future course of science.

I take Hendry’s “breaks” to represent shifts that, in effect, would result in temporal incoherence. “Imposing a prior distribution that is consistent with an assumed model when breaks are not included is a recipe for a bad analysis in macroeconomics.” So perhaps he is saying that since these “breaks” would result in temporal incoherence, the Bayesian would effectively start out anticipating such incoherence (see earlier posts on Dutch books).

Anyway, Hendry’s goal is to provide a model discovery procedure that does not require such assumptions. To the extent that he succeeds, I can’t imagine a complaint that it didn’t compel us to start out with a certain kind of bounded universe and prior probability distributions or whatever. So the pertinent question is the extent to which he succeeds.

Hendry’s colleague and co-author, Dr Jennifer Castle addresses some of these points and queries in such papers as:

• Forecasting breaks and forecasting during breaks (2011)

Model Selection in Under-specified Equations Facing Breaks (2010)

• http://www.economics.ox.ac.uk/index.php/papers/details/department_wp_538/

http://www.economics.ox.ac.uk/index.php/staff/castle/

As always, I find Glymour’s writings insightful, well-articulated and provocative. Although I appreciate his mentioning of our past interactions, I would like to correct his assertion concerning “Normalization”. Contrary to the assertion, a large component of my published research is concerned with the restrictiveness of the Normal distribution, especially in the case of financial data. Indeed, I have introduced into applied econometrics several non-Normal distributions, including the multivariate Student’s t, Pearson type II, the Elliptically Symmetric family, etc. Using numerous financial and macro-data I have demonstrated the inappropriateness of the Normal distribution on empirical grounds and criticized severely the current key theories of finance, including the CAPM and the Efficient Market Hypothesis for their unwarranted reliance on the Normal distribution.

Aris. I apologize for misrepresenting your view based on memory from your lectures at Virginia Tech.

Clark’s replies to comments: Apparently his replies were eaten by the black hole of the internet. I’ve had this happen as have many others; please write comments on a separate document so they won’t be lost. As for Glymour’s replies, well, I think we’ll just have to imagine what he intended to say.

What I wrote was more or less this: In many cases, a favorable result of a hypothesis test is no justification for a theory in the absence of a search. Take social science models, say a model of the effects of Head Start, or the long term effects of watching violent television in childhood (chosen only because they are

examples where I have analyzed some data sets). The standard asymptotic chi square test for such models takes the model–or its implied covariance matrix to be more precise–as the null hypothesis. Standard practice is to take the data to be

evidence for the model if the null hypothesis cannot be rejected at some conventional level. But there are myriad hypotheses, generally unknown or unarticulated, that may, for all the investigator knows (i.e., very little)not be rejected by the same test. Indeed, there may be hypotheses that dominate the model, in the sense that for any alpha level at which the model is not rejected, the

alternative hypothesis is not rejected, but not conversely.

Clark: Are you saying they take non-statistically significant results as evidence for a null such as:

Ho: no long term effects of watching violent television in childhood?

Ho would have passed with terrible severity (for the reasons you give). The fallacy of insignificant results is a common theme on this blog (https://errorstatistics.com/2011/11/18/neymans-nursery-nn5-final-post/).

So perhaps your point is that to get around this they should perform a search of ways they could have erroneously failed to find a difference? But then, of course, if some subgroup does show an association, one is in danger of committing the reverse fallacy (i.e., rejecting from hunting). Granted, that would be if one did it in an unthinking fashion as opposed to, say, using the searching to get ideas for what to probe separately. By the way, this connects to the discussion that emerged in the post just before yours: https://errorstatistics.com/2012/07/21/always-the-last-place-you-look/

In relation to testing equivalence of band name and generic drugs.