*Here are a few comments on your recent blog about my ideas on parsimony. Thanks for inviting me to contribute!*

You write that in model selection, “’parsimony fights likelihood,’ while, in adequate evolutionary theory, the two are thought to go hand in hand.” The second part of this statement isn’t correct. There are sufficient conditions (i.e., models of the evolutionary process) that entail that parsimony and maximum likelihood are ordinally equivalent, but there are cases in which they are not. Biologists often have data sets in which maximum parsimony and maximum likelihood disagree about which phylogenetic tree is best.

You also write that “error statisticians view hypothesis testing as between exhaustive hypotheses H and not-H (usually within a model).” I think that the criticism of Bayesianism that focuses on the problem of assessing the likelihoods of “catch-all hypotheses” applies to this description of your error statistical philosophy. The General Theory of Relativity, for example, may tell us how probable a set of observations is, but its negation does not. I note that you have “usually within a model” in parentheses. In many such cases, two alternatives within a model will not be exhaustive even within the confines of a model and of course they won’t be exhaustive if we consider a wider domain.

Under your entry on Falsification, you write that “Sober made a point of saying that his account does not falsify models or hypotheses. We are to start out with all the possible models to be considered (hopefully including one that is true or approximately true), akin to the “closed universe” of standard Bayesian accounts, but do we not get rid of any as falsified, given data? It seems not.” My view is that we rarely start out with all possible models. In addition, I agree that we can get rid of models that deductively entail (perhaps with the help of auxiliary assumptions) observational outcomes that do not happen. But as soon as the relation is nondeductive, is there “falsification”? I do think that there are restricted, special, contexts in which Bayesians are right to say that we can discover that a given statement is very probably false (where the probabilities are objective). In that kind of case, there is a kind of falsification. But I have been critical of significance tests and of Neyman-Pearson testing, which of course attempt to describe a kind of nonBayesian falsification.

On the Law of Likelihood You correctly point out that it is easy to find hypotheses H1 and H2, and observations O, where the law of likelihood says that O favors H1 over H2, and yet we think that H1 is in some sense less satisfactory than H2. Bayesians bring in prior probabilities here. NonBayesians of course bring in other considerations. I agree that such situations exist and that epistemological ideas not provided by the Law of Likelihood are needed. But this, by itself, doesn’t show that the Law of Likelihood is false. The LoL doesn’t tell you what to accept or reject, and it doesn’t tell you what is most plausible, everything considered. It simply describes what the evidence at hand says.

Elliott Sober

Philosophy Department

University of Wisconsin – Madison

I think Sober’s comment on General Relativity is a bit of a straw-man because it totally ignores the fact that for error statisticians testing is always a piecemeal affair. One would not try to test the entire (large-scale) theory, but rather would attempt to break off parts of it that can be “severely” tested, e.g., your example of Eddington’s tests on the deflection of light, Perrin’s experiments on Brownian motion, etc. So to me his criticism that error statistics suffers from a catch-all hypotheses problem akin to Bayesianism when dealing with GTR seems to miss the whole point of the error statistical approach to testing, I take his point that often two hypotheses within a model will not be exhaustive, but then it seems the trick (experimental skill) is to devise a test where one can exhaust some space of alternatives and still learn something of interest–not saying its easy or that it can always be done,

Jean: Exactly!

Intuitively, I can understand the notion that a low Bayesian probability could support the argument for falsification (but would have to trust the underlying methodology to buy it). Would it not be a stronger argument to state that the data are weak evidence for H, and it is highly probable that stronger evidence would obtain were H true (with this probability being the one that carries us to falsification, not the prob of H)?

Sober writes: You also write that “error statisticians view hypothesis testing as between exhaustive hypotheses H and not-H (usually within a model).” I think that the criticism of Bayesianism that focuses on the problem of assessing the likelihoods of “catch-all hypotheses” applies to this description of your error statistical philosophy. he General Theory of Relativity, for example, may tell us how probable a set of observations is, but its negation does not.

Time to test my error statistics philosophy comprehension! Let me just don my Mayo wig…

Theoretical predictions flow down a three-tiered hierarchy from (top-level) scientific theory to (mid-level) experimental setup to (bottom-level) a probability distribution for the data. A typical experiment tests only one or a small number of ways in which a scientific theory can be in error. Any single experiment can only be a severe test of a subset of all of a theory’s predictions. Formal statistical hypothesis testing happens on this scale — it is used to understand the data generating mechanism writ small. Evidence for a scientific theory is built up piecemeal by severely testing many ways in which the theory can be in error. Any one experiment isn’t a severe test of a whole theory, but a corpus of experiments can be — if the scientific theory were false, it would be an incredible coincidence if all the experiments passed its specific predictions.

In the case of GTR, we don’t need a specific not-GTR. Each experiment tests a specific prediction of GTR; it suffices that our test procedure (1) passes the prediction with high probability under the statistical model implied by GTR, and (2) fails it with high probability under a set of statistical models over the sample space. These alternative hypotheses *are* the ways in which GTR can be in error, but they don’t need to come from a overarching not-GTR theory.

How’d I do?

Corey: I would have to pass you*; if you’d keep on the wig for awhile, or better, make it your own hair, it could even be a pass with high severity.

*It would seem mean spirited to nit pick.

Mayo: I’m glad I pass! Nevertheless, I seek to learn from my errors. I declare Crocker’s rules with respect to the defects in the above construal.

Those rules sound like a “crock” to me!

Well, ok, I’d start with (1) and (2)–they don’t sound right. (2) would be improved by replacing “sample space” with “parameter space”.

Erf. Quite right. The phrase “set of statistical models over the sample space” in (2) was intended as a short form of “set of statistical hypotheses associated with the sample space of the experiment”. Likewise, “statistical model” in (1) should be “statistical hypothesis”.

As a self-taught Bayesian, sharp distinctions between simple statistical hypotheses, sets of the latter, and models don’t come naturally to me, since I want to splash probability density and/or mass around on all of them like whitewash.

Well whitewashing comes in handy for a Bayesian.

Oh snap!

So glad to see others weighing in already, I was busy moving within and between places with electricity and other facilities. Terrific! Here’s what I jotted before noticing:

Elliott: Thanks so much accepting my invitation to post. Just to deal with the business of testing theories like GTR, I argue that even in dealing with such large-scale theories, the actual testing and learning takes place in a piece-meal fashion. The answers to the one question exhaust the space. A lot of my work in the past decade has dealt with how this break down and testing occurs. The case of experimentally testing GTR is my prime example! This is touched on in slides 35-6 from the conference [i] , and in much more detail elsewhere, with an explicit contrast to what the “comparativist” recommends [ii]. Notably, the parameters in the PPN (parameterized post Newtonian) framework were tested by means of a familiar error statistical method. The GTR value for the PPN parameter under test serves as the null hypothesis from which discrepancies are sought by means of “high precision null experiments”, and the inferences are in the form of upper and lower bounds that are/are not ruled out with severity.

[i] My slides: http://www.phil.vt.edu/dmayo/personal_website/June24,12MayoCMU-SIMP.pdf

[ii] Mayo, D. (2010). “Error, Severe Testing, and the Growth of Theoretical Knowledge“

As you know, Deborah, this is a topic (and an example) that interests me a great deal. I anticipate a rejoinder going back to Elliott’s comment that “of course [the hypotheses under test] won’t be exhaustive if we consider a wider domain.” So I’d like to respond to that rejoinder on behalf of the error statistician.

In the present case this would be a question about the domain of hypotheses that lie outside of those consistent with the PPN assumptions. Those, of course, are substantive assumptions (essentially Local Lorentz Invariance, Local Position Invariance, and Weak Equivalence).

Although these are substantive assumptions of the inferences carried out within PPN, they are not mere assumptions. A severe tester strives to be in a position to warrant the model assumptions that serve to delimit the class of hypotheses under consideration. As I understand it, the assumptions of PPN have themselves been subjected to ongoing testing of various sorts (often within other, broader parametric frameworks), to a point where the possibilities of violations of these principles that are compatible with the tests that have been performed are of a sort that would not undermine inferences carried out within PPN (e.g., they would involve violations at scales where the validity of unmodified GTR is not to be assumed in any case and where “new physics” is to be expected). In other words, PPN-guided inferences are robust under the various possible violations of the precise versions of PPN assumptions.

Does that seem about right?

Kent: Thanks so much Kent. Yes, I would concur with your remarks. It’s welcome to hear from a philosopher of science who understands and is in sync with the error statistical philosophy, and I assume you’ll be on the evening ferry to Elba! It’s a lovely place, really.