# Philosophy of Statistics

## Oxford Gaol: Statistical Bogeymen

Memory Lane: Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (I’m serious, it is now a boutique hotel.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

• Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
• Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
• I have proposed an alternative philosophy that replaces these tenets with different ones:
• the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
• the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
• Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution.  Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical. Read more »

## Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*

A long-running research program in philosophy is to seek a quantitative measure

C(h,x)

to capture intuitive ideas about “confirmation” and about “confirmational relevance”. The components of C(h,x) are allowed to be any statements, no reference to a probability model or to joint distributions are required. Then h is “confirmed” or supported by x if P(h|x) > P(h), disconfirmed (or undermined) if P(h|x) < P(h), (else x is confirmationally irrelevant to h). This is the generally accepted view of philosophers of confirmation (or Bayesian formal epistemologists) up to the present. There is generally a background “k” included, but to avoid a blinding mass of symbols I omit it. (We are rarely told how to get the probabilities anyway; but I’m going to leave that to one side, as it will not really matter here.)

A test of any purported philosophical confirmation theory is whether it elucidates or is even in sync with intuitive methodological principles about evidence or testing. One of the first problems that arises stems from asking…

Is Probability a good measure of confirmation?

A natural move then would be to identify the degree of confirmation of h by x with probability P(h|x), (which philosophers sometimes write as P(h,x)). Statement x affords hypothesis h higher confirmation than it does h’ iff P(h|x) > P(h’|x).

Some puzzles immediately arise. Hypothesis h can be confirmed by x, while h’ disconfirmed by x, and yet P(h|x) < P(h’|x).  In other words, we can have P(h|x) > P(h) and P(h’|x) < P(h’) and yet P(h|x) < P(h’|x).

Popper (The Logic of Scientific Discovery, 1959, 390) gives this example, (I quote from him, only changing symbols slightly):

Consider the next toss with a homogeneous die.

h: 6 will turn up

h’: 6 will not turn up

x: an even number will turn up.

P(h) = 1/6, p(h’) = 5/6 P(x) = ½

The probability of h is raised by information x, while h’ is undermined by x. (It’s probability goes from 5/6 to 4/6.) If we identify probability with degree of confirmation, x confirms h and disconfirms h’ (i.e., P(h|x) >P(h) and P(h’|x) < P(h’)). Yet because P(h|x) < P(h’|x), h is less well confirmed given x than is h’.  (This happens because P(h) is sufficiently low.) So P(h|x) cannot just be identified with the degree of confirmation that x affords h.

Note, these are not real statistical hypotheses but statements of events.

Obviously there needs to be a way to distinguish between some absolute confirmation for h, and a relative measure of how much it has increased due to x. From the start, Rudolf Carnap noted that “the verb ‘to confirm’ is ambiguous” but thought it had “the connotation of ‘making firmer’ even more often than that of ‘making firm’.” (Carnap, Logical Foundations of Probability (2nd), xviii ). x can increase the firmness of h, but C(h,x) < C(~h,x) (h is more firm, given x, than is ~h). Like Carnap, it’s the ‘making firmer’ that is generally assumed in Bayesian confirmation theory.

But there are many different measures of making firmer (Popper, Carnap, Fitelson).  Referring to Popper’s example, we can report the ratio R: P(h|x)/P(h) = 2.

(In this case h’ = ~h).

Or we use the likelihood ratio LR: P(x|h)/P(x|~h) = (1/.4) = 2.5.

Many other ways of measuring the increase in confirmation x affords h could do as well. But what shall we say about the numbers like 2, 2.5? Do they mean the same thing in different contexts? What happens if we get beyond toy examples to scientific hypotheses where ~h would allude to all possible theories not yet thought of. What’s P(x|~h) where ~h is “the catchall” hypothesis asserting “something else”? (see, for example, Mayo 1997)

Perhaps this point won’t prevent confirmation logics from accomplishing the role of capturing and justifying intuitions about confirmation. So let’s consider the value of confirmation theories to that role.  One of the early leaders of philosophical Bayesian confirmation, Peter Achinstein (2001), began to have doubts about the value of the philosopher’s a priori project.  He even claims, rather provocatively, that “scientists do not and should not take … philosophical accounts of evidence seriously” (p. 9) because they give us formal syntactical (context –free) measures; whereas, scientists look to empirical grounds for confirmation. Philosophical accounts, moreover, make it too easy to confirm. He rejects confirmation as increased firmness, denying it is either necessary or sufficient for evidence. As far as making it too easy to get confirmation, there is the classic problem: it appears we can get everything to confirm everything, so long as one thing is confirmed. This is a famous argument due to Glymour (1980).

We now switch to emphasizing that the hypotheses may be statistical hypotheses or substantive theories. Both for this reason and because I think they look better, I move away from Popper and Carnap’s lower case letters for hypotheses.

The problem of irrelevant conjunctions (the “tacking paradox”) is this: If x confirms H, then x also confirms (H & J), even if hypothesis J is just “tacked on” to H. As with most of these chestnuts, there is a long history (e.g., Earman 1992, Rosenkrantz 1977), but consider just a leading contemporary representative, Branden Fitelson. Fitelson has importantly emphasized how many different C functions there are for capturing “makes firm”.  Fitelson defines:

J is an irrelevant conjunct to H, with respect to x just in case P(x|H) = P(x|J & H).

For instance, x might be radioastronomic data in support of:

H: the deflection of light effect (due to gravity) is as stipulated in the General Theory of Relativity (GTR), 1.75” at the limb of the sun.

and the irrelevant conjunct:

J: the radioactivity of the Fukushima water being dumped in the Pacific ocean is within acceptable levels.

(1)   Bayesian (Confirmation) Conjunction: If x Bayesian confirms H, then x Bayesian-confirms (H & J), where P(x| H & J ) = P(x|H) for any J consistent with H.

The reasoning is as follows:

P(x|H) /P(x) > 1     (x Bayesian confirms H)

P(x|H & J) = P(x|H)  (given)

So [P(x|H & J) /P(x)]> 1

Therefore x Bayesian confirms (H & J)

However, it is also plausible to hold :

(2) Entailment condition: If x confirms T, and T entails J, then x confirms J.

In particular, if x confirms (H & J), then x confirms J.

(3)   From (1) and (2) , if x confirms H, then x confirms J  for any irrelevant J consistent with H.

(Assume neither H nor J have probabilities 0 or 1).

It follows that if x confirms any H, then x confirms any J.

Branden Fitelson’s solution

Fitelson (2002), and Fitelson and Hawthorne (2004) offer this “solution”: He will allow that x confirms (H & J), but deny the entailment condition. So, in particular, x confirms the conjunction although x does not confirm the irrelevant conjunct. Moreover, Fitelson shows, even though (J) is confirmed by x, (H & J) gets less of a confirmation (firmness) boost than does H—so long as one doesn’t measure the confirmation boost using R: P(h|x)/P(x). If one does use R, then (H & J) is just as well confirmed as is H, which is disturbing.

But even if we use the LR as our firmness boost, I would agree with Glymour that the solution scarcely solves the real problem. Paraphrasing him, we would not be assured by an account that tells us deflection of light data (x) confirms both GTR (H) and the radioactivity of the Fukushima water is within acceptable levels (J), while assuring us that x does not confirm the Fukishima water having acceptable levels of radiation (31).

The tacking paradox is to be expected if confirmation is taken as a variation on probabilistic affirming the consequent. Hypothetico-deductivists had the same problem, which is why Popper said we need to supplement each of the measures of confirmation boost with the condition of “severity”. However, he was unable to characterize severity adequately, and ultimately denied it could be formalized. He left it as an intuitive requirement that before applying any C-function, the confirming evidence must be the result of “a sincere (and ingenious) attempt to falsify the hypothesis” in question. I try to supply a more adequate account of severity (e.g., Mayo 1996, 2/3/12 post (no-pain philosophy III)).

How would the tacking method fare on the severity account? We’re not given the details we’d want for an error statistical appraisal, but let’s do the best with their stipulations. From our necessary condition, we have that (H and J) cannot warrant taking x as evidence for (H and J) if x counts as a highly insevere test of (H and J). The “test process” with tacking is something like this: having confirmed H, tack on any consistent but irrelevant J to obtain (H & J).(Sentence was amended on 10/21/13)

A scrutiny of well-testedness may proceed by denying either condition for severity. To follow the confirmation theorists, let’s grant the fit requirement (since H fits or entails x). This does not constitute having done anything to detect the falsity of H& J. The conjunction has been subjected to a radically non-risky test. (See also 1/2/13 post, esp. 5.3.4 Tacking Paradox Scotched.)

What they call confirmation we call mere “fit”

In fact, all their measures of confirmation C, be it the ratio measure R: P(H|x)/P(H) or the (so-called[1]) likelihood ratio LR: P(H|x)/P(~H|x), or one of the others, count merely as “fit” or “accordance” measures to the error statistician. There is no problem allowing each to be relevant for different problems and different dimensions of evidence. What we need to add in each case are the associated error probabilities:

P([H & J] is Bayesian confirmed; ~(J&H)) = maximal, so x is “bad evidence, no test” (BENT) for the conjunction.

We read “;” as “under the assumption that”.

In fact, all their measures of confirmation C are mere “fit” measures, be it the ratio measure R: P(H|x)/P(H) or the LR or other.

The following was added on 10-21-13: The above probability stems from taking the “fit measure” as a statistic, and assessing error probabilities by taking account the test process, as in error statistics. The result is

SEV[(H & J), tacking test, x] is minimal

I have still further problems with these inductive logic paradigms: an adequate philosophical account should answer questions and explicate principles about the methodology of scientific inference. Yet the Bayesian inductivist starts out assuming the intuition or principle, the task then being the homework problem of assigning priors and likelihoods that mesh with the principles. This often demands beating a Bayesian analysis into line, while still not getting at its genuine rationale. “The idea of putting probabilities over hypotheses delivered to philosophy a godsend, and an entire package of superficiality.” (Glymour 2010, 334). Perhaps philosophers are moving away from analytic reconstructions. Enough tears have been shed. But does an analogous problem crop up in Bayesian logic more generally?

I may update this post, and if I do I will alter the number following the title.

Oct. 20, 2013: I am updating this to reflect corrections pointed out by James Hawthorne, for which I’m very grateful. I will call this draft (ii).

Oct. 21, 2013 (updated in blue). I think another sentence might have accidentally got moved around.

Oct. 23, 2013. Given some issues that cropped up in the discussion (and the fact that certain symbols didn’t always come out right in the comments, I’m placing the point below in Note [2]):

[1] I say “so-called” because there’s no requirement of a proper statistical model here.

[2] Can P = C?

Spoze there’s a case where z confirms hh’ more than z confirms h’:  C(hh’,z) > C(h’,z)

Now h’ = (~hh’ or hh’)
So,
(i) C(hh’,z) > C(~hh’ or hh’,z)

Since ~hh’ and hh’ are mutually exclusive, we have from special addition rule
(ii) P(hh’,z) < P(~hh’ or hh’,z)

So if P = C, (i) and (ii) yield a contradiction.

REFERENCES

Achinstein, P. (2001). The Book of EvidenceOxford: Oxford University Press.

Carnap, R. (1962). Logical Foundations of Probability. Chicago: University of Chicago Press.

Earman, J.  (1992). Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory Cambridge MA: MIT Press.

Fitelson, B.  (2002). Putting the Irrelevance Back Into the Problem of Irrelevant Conjunction. Philosophy of Science 69(4), 611–622.

Fitelson, B. & Hawthorne, J.  (2004). Re-Solving Irrelevant Conjunction with Probabilistic Independence,  Philosophy of Science, 71: 505–514.

Glymour, C. (1980) . Theory and Evidence. Princeton: Princeton University Press

_____. (2010). Explanation and Truth. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, 305–314. Cambridge: Cambridge University Press.

Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.

_____. (1997). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,“ Philosophy of Science 64(1): 222-244 and 323-333.

_____. (2010). Explanation and Testing Exchanges with Clark Glymour. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, 305–314. Cambridge: Cambridge University Press.

Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books.

Rosenkranz, R. (1977). Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

## Highly probable vs highly probed: Bayesian/ error statistical differences

A reader asks: “Can you tell me about disagreements on numbers between a severity assessment within error statistics, and a Bayesian assessment of posterior probabilities?” Sure.

There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)

1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over k or more factors?  Or to distinguish optional stopping with sequential trials from fixed sample size experiments.  Here’s a quote I came across just yesterday:

“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a fixed sample had been taken, one could guarantee arbitrarily strong frequentist ‘significance’ against H0.” (Berger and Wolpert, 1988, 77).

The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong)  likelihood principle, and Birnbaum.

2. Highly probable vs. highly probed. SEV doesn’t obey the probability calculus: for any test T and outcome x, the severity for both H and ~H might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle. Read more »

## Barnard’s Birthday: background, likelihood principle, intentions

G.A. Barnard: 23 Sept.1915 – 9 Aug.2002

Reblog (year ago) : G.A. Barnard’s birthday is today, so here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to some earlier issues: stopping rules, likelihood principle, and background information here and here (at least of one type). (A few other Barnard links on this blog are below* .) Happy Birthday George!

Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances.  Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant.

In particular, suppose somebody sets out to demonstrate the existence of extrasensory perception and says ‘I am going to go on until I get a one in ten thousand significance level’. Knowing that this is what he is setting out to do would lead you to adopt a different test criterion. What you would look at would not be the ratio of successes obtained, but how long it took him to obtain it. And you would have a very simple test of significance which said if it took you so long to achieve this increase in the score above the chance fraction, this is not at all strong evidence for E.S.P., it is very weak evidence. And the reversing of the choice of test criteria would I think overcome the difficulty.

This is the answer to the point Professor Savage makes; he says why use one method when you have vague knowledge, when you would use a quite different method when you have precise knowledge. It seem to me the answer is that you would use one method when you have precisely determined alternatives, with which you want to compare a given hypothesis, and you use another method when you do not have these alternatives.

Savage: May I digress to say publicly that I learned the stopping-rule principle from professor Barnard, in conversation in the summer of 1952. Frankly I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right. I am particularly surprised to hear Professor Barnard say today that the stopping rule is irrelevant in certain circumstances only, for the argument he first gave in favour of the principle seems quite unaffected by the distinctions just discussed. The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head and cannot be known to those who have to judge the experiment. Never having been comfortable with that argument, I am not advancing it myself. But if Professor Barnard still accepts it, how can he conclude that the stopping-rule principle is only sometimes valid? (emphasis added) Read more »

## Gandenberger: How to Do Philosophy That Matters (guest post)

Greg Gandenberger
Philosopher of Science
University of Pittsburgh
gandenberger.org

Genuine philosophical problems are always rooted in urgent problems outside philosophy,
and they die if these roots decay
Karl Popper (1963, 72)

My concern in this post is how we philosophers can use our skills to do work that matters to people both inside and outside of philosophy.

Philosophers are highly skilled at conceptual analysis, in which one takes an interesting but unclear concept and attempts to state precisely when it applies and when it doesn’t.

What is the point of this activity? In many cases, this question has no satisfactory answer. Conceptual analysis becomes an end in itself, and philosophical debates become fruitless arguments about words. The pleasure we philosophers take in such arguments hardly warrants scarce government and university resources. It does provide good training in critical thinking, but so do many other activities that are also immediately useful, such as doing science and programming computers.

Conceptual analysis does not have to be pointless. It is often prompted by a real-world problem. In Plato’s Euthyphro, for instance, the character Euthyphro thought that piety required him to prosecute his father for murder. His family thought on the contrary that for a son to prosecute his own father was the height of impiety. In this situation, the question “what is piety?” took on great urgency. It also had great urgency for Socrates, who was awaiting trial for corrupting the youth of Athens.

In general, conceptual analysis often begins as a response to some question about how we ought to regulate our beliefs or actions. It can be a fruitful activity as long as the questions that prompted it are kept in view. It tends to degenerate into merely verbal disputes when it becomes an end in itself.

The kind of goal-oriented view of conceptual analysis I aim to articulate and promote is not teleosemantics: it is a view about how philosophy should be done rather than a theory of meaning. It is consistent with Carnap’s notion of explication (one of the desiderata of which is fruitfulness) (Carnap 1963, 5), but in practice Carnapian explication seems to devolve into idle word games just as easily as conceptual analysis. Our overriding goal should not be fidelity to intuitions, precision, or systematicity, but usefulness.

How I Became Suspicious of Conceptual Analysis

When I began working on proofs of the Likelihood Principle, I assumed that following my intuitions about the concept of “evidential equivalence” would lead to insights about how science should be done. Birnbaum’s proof showed me that my intuitions entail the Likelihood Principle, which frequentist methods violate. Voila! Voila! Scientists shouldn’t use frequentist methods. All that remained to be done was to fortify Birnbaum’s proof, as I do in “A New Proof of the Likelihood Principle” by defending it against objections and buttressing it with an alternative proof. [Editor: For a number of related materials on this blog see Mayo’s JSM presentation, and note [i].]

After working on this topic for some time, I realized that I was making simplistic assumptions about the relationship between conceptual intuitions and methodological norms. At most, a proof of the Likelihood Principle can show you that frequentist methods run contrary to your intuitions about evidential equivalence. Even if those intuitions are true, it does not follow immediately that scientists should not use frequentist methods. The ultimate aim of science, presumably, is not to respect evidential equivalence but (roughly) to learn about the world and make it better. The demand that scientists use methods that respect evidential equivalence is warranted only insofar as it is conducive to achieving those ends. Birnbaum’s proof says nothing about that issue.

• In general, a conceptual analysis–even of a normatively freighted term like “evidence”–is never enough by itself to justify a normative claim. The questions that ultimately matter are not about “what we mean” when we use particular words and phrases, but rather about what our aims are and how we can best achieve them.

How to Do Conceptual Analysis Teleologically

This is not to say that my work on the Likelihood Principle or conceptual analysis in general is without value. But it is nothing more than a kind of careful lexicography. This kind of work is potentially useful for clarifying normative claims with the aim of assessing and possibly implementing them. To do work that matters, philosophers engaged in conceptual analysis need to take enough interest in the assessment and implementation stages to do their conceptual analysis with the relevant normative claims in mind.

So what does this kind of teleological (goal-oriented) conceptual analysis look like?

It can involve personally following through on the process of assessing and implementing the relevant norms. For example, philosophers at Carnegie Mellon University working on causation have not only provided a kind of analysis of the concept of causation but also developed algorithms for causal discovery, proved theorems about those algorithms, and applied those algorithms to contemporary scientific problems (see e.g. Spirtes et al. 2000).

I have great respect for this work. But doing conceptual analysis does not have to mean going so far outside the traditional bounds of philosophy. A perfect example is James Woodward’s related work on causal explanation, which he describes as follows (2003, 7-8, original emphasis):

My project…makes recommendations about what one ought to mean by various causal and explanatory claims, rather than just attempting to describe how we use those claims. It recognizes that causal and explanatory claims sometimes are confused, unclear, and ambiguous and suggests how those limitations might be addressed…. we introduce concepts…and characterize them in certain ways…because we want to do things with them…. Concepts can be well or badly designed for such purposes, and we can evaluate them accordingly.

Woodward keeps his eye on what the notion of causation is for, namely distinguishing between relationships that do and relationships that do not remain invariant under interventions. This distinction is enormously important because only relationships that remain invariant under interventions provide “handles” we can use to change the world.

Here are some lessons about teleological conceptual analysis that we can take from Woodward’s work. (I’m sure this list could be expanded.)

1. Teleological conceptual analysis puts us in charge. In his wonderful presidential address at the 2012 meeting of the Philosophy of Science Association, Woodward ended a litany of metaphysical arguments against regarding mental events as causes by asking “Who’s in charge here?” There is no ideal form of Causation to which we must answer. We are free to decide to use “causation” and related words in the ways that best serve our interests.
2. Teleological conceptual analysis can be revisionary. If ordinary usage is not optimal, we can change it.
3. The product of a teleological conceptual analysis need not be unique. Some philosophers reject Woodward’s account because they regard causation as a process rather than as a relationship among variables. But why do we need to choose? There could just be two different notions of causation. Woodward’s account captures one notion that is very important in science and everyday life. If it captures all of the causal notions that are important, then so much the better. But this kind of comprehensiveness is not essential.
4. Teleological conceptual analysis can be non-reductive. Woodward characterizes causal relations as (roughly) correlation relations that are invariant under certain kinds of interventions. But the notion of an intervention is itself causal. Woodward’s account is not circular because it characterizes what it means for a causal relationship to hold between two variables in terms of a different causal processes involving different sets of variables. But it is non-reductive in the sense that does not allow us to replace causal claims with equivalent non-causal claims (as, e.g., counterfactual, regularity, probabilistic, and process theories purport to do). This fact is a problem if one’s primary concern is to reduce one’s ultimate metaphysical commitments, but it is not necessarily a problem if one’s primary concern is to improve our ability to assess and use causal claims.

Conclusion

Philosophers rarely succeed in capturing all of our intuitions about an important informal concept. Even if they did succeed, they would have more work to do in justifying any norms that invoke that concept. Conceptual analysis can be a first step toward doing philosophy that matters, but it needs to be undertaken with the relevant normative claims in mind.

Question: What are your best examples of philosophy that matters? What can we learn from them?

Citations

• Birnbaum, Allan. “On the Foundations of Statistical Inference.” Journal of the American Statistical Association 57.298 (1962): 269-306.
• Carnap, Rudolf. Logical Foundations of Probability. U of Chicago Press, 1963.
• Gandenberger, Greg. “A New Proof of the Likelihood Principle.” The British Journal for the Philosophy of Science (forthcoming).
• Plato. Euthyphrohttp://classics.mit.edu/Plato/euthyfro.html.
• Popper, Karl. Conjectures and Refutations. London: Routledge & Kegan Paul, 1963.
• Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. Vol. 81. The MIT Press, 2000.
• Woodward, James. Making Things Happen: A Theory of Causal Explanation. Oxford University Press, 2003.

[i] Earlier posts are here and here. Some U-Phils are here, here, and here. For some amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).

Some related papers:

• Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

## E.S. Pearson: “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”

E.S.Pearson on a Gate,             Mayo sketch

Today is Egon Pearson’s birthday (11 Aug., 1895-12 June, 1980); and here you see my scruffy sketch of him, at the start of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman-Pearson theory of statistics.  ”Pearson and Pearson” statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect. One of the few sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It begins like this:

Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data.  We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done.  If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.

In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect.  There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”.  There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans.  It was really much simpler–or worse.  The original heresy, as we shall see, was a Pearson one!…
Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot…!

Happy Birthday E.S. Pearson!

## Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs

A low-powered statistical analysis of this blog—nearing its 2-year anniversary!—reveals that the topic to crop up most often—either front and center, or lurking in the bushes–is that of “background information”. The following was one of my early posts, back in Oct.30, 2011:

October 30, 2011 (London). Increasingly, I am discovering that one of the biggest sources of confusion about the foundations of statistics has to do with what it means or should mean to use “background knowledge” and “judgment” in making statistical and scientific inferences. David Cox and I address this in our “Conversation” in RMM (2011); it is one of the three or four topics in that special volume that I am keen to take up.

Insofar as humans conduct science and draw inferences, and insofar as learning about the world is not reducible to a priori deductions, it is obvious that “human judgments” are involved. True enough, but too trivial an observation to help us distinguish among the very different ways judgments should enter according to contrasting inferential accounts. When Bayesians claim that frequentists do not use or are barred from using background information, what they really mean is that frequentists do not use prior probabilities of hypotheses, at least when those hypotheses are regarded as correct or incorrect, if only approximately. So, for example, we would not assign relative frequencies to the truth of hypotheses such as (1) prion transmission is via protein folding without nucleic acid, or (2) the deflection of light is approximately 1.75” (as if, as Pierce puts it, “universes were as plenty as blackberries”). How odd it would be to try to model these hypotheses as themselves having distributions: to us, statistical hypotheses assign probabilities to outcomes or values of a random variable.

However, quite a lot of background information goes into designing, carrying out, and analyzing inquiries into hypotheses regarded as correct or incorrect. For a frequentist, that is where background knowledge enters. There is no reason to suppose that the background required in order sensibly to generate, interpret, and draw inferences about H should—or even can—enter through prior probabilities for H itself! Of course, presumably, Bayesians also require background information in order to determine that “data x” have been observed, to determine how to model and conduct the inquiry, and to check the adequacy of statistical models for the purposes of the inquiry. So the Bayesian prior only purports to add some other kind of judgment, about the degree of belief in H. It does not get away from the other background judgments that frequentists employ.

This relates to a second point that came up in our conversation when Cox asked, “Do we want to put in a lot of information external to the data, or as little as possible?” Read more »

## A.Birnbaum: Statistical Methods in Scientific Inference

Birnbaum: born May 27, 1923

Today is (statistician) Allan Birnbaum’s birthday. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to obtain what he called “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I would heartily endorse!  While known for attempts to argue that the (strong) Likelihood Principle followed from sufficiency and conditionality principles, a few years after publishing this result, he seems to have turned away from it, perhaps discovering gaps in his argument.

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood).  I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised.

If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].

While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.

Allan Birnbaum
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012

Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:

(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.

Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence”  simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below). Read more »

Categories: Likelihood Principle, phil/history of stat, Statistics | Tags: | 3 Comments

## Stephen Senn: Also Smith and Jones

Also Smith and Jones[1]
by Stephen Senn

Head of Competence Center for Methodology and Statistics (CCMS)

This story is based on a paradox proposed to me by Don Berry. I have my own opinion on this but I find that opinion boring and predictable. The opinion of others is much more interesting and so I am putting this up for others to interpret.

Two scientists working for a pharmaceutical company collaborate in designing and running a clinical trial known as CONFUSE (Clinical Outcomes in Neuropathic Fibromyalgia in US Elderly). One of them, Smith is going to start another programme of drug development in a little while. The other one, Jones, will just be working on the current project. The planned sample size is 6000 patients.

Smith says that he would like to look at the experiment after 3000 patients in order to make an important decision as regards his other project. As far as he is concerned that’s good enough.

Jones is horrified. She considers that for other reasons CONFUSE should continue to recruit all 6000 and that on no account should the trial be stopped early.

Smith say that he is simply going to look at the data to decide whether to initiate a trial in a similar product being studied in the other project he will be working on. The fact that he looks should not affect Jones’s analysis.

Jones is still very unhappy and points out that the integrity of her trial is being compromised.

Smith suggests that all that she needs to do is to state quite clearly in the protocol that the trial will proceed whatever the result of the interim administrative look and she should just write that this is so in the protocol. The fact that she states publicly that on no account will she claim significance based on the first 3000 alone will reassure everybody including the FDA. (In drug development circles, FDA stands for Finally Decisive Argument.)

However, Jones insists. She wants to know what Smith will do if the result after 3000 patients is not significant.

Smith replies that in that case he will not initiate the trial in the parallel project. It will suggest to him that it is not worth going ahead.

Jones wants to know suppose that the results for the first 3000 are not significant what will Smith do once the results of all 6000 are in.

Smith replies that, of course, in that case he will have a look. If (but it seems to him an unlikely situation) the results based on all 6000 will be significant, even though the results based on the first 3000 were not, he may well decide that the treatment works after all and initiate his alternative program, regretting, of course, the time that has been lost.

Jones points out that Smith will not be controlling his type I error rate by this procedure.

‘OK’, Says Smith, ‘to satisfy you I will use adjusted type I error rates. You, of course, don’t have to.’

The trial is run. Smith looks after 3000 patients and concludes the difference is not significant. The trial continues on its planned course. Jones looks after 6000 and concludes it is significant P=0.049. Smith looks after 6000 and concludes it is not significant, P=0.052. (A very similar thing happened in the famous TORCH study(1))

Shortly after the conclusion of the trial, Smith and Jones are head-hunted and leave the company.  The brief is taken over by new recruit Evans.

What does Evans have on her hands: a significant study or not?

Reference

1.  Calverley PM, Anderson JA, Celli B, Ferguson GT, Jenkins C, Jones PW, et al. Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England journal of medicine. 2007;356(8):775-89.

[1] Not to be confused with either Alias Smith and Jones nor even Alas Smith and Jones

Categories: Philosophy of Statistics, Statistics | | 14 Comments

## U-Phil: Mayo’s response to Hennig and Gandenberger

brakes on the ‘breakthrough’

“This will be my last post on the (irksome) Birnbaum argument!” she says with her fingers (or perhaps toes) crossed. But really, really it is (at least until midnight 2013). In fact the following brief remarks are all said, more clearly, in my (old) PAPER , new paperMayo 2010Cox & Mayo 2011 (appendix), and in posts connected to this U-Phil: Blogging the likelihood principle, new summary 10/31/12*.

What’s the catch?

In my recent ‘Ton o’ Bricks” post,many readers were struck by the implausibility of letting the evidential interpretation of x’* be influenced by the properties of experiments known not to have produced x’*. Yet it is altogether common to be told that, should a sampling theorist try to block this, “unfortunately there is a catch” (Ghosh, Delampady, and Semanta 2006, 38): We would be forced to embrace the strong likelihood principle (SLP, or LP, for short), at least according to an infamous argument by Allan Birnbaum (who himself rejected the LP [i]).

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. . . . The “dilemma” argument is therefore an illusion. (Cox and Mayo 2010, 298)

In my many detailed expositions, I have explained the source of the illusion and sleight of hand from a number of perspectives (I will not repeat references here). While I appreciate the care that Hennig and Gandenberger have taken in their U-Phils (and wish them all the luck in published outgrowths), it is clear to me that they are not hearing (or are unwittingly blocking) the scre-e-e-e-ching of the brakes!

No revolution, no breakthrough!

Berger and Wolpert, in their famous monograph The Likelihood Principle, identify the core issue:

The philosophical incompatibility of the LP and the frequentist viewpoint is clear, since the LP deals only with the observed x, while frequentist analyses involve averages over possible observations. . . . Enough direct conflicts have been . . . seen to justify viewing the LP as revolutionary from a frequentist perspective. (Berger and Wolpert 1988, 65-66)[ii]

If Birnbaum’s proof does not apply to a frequentist sampling theorist, then there is neither a revolution nor a breakthrough (as Savage called it). The SLP holds just for methodologies in which it holds . . . We are going in circles.

Since Birnbaum’s argument has stood for over fifty years, I’ve given it the maximal run for its money, and haven’t tried to block its premises, however questionable its key moves may appear. Despite such latitude, I’ve shown that the “proof” to the SLP conclusion will not wash, and I’m just a wee bit disappointed that Hennig and Gandenberger haven’t wrestled with my specific argument, or shown just where they think my debunking fails. What would this require?

Since the SLP is a universal generalization, it requires only a single counterexample to falsify it. In fact, every violation of the SLP within frequentist sampling theory, I show, is a counterexample to it! In other words, using the language from the definition of the SLP, the onus is on Birnbaum to show that for any x’* that is a member of an SLP pair (E’, E”) with given, different probability models f’, f”, that x’* and x”* should have the identical evidential import for an inference concerning parameter q–, on pain of facing “the catch” above, i.e., being forced to allow the import of data known to have come from E’ to be altered by unperformed experiments known not to have produced x’*.

If one is to release the breaks from my screeching halt, defenders of Birnbaum might try to show that the SLP counterexamples lead me to “the catch” as alleged. I have considered two well-known violations of the SLP. Can it be shown that a contradiction with the WCP or SP follows? I say no. Neither Hennig[ii] nor Gandenberger show otherwise.

In my tracing out of Birnbaum’s arguments, I strived to assume that he would not be giving us circular arguments. To say that “I can prove that your methodology must obey the SLP,” and then to set out to do so by declaring “Hey Presto! Assume sampling distributions are irrelevant (once the data are in hand),” is a neat trick, but it assumes what it purports to prove. All other interpretations are shown to be unsound.

______

[i] Birnbaum himself, soon after presenting his result, rejected the SLP. As Birnbaum puts it, ”the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” (Birnbaum 1969, p. 128.)

(We use LP and SLP synonymously here.)

[ii] Hennig initially concurred with me, but says a person convinced him to get back on the Birnbaum bus (even though Birnbaum got off it [i]).

Some other, related, posted discussions: Brakes on Breakthrough Part 1 (12/06/11)  & Part 2 (12/07/11); Don’t Birnbaumize that experiment (12/08/12); Midnight with Birnbaum re-blog (12/31/12). The initial call to this U-Phil, the extension, details here,  the post from my 28 Nov. seminar, (LSE), and the original post by Gandenberger,

OTHER :

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57 (298), 269-306.

Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). On the foundations of statistical inference: “Discussion (of Birnbaum 1962)”,  Journal of the American Statistical Association 57 (298), 307-326.

Birbaum, A (1970). Statistical Methods in Scientific Inference  (letter to the editor). Nature 225, 1033.

Cox D. R. and Mayo. D. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo & A. Spanos eds.), CUP 276-304.

…and if that’s not enough, search this blog.

Categories: Birnbaum Brakes, Likelihood Principle, Statistics | 30 Comments

## U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof

Defending Birnbaum’s Proof

Greg Gandenberger
PhD student, History and Philosophy of Science
Master’s student, Statistics
University of Pittsburgh

In her 1996 Error and the Growth of Experimental Knowledge, Professor Mayo argued against the Likelihood Principle on the grounds that it does not allow one to control long-run error rates in the way that frequentist methods do.  This argument seems to me the kind of response a frequentist should give to Birnbaum’s proof.  It does not require arguing that Birnbaum’s proof is unsound: a frequentist can accommodate Birnbaum’s conclusion (two experimental outcomes are evidentially equivalent if they have the same likelihood function) by claiming that respecting evidential equivalence is less important than achieving certain goals for which frequentist methods are well suited.

More recently, Mayo has shown that Birnbaum’s premises cannot be reformulated as claims about what sampling distribution should be used for inference while retaining the soundness of his proof.  It does not follow that Birnbaum’s proof is unsound because Birnbaum’s original premises are not claims about what sampling distribution should be used for inference but instead as sufficient conditions for experimental outcomes to be evidentially equivalent.

Mayo acknowledges that the premises she uses in her argument against Birnbaum’s proof differ from Birnbaum’s original premises in a recent blog post in which she distinguishes between “the Sufficient Principle (general)” and “the Sufficiency Principle applied in sampling theory.“  One could make a similar distinction for the Weak Conditionality Principle.  There is indeed no way to formulate Sufficiency and Weak Conditionality Principles “applied in sampling theory” that are consistent and imply the Likelihood Principle.  This fact is not surprising: sampling theory is incompatible with the Likelihood Principle!

Birnbaum himself insisted that his premises were to be understood as “equivalence relations” rather than as “substitution rules” (i.e., rules about what sampling distribution should be used for inference) and recognized the fact that understanding them in this way was necessary for his proof.  As he put it in his 1975 rejoinder to Kalbfleisch’s response to his proof, “It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1972 paper, to the monster of the likelihood axiom” (263).

Because Mayo’s argument against Birnbaum’s proof requires reformulating Birnbaum’s premises, it is best understood as an argument not for the claim that Birnbaum’s original proof is invalid, but rather for the claim that Birnbaum’s proof is valid only when formulated in a way that is irrelevant to a sampling theorist.  Reformulating Birnbaum’s premises as claims about what sampling distribution should be used for inference is the only way for a fully committed sampling theorist to understand them.  Any other formulation of those premises is either false or question-begging.

Mayo’s argument makes good sense when understood in this way, but it requires a strong prior commitment to sampling theory. Whether various arguments for sampling theory such as those Mayo gives in Error and the Growth of Experimental Knowledge are sufficient to warrant such a commitment is a topic for another day.  To those who lack such a commitment, Birnbaum’s original premises may seem quite compelling.  Mayo has not refuted the widespread view that those premises do in fact entail the Likelihood Principle.

Mayo has objected to this line of argument by claiming that her reformulations of Birnbaum’s principles are just instantiations of Birnbaum’s principles in the context of frequentist methods. But they cannot be instantiations in a literal sense because they are imperatives, whereas Birnabaum’s original premises are declaratives.  They are instead instructions that a frequentist would have to follow in order to avoid violating Birnbaum’s principles. The fact that one cannot follow them both is only an objection to Birnbaum’s principles on the question-begging assumption that evidential meaning depends on sampling distributions.

********

Birnbaum’s proof is not wrong but error statisticians don’t need to bother

Christian Hennig
Department of Statistical Science
University College London

I was impressed by Mayo’s arguments in “Error and Inference” when I came across them for the first time. To some extent, I still am. However, I have also seen versions of Birnbaum’s theorem and proof presented in a mathematically sound fashion with which I as a mathematician had no issue.

After having discussed this a bit with Phil Dawid, and having thought and read more on the issue, my conclusion is that
1) Birnbaum’s theorem and proof are correct (apart from small mathematical issues resolved later in the literature), and they are not vacuous (i.e., there are evidence functions that fulfill them without any contradiction in the premises),
2) however, Mayo’s arguments actually do raise an important problem with Birnbaum’s reasoning.

Here is why. Note that Mayo’s arguments are based on the implicit (error statistical) assumption that the sampling distribution of an inference method is relevant. In that case, application of the sufficiency principle to Birnbaum’s mixture distribution enforces the use of the sampling distribution under the mixture distribution as it is, whereas application of the conditionality principle enforces the use of the sampling distribution under the experiment that actually produced the data, which is different in the usual examples. So the problem is not that Birnbaum’s proof is wrong, but that enforcing both principles at the same time in the mixture experiment is in contradiction to the relevance of the sampling distribution (and therefore to error statistical inference). It is a case in which the sufficiency principle suppresses information that is clearly relevant under the conditionality principle. This means that the justification of the sufficiency principle (namely that all relevant information is in the sufficient statistic) breaks down in this case.

Frequentists/error statisticians therefore don’t need to worry about the likelihood principle because they shouldn’t accept the sufficiency principle in the generality that is required for Birnbaum’s proof.

Having understood this, I toyed around with the idea of writing this down as a publishable paper, but I now came across a paper in which this argument can already be found (although in a less straightforward and more mathematical manner), namely:
M. J. Evans, D. A. S. Fraser and G. Monette (1986) On Principles and Arguments to Likelihood. Canadian Journal of Statistics 14, 181-194, http://www.jstor.org/stable/3314794, particularly Section 7 (the rest is interesting, too).

NOTE: This is the last of this group of U-Phils. Mayo will issue a brief response tomorrow. Background to these U-Phils may be found here.

Categories: Philosophy of Statistics, Statistics, U-Phil | | 12 Comments

## From Gelman’s blog: philosophy and the practice of Bayesian statistics

I hadn’t read Gelman and Shalizi’s response to my comment on their paper in the British Journal of Mathematical and Statistical Psychology. I see the issue is posted on Gelman’s blogHere’s the issue of the journal,

Philosophy and the practice of Bayesian statistics (with all the discussions!)

Mark Andrews and Thom Baguley

## Coming up: December U-Phil Contributions….

Dear Reader: You were probably* wondering about the December U-Phils (blogging the strong likelihood principle (SLP)). They will be posted, singly or in pairs, over the next few blog entries. Here is the initial call, and the extension. The details of the specific U-Phil may be found here, but also look at the post from my 28 Nov. seminar at the London School of Economics (LSE), which was on the SLP. Posts were to be in relation to either the guest graduate student post by Gandenberger, and/or my discussion/argument and reactions to it. Earlier U-Phils may be found here; and more by searching this blog. ”U-Phil” is short for “you ‘philosophize”.

If you have ideas for future “U-Phils,” post them as comments to this blog or send them to error@vt.edu.

*This is how I see “probability” mainly used in ordinary English, namely as expressing something like “here’s a pure guess made without evidence or with little evidence,” be it sarcastic or quite genuine.

## Severity as a ‘Metastatistical’ Assessment

Some weeks ago I discovered an error* in the upper severity bounds for the one-sided Normal test in section 5 of: “Statistical Science Meets Philosophy of Science Part 2″ SS & POS 2.  The published article has been corrected.  The error was in section 5.3, but I am blogging all of 5.

(* μo was written where xo should have been!)

5. The Error-Statistical Philosophy

I recommend moving away, once and for all, from the idea that frequentists must ‘sign up’ for either Neyman and Pearson, or Fisherian paradigms. As a philosopher of statistics I am prepared to admit to supplying the tools with an interpretation and an associated philosophy of inference. I am not concerned to prove this is what any of the founders ‘really meant’.

Fisherian simple-significance tests, with their single null hypothesis and at most an idea of  a directional alternative (and a corresponding notion of the ‘sensitivity’ of a test), are commonly distinguished from Neyman and Pearson tests, where the null and alternative exhaust the parameter space, and the corresponding notion of power is explicit. On the interpretation of tests that I am proposing, these are just two of the various types of testing contexts appropriate for different questions of interest. My use of a distinct term, ‘error statistics’, frees us from the bogeymen and bogeywomen often associated with ‘classical’ statistics, and it is to be hoped that that term is shelved. (Even ‘sampling theory’, technically correct, does not seem to represent the key point: the sampling distribution matters in order to evaluate error probabilities, and thereby assess corroboration or severity associated with claims of interest.) Nor do I see that my comments turn on whether one replaces frequencies with ‘propensities’ (whatever they are). Read more »

## Don’t Birnbaumize that experiment my friend*–updated reblog

Our current topic, the strong likelihood principle (SLP), was recently mentioned by blogger Christian Robert (nice diagram). So ,since it’s Saturday night, and given the new law just passed in the state of Washington*, I’m going to reblog a post from along with a new UPDATE (following a video we include as an experiment). The new material will be in red (slight differences in notation are explicated within links).

(A)  “It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin[i], or else to embrace the strong likelihood principle which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained.  This is a false dilemma … The ‘dilemma’ argument is therefore an illusion”. ( p. 298)

The “illusion” stems from the sleight of hand I have been explaining in the Birnbaum argument—it starts with Birnbaumization. Read more »

Categories: Birnbaum Brakes, Likelihood Principle, Statistics | 9 Comments

## Announcement: U-Phil Extension: Blogging the Likelihood Principle

U-Phil: I am extending to Dec. 19, 2012 the date for sending me responses to the “U-Phil” call, see initial call, given some requests for more time. The details of the specific U-Phil may be found here, but you might also look at the post relating to my 28 Nov. seminar at the LSE, which is directly on the topic: the infamous (strong) likelihood principle (SLP). ”U-Phil, ” which is short for “you ‘philosophize’” is really just an opportunity to write something .5-1 notch above an ordinary comment (focussed on one or more specific posts/papers, as described in each call): it can be longer (~500-1000 words), and it appears in the regular blog area rather than as a comment.  Your remarks can relate to the guest graduate student post by Gregory Gandenberger, and/or my discussion/argument. Graduate student posts (e.g., attendees of my 28 Nov. LSE seminar?) are especially welcome*. Earlier explemplars of U-Phils may be found here; and more by searching this blog.

Thanks to everyone who sent me names of vintage typewriter repair shops in London, after the airline damage: the “x” is fixed, but the “z” key is still misbehaving.

*Another post of possible relevance to graduate students comes up when searching this blog for  “sex”.

## Error Statistics (brief overview)

In view of some questions about “behavioristic” vs “evidential” construals of frequentist statistics (from the last post), and how the error statistical philosophy tries to improve on Birnbaum’s attempt at providing the latter, I’m reblogging a portion of a post from Nov. 5, 2011 when I also happened to be in London. (The beginning just records a goofy mishap with a skeletal key, and so I leave it out in this reblog.) Two papers with much more detail are linked at the end.

Error Statistics

(1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.). Read more »

Categories: Error Statistics, Philosophy of Statistics, Statistics | | 10 Comments

## Blogging Birnbaum: on Statistical Methods in Scientific Inference

I said I’d make some comments on Birnbaum’s letter (to Nature), (linked in my last post), which is relevant to today’s Seminar session (at the LSE*), as well as to (Normal Deviate‘s) recent discussion of frequentist inference–in terms of constructing procedures with good long-run “coverage”. (Also to the current U-Phil).

NATURE VOL. 225 MARCH 14, 1970 (1033)

LETTERS TO THE EDITOR

Statistical methods in Scientific Inference

It is regrettable that Edwards’s interesting article[1], supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200)[2], I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood).  I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4)[3] [4], and so my comments here will be restricted to several specific points that Edwards raised. Read more »

Categories: Likelihood Principle, Statistics, U-Phil | 5 Comments

## Likelihood Links [for 28 Nov. Seminar and Current U-Phil]

Dear Reader: We just arrived in London[i][ii]. Jean Miller has put together some materials for Birnbaum LP aficionados in connection with my 28 November seminar. Great to have ready links to some of the early comments and replies by Birnbaum, Durbin, Kalbfleish and others, possibly of interest to those planning contributions to the current “U-Phil“.  I will try to make some remarks on Birnbaum’s 1970 letter to the editor tomorrow.

## Announcement: 28 November: My Seminar at the LSE (Contemporary PhilStat)

28 November: (10 – 12 noon):
Mayo: “On Birnbaum’s argument for the Likelihood Principle: A 50-year old error and its influence on statistical foundations”
PH500 Seminar, Room: Lak 2.06 (Lakatos building).
London School of Economics and Political Science (LSE)