“Frequentist Statistics as a Theory of Inductive Inference”
“Probability/Statistics Lecture Notes 4: Hypothesis Testing”
“Frequentist Statistics as a Theory of Inductive Inference”
“Probability/Statistics Lecture Notes 4: Hypothesis Testing”
*This is Day #5 on the Syllabus, as Day #4 had to be made up (Feb 24, 2014) due to snow. Slides for Day #4 will go up Feb. 26, 2014. (See the revised Syllabus Second Installment.)
Day #1 slides are here.
REVISITING THE FOUNDATIONS OF STATISTICS IN THE ERA OF BIG DATA: SCALING UP TO MEET THE CHALLENGE
Cosponsored by the Department of Mathematics & Statistics at Boston University.
Friday, February 21, 2014
10 a.m. – 5:30 p.m.
Photonics Center, 9th Floor Colloquium Room (Rm 906)
8 St. Mary’s Street
On weekends this spring (in connection with Phil 6334, but not limited to seminar participants) I will post relevant “comedy hours”, invites to analyze short papers or blogs (“U-Phils”, as in “U-philosophize”), and some of my “deconstructions” of articles. To begin with a “U-Phil”, consider a note by Andrew Gelman: “Ethics and the statistical use of prior information,”[i].
U-Phil (2/10/14): In section 3 Gelman comments on some of David Cox’s remarks in a (highly informal and non-scripted) conversation we recorded:
“A Statistical Scientist Meets a Philosopher of Science: A Conversation between Sir David Cox and Deborah Mayo,” published in Rationality, Markets and Morals [iii] (Section 2 has some remarks on Larry Wasserman, by the way.)
Here’s the relevant portion of the conversation:
COX: Deborah, in some fields foundations do not seem very important, but we both think foundations of statistical inference are important; why do you think that is?
MAYO: I think because they ask about fundamental questions of evidence, inference, and probability. I don’t think that foundations of different fields are all alike; because in statistics we’re so intimately connected to the scientific interest in learning about the world, we invariably cross into philosophical questions about empirical knowledge and inductive inference.
COX: One aspect of it is that it forces us to say what it is that we really want to know when we analyze a situation statistically. Do we want to put in a lot of information external to the data, or as little as possible. It forces us to think about questions of that sort.
MAYO: But key questions, I think, are not so much a matter of putting in a lot or a little information. …What matters is the kind of information, and how to use it to learn. This gets to the question of how we manage to be so successful in learning about the world, despite knowledge gaps, uncertainties and errors. To me that’s one of the deepest questions and it’s the main one I care about. I don’t think a (deductive) Bayesian computation can adequately answer it.…..
COX: There’s a lot of talk about what used to be called inverse probability and is now called Bayesian theory. That represents at least two extremely different approaches. How do you see the two? Do you see them as part of a single whole? Or as very different? Continue reading
I will post seminar slides here (they will generally be ragtag affairs), links to the papers are in the syllabus.
This follows up on yesterday’s deconstruction:
I’m happy to play devil’s advocate in commenting on Larry’s very interesting and provocative (in a good way) paper on ‘how recent developments in statistical modeling and inference have [a] changed the intended scope of data analysis, and [b] raised new foundational issues that rendered the ‘older’ foundational problems more or less irrelevant’.
The new intended scope, ‘low assumptions, high dimensions’, is delimited by three characteristics:
“1. The number of parameters is larger than the number of data points.
2. Data can be numbers, images, text, video, manifolds, geometric objects, etc.
3. The model is always wrong. We use models, and they lead to useful insights but the parameters in the model are not meaningful.” (p. 1)
In the discussion that follows I focus almost exclusively on the ‘low assumptions’ component of the new paradigm. The discussion by David F. Hendry (2011), “Empirical Economic Model Discovery and Theory Evaluation,” RMM, 2: 115-145, is particularly relevant to some of the issues raised by the ‘high dimensions’ component in a way that complements the discussion that follows.
My immediate reaction to the demarcation based on 1-3 is that the new intended scope, although interesting in itself, excludes the overwhelming majority of scientific fields where restriction 3 seems unduly limiting. In my own field of economics the substantive information comes primarily in the form of substantively specified mechanisms (structural models), accompanied with theory-restricted and substantively meaningful parameters.
In addition, I consider the assertion “the model is always wrong” an unhelpful truism when ‘wrong’ is used in the sense that “the model is not an exact picture of the ‘reality’ it aims to capture”. Worse, if ‘wrong’ refers to ‘the data in question could not have been generated by the assumed model’, then any inference based on such a model will be dubious at best! Continue reading
Larry Wasserman (“Normal Deviate”) has announced he will stop blogging (for now at least). That means we’re losing one of the wisest blog-voices on issues relevant to statistical foundations (among many other areas in statistics). Whether this lures him back or reaffirms his decision to stay away, I thought I’d reblog my (2012) “deconstruction” of him (in relation to a paper linked below)[i]
Deconstructing Larry Wasserman [i] by D. Mayo
The temptation is strong, but I shall refrain from using the whole post to deconstruct Al Franken’s 2003 quip about media bias (from Lies and Lying Liars Who Tell Them: A Fair and Balanced Look at the Right), with which Larry Wasserman begins his paper “Low Assumptions, High Dimensions” (2011) in his contribution to Rationality, Markets and Morals (RMM) Special Topic: Statistical Science and Philosophy of Science:
Wasserman: There is a joke about media bias from the comedian Al Franken:
‘To make the argument that the media has a left- or right-wing, or a liberal or a conservative bias, is like asking if the problem with Al-Qaeda is: do they use too much oil in their hummus?’
According to Wasserman, “a similar comment could be applied to the usual debates in the foundations of statistical inference.”
Although it’s not altogether clear what Wasserman means by his analogy with comedian (now senator) Franken, it’s clear enough what Franken meant if we follow up the quip with the next sentence in his text (which Wasserman omits): “The problem with al Qaeda is that they’re trying to kill us!” (p. 1). The rest of Franken’s opening chapter is not about al Qaeda but about bias in media. Conservatives, he says, decry what they claim is a liberal bias in mainstream media. Franken rejects their claim.
The mainstream media does not have a liberal bias. And for all their other biases . . . , the mainstream media . . . at least try to be fair. …There is, however, a right-wing media. . . . They are biased. And they have an agenda…The members of the right-wing media are not interested in conveying the truth… . They are an indispensable component of the right-wing machine that has taken over our country… . We have to be vigilant. And we have to be more than vigilant. We have to fight back… . Let’s call them what they are: liars. Lying, lying, liars. (Franken, pp. 3-4)
When I read this in 2004 (when Bush was in office), I couldn’t have agreed more. How things change*. Now, of course, any argument that swerves from the politically correct is by definition unsound, irrelevant, and/ or biased. [ii]
But what does this have to do with Bayesian-frequentist foundations? What is Wasserman, deep down, really trying to tell us by way of this analogy (if only subliminally)? Such are my ponderings—and thus this deconstruction. (I will invite your “U-Phils” at the end[a].) I will allude to passages from my contribution to RMM (2011) http://www.rmm-journal.de/htdocs/st01.html (in red).
A.What Is the Foundational Issue?
Wasserman: To me, the most pressing foundational question is: how do we reconcile the two most powerful needs in modern statistics: the need to make methods assumption free and the need to make methods work in high dimensions… . The Bayes-Frequentist debate is not irrelevant but it is not as central as it once was. (p. 201)
One may wonder why he calls this a foundational issue, as opposed to, say, a technical one. I will assume he means what he says and attempt to extract his meaning by looking through a foundational lens. Continue reading
A paper of mine on “double-counting” and novel evidence just came out: ”Some surprising facts about (the problem of) surprising facts” in Studies in History and Philosophy of Science (2013), http://dx.doi.org/10.1016/j.shpsa.2013.10.005
ABSTRACT: A common intuition about evidence is that if data x have been used to construct a hypothesis H, then x should not be used again in support of H. It is no surprise that x fits H, if H was deliberately constructed to accord with x. The question of when and why we should avoid such ‘‘double-counting’’ continues to be debated in philosophy and statistics. It arises as a prohibition against data mining, hunting for significance, tuning on the signal, and ad hoc hypotheses, and as a preference for predesignated hypotheses and ‘‘surprising’’ predictions. I have argued that it is the severity or probativeness of the test—or lack of it—that should determine whether a double-use of data is admissible. I examine a number of surprising ambiguities and unexpected facts that continue to bedevil this debate.
Memory Lane: 2 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (I’m serious, it is now a boutique hotel.) My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.
In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory. Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended. But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)
Criticisms then follow readily: the form of one or both:
What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution. Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical. Continue reading
A long-running research program in philosophy is to seek a quantitative measure
to capture intuitive ideas about “confirmation” and about “confirmational relevance”. The components of C(h,x) are allowed to be any statements, no reference to a probability model or to joint distributions are required. Then h is “confirmed” or supported by x if P(h|x) > P(h), disconfirmed (or undermined) if P(h|x) < P(h), (else x is confirmationally irrelevant to h). This is the generally accepted view of philosophers of confirmation (or Bayesian formal epistemologists) up to the present. There is generally a background “k” included, but to avoid a blinding mass of symbols I omit it. (We are rarely told how to get the probabilities anyway; but I’m going to leave that to one side, as it will not really matter here.)
A test of any purported philosophical confirmation theory is whether it elucidates or is even in sync with intuitive methodological principles about evidence or testing. One of the first problems that arises stems from asking…
Is Probability a good measure of confirmation?
A natural move then would be to identify the degree of confirmation of h by x with probability P(h|x), (which philosophers sometimes write as P(h,x)). Statement x affords hypothesis h higher confirmation than it does h’ iff P(h|x) > P(h’|x).
Some puzzles immediately arise. Hypothesis h can be confirmed by x, while h’ disconfirmed by x, and yet P(h|x) < P(h’|x). In other words, we can have P(h|x) > P(h) and P(h’|x) < P(h’) and yet P(h|x) < P(h’|x).
Popper (The Logic of Scientific Discovery, 1959, 390) gives this example, (I quote from him, only changing symbols slightly):
Consider the next toss with a homogeneous die.
h: 6 will turn up
h’: 6 will not turn up
x: an even number will turn up.
P(h) = 1/6, p(h’) = 5/6 P(x) = ½
The probability of h is raised by information x, while h’ is undermined by x. (It’s probability goes from 5/6 to 4/6.) If we identify probability with degree of confirmation, x confirms h and disconfirms h’ (i.e., P(h|x) >P(h) and P(h’|x) < P(h’)). Yet because P(h|x) < P(h’|x), h is less well confirmed given x than is h’. (This happens because P(h) is sufficiently low.) So P(h|x) cannot just be identified with the degree of confirmation that x affords h.
Note, these are not real statistical hypotheses but statements of events.
Obviously there needs to be a way to distinguish between some absolute confirmation for h, and a relative measure of how much it has increased due to x. From the start, Rudolf Carnap noted that “the verb ‘to confirm’ is ambiguous” but thought it had “the connotation of ‘making firmer’ even more often than that of ‘making firm’.” (Carnap, Logical Foundations of Probability (2nd), xviii ). x can increase the firmness of h, but C(h,x) < C(~h,x) (h is more firm, given x, than is ~h). Like Carnap, it’s the ‘making firmer’ that is generally assumed in Bayesian confirmation theory.
But there are many different measures of making firmer (Popper, Carnap, Fitelson). Referring to Popper’s example, we can report the ratio R: P(h|x)/P(h) = 2.
(In this case h’ = ~h).
Or we use the likelihood ratio LR: P(x|h)/P(x|~h) = (1/.4) = 2.5.
Many other ways of measuring the increase in confirmation x affords h could do as well. But what shall we say about the numbers like 2, 2.5? Do they mean the same thing in different contexts? What happens if we get beyond toy examples to scientific hypotheses where ~h would allude to all possible theories not yet thought of. What’s P(x|~h) where ~h is “the catchall” hypothesis asserting “something else”? (see, for example, Mayo 1997)
Perhaps this point won’t prevent confirmation logics from accomplishing the role of capturing and justifying intuitions about confirmation. So let’s consider the value of confirmation theories to that role. One of the early leaders of philosophical Bayesian confirmation, Peter Achinstein (2001), began to have doubts about the value of the philosopher’s a priori project. He even claims, rather provocatively, that “scientists do not and should not take … philosophical accounts of evidence seriously” (p. 9) because they give us formal syntactical (context –free) measures; whereas, scientists look to empirical grounds for confirmation. Philosophical accounts, moreover, make it too easy to confirm. He rejects confirmation as increased firmness, denying it is either necessary or sufficient for evidence. As far as making it too easy to get confirmation, there is the classic problem: it appears we can get everything to confirm everything, so long as one thing is confirmed. This is a famous argument due to Glymour (1980).
Paradox of irrelevant conjunctions
We now switch to emphasizing that the hypotheses may be statistical hypotheses or substantive theories. Both for this reason and because I think they look better, I move away from Popper and Carnap’s lower case letters for hypotheses.
The problem of irrelevant conjunctions (the “tacking paradox”) is this: If x confirms H, then x also confirms (H & J), even if hypothesis J is just “tacked on” to H. As with most of these chestnuts, there is a long history (e.g., Earman 1992, Rosenkrantz 1977), but consider just a leading contemporary representative, Branden Fitelson. Fitelson has importantly emphasized how many different C functions there are for capturing “makes firm”. Fitelson defines:
J is an irrelevant conjunct to H, with respect to x just in case P(x|H) = P(x|J & H).
For instance, x might be radioastronomic data in support of:
H: the deflection of light effect (due to gravity) is as stipulated in the General Theory of Relativity (GTR), 1.75” at the limb of the sun.
and the irrelevant conjunct:
J: the radioactivity of the Fukushima water being dumped in the Pacific ocean is within acceptable levels.
(1) Bayesian (Confirmation) Conjunction: If x Bayesian confirms H, then x Bayesian-confirms (H & J), where P(x| H & J ) = P(x|H) for any J consistent with H.
The reasoning is as follows:
P(x|H) /P(x) > 1 (x Bayesian confirms H)
P(x|H & J) = P(x|H) (given)
So [P(x|H & J) /P(x)]> 1
Therefore x Bayesian confirms (H & J)
However, it is also plausible to hold :
(2) Entailment condition: If x confirms T, and T entails J, then x confirms J.
In particular, if x confirms (H & J), then x confirms J.
(3) From (1) and (2) , if x confirms H, then x confirms J for any irrelevant J consistent with H.
(Assume neither H nor J have probabilities 0 or 1).
It follows that if x confirms any H, then x confirms any J.
Branden Fitelson’s solution
Fitelson (2002), and Fitelson and Hawthorne (2004) offer this “solution”: He will allow that x confirms (H & J), but deny the entailment condition. So, in particular, x confirms the conjunction although x does not confirm the irrelevant conjunct. Moreover, Fitelson shows, even though (H & J) is confirmed by x, (H & J) gets less of a confirmation (firmness) boost than does H—so long as one doesn’t measure the confirmation boost using R: P(h|x)/P(x). If one does use R, then (H & J) is just as well confirmed as is H, which is disturbing.
But even if we use the LR as our firmness boost, I would agree with Glymour that the solution scarcely solves the real problem. Paraphrasing him, we would not be assured by an account that tells us deflection of light data (x) confirms both GTR (H) and the radioactivity of the Fukushima water is within acceptable levels (J), while assuring us that x does not confirm the Fukishima water having acceptable levels of radiation (31).
The tacking paradox is to be expected if confirmation is taken as a variation on probabilistic affirming the consequent. Hypothetico-deductivists had the same problem, which is why Popper said we need to supplement each of the measures of confirmation boost with the condition of “severity”. However, he was unable to characterize severity adequately, and ultimately denied it could be formalized. He left it as an intuitive requirement that before applying any C-function, the confirming evidence must be the result of “a sincere (and ingenious) attempt to falsify the hypothesis” in question. I try to supply a more adequate account of severity (e.g., Mayo 1996, 2/3/12 post (no-pain philosophy III)).
How would the tacking method fare on the severity account? We’re not given the details we’d want for an error statistical appraisal, but let’s do the best with their stipulations. From our necessary condition, we have that (H and J) cannot warrant taking x as evidence for (H and J) if x counts as a highly insevere test of (H and J). The “test process” with tacking is something like this: having confirmed H, tack on any consistent but irrelevant J to obtain (H & J).(Sentence was amended on 10/21/13)
A scrutiny of well-testedness may proceed by denying either condition for severity. To follow the confirmation theorists, let’s grant the fit requirement (since H fits or entails x). This does not constitute having done anything to detect the falsity of H& J. The conjunction has been subjected to a radically non-risky test. (See also 1/2/13 post, esp. 5.3.4 Tacking Paradox Scotched.)
What they call confirmation we call mere “fit”
In fact, all their measures of confirmation C, be it the ratio measure R: P(H|x)/P(H) or the (so-called) likelihood ratio LR: P(H|x)/P(~H|x), or one of the others, count merely as “fit” or “accordance” measures to the error statistician. There is no problem allowing each to be relevant for different problems and different dimensions of evidence. What we need to add in each case are the associated error probabilities:
P([H & J] is Bayesian confirmed; ~(J&H)) = maximal, so x is “bad evidence, no test” (BENT) for the conjunction.
We read “;” as “under the assumption that”.
In fact, all their measures of confirmation C are mere “fit” measures, be it the ratio measure R: P(H|x)/P(H) or the LR or other.
The following was added on 10-21-13: The above probability stems from taking the “fit measure” as a statistic, and assessing error probabilities by taking account the test process, as in error statistics. The result is
SEV[(H & J), tacking test, x] is minimal
I have still further problems with these inductive logic paradigms: an adequate philosophical account should answer questions and explicate principles about the methodology of scientific inference. Yet the Bayesian inductivist starts out assuming the intuition or principle, the task then being the homework problem of assigning priors and likelihoods that mesh with the principles. This often demands beating a Bayesian analysis into line, while still not getting at its genuine rationale. “The idea of putting probabilities over hypotheses delivered to philosophy a godsend, and an entire package of superficiality.” (Glymour 2010, 334). Perhaps philosophers are moving away from analytic reconstructions. Enough tears have been shed. But does an analogous problem crop up in Bayesian logic more generally?
I may update this post, and if I do I will alter the number following the title.
Oct. 20, 2013: I am updating this to reflect corrections pointed out by James Hawthorne, for which I’m very grateful. I will call this draft (ii).
Oct. 21, 2013 (updated in blue). I think another sentence might have accidentally got moved around.
Oct. 23, 2013. Given some issues that cropped up in the discussion (and the fact that certain symbols didn’t always come out right in the comments, I’m placing the point below in Note ):
 I say “so-called” because there’s no requirement of a proper statistical model here.
 Can P = C?
Spoze there’s a case where z confirms hh’ more than z confirms h’: C(hh’,z) > C(h’,z)
Now h’ = (~hh’ or hh’)
(i) C(hh’,z) > C(~hh’ or hh’,z)
Since ~hh’ and hh’ are mutually exclusive, we have from special addition rule
(ii) P(hh’,z) < P(~hh’ or hh’,z)
So if P = C, (i) and (ii) yield a contradiction.
Achinstein, P. (2001). The Book of Evidence. Oxford: Oxford University Press.
Carnap, R. (1962). Logical Foundations of Probability. Chicago: University of Chicago Press.
Earman, J. (1992). Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory Cambridge MA: MIT Press.
Fitelson, B. (2002). Putting the Irrelevance Back Into the Problem of Irrelevant Conjunction. Philosophy of Science 69(4), 611–622.
Fitelson, B. & Hawthorne, J. (2004). Re-Solving Irrelevant Conjunction with Probabilistic Independence, Philosophy of Science, 71: 505–514.
Glymour, C. (1980) . Theory and Evidence. Princeton: Princeton University Press
_____. (2010). Explanation and Truth. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, 305–314. Cambridge: Cambridge University Press.
Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.
_____. (1997). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,“ Philosophy of Science 64(1): 222-244 and 323-333.
_____. (2010). Explanation and Testing Exchanges with Clark Glymour. In D. G. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, 305–314. Cambridge: Cambridge University Press.
Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books.
Rosenkranz, R. (1977). Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)
1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over k or more factors? Or to distinguish optional stopping with sequential trials from fixed sample size experiments. Here’s a quote I came across just yesterday:
“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a fixed sample had been taken, one could guarantee arbitrarily strong frequentist ‘significance’ against H0.” (Berger and Wolpert, 1988, 77).
The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong) likelihood principle, and Birnbaum.
2. Highly probable vs. highly probed. SEV doesn’t obey the probability calculus: for any test T and outcome x, the severity for both H and ~H might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle. Continue reading
Reblog (year ago) : G.A. Barnard’s birthday is today, so here’s a snippet of his discussion with Savage (1962) (link below [i]) that connects to some earlier issues: stopping rules, likelihood principle, and background information here and here (at least of one type). (A few other Barnard links on this blog are below* .) Happy Birthday George!
Barnard: I have been made to think further about this issue of the stopping rule since I first suggested that the stopping rule was irrelevant (Barnard 1947a,b). This conclusion does not follow only from the subjective theory of probability; it seems to me that the stopping rule is irrelevant in certain circumstances. Since 1947 I have had the great benefit of a long correspondence—not many letters because they were not very frequent, but it went on over a long time—with Professor Bartlett, as a result of which I am considerably clearer than I was before. My feeling is that, as I indicated [on p. 42], we meet with two sorts of situation in applying statistics to data One is where we want to have a single hypothesis with which to confront the data. Do they agree with this hypothesis or do they not? Now in that situation you cannot apply Bayes’s theorem because you have not got any alternatives to think about and specify—not yet. I do not say they are not specifiable—they are not specified yet. And in that situation it seems to me the stopping rule is relevant.
In particular, suppose somebody sets out to demonstrate the existence of extrasensory perception and says ‘I am going to go on until I get a one in ten thousand significance level’. Knowing that this is what he is setting out to do would lead you to adopt a different test criterion. What you would look at would not be the ratio of successes obtained, but how long it took him to obtain it. And you would have a very simple test of significance which said if it took you so long to achieve this increase in the score above the chance fraction, this is not at all strong evidence for E.S.P., it is very weak evidence. And the reversing of the choice of test criteria would I think overcome the difficulty.
This is the answer to the point Professor Savage makes; he says why use one method when you have vague knowledge, when you would use a quite different method when you have precise knowledge. It seem to me the answer is that you would use one method when you have precisely determined alternatives, with which you want to compare a given hypothesis, and you use another method when you do not have these alternatives.
Savage: May I digress to say publicly that I learned the stopping-rule principle from professor Barnard, in conversation in the summer of 1952. Frankly I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right. I am particularly surprised to hear Professor Barnard say today that the stopping rule is irrelevant in certain circumstances only, for the argument he first gave in favour of the principle seems quite unaffected by the distinctions just discussed. The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head and cannot be known to those who have to judge the experiment. Never having been comfortable with that argument, I am not advancing it myself. But if Professor Barnard still accepts it, how can he conclude that the stopping-rule principle is only sometimes valid? (emphasis added) Continue reading
Philosopher of Science
University of Pittsburgh
Genuine philosophical problems are always rooted in urgent problems outside philosophy,
and they die if these roots decay
Karl Popper (1963, 72)
My concern in this post is how we philosophers can use our skills to do work that matters to people both inside and outside of philosophy.
Philosophers are highly skilled at conceptual analysis, in which one takes an interesting but unclear concept and attempts to state precisely when it applies and when it doesn’t.
What is the point of this activity? In many cases, this question has no satisfactory answer. Conceptual analysis becomes an end in itself, and philosophical debates become fruitless arguments about words. The pleasure we philosophers take in such arguments hardly warrants scarce government and university resources. It does provide good training in critical thinking, but so do many other activities that are also immediately useful, such as doing science and programming computers.
Conceptual analysis does not have to be pointless. It is often prompted by a real-world problem. In Plato’s Euthyphro, for instance, the character Euthyphro thought that piety required him to prosecute his father for murder. His family thought on the contrary that for a son to prosecute his own father was the height of impiety. In this situation, the question “what is piety?” took on great urgency. It also had great urgency for Socrates, who was awaiting trial for corrupting the youth of Athens.
In general, conceptual analysis often begins as a response to some question about how we ought to regulate our beliefs or actions. It can be a fruitful activity as long as the questions that prompted it are kept in view. It tends to degenerate into merely verbal disputes when it becomes an end in itself.
The kind of goal-oriented view of conceptual analysis I aim to articulate and promote is not teleosemantics: it is a view about how philosophy should be done rather than a theory of meaning. It is consistent with Carnap’s notion of explication (one of the desiderata of which is fruitfulness) (Carnap 1963, 5), but in practice Carnapian explication seems to devolve into idle word games just as easily as conceptual analysis. Our overriding goal should not be fidelity to intuitions, precision, or systematicity, but usefulness.
How I Became Suspicious of Conceptual Analysis
When I began working on proofs of the Likelihood Principle, I assumed that following my intuitions about the concept of “evidential equivalence” would lead to insights about how science should be done. Birnbaum’s proof showed me that my intuitions entail the Likelihood Principle, which frequentist methods violate.
Voila! Voila! Scientists shouldn’t use frequentist methods. All that remained to be done was to fortify Birnbaum’s proof, as I do in “A New Proof of the Likelihood Principle” by defending it against objections and buttressing it with an alternative proof. [Editor: For a number of related materials on this blog see Mayo’s JSM presentation, and note [i].]
After working on this topic for some time, I realized that I was making simplistic assumptions about the relationship between conceptual intuitions and methodological norms. At most, a proof of the Likelihood Principle can show you that frequentist methods run contrary to your intuitions about evidential equivalence. Even if those intuitions are true, it does not follow immediately that scientists should not use frequentist methods. The ultimate aim of science, presumably, is not to respect evidential equivalence but (roughly) to learn about the world and make it better. The demand that scientists use methods that respect evidential equivalence is warranted only insofar as it is conducive to achieving those ends. Birnbaum’s proof says nothing about that issue.
How to Do Conceptual Analysis Teleologically
This is not to say that my work on the Likelihood Principle or conceptual analysis in general is without value. But it is nothing more than a kind of careful lexicography. This kind of work is potentially useful for clarifying normative claims with the aim of assessing and possibly implementing them. To do work that matters, philosophers engaged in conceptual analysis need to take enough interest in the assessment and implementation stages to do their conceptual analysis with the relevant normative claims in mind.
So what does this kind of teleological (goal-oriented) conceptual analysis look like?
It can involve personally following through on the process of assessing and implementing the relevant norms. For example, philosophers at Carnegie Mellon University working on causation have not only provided a kind of analysis of the concept of causation but also developed algorithms for causal discovery, proved theorems about those algorithms, and applied those algorithms to contemporary scientific problems (see e.g. Spirtes et al. 2000).
I have great respect for this work. But doing conceptual analysis does not have to mean going so far outside the traditional bounds of philosophy. A perfect example is James Woodward’s related work on causal explanation, which he describes as follows (2003, 7-8, original emphasis):
My project…makes recommendations about what one ought to mean by various causal and explanatory claims, rather than just attempting to describe how we use those claims. It recognizes that causal and explanatory claims sometimes are confused, unclear, and ambiguous and suggests how those limitations might be addressed…. we introduce concepts…and characterize them in certain ways…because we want to do things with them…. Concepts can be well or badly designed for such purposes, and we can evaluate them accordingly.
Woodward keeps his eye on what the notion of causation is for, namely distinguishing between relationships that do and relationships that do not remain invariant under interventions. This distinction is enormously important because only relationships that remain invariant under interventions provide “handles” we can use to change the world.
Here are some lessons about teleological conceptual analysis that we can take from Woodward’s work. (I’m sure this list could be expanded.)
Philosophers rarely succeed in capturing all of our intuitions about an important informal concept. Even if they did succeed, they would have more work to do in justifying any norms that invoke that concept. Conceptual analysis can be a first step toward doing philosophy that matters, but it needs to be undertaken with the relevant normative claims in mind.
Question: What are your best examples of philosophy that matters? What can we learn from them?
Some related papers:
Today is Egon Pearson’s birthday (11 Aug., 1895-12 June, 1980); and here you see my scruffy sketch of him, at the start of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman-Pearson theory of statistics. ”Pearson and Pearson” statistics (both Egon, not Karl) would have looked very different from Neyman and Pearson statistics, I suspect. One of the few sources of E.S. Pearson’s statistical philosophy is his (1955) “Statistical Concepts in Their Relation to Reality”. It begins like this:
Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.
In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. It was really much simpler–or worse. The original heresy, as we shall see, was a Pearson one!…
Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot…!
To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE.
See also Aris Spanos: “Egon Pearson’s Neglected Contributions to Statistics“.
Happy Birthday E.S. Pearson!
A low-powered statistical analysis of this blog—nearing its 2-year anniversary!—reveals that the topic to crop up most often—either front and center, or lurking in the bushes–is that of “background information”. The following was one of my early posts, back in Oct.30, 2011:
October 30, 2011 (London). Increasingly, I am discovering that one of the biggest sources of confusion about the foundations of statistics has to do with what it means or should mean to use “background knowledge” and “judgment” in making statistical and scientific inferences. David Cox and I address this in our “Conversation” in RMM (2011); it is one of the three or four topics in that special volume that I am keen to take up.
Insofar as humans conduct science and draw inferences, and insofar as learning about the world is not reducible to a priori deductions, it is obvious that “human judgments” are involved. True enough, but too trivial an observation to help us distinguish among the very different ways judgments should enter according to contrasting inferential accounts. When Bayesians claim that frequentists do not use or are barred from using background information, what they really mean is that frequentists do not use prior probabilities of hypotheses, at least when those hypotheses are regarded as correct or incorrect, if only approximately. So, for example, we would not assign relative frequencies to the truth of hypotheses such as (1) prion transmission is via protein folding without nucleic acid, or (2) the deflection of light is approximately 1.75” (as if, as Pierce puts it, “universes were as plenty as blackberries”). How odd it would be to try to model these hypotheses as themselves having distributions: to us, statistical hypotheses assign probabilities to outcomes or values of a random variable.
However, quite a lot of background information goes into designing, carrying out, and analyzing inquiries into hypotheses regarded as correct or incorrect. For a frequentist, that is where background knowledge enters. There is no reason to suppose that the background required in order sensibly to generate, interpret, and draw inferences about H should—or even can—enter through prior probabilities for H itself! Of course, presumably, Bayesians also require background information in order to determine that “data x” have been observed, to determine how to model and conduct the inquiry, and to check the adequacy of statistical models for the purposes of the inquiry. So the Bayesian prior only purports to add some other kind of judgment, about the degree of belief in H. It does not get away from the other background judgments that frequentists employ.
This relates to a second point that came up in our conversation when Cox asked, “Do we want to put in a lot of information external to the data, or as little as possible?” Continue reading
Today is (statistician) Allan Birnbaum’s birthday. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to obtain what he called “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I would heartily endorse! While known for attempts to argue that the (strong) Likelihood Principle followed from sufficiency and conditionality principles, a few years after publishing this result, he seems to have turned away from it, perhaps discovering gaps in his argument.
NATURE VOL. 225 MARCH 14, 1970 (1033)
LETTERS TO THE EDITOR
Statistical methods in Scientific Inference
It is regrettable that Edwards’s interesting article, supporting the likelihood and prior likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) concepts that seem to dissuade most theoretical and applied statisticians from adopting them. As one whom Edwards particularly credits with having ‘analysed in depth…some attractive properties” of the likelihood concept, I must point out that I am not now among the ‘modern exponents” of the likelihood concept. Further, after suggesting that the notion of prior likelihood was plausible as an extension or analogue of the usual likelihood concept (ref.2, p. 200), I have pursued the matter through further consideration and rejection of both the likelihood concept and various proposed formalizations of prior information and opinion (including prior likelihood). I regret not having expressed my developing views in any formal publication between 1962 and late 1969 (just after ref. 1 appeared). My present views have now, however, been published in an expository but critical article (ref. 3, see also ref. 4) , and so my comments here will be restricted to several specific points that Edwards raised.
If there has been ‘one rock in a shifting scene’ or general statistical thinking and practice in recent decades, it has not been the likelihood concept, as Edwards suggests, but rather the concept by which confidence limits and hypothesis tests are usually interpreted, which we may call the confidence concept of statistical evidence. This concept is not part of the Neyman-Pearson theory of tests and confidence region estimation, which denies any role to concepts of statistical evidence, as Neyman consistently insists. The confidence concept takes from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data. (The absence of a comparable property in the likelihood and Bayesian approaches is widely regarded as a decisive inadequacy.) The confidence concept also incorporates important but limited aspects of the likelihood concept: the sufficiency concept, expressed in the general refusal to use randomized tests and confidence limits when they are recommended by the Neyman-Pearson approach; and some applications of the conditionality concept. It is remarkable that this concept, an incompletely formalized synthesis of ingredients borrowed from mutually incompatible theoretical approaches, is evidently useful continuously in much critically informed statistical thinking and practice [emphasis mine].
While inferences of many sorts are evident everywhere in scientific work, the existence of precise, general and accurate schemas of scientific inference remains a problem. Mendelian examples like those of Edwards and my 1969 paper seem particularly appropriate as case-study material for clarifying issues and facilitating effective communication among interested statisticians, scientific workers and philosophers and historians of science.
New York University
Courant Institute of Mathematical Sciences,
251 Mercer Street,
New York, NY 10012
Birnbaum’s confidence concept, sometimes written (Conf), was his attempt to find in error statistical ideas a concept of statistical evidence–a term that he invented and popularized. In Birnbaum 1977 (24), he states it as follows:
(Conf): A concept of statistical evidence is not plausible unless it finds ‘strong evidence for J as against H with small probability (α) when H is true, and with much larger probability (1 – β) when J is true.
Birnbaum questioned whether Neyman-Pearson methods had “concepts of evidence” simply because Neyman talked of “inductive behavior” and Wald and others cauched statistical methods in decision-theoretic terms. I have been urging that we consider instead how the tools may actually be used, and not be restricted by the statistical philosophies of founders (not to mention that so many of their statements are tied up with personality disputes, and problems of “anger management”). Recall, as well, E. Pearson’s insistence on an evidential construal of N-P methods, and the fact that Neyman, in practice, spoke of drawing inferences and reaching conclusions (e.g., Neyman’s nursery posts, links in [iii] below). Continue reading
“This will be my last post on the (irksome) Birnbaum argument!” she says with her fingers (or perhaps toes) crossed. But really, really it is (at least until midnight 2013). In fact the following brief remarks are all said, more clearly, in my (old) PAPER , new paper, Mayo 2010, Cox & Mayo 2011 (appendix), and in posts connected to this U-Phil: Blogging the likelihood principle, new summary 10/31/12*.
What’s the catch?
In my recent ‘Ton o’ Bricks” post,many readers were struck by the implausibility of letting the evidential interpretation of x’* be influenced by the properties of experiments known not to have produced x’*. Yet it is altogether common to be told that, should a sampling theorist try to block this, “unfortunately there is a catch” (Ghosh, Delampady, and Semanta 2006, 38): We would be forced to embrace the strong likelihood principle (SLP, or LP, for short), at least according to an infamous argument by Allan Birnbaum (who himself rejected the LP [i]).
It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. . . . The “dilemma” argument is therefore an illusion. (Cox and Mayo 2010, 298)
In my many detailed expositions, I have explained the source of the illusion and sleight of hand from a number of perspectives (I will not repeat references here). While I appreciate the care that Hennig and Gandenberger have taken in their U-Phils (and wish them all the luck in published outgrowths), it is clear to me that they are not hearing (or are unwittingly blocking) the scre-e-e-e-ching of the brakes!
No revolution, no breakthrough!
Berger and Wolpert, in their famous monograph The Likelihood Principle, identify the core issue:
The philosophical incompatibility of the LP and the frequentist viewpoint is clear, since the LP deals only with the observed x, while frequentist analyses involve averages over possible observations. . . . Enough direct conflicts have been . . . seen to justify viewing the LP as revolutionary from a frequentist perspective. (Berger and Wolpert 1988, 65-66)[ii]
If Birnbaum’s proof does not apply to a frequentist sampling theorist, then there is neither a revolution nor a breakthrough (as Savage called it). The SLP holds just for methodologies in which it holds . . . We are going in circles.
Block my counterexamples, please!
Since Birnbaum’s argument has stood for over fifty years, I’ve given it the maximal run for its money, and haven’t tried to block its premises, however questionable its key moves may appear. Despite such latitude, I’ve shown that the “proof” to the SLP conclusion will not wash, and I’m just a wee bit disappointed that Hennig and Gandenberger haven’t wrestled with my specific argument, or shown just where they think my debunking fails. What would this require?
Since the SLP is a universal generalization, it requires only a single counterexample to falsify it. In fact, every violation of the SLP within frequentist sampling theory, I show, is a counterexample to it! In other words, using the language from the definition of the SLP, the onus is on Birnbaum to show that for any x’* that is a member of an SLP pair (E’, E”) with given, different probability models f’, f”, that x’* and x”* should have the identical evidential import for an inference concerning parameter q–, on pain of facing “the catch” above, i.e., being forced to allow the import of data known to have come from E’ to be altered by unperformed experiments known not to have produced x’*.
If one is to release the breaks from my screeching halt, defenders of Birnbaum might try to show that the SLP counterexamples lead me to “the catch” as alleged. I have considered two well-known violations of the SLP. Can it be shown that a contradiction with the WCP or SP follows? I say no. Neither Hennig[ii] nor Gandenberger show otherwise.
In my tracing out of Birnbaum’s arguments, I strived to assume that he would not be giving us circular arguments. To say that “I can prove that your methodology must obey the SLP,” and then to set out to do so by declaring “Hey Presto! Assume sampling distributions are irrelevant (once the data are in hand),” is a neat trick, but it assumes what it purports to prove. All other interpretations are shown to be unsound.
[i] Birnbaum himself, soon after presenting his result, rejected the SLP. As Birnbaum puts it, ”the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” (Birnbaum 1969, p. 128.)
(We use LP and SLP synonymously here.)
[ii] Hennig initially concurred with me, but says a person convinced him to get back on the Birnbaum bus (even though Birnbaum got off it [i]).
Some other, related, posted discussions: Brakes on Breakthrough Part 1 (12/06/11) & Part 2 (12/07/11); Don’t Birnbaumize that experiment (12/08/12); Midnight with Birnbaum re-blog (12/31/12). The initial call to this U-Phil, the extension, details here, the post from my 28 Nov. seminar, (LSE), and the original post by Gandenberger,
Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, Journal of the American Statistical Association 57 (298), 269-306.
Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). On the foundations of statistical inference: “Discussion (of Birnbaum 1962)”, Journal of the American Statistical Association 57 (298), 307-326.
Birbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
Cox D. R. and Mayo. D. (2010). “Objectivity and Conditionality in Frequentist Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo & A. Spanos eds.), CUP 276-304.
…and if that’s not enough, search this blog.
PhD student, History and Philosophy of Science
Master’s student, Statistics
University of Pittsburgh
In her 1996 Error and the Growth of Experimental Knowledge, Professor Mayo argued against the Likelihood Principle on the grounds that it does not allow one to control long-run error rates in the way that frequentist methods do. This argument seems to me the kind of response a frequentist should give to Birnbaum’s proof. It does not require arguing that Birnbaum’s proof is unsound: a frequentist can accommodate Birnbaum’s conclusion (two experimental outcomes are evidentially equivalent if they have the same likelihood function) by claiming that respecting evidential equivalence is less important than achieving certain goals for which frequentist methods are well suited.
More recently, Mayo has shown that Birnbaum’s premises cannot be reformulated as claims about what sampling distribution should be used for inference while retaining the soundness of his proof. It does not follow that Birnbaum’s proof is unsound because Birnbaum’s original premises are not claims about what sampling distribution should be used for inference but instead as sufficient conditions for experimental outcomes to be evidentially equivalent.
Mayo acknowledges that the premises she uses in her argument against Birnbaum’s proof differ from Birnbaum’s original premises in a recent blog post in which she distinguishes between “the Sufficient Principle (general)” and “the Sufficiency Principle applied in sampling theory.“ One could make a similar distinction for the Weak Conditionality Principle. There is indeed no way to formulate Sufficiency and Weak Conditionality Principles “applied in sampling theory” that are consistent and imply the Likelihood Principle. This fact is not surprising: sampling theory is incompatible with the Likelihood Principle!
Birnbaum himself insisted that his premises were to be understood as “equivalence relations” rather than as “substitution rules” (i.e., rules about what sampling distribution should be used for inference) and recognized the fact that understanding them in this way was necessary for his proof. As he put it in his 1975 rejoinder to Kalbfleisch’s response to his proof, “It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1972 paper, to the monster of the likelihood axiom” (263).
Because Mayo’s argument against Birnbaum’s proof requires reformulating Birnbaum’s premises, it is best understood as an argument not for the claim that Birnbaum’s original proof is invalid, but rather for the claim that Birnbaum’s proof is valid only when formulated in a way that is irrelevant to a sampling theorist. Reformulating Birnbaum’s premises as claims about what sampling distribution should be used for inference is the only way for a fully committed sampling theorist to understand them. Any other formulation of those premises is either false or question-begging.
Mayo’s argument makes good sense when understood in this way, but it requires a strong prior commitment to sampling theory. Whether various arguments for sampling theory such as those Mayo gives in Error and the Growth of Experimental Knowledge are sufficient to warrant such a commitment is a topic for another day. To those who lack such a commitment, Birnbaum’s original premises may seem quite compelling. Mayo has not refuted the widespread view that those premises do in fact entail the Likelihood Principle.
Mayo has objected to this line of argument by claiming that her reformulations of Birnbaum’s principles are just instantiations of Birnbaum’s principles in the context of frequentist methods. But they cannot be instantiations in a literal sense because they are imperatives, whereas Birnabaum’s original premises are declaratives. They are instead instructions that a frequentist would have to follow in order to avoid violating Birnbaum’s principles. The fact that one cannot follow them both is only an objection to Birnbaum’s principles on the question-begging assumption that evidential meaning depends on sampling distributions.
Department of Statistical Science
University College London
I was impressed by Mayo’s arguments in “Error and Inference” when I came across them for the first time. To some extent, I still am. However, I have also seen versions of Birnbaum’s theorem and proof presented in a mathematically sound fashion with which I as a mathematician had no issue.
After having discussed this a bit with Phil Dawid, and having thought and read more on the issue, my conclusion is that
1) Birnbaum’s theorem and proof are correct (apart from small mathematical issues resolved later in the literature), and they are not vacuous (i.e., there are evidence functions that fulfill them without any contradiction in the premises),
2) however, Mayo’s arguments actually do raise an important problem with Birnbaum’s reasoning.
Here is why. Note that Mayo’s arguments are based on the implicit (error statistical) assumption that the sampling distribution of an inference method is relevant. In that case, application of the sufficiency principle to Birnbaum’s mixture distribution enforces the use of the sampling distribution under the mixture distribution as it is, whereas application of the conditionality principle enforces the use of the sampling distribution under the experiment that actually produced the data, which is different in the usual examples. So the problem is not that Birnbaum’s proof is wrong, but that enforcing both principles at the same time in the mixture experiment is in contradiction to the relevance of the sampling distribution (and therefore to error statistical inference). It is a case in which the sufficiency principle suppresses information that is clearly relevant under the conditionality principle. This means that the justification of the sufficiency principle (namely that all relevant information is in the sufficient statistic) breaks down in this case.
Frequentists/error statisticians therefore don’t need to worry about the likelihood principle because they shouldn’t accept the sufficiency principle in the generality that is required for Birnbaum’s proof.
Having understood this, I toyed around with the idea of writing this down as a publishable paper, but I now came across a paper in which this argument can already be found (although in a less straightforward and more mathematical manner), namely:
M. J. Evans, D. A. S. Fraser and G. Monette (1986) On Principles and Arguments to Likelihood. Canadian Journal of Statistics 14, 181-194, http://www.jstor.org/stable/3314794, particularly Section 7 (the rest is interesting, too).
NOTE: This is the last of this group of U-Phils. Mayo will issue a brief response tomorrow. Background to these U-Phils may be found here.