
- 28 November: (10 – 12 noon): Mayo: On Birnbaum’s argument for the Likelihood Principle: A 50-year old error and its influence on statistical foundations (See my blog and links within.)
5 December and 12 December: Statistical Science meets philosophy of science: Mayo and guests:
- 5 Dec: 12 (noon)- 2p.m.: Sir David Cox
- 12 Dec (10-12).Dr. Stephen Senn;
Dr. Christian Hennig: TBA
Topics, activities, readings :TBA (Two 2012 Summer Seminars may be found here).
Blurb: Debates over the philosophical foundations of statistical science have a long and fascinating history marked by deep and passionate controversies that intertwine with fundamental notions of the nature of statistical inference and the role of probabilistic concepts in inductive learning. Progress in resolving decades-old controversies which still shake the foundations of statistics, demands both philosophical and technical acumen, but gaining entry into the current state of play requires a roadmap that zeroes in on core themes and current standpoints. While the seminar will attempt to minimize technical details, it will be important to clarify key notions to fully contribute to the debates. Relevance for general philosophical problems will be emphasized. Because the contexts in which statistical methods are most needed are ones that compel us to be most aware of strategies scientists use to cope with threats to reliability, considering the nature of statistical method in the collection, modeling, and analysis of data is an effective way to articulate and warrant general principles of evidence and inference.
Room 2.06 Lakatos Building; Centre for Philosophy of Natural and Social Science
London School of Economics
Houghton Street
London WC2A 2AE
Administrator: T. R. Chivers@lse.ac.uk
For updates, details, and associated readings: please check the LSE Ph500 page on my blog or write to me.
*It is not necessary to have attended the 2 sessions held during the summer of 2012.
All: I learned that the T building at the LSE is back being called the Lakatos building as it used to be, and the room number has a dot following the 2: Lak 2.06.
Dr. Mayo,
I was looking at your paper “Error Statistics” and ran into a problem I hoped you or Dr. Spanos could help me with. Throughout the paper there is the example SEV(mu>mu1) which is calculated by P(d(X)<d(x0);mu=mu1). It’s a great example and I love how it puts to rest so many of the howlers thown against Frequentist Statistics. Using the SEV does seem to eliminate many, if not, all the problems for Frequentists.
The problem I’m having is that by a simple change of variables it’s easy to show P(d(X)mu1|d(x0)) (for a uniform prior on mu). This is exactly the natural quantity a Bayesian would want to use to evaluate the hypothesis mu>mu1. This is a simple change of variables (essentially: d(X) equal to d(x0)+mu1-mu, where d(X) and mu are integration variables) and holds for any constants d(x0), mu1; so please don’t dismiss my concern.
Given this numerical identity, then all the great arguments for how SEV solves the problems of Frequentist Statistics apply just as easily for the Bayesian calculation (at least in this example). This can’t possibly be right, so I wanted to find an example that clearly showed the superiority of SEV.
Since the Bayesian result depended on a uniform prior for mu, what happens when we have prior knowledge that mu lies within a certain range [a,b]? The Bayesian could then restrict their uniform prior for mu to this interval. To show the Bayesians up I wanted to get the equivalent SEV answer, but I’m having a lot of trouble.
From you I learned that the only philosophically sound way to take prior knowledge into consideration is by changing the model and not with a prior probability, but how do I do that in this case? The prior knowledge has no effect on the sampling distribution at all. The error distribution just comes from a “well calibrated measuring instrument” and isn’t changed by information I have on the thing I’m measuring.
I’m tempted to just throw this prior info away and say it has no effect on the numerical value of SEV, but that can’t be right either. In an extreme case the prior info could be so restrictive that we know for certain mu>mu1. The Bayesian result will handle this extreme case perfectly, but SEV won’t unless it too somehow includes this information.
So can you or Dr. Spanos tell me how to change the sampling distribution to include the knowledge that mu is in [a,b] and how to show that the resulting SEV has much better properties than the equivalent Bayesian calculation?
Dear Guest: ‘From you I learned that the only philosophically sound way to take prior knowledge into consideration is by changing the model and not with a prior probability.”
You couldn’t have learned this from me. I’ve only said that background knowledge need not and often will not come in the form of a prior. That’s what that whole discussion of ESP was for. The background came in as regards such matters as known flaws and fraud to be avoided. This is not to change the model really, though it would influence the wise choice of design to prevent Geller from cheating, say.
The rest of what you wrote I don’t quite get, but that may be because I’m riding on a ferry and its windy! We don’t have to assign a probability to a known claim. If all values other than those are ruled out, then it’s known. Never said a Sev assessment would necessarily have superior properties (from the perspective of a non error statistical school ?) than a Bayesian one for the same case. Indeed, Bayesians are good at taking known error statistical results that seem sensible and finding ways to match them Bayesianly . The Bayesians have the magic (as Le Cam said). Oy, my scarf just blew off the slide of the ferry.! Sorry, gotta try and catch it!
Sorry again, there was a formatting problem which made it confusing. The quantity SEV(mu greater than mu1) in the paper is numerically identical to the Bayesian posterior P( mu greater than mu1; data=d(x0))
This is an identity true for all values of the constants. Just apply a change of variables in the integral used to evaluate SEV to convert it into the integral used to compute the Bayesian posterior.
My point was that as things stands all the arguments for SEV in the paper are equally arguments for the Bayesian posterior. This can’t be right, so I wanted to extend this useful example to include prior information and then compare SEV with the new Bayesian Posterior so as to show the superiority of SEV.
“Bayesians are good at taking known error statistical results that seem sensible and finding ways to match them Bayesianly”
Unfortunately, P(mu>m1;data) is exactly the calculation Laplace would have made (and almost certainly did make). So the Bayesian solution predates the SEV(mu>m1) by a couple of centuries. Rather than arguing over who got there first, I’d prefer a clear cut case where the SEV differs from the Bayesian answer and it’s clearly superior in some way.
I’m sorry there was some kind of formatting problem. I meant to say:
it’s easy to show P(d(X) less than d(x0);mu equal mu1) is numerically equal to the Bayesian probability conditional on the data P(mu greater than mu1; d(x0))
guest: What would stop the frequentist from restricting the parameter space to [a,b], if that is known a priori?
Hopefully nothing, but I don’t see how to do it. The sampling distribution on which SEV was calculated is unaffected by this restriction.
“The sampling distribution on which SEV was calculated is unaffected by this restriction.”
I think actually in this problem as long as you only want a test and severity (no confidence intervals) it really doesn’t make a difference whether the parameter space is restricted to [a,b] or not, apart from the fact that you shouldn’t use impossible values for mu1.
But is there anything wrong with that? As long as we’re talking frequentist logic, it doesn’t count as an argument that this is in some sense equivalent to a Bayesian analysis with uniform prior over the whole of IR and that the Bayesian solution on [a,b] would be different.
That was my intuition as well, but on deeper though it seems like a problem. This is a toy problem, but it does represent some real elements of my work. Basically my work involves programming inferences and decisions that have to be made in real time with no human in the loop. In my case there strong physical restrictions on most parameters and those restrictions change as function of time.
To see my problem, suppose we start out with no restriction on mu. Then the Error Statistician and Bayesian will calculate the same actual number and draw the same kind of conclusion from it. Now suppose there is some lower bound to mu>a. As a->.2 the Error Statistician’s conclusion won’t change at all until ‘a’ just passes .2. There will be a significant discontinuity in the conclusions they draw. The discontinuity can be large if both the Error Statistician and Bayesian initially concluded there was only slight confirmation of the hypothesis mu>.2.
The Bayesian posterior however will transition continuously to 1. The reason is that if we know mu>,19999 before ever taking a measurement, then it already seems pretty unlikely that mu.2 is strongly confirmed (at this point the Error Statistician will still be saying “There is only slight confirmation”).
sorry again about the formatting problems. The last paragraph should read:
The Bayesian posterior however will transition continuously to 1. The reason is that if we know mu greater than ,19999 before ever taking a measurement, then it already seems pretty unlikely that mu is less than .2. If the data confirms mu is greater than .2 even a little, then total result is fairly strongly confirmation of mu greater than .2 (at this point the Error Statistician will still be saying “There is only slight confirmation”).
Guest: I may be slightly jumping in in the middle of your conversation, but owing to extreme circumstances here, it can’t be helped.
I have in mind that one would report discrepancies that are well indicated as well as those that are not. I never contemplated saying “there is only slight confirmation” of a discrepancy. The quantification of how well a given discrepancy is or is not indicated (or ruled out) concerns the severity or reliability or stringency or the like of the test. If a discrepancy is poorly indicated, it’s not a little bit confirmed–it’s unwarranted to claim there’s evidence for the discrepancy.
This might seem like no big difference, but I’ve come to realize that it is a KEY difference between Popper-Peirce type (error statistical) reasoning and “degree-of-confirmation” or support reasoning. That is why probability logic doesn’t hold for a “logic” of well-testedness. Please ponder this possibility seriously. Take a look at the picture (of the severe testing standpoint) on p. 18 of my slides from the 10a.m. presentation found on
http://www.phil.vt.edu/dmayo/conference_2010/schedule.htm
” The Confluence Between StatSci & PhilSci : Deep vs Shallow Explorations”.
Sorry for being unclear Dr. Mayo, but nothing I was describing involved “poorly indicated”
.
But now I’m a bit confused. If mu>mu1 has a severity of .6 then its negation (using the equations in your paper) will have severity .4. On the other hand, if mu>mu1 has severity .95 then “mu less than mu1” has severity .05.
In both cases mu>mu1 passes a severe test, but are you saying the lopsided severity in the second case doesn’t matter? They both simply “pass” a severe test and both cases should be treated the same for making future decisions whose outcomes depends on the truth of mu>mu1?
This does not always work,but again I feel in the middle of a discussion to compare two different approaches, one with a clear underlying philosophy (that would be the error statistical) and evidential principle, and the other with a vague hit or miss philosophy that I’m not clear on.I’d like to understand what’s being compared, and once again, I’m crossing that ferry at night (to get proper internet).
Guest: It is wrong to say that if the Bayesian and the Error Statistician compute the same number, they draw the same kind of conclusion from it. These numbers mean different things. The frequentist number is not a probability for a certain set of parameters and shouldn’t be used like one.
I’m not anti-Bayesian so I’ll grant you that if you want to make a decision computing expected loss involving a distribution of the parameter, you have to do it the Bayesian way (I don’t believe that there is anything objective about this, though). If Mayo thinks otherwise, she has to take over.
Still, there is nothing wrong with the error statistical approach here unless you forget that it delivers you something else that is of interest in its own right but cannot be interpreted in the same way.
I don’t think there is anything wrong with Error Statistics!
Here’s the problem I’m having though. If the relevant number for mu>mu1 is .99 then an Error Statistician says “it passes a test with severity .99”. A Bayesian says “it’s true with probability .99”.
These have deep philosophical differences and mean something very different. But here’s the rub: if I know the Bayesian answer I automatically know the Error Statistician’s answer!
So the Bayesian method tells me everything the Error Statistician’s answer does, but the Bayesian answer seems to do so much more besides. I can use it in my decision theoretic program and I can include some kinds of prior information which you’ve said above the Error Statistician should just leave out of the analysis. By what reason then should my office ever give up their Bayesian methods?
There’s no rub here. In problems where there is a known relationship between numbers knowing my quantity may give yours, but if you don’t grasp how to use and justify what mine teaches, that’s pretty shallow. On the background information, I thought we were quite clear that it is Bayesians who want background in terms of priors and model features, whereas we get beyond that straightjacket. The exs I think you have in mind are discussed in Cox and Mayo (2010) and in a distinct paper by Spanos. I think they’re shutting the lights now, it’s 2a.m.
I don’t know how to use your teachings. That’s why I asked the authors how to include the kind of prior info I’d need to include in practice.
The interest wasn’t purely practical though. The “Error Statistics” paper shows that “severity” removes the problems of “classical” statistics. Because of the numerical identify for the example given between SEV and the Bayesian posterior, the paper can be converted (with the appropriate verbal translation) into showing that “bayesian posteriors” remove the problems of “classical” statistics as well.
I was hoping to remove this defect with a better example. Including the prior info in an “Error Statistical” way may lead to an example that wasn’t numerically equal to the Bayesian result but that was better as judged (by an “Error Statistical” criterion).
Has your view changed? You’ve written before:
“If there is genuine knowledge, say, about the range of a parameter, then that would seem to be something to be taken care of in a proper model of the case—at least for a non-Bayesian.”
http://andrewgelman.com/2011/12/keeping-things-unridiculous-berger-ohagan-and-me-on-weakly-informative-priors/#comment-70586
Also, I’m a bit confused about the “shallow” comment. If an Error Statistician computes SEV(H) =.999 and a Bayesian computes P(H)=.999, then what is the deeper consequence that the Error Statistician sees from this that the Bayesian failed to understand?
Guest: I don’t know what yours means.
guest: As said before, I’m not going to tell you that your office should give up Bayesian statistics. However, bear in mind that “the event that the true parameter value is in [c,d] is true with probability 0.99” is a rather problematic statement to make for a Bayesian, because it implies that you believe in a true *frequentist* model with a true parameter in the first place.
If you don’t have any justification for any kind of *frequentist* distribution of the true parameter, i.e., your prior, the Bayesian calculus mixes up two kinds of probabilities, epistemic and frequentist, and I don’t see a proper justification for that.
Of course I know that there are indeed “empirical Bayes” situations where assuming a frequentist distribution of true parameter values makes sense, because the setup is so that there is a physical interpretation of this distribution (incl. repetitions with different true parameters). In such situations frequentists would normally be happy to apply Bayesian methods, too.
On the other hand, there is de Finetti’s subjectivism, to which you can stick, too, but this doesn’t license any kind of statement about true parameters and the prior is subjective, so somebody else can legitimetely have a probability of 0.2 for the event for which you get 0.99. (To name a leading non-subjectivist Bayesian, Jaynes does not believe in true parameters either, as far as I can see.)
All this is known to many major players in the Bayesian literature, as you will see if you read for example Bernardo & Smith with care.
Christian,
There isn’t the slightest problem for a Bayesian saying something like “a person weight is definitely greater than 0 and less than 10^6 kg”. That information may not be very useful, but it’s ridiculous to say only Frequentists can make statements like this.
I know you don’t see the justification for mixing “epistemic” priors with the likelihood, but that doesn’t mean that I don’t see the justification. This would take us off on another tract however. I’ll just mention that it is well know in my industry that including this kind of information in Bayesian priors does in fact lead to better results. If the net result of Error Statistics is that it denies us this practical benefit, then that’s a big problem.
guest: Bringing in a person’s weight now is very odd, and you don’t even make a probability statement. This has very little to do with what my point was.
Anyway, you can say what you want and I was fine with your industry using Bayes from the very beginning, so no need to convince me in that respect (not much hope either without being more explicit, though). I’m too pluralist to worry about your business just because you’re Bayesian. Somewhere further down this blog I took over the role to defend de Finetti, so there you are.
You can by the way say a hundred times that “the Bayesian posterior does the right thing as well” – still the same numbers interpreted differently are not the same thing. An error statistician should be all fine with whatever mathematical way you find to come up with a correct severity, even if you use a Bayesian prior *as long as you interpret it as a severity* and not as a probability measure on the parameter space.
“Bringing in a person’s weight now is very odd”
Weight was the original example in one of Mayo’s papers and it was explicitly the kind of prior info I mentioned in my original question to Dr. Mayo.
“I was fine with your industry using Bayes from the very beginning”
I wasn’t fine with it, but to get my coworkers to change I’d need to have these kinds of questions answered.
“still the same numbers interpreted differently are not the same thing”
True, but the numbers once known can be used in any way. There’s nothing stopping someone doing a bayesian calculation and saying “this also tells us the SEV(H) so that means XYZ”. Although I’m still a little unclear how much more “XYZ” there is that the Error Stat would have found that the Bayesian missed.
A related question about wording. If there is a problem with the severity concept (and the Bayesian are right) then the discussion in the “Error Statistics” paper is very unlikely to find it since the examples are limited to ones in which the Bayesian Posterior does the right thing as well.
So is it correct to say “the examples given so far are a low severity test of the ‘SEV’ concept”? If so, then what I’m basically interested in examples which test ‘SEV’ with high severity.
Guest: I set out “4.1 Criteria for the Philosophical Scrutiny of Methods” in Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2011 (and Beyond)?RMM Vol. 2, 2011, 79–102, Special Topic.
Popperians have argued that any attempt to justify a universal account of method is either circular or self-contradictory. It’s an intriguing argument, but I argue against that in Mayo, D. (2006). “Critical Rationalism and Its Failure to Withstand Critical Scrutiny,” in C. Cheyne and J. Worrall (eds.) Rationality and Reality: Conversations with Alan Musgrave.
Popper himself said that his method (of falsifiability) should not itself be judged by this criteria because it is a philosophical theory, not a scientific one, but again, my position differs.
I’m sorry but I have a tiny amount of time to write a paper and cannot work on the blog comments for a bit, much as I’m keen to. Be assured that I will return, however.