A reader asks: “Can you tell me about disagreements on numbers between a severity assessment within error statistics, and a Bayesian assessment of posterior probabilities?” Sure.

There are differences between Bayesian posterior probabilities and formal error statistical measures, as well as between the latter and a severity (SEV) assessment, which differs from the standard type 1 and 2 error probabilities, p-values, and confidence levels—despite the numerical relationships. Here are some random thoughts that will hopefully be relevant for both types of differences. (Please search this blog for specifics.)

1. The most noteworthy difference is that error statistical inference makes use of outcomes other than the one observed, even after the data are available: there’s no other way to ask things like, how often would you find 1 nominally statistically significant difference in a hunting expedition over *k* or more factors? Or to distinguish optional stopping with sequential trials from fixed sample size experiments. Here’s a quote I came across just yesterday:

“[S]topping ‘when the data looks good’ can be a serious error when combined with frequentist measures of evidence. For instance, if one used the stopping rule [above]…but analyzed the data as if a

fixedsample had been taken, one couldguaranteearbitrarily strong frequentist ‘significance’ againstH_{0}.” (Berger and Wolpert, 1988, 77).

The worry about being guaranteed to erroneously exclude the true parameter value here is an error statistical affliction that the Bayesian is spared (even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.) See this post for an amusing note; Mayo and Kruse (2001) below; and, if interested, search the (strong) likelihood principle, and Birnbaum.

2. *Highly probable vs. highly probed*. SEV doesn’t obey the probability calculus: for any test T and outcome ** x**, the severity for both

*H*and ~

*H*might be horribly low. Moreover, an error statistical analysis is not in the business of probabilifying hypotheses but evaluating and controlling the capabilities of methods to discern inferential flaws (problems with linking statistical and scientific claims, problems of interpreting statistical tests and estimates, and problems of underlying model assumptions). This is the basis for applying what may be called the Severity principle.

3. I once posted a comment by J. A. Hartigan (1971), on David Bartholomew, that gives a 5 line argument that while Bayes and frequency intervals may sometimes agree with improper priors, they never exactly agree with proper priors (see below [i]). But improper priors are not considered to provide degrees of belief (not even being proper probabilities). *This would seem to suggest that when they (frequentists and Bayesians) agree on numbers, the prior cannot be construed as a proper degree of belief assignment. So is there agreement? *If priors are not probabilities, what then is the interpretation of a posterior? (There’s no suggestion it would be a measure of well-testedness, as is SEV).* *

*4. Now it might instead be a “default”or “conventional” prior. *Yet, producing identical numbers could only be taken as performing the tasks of an error statistical inference by reinterpreting them to mean confidence levels and significance levels, not posteriors. Then these could be connected to SEV measures.

For example, there’s agreement between “conventional” Bayesians and frequentists in the case of one-sided Normal (IID) testing (known σ), as noted in Ghosh, Delampady, and Samanta (2006, p. 35). If we wish to reject a null value when “the posterior odds against it are 19:1 or more, i.e., if posterior probability of *H*_{0} is < .05” then the rejection region matches that of the corresponding test of *H*_{0}, (at the .05 level) if that were the null hypothesis. Even with this agreement on numbers, it seems to me this would badly distort the goal and rationale of testing inferences. When properly used, they typically serve as workaday tools for probing a specific question or problem (e.g., about discrepancies). *You jump in, and you jump out.* Even after ensuring a p-value is genuine (“isolated tests” are not enough, search blog for why) the tester is not about to use the information gleaned to assign either “posterior odds” or (actual or rational) degrees of belief to statistical hypotheses within a generally very idealized model. (At least that’s not the direct function of p-values or confidence levels).

5. Then there’s the familiar fact that there would be disagreement between the frequentist and Bayesian if one were testing the two sided Normal mean: *H*_{0}: μ=μ_{0} vs. *H*_{1}: μ≠μ_{0}. In fact, the same outcome that would be regarded as evidence against the null in the one-sided test (for the default Bayesian and frequentist) can result in statistically significant results being Bayesianly construed as no evidence against the null or even evidence *for* it (due to a spiked prior).[ii]

See this post and the chart [iii] comparing p-values and posteriors below.

6. I’m reminded of Jim Berger’s valiant attempt to get Fisher, Jeffreys, and Neyman to all agree on testing (Berger 2003)? Taking the conflict between p-values and Bayesian posteriors in two-sided testing, he offers a revision of tests thought to do a better job from both Bayesian and frequentist perspectives, but neither side liked it very much. See an earlier post. (Also Mayo 2003, Mayo and Spanos 2011 below [iv])

The ‘spike and slab priors’ (a cute name I heard some statisticians use) also leads to a conflict with Bayesian ‘credibility interval’ reasoning, since 0 is outside the corresponding interval. By contrast, the severe testing inferences are harmonious whether one is doing testing or confidence intervals.

7. Recall in this connection Senn’s note (on this blog) on the allegation that p-values overstate the evidence.

8. Sometimes when people ask for differences they really mean to pose the challenge: “show me a case that I cannot reconstruct Bayesianly.” This leads to “rational reconstructions” or what I call “painting by numbers”.

The search for an agreement/disagreement on numbers across different statistical philosophies is an understandable pastime in foundations of statistics. Identifying matching or unified numbers, apart from what they might mean, might offer a glimpse as to shared (unconscious?) underlying goals. For example, if *H* passes severely, you are free to say you have warrant to believe *H*, and then assign it a high probability, but this is not Bayesian inference (and the converse doesn’t hold). On the face of it, any inference, whether to the adequacy of a model (for a given purpose) or to a posterior probability, can be said to be warranted just to the extent that the inference has withstood severe testing: one with a high capability of having found flaws were they present. But I think it’s a mistake to play a numbers game without the statistical philosophy to point toward interpretation as well as avenues for filling gaps in the approach.

__________

[i] Here are the 5 lines:

We need P(θ < θ_{α}(x)| θ) = α all θ.

Assume θ_{α}(x) has positive density over line, all θ.

Then P(θ < θ_{α}(x)| θ, θ_{α}(x) >0) = α(θ) > α.

So P(θ < θ_{α}(x)| θ_{α}(x) > 0) > α averaging over θ.

So P(θ < θ_{α}(x)| x) = α is impossible.

(J.A. Hartigan, comment on D. J. Bartholomew (1971), “Comparison of Frequentist and Bayesian Approaches to Inference with Prior Knowledge”, in Godambe and Sprott (eds.), *Foundations of Statistical Inference, p.432)*

[ii] Not all default Bayesians endorse the spiked priors here, meaning there a lack of agreement on numbers even within the same philosophical school. See this post.

[iii] Chart

[iv] For a short and sweet, and not too bad, overview, with comments by Casella, see Mayo 2004 below.

Bartholomew, D. J., “A comparison of Frequentist and Bayesian Approaches to Inferences With Prior Knowledge,” in Godambe and Sprott, (1971), *Foundations of Statistical Inference*, 417-429.

Berger, J. (2003),“Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, *Statistical Science **18 ,* 1–12.

Berger, J. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. 82 , *112–139.

Berger, J. and Wolpert, R. (1988). *The Likelihood Principle*. 2nd ed. Vol. 6. Lecture Notes-Monograph Series. Hayward, California: Institute of Mathematical Statistics.

Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. 82 ,*106–111, 123–139.

Ghosh, Delampady, and Samanta (2006), *An Introduction to Bayesian Analysis, Theory and Methods,* Springer.

Hartigan, J. (1971) Comment on D. J. Bartholomew (1971), “Comparison of Frequentist and Bayesian Approaches to Inference with Prior Knowledge”, in Godambe and Sprott (eds.), *Foundations of Statistical Inference.*

Mayo, D. (2003), Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, *Statistical Science* *18*, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press: 79-118.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

*Philosophy of Statistics , Handbook of Philosophy of Science*Volume 7

*Philosophy of Statistics*, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

I neglected another point of difference which is very important:

9. for the error statistician, data generation and modeling is very much a part of the overall inquiry; we don’t seek an account that imagines data are thrown at you and you have to make an inference. Testing assumptions is the other big area of contrast with Bayesians.

Now the person who sent me the query was mentioning another blog that was seeking to find cases where SEV and posteriors differ, presumably to show the Bayesian posterior gets it right or whatever. But the person seemed to be overlooking the most central points of disagreement, and #9 is yet another.

“even though I don’t think they can be too happy about it, especially when HPD intervals are assured of excluding the true parameter value.”

… if and only if the prior is absolutely continuous and the true parameter value happens to be in the exact center of the optional stopping window.

Typically, if one suspects that this might *actually* be the case, one has prior information (usually of a theoretical nature) justifying putting a point mass on that specific value, and the quoted claim is no longer true.

Corey: The point of inferential tools is to reach inferences when one doesn’t already know the answer. There’s a discussion of some Bayesian moves regarding this case in Mayo and Kruse (2001) (it takes a few seconds to load):

Click to access Mayo%20&%20Kruse%20Principles%20of%20inference%20and%20their%20consequences%20B.pdf

392-399

Mayo: Any non-degenerate prior encodes a state of information in which one doesn’t alreadty know the answer. Spike-and-slab priors aren’t degenerate…

Corey: The point is when the agent sets out to “encode her state of information” to the same hypothesis in the fixed sample size case or an optional stopping case, it would be the same. Or are you saying when the Bayesian hears the type of sampling plan the experimenter will use, he should alter the prior plausibility in the same hypothesis? Why? (This is discussed in Mayo and Kruse 2001, which is why I link to it). This makes Lindley howl because it violates the likelihood principle. The “simplicity and freedom” Savage touts in being able to ignore the stopping rule disappears (Savage forum). Not just that, his whole point is to protest that there’s no difference in information, and that no special interpretation is needed, the intention when to stop is buried in the agent’s mind somewhere, etc. etc. etc.. Besides, why wouldn’t the null be accorded a lower rather than a higher prior in the optional stopping case? The experimenter must really believe it false, I might reason, seeing she’s determined to show it false.

By the way, on your earlier comment, I’m always going to assume and not repeat all the particulars of the example in the discussion I specifically link to in mentioning it, else it’s unwieldy. As you know these are examples trotted out numerous times on the blog, and I took the trouble to link to several.

“As you know these are examples trotted out numerous times on the blog, and I took the trouble to link to several.”

I confess confusion regarding the reason for this gentle chiding. Perhaps you think I’m making a stronger claim than I actually intend, and that’s why you think those examples are relevant to the discussion? I’m just trying to offer the appropriate qualifications to restore correctness to a single inadequately qualified and hence too general claim.

“The point is when the agent sets out to “encode her state of information” to the same hypothesis in the fixed sample size case or an optional stopping case, it would be the same.”

Yes.

“Or are you saying when the Bayesian hears the type of sampling plan the experimenter will use, he should alter the prior plausibility in the same hypothesis?”

Definitely not.

I’m saying that in situations where we know the exact null might actually be true (e.g., because it’s picked out by some theory we regard as worth testing) we aren’t in the situation where an absolutely continuous prior would be appropriate. Then the the optional stopping issue is avoided by using a spike-and-slab that accurately reflects the information at our disposal. Conversely, in fields like social science and biology where everything has an effect (perhaps a negligible one but never strictly zero, even if for no other reason than continuous models doesn’t apply at the smallest scales) the lack of a point mass at zero is a good reflection of the available information, and the optional stopping issue is a non-issue by hypothesis.

+1 for Corey’s “converse” point.

In these situations it’s not hard to get Bayesian testing procedures (not point null hypothesis tests, as written about by e.g. Berger) and two-sided p-values to agree. The motivation for Bayesian tests that do so is close to that of significance testing – which argues against point 8. NB by “agree” I mean to be identical up to asymptotic approximation, which is the standard way to judge whether two frequentist analyses are essentially doing the same thing.

There’s a lot of other points one could pick on in 1-8 (and 9) but the most egregious is this; 2. Highly probable vs. highly probed. I don’t believe you (or anyone else) has shown why a Bayesian analysis should not be able to reserve judgement on a testing result when conditions on precision (or any other quantity that one can cook up from a posterior) are not met. More generally, Bayesian analyses are free to depend on multiple characteristics of what’s known – those who want lots of probing are free to set up Bayesian analyses that does it. That the literature on point null hypothesis testing does not take this approach does not mean it is unBayesian.

OG: Gee I think it supports #8 quite well: painting by numbers. You know, an account that can explain/accommodate anything does not explain the thing at all: it’s like the flexible theory that accommodates all outcomes. But look, I was just reacting to the query sent my way seeking contrasts…

Berger and Sellke clearly allude to a contrasting notion of “evidence” at a very fundamental level, if you wish to reject that fine, perhaps all foundational arguments/issues disappear.

Berger, J. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82, 112–139.

(link is in the article).

Corey: A few things: First, I’m prepared to allow that Bayesians are happy as clams about the consequences of the Stopping Rule Principle. My parenthetical remark (that they didn’t seem so happy about it) was only a reflection of many, many attempts to jump through hoops to avoid it, e.g., Bernardo, default Bayesians.

Of course, one needn’t consider the most extreme case of being wrong with maximal probability to point up that adherence to the LP precludes error probability control (which we care about, others need not).

You say in social sciences and biology everything is related to everything*, so it’s fine to keep trying and trying til you find an effect, and report things the same way, thereby handily dissolving the main criticisms of “hunting and shopping” for significance, and the like. I realize that’s a perfectly reasonable Bayesian position, and I was merely showing the contrasts with error statistics that came to mind. That was the question, recall. Add to that disagreements about the need to take into account other selection effects, cherry picking, data dependent models, etc.. My point is really that the discussion about contrasts shouldn’t be limited to disagreements about numbers but should recognize we’re doing different things, and holding a different statistical philosophy. And any disagreement about numbers shouldn’t just assume a statistical philosophy that is at odds with the nature and rationale of error statistical inference.

By the way, J. Berger’s paper explicitly considers the disagreement with an interval null. They explain that the spiked Bayesian prior is employed to avoid the inference being merely: an improbable hypothesis has become more improbable. The paper by Casella and R. Berger disagrees with the spiked prior.

I might note that even if the null is associated with the established theory, e.g., Newton in 1919, our goal–discerning evidence against it– very much depends on not putting a high prior on Newton. Again, our goals differ: we (error probers) want to say, despite the honorable place occupied by Newton, these are genuine anomalies in the direction of Einstein. And we’re going to investigate them…now.

*I deny all nulls are false, by the way–even in social sciences and biology. I argue it’s based on a fallacy. But anyway, we always want to know how false, and where false, and for that we still need to avoid misleading error probabilities. Thus we still need to take into account aspects of the data and hypotheses generation that alter error probabilities. I mean we error statisticians, we worrywarts who are afflicted with such concerns.

Mayo: My problem is not with “I don’t think they can be too happy about it”. Insofar as I have seemed to be addressing that, allow me to state that I don’t speak for other Bayesians and vice versa.

Are you prepared to allow that “HPD intervals are assured of excluding the true parameter value” is a claim that is only true when the true parameter value is at the exact middle of the optional stopping window?

Are you also prepared to consider the idea that looking at Bayesian analyses through the lens of the notion of a “test” leads you awry by blinding you to the separate, distinguishable contributions of prior information and data information to the posterior state of information?

That is an interesting question. Should we refrain from ever viewing Bayesian analyses as tests?

john byrd: Refrain? No. Just keep in mind that posterior assessments incorporate information outside that of the data. Conceptualizations that fails to do that – e.g., “discerning evidence against [Newtonian mechanics] very much depends on not putting a high prior on Newton” – have gone awry.

Corey: Under what circumstances can I view a Bayesian

analysis as a test, with the purpose being to provide an opportunity to refute a hypothesis using data deemed probative?

john byrd: To reduce the chance of miscommunication, can you spell out in greater detail what you mean by “data deemed probative”? Obviously it doesn’t mean definitive…

Corey: Sure, I just mean in the simple sense that it is data believed to be relevant to the problem at hand. So, if my hypothesis is that people belonging to group A are taller on average than people in Group B, then stature data from samples of the two groups would be probative.

john byrd: Okay, good enough. I’ll supply my own interpretation of “refute” — it’s “treat as false for some purpose”. This is both relative and decision-theory-laden: your purposes and mine might not match. If we need to generalize, we can go further, to “treat as false for most purposes”, making plausible guesses about what those purposes might be.

Given this, my line on what counts as a “test” is quite Mayonian: it’s a test if it could go either way before the data are seen.

Mayo has used the example of Isaac’s college entrance exam. (The college entrance exam isn’t the best setting in which to place the mathematical content of the example — actual entrance exams provide more than a pass/fail outcome. But it’s what’s easily available, so let’s go with it.) The pass/fail outcome cannot provide enough evidence to outweigh the fact that Isaac is randomly selected from Fewready town (by hypothesis). Isaac’s home town is highly relevant to the question one hopes to ask, so relevant that it determines the outcome of the procedure all on its own. The end result is that the data are probative by your definition, but not probative enough to turn this procedure into anything I’d call a test.

My take-home message is: don’t do “tests” (or call such procedures “tests”) unless they provide enough information to actually influence what you might do.

I wouldn’t define tests (or refute) this way, and wouldn’t deny poor Isaac the ability to demonstrate his college-readiness simply because he comes from Fewready town. That’s absurd. It links to a post I’ll put up soon about the problem of using probability as an adequate measure of evidence, confirmation, or the like. H: Isaac’s readiness is confirmed (in the sense of made more firm) by the test score, yet H: Isaac is ready is less well confirmed than is ~H, on the test data (P(H|x) < P(~H|x)).

I read this as do not seek evidence, perform tests or follow other attempts at trying to test your preconceived notions if your preconceived notions are strong convictions. Granted, some might have an objective approach to developing these convictions, but some Bayesians surely do not see it as required. Under the circumstances, this would seem to be an unacceptable approach for scientists. It is unacceptable because there is no push to refute the original conviction or to probe for problems with it as a hypothesis. It is a foregone conclusion. So, we ignore evidence that speaks directly to Isaac because of what we know abouthis peers. most of us want to know about Isaac before we draw a conclusion.

John: The first letter I received from Erich Lehmann in 1996 or 7 discusses the problem of college readiness testing. His wife Juliet Shaffer was at the Educational Testing Services in Princeton N.J. at the time–makers of SAT tests (it’s also where I first visited him). What Lehmann wrote, very seriously, is that there was a disagreement about looking at background rates of students because it was felt to disadvantage those already disadvantaged. In effect they’d need a higher score to reach the same probability of “readiness” than students in Manyready town. The other example was testing rare diseases. (Even positive results on a diagnostic could give “no disease” a high posteriors.) I was very taken, given I didn’t know him, that he shared a certain degree of outrage about these cases. I didn’t this could happen with real students (patients) in the real world–just showing my naivety. But I doubt/hope that things aren’t so crude when it comes to admitting students into college; in fact I think points would be given for doing well despite being from Fewready Town. Especially if you’ve got a good college essay!

Mayo: Of course *you* wouldn’t — he’s *your* kid!

More seriously, this shows the problem with the college entrance exam example — it’s a poor match to the math that we’re using to model it. There are two problems: first, the result really does depend on knowing nothing more than the test outcome and the fact of random selection. Additional pertinent information is so often available that our minds can’t help completing the picture with a young scholar toiling away in dismal Fewready town, unfairly rejected… Second, college entrance exams are a game in the sense of game theory: fix a ruleset, and the players will do their best to game the outcome. Treating such situations as a pure problem of statistical inference will often fail to capture relevant information.

Medical testing is a much better fit to the math. Since you bring it up, I’d be curious as to your reaction to the CDC fact sheet info on interpreting tuberculin skin tests.

john d byrd: I can’t help but feel that you’ve come to the conversation with a few preconceived notions of your own. Let me assure you that I think Bayesians who do not see founding priors on objective information are barking mad.

john byrd: That should go, Bayesians who don’t see that founding priors on objective information is necessary for accurate inference are barking mad.

Corey: Thanks. So, I might venture to say that with compelling prior information in hand, I can only add support to my hypothesis by application of severe tests?Could there be a Bayesian test that does not meet the severity requirement? That is, one that a scientist should value.

“I might venture to say that with compelling prior information in hand, I can only add support to my hypothesis by application of severe tests?”

I’m assuming you mean “severe” in Mayo’s sense. I’m puzzled as to why you think that having compelling prior information in hand means that a severe test can only add support for your hypothesis — I don’t know why that would follow…

Maybe you don’t think that, but you suspect I think that?

“Could there be a Bayesian test that does not meet the severity requirement?”

For now, this is an open question. Investigating the disparities between the Bayesian and severity analyses is on my to-do list.

(I believe such a comparison is possible in spite of the differing aims of these two kinds of analyses. In my view, the colloquial notions of well-warrantedness entails colloquial plausibility. If I can find a case in which severity judges a hypothesis well-warranted but Bayes judges it implausible — and this discrepancy is not due to a strong prior — then I have found a contradiction between the formalized notions of well-warrantedness and plausibility and the colloquial notions the formalizations are meant to capture. I think any such contradiction will be highly illuminating.The converse case — a hypothesis that is Bayes-plausible but not severity-well-warranted — again, not due to a strong prior — is also an interesting discrepancy, but not an outright contradiction; it may or may not make sense that such a state of affairs could hold.)

Corey: This was supposed to reply to Corey’s Oct. 4 comment. But it doesn’t want to go there.

We most certainly incorporate information outside the data–in this case, not just problems with Newton’s theory, but the general problem of saving one’s theory at all costs, or claiming the data supports theory T because T has had success in the past. T’s defenders had to show there was another way to explain the data while retaining T, in this case Newton. Even though the impetus to such work was an adherence to Newton, everyone knew that cut no ice with dealing with the data.

On the other hand, the past successes enter in an appropriate degree of “dogmatism” (as Popper would call it): we don’t want to give up a theory too soon, before really understanding it, stressing it, developing it and exploring what it can teach. But the onus, for a Newtonian, would be to explain away the discordant data–through rigorous means.

This they must do, if they be scientists, even though in their heart of hearts they still believed Newton, some of them). This is what science is about. No one could just keep saying “But I must have a Newtonian ether because it is through the ether that I communicate with my departed son, Raymond” (as Oliver Lodge declared).

Mayo: From my perspective, finding a way to save NM is tantamount to finding an explanation of the data consistent with NM — through rigorous means, as you say — that gives a *likelihood* not much smaller than that obtained from the GR prediction.

So yes, “information outside the data” can indeed be relevant to P( Data | NM ). But I can’t see how this line of discussion addresses my example of your perspective gone askew, to wit: “discerning evidence against [Newtonian mechanics] very much depends on not putting a high prior on Newton.”

Corey: Except that that’s not at all how it’s done: NM had a hugely higher probability on the basis of successful predictions, and if only predictions matter, as the instrumentalist/operationalist holds, then most of science couldn’t be explained. They’d never even have gone beyond NM–Kant’s synthetic apriori could live.

Now it’s true, any case can be reconstructed in a manner Bayesians will be happy with. That really is painting by numbers (backward looking rather than forward looking).

Mayo: Actually, it was you who showed me why “NM had a hugely higher probability [prior to the solar eclipse photos] on the basis of successful predictions” is false, although in all likelihood this will come as news to you.

When we generate new theories, we require that they match/generalize well-supported theories in the domains where those theories have been found to apply. The reason for this is that the accumulation of evidence over time for the correctness of a particular theory — say, NM — does not actually single that theory out from the set of theories that make the same predictions. It is actually the disjunction of all theories making the same predictions which accumulates probability mass. GR was in this position relative to NM — it makes the same predictions as NM in low gravity/acceleration situations.

It was your discussion of the Suppean hierarchy of models in EGEK and your remarks in various places about the distinction between statistical and substantive hypotheses that put me onto this notion.

Corey: I’m not sure which part is to be news to me, but I don’t see how you get the lesson you state from EGEK, as it’s not true. This idea of cumulative theory building used to be popular (still is in some quarters) but especially in the kind of appraisal here doesn’t hold.

Firstly, to back up, I was alluding to people’s beliefs in Newton’s theory (even warranted beliefs) based on past success, as I expect Bayesians would. Are you saying GTR starts out with a high probability copped from Newton? No way. Nor was GTR in the “catchall hypothesis” applied by Newtonians.

The second thing I was saying (in my comment) is that the Newtonian beliefs are not carried into the adjudication of the data, except in the sense that one’s beliefs may motivate one to find a way to salvage the theory believed in. So if you read that part of EGEK, you know that the Newtonian’s “beliefs [in the theory under appraisal] had nothing to do with it”.

But back to where I started, this accumulation picture fails. Giving up one theory requires giving up solutions to lots of problems. On the typical “problem solving” calculus, Newton’s theory solved more problems and didn’t introduce apparently spooky metaphysics. Kuhn, among others, put to rest this idea of the old theory being a kind of limiting case.Although he did correctly observe that the winners would likely retell things to show progress.

Now I do say that experimental knowledge (severely corroborated hypotheses) remains, but this is not obtained Bayesianly. And it’s not used Bayesianly (e.g., we don’t give a high probability assignment to GTR, or try to update beliefs in GTR). I could go on….

Having said all that, I appreciate your having read parts of EGEK on this. The more interesting story (on GTR) occurs much later.

When I say that it will come as news to you, I mean that the insights I gained from reading your work will be so foreign to you that you could never predict that them.

“Firstly, to back up, I was alluding to people’s beliefs in Newton’s theory (even warranted beliefs) based on past success, as I expect Bayesians would.”

I’m not a subjective Bayesian — I don’t care what people believe except insofar as those beliefs reflect the information at their disposal. The probabilities I compute are relative to states of information.

“Are you saying GTR starts out with a high probability copped from Newton? No way.”

I’m saying that the body of evidence accumulated prior to the eclipse photos did not endow NM with a vastly greater prior probability. Only in situations where the predictions of NM and GR disagree is it possible to obtain evidence favoring one or the other. The body of evidence up to that point contained few cases where NM and GR disagreed, and therefore could not have endowed NM with a vastly greater prior probability, QED.

“Kuhn, among others, put to rest this idea of the old theory being a kind of limiting case.”

I don’t *think* you mean to claim that Kuhn disproved a straightforward consequence of GR, i.e., that NM provides an adequate approximation to it in low gravity/acceleration situations… you seem to be talking about the sociology of science. I’m talking about the math.

Corey:

“When I say that it will come as news to you, I mean that the insights I gained from reading your work will be so foreign to you that you could never predict that them.”

You mean I couldn’t predict you’d so misunderstand me?

“I’m not a subjective Bayesian — I don’t care what people believe except insofar as those beliefs reflect the information at their disposal. The probabilities I compute are relative to states of information.”

Well the information at their disposal consisted of the Newtonian success stories. This “states of information” seems to be a murky business that is not pinned down.

I’m also talking about the math/science—on Kuhn and the issue of purported cumulatively– but I can’t be teaching you philosophy of science in blog comments.

so how’s your “chancy” blog coming?

Mayo:

“so how’s your “chancy” blog coming?”

It’s coming along, slowly. At this rate I might even publish my first post before November!

“I’m also talking about the math/science—on Kuhn and the issue of purported cumulatively– but I can’t be teaching you philosophy of science in blog comments.”

I’m not talking about “math/science” — just math. Unless Kuhn published a remarkable (and false) theorem to the effect that the predictions of GR and NM never agree, he is strictly not relevant to the point I’m making.

“Well the information at their disposal consisted of the Newtonian success stories. This “states of information” seems to be a murky business that is not pinned down.”

For an non-murky example of reasoning about states of information, please read the paragraph immediately following the one you’re reacting to. It starts with, “I’m saying…” and ends with “QED”. You may take “body of evidence” to refer to a state of information.

“You mean I couldn’t predict you’d so misunderstand me?”

Naturally I do not mean that. -_-

Your ideas, in combination with ideas I hold that you reject utterly, led me to consider the consequence of applying Bayesian updating to the disjunction of theories that all make the same prediction in some circumstance. Once I had that notion, the rest was simple math. You could do it yourself — provided you are capable of dispassionately working out the consequences of ideas with which you don’t agree. In fact, since you’ve descended to slanging me, I challenge you to do so in a blog post. You know that I’ve engaged with your ideas enough to understand at least some of them — you’ve said so yourself. How about you extend me the same courtesy?

Corey: Now now, I’m not “slanging” you, whatever that means (did you mean slugging?) I would like to understand your ideas, but as i said, when I see remarks to which I’m inclined to say “oy, it’s a long story”* then I have to be honest. They can’t be just given in little Cliff’s notes. It’s not my rejecting your ideas, at all, it’s that certain things don’t wash, such as making inferences about “the disjunction of theories that all make the same prediction”. This is not any kind of theory, I don’t know how you’d assign it a probability, or appraise it. Maybe it IS just the data. You infer the sense data or the observables, and just say whatever it is that is consistent with this data is ….well, responsible. Then you make predictions about new data…how?

Anyway, good luck on the blog.

*If you were around, I’d suggest taking the seminar on philosophy of statistics I’m giving with Spanos next semester.

Mayo:slang.

“It’s not my rejecting your ideas, at all, it’s that certain things don’t wash, such as making inferences about “the disjunction of theories that all make the same prediction”.”

And yet, given prior probabilities for some set of theories and an observation error distribution, Bayesians can easily compute posterior probabilities for arbitrary disjunctions of theories in the original set. My challenge to you is to jump into that framework — nonsensical as you find it to be — and posit an experiment in which a subset of the theories issue the same prediction, work out the consequences for the posterior probabilities, and see what a Bayesian would think in that situation.

“I’d suggest taking the seminar on philosophy of statistics I’m giving with Spanos next semester.”

Oh man, I’d love to do that. Alas…

I think that it should be quite obvious that posterior assessment and severity assessment (and other frequentist measures) don’t do the same thing, conceptually and mathematically, so I’d view disagreement between the numbers as the rule and agreement as the (potentially interesting) exception. I’m always puzzled why some people expect these numbers to be the same or expect a specific “explanation” for differences.

I should add that I’m fine with the rather “psychological” explanation given in the posting why this is so. So I’m actually not that puzzled but I think it would be a progress if people could get away from expecting the two approaches to “do the same thing” somehow, which I think is an obstacle to understanding.

Christian: Thanks so much for your notes, I’m behind in reading comments because I’m snowed under with my book as well as under the weather….

psychological? do you mean a difference in aims?

I am told that the person was interested in finding differences to show inferiority of SEV, downplaying or disregarding the difference in aims.

Mayo: I was referring to this: “The search for an agreement/disagreement on numbers across different statistical philosophies is an understandable pastime in foundations of statistics” and what follows.

Christian. Oh. I didn’t see it as psychological, but who knows, maybe that’s what it is. I agree with your point.

Pingback: Entsophy

Click to access casella-berger-comment-on-berger-delampady-stat-sci-1987-1.pdf