*The following two sections from Aris Spanos’ contribution to the RMM volume are relevant to the points raised by Gelman (as regards what I am calling the “two slogans”)**.*

** ****6.1 Objectivity in Inference (From Spanos, RMM 2011, pp. 166-7)**

The traditional literature seems to suggest that ‘objectivity’ stems from the mere fact that one assumes a statistical model (a likelihood function), enabling one to accommodate highly complex models. Worse, in Bayesian modeling it is often misleadingly claimed that as long as a prior is determined by the assumed statistical model—the so called *reference prior—*the resulting inference procedures are objective, or at least as objective as the traditional frequentist procedures:

“Any statistical analysis contains a fair number of subjective elements; these include (among others) the data selected, the model assumptions, and the choice of the quantities of interest. Reference analysis may be argued to provide an ‘objective’ Bayesian solution to statistical inference in just the same sense that conventional statistical methods claim to be ‘objective’: in that the solutions only depend on model assumptions and observed data.” (Bernardo 2010, 117)

This claim brings out the unfathomable gap between the notion of ‘objectivity’ as understood in Bayesian statistics, and the error statistical viewpoint. As argued above, there is nothing ‘subjective’ about the choice of the statistical model *M*_{θ}(**z**) because it is chosen with a view to account for the statistical regularities in data **z**_{0}**, **and its validity can be objectively assessed using trenchant M-S testing. Model validation, as understood in error statistics, plays a pivotal role in providing an ‘objective scrutiny’ of the reliability of the ensuing inductive procedures.

Objectivity does NOT stem from the mere fact that one ‘assumes’ a statistical model. It stems from establishing a *sound link *between the process generating the data **z**_{0}** **and the assumed *M*_{θ}(**z**), by securing statistical adequacy. The *sound *application and the *objectivity *of statistical methods turns on the *validity *of the assumed statistical model *M*_{θ}(**z**) for the particular data **z**_{0.}** **Hence, in the case of ‘reference’ priors, a misspecified statistical model *M*_{θ}(**z**) will also give rise to an inappropriate prior π(θ).

Moreover, there is nothing subjective or arbitrary about the ‘choice of the data and the quantities of interest’ either. The appropriateness of the data is assessed by how well data **z**_{0}** **correspond to the theoretical concepts underlying the substantive model in question. Indeed, one of the key problems in modeling observational data is the pertinent bridging of the gap between the theory concepts and the available data **z**_{0}** **(see Spanos 1995). The choice of the quantities of interest, i.e. the statistical parameters, should be assessed in terms of the statistical adequacy of the statistical model in question and how well these parameters enable one to pose and answer the substantive questions of interest.

* *For error statisticians, *objectivity *in scientific inference is inextricably bound up with the *reliability *of their methods, and hence the emphasis on thorough probing of the different ways an inference can go astray (see Cox and Mayo 2010). It is in this sense that M-S testing to secure statistical adequacy plays a pivotal role in providing an *objective scrutiny *of the reliability of error statistical procedures.

In summary, the well-rehearsed claim that the only difference between frequentist and Bayesian inference is that they both share several subjective and arbitrary choices but the latter is more honest about its presuppositions, constitutes a lame excuse for the ad hoc choices in the latter approach and highlights the huge gap between the two perspectives on modeling and inference. The appropriateness of every choice made by an error statistician, including the statistical model *M*_{θ}(**z**) and the particular data **z**_{0},** **is subject to independent scrutiny by other modelers.

**6.2 ‘All models are wrong, but some are useful’ **

A related argument—widely used by Bayesians (see Gelman, this volume) and some frequentists—to debase the value of securing statistical adequacy, is that statistical misspecification is inevitable and thus the problem is not as crucial as often claimed. After all, as George Box remarked:

“All models are false, but some are useful!”

A closer look at this locution, however, reveals that it is mired in confusion. *First, *in what sense ‘all models are wrong’?

This catchphrase alludes to the obvious simplification/idealization associated with any form of modeling: it does not represent the real-world phenomenon of interest in all its details. That, however, is very different from claiming that the underlying statistical model is unavoidably misspecified vis-à-vis the data **z**_{0.}** **In other words, this locution conflates two different aspects of empirical modeling:

(a) the ‘*realisticness’ *of the substantive information (assumptions) comprising the structural model *M*_{φ}(**z**) (substantive *premises), *vis-à-vis the phenomenon of interest, with

(b) the *validity *of the probabilistic assumptions comprising the statistical model *M*_{θ}(**z**) (statistical *premises), *vis-à-vis the data **z**_{0}** **in question.

It’s one thing to claim that a model is not an exact picture of reality in a substantive sense, and totally another to claim that this statistical model *M*_{θ}(**z**) could *not *have generated data **z**_{0}** **because the latter is statistically misspecified. The distinction is crucial for two reasons. To begin with, the types of *errors *one needs to probe for and guard against are very different in the two cases. *Substantive adequacy *calls for additional probing of (potential) errors in bridging the gap between theory and data. Without securing *statistical adequacy, *however, probing for substantive adequacy is likely to be misleading. Moreover, even though good fit/prediction is neither *necessary *nor *sufficient *for statistical adequacy, it *is *relevant for *substantive adequacy *in the sense that it provides a measure of the structural model’s comprehensiveness (explanatory capacity) vis-à-vis the phenomenon of interest (see Spanos 2010a). This indicates that part of the confusion pertaining to model validation and its connection (or lack of) to goodness-of-fit/prediction criteria stem from inadequate appreciation of the difference between substantive and statistical information.

*Second, *how wrong does a model have to be to *not *be useful? It turns out that the full quotation reflecting the view originally voiced by Box is given in Box and Draper (1987, 74):

“[. . . ] all models are wrong; the practical question is how wrong do they have to be to not be useful.”

In light of that, the only criterion for deciding when a misspecified model is or is *not *useful is to evaluate its potential unreliability: the implied *discrepancy *between the relevant actual and nominal error probabilities for a particular inference. When this discrepancy is small enough, the estimated model can be useful for inference purposes, otherwise it is not. The onus, however, is on the practitioner to demonstrate that. Invoking vague *generic robustness *claims, like ‘small’ departures from the model assumptions do not affect the reliability of inference, will not suffice because they are often highly misleading when appraised using the error discrepancy criterion. Indeed, it’s not the discrepancy between models that matters for evaluating the robustness of inference procedures, as often claimed in statistics textbooks, but the discrepancy between the relevant actual and nominal error probabilities (see Spanos 2009a).

In general, when the estimated model *M*_{θ}(**z**) is statistically misspecified, it is practically useless for inference purposes, unless one can demonstrate that its reliability is adequate for the particular inferences.

*A. Spanos 2011, “Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*, ” RMM* Vol. 2, 2011, 146–178, Special Topic: Statistical Science and Philosophy of Science

**Note: Aspects of the on-line exchange between me and Senn are now published in RMM; comments you post or send for the blog (on any of the papers it this special RMM volume) if you wish, can similarly be considered for inclusion in tthe discussions in RMM.

Mayo, this is a nice discussion by Spanos.

But how does it solve Berkson paradox? Given enough data, isn’t every model going to be statiscally misspecified?

Let me try to be more clear. Let’s use Spanos example, CAPM.

At page 161, he rejects almost all assumptions of the model, the only exception is linearity. So, all the inferences made within the model are called into question, because the error probabilities are not properly controlled.

Now let’s suppose for a minute that CAPM had passed all tests. Then, by analogy, all inferences would be correct (in the sense that errors probabilities were controlled).

But,like Berkson argued, it’s likely that our model isn’t just exactly right. We just didn’t have enough data to reject our assumption. Suppose, then, that some years after evaluating CAPM as “statistically adequate”, we get more data. And then we end up rejecting every assumption (like Spanos did in page 161). Then, what we had thought as valid inferences before, wouldn’t be valid inferences anymore, right?

Well, if a practioneer get this kind of “disapointment” very often, like Berkson, then he would extrapolate this kind of thinking to every model he is evaluating.

Carlos

The Berkson paradox stems from assuming that a rejection provides equally good evidence against the null hypothesis, irrespective of the sample size, which is false. Such reasoning ignores the fact that when one rejects a null hypothesis with a smaller sample size (or a test with lower power), it provides better evidence for a particular departure d from the null than rejecting it with a larger sample size. This intuition is harnessed by Mayo’s severity assessement in order to quantify a rejection (or acceptance) in terms of the discrepancy from the null warranted by the particular data; see Mayo and Spanos (2006).

In relation to the CAPM example, I have used a much larger sample size (n=1318) to test the theory model by respecifying the underlying statistical model; the original n=64 was totally inadequate for “capturing” the temporal structure and heterogeneity in the data. It turns out that one needs to go beyond the Normal/static family of models into a Student’s t dynamic model with mean heterogeneity to find a statistically adequate model. On the basis of the latter model the substantive restrictions of the CAPM are strongly rejected. I will be happy to share the data with anybody who wants to test the statistical model assumptions independently.

Final note: I will urge people to avoid expressions like “a model or a hypothesis is exactly right”; they are misplaced in the context of inductive (statistical) inference.

Thank you, Spanos.

I’m reading your 2007 paper on curve fitting and I got interested in Kepler’s example.

Berger’s book (1985) had mentioned Kepler as a model that it is seen as good but it would be statistically rejected today.

In your paper, you tested the statistical adequacy of Kepler’s model with n=28 and it passed.

With a bigger n, wouldn’t we eventually reject Kepler’s model just as you rejected Ptolemy’s geocentric model?

I have tested the Kepler model using modern data (n=700) and it passes all M-S tests with flying colors! By the way the Ptolemy model is statistically misspecified even when the M-S testing is perfomed using much smaller samples sizes, say n=40! Hence, M-S testing is not about sample sizes, as long as there are enough observations to render the selected M-S tests effective (poweful) enough!

Oh, I would like to ask something else too!

In Spanos review of Ziliak and McCloskey’s book

also page 161 (what are the odds?) there’s this passage:

“These two examples demonstrate how the same test result τ(x0)=1.82, arising from two different sample sizes, n=10000 and n=10, can give rise to widely different ‘severely passed’ claims concerning the warranted substantive discrepancy: γ 1.1, respectively.”

But wouldn’t it be γ<1.1?

Because we didn't reject H0, so we could say that the discrepancy is, with high severity (0.9 or higher), below 1.1. Not above 1.1.

also the passage:

“[…] the minimum discrepancy warranted by data x0 is γ >1.1”

Wouldn’t it be

“[…] the maximum discrepancy warranted by data x0 is γ <1.1"

Never mind the last two questions hehe

I thought that Spanos had said the data supported discrepancy higher that 1.1, and I thought it very odd… but he just said that values above 1.1 passed the severity test that u<γ. Hence, we could say that the minimum discrepancy with "high" severity is 1.1.

Sorry.

Glad you straightened the last two questions out…but on the general question, I have this to say:

I think there is very often a confusion between asserting a hypothesis H is false (or , in some cases, asserting H is discrepant by more than such and such) and asserting that a procedure can or will find it false. This occurs, for example, when it is proclaimed that any null hypothesizing 0 correlation of some sort, is invariably false.

That we use models, or language altogether for that matter, already means we are viewing reality through a “framework”, but given a linguistic (or modeling) framework, whether or not a hypothesis is true/false approximately true/false, adequate/inadequate, etc, is a matter of what is the case.

If you’ve got a procedure that will always declare a hypothesis false, even when it is true, then it’s declaring H is false, is very poor evidence for it’s falsity. “H is false” passes a test with minimal severity.

Note, by the way, that in order to be warranted in finding a scientific theory false, it is required to affirm as true (or approximately true) a “falsifying hypothesis” with severity. See “no pain” philosophy. Sorry, to be dashing in undergrounds—did not want to ignore blog entirely.

I continue to be impressed with this framework, but every time I ask myself why I’m not doing this myself (or rather, teaching myself how to do it from Spanos’s papers) I think, “But what about Cox’s theorem?”

meaning?

Cox’s theorem establishes probability theory as an extension of classical logic to reasoning under uncertainty about the truth values of the propositions under consideration. In particular, the theorem shows that any univariate real-valued measure of plausibility that is not equivalent to Bayes must violate some compelling (to me) desiderata for reasoning under uncertainty. I regard science as the endeavour of reducing uncertainty in our knowledge of the way the universe works; Cox’s theorem shows me how to do the data analysis part of this job quantitatively.

To the extent that things like M-S tests and severity require p-values as univariate real-valued measures of plausibility, they must conflict with the desiderata.

Oy and No—yet very glad you raised this. That ASSUMES an ungodly amount of assumptions (e.g., about assigning probabilities to everything as measures of “plausibility”, taking all bets, coherence, …on and on..and on…which Cox certainly does not accept. Please check the “ifs” of the “theorem”. Further, deductive updating via Bayes theory, like all deductive moves, always results in no more info than the premises contained—so no growth of knowledge. But, on the other hand, there could be a challenge here…for classical Bayesians. Error Statisticians have an alternative philosophy of scientific and inductive inference, do the Bayesians? Are they essentially seeking foundations under the error statistical umbrella.

The Cox in question isn’t Sir David Cox — it’s Richard T. Cox, an American physicist who studied electric eels.

Bets don’t enter into the desiderata. In this it is quite different from de Finetti’s “bets plus Dutch-book-style coherence” approach and Savage’s “bets plus rational preferences” approach.

Same argument, same problem. You’ll have to demonstrate why you think it applies to severity. I can show that probability logic is the wrong logic for assessment of well-testedness. Example: Irrelevant conjunctions get some support from data that well-corroborates one conjunct, even if it is irrelevant to the other.

I’m not sure why you state “same argument, same problem.” You wrote that “the argument assumes an ungodly number of assumptions,” but the argument that proves Cox’s theorem doesn’t have a list of premises that runs “on and on… and on”. In fact, I’d be interested to know which specific premise(s) of the theorem you reject (you can find a nice intro to the entire argumenthere.

The other problems you state don’t trouble me: the growth of knowledge comes from the accumulation of data. Propositional logic is a set of rules of inference that can operate on previously unknown data, generating novel deductions, and Bayes is an extension of propositional logic, generating novel plausible inferences.

As to conjunctions get some support from data that well-corroborates one conjunct, even if it is irrelevant to the other, I’m not seeing the problem — that’s just as it should be, as far as I can tell. This is not dangerous because a conjunction cannot be more probable than any of its conjuncts. Perhaps you can point me to an example from one of your papers.

As for a specific example, someone posted a link in the comments of a previous post to a paper (by Morris deGroot, I think?) that shows how p-values fail in this regard. I’ll have to dig it up and rephrase it in terms of severity.

Ah, no — the p-value paper was by Mark Schervish, referenced but not linked by commenter “Guest” in this post.

@Corey, if you haven’t seen it already, John Cook nicely restates Schervish’s example. But the two hypotheses are tested with different power (i.e. severity) which *should* also affect the interpretation of their p-values …even though it’s no caricature to say this doesn’t happen much in practice.

Thanks for the link, Guest. (Severity isn’t the same as power — in the normal model it’s a mathematically related to the confidence distribution._

Sorry I missed this, I need to redirect comments from Elba to me. I see others more or less addressed points except for when you wrote (in your March 10 comment):

“As to conjunctions get some support from data that well-corroborates one conjunct, even if it is irrelevant to the other, I’m not seeing the problem — that’s just as it should be, as far as I can tell. This is not dangerous because a conjunction cannot be more probable than any of its conjuncts. Perhaps you can point me to an example from one of your papers.”

The danger is that if x confirms H & J, merely because x well-corroborates H, but is utterly irrelevant to J, then x confirms J, even a little, and so anything that confirms something confirms anything. Moreover, we want an account that lets us say/show that you have carried out a terrible test of J, one which will easily find some support for J even if J is false.

In general, as I have been arguing, probability logic, which is deductive, does a very poor job of capturing ampliative 9evidence-transcending) inference.

I can’t make any sense of your comment, which suggests that you and I are using the same words to mean different things. I’ll explain how I’m using the key words and why your statements make no sense under my definitions, and then maybe we can isolate the problem. To disambiguate, I’ll use the prefix “p-” for my sense of the key words. All of my probability expressions will explicitly condition on a prior state of information Z.

Evidence x is said to p-support (p-confirm, p-well-corroborate, etc.) proposition A (given prior information Z) iff Pr(A | x, Z) > Pr(A | Z).

Evidence x is said to be p-irrelevant to the plausibility of proposition A (given prior information Z) iff for any proposition B, Pr(A | x, B, Z) = Pr(A | B, Z). An immediate implication is Pr(A | x, Z) = Pr(A | Z).

Now consider the p-support offered to a conjunction H & J by evidence x. By the definition of conditional probability,

Pr(H & J | x, Z) / Pr(H & J | Z) = [ Pr(H | x, Z)*Pr(J | x, H, Z) ] / [ Pr(H | Z)*Pr(J | H, Z) ].

Your claim was “if x confirms H & J, merely because x well-corroborates H, but is utterly irrelevant to J, then x confirms J, even a little, and so anything that confirms something confirms anything.” Using my sense of the words, this claim is false. If x is p-irrelevant to J, then by definition Pr(J | x, H, Z) = Pr(J | x, Z) and

Pr(H & J | x, Z) / Pr(H & J | Z) = Pr(H | x, Z) / Pr(H | Z),

that is, x p-supports H & J to the precise degree that it p-supports H alone. Furthermore, as noted above, Pr(J | x, Z) = Pr(J | Z), that is, x does not p-support J.

I just noticed a typo. I wrote:

“If x is p-irrelevant to J, then by definition Pr(J | x, H, Z) = Pr(J | x, Z)…”

That sentence should read:

“”If x is p-irrelevant to J, then by definition Pr(J | x, H, Z) = Pr(J | H, Z)…”

Yes, we’re speaking a different language. SEV is not a probability assignment, and it gets the logic right as regards the problem of irrelevant conjuncts. My problem is already with claiming x supports H & J, with J irrelevant. But one can also go further and notice that the conjunction entails the conjunct J. Will write when I am in one place: unexpected travel.

Thank you again, Spanos! I was goint to cite Berger’s example on my dissertation, but then I ran into your paper and now your comment definetely made me think twice about that.

But I’m still trying to understand some things here.

Let’s suppose our model is statistically adequate. Then, I can see how severity assessment can prevent one to make rejection or acception fallacies that are so commonly made nowadays (fruits of the hybrid incoherent “null ritual” as defined by Gigerenzer).

But I still didn’t get how to assess severity in the very statistical adequacy we need !

For example, how can I distinguish “how far” my model is from normality? By looking only the p-value of a M-S test? How do I draw the line, when is it 5% or when is it, say, 19%?

Because, as far as I am making a judgement about a parameter, say, a price elasticity, I can, as an economist, say when it is small and when it is big. So, in the error statistics framework, I could see how an economist could judge, with severity, if the rejection of a null has substantive meaning.

But as to departures from normality, what is small and what is big is still not very clear.

For example, in Kepler’s model, the non-normality p-value was arround 10%… why isn’t it significant?

Thanks agains!

Carlos

I apologize for the delay in replying; I have been traveling for the last two days. The severity assessment has a role to play in the context of M-S testing. For example, in the case of testing Normality the test statistic is based on a combination of: skewness=0 and kurtosis=3. One can establish the discrepancy from this null warranted by the particular data using the post-data severity evaluation. What is “small” and “big” for this context is not particularly difficult to determine. For instance, a discrepancy from skewness bigger than .5 should be considered serious; it will imply serious asymmetry. Similarly, a discrepancy of bigger than plus or minus 1 from kurtosis=3 is also considered serious enough. In the case of the empirical example with Kepler’s fist law the warranted discrepancies from the skewness and kuortosis were considerably less than these thresholds ensuring that there is no serious departure from the symmetry or the mesokurtosis of the Normal distribution. Of course, one should always apply several different tests for the different model assumptions, both individually and jointly, as a cross-check in order to ensure a more reliable assessment. In the case of Kepler’s law I supplemented the skewness-kurtosis test with a nonparametric test [Shapiro-Wilks] which confirmed the result of no evidence against Normality. Having done a lot of research with financial data, I can testify that Normality is almost never valid for such data because they often exhibit leptokurtosis [kurtosis > 3] and one has to use distributions like the Student’s t when symmetry is not a problem.

Thank you, I think it’s getting clearer now!

Let me see if I got it right.

For instance, the Jarque-Bera test assumes S=0 and K=3. Let’s suppose our sample is large enough for the asymptotics of the test be “just fine”.

Usually, what people do is to just check if the p-value<5% and reject normality. That, if I got it right, can be very misleading. A better practice would be to see what discrepancies of S and K are warranted.

So, let's say my sample was huge, two hundred thousand.

Then, I could get a very small p-value, p<0,00001. But, still, if it turns out that the discrepancy warranted for S and K are arround 0.1 then I could say that the departures, though statistically different from S=0 and K=4, are not "significant" in the substantive sense that, for inference purposes, the nominal error probabilities are not very different from the real error probabilities.Is that it?

If it is, I think this approach can indeed make classical testing more sensible.

Thanks,

Carlos

You now have the gist of how one can avoid being misled by inferences results due to a large sample size! Indeed, it can work the other way around when the sample is small and the M-S test used does not have enough power to detect an existing departure. The post data severity evaluation addresses both problems simultaneously.

Hi, Mayo,

What do you think about this paper from Hoening?

See page 3.

He thinks it’s “nonsensical” to do post-data power analysis because it contradicts the notion of “p-value” as measure of evidence against he null.

His example is two estimates of the same effect size, but in one of them the standard-error is lower. Then, in this one, the Z statistic is higher and the p-value is lower.

But, if one would calculte post-data power analysis, he would find-out that the more precise estimation, with higher Z, givers lower upper-bound for the true effect-size. And he thinks this is “nonsense” because if the p-value is lower, than it should be more evidence agains the null.

My opinion is that it is his notion of “p-value” as measure of evidence that is flawed – see Schervish (1996) or Spielman (1973).

What do you think?

Thanks

Carlos

Oy! Please see my posts on “shpower”. You will see that the fault lies not in power but in this completely wrong-headed invention “shpower”. His notion is equivalent to “shpower”.

Thanks, got it!

Reply to Mayo, April 2 (nesting limit reached):

Let me see if I understand your position. You claim that a reasonable definition of support ought to be such that if evidence x supports claim H but is irrelevant to J, then x does not support the conjunction H & J. This is because H & J entails J, so under any reasonable definition of support, any evidence that supports H & J also supports J. Do I have that right?

If I have that right, how crucial would you say the above claim is to your motivation for developing the severity framework?

Cut out everything after “This is because”. Then write: this is because the experiment that produced x has done nothing to rule out the falsity of J. The entailment point, is just another known consequence of all such accounts. Moving in car…will write more later.

When you wrote “Cut out everything before” did you mean everything *after*?

yes, “after”, fixed it (told you I was moving in a vehicle)

Is more forthcoming? (I ask because I have a response drafted, but I was waiting for you to complete your comment.)

Corey: Sorry, forgot to get back to this, but I plan to be talking about the very idea of using probability as an inductive support measure of some sort. It is a view I reject, but I am happy to allow logicians of support or belief or the like to have their research program, and talk instead about an account of well-testedness, warranted, evidence, corroboration, severe testing or the like. Hopefully statistically -minded scholars will consider the mileage that they can bring to this (admittedly) different philosophy of ampliative inference. If an account regards data x as warranting hypothesis J, even a little bit, when in fact the procedure had 0 capability to have resulted in denying J is warranted, even were J false–, then the procedure producing x provides no test at all for J. One can put it in zillions of ways but that is essentially the weakest requirement for scientific evidence in my philosophy. See my Popper posts. Of course, in practice, I do not think scientists would contemplate the evidence x supplies to a conjunction when J is utterly irrelevant, and so the two do not share data models (e.g., general relativity and prion theory entail light deflection. Ordinary first-order logic with material conditionals is quite inadequate for scientific reasoning (e.g., false antecedents yield true conditionals).

Mayo,

You’re simply wrong that probabilities gets conjunctions wrong. Corey demonstrated that they get it right explicitly above.

Also, probabilities get your “weakest requirement for scientific evidence” right as well. To see why note that it’s not possible for both of these to be true:

(1) There is no x that supports J being false.

(2) There is some x that supports J being true.

Using Corey’s notation these become:

(1) For all x, P(not J|x,Z)P(J|Z)

Roughly these say “there is no x that would increase the chance that J is false” and “ there is some x’ that increased the chance J is true”. You can show in a few lines that (1) and (2) imply the contradiction:

P(J|Z)>P(J|Z)

So you can’t have (1) and (2) both be true.

Moreover, these are not isolated examples. There are famous books by Poyla and Jaynes which give example after example where probabilities get these kinds of thing right. Many of these are real world and not just toy examples.

Of course all the examples in the world don’t prove that probabilities will always work this way. But then their is Cox’s theorem which Corey mentioned above. Basically, anytime you try to reason by assigning real numbers to hypothesis (it doesn’t matter what you call those real numbers: they could be “evidence” or “belief” or “severity” or whatever) then these numbers will fit into a formalism identical to probability theory or lead to some embarrassing conclusions.

So regardless of how you feel about Cox’s theorem, at the very least it does imply that the formalism of probability is going to be able to handle or mimic how we do reason a surprising amount of the time (otherwise nothing like Cox’s theorem would be true).

At any rate, the demonstrations me and Corey gave are a few lines of elementary mathematics. There shouldn’t be any disagreement about those.

I don’t assign numbers to hypotheses, only to methods, e.g., tests. The problem of “irrelevant conjunction” is a standard one for probabilists and it has been discussed in scads of articles (e.g., (2004). Discussion: Re-Solving Irrelevant Conjunction with Probabilistic Independence. Philosophy of Science 71:505-514.). If you have a new way for probabilists to handle it, you should notify them.

Severity, like error probabilities associated with inferences (in error statistical tests and confidence intervals), are not probabilistic assignments to the inferences. They don’t obey probability relations. It is not I who have my logic wrong–recheck what you wrote.

For some reason the two conditions got cut off. Here they are again:

(1) For all x, P(not J|x,Z)P(J|Z)

Third times a charm:

First:

For all x, P(not J|x,Z)P(J|Z)

It still not posting it. Not sure what hte problem is, but the second condition should be

there exists an x_prime such that P(not J|x_prime,Z)>P(not J|Z)

Please delete the other ones. There must be some kind of formatting issue that is causing the blog to mangle the formulas. I think these will work:

First: For all x P(not J|x,Z)P(J|Z)

(1) for all x it’s true that P(not J|x,Z) is less than or equal to P(not J|Z)

and

(2) there is some x such that P(J|x,Z) is strickly greater than P(J|Z)

The last comment gets it right. There was a mangling of (1) and (2) and it wasn’t printing out right for some reason.