The blog “It’s Chancy” (Corey Yanofsky) has a post today about “two severities” which warrants clarification. Two distinctions are being blurred: between formal and informal severity assessments, and between a statistical philosophy (something Corey says he’s interested in) and its relevance to philosophy of science (which he isn’t). I call the latter an error statistical philosophy of science. The former requires both formal, semi-formal and informal severity assessments. Here’s his post:

In the comments to my first post on severity, Professor Mayo noted some apparent and some actual misstatements of her views.To avert misunderstandings, she directed readers to two of her articles, one of which opens by making this distinction:

“Error statistics refers to a standpoint regarding both (1) a general philosophy of science and the roles probability plays in inductive inference, and (2) a cluster of statistical tools, their interpretation, and their justiﬁcation.”

In Mayo’s writings I see two interrelated notions of severity corresponding to the two items listed in the quote: (1) an informal severity notion that Mayo uses when discussing philosophy of science and specific scientific investigations, and (2) Mayo’s formalization of severity at the data analysis level.

One of my besetting flaws is a tendency to take a narrow conceptual focus to the detriment of the wider context. In the case of Severity, part one, I think I ended up making claims about severity that were wrong. I was narrowly focused on severity in sense (2) — in fact, on one specific equation within (2) — but used a mish-mash of ideas and terminology drawn from all of my readings of Mayo’s work. When read through a philosophy-of-science lens, the result is a distorted and misstated version of severity in sense (1) .

As a philosopher of science, I’m a rank amateur; I’m not equipped to add anything to the conversation about severity as a philosophy of science. My topic is statistics, not philosophy, and so I want to warn readers against interpreting Severity, part one as a description of Mayo’s philosophy of science; it’s more of a wordy introduction to the formal definition of severity in sense (2).[It’s Chancy, Jan 11, 2014)

*A needed clarification may be found in a post of mine which begins:
*

Error statistics: (1) There is a “statistical philosophy” and a philosophy of science. (a) An error-statistical philosophy alludes to the methodological principles and foundations associated with frequentist error-statistical methods. (b) An error-statistical philosophy of science, on the other hand, involves using the error-statistical methods, formally or informally, to deal with problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference, and to address philosophical problems about evidence and inference (the problem of induction, underdetermination, warranting evidence, theory testing, etc.).

I assume the interest here* is on the former, (a). I have stated it in numerous ways, but the basic position is that inductive inference—i.e., data-transcending inference—calls for methods of controlling and evaluating error probabilities (even if only approximate). An inductive inference, in this conception, takes the form of inferring hypotheses or claims to the extent that they have been well tested. It also requires reporting claims that have not passed severely, or have passed with low severity. In the “severe testing” philosophy of induction, the quantitative assessment offered by error probabilities tells us not “how probable” but, rather, “how well probed” hypotheses are. The local canonical hypotheses of formal tests and estimation methods need not be the ones we entertain post data; but they give us a place to start without having to go “the designer-clothes” route.

*The post-data interpretations might be formal, semi-formal, or informal.*

*See also: Staley’s review of Error and Inference (Mayo and Spanos eds.)
*

It’s not that I’m not *interested* in the philosophy of science…

Corey: Sure, I meant it’s not your focus, nor did I spoze it was. It’s the blurring of two distinctions that needs correcting. Do you see what I’m getting at? Fun to reblog “It’s Chancy” though….

I think I get it. My use of the word “corresponding” suggests a one-to-one mapping, which is where the blurring of distinctions comes in. Error statistics in the sense of “a cluster of statistical tools, their interpretation, and their justiﬁcation” comprises an entire statistical philosophy — much more than just the formal SEV function.

Right, but now can you pinpoint the issues you had in mind wrt the realm of statistical philosophy? (the one’s you’re focussing on).

Yup. SEV purports to be a formalization of the post-data degree of warrant for an hypothesis. Bayesian posterior probability purports to be a formalization of the post-data degree of plausibility of an hypothesis. In my conception of (informal) well-warranted-ness and (informal) plausibility, an hypothesis’s well-warranted-ness is sufficient for its high plausibility (absent relevant prior information that would militate against said high plausibility). I require (for my own use) that any formalization of those concepts obey this logical relation in some sense.

As is well known in these parts, in simple models the SEV answer and the Bayes answer (with some prior generally accepted among Bayesians as a reasonable default) often coincide numerically. (Of course the interpretation of the numbers is very different.) In other words, it’s a rather tricky to see daylight between SEV and Bayes, and that numerical fact leaves open an important question: what is really doing the work? Error statisticians think severity is doing the work; Bayesians say Bayes’ Theorem is doing the work.

I have devised a simple toy model that (I believe) guarantees distinct numbers from the two formal methods. Now I just have to work out what the SEV analysis is, exactly. This requires an answer of the question I broached with you and Spanos in December.

Corey: No, sev in relation to a claim H is an assessment of how good or how poor a job has been done (in the experiment of relevance) in probing specific ways H can be false. It’s part of a report that includes what flaws have been poorly discriminated by test T. But a low sev, earned by poor discriminative capacity, definitely does NOT indicate or give evidence H is false. Conversely, claims can be well warranted and even known to be true, while poorly tested by T.

There’s a sun’s worth of daylight between the two conceptions of learning and science.

It’s curious to say Bayes’ theorem can do (empirical) work, being a deductive theorem. Sev is about ampliative inference (the conclusion goes beyond the premises), based on error probabilistic properties of methods. But we cannot review a whole philosophy of statistics in an exchange of blog comments. Do you have Error and Inference?

Mayo: By design, my toy model has a sufficient statistic, so there’s no question about what test statistic I should use and no question about poor discriminative capacity. I thought that I would thereby avoid worrying about the specification of the test, but It turns out that’s not the case — the issue is precisely what test procedure to apply when the sample space is the Cartesian product {0,1} x R. I’m aiming to derive a UMP test, but I haven’t even sorted out whether it’s possible in my toy model; I should probably just chuck the equations and resort to coding up the power function in R…

I don’t find it that curious that Bayes’ theorem can do empirical work even though it’s a deductive theorem. Even though it’s a theorem, it operates more like a rule of inference in logic. Just as there’s no mystery to the idea that if one knows that X implies Y, then learning that X is true lets one conclude that Y is true, there’s no mystery in the idea that if X is more plausible on Y than on not-Y then learning that X is true gives Y a B-boost.

I do not have Error and Inference.

Corey: You don’t need exotic examples: the very idea that a B-boost for H counts as evidence for H is as radical a departure from severity as is needed. The fundamental problem of inductive inference is that data x can be “fit”, entailed, or even “explained by” all sorts of hypotheses, even rival hypotheses. Hypothetical-deductive inference fails for this reason: merely fitting or entailing x is so weak: one finds a great fit between H and x with high or maximal probability, even if H is false. One finds a great fit between H and x even if nothing has been done that would have found H false, even if it is false. Such a “test procedure” results in H “passing the test” but the test has minimal severity.

That, of course, is why Popperians and critical rationalists dumped H-D inference. Popper’s famous line: before you have evidence for H, you must have “sincerely tried” to detect how H may be wrong, and despite trying, find x accords with H. I’m with them (what I supply Popper is an account of severity that he didn’t have). That’s quite enough of a conflict.

(The ability to be wrong with probability 1, in optional stopping (for Bayesians and likelihoodists), is a formal counterpart to the general lack of error control. Readers unfamiliar with this may search the blog.)

While I’m on this, interested readers might also go back to the discussion of the tacking paradox (which incidentally was (statistically) one of the top posts of the year) .

Now some Bayesians would require the posterior probability of H be high, rather than merely getting a B-boost,as you say you do. We’ve seen that on the B-boost view, the measure of confirmation cannot be equated to probability.

“You don’t need exotic examples: the very idea that a B-boost for H counts as evidence for H is as radical a departure from severity as is needed.”

I’m not convinced. As a practicing statistician, my focus is on applications, so I want a test case in which all of the working parts are fully exposed.

My model may be a toy, but it’s not any more exotic than the optional stopping example you like so much. In fact, it’s a simplification of optional stopping that I intend to use to get to the heart of the matter. (The “wrong with probability 1” observation doesn’t get to the heart of the matter for reasons which have apparently remained obscure to you even though Andrew Gelman and I have both taken whacks at explaining them.)

“Now some Bayesians would require the posterior probability of H be high, rather than merely getting a B-boost,as you say you do.”

Woah, woah, woah. I was talking about the “curiosity” of a deductive theorem doing empirical work. I didn’t say anything about my requirements in that ‘graph. I deny the relevance to science of so-called “Bayesian confirmation measures”; associating my point of view with them is a move that’s about on par with associating error statistics with the fallacies of acceptance and rejection that it defeats.

Mayo: “Now wait a min—did you not robustly defend the B-boost view of confirmation in the lengthy comments to the tacking paradox post? Confused about that.”

I defended the view that Bayes gives a sensible answer to the question of whether a piece of evidence makes a given claim more plausible, less plausible, or leaves its plausibility unchanged. I know of no basis for distinguishing a scale for a *numerical measure* of “confirmation”. Direction, yes; magnitude, no.

Corey: You say “The ‘wrong with probability 1’ observation doesn’t get to the heart of the matter” for you, but it does for me. Nor does one need to refer to sequential sampling, data dependent alternatives will do. See for example Cox and Hinkley 1974, 51-2. Nor need one just look at worst cases to bristle. Unsurprisingly that and other examples are due to Birnbaum, who rejected the likelihood principle because it prevented controlling error probabilities.

“I deny the relevance to science of so-called “Bayesian confirmation measures”

Now wait a min—did you not robustly defend the B-boost view of confirmation in the lengthy comments to the tacking paradox post? Confused about that.

Corey: 2 comments ago, and many times elsewhere, I raised doubts as whether a B-bump is even indicative of making a claim more plausible in an intuitive sense because H can fit x swimmingly while nothing has been done to probe H (by dint of x).

But putting that aside: for something more radical, you wrote: “I know of no basis for distinguishing a scale for a *numerical measure* of “confirmation”. Direction, yes; magnitude, no.”

Are you saying you don’t think there’s a scale for confirmation or plausibility or what have you, altogether? (or only that a B-boost doesn’t measure it?) Isn’t that the reason Bayesians (and others) appeal to probability in an account of inductive inference?

Mayo: Cox’s theorem shows that any monotonic function of probability can serve as a measure of (Cox-)plausibility. So, while no particular numerical scale can be distinguished as *the* scale on which plausibility is to be measured, probability is nevertheless the central concept.

Corey: But will it be a B-boost type measure, e.g., a Bayes ratio, or a posterior probability? Are the measures to mean the same things in different contexts?

Let’s be careful to distinguish between (i) the plausibility of a claim given a specific fixed state of information, and (ii) changes in the plausibility of a claim induced by changes in the available information.

I don’t think I understand precisely what you’re asking in “Are the measures to mean the same things in different contexts?” Can you clarify, perhaps by giving me a specific example?

Corey: Sure but I took your comment to suggest (surprisingly) that you now (?) deny there’s a probabilistic measure of plausibility. If there’s an assessment of changes in plausibility, and that’s by an increase in probability (B-bump), but then it’s denied that plausibility is measured by probability, then why measure the change this way? I think I just must have misunderstood what you wrote, and since I’m traveling, I didn’t reread all the previous comments.

Mayo: It’s more like, any set of numbers (each attached to a claim) that satisfy Cox’s postulates for a measure of plausibility can be transformed by a monotonic function into a set of numbers that satisfy the probability axioms. So, on the one hand, plausibility has no intrinsic scale; arbitrary monotonic transformations leave the Cox postulates satisfied. On the other hand, the thing about the numbers that makes them representative of plausibility is just that they can be mapped to probabilities.

So we might as well do our computations on the probability scale instead of any other scale. (Actually other scales can be handy in some circumstances, e.g., odds, log-odds.) Also, it turns out (i.e., it is a theorem) that the expected frequency of a specific outcome within of a collection of exchangeable Bernoulli random variables is equal to the marginal probability of that outcome for one of the random variables. This connects the probability scale in particular to an observable quantity.

Corey: So is it correct that your view (not R. Cox’s yours) is we can assess the plausibility of a hypothesis by means of its posterior probability assignment given data?

“the thing about the numbers that makes them representative of plausibility is just that they can be mapped to probabilities”. That might be so if it’s assumed “plausibility of H” is captured or measured by H’s probability assignment. Maybe it can serve to measure belief in H (given x), but that would be very different from measuring how well tested H is (given x).

But never mind well-testedness for now, we can stick to pinning down the probabilist’s goals. It seems that Bayesians are increasingly unwilling or unable to tell us what their probability/plausibility assignments mean, and why they want to use them for a scientific assessment of claims. It’s something they know not what, but they insist it’s what we want.

I read a post the other day by that Briggs in which he said the meaning of a claim such as “the probability of H is .8” is simply that there is evidence that the probability of H is .8.

” we can assess the plausibility of a hypothesis by means of its posterior probability assignment given data?”

Given data and (sufficient) prior information, yes.

Corey: trick answer. My question was, is that how you recommend assessing plausibility in general?

I don’t know how general “in general” is supposed to be, but I’ll go so far as to say that if you’re in a position to apply statistics at all, I recommend the Bayesian statistical approach.

Mayo: In principle. the prior distribution should encode all of the prior information at one’s disposal. This is subjective in the sense that there is inevitably a being that can be said to know the prior information in the picture. It’s objective in the sense that the prior distribution doesn’t depend on any feature of that being other than the prior information at their disposal; in particular, if two different agents have the same prior information, I consider it axiomatic that they should use the same prior distribution.

In practice, approximations are okay. In particular, if the data specify some parameter value far more precisely than the prior information does, then it usually makes a negligible difference if one uses a default prior instead of the in-principle-correct prior. Also, it’s usual to start by screening off most of the universe via implicit independence assumptions. This is often a reasonable default assumption, but it *is* just an approximation, and it’s possible that the data will end up demonstrating that it’s not a good one.

Corey: But why? (and please don’t say because so and so said decades ago that it represents rationality or the like.) Your reply completely skirts the issue of how formal probability is entering on your view of Bayesianism: Is it to get a posterior? (which is then reported as the degree of plausibility)? Or maybe just to report a Bayes ratio? And how do we obtain/interpret the components on your view?

Mayo: Because I think that the Cox postulates encode a sensible extension of classical logic. It provides a practical approach to data analysis; in particular, the way it handles the task od removing nuisance parameters from inferences makes eminent sense.

I prefer to report posterior distributions rather than Bayes factors.

I still don’t have a good sense of what it is you’re asking of me…

Corey: Just trying to figure out what kind of Bayesianism you espouse. So, now I think I’ve got it: it’s posterior probability that conveys the degree of plausibility of a hypotheses. And the priors are…subjective? conventional (default)? or a bit of both?

Corey: Regarding your toy problem: Without having gone into it in depth, if there isn’t an UMP test, you can look at all sorts of alpha-level tests of H and specify for what specific alternative they are powerful, and then say that they serve separating H from that specific alternative.

I don’t think that the severity concept relies on optimality of tests, sufficient statistics and the like. Of course, if there is an UMP test, it would be stupid to do anything else because the UMP test will give you uniformly better severity. But the general idea of severity (as far as I understand it) implies that severity can be computed for any test and alternative, and in absence of an UMP one, one would use other “good” tests and could then analyse severity, depending on what kind of alternative is sought to be ruled out.

Christian: This is the first time I’ve heard anyone use the term “uniformly better severity”. Do you think it would have to be qualified, i.e., for the given alpha (or p-value), and given form of inference? Interesting.

Mayo: “This is the first time I’ve heard anyone use the term “uniformly better severity”. Do you think it would have to be qualified, i.e., for the given alpha (or p-value), and given form of inference? Interesting.”

There was not much more behind it than taking the “uniformly better” label over from Neyman-Pearson theory. Severity calculations work basically in the same way as calculations of type II error probabilities with the observed test statistics value instead of what would be used for power calculations before observing the data, which would be the critical value at level alpha (again, I’m writing this without having spent an hour thinking about whether it’s really really true, so correct me if I’m wrong). This leads me to think that if a test is uniformly more powerful than another, severity against alternatives from the same set as used in power calculations will also always be uniformly higher (probably requiring the same monotone likelihood ratio assumption that you find in Neyman-Pearson theory; that this holds is an intuition of mine right now so don’t rely on it without checking, but I’m pretty confident).

Christian: But this severity measure is relevant for non-statistically significant results, e.g., to set an upper bound for discrepancies from a null hypothesis.

Does this affect what I wrote?

Christian: yes, thanks. Of course if the whole thing is so horribly specified and poorly posed, we might only be able to say there’s no way to assess sev. On those grounds alone, it’s a poor test.

Mayo: If there were no way of assessing SEV in my toy model, I would actually view that as a crushing indictment of SEV! But that’s almost certainly not the case.

Corey: How can it be that there could be no way of assessing SEV in your toy model?

Given a triplet of test statistic, result, and specific alternative, I’d think that SEV is always defined (Mayo, correct me if I’m wrong). So you can take *any* test and get a SEV. It may be impossible to compute it for computational reasons but that’s not really the problem of the philosopher/the concept.

Christian: I’m aiming for a SEV analysis as “canonical” as the SEV analysis for the normal model with known variance and fixed sample size. But no matter how I formulate my model and my test procedure, one could *always* respond, sorry, that’s horribly specified and poorly posed, so that’s not a representative of a SEV analysis. (Of course such a one would have far less intellectual integrity than Mayo.)

I offered a (contrived) model to Spanos a month ago and asked for a “distance measure” (something Spanos has written is a natural mumble mumble statistical model {wave hands} therefore severity) and for the severity analysis, but got no response. In that case, I do concede that the model is horribly specified, and the lack of a nice severity analysis demonstrates very little. I won’t concede that about my toy optional stopping model, though.

Corey: Scholarly advances need to proceed by means of the usual channels, that is, you can’t take not engaging in back and forth blog comments as indicating anything–Spanos very rarely deals in blogs. (No arguments from ignorance.)

“In principle. the prior distribution should encode all of the prior information at one’s disposal.” Since the kind of background repertoire the error statistician needs and uses so rarely comes in by way of a prior probability distribution (unless it’s an ordinary frequentist one, as in screening*), it’s hard to see how it’s encoding all of the prior information at one’s disposal. Background would involve flaws and fallacies in the kind of inference, assumptions of models, theories of instruments, phenomena, links between the substantive and scientific questions,knowledge of biases in the data collection, modeling, and interpretation, etc.

* I distinguish screening from inference.

“Scholarly advances need to proceed by means of the usual channels, that is, you can’t take not engaging in back and forth blog comments as indicating anything–Spanos very rarely deals in blogs. (No arguments from ignorance.)”

I accept that in this case I “can’t take [Spanos’s] not engaging in back and forth blog comments as indicating anything”. That said, I deny that “scholarly advances need to proceed by means of the usual channels”. That’s only true to the extent that scholars demand it. See Polymath Project for a recent counter-example.

Corey: Let me revise that to “you can’t expect it” and regardless, an argument from ignorance is still fallacious. I don’t even know what your example was all about—but I’m scrambling to figure out what to teach next week.

Christian: Yup, and I’ve run through the N-P exercise of maximizing power for a specific alternative while holding alpha fixed. But I’ll be attempting to compare SEV and posterior probability in a way as analogous as possible to the comparison one gets in the normal model, known variance, fixed sample size. In that comparison, SEV and Bayes with a flat prior are numerically equal. I want to isolate the effect of optional stopping on just that comparison.

Corey: Johnson gets a mismatch between p-values and posteriors with a one-sided test as well (given his priors).

I warn you against taking a rejection as evidence for an alternative against which the test has high power. That would be to infer claims that have passed with low severity. This gets to an error in all of the “% true nulls in urns” exercises we keep hearing about.

By the way,

“I warn you against taking a rejection as evidence for an alternative against which the test has high power.”

Noted.

In the severity analysis of a normal mean (with known variance and fixed sample size), all of the details of the test procedure other than the choice of test statistic are irrelevant. In my SEV analysis for my optional stopping toy model I want to get as close to this as possible.

Corey: This should be under your comment:

“I know of no basis for distinguishing a scale for a *numerical measure* of “confirmation”. Direction, yes; magnitude, no.”

If Saul were looking for a numerical measure of confirmation, warrant, or how well-tested P is, given data, you’d tell Saul not to expect to find such a measure in a posterior probability in P. The most would be a measure of increase or decreased confirmation.

OK, but how do you know you’ve got a greater degree of confirmation in P, if you don’t know how to measure confirmation?

I don’t understand the question.

Anonymous: I reread the question, and I think I have a better sense of what you’re asking.

Suppose I’m buying a car, and the dealer quotes prices for three cars: the Toyonda Truck (T), the Jupiter Jalopy (J), and the Mazdades Muscle-Car (M). The wrinkle is that the dealer quotes prices not in dollars but in some unknown monotonic strictly increasing function of dollars. Let’s call the unit for these bizarre prices bizollars.

With prices quoted in bizollars, there’s a lot we don’t know — we don’t even know if 0 bizollars maps to 0 dollars. Perhaps the dealer is secretly offering to pay us to take a car off his hands! But there are a few things we do know:

– we know the *order* of prices in dollars, since that’s the same as the order of prices in bizolllars;

– we know that if two cars have the same price in bizollars, then they have the same price in dollars;

Suppose the difference in bizollar price (M – J) is much much bigger than the bizollar price difference (T – J) (and both differences are positive). Right away we know that the M costs more than T, and both cost most the J, in dollars. But can we make any quantitative statements about the *dollar price differences*? No, we can’t; the monotonic strictly increasing function could increase at any non-zero rate in the interval [J, T], and likewise for the interval [T, M], so information about bizollar price differences does not convey information about dollar price differences.

Using changes in probability of a claim (or changes in any other scale of Cox-plausibilities) to talk about changes in support (or confirmation or what have you) is analogous to using bizollar price differences to talk about dollar price differences: the meaningful statements about changes in support correspond to valid statements about dollar price differences.

Corey: Without any kind of calibration, these are not scientific measurements. Saul is incapable of judging if he can afford the car, if it’s a gigantic ripoff, or a steal.

Worse than bitcoin!

Corey: I thought probability, for you, measures plausibility. H may get a B-bump but still have very low probability, not to mention be very poorly probed. How do you convey that the inference to H was unreliable, ad hoc, or in severe?

Mayo: More precisely, every system of plausibility assignments that has properties I deem desirable is isomorphic to probability. I’m currently content to simplify/summarize that statements a “probability, for me, is a formalization of the notion of plausibility”; if there’s something misleading about that, I’d be grateful to hear about it.

What, precisely, do you mean by “inference to H”? Once I report that H got a B-bump but still has very low posterior probability, have I performed what you call “inference to H”?

When is an inference to H warranted? when not? And how do you capture the slew of problems of bias (as used in today’s familiar criticisms, e.g., due to data-dependent subgroups, endpoints, stopping rules, and various ad hockeries). For example, you give a B-boost to the conjunction (GTR and Fukushama water is safe) given the GTR data x (since x “confirms” GTR). I’d say x provides minimal severity and “bad evidence, no test” (BENT) to the conjunction. I’m on a bus: readers can look back at the “tacking paradox” post to which we recently alluded.

Mayo: Sampling plans (in which I include endpoints and stopping rules) are dealt with along lines given in Chapter 8 of BDA3. By data-dependent subgroups, do you mean data-dredging to find subgroups in which effects “look” large? If so, I deal with that by treating subgroups as exchangeable (instead of independent) in the prior distribution; this leads to Gelman-style multilevel models and shrinkage to the overall mean, mitigating the problem of estimates that have large magnitudes due to chance.

As you say, data x that gives GTR a B-boost but has no bearing on — and therefore neither B-boosts nor B-discredits — the claim “Fukushima water is safe” gives their conjunction a B-boost. But what more do I need to do than note that the data has no bearing — and therefore neither B-boosts nor B-discredits — the claim “Fukushima water is safe”?

Corey: Jumping over hoops to (perhaps) get a result–but without the clear, direct and simple rationale from error statistics–scarcely recommends the approach. If Gelman is serious in viewing himself as accepting error statistical reasoning (as in Gelman and Shalizi), then he ought to be happy to apply it directly, when warranted. The test’s capability to have discerned errors has gone way down–and no priors are involved. We actually might have a high prior in the ad hoc hypothesis, and yet we still want to say it was poorly tested by this means.

On the B-boost of irrelevant conjunctions, the exact same problem results here. In real life, we do not say, “Data x gives me evidence in support of both GTR and Fukushima water is safe”, when we know the second is an irrelevant conjunct (and of course x is also evidence of GTR and

Fukushima water is not safe”. The regulatory commissions will have a field day.

And recall the other consequence of B-boost confirmation, e.g., x can confirm AB, even though it confirms neither A nor B.

Anyway, just to end off where we began: finding conflicts with severity are so blatantly available with these gambits that you needn’t search for anything exotic. Thanks for the exchange.

Mayo: The above comment has a few points that I deny and a few that are unclear to me. I accept that you want to conclude our exchange, but let me just record for posterity:

– “…scarcely recommends the approach.” …To you.

– “The test’s capability to have discerned errors has gone way down..,” What test?

– ‘”Data x gives me evidence in support of both…” Careful — that usage of “both” could be misread very easily.

– No, we don’t usually attach a whole bunch of irrelevant conjuncts. I just claim that if you insist on attaching an irrelevant conjunct, you won’t get nonsense, provided you’re clear on the usage of “confirms” in the sense of “makes more firm” (i.e., increases the plausibility) and not “makes firm” (i.e., makes plausible enough to act on).

– On confirming A and B, but not either of A or B considered individually: I’d have challenged you to give a concrete example in which this yields nonsense.

Thanks to you as well, Mayo!

Anonymous: As I wrote above, “it turns out (i.e., it is a theorem) that the expected frequency of a specific outcome within of a collection of exchangeable Bernoulli random variables is equal to the marginal probability of that outcome for one of the random variables”. So this singles out the probability scale out of all possible Cox-probability scales as the one relevant to observable quantity, but you need the extra assumption of exchangeability to get there. This exchange I had with Christian Hennig is also relevant.

(Should be “Cox-plausibility scales”, not “Cox-probability scales”.)