*This short paper, together with the response to comments by Casella and McCoy, may provide an OK overview of some issues/ideas, and as I’m making it available for my upcoming PH500 seminar*, I thought I’d post it too. The paper itself was a 15-minute presentation at the Ecological Society of America in 1998; my response to criticisms, around the same length, was requested much later. While in some ways the time lag shows, e.g., McCoy’s reference to “reductionist” accounts–part of the popular constructive leanings of the time; scant mention of Bayesian developments taking place around then, it is simple and short and non-technical **. Also, as I should hope, my own views have gone considerably beyond what I wrote then.
*

(Taper and Lele did an excellent job with this volume, as long as it took, particularly interspersing the commentary. I recommend it!***)

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence” in M. Taper and S. Lele (eds.) *The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. *Chicago: University of Chicago Press: 79-118 (with discussion).

ABSTRACTDespite the widespread use of error-statistical methods in science, these methods have been the subject of enormous criticism, giving rise to the popular statistical reform” movement and bolstering subjective Bayesian philosophy of science. Given the new emphasis of philosophers of science on scientific practice, it is surprising to find they are rarely called upon to shed light on the large literature now arising from debates about these reforms—debates that are so often philosophical. I have long proposed reinterpreting standard statistical tests as tools for obtaining experimental knowledge. In my account of testing, data

are evidence for a hypothesisxHto the extent thatHpasses a severe test with. The familiar statistical hypotheses as I see them serve to ask questions about the presence of key errors: mistaking real effects for chance, or mistakes about parameter values, causes, and experimental assumptions. An experimental result is a good indication that an error is absent if there is a very high probability that the error would have been detected if it existed, and yet it was not detected. These results provide a good (poor) indication of a hypothesisxHto the extent thatHpasses a test with high (low) severity. Tests with low error probabilities are justified by the corresponding reasoning for hypotheses that pass severe tests.

**PH500 *Contemporary Philosophy of StatisticsAs a visitor of the Centre for Philosophy of Natural and Social Science (CPNSS) at the London School of Economics and Political Science, *I am planning to lead 5 seminars in the department of Philosophy, Logic, and Scientific Method this summer (2) and autumn (3) on Contemporary Philosophy of Statistics under the PH500 rubric, (listed under summer term).* This will be rather informal, based on the book I am writing with this name. There will be at least one guest seminar leader in the fall. Anyone interested in attending or finding out more may write to me: error@vt.edu .

Wednesday 6th June 3-5pm T206

Wednesday 13th June 3-5pm T206

Autumn term dates: To Be Announced

**I’ve heard it referred to as “Mayo Lite”.

***Never mind that some of the ecologists are or were somewhat under the spell of likelihoodist Richard Royall. Royall told me that he would have preferred having influence over a less messy field, but he got the ecologists (something like that). Personally, I rather like ecologists.

Mayo,

First, it seems to me that Casella’s comment was very odd. His view of NP testing is the textbook view, not the one that you have presented.

But, maybe, the vagueness that he refers to – and do not explain – is that it is not obvious how to apply severity assessment in more complex cases than in the test of a mean in a normal distribution.

For example, there are situations that the p-value for the hipothesis

H*: mean1=mean2

is 0,3% but the p-value for the hipothesis

H: mean1=zero=mean2

is 10%.

So, even if we do not take an E-R approach, it’s very puzzling to say that the hypothesis not-H* has passed a severe test but the hypothesis not-H has not passed. Becauses if not-H* is valid, then not-H must also be.

Do you have any other practical examples of severity assessment, beyond the test of a simples hypothesis about the mean?

Thanks

Carlos

Carlos: SEV evaluations can be made for any N-P or Fisherian test, but also outside of statistics altogether. The one-sided/two-sided difference, while slight, is not puzzling at all for one using error probabilities to express probativeness. (It is puzzling for an E-R theorist, and recall the huge difference between one-sided and two sided testing for a Bayesian).

Carlos: I’ve discussed this already; not odd in the least to distinguish capabilities for error and error probing capacity. Sorry to lack time to repeat just now. But I’m sure it will come around again, else write back…

Ok, Mayo, I’m also in a hurry right now, so I´ll try to think more about it, and elaborate the question better, later!

But the problem I was refering to is the one Alexandre has writen down there, testing the equality of the means of a bivariate normal distribution. Thanks, Carlos.

Dear Mayo,

Consider X1, …, Xn i.id. random vectors sampled from a bivariate normal distribution with mean mu = (mu1, mu2)’ and covariance-variance matrix I (identity).

Could you derive a severity procedure for testing

“H: mu1 = mu2 against A: mu1 \neq mu2”

and (other) for testing

“H’: (mu1, mu2) = (0,0) against A’: (mu1, mu2) \neq (0,0)” ?

OBS: “\neq” means “different”.

I am looking forward to hearing from you.

With best regards,

Alexandre.

Alexandrel: Can’t read this sorry. Try writing it out. Mayo

Dear Mayo,

It seems that you always present examples in a uniparametric context (where it is possible to have UMP tests). However, sometimes we must work in a multi-parametric framework (and UMP tests are no longer available). For instance, (1) to test whether the averages of two (or more) different populations are equal; (2) to test whether the averages of these two (or more) populations equal zero.

In order to simplify the problem, consider that the two populations are normally distributed with variance 1 and assume we observe independent and identically distributed random samples of size “n” from these both populations.

I want to derive a severity procedure for testing:

(1) H: “the averages of the two populations are equal (i.e., mu1 = mu2)”;

(2) H’: “the averages of the two populations equal zero (i.e., mu1=mu2=0)”.

Thanks,

Alexandre.

N-P tests certainly are not limited to UMP tests! Nor would I mandate a particular test to be used for a particular context. For a general discussion with nuisance parameters that is also informal, the paper that comes to mind is Cox and Mayo in ERROR AND INFERENCE (Mayo and Spanos 2010). Mayo and Cox (in the same volume has a typology of nulls). Distinct from that is the one-sided two-sided issue, and selection effects. Still can’t read your example.I see I can read one of your examples now.

Severity does not attach to a procedure, method or test. It requires considering a specific inference, and particular error to be ruled out by that inference (relevant alternatives), as well as the test, data, and model. My construal of tests uses p-values (if and when they are relevant) but is certainly not equal to them. Perhaps you should identify what you think the problem is….

I know your time is short to discuss an apparently unimportant topic. However, this subject is virtually the core of the logic flaws of p-values and some related frequentist measures of “evidence”.

My question was posed to start a discussion about “logical flaws” on frequentist measures of evidence.

I had tried to explain those “logical flaws”, but you did not answer me yet. I’m trying to change the strategy. If you derive your severity procedure for these two hypotheses maybe I can show you what I mean.

I hope you engage this discussion.

Mayo,

I know that your method is not only applied for a particular context. But I only see cases when just one parameter is being tested (even when exist nuisance parameters, the null hypothesis is specifying only one parameter. Maybe you have worked with more general hypotheses, which specify more than one parameter, but I didn’t see them yet). The examples I provided here are considering two (or more) parameters that are being tested in the null hypothesis (this has nothing or little to do with nuisance parameters.)

If you cannot read my examples, I think it will be very difficult to have a discussion, since they are the most simple ones I can think of right now.

I can compute the p-values under these two examples via the generalised likelihood ratio statistics. I am not going to write it down here, since this requires special notation. If you want to understand these examples I refer you to the Example 1.1 of my paper “A classical measure of evidence for general null hypotheses” which is in the Arxiv.org

PS: This paper is being enlarged and revised (I’m inserting a new section connecting my evidence measure with the abstract belief calculus.)

As somebody who has applied the post-data severity evaluation extensively in several published and unpublished papers, I find some of the above comments/requests by Carlos and Aleandre rather puzzling. The severity assessement takes whatever frequentist test one has used to derive the accept/reject and p-value results and performs a post-data evaluation to establish the discrepancy from the null warranted by the data in question. In this sense, one needs a lot more information, besides some arbitrary hypothesis and a p-value, before such an evaluation can be performed. This includes a test statistic, a rejection region, a significance level, the estimated parameters involved in the test statistic, its observed value, and the sample size, as well as the distribution of the test statistic under both the null and alternatives in a usable form.

In light of that, I’m totally puzzled by comments like:

(a) how to evaluate the severity for hypotheses:

H*: mean1=mean2

given that the p-value is 0,3%, but the p-value for the hypothesis

H: mean1=zero=mean2, is 10%.

(b) I want to derive a severity procedure for testing:

H: “the averages of the two populations are equal (i.e., mu1 = mu2)”;

H’: “the averages of the two populations equal zero (i.e., mu1=mu2=0)”.

In both cases one needs all the above information I just described, including an explicit test statistic which frames the question one wants to pose. There is nothing difficult about post-data severity evaluations relating to tests of joint (multiple) hypotheses, but one needs to spell out the relevant test statistic and the rest of the information needed.

For instance, in testing the hypotheses:

H0: mu1=m2, vs. H1: mu1 diff. m2,

one can use the test statistic based on the standardized difference between the two estimators, or define the data as X-Y and test whether the mean of X-Y is zero or not using the standardized form of the latter mean.

Similarly, for testing H: mu1=mu2=0

one can frame it as two joint hypotheses, mu1=0 and mu2=0, by using an F-type test; extensively used in regression analysis. It should be noted that other test statistics can be devised that combine these hypotheses into one by taking a linear combination of the underlying variables.

In each case one is posing a somewhat different question, but the bottom line is that no severity evaluation can be performed on arbitrary hypotheses without all the details associated with a frequentist test mentioned above. For each such test one can evaluate the post-data severity of the accept/reject or p-value results. The practical side of such an evaluation is no different from evaluating the power of a frequentist test for different discrepancies form the null. Hence, if you ever did the latter, it will be trivial to adapt your calculations and thresholds to do the severity assessement.

Alexandre: Out of curiosity I had a look at your paper and the example as detailed there. My impression is that these examples are not problematic for the severity approach, in the sense that I don’t think that any severity would be attached to the apparently conflicting conlusions about the null hypothesis (one would need to specify alternatives in order to evaluate this and I don’t have the time to do that). The information you posted in the discussion here is actually not enough to understand this, but in the paper (I’m writing this for Mayo and anyone who doesn’t know it) you actually define a procedure and give example data, so severity could be evaluated given an alternative.

I don’t comment on whether your proposed measure is an as good or even better than severity in oder to quantify evidence in a purely frequentist way but it looks worth a thought at least.

Thanks so much Aris and Christian: I won’t be able to study any of this for a few days–traveling.

I’m trying to reply, but the blog does not allow me. Let’s see if now it goes…

Aris Spanos,

I just wanted to know what you have said. The whole procedure to apply the post-data severity evaluation. You need an explicit test statistic, p-values, specific hypotheses and so on, OK… They are the things I just wanted to know. Thanks. I started asking about severity just for gathering information about the process and move on the discussion.

I invite you to read the examples of my paper. There are numerical examples, statistics, p-values and null hypotheses to be tested. They are very simple.

The issue here is the one I’ve been said in this blog but nobody have yet replied. Let me put it again.

Those hypotheses, in your item (b), H and H’ have an interesting relation, namely: if H is false then by the logical reasoning H’ must also be false. OK?

However, I showed one example in that paper in which the p-value (0.03) computed under H is quite smaller than the p-value (0.10) computed under H’ when using the very same data. Naturally this behavior does not happens all the time.

What is this? It seems to be a logical problem, since for a 5% significance level we have evidence that H is false, but at the same level of significance we don’t have evidence to claim that H’ is also false. Of course there is an explanation for this fact that has to do with the null distribution of the statistics (and I showed it in the paper).

This space is not so pleasant for writing mathematical stuff… Sorry for any confusions.

Alexandre: My earlier reply today seems to have disappeared, but as I have responded long ago, SEV does not obey the so-called “consequence” condition, and one needn’t go beyond one vs two-sided tests, or selection effects (e.g., hunting for statistical significance) to see that. I think this is entirely defensible for an account that is capturing the capability of a test to have avoided an error in question. I had hoped you’d devour the published explanations, which are far more satisfactory than these comments (e.g., chapter 9, 10 of EGEK), and return. On the other hand, as I also pointed out way back, the common Bayesian account distinguishes much more radically (inferences from) one and two sided tests (as discussed in recent posts), and there the problem does seem problematic because it is intended as a support or E-R measure. Your response was that you didn’t understand, so I dropped it. In airport, can’t vouch for reliability of connection.

Mayo: thanks for your response.

I didn’t read all texts about severity, since I also have other things to do and my time is very short. I hoped to find here a sharp direction towards the problem I’m working on. I’ve heard that maybe post-data severity evaluation would correct the “logical problems” of p-values for nested hypotheses. But as you point out above: “SEV does not obey the so-called “consequence” condition”, then I must conclude that SEV cannot correct those “logical problems” and maybe p-values definitively cannot be used to say if a subset is more plausible than other on Theta.

My proposal does not need prior information and is frequentist (i.e., we are in the same team). However, contrary to p-values, my proposal is indeed a plausibility/possibilistic measure on Theta, then we can use it to provide an objective state of belief on the subsets of Theta.

Alexandre: My first chance to respond The “inconsistency” you think exists assumes an account that is at odds with what I seek, and with what is sought by those who are looking for an assessment of well testedness. SEV is a measure of how good a job a test has successfully ruled out specific errors (or not). To put it informally: in order to rule out 2 errors, say, at a given level, takes more than ruling out 1 error at the same level. Once tests are made out this way, with the conclusion either (i) there is or (ii) there is not, evidence that a specific error is absent/present, it turns out that the correct logical entailments are precisely the ones given by the SEV entailments! So if you try to reverse things, you will fail to capture the logic of well-testedness or corroboration.