The journey to San Francisco was smooth sailing with no plane delays; within two hours of landing I found myself in the E.R. of St. Francis Hospital (with the philosopher of science Ronald Giere), unable to walk. I have just described an unexpected, “anomalous”, highly unusual event, but no one would suppose it was anomalous FOR, i.e., evidence against some theory, say, in molecular biology. Yet I am getting e-mails (from readers) saying, in effect, that since the improbable coin toss result is very unexpected/anomalous in its own right, it therefore is anomalous for any and all theories, which is patently absurd. What had happened, in case you want to know, is that just as I lunged forward to grab my (bulging) suitcase off the airline baggage thingy, out of the corner of my eye I saw my computer bag being pulled away by someone on my left, and as I simultaneously yanked it back, I tumbled over—very gently it seemed– twisting my knee in a funny way. To my surprise/alarm, much as a tried, I could put no weight on my right leg without succumbing to a Geppeto-puppet-like collapse. The event, of course, could rightly be regarded as anomalous for hypotheses about my invulnerability to such mishaps, because it runs counter to them. I will assume this issue is now settled for our discussions, yes?

# A Highly Anomalous Event

Categories: Statistics
Tags: anomalous, coin toss, error statistical philosophy, Ronald Giere, significance tests
29 Comments

I’m trying to understand the philosophical arguments in terms of statistical notation. The unfair coin could even be thought as a legitimate alpha-level statistical hypothesis test for H: E(X) = 0, where X denotes some financial asset return, because it can make sense. However it definitely is not one for H: E(Y) = 0, whree Y denotes the number of words written in this comment, because it simply makes no sense. Did I understand it right?

RP: I’m not sure I understand the question, sorry.

I’m sorry that I didn’t make me clear enough. Although late, I’ll try again!

The discussion seems to be about what happens to be a legitimate test. Regarding this, if you take a formal text like Erich Lehmann’s Testing Statistical Hypothesis, the problem begins to be stated considering a parameter space that is split in two mutually exclusive parts, each one giving rise to one of two complementary hypotheses. Thus, I am assuming that a test cannot be a legitimate one if a value for its test statistic can be assigned by points outside the parameter space. In other words, a legitimate test must consider only test statistics that can ONLY have values assigned by hypotheses derived from the (fixed) parameter space.

In my examples above, the first test is legitimate because E(X) = 0 is inside the parameter space of ‘possible values for financial returns’. However, the second one is not legitimate because E(Y) = 0 is outside the parameter space of ‘possible values for number of words written in that comment’, which, clearly must be greater than zero.

Jay’s example considers a test statistic whose values can be assigned by ANY parameter space. This means that the particular parameter space of a specific problem doesn’t matter. Using the notation of Bill Jeffreys above, since the statistic is independent of the hypotheses, one could write P(S|H_0) = P(S|H_1) = P(S|H_{-1}) = P(S), where H_{-1} means ‘everything that is outside the parameter space even if it doesn’t make sense’. This is insane!

My question is: is this explanation compatible with yours? If not, what are the divergences?

Best wishes with your recovery.

I’m sure we all wish you a speedy recovery.

Thank you; it is annoyingly slow-going.

Naive guy again. It seems obvious to me that for the coin example from the Uncertainty book, that if our hypothesis is that the frequency for heads is 0.05, then we would set up an experiment in which the null hypothesis was that repeated tosses will demonstrate a frequency of heads close to 0.05, with how close determined by sample size (number of tosses), etc. We test for significance of a deviation from 0.05. Thus, the one toss was inconclusive. My question is–and I am interested in moving beyond the coins– how does a book published in 2011 provide such a misleading example? Does this view exemplified by the example represent the thinking of Bayesian proponents? I am intensely interested in understanding what appears to me as flawed reasoning that is gaining support (like a clothing fashion–kinda like Chinese foot-binding, but a fashion). I just attended a forensic conference where a colleague complained to me that in her home country, they are required now to report the Bayesian posterior probability for every identification of human remains. Yet, they have no sound basis for developing priors. They also suffer from loose thinking about liklihoods. The effect showed itself recently when the DNA lab reported out that a set of remains had a probability of >99.0 of being a certain young man. The remains turned out to be female, probably his sister who also was missing. They had some difficulty straightening it out. Sloppy use of Bayes’ model will make a big mess, I think. Scary. Is this where science is going? Is it inevitable?

This example is not about sample size, but about proposing a test statistic which ignores completely the substantial part of the data.

The example Mayo cites I would agree is peculiar in that nobody would ever do it… The point of interest is such a perverse test permissible under the definitions, but I fully agree it does not criticise non-Bayesian statistics as it is actually practiced…

I would strongly defend this book as being an impressive piece of scholarship. The criticism in the brief final chapter includes discussion of many other problems in non-Bayesian statistics as it is actually practiced… (which I agree is a problem with this example)

This is not to say that you didn’t encounter some very strange Bayesian analysis in your work!

Granted this is one of the more extreme examples, but it brings out certain fundamental misunderstandings of the requirements of significance tests, of both Fisherian and N-P varieties.

“How does a book published in 2011 provide such a misleading example? ” I have the same question, and Kadane is certainly a high priest of subjective Bayesianism. It is to his credit that he hasn’t tried to weasel out of the consequences of the account. But this and other criticisms of frequentist tests don’t hold up. I just started with an extreme example to make the broader philosophical points about scientific testing.

I truly think that it will go that way if people don’t do something about it. That, of course, is my mission. But it requires a certain amount of care, understanding of logic, and intellectual honesty. From critics, it will require avoidance of just giving the same knee jerk reactions again and again and again. A good example is in dealing with the relevance of error probabilities. Many Bayesians just say they are irrelevant. But, guess what? That’s not an argument. We frequentists give arguments explaining their relevance—to appraising the inference at hand—and the onus of a denier must be to counter those arguments. I will address this soon in considering C. Roberts on Error and Inference. I appreciate his posts,but I hope to get him to stop for just a minute and think (before a glib but question-begging insistence that error probabilities are irrelevant to the particular case.).

There is a lot of discussion of the conflict between and the merits of coherence vs other criteria such as coverage and p-values in the Bayesian literature. I have been disappointed to find that this discussion of conflicting principles is almost always from a Bayesian point of view (after a thorough search I have located only a handful of examples and very very rarely from leading figures in frequentist statistics).

The main argument for the use of subjective probability is that it is a primitive for evaluating the expected utility of a decision. I am not aware of any similar argument for frequentist criteria.

I am really interested for more detail on: “We frequentists give arguments explaining their relevance—to appraising the inference at hand”…. I have been searching the literature for years for these arguments, where are they?!

I think it is quite wrong to say that statements that “error probabilities are irrelevant ” is done without argument. To the contrary, there is a significant literature about obtaining both if possible, and the philosophical tension that arises if only one is possible (mostly resolved in favour of coherence as you note).

My understanding is that leading frequentist statisticians essentially accept that Bayesian theory is the closest thing to a complete theory of statistics, but it raises (potentially very serious) difficulties in practice for example Bradley Efron says:

“The only complete theory of statistics is the Bayesian theory and even though it’s unassailable it somehow misses part of the story, which is that you can’t use it as an actual driving theory for complicated problems. You always are then forced to do something too complicated, and make up your mind on things you have no opinions on. So somehow Bayesian theory is wonderful but it doesn’t tell the whole story. Frequentist theory is shot full of contradictions but it seems to work so well.”

Similarly Stephen Senn describes Bayesian theory as a theory of how to be perfect, but doesn’t on that doesn’t necessarily help you be good.

This criticism of Bayesian theory is one that I acknowledge and respect… In contrast I remain bewildered about the line of argument Deborah Mayo gives. She does not discuss as far as I am aware the philosophical tension that arises between coherence and coverage, and seems to acknowledge no value in coherence at all as far as I can see. I am not sure if she would dismiss, subjective Bayesian as the appropriate theory for decision making in the face of uncertainty as nearly all statisticians of all colours do.

The scope of a theory on decision making under uncertainty as a contributor to philosophy of science is however debatable, as is its meaningful application to complex real world problems.

The question was never (or at least, should never have been) about the obvious lack of legitimacy of tests of hypotheses based on data which aren’t relevant to those hypotheses — I’d hope that would be taken for granted. The question is: how is this notion of (lack of) legitimacy formalized in the context of error-statistical hypothesis testing?

Hope you feel better soon!

Thanks for the well-wishes! It is formalized in the requirements for a test statistic.

I would like to know where it is written down what the requirements for a test statistic S are that would rule out Jay’s example as a legitimate one. As far as I know, the only requirements for alpha-level testing that anyone writes down, for example, are that you observe a statistic and determine whether the probability of observing that statistic, given H_0, in the rejection region (0.05 in Jay’s example). You can say that it should be a requirement that the probability is actually dependent on H_0, but where is that written down?

But it is clear that the Bayesian approach to the same example automatically fixes the problem. In the Bayesian approach, you would calculate the Bayes factor:

BF=P(S|H_0)/P(S|H_1) [where H_1=not H_0],

and if BF>1 the test favors H_0 and if BF<1 it favors H_1.

But in Jay's example, S is independent of H_0 and H_1: P(S|H_0)=P(S|H_1)=P(S). [To be pedantically correct I should include on after the conditioning bar a term B representing the background information we have, e.g., that we got S by tossing a particular coin, but it isn't really needed here so I suppress it.]

Thus, in this example, BF=1 and we learn, automatically, that S gives us no information about what we should think about H_0 and H_1. No additional definition is required.

No such automatic evaluation of Jay's example as printed in his book attaches. We could say, for example, that no test is legitimate of the sampling distribution of S is independent of the hypothesis being tested, but does anyone actually write that down?

It is written down in any decent exposition of tests. To begin with, the null must assign probabilities to each value of the statistic. This already fails for the example we are discussing.

In Jay’s example, the null assigns probability 0.05 to tails and 0.95 to heads (the two values of the statistic are tails and heads). How is this not assigning probabilities to each value of the statistic?

It doesn’t fail in this way. Under the null (and under the alternative) the probability of getting heads is known.

No, the null can be about chemistry, physics or what have you and it doesn’t talk about his pet biased coin. Admittedly, that’s just one silly example. But I have dealt with all the others in published papers, and will address them specifically on this blog.

Who are “you”, and where are these papers published?

Who am I? Is that your question? Why of course I am that frequentist witch in exile? Plenty of references on Mayo’s blog.

Yes, I asked “who you are?”

I guess that you are Mayo. But that is not clear. ANYONE can post under another handle, if they wish.

But why use several handles? Are you trying to keep us guessing?

Are you “Error”? “Mayoerror”? “ERRorERRor”? “Phildgs2″? How many other handles are you using here?”

What is the point of using all these handles?

Are you using sock puppets to confuse us?

This is not what a scholar should do, if indeed you are the owner of this list.

I post under one name. That’s the honest thing to do.

Like

All the different “handles” are accidental; didn’t I admit from the start that I didn’t know the foggiest thing about how to blog? If it comes out ok, I leave it, if someone else has to go in and fix something, they might restore the writing but then their own “handle” arises. Sorry, but no dishonesty.

I mean, I want the citations so I can look at the papers.

…in addition, I want to note that in my example, the null assigns exactly the same probabilities to each possible value of the test statistic, regardless of the choice of p (since in all cases, under the null, the y’s are drawn from a standard normal distribution, independent of p). In other words, it seems to me that if for p=1 the hypothesis test is legitimate in Mayo’s eyes, the criterion she announced above for legitimacy fails to declare my tests illegitimate for any other value of p, including 0.

Also, in my example, for fixed p>0 and arbitrary but fixed theta, the power for a test based on a single sample decreases (approaching alpha) as p–>0. However, no matter how small a p>0 you choose, it is possible to make the sample size N large enough so that you will have power as close to 1 as you wish, for that fixed theta.

This sounds to me like a legitimate test, even if a dumb one.

I didn’t say the large n problem led to no tests at all, I haven’t really discussed it on the blog (though it relates to what I’ve been writing about power). One must, in my construal of tests, indicate the extent of the discrepancy from the null that is and is not warranted. As n increases, the “same” level of significance indicates a smaller discrepancy from the null is warranted. Essentially the same as a confidence interval. Please see, for example, the Mayo and Spanos article with the 13 lucky criticisms post. I think this space is getting too small to write in–just like when sample size is increased!

Hope you are feeling better.

Apologies for the pedantry, but the rare event *is* anomalous; it is “inconsistent with or deviating from what is usual, normal, or expected” [Merriam-Webster]. The rare event is anomalous regardless of anything going on in molecular biology.

More interestingly, rare coin tossing events make for useless tests because they’re *equally* anomalous regardless of anything in molecular biology, i.e. they have minimal power. Are you defining legitimacy as having optimal power, or just having more power than a coin toss?

No. The rare event is not anomalous FOR just any old hypothesis; only those that counterpredict it, statistically or otherwise.

Okay, so you’re saying observed data can’t be “anomalous for” ideas (an epistemological statement) if those observed data are equally “anomalous” (an aleatory statement) under all ideas considered.

How, in your ideas, is this turned into a constructive definition? Without attaching a notion of optimality to the ability to counterpredict, just being “anomalous for” doesn’t rule out silly tests very like Kadane’s (e.g. those in Bill J’s earlier post).

If one does require optimality, how is the optimality determined?

sketching what I think is happening…

the null hypothesis H_0 is the following:coin flips tails with probability 0.05

only viruses and bacteria cause infection.

H_0 is widely held so evidence against it would represent a breakthrough…

we also have a coin flip, C and experimental data D.

P(C,D|H_0) = P(C|H_0)P(D|H_0)

Kadane suggests the (crazy) statistic

S(C,D)=C

and then argues that as

P(S=tails|H_0)=0.05

then if this event occurs we can say:

either an unlikely event occurred or the hypothesis is false. This seems technically ok but the hypothesis is a composite one on both the coin and in biology…

I think it would be most concerning if we observed heads so chose not to reject the null.. i.e. we chose to make no conclusion whatsoever…

… but if we also found that P(D|H_0)=0 then it would seem that this analysis missed a significant breakthrough.

… there are quite a few issues floating around this example… including Fisher vs Neyman Pearson frequentist only discussions on the requirement for an alternative hypothesis… in science occasionally you get a Kuru situation where P(D|H_0)=0 so that you really don’t need an alternative hypothesis to reject the null (nor do you need a statistician)… another significant oddity in this example is the way Kadane introduced it, it is hard to imagine any reasonable hypothesis in which C and D were dependent…

Mayo’s wider point that rejection of frequentist philosophy on this single caricature of frequentist practice is well taken.

On the other hand, I don’t think that was Kadane’s intention. The example that Mayo cites is a parenthetical remark buried in a much more substantial example and discussion.