The journey to San Francisco was smooth sailing with no plane delays; within two hours of landing I found myself in the E.R. of St. Francis Hospital (with the philosopher of science Ronald Giere), unable to walk. I have just described an unexpected, “anomalous”, highly unusual event, but no one would suppose it was anomalous FOR, i.e., evidence against some theory, say, in molecular biology. Yet I am getting e-mails (from readers) saying, in effect, that since the improbable coin toss result is very unexpected/anomalous in its own right, it therefore is anomalous for any and all theories, which is patently absurd. What had happened, in case you want to know, is that just as I lunged forward to grab my (bulging) suitcase off the airline baggage thingy, out of the corner of my eye I saw my computer bag being pulled away by someone on my left, and as I simultaneously yanked it back, I tumbled over—very gently it seemed– twisting my knee in a funny way. To my surprise/alarm, much as a tried, I could put no weight on my right leg without succumbing to a Geppeto-puppet-like collapse. The event, of course, could rightly be regarded as anomalous for hypotheses about my invulnerability to such mishaps, because it runs counter to them. I will assume this issue is now settled for our discussions, yes?

# A Highly Anomalous Event

Categories: Statistics
| Tags: anomalous, coin toss, error statistical philosophy, Ronald Giere, significance tests

### 29 thoughts on “A Highly Anomalous Event”

### The Statistics Wars & Their Casualties

### Blog links (references)

### Reviews of Statistical Inference as Severe Testing (SIST)

- P. Bandyopadhyay (2019) Notre Dame Philosophical Reviews
- C. Hennig (2019) Statistical Modeling, Causal. Inference, and Social Science blog
- A. Spanos (2019) OEconomia: History, Methodology, Philosophy
- R. Cousins 2020 (Preprint)
- S. Fletcher (2020) Philosophy of Science
- B. Haig (2020) Methods in Psychology
- C. Mayo-Wilson (2020 forthcoming) Philosophical Review
- T. Sterkenburg (2020) Journal for General Philosophy of Science

### Interviews & Debates on PhilStat (2020)

- The Statistics Debate!with Jim Berger, Deborah Mayo, David Trafimow & Dan Jeske, moderator (10/15/20)
- The Filter podcast with Matt Asher (11/23/20)
- Philosophy of Data Science Series Keynote Episode 1: Revolutions, Reforms, and Severe Testing in Data Science with Glen Wright Colopy (11/24/20)
- Philosophy of Data Science Series Keynote Episode 2: The Philosophy of Science & Statistics with Glen Wright Colopy (12/01/20)

### Interviews on PhilStat (2019)

### Summer Seminar 2019 (article)

### Top Posts & Pages

- D. Mayo & D. Hand: "Statistical significance and its critics: practicing damaging science, or damaging scientific practice?"
- The P-Values Debate
- The ASA controversy on P-values as an illustration of the difficulty of statistics
- Do "underpowered" tests "exaggerate" population effects? (iv)
- Fair shares: sexual justice in patient recruitment in clinical trials
- A statistically significant result indicates H' (μ > μ') when POW(μ') is low (not the other way round)--but don't ignore the standard error
- Paul Daniell & Yu-li Ko commentaries on Mayo's ConBio Editorial
- Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)
- Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)
- S. Senn: Randomisation is not about balance, nor about homogeneity but about randomness (Guest Post)

### Conferences & Workshops

### RMM Special Topic

### Mayo & Spanos, Error Statistics

### My Websites

### Recent Posts: PhilStatWars

#### January 11: Phil Stat (Remote) Forum

#### WORKSHOP

#### The (Vaccine) Booster Wars: A prepost

#### June 24: “Have Covid-19 lockdowns led to an increase in domestic violence? Drawing inferences from police administrative data” (Katrin Hohl)

#### May 20: “Objective Bayesianism from a philosophical perspective” (Jon Williamson)

### LOG IN/OUT

### Archives

- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011

I’m trying to understand the philosophical arguments in terms of statistical notation. The unfair coin could even be thought as a legitimate alpha-level statistical hypothesis test for H: E(X) = 0, where X denotes some financial asset return, because it can make sense. However it definitely is not one for H: E(Y) = 0, whree Y denotes the number of words written in this comment, because it simply makes no sense. Did I understand it right?

RP: I’m not sure I understand the question, sorry.

I’m sorry that I didn’t make me clear enough. Although late, I’ll try again!

The discussion seems to be about what happens to be a legitimate test. Regarding this, if you take a formal text like Erich Lehmann’s Testing Statistical Hypothesis, the problem begins to be stated considering a parameter space that is split in two mutually exclusive parts, each one giving rise to one of two complementary hypotheses. Thus, I am assuming that a test cannot be a legitimate one if a value for its test statistic can be assigned by points outside the parameter space. In other words, a legitimate test must consider only test statistics that can ONLY have values assigned by hypotheses derived from the (fixed) parameter space.

In my examples above, the first test is legitimate because E(X) = 0 is inside the parameter space of ‘possible values for financial returns’. However, the second one is not legitimate because E(Y) = 0 is outside the parameter space of ‘possible values for number of words written in that comment’, which, clearly must be greater than zero.

Jay’s example considers a test statistic whose values can be assigned by ANY parameter space. This means that the particular parameter space of a specific problem doesn’t matter. Using the notation of Bill Jeffreys above, since the statistic is independent of the hypotheses, one could write P(S|H_0) = P(S|H_1) = P(S|H_{-1}) = P(S), where H_{-1} means ‘everything that is outside the parameter space even if it doesn’t make sense’. This is insane!

My question is: is this explanation compatible with yours? If not, what are the divergences?

Best wishes with your recovery.

I’m sure we all wish you a speedy recovery.

Thank you; it is annoyingly slow-going.

Naive guy again. It seems obvious to me that for the coin example from the Uncertainty book, that if our hypothesis is that the frequency for heads is 0.05, then we would set up an experiment in which the null hypothesis was that repeated tosses will demonstrate a frequency of heads close to 0.05, with how close determined by sample size (number of tosses), etc. We test for significance of a deviation from 0.05. Thus, the one toss was inconclusive. My question is–and I am interested in moving beyond the coins– how does a book published in 2011 provide such a misleading example? Does this view exemplified by the example represent the thinking of Bayesian proponents? I am intensely interested in understanding what appears to me as flawed reasoning that is gaining support (like a clothing fashion–kinda like Chinese foot-binding, but a fashion). I just attended a forensic conference where a colleague complained to me that in her home country, they are required now to report the Bayesian posterior probability for every identification of human remains. Yet, they have no sound basis for developing priors. They also suffer from loose thinking about liklihoods. The effect showed itself recently when the DNA lab reported out that a set of remains had a probability of >99.0 of being a certain young man. The remains turned out to be female, probably his sister who also was missing. They had some difficulty straightening it out. Sloppy use of Bayes’ model will make a big mess, I think. Scary. Is this where science is going? Is it inevitable?

This example is not about sample size, but about proposing a test statistic which ignores completely the substantial part of the data.

The example Mayo cites I would agree is peculiar in that nobody would ever do it… The point of interest is such a perverse test permissible under the definitions, but I fully agree it does not criticise non-Bayesian statistics as it is actually practiced…

I would strongly defend this book as being an impressive piece of scholarship. The criticism in the brief final chapter includes discussion of many other problems in non-Bayesian statistics as it is actually practiced… (which I agree is a problem with this example)

This is not to say that you didn’t encounter some very strange Bayesian analysis in your work!

Granted this is one of the more extreme examples, but it brings out certain fundamental misunderstandings of the requirements of significance tests, of both Fisherian and N-P varieties.

“How does a book published in 2011 provide such a misleading example? ” I have the same question, and Kadane is certainly a high priest of subjective Bayesianism. It is to his credit that he hasn’t tried to weasel out of the consequences of the account. But this and other criticisms of frequentist tests don’t hold up. I just started with an extreme example to make the broader philosophical points about scientific testing.

I truly think that it will go that way if people don’t do something about it. That, of course, is my mission. But it requires a certain amount of care, understanding of logic, and intellectual honesty. From critics, it will require avoidance of just giving the same knee jerk reactions again and again and again. A good example is in dealing with the relevance of error probabilities. Many Bayesians just say they are irrelevant. But, guess what? That’s not an argument. We frequentists give arguments explaining their relevance—to appraising the inference at hand—and the onus of a denier must be to counter those arguments. I will address this soon in considering C. Roberts on Error and Inference. I appreciate his posts,but I hope to get him to stop for just a minute and think (before a glib but question-begging insistence that error probabilities are irrelevant to the particular case.).

There is a lot of discussion of the conflict between and the merits of coherence vs other criteria such as coverage and p-values in the Bayesian literature. I have been disappointed to find that this discussion of conflicting principles is almost always from a Bayesian point of view (after a thorough search I have located only a handful of examples and very very rarely from leading figures in frequentist statistics).

The main argument for the use of subjective probability is that it is a primitive for evaluating the expected utility of a decision. I am not aware of any similar argument for frequentist criteria.

I am really interested for more detail on: “We frequentists give arguments explaining their relevance—to appraising the inference at hand”…. I have been searching the literature for years for these arguments, where are they?!

I think it is quite wrong to say that statements that “error probabilities are irrelevant ” is done without argument. To the contrary, there is a significant literature about obtaining both if possible, and the philosophical tension that arises if only one is possible (mostly resolved in favour of coherence as you note).

My understanding is that leading frequentist statisticians essentially accept that Bayesian theory is the closest thing to a complete theory of statistics, but it raises (potentially very serious) difficulties in practice for example Bradley Efron says:

“The only complete theory of statistics is the Bayesian theory and even though it’s unassailable it somehow misses part of the story, which is that you can’t use it as an actual driving theory for complicated problems. You always are then forced to do something too complicated, and make up your mind on things you have no opinions on. So somehow Bayesian theory is wonderful but it doesn’t tell the whole story. Frequentist theory is shot full of contradictions but it seems to work so well.”

Similarly Stephen Senn describes Bayesian theory as a theory of how to be perfect, but doesn’t on that doesn’t necessarily help you be good.

This criticism of Bayesian theory is one that I acknowledge and respect… In contrast I remain bewildered about the line of argument Deborah Mayo gives. She does not discuss as far as I am aware the philosophical tension that arises between coherence and coverage, and seems to acknowledge no value in coherence at all as far as I can see. I am not sure if she would dismiss, subjective Bayesian as the appropriate theory for decision making in the face of uncertainty as nearly all statisticians of all colours do.

The scope of a theory on decision making under uncertainty as a contributor to philosophy of science is however debatable, as is its meaningful application to complex real world problems.

The question was never (or at least, should never have been) about the obvious lack of legitimacy of tests of hypotheses based on data which aren’t relevant to those hypotheses — I’d hope that would be taken for granted. The question is: how is this notion of (lack of) legitimacy formalized in the context of error-statistical hypothesis testing?

Hope you feel better soon!

Thanks for the well-wishes! It is formalized in the requirements for a test statistic.

I would like to know where it is written down what the requirements for a test statistic S are that would rule out Jay’s example as a legitimate one. As far as I know, the only requirements for alpha-level testing that anyone writes down, for example, are that you observe a statistic and determine whether the probability of observing that statistic, given H_0, in the rejection region (0.05 in Jay’s example). You can say that it should be a requirement that the probability is actually dependent on H_0, but where is that written down?

But it is clear that the Bayesian approach to the same example automatically fixes the problem. In the Bayesian approach, you would calculate the Bayes factor:

BF=P(S|H_0)/P(S|H_1) [where H_1=not H_0],

and if BF>1 the test favors H_0 and if BF<1 it favors H_1.

But in Jay's example, S is independent of H_0 and H_1: P(S|H_0)=P(S|H_1)=P(S). [To be pedantically correct I should include on after the conditioning bar a term B representing the background information we have, e.g., that we got S by tossing a particular coin, but it isn't really needed here so I suppress it.]

Thus, in this example, BF=1 and we learn, automatically, that S gives us no information about what we should think about H_0 and H_1. No additional definition is required.

No such automatic evaluation of Jay's example as printed in his book attaches. We could say, for example, that no test is legitimate of the sampling distribution of S is independent of the hypothesis being tested, but does anyone actually write that down?

It is written down in any decent exposition of tests. To begin with, the null must assign probabilities to each value of the statistic. This already fails for the example we are discussing.

In Jay’s example, the null assigns probability 0.05 to tails and 0.95 to heads (the two values of the statistic are tails and heads). How is this not assigning probabilities to each value of the statistic?

It doesn’t fail in this way. Under the null (and under the alternative) the probability of getting heads is known.

No, the null can be about chemistry, physics or what have you and it doesn’t talk about his pet biased coin. Admittedly, that’s just one silly example. But I have dealt with all the others in published papers, and will address them specifically on this blog.

Who are “you”, and where are these papers published?

Who am I? Is that your question? Why of course I am that frequentist witch in exile? Plenty of references on Mayo’s blog.

Yes, I asked “who you are?”

I guess that you are Mayo. But that is not clear. ANYONE can post under another handle, if they wish.

But why use several handles? Are you trying to keep us guessing?

Are you “Error”? “Mayoerror”? “ERRorERRor”? “Phildgs2″? How many other handles are you using here?”

What is the point of using all these handles?

Are you using sock puppets to confuse us?

This is not what a scholar should do, if indeed you are the owner of this list.

I post under one name. That’s the honest thing to do.

Like

All the different “handles” are accidental; didn’t I admit from the start that I didn’t know the foggiest thing about how to blog? If it comes out ok, I leave it, if someone else has to go in and fix something, they might restore the writing but then their own “handle” arises. Sorry, but no dishonesty.

I mean, I want the citations so I can look at the papers.

…in addition, I want to note that in my example, the null assigns exactly the same probabilities to each possible value of the test statistic, regardless of the choice of p (since in all cases, under the null, the y’s are drawn from a standard normal distribution, independent of p). In other words, it seems to me that if for p=1 the hypothesis test is legitimate in Mayo’s eyes, the criterion she announced above for legitimacy fails to declare my tests illegitimate for any other value of p, including 0.

Also, in my example, for fixed p>0 and arbitrary but fixed theta, the power for a test based on a single sample decreases (approaching alpha) as p–>0. However, no matter how small a p>0 you choose, it is possible to make the sample size N large enough so that you will have power as close to 1 as you wish, for that fixed theta.

This sounds to me like a legitimate test, even if a dumb one.

I didn’t say the large n problem led to no tests at all, I haven’t really discussed it on the blog (though it relates to what I’ve been writing about power). One must, in my construal of tests, indicate the extent of the discrepancy from the null that is and is not warranted. As n increases, the “same” level of significance indicates a smaller discrepancy from the null is warranted. Essentially the same as a confidence interval. Please see, for example, the Mayo and Spanos article with the 13 lucky criticisms post. I think this space is getting too small to write in–just like when sample size is increased!

Hope you are feeling better.

Apologies for the pedantry, but the rare event *is* anomalous; it is “inconsistent with or deviating from what is usual, normal, or expected” [Merriam-Webster]. The rare event is anomalous regardless of anything going on in molecular biology.

More interestingly, rare coin tossing events make for useless tests because they’re *equally* anomalous regardless of anything in molecular biology, i.e. they have minimal power. Are you defining legitimacy as having optimal power, or just having more power than a coin toss?

No. The rare event is not anomalous FOR just any old hypothesis; only those that counterpredict it, statistically or otherwise.

Okay, so you’re saying observed data can’t be “anomalous for” ideas (an epistemological statement) if those observed data are equally “anomalous” (an aleatory statement) under all ideas considered.

How, in your ideas, is this turned into a constructive definition? Without attaching a notion of optimality to the ability to counterpredict, just being “anomalous for” doesn’t rule out silly tests very like Kadane’s (e.g. those in Bill J’s earlier post).

If one does require optimality, how is the optimality determined?

sketching what I think is happening…

the null hypothesis H_0 is the following:coin flips tails with probability 0.05

only viruses and bacteria cause infection.

H_0 is widely held so evidence against it would represent a breakthrough…

we also have a coin flip, C and experimental data D.

P(C,D|H_0) = P(C|H_0)P(D|H_0)

Kadane suggests the (crazy) statistic

S(C,D)=C

and then argues that as

P(S=tails|H_0)=0.05

then if this event occurs we can say:

either an unlikely event occurred or the hypothesis is false. This seems technically ok but the hypothesis is a composite one on both the coin and in biology…

I think it would be most concerning if we observed heads so chose not to reject the null.. i.e. we chose to make no conclusion whatsoever…

… but if we also found that P(D|H_0)=0 then it would seem that this analysis missed a significant breakthrough.

… there are quite a few issues floating around this example… including Fisher vs Neyman Pearson frequentist only discussions on the requirement for an alternative hypothesis… in science occasionally you get a Kuru situation where P(D|H_0)=0 so that you really don’t need an alternative hypothesis to reject the null (nor do you need a statistician)… another significant oddity in this example is the way Kadane introduced it, it is hard to imagine any reasonable hypothesis in which C and D were dependent…

Mayo’s wider point that rejection of frequentist philosophy on this single caricature of frequentist practice is well taken.

On the other hand, I don’t think that was Kadane’s intention. The example that Mayo cites is a parenthetical remark buried in a much more substantial example and discussion.