The journey to San Francisco was smooth sailing with no plane delays; within two hours of landing I found myself in the E.R. of St. Francis Hospital (with the philosopher of science Ronald Giere), unable to walk. I have just described an unexpected, “anomalous”, highly unusual event, but no one would suppose it was anomalous FOR, i.e., evidence against some theory, say, in molecular biology. Yet I am getting emails (from readers) saying, in effect, that since the improbable coin toss result is very unexpected/anomalous in its own right, it therefore is anomalous for any and all theories, which is patently absurd. What had happened, in case you want to know, is that just as I lunged forward to grab my (bulging) suitcase off the airline baggage thingy, out of the corner of my eye I saw my computer bag being pulled away by someone on my left, and as I simultaneously yanked it back, I tumbled over—very gently it seemed– twisting my knee in a funny way. To my surprise/alarm, much as a tried, I could put no weight on my right leg without succumbing to a Geppetopuppetlike collapse. The event, of course, could rightly be regarded as anomalous for hypotheses about my invulnerability to such mishaps, because it runs counter to them. I will assume this issue is now settled for our discussions, yes?
A Highly Anomalous Event
Categories: Statistics
 Tags: anomalous, coin toss, error statistical philosophy, Ronald Giere, significance tests

29 Comments
29 thoughts on “A Highly Anomalous Event”
Leave a Reply to Drjohnbyrd Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
The Statistics Wars & Their Casualties
Reviews of Statistical Inference as Severe Testing (SIST)
 P. Bandyopadhyay (2019) Notre Dame Philosophical Reviews
 C. Hennig (2019) Statistical Modeling, Causal. Inference, and Social Science blog
 A. Spanos (2019) OEconomia: History, Methodology, Philosophy
 R. Cousins 2020 (Preprint)
 S. Fletcher (2020) Philosophy of Science
 B. Haig (2020) Methods in Psychology
 C. MayoWilson (2020 forthcoming) Philosophical Review
 T. Sterkenberg (2020) Journal for General Philosophy of Science
Interviews & Debates on PhilStat (2020)
 The Statistics Debate!with Jim Berger, Deborah Mayo, David Trafimow & Dan Jeske, moderator (10/15/20)
 The Filter podcast with Matt Asher (11/23/20)
 Philosophy of Data Science Series Keynote Episode 1: Revolutions, Reforms, and Severe Testing in Data Science with Glen Wright Colopy (11/24/20)
 Philosophy of Data Science Series Keynote Episode 2: The Philosophy of Science & Statistics with Glen Wright Colopy (12/01/20)
Summer Seminar 2019 (article)
Top Posts & Pages
 S. Senn: "Beta testing": The Pfizer/BioNTech statistical analysis of their Covid19 vaccine trial (guest post)
 Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
 January 28 Phil Stat Forum "How Can We Improve Replicability?" (Alexander Bird)
 SIST: All Excerpts and Mementos: May 2018June 2020 (updated)
 S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)
 Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)
 "The 2019 ASA Guide to Pvalues and Statistical Significance: Don’t Say What You Don’t Mean" (Some Recommendations)(ii)
 S. Senn: "A Vaccine Trial from A to Z" with a Postscript (guest post)
 Why hasn't the ASA Board revealed the recommendations of its new task force on statistical significance and replicability?
 S. Senn: Randomisation is not about balance, nor about homogeneity but about randomness (Guest Post)
Conferences & Workshops
Interviews on PhilStat (2019)
RMM Special Topic
Mayo & Spanos, Error Statistics
My Websites
Recent Posts: PhilStatWars
January 28 “How can we improve replicability” (Alexander Bird)
January 7: “Putting the Brakes on the Breakthrough: On the Birnbaum Argument for the Strong Likelihood Principle” (D.Mayo)
November 19: “Randomisation and control in the age of coronavirus?” (Stephen Senn)
The PValues Debate
The Statistics Debate
LOG IN/OUT
Archives
 January 2021
 December 2020
 November 2020
 October 2020
 September 2020
 August 2020
 July 2020
 June 2020
 May 2020
 April 2020
 March 2020
 February 2020
 January 2020
 December 2019
 November 2019
 October 2019
 September 2019
 August 2019
 July 2019
 June 2019
 May 2019
 April 2019
 March 2019
 February 2019
 January 2019
 December 2018
 November 2018
 October 2018
 September 2018
 August 2018
 July 2018
 June 2018
 May 2018
 April 2018
 March 2018
 February 2018
 January 2018
 December 2017
 November 2017
 October 2017
 September 2017
 August 2017
 July 2017
 June 2017
 May 2017
 April 2017
 March 2017
 February 2017
 January 2017
 December 2016
 November 2016
 October 2016
 September 2016
 August 2016
 July 2016
 June 2016
 May 2016
 April 2016
 March 2016
 February 2016
 January 2016
 December 2015
 November 2015
 October 2015
 September 2015
 August 2015
 July 2015
 June 2015
 May 2015
 April 2015
 March 2015
 February 2015
 January 2015
 December 2014
 November 2014
 October 2014
 September 2014
 August 2014
 July 2014
 June 2014
 May 2014
 April 2014
 March 2014
 February 2014
 January 2014
 December 2013
 November 2013
 October 2013
 September 2013
 August 2013
 July 2013
 June 2013
 May 2013
 April 2013
 March 2013
 February 2013
 January 2013
 December 2012
 November 2012
 October 2012
 September 2012
 August 2012
 July 2012
 June 2012
 May 2012
 April 2012
 March 2012
 February 2012
 January 2012
 December 2011
 November 2011
 October 2011
 September 2011
I’m trying to understand the philosophical arguments in terms of statistical notation. The unfair coin could even be thought as a legitimate alphalevel statistical hypothesis test for H: E(X) = 0, where X denotes some financial asset return, because it can make sense. However it definitely is not one for H: E(Y) = 0, whree Y denotes the number of words written in this comment, because it simply makes no sense. Did I understand it right?
RP: I’m not sure I understand the question, sorry.
I’m sorry that I didn’t make me clear enough. Although late, I’ll try again!
The discussion seems to be about what happens to be a legitimate test. Regarding this, if you take a formal text like Erich Lehmann’s Testing Statistical Hypothesis, the problem begins to be stated considering a parameter space that is split in two mutually exclusive parts, each one giving rise to one of two complementary hypotheses. Thus, I am assuming that a test cannot be a legitimate one if a value for its test statistic can be assigned by points outside the parameter space. In other words, a legitimate test must consider only test statistics that can ONLY have values assigned by hypotheses derived from the (fixed) parameter space.
In my examples above, the first test is legitimate because E(X) = 0 is inside the parameter space of ‘possible values for financial returns’. However, the second one is not legitimate because E(Y) = 0 is outside the parameter space of ‘possible values for number of words written in that comment’, which, clearly must be greater than zero.
Jay’s example considers a test statistic whose values can be assigned by ANY parameter space. This means that the particular parameter space of a specific problem doesn’t matter. Using the notation of Bill Jeffreys above, since the statistic is independent of the hypotheses, one could write P(SH_0) = P(SH_1) = P(SH_{1}) = P(S), where H_{1} means ‘everything that is outside the parameter space even if it doesn’t make sense’. This is insane!
My question is: is this explanation compatible with yours? If not, what are the divergences?
Best wishes with your recovery.
I’m sure we all wish you a speedy recovery.
Thank you; it is annoyingly slowgoing.
Naive guy again. It seems obvious to me that for the coin example from the Uncertainty book, that if our hypothesis is that the frequency for heads is 0.05, then we would set up an experiment in which the null hypothesis was that repeated tosses will demonstrate a frequency of heads close to 0.05, with how close determined by sample size (number of tosses), etc. We test for significance of a deviation from 0.05. Thus, the one toss was inconclusive. My question is–and I am interested in moving beyond the coins– how does a book published in 2011 provide such a misleading example? Does this view exemplified by the example represent the thinking of Bayesian proponents? I am intensely interested in understanding what appears to me as flawed reasoning that is gaining support (like a clothing fashion–kinda like Chinese footbinding, but a fashion). I just attended a forensic conference where a colleague complained to me that in her home country, they are required now to report the Bayesian posterior probability for every identification of human remains. Yet, they have no sound basis for developing priors. They also suffer from loose thinking about liklihoods. The effect showed itself recently when the DNA lab reported out that a set of remains had a probability of >99.0 of being a certain young man. The remains turned out to be female, probably his sister who also was missing. They had some difficulty straightening it out. Sloppy use of Bayes’ model will make a big mess, I think. Scary. Is this where science is going? Is it inevitable?
This example is not about sample size, but about proposing a test statistic which ignores completely the substantial part of the data.
The example Mayo cites I would agree is peculiar in that nobody would ever do it… The point of interest is such a perverse test permissible under the definitions, but I fully agree it does not criticise nonBayesian statistics as it is actually practiced…
I would strongly defend this book as being an impressive piece of scholarship. The criticism in the brief final chapter includes discussion of many other problems in nonBayesian statistics as it is actually practiced… (which I agree is a problem with this example)
This is not to say that you didn’t encounter some very strange Bayesian analysis in your work!
Granted this is one of the more extreme examples, but it brings out certain fundamental misunderstandings of the requirements of significance tests, of both Fisherian and NP varieties.
“How does a book published in 2011 provide such a misleading example? ” I have the same question, and Kadane is certainly a high priest of subjective Bayesianism. It is to his credit that he hasn’t tried to weasel out of the consequences of the account. But this and other criticisms of frequentist tests don’t hold up. I just started with an extreme example to make the broader philosophical points about scientific testing.
I truly think that it will go that way if people don’t do something about it. That, of course, is my mission. But it requires a certain amount of care, understanding of logic, and intellectual honesty. From critics, it will require avoidance of just giving the same knee jerk reactions again and again and again. A good example is in dealing with the relevance of error probabilities. Many Bayesians just say they are irrelevant. But, guess what? That’s not an argument. We frequentists give arguments explaining their relevance—to appraising the inference at hand—and the onus of a denier must be to counter those arguments. I will address this soon in considering C. Roberts on Error and Inference. I appreciate his posts,but I hope to get him to stop for just a minute and think (before a glib but questionbegging insistence that error probabilities are irrelevant to the particular case.).
There is a lot of discussion of the conflict between and the merits of coherence vs other criteria such as coverage and pvalues in the Bayesian literature. I have been disappointed to find that this discussion of conflicting principles is almost always from a Bayesian point of view (after a thorough search I have located only a handful of examples and very very rarely from leading figures in frequentist statistics).
The main argument for the use of subjective probability is that it is a primitive for evaluating the expected utility of a decision. I am not aware of any similar argument for frequentist criteria.
I am really interested for more detail on: “We frequentists give arguments explaining their relevance—to appraising the inference at hand”…. I have been searching the literature for years for these arguments, where are they?!
I think it is quite wrong to say that statements that “error probabilities are irrelevant ” is done without argument. To the contrary, there is a significant literature about obtaining both if possible, and the philosophical tension that arises if only one is possible (mostly resolved in favour of coherence as you note).
My understanding is that leading frequentist statisticians essentially accept that Bayesian theory is the closest thing to a complete theory of statistics, but it raises (potentially very serious) difficulties in practice for example Bradley Efron says:
“The only complete theory of statistics is the Bayesian theory and even though it’s unassailable it somehow misses part of the story, which is that you can’t use it as an actual driving theory for complicated problems. You always are then forced to do something too complicated, and make up your mind on things you have no opinions on. So somehow Bayesian theory is wonderful but it doesn’t tell the whole story. Frequentist theory is shot full of contradictions but it seems to work so well.”
Similarly Stephen Senn describes Bayesian theory as a theory of how to be perfect, but doesn’t on that doesn’t necessarily help you be good.
This criticism of Bayesian theory is one that I acknowledge and respect… In contrast I remain bewildered about the line of argument Deborah Mayo gives. She does not discuss as far as I am aware the philosophical tension that arises between coherence and coverage, and seems to acknowledge no value in coherence at all as far as I can see. I am not sure if she would dismiss, subjective Bayesian as the appropriate theory for decision making in the face of uncertainty as nearly all statisticians of all colours do.
The scope of a theory on decision making under uncertainty as a contributor to philosophy of science is however debatable, as is its meaningful application to complex real world problems.
The question was never (or at least, should never have been) about the obvious lack of legitimacy of tests of hypotheses based on data which aren’t relevant to those hypotheses — I’d hope that would be taken for granted. The question is: how is this notion of (lack of) legitimacy formalized in the context of errorstatistical hypothesis testing?
Hope you feel better soon!
Thanks for the wellwishes! It is formalized in the requirements for a test statistic.
I would like to know where it is written down what the requirements for a test statistic S are that would rule out Jay’s example as a legitimate one. As far as I know, the only requirements for alphalevel testing that anyone writes down, for example, are that you observe a statistic and determine whether the probability of observing that statistic, given H_0, in the rejection region (0.05 in Jay’s example). You can say that it should be a requirement that the probability is actually dependent on H_0, but where is that written down?
But it is clear that the Bayesian approach to the same example automatically fixes the problem. In the Bayesian approach, you would calculate the Bayes factor:
BF=P(SH_0)/P(SH_1) [where H_1=not H_0],
and if BF>1 the test favors H_0 and if BF<1 it favors H_1.
But in Jay's example, S is independent of H_0 and H_1: P(SH_0)=P(SH_1)=P(S). [To be pedantically correct I should include on after the conditioning bar a term B representing the background information we have, e.g., that we got S by tossing a particular coin, but it isn't really needed here so I suppress it.]
Thus, in this example, BF=1 and we learn, automatically, that S gives us no information about what we should think about H_0 and H_1. No additional definition is required.
No such automatic evaluation of Jay's example as printed in his book attaches. We could say, for example, that no test is legitimate of the sampling distribution of S is independent of the hypothesis being tested, but does anyone actually write that down?
It is written down in any decent exposition of tests. To begin with, the null must assign probabilities to each value of the statistic. This already fails for the example we are discussing.
In Jay’s example, the null assigns probability 0.05 to tails and 0.95 to heads (the two values of the statistic are tails and heads). How is this not assigning probabilities to each value of the statistic?
It doesn’t fail in this way. Under the null (and under the alternative) the probability of getting heads is known.
No, the null can be about chemistry, physics or what have you and it doesn’t talk about his pet biased coin. Admittedly, that’s just one silly example. But I have dealt with all the others in published papers, and will address them specifically on this blog.
Who are “you”, and where are these papers published?
Who am I? Is that your question? Why of course I am that frequentist witch in exile? Plenty of references on Mayo’s blog.
Yes, I asked “who you are?”
I guess that you are Mayo. But that is not clear. ANYONE can post under another handle, if they wish.
But why use several handles? Are you trying to keep us guessing?
Are you “Error”? “Mayoerror”? “ERRorERRor”? “Phildgs2″? How many other handles are you using here?”
What is the point of using all these handles?
Are you using sock puppets to confuse us?
This is not what a scholar should do, if indeed you are the owner of this list.
I post under one name. That’s the honest thing to do.
Like
All the different “handles” are accidental; didn’t I admit from the start that I didn’t know the foggiest thing about how to blog? If it comes out ok, I leave it, if someone else has to go in and fix something, they might restore the writing but then their own “handle” arises. Sorry, but no dishonesty.
I mean, I want the citations so I can look at the papers.
…in addition, I want to note that in my example, the null assigns exactly the same probabilities to each possible value of the test statistic, regardless of the choice of p (since in all cases, under the null, the y’s are drawn from a standard normal distribution, independent of p). In other words, it seems to me that if for p=1 the hypothesis test is legitimate in Mayo’s eyes, the criterion she announced above for legitimacy fails to declare my tests illegitimate for any other value of p, including 0.
Also, in my example, for fixed p>0 and arbitrary but fixed theta, the power for a test based on a single sample decreases (approaching alpha) as p–>0. However, no matter how small a p>0 you choose, it is possible to make the sample size N large enough so that you will have power as close to 1 as you wish, for that fixed theta.
This sounds to me like a legitimate test, even if a dumb one.
I didn’t say the large n problem led to no tests at all, I haven’t really discussed it on the blog (though it relates to what I’ve been writing about power). One must, in my construal of tests, indicate the extent of the discrepancy from the null that is and is not warranted. As n increases, the “same” level of significance indicates a smaller discrepancy from the null is warranted. Essentially the same as a confidence interval. Please see, for example, the Mayo and Spanos article with the 13 lucky criticisms post. I think this space is getting too small to write in–just like when sample size is increased!
Hope you are feeling better.
Apologies for the pedantry, but the rare event *is* anomalous; it is “inconsistent with or deviating from what is usual, normal, or expected” [MerriamWebster]. The rare event is anomalous regardless of anything going on in molecular biology.
More interestingly, rare coin tossing events make for useless tests because they’re *equally* anomalous regardless of anything in molecular biology, i.e. they have minimal power. Are you defining legitimacy as having optimal power, or just having more power than a coin toss?
No. The rare event is not anomalous FOR just any old hypothesis; only those that counterpredict it, statistically or otherwise.
Okay, so you’re saying observed data can’t be “anomalous for” ideas (an epistemological statement) if those observed data are equally “anomalous” (an aleatory statement) under all ideas considered.
How, in your ideas, is this turned into a constructive definition? Without attaching a notion of optimality to the ability to counterpredict, just being “anomalous for” doesn’t rule out silly tests very like Kadane’s (e.g. those in Bill J’s earlier post).
If one does require optimality, how is the optimality determined?
sketching what I think is happening…
the null hypothesis H_0 is the following:coin flips tails with probability 0.05
only viruses and bacteria cause infection.
H_0 is widely held so evidence against it would represent a breakthrough…
we also have a coin flip, C and experimental data D.
P(C,DH_0) = P(CH_0)P(DH_0)
Kadane suggests the (crazy) statistic
S(C,D)=C
and then argues that as
P(S=tailsH_0)=0.05
then if this event occurs we can say:
either an unlikely event occurred or the hypothesis is false. This seems technically ok but the hypothesis is a composite one on both the coin and in biology…
I think it would be most concerning if we observed heads so chose not to reject the null.. i.e. we chose to make no conclusion whatsoever…
… but if we also found that P(DH_0)=0 then it would seem that this analysis missed a significant breakthrough.
… there are quite a few issues floating around this example… including Fisher vs Neyman Pearson frequentist only discussions on the requirement for an alternative hypothesis… in science occasionally you get a Kuru situation where P(DH_0)=0 so that you really don’t need an alternative hypothesis to reject the null (nor do you need a statistician)… another significant oddity in this example is the way Kadane introduced it, it is hard to imagine any reasonable hypothesis in which C and D were dependent…
Mayo’s wider point that rejection of frequentist philosophy on this single caricature of frequentist practice is well taken.
On the other hand, I don’t think that was Kadane’s intention. The example that Mayo cites is a parenthetical remark buried in a much more substantial example and discussion.