*See “rejected posts”.

# Monthly Archives: November 2013

## Saturday night comedy from a Bayesian diary (rejected post*)

## “The probability that it be a statistical fluke” [iia]

My rationale for the last post is really just to highlight such passages as:

“Particle physicists have agreed, by convention, not to view an observed phenomenon as a discovery until

the probability that it be a statistical fluke be below 1 in a million, a requirement that seems insanely draconian at first glance.” (Strassler)….

Even before the dust had settled regarding the discovery of a Standard Model-like Higgs particle, the nature and rationale of the 5-sigma discovery criterion began to be challenged. But my interest now is not in the fact that the 5-sigma discovery criterion is a convention, nor with the choice of 5. It is the understanding of “the probability that it be a statistical fluke” that interests me, because if we can get this right, I* think we can understand a kind of equivocation that leads many to suppose that significance tests are being misinterpreted—even when they aren’t!* So given that I’m stuck, unmoving, on this bus outside of London for 2+ hours (because of a car accident)—and the internet works—I’ll try to scratch out my point (expect errors, we’re moving now). Here’s another passage…

“Even when the probability of a particular statistical fluke, of a particular type, in a particular experiment seems to be very small indeed, we must remain cautious. …Is it really unlikely that someone, somewhere, will hit the jackpot, and see in their data an amazing statistical fluke that seems so impossible that it convincingly appears to be a new phenomenon?”

A very sketchy nutshell of the Higgs statistics: There is a general model of the detector, and within that model researchers define a “global signal strength” parameter “such that *H _{0}:* μ = 0 corresponds to the background only hypothesis and μ = 1 corresponds to the Standard Model (SM) Higgs boson signal in addition to the background” (quote from an ATLAS report). The statistical test may be framed as a one-sided test; the test statistic records differences in the positive direction, in standard deviation or sigma units. The interest is not in the point against point hypotheses, but in finding discrepancies from

*H*in the direction of the alternative, and then estimating their values. The improbability of the 5-sigma excess alludes to the sampling Continue reading

_{0}## Probability that it is a statistical fluke [i]

**From another blog:**

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.

Humans are notoriously incompetent at estimating these types of probabilities… which is why scientists (including particle physicists), when they see something unusual in their data, always try to quantify *the probability that it is a statistical fluke* — a pure chance event. You would not want to be wrong, and celebrate your future Nobel prize only to receive instead a booby prize. (And nature gives out lots and lots of booby prizes.) So scientists, grabbing their statistics textbooks and appealing to the latest advances in statistical techniques, compute these probabilities as best they can. Armed with these numbers, they then try to infer whether it is likely that they have actually discovered something new or not.

And on the whole, it doesn’t work. Unless the answer is so obvious that no statistical argument is needed, the numbers typically do not settle the question.

Despite this remark, you mustn’t think I am arguing against doing statistics. One has to do something better than guessing. But there is a reason for the old saw: “There are three types of falsehoods: lies, damned lies, and statistics.” It’s not that statistics themselves lie, but that to some extent, unless the case is virtually airtight, you can almost always choose to ask a question in such a way as to get any answer you want. … [For instance, in 1991 the volcano Pinatubo in the Philippines had its titanic eruption while a hurricane (or `typhoon’ as it is called in that region) happened to be underway. Oh, and the collapse of Lehman Brothers on Sept 15, 2008 was followed *within three days* by the breakdown of the Large Hadron Collider (LHC) during its first week of running… Coincidence? I-think-so.] One can draw completely different conclusions, both of them statistically sensible, by looking at the same data from two different points of view, and asking for the statistical answer to two different questions.

To a certain extent, this is just why Republicans and Democrats almost never agree, even if they are discussing the same basic data. The point of a spin-doctor is to figure out which question to ask in order to get the political answer that you wanted in advance. Obviously this kind of manipulation is unacceptable in science. Unfortunately it is also unavoidable. Continue reading

## Erich Lehmann: Statistician and Poet

Today is Erich Lehmann’s birthday. The last time I saw him was at the Second Lehmann conference in 2004, at which I organized a session on philosophical foundations of statistics (including David Freedman and D.R. Cox).

I got to know Lehmann, Neyman’s first student, in 1997. One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). He told me he was sitting in a very large room at an ASA meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, dark table sat just one book, all alone, shiny red. He said he wondered if it might be of interest to him! So he walked up to it…. It turned out to be my *Error and the Growth of Experimental Knowledge* (1996, Chicago), which he reviewed soon after. Some related posts on Lehmann’s letter are here and here.

That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. (It’s just like the case of our high school student Isaac). His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here*). Aside from being a leading statistician, Erich had a (serious) literary bent.

The picture on the right was taken in 2003 (by A. Spanos).

Mayo, D. G (1997a), “Response to Howson and Laudan,” *Philosophy of Science* 64: 323-333.

(Selected) Books

*Testing Statistical Hypotheses*, 1959*Basic Concepts of Probability and Statistics*, 1964, co-author J. L. Hodges*Elements of Finite Probability*, 1965, co-author J. L. Hodges- Lehmann, Erich L.; With the special assistance of H. J. M. D’Abrera (2006).
*Nonparametrics: Statistical methods based on ranks*(Reprinting of 1988 revision of 1975 Holden-Day ed.). New York: Springer. pp. xvi+463. ISBN 978-0-387-35212-1. MR 2279708. *Theory of Point Estimation*, 1983*Elements of Large-Sample Theory (1988)*. New York: Springer Verlag.

*Reminiscences of a Statistician*, 2007, ISBN 978-0-387-71596-4*Fisher, Neyman, and the Creation of Classical Statistics*, 2011, ISBN 978-1-4419-9499-8 [published posthumously]

Articles (3 of very many)

- Lehmann, E.L.; Scheffé, H. (1950). “Completeness, similar regions, and unbiased estimation. I.”.
*Sankhyā: the Indian Journal of Statistics***10**(4): 305–340. JSTOR 25048038. MR 39201. - Lehmann, E.L.; Scheffé, H. (1955). “Completeness, similar regions, and unbiased estimation. II”.
*Sankhyā: the Indian Journal of Statistics***15**(3): 219–236. JSTOR 25048243. MR 72410. - Lehmann, E. L. 1993. “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?”
*Journal of the American Statistical Association*88 (424): 1242–1249.

## Lucien Le Cam: “The Bayesians hold the Magic”

Today is Lucien Le Cam’s birthday. He was an error statistician whose remarks in an article, “A Note on Metastatisics,” in a collection on foundations of statistics (Le Cam 1977)* had some influence on me. A statistician at Berkeley, Le Cam was a co-editor with Neyman of the Berkeley Symposia volumes. I hadn’t mentioned him on this blog before, so here are some snippets from EGEK (Mayo, 1996, 337-8; 350-1) that begin with a snippet from a passage from Le Cam (1977) (Here I have fleshed it out):

“One of the claims [of the Bayesian approach] is that the experiment matters little, what matters is the likelihood function after experimentation. Whether this is true, false, unacceptable or inspiring, it tends to undo what classical statisticians have been preaching for many years: think about your experiment, design it as best you can to answer specific questions, take all sorts of precautions against selection bias and your subconscious prejudices. It is only at the design stage that the statistician can help you.

Another claim is the very curious one that if one follows the neo-Bayesian theory strictly one would not randomize experiments….However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. …

In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.

There are many other curious statements concerning confidence intervals, levels of significance, power, and so forth. These statements are only confusing to an otherwise abused public”. (Le Cam 1977, 158)

Back to EGEK:

Why does embracing the Bayesian position tend to undo what classical statisticians have been preaching? Because Bayesian and classical statisticians view the task of statistical inference very differently,

In [chapter 3, Mayo 1996] I contrasted these two conceptions of statistical inference by distinguishing evidential-relationship or E-R approaches from testing approaches, … .

The E-R view is modeled on deductive logic, only with probabilities. In the E-R view, the task of a theory of statistics is to say, for given evidence and hypotheses, how well the evidence confirms or supports hypotheses (whether absolutely or comparatively). There is, I suppose, a certain confidence and cleanness to this conception that is absent from the error-statistician’s view of things. Error statisticians eschew grand and unified schemes for relating their beliefs, preferring a hodgepodge of methods that are truly ampliative. Error statisticians appeal to statistical tools as protection from the many ways they know they can be misled by data as well as by their own beliefs and desires. The value of statistical tools for them is to develop strategies that capitalize on their knowledge of mistakes: strategies for collecting data, for efficiently checking an assortment of errors, and for communicating results in a form that promotes their extension by others.

Given the difference in aims, it is not surprising that information relevant to the Bayesian task is very different from that relevant to the task of the error statistician. In this section I want to sharpen and make more rigorous what I have already said about this distinction.

…. the secret to solving a number of problems about evidence, I hold, lies in utilizing—formally or informally—the error probabilities of the procedures generating the evidence. It was the appeal to severity (an error probability), for example, that allowed distinguishing among the well-testedness of hypotheses that fit the data equally well… .

A few pages later in a section titled “*Bayesian Freedom, Bayesian Magic” (350-1):*

A big selling point for adopting the LP (strong likelihood principle), and with it the irrelevance of stopping rules, is that it frees us to do things that are sinful and forbidden to an error statistician.“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson). . . . Many experimenters would like to feel free to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money or patience … Classical statisticians … have frowned on [this]”. (Edwards, Lindman, and Savage 1963, 239)

^{1}Breaking loose from the grip imposed by error probabilistic requirements returns to us an appealing freedom.

Le Cam, … hits the nail on the head:

“It is characteristic of [Bayesian approaches] [2] . . . that they … tend to treat experiments and fortuitous observations alike. In fact, the main reason for their periodic return to fashion seems to be that they claim to hold the magic which permits [us] to draw conclusions from whatever data and whatever features one happens to notice”. (Le Cam 1977, 145)

In contrast, the error probability assurances go out the window if you are allowed to change the experiment as you go along. Repeated tests of significance (or sequential trials) are permitted, are even desirable for the error statistician; but a penalty must be paid for perseverance—for optional stopping. Before-trial planning stipulates how to select a small enough significance level to be on the lookout for at each trial so that the overall significance level is still low. …. Wearing our error probability glasses—glasses that compel us to see how certain procedures alter error probability characteristics of tests—we are forced to say, with Armitage, that “Thou shalt be misled if thou dost not know that” the data resulted from the try and try again stopping rule. To avoid having a high probability of following false leads, the error statistician must scrupulously follow a specified experimental plan. But that is because we hold that error probabilities of the procedure alter what the data are saying—whereas Bayesians do not. The Bayesian is permitted the luxury of optional stopping and has nothing to worry about. The Bayesians hold the magic.

Or is it voodoo statistics?

When I sent him a note, saying his work had inspired me, he modestly responded that he doubted he could have had all that much of an impact.

_____________

*I had forgotten that this *Synthese* (1977) volume on foundations of probability and statistics is the one dedicated to the memory of Allan Birnbaum after his suicide: “By publishing this special issue we wish to pay homage to professor Birnbaum’s penetrating and stimulating work on the foundations of statistics” (Editorial Introduction). In fact, I somehow had misremembered it as being in a Harper and Hooker volume from 1976. The *Synthese* volume contains papers by Giere, Birnbaum, Lindley, Pratt, Smith, Kyburg, Neyman, Le Cam, and Kiefer.

REFERENCES:

*Journal of the Royal Statistical Society (B)*23:1-37.

_______(1962). Contribution to discussion in *The foundations of statistical inference*, edited by L. Savage. London: Methuen.

_______(1975). *Sequential Medical Trials*. 2nd ed. New York: John Wiley & Sons.

Edwards, W., H. Lindman & L. Savage (1963) Bayesian statistical inference for psychological research. *Psychological Review* 70: 193-242.

Le Cam, L. (1974). J. Neyman: on the occasion of his 80th birthday. *Annals of Statistics*, Vol. 2, No. 3 , pp. vii-xiii, (with E.L. Lehmann).

Le Cam, L. (1977). A note on metastatistics or “An essay toward stating a problem in the doctrine of chances.” *Synthese* 36: 133-60.

Le Cam, L. (1982). A remark on empirical measures in *Festschrift in the honor of E. Lehmann. *P. Bickel, K. Doksum & J. L. Hodges, Jr. eds., Wadsworth pp. 305-327.

Le Cam, L. (1986). The central limit theorem around 1935. *Statistical Science*, Vol. 1, No. 1, pp. 78-96.

Le Cam, L. (1988) Discussion of “The Likelihood Principle,” by J. O. Berger and R. L. Wolpert. IMS Lecture Notes Monogr. Ser. 6 182–185. IMS, Hayward, CA

Le Cam, L. (1996) Comparison of experiments: A short review. In *Statistics, Probability and Game Theory. Papers in Honor of David Blackwell* 127–138. IMS, Hayward, CA.

Le Cam, L., J. Neyman and E. L. Scott (Eds). (1973). *Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability*, Vol. l: *Theory of Statistics*, Vol. 2: *Probability Theory*, Vol. 3: *Probability Theory*. Univ. of Calif. Press, Berkeley Los Angeles.

Mayo, D. (1996). [EGEK] *Error Statistics and the Growth of Experimental Knowledge. *Chicago: University of Chicago Press. (Chapter 10; Chapter 3)

Neyman, J. and L. Le Cam (Eds). (1967). *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*, Vol. I: *Statistics*, Vol. II: *Probability* Part I & Part II. Univ. of Calif. Press, Berkeley and Los Angeles.

[1] For some links on optional stopping on this blog: Highly probably vs highly probed: Bayesian/error statistical differences.; Who is allowed to cheat? I.J. Good and that after dinner comedy hour….; New Summary; Mayo: (section 7) “StatSci and PhilSci: part 2″; After dinner Bayesian comedy hour….; Search for more, if interested.

[2] Le Cam is alluding mostly to Savage, and (what he called) the “neo-Bayesian” accounts.

## S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)

* Stanley Young’s guest post arose in connection with Kepler’s Nov. 13, and my November 9 post,and associated comments.*

**S. Stanley Young, PhD** Assistant Director for Bioinformatics National Institute of Statistical Sciences Research Triangle Park, NC

Much is made by some of the experimental biologists that their art is oh so sophisticated that mere mortals do not have a chance [to successfully replicate]. Bunk. Agriculture replicates all the time. That is why food is so cheap. The world is growing much more on fewer acres now than it did 10 years ago. Materials science is doing remarkable things using mixtures of materials. Take a look at just about any sports equipment. These two areas and many more use statistical methods: design of experiments, randomization, blind reading of results, etc. and these methods replicate, quite well, thank you. Read about Edwards Deming. Experimental biology experiments are typically run by small teams in what is in effect a cottage industry. Herr professor is usually not in the lab. He/she is busy writing grants. A “hands” guy is in the lab. A computer guy does the numbers. No one is checking other workers’ work. It is a cottage industry to produce papers.

There is a famous failure to replicate that appeared in Science. A pair of non-estrogens was reported to have a strong estrogenic effect. Six labs wrote into Science saying the could not replicate the effect. I think the back story is as follows. The hands guy tested a very large number of pairs of chemicals. The most extreme pair looked unusual. Lab boss said, write it up. Every assay has some variability, so they reported extreme variability as real. Failure to replicate in six labs. Science editors says, what gives. Lab boss goes to hands guy and says run the pair again. No effect. Lab boss accuses hands guy of data fabrication. They did not replicate their own finding before rushing to publish. I asked the lab for the full data set, but they refused to provide the data. The EPA is still chasing this will of the wisp, environmental estrogens. False positive results with compelling stories can live a very long time. See [i].

Begley and Ellis visited labs. They saw how the work was done. There are instances where something was tried over and over and when it worked “as expected”, it was a rap. Write the paper and move on. I listened to a young researcher say that she tried for 6 months to replicate results of a paper. Informal conversations with scientists support very poor replication.

One can say that the jury is out as there have been few serious attempts to systematically replicate. There is now starting systematic replication. I say less than 50% of experimental biology claims will replicate.

[i]**Hormone Hysterics.** Tulane University researchers published a 1996 study claiming that combinations of manmade chemicals (pesticides and PCBs) disrupted normal hormonal processes, causing everything from cancer to infertility to attention deficit disorder.

Media, regulators and environmentalists hailed the study as “astonishing.” Indeed it was as it turned out to be fraud, according to an October 2001 report by federal investigators. Though the study was retracted from publication, the law it spawned wasn’t and continues to be enforced by the EPA. Read more…

## T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)

Tom Kepler’s guest post arose in connection with my November 9 post & comments.

Professor Thomas B. KeplerDepartment of Microbiology

Department of Mathematics & Statistics

Boston University School of Medicine

There is much to say about the article in the Economist, but the first is to note that it is far more balanced than its sensational headline promises. Promising to throw open the curtain on “Unreliable research” is mere click-bait for the science-averse readers who have recently found validation against their intellectual insecurities in the populist uprising against the shadowy world of the scientist. What with the East Anglia conspiracy, and so on, there’s no such thing as “too skeptical” when it comes to science.

There is some remarkably casual reporting in an article that purports to be concerned with mechanisms to assure that inaccuracies not be perpetuated.

For example, the authors cite the comment in *Nature* by Begley and Ellis and summarize it thus: …scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. Stan Young, in his comments to Mayo’s blog adds, “These claims can not be replicated – even by the original investigators! Stop and think of that.” But in fact the role of the original investigators is described as follows in Begley and Ellis: “…when findings could not be reproduced, *an attempt was made* to contact the original authors, discuss the discrepant findings, exchange reagents and repeat experiments under the authors’ direction, *occasionally *even in the laboratory of the original investigator.” (Emphasis added.) Now, please stop and think about what agenda is served by eliding the tempered language of the original.

Both the Begley and Ellis comment and the brief correspondence by Prinz et al. also cited in this discussion are about laboratories in commercial pharmaceutical companies failing to reproduce experimental results. While deciding how to interpret their findings, it would be prudent to bear in mind the insight from Harry Collins, the sociologist of science paraphrased in the Economist piece as indicating that “performing an experiment always entails what sociologists call “tacit knowledge”—craft skills and extemporisations that their possessors take for granted but can pass on only through example. Thus if a replication fails, it could be because the repeaters didn’t quite get these *je-ne-sais-quoi* bits of the protocol right.” Indeed, I would go further and conjecture that few experimental biologists would hold out hope that any one laboratory could claim the expertise necessary to reproduce the results of 53 ground-breaking papers in diverse specialties, even within cancer drug discovery. And to those who are unhappy that authors often do not comply with the journals’ clear policy of data-sharing, how do you suppose you would fare getting such data from the pharmaceutical companies that wrote these damning papers? Or the authors of the papers themselves? Nature had to clarify, writing two months after the publication of Begley and Ellis, “*Nature*, like most journals, requires authors of research papers to make their data available on request. In this less formal Comment, we chose not to enforce this requirement so that Begley and Ellis could abide by the legal agreements [they made with the original authors].” Continue reading

## Beware of questionable front page articles warning you to beware of questionable front page articles (iii)

In this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures. Science writers are under similar pressures, and to this end they have found a way to deliver up at least one fire-breathing, front page article a month. How? By writing minor variations on an article about how in this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures.

Thus every month or so we see retreads on why most scientific claims are unreliable, biased, wrong, and not even wrong. Maybe that’s the reason the authors of a recent article in The Economist (“Trouble at the Lab“) remain anonymous.

I don’t disagree with everything in the article; on the contrary, part of their strategy is to include such well known problems as publication bias, problems with priming studies in psychology, and failed statistical assumptions. But the “big news”–the one that sells– is that “to an alarming degree” science (as a whole) is not reliable and not self-correcting. The main evidence is that there are the factory-like (thumbs up/thumbs down) applications of statistics in exploratory, hypotheses generating contexts wherein the goal is merely screening through reams of associations to identify a smaller batch for further analysis. But do even those screening efforts claim to have evidence of a genuine relationship when a given H is spewed out of their industrial complexes? Do they go straight to press after one statistically significant result? I don’t know, maybe some do. What I do know is that the generalizations we are seeing in these “gotcha” articles are every bit as guilty of sensationalizing without substance as the bad statistics they purport to be impugning. As they see it, scientists, upon finding a single statistically significant result at the 5% level, declare an effect real or a hypothesis true, and then move on to the next hypothesis. No real follow-up scrutiny, no building on discrepancies found, no triangulation, self-scrutiny, etc.

But even so, the argument which purports to follow from “statistical logic”, but which actually is a jumble of “up-down” significance testing, Bayesian calculations, and computations that might at best hold for crude screening exercises (e.g., for associations between genes and disease) commits blunders about statistical power, and founders. Never mind that if the highest rate of true outputs was wanted, scientists would dabble in trivialities….Never mind that I guarantee if you asked Nobel prize winning scientists the rate of correct attempts vs blind alleys they went through before their Prize winning results, they’d say far more than 50% errors, (Perrin and Brownian motion, Prusiner and Prions, experimental general relativity, just to name some I know.)

* But what about the statistics?* Continue reading

## Forthcoming paper on the strong likelihood principle

My paper, “On the Birnbaum Argument for the Strong Likelihood Principle” has been accepted by *Statistical Science*. The latest version is here. (It differs from all versions posted anywhere). If you spot any typos, please let me know (error@vt.edu). If you can’t open this link, please write to me and I’ll send it directly. As always, comments and queries are welcome.

I appreciate considerable feedback on SLP on this blog. Interested readers may search this blog for quite a lot of discussion of the SLP (e.g., here and here) including links to the central papers, “U-Phils” (commentaries) by others (e.g., here, here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum), and more…..

Abstract: An essential component of inference based on familiar frequentist notions, such as p-values, significance and confidence levels, is the relevant sampling distribution. This feature results in violations of a principle known as the strong likelihood principle (SLP), the focus of this paper. In particular, if outcomes

x^{∗}andy^{∗}from experimentsE_{1}andE_{2}(both with unknown parameterθ), have different probability modelsf_{1}( . ),f_{2}( . ), then even thoughf_{1}(x^{∗};θ) = cf_{2}(y^{∗};θ) for allθ, outcomesx^{∗}andy^{∗}may have different implications for an inference aboutθ. Although such violations stem from considering outcomes other than the one observed, we argue, this does not require us to consider experiments other than the one performed to produce the data. David Cox (1958) proposes the Weak Conditionality Principle (WCP) to justify restricting the space of relevant repetitions. The WCP says that once it is known whichEproduced the measurement, the assessment should be in terms of the properties of_{i}E. The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. The goal of this article is to provide a new clarification and critique of Birnbaum’s argument. Although his argument purports that [(WCP and SP), entails SLP], we show how data may violate the SLP while holding both the WCP and SP. Such cases also refute [WCP entails SLP]._{i}

Key words:Birnbaumization, likelihood principle (weak and strong), sampling theory, sufficiency, weak conditionality

** **

## Oxford Gaol: Statistical Bogeymen

Memory Lane: 2 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (I’m serious, it is now a boutique hotel.) My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should I think be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory. Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended. But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)

Criticisms then follow readily: the form of one or both:

- Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
- Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
- I have proposed an alternative philosophy that replaces these tenets with different ones:
- the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
- the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
- Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

What is key on the statistics side of this alternative philosophy is that the probabilities refer to the distribution of a statistic d(x)—the so-called sampling distribution. Hence such accounts are often called sampling theory accounts. Since the sampling distribution is the basis for error probabilities, another term might be error statistical. Continue reading