Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)


How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? … Is complacency in the face of contradiction acceptable for a central discipline of science? (Donald Fraser 2011, p. 329)

We [statisticians] are not blameless … we have not made a concerted professional effort to provide the scientific world with a unified testing methodology. (J. Berger 2003, p. 4)

From the aerial perspective of a hot-air balloon, we may see contemporary statistics as a place of happy multiplicity: the wealth of computational ability allows for the application of countless methods, with little handwringing about foundations. Doesn’t this show we may have reached “the end of statistical foundations”? One might have thought so. Yet, descending close to a marshy wetland, and especially scratching a bit below the surface, reveals unease on all sides. The false dilemma between probabilism and long-run performance lets us get a handle on it. In fact, the Bayesian versus frequentist dispute arises as a dispute between probabilism and performance. This gets to my second reason for why the time is right to jump back into these debates: the “statistics wars” present new twists and turns. Rival tribes are more likely to live closer and in mixed neighborhoods since around the turn of the century. Yet, to the beginning student, it can appear as a jungle.

Statistics Debates: Bayesian versus Frequentist

These days there is less distance between Bayesians and frequentists, especially with the rise of objective [default] Bayesianism, and we may even be heading toward a coalition government. (Efron 2013, p. 145)

A central way to formally capture probabilism is by means of the formula for conditional probability, where Pr(x) > 0:

Since Pr(H and x) = Pr(x|H)Pr(H) and Pr(x) = Pr(x|H)Pr(H) + Pr(x|~H)Pr(~H), we get:

where ~H is the denial of H. It would be cashed out in terms of all rivals to H within a frame of reference. Some call it Bayes’ Rule or inverse probability. Leaving probability uninterpreted for now, if the data are very improbable given H, then our probability in H after seeing x, the posterior probability Pr(H|x), may be lower than the probability in H prior to x, the prior prob- ability Pr(H). Bayes’ Theorem is just a theorem stemming from the definition of conditional probability; it is only when statistical inference is thought to be encompassed by it that it becomes a statistical philosophy. Using Bayes’ Theorem doesn’t make you a Bayesian.

Larry Wasserman, a statistician and master of brevity, boils it down to a contrast of goals. According to him (2012b):

The Goal of Frequentist Inference: Construct procedure with frequentist guarantees [i.e., low error rates].

The Goal of Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

At times he suggests we use B(H) for belief and F(H) for frequencies. The distinctions in goals are too crude, but they give a feel for what is often regarded as the Bayesian-frequentist controversy. However, they present us with the false dilemma (performance or probabilism) I’ve said we need to get beyond.

Today’s Bayesian–frequentist debates clearly differ from those of some years ago. In fact, many of the same discussants, who only a decade ago were arguing for the irreconcilability of frequentist P-values and Bayesian measures, are now smoking the peace pipe, calling for ways to unify and marry the two. I want to show you what really drew me back into the Bayesian–frequentist debates sometime around 2000. If you lean over the edge of the gondola, you can hear some Bayesian family feuds starting around then or a bit after. Principles that had long been part of the Bayesian hard core are being questioned or even abandoned by members of the Bayesian family. Suddenly sparks are flying, mostly kept shrouded within Bayesian walls, but nothing can long be kept secret even there. Spontaneous combustion looms. Hard core subjectivists are accusing the increasingly popular “objective (non-subjective)” and “reference” Bayesians of practicing in bad faith; the new frequentist–Bayesian unificationists are taking pains to show they are not subjective; and some are calling the new Bayesian kids  on  the block  “pseudo Bayesian.” Then there are the Bayesians camping somewhere in the middle (or perhaps out in left field) who, though they still use the Bayesian umbrella, are flatly denying the very idea that Bayesian updating fits anything they actually do in statistics. Obeisance to Bayesian reasoning remains, but on some kind of a priori philosophical grounds. Let’s start with the unifications.

While subjective Bayesianism offers an algorithm for coherently updating prior degrees of belief in possible hypotheses H1, H2, …, Hn, these unifications fall under the umbrella of non-subjective Bayesian paradigms. Here the prior probabilities in hypotheses are not taken to express degrees of belief but are given by various formal assignments, ideally to have minimal impact on the posterior probability. I will call such Bayesian priors default. Advocates of unifications are keen to show that (i) default Bayesian methods have good performance in a long series of repetitions – so probabilism may yield performance; or alternatively, (ii) frequentist quantities are similar to Bayesian ones (at least in certain cases) – so performance may yield probabilist numbers. Why is this not bliss? Why are so many from all sides dissatisfied?

True blue subjective Bayesians are understandably unhappy with non- subjective priors. Rather than quantify prior beliefs, non-subjective priors are viewed as primitives or conventions for obtaining posterior probabilities. Take Jay Kadane (2008):

The growth in use and popularity of Bayesian methods has stunned many of us who were involved in exploring their implications decades ago. The result … is that there are users of these methods who do not understand the philosophical basis of the methods they are using, and hence may misinterpret or badly use the results … No doubt helping people to use Bayesian methods more appropriately is an important task of our time. (p. 457, emphasis added)

I have some sympathy here: Many modern Bayesians aren’t aware of the traditional philosophy behind the methods they’re buying into. Yet there is not just one philosophical basis for a given set of methods. This takes us to one of the most dramatic shifts in contemporary statistical foundations. It had long been assumed that only subjective or personalistic Bayesianism had a shot at providing genuine philosophical foundations, but you’ll notice that groups holding this position, while they still dot the landscape in 2018, have been gradually shrinking. Some Bayesians have come to question whether the wide- spread use of methods under the Bayesian umbrella, however useful, indicates support for subjective Bayesianism as a foundation.

Marriages of Convenience?

The current frequentist–Bayesian unifications are often marriages of convenience; statisticians rationalize them less on philosophical than on practical grounds. For one thing, some are concerned that methodological conflicts are bad for the profession. For another, frequentist tribes, contrary to expectation, have not disappeared. Ensuring that accounts can control their error probabilities remains a desideratum that scientists are unwilling to forgo. Frequentists have an incentive to marry as well. Lacking a suitable epistemic interpretation of error probabilities – significance levels, power, and confidence levels – frequentists are constantly put on the defensive. Jim Berger (2003) proposes a construal of significance tests on which the tribes of Fisher, Jeffreys, and Neyman could agree, yet none of the chiefs of those tribes concur (Mayo 2003b). The success stories are based on agreements on numbers that are not obviously true to any of the three philosophies. Beneath the surface – while it’s not often said in polite company – the most serious disputes live on. I plan to lay them bare.

If it’s assumed an evidential assessment of hypothesis H should take the form of a posterior probability of H – a form of probabilism – then P-values and confidence levels are applicable only through misinterpretation and mistranslation. Resigned to live with P-values, some are keen to show that construing them as posterior probabilities is not so bad (e.g., Greenland and Poole  2013).  Others  focus  on  long-run  error  control,  but  cede  territory wherein probability captures the epistemological ground of statistical inference. Why assume significance levels and confidence levels lack an authentic epistemological function? I say they do: to secure and evaluate how well probed and how severely tested claims are.

Eclecticism and Ecumenism

If you look carefully between dense forest trees, you can distinguish unification country from lands of eclecticism (Cox 1978) and ecumenism (Box 1983), where tools first constructed by rival tribes are separate, and more or less equal (for different aims). Current-day eclecticisms have a long history – the dabbling in tools from competing statistical tribes has not been thought to pose serious challenges. For example, frequentist methods have long been employed to check or calibrate Bayesian methods (e.g., Box 1983); you might test your statistical model using a simple significance test, say, and then proceed to Bayesian updating. Others suggest scrutinizing a posterior probability or a likelihood ratio from an error probability standpoint. What this boils down to will depend on the notion of probability used. If a procedure frequently gives high probability for claim C even if C is false, severe testers deny convincing evidence has been provided, and never mind about the meaning of probability. One argument is that throwing different methods at a problem is all to the good, that it increases the chances that at least one will get it right. This may be so, provided one understands how to interpret competing answers. Using multiple  methods  is  valuable  when  a  shortcoming  of  one  is  rescued  by a strength in another. For example, when randomized studies are used to expose the failure to replicate observational studies, there is a presumption that the former is capable of discerning problems with the latter. But what happens if one procedure fosters a goal that is not recognized or is even opposed by another? Members of rival tribes are free to sneak ammunition from a rival’s arsenal – but what if at the same time they denounce the rival method as useless or ineffective?

Decoupling. On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched. In an attempted meeting of the minds (Bayesian and error statistical), Andrew Gelman and Cosma Shalizi (2013) claim that “implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo” (p. 10). In particular, Bayesian model checking, they say, uses statistics to satisfy Popperian criteria for severe tests. The idea of error statistical foundations for Bayesian tools is not as preposterous as it may seem. The concept of severe testing is sufficiently general to apply to any of the methods now in use. On the face of it, any inference, whether to the adequacy of a model or to a posterior probability, can be said to be warranted just to the extent that it has withstood severe testing. Where this will land us is still futuristic.

Why Our Journey?

We have all, or nearly all, moved past these old [Bayesian-frequentist] debates, yet our textbook explanations have not caught up with the eclecticism of statistical practice. (Kass 2011, p. 1)

When Kass proffers “a philosophy that matches contemporary attitudes,” he finds resistance to his big tent. Being hesitant to reopen wounds from old battles does not heal them. Distilling them in inoffensive terms just leads to the marshy swamp. Textbooks can’t “catch-up” by soft-peddling competing statistical accounts. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.

From an elevated altitude we see how it occurs. Once high-profile failures of replication spread to biomedicine, and other “hard” sciences, the problem took on a new seriousness. Where does the new scrutiny look? By and large, it collects from the earlier social science “significance test controversy” and the traditional philosophies coupled to Bayesian and frequentist accounts, along with the newer Bayesian–frequentist unifications we just surveyed. This jungle has never been disentangled. No wonder leading reforms and semi-popular guidebooks contain misleading views about all these tools. No wonder we see the same fallacies that earlier reforms were designed to avoid, and even brand new ones. Let me be clear, I’m not speaking about flat-out howlers such as interpreting a P-value as a posterior probability. By and large, they are more subtle; you’ll want to reach your own position on them. It’s not a matter of switching your tribe, but excavating the roots of tribal warfare. To tell what’s true about them. I don’t mean understand them at the socio-psychological levels, although there’s a good story there (and I’ll leak some of the juicy parts during our travels).

How can we make progress when it is difficult even to tell what is true about the different methods of statistics? We must start afresh, taking responsibility to offer a new standpoint from which to interpret the cluster of tools around which there has been so much controversy. Only then can we alter and extend their limits. I admit that the statistical philosophy that girds our explorations is not out there ready-made; if it was, there would be no need for our holiday cruise. While there are plenty of giant shoulders on which we stand, we won’t be restricted by the pronouncements of any of the high and low priests, as sagacious as many of their words have been. In fact, we’ll brazenly question some of their most entrenched mantras. Grab on to the gondola, our balloon’s about to land.

In Tour II, I’ll give you a glimpse of the core behind statistics battles, with a firm promise to retrace the steps more slowly in later trips.

FOR ALL OF TOUR I: SIST Excursion 1 Tour I

THE FULL ITINERARY: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars: SIST Itinerary



Berger, J. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and ‘Rejoinder’, Statistical Science 18(1), 1–12; 28–32.

Box, G. (1983). ‘An Apology for Ecumenism in Statistics’, in Box, G., Leonard, T., and Wu, D. (eds.), Scientific Inference, Data Analysis, and Robustness, New York:
Academic Press, 51–84.

Cox, D. (1978). ‘Foundations of Statistical Inference: The Case for Eclecticism’, Australian Journal of Statistics 20(1), 43–59.

Efron, B. (2013). ‘A 250-Year Argument: Belief, Behavior, and the Bootstrap’, Bulletin of the American Mathematical Society 50(1), 126–46.

Fraser, D. (2011). ‘Is Bayes Posterior Just Quick and Dirty Confidence?’ and ‘Rejoinder’, Statistical Science 26(3), 299–316; 329–31.

Gelman, A. and Shalizi, C. (2013). ‘Philosophy and the Practice of Bayesian Statistics’ and ‘Rejoinder’, British Journal of Mathematical and Statistical Psychology 66(1), 8–38; 76–80.

Greenland, S. and Poole, C. (2013). ‘Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics’ and ‘Rejoinder: Living with Statistics in Observational Research’, Epidemiology 24(1), 62–8; 73–8.

Kadane, J. (2008). ‘Comment on Article by Gelman’, Bayesian Analysis 3(3), 455–8.

Kass, R. (2011). ‘Statistical Inference: The Big Picture (with discussion and rejoinder)’, Statistical Science 26(1), 1–20.

Mayo, D. (2003b). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Commentary on J. Berger’s Fisher Address’, Statistical Science 18, 19–24.

Wasserman, L. (2012b). ‘What is Bayesian/Frequentist Inference?’, Blogpost on normaldeviate.wordpress.com (11/7/2012).


Categories: Statistical Inference as Severe Testing | Leave a comment

Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)


I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. (George Barnard 1985, p. 2)

While statistical science (as with other sciences) generally goes about its business without attending to its own foundations, implicit in every statistical methodology are core ideas that direct its principles, methods, and interpretations. I will call this its statistical philosophy. To tell what’s true about statistical inference, understanding the associated philosophy (or philosophies) is essential. Discussions of statistical foundations tend to focus on how to interpret probability, and much less on the overarching question of how probability ought to be used in inference. Assumptions about the latter lurk implicitly behind debates, but rarely get the limelight. If we put the spotlight on them, we see that there are two main philosophies about the roles of probability in statistical inference: We may dub them performance (in the long run) and probabilism. Continue reading

Categories: Statistical Inference as Severe Testing | 3 Comments

Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)

The cruise begins…

I’m talking about a specific, extra type of integrity that is [beyond] not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. (Feynman 1974/1985, p. 387)

It is easy to lie with statistics. Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self- correcting statistical methods are needed, and, with minimal technical fanfare, that’s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954, ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  • Association is not causation.
  • Statistical significance is not substantive significamce
  • No evidence of risk is not evidence of no risk.
  • If you torture the data enough, they will confess.

Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “evidence-based” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, significance seeking, P-hacking, and assorted modes of extra- ordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraud-busting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed. Continue reading

Categories: Statistical Inference as Severe Testing, Statistics | 8 Comments

The Physical Reality of My New Book! Here at the RSS Meeting


Categories: SIST | 2 Comments

RSS 2018 – Significance Tests: Rethinking the Controversy


Day 2, Wednesday 05/09/2018

11:20 – 13:20

Keynote 4 – Significance Tests: Rethinking the Controversy Assembly Room

Sir David Cox, Nuffield College, Oxford
Deborah Mayo, Virginia Tech
Richard Morey, Cardiff University
Aris Spanos, Virginia Tech

Intermingled in today’s statistical controversies are some long-standing, but unresolved, disagreements on the nature and principles of statistical methods and the roles for probability in statistical inference and modelling. In reaction to the so-called “replication crisis” in the sciences, some reformers suggest significance tests as a major culprit. To understand the ramifications of the proposed reforms, there is a pressing need for a deeper understanding of the source of the problems in the sciences and a balanced critique of the alternative methods being proposed to supplant significance tests. In this session speakers offer perspectives on significance tests from statistical science, econometrics, experimental psychology and philosophy of science. There will be also be panel discussion.

Categories: Error Statistics | 2 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: August 2015. I mark in red 3-4 posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of relevance to philosophy of statistics [2]. Posts that are part of a “unit” or a group count as one.

August 2015

  • 08/05 Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen
  • 08/08  Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • 08/11 A. Spanos: Egon Pearson’s Neglected Contributions to Statistics (recently reblogged)
  • 08/14 Performance or Probativeness? E.S. Pearson’s Statistical Philosophy
  • 08/15  Severity in a Likelihood Text by Charles Rohde
  • 08/19 Statistics, the Spooky Science
  • 08/20 How to avoid making mountains out of molehills, using power/severity
  • 08/24 3 YEARS AGO (AUGUST 2012): MEMORY LANE
  • 08/31 The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

A. Spanos: Egon Pearson’s Neglected Contributions to Statistics

Continuing with the discussion of E.S. Pearson in honor of his birthday:

Egon Pearson’s Neglected Contributions to Statistics

by Aris Spanos

    Egon Pearson (11 August 1895 – 12 June 1980), is widely known today for his contribution in recasting of Fisher’s significance testing into the Neyman-Pearson (1933) theory of hypothesis testing. Occasionally, he is also credited with contributions in promoting statistical methods in industry and in the history of modern statistics; see Bartlett (1981). What is rarely mentioned is Egon’s early pioneering work on:

(i) specification: the need to state explicitly the inductive premises of one’s inferences,

(ii) robustness: evaluating the ‘sensitivity’ of inferential procedures to departures from the Normality assumption, as well as

(iii) Mis-Specification (M-S) testing: probing for potential departures from the Normality  assumption.

Arguably, modern frequentist inference began with the development of various finite sample inference procedures, initially by William Gosset (1908) [of the Student’s t fame] and then Fisher (1915, 1921, 1922a-b). These inference procedures revolved around a particular statistical model, known today as the simple Normal model: Continue reading

Categories: E.S. Pearson, Egon Pearson, Statistics | 1 Comment

Egon Pearson’s Heresy

E.S. Pearson: 11 Aug 1895-12 June 1980.

Today is Egon Pearson’s birthday. In honor of his birthday, I am posting “Statistical Concepts in Their Relation to Reality” (Pearson 1955). I’ve posted it several times over the years, but always find a new gem or two, despite its being so short. E. Pearson rejected some of the familiar tenets that have come to be associated with Neyman and Pearson (N-P) statistical tests, notably the idea that the essential justification for tests resides in a long-run control of rates of erroneous interpretations–what he termed the “behavioral” rationale of tests. In an unpublished letter E. Pearson wrote to Birnbaum (1974), he talks about N-P theory admitting of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

Continue reading

Categories: phil/history of stat, Philosophy of Statistics, Statistics | Tags: , , | 2 Comments

For Popper’s Birthday: Reading from Conjectures and Refutations (+ self-test)


28 July 1902 – 17 September 1994

Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, an essay, and thumbnail definitions at the end[ii].

Blog Readers who wish to send me their answers will have their papers graded [use the comments or error@vt.edu.] An A- or better earns a signed copy of my forthcoming book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. [iii]

[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts on this blog: parts 12, and 3.

[iii] I posted this once before, but now I have a better prize.



Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.



Categories: Popper | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics [2].  Posts that are part of a “unit” or a group count as one.

July 2015

  • 07/03 Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)
  • 07/09  Winner of the June Palindrome contest: Lori Wike
  • 07/11 Higgs discovery three years on (Higgs analysis and statistical flukes)-reblogged recently
  • 07/14  Spot the power howler: α = ß?
  • 07/17  “Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)
  • 07/22 3 YEARS AGO (JULY 2012): MEMORY LANE
  • 07/24 Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
  • 07/29  Telling What’s True About Power, if practicing within the error-statistical tribe


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

S. Senn: Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine? (Guest Post)

Personal perils: are numbers needed to treat misleading us as to the scope for personalised medicine?

A common misinterpretation of Numbers Needed to Treat is causing confusion about the scope for personalised medicine.

Stephen Senn
Consultant Statistician,


Thirty years ago, Laupacis et al1 proposed an intuitively appealing way that physicians could decide how to prioritise health care interventions: they could consider how many patients would need to be switched from an inferior treatment to a superior one in order for one to have an improved outcome. They called this the number needed to be treated. It is now more usually referred to as the number needed to treat (NNT).

Within fifteen years, NNTs were so well established that the then editor of the British Medical Journal, Richard Smith could write:  ‘Anybody familiar with the notion of “number needed to treat” (NNT) knows that it’s usually necessary to treat many patients in order for one to benefit’2. Fifteen years further on, bringing us up to date,  Wikipedia makes a similar point ‘The NNT is the average number of patients who need to be treated to prevent one additional bad outcome (e.g. the number of patients that need to be treated for one of them to benefit compared with a control in a clinical trial).’3

This common interpretation is false, as I have pointed out previously in two blogs on this site: Responder Despondency and  Painful Dichotomies. Nevertheless, it seems to me the point is worth making again and the thirty-year anniversary of NNTs provides a good excuse. Continue reading

Categories: personalized medicine, PhilStat/Med, S. Senn | 7 Comments

Statistics and the Higgs Discovery: 5-6 yr Memory Lane


I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.  Continue reading

Categories: Higgs, highly probable vs highly probed, P-values | 1 Comment

Replication Crises and the Statistics Wars: Hidden Controversies


Below are the slides from my June 14 presentation at the X-Phil conference on Reproducibility and Replicability in Psychology and Experimental Philosophy at University College London. What I think must be examined seriously are the “hidden” issues that are going unattended in replication research and related statistics wars. An overview of the “hidden controversies” are on slide #3. Although I was presenting them as “hidden”, I hoped they wouldn’t be quite as invisible as I found them through the conference. (Since my talk was at the start, I didn’t know what to expect–else I might have noted some examples that seemed to call for further scrutiny). Exceptions came largely (but not exclusively) from a small group of philosophers (me, Machery and Fletcher). Then again,there were parallel sessions, so I missed some.  However, I did learn something about X-phil, particularly from the very interesting poster session [1]. This new area should invite much, much more scrutiny of statistical methodology from philosophers of science.

[1] The women who organized and ran the conference did an excellent job: Lara Kirfel, a psychology PhD student at UCL, and Pascale Willemsen from Ruhr University.

Categories: Philosophy of Statistics, replication research, slides | Leave a comment

Your data-driven claims must still be probed severely

Vagelos Education Center

Below are the slides from my talk today at Columbia University at a session, Philosophy of Science and the New Paradigm of Data-Driven Science, at an American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics. Todd was brave to sneak in philosophy of science in an otherwise highly mathematical conference.

Philosophy of Science and the New Paradigm of Data-Driven Science : (Room VEC 902/903)
Organizer and Chair: Todd Kuffner (Washington U)

  1. Deborah Mayo (Virginia Tech) “Your Data-Driven Claims Must Still be Probed Severely”
  2.  Ian McKeague (Columbia) “On the Replicability of Scientific Studies”
  3.  Xiao-Li Meng (Harvard) “Conducting Highly Principled Data Science: A Statistician’s Job and Joy


Categories: slides, Statistics and Data Science | 5 Comments

“Intentions (in your head)” is the code word for “error probabilities (of a procedure)”: Allan Birnbaum’s Birthday

27 May 1923-1 July 1976

27 May 1923-1 July 1976

Today is Allan Birnbaum’s Birthday. Birnbaum’s (1962) classic “On the Foundations of Statistical Inference,” in Breakthroughs in Statistics (volume I 1993), concerns a principle that remains at the heart of today’s controversies in statistics–even if it isn’t obvious at first: the Likelihood Principle (LP) (also called the strong likelihood Principle SLP, to distinguish it from the weak LP [1]). According to the LP/SLP, given the statistical model, the information from the data are fully contained in the likelihood ratio. Thus, properties of the sampling distribution of the test statistic vanish (as I put it in my slides from this post)! But error probabilities are all properties of the sampling distribution. Thus, embracing the LP (SLP) blocks our error statistician’s direct ways of taking into account “biasing selection effects” (slide #10). [Posted earlier here.] Interesting, as seen in a 2018 post on Neyman, Neyman did discuss this paper, but had an odd reaction that I’m not sure I understand. (Check it out.) Continue reading

Categories: Birnbaum, Birnbaum Brakes, frequentist/Bayesian, Likelihood Principle, phil/history of stat, Statistics | 7 Comments

The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars


Excerpts from the Preface:

The Statistics Wars: 

Today’s “statistics wars” are fascinating: They are at once ancient and up to the minute. They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. How should the integrity of science be restored? Experts do not agree. This book pulls back the curtain on why. Continue reading

Categories: Announcement, SIST | Leave a comment

Getting Up to Speed on Principles of Statistics


“If a statistical analysis is clearly shown to be effective … it gains nothing from being … principled,” according to Terry Speed in an interesting IMS article (2016) that Harry Crane tweeted about a couple of days ago [i]. Crane objects that you need principles to determine if it is effective, else it “seems that a method is effective (a la Speed) if it gives the answer you want/expect.” I suspected that what Speed was objecting to was an appeal to “principles of inference” of the type to which Neyman objected in my recent post. This turns out to be correct. Here are some excerpts from Speed’s article (emphasis is mine): Continue reading

Categories: Likelihood Principle, Philosophy of Statistics | 5 Comments

3 YEARS AGO (May 2015): Monthly Memory Lane

3 years ago...               3 years ago…

MONTHLY MEMORY LANE: 3 years ago: May 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1]. Posts that are part of a “unit” or a group count as one, as in the case of 5/16, 5/19 and 5/24.

May 2015

  • 05/04 Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)
  • 05/08 What really defies common sense (Msc kvetch on rejected posts)
  • 05/09 Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
  • 05/16 “Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo
  • 05/19 Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)
  • 05/24 From our “Philosophy of Statistics” session: APS 2015 convention
  • 05/27 “Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday
  • 05/30 3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

I regret being away from blogging as of late (yes, the last bit of proofing on the book): I shall return soon! Send me stuff to post of yours or items of interest in the mean time.


Categories: 3-year memory lane | 1 Comment

Neyman vs the ‘Inferential’ Probabilists continued (a)


Today is Jerzy Neyman’s Birthday (April 16, 1894 – August 5, 1981).  I am posting a brief excerpt and a link to a paper of his that I hadn’t posted before: Neyman, J. (1962), ‘Two Breakthroughs in the Theory of Statistical Decision Making‘ [i] It’s chock full of ideas and arguments, but the one that interests me at the moment is Neyman’s conception of “his breakthrough”, in relation to a certain concept of “inference”.  “In the present paper” he tells us, “the term ‘inferential theory’…will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation …either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty” such as Fisher’s fiducial probability. Now Neyman always distinguishes his error statistical performance conception from Bayesian and Fiducial probabilisms [ii]. The surprising twist here is semantical and the culprit is none other than…Allan Birnbaum. Yet Birnbaum gets short shrift, and no mention is made of our favorite “breakthrough” (or did I miss it?). [iii] I’ll explain in later stages of this post & in comments…(so please check back); I don’t want to miss the start of the birthday party in honor of Neyman, and it’s already 8:30 p.m in Berkeley!

Note: In this article,”attacks” on various statistical “fronts” refers to ways of attacking problems in one or another statistical research program. HAPPY BIRTHDAY NEYMAN! Continue reading

Categories: Bayesian/frequentist, Error Statistics, Neyman, Statistics | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: April 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2].  Posts that are part of a “unit” or a group count as one.

April 2015

  • 04/01 Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)
  • 04/04 Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”
  • 04/08 Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!
  • 04/13 Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC
  • 04/16 A. Spanos: Jerzy Neyman and his Enduring Legacy
  • 04/18 Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen
  • 04/22 NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)
  • 04/24 “Statistical Concepts in Their Relation to Reality” by E.S. Pearson
  • 04/27 3 YEARS AGO (APRIL 2012): MEMORY LANE
  • 04/30 96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!


[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).












Categories: 3-year memory lane | Leave a comment

Blog at WordPress.com.