Monthly Archives: January 2019

Mementos from Excursion 4: Objectivity & Auditing: Blurbs of Tours I – IV

Excursion 4: Objectivity and Auditing (blurbs of Tours I – IV)



Excursion 4 Tour I: The Myth of “The Myth of Objectivity”

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.


objective vs. subjective, objectivity requirements, auditing, dirty hands argument, phenomena vs. epiphenomena, logical positivism, verificationism, loss and cost functions, default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, transparency, epistemology: internal/external distinction


Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?

We begin with the Mountains out of Molehills Fallacy (large n problem): The fallacy of taking a (P-level) rejection of H0 with larger sample size as indicating greater discrepancy from H0 than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough n, a .05 significant result can correspond to assigning H0 a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.

It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.


significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)


Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small P-values, yet replication attempts find it difficult to get small P-values with preregistered results. I call this the paradox of replication. The problem isn’t P-values but failing to adjust them for cherry picking and other biasing selection effects. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)


error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)


Excursion 4 Tour IV: More Auditing: Objectivity and Model Checking

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being adequate for a problem, meaning it enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical  probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.


adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification


Where you are in the Journey 

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Excerpt from Excursion 4 Tour II: 4.4 “Do P-Values Exaggerate the Evidence?”

getting beyond…

Excerpt from Excursion 4 Tour II*


4.4 Do P-Values Exaggerate the Evidence?

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π0 of 1/2 on a point null H0 (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on H0.

More generally, the “P-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − P.

You might react by observing that: (a) P-values are not intended as posteriors in H0 (or Bayes ratios, likelihood ratios) but rather are used to determine if there’s an indication of discrepancy from, or inconsistency with, H0. This might only mean it’s worth getting more data to probe for a real effect. It’s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’s no reason to suppose a P-value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “height and stones”:

. . . [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone. (Senn 2001b, p. 202)

To top off your rejoinder, you might ask: (c) Why assume that “the” or even “a” correct measure of evidence (relevant for scrutinizing the P-value) is one of the probabilist ones?

All such retorts are valid, and we’ll want to explore how they play out here. Yet, I want to push beyond them. Let’s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.

Getting Beyond “I’m Rubber and You’re Glue”. The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P-value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’t, and they don’t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P-value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P-value, the least popular girl in the class, really does.

To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berger. These, in turn, are based on a handful of older ones (Cox 1977, E, L, & S 1963, Pratt 1965), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II, are about to pay large dividends.

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

 Readers can find blogposts that trace out the discussion of this topic, as I was developing it, along with comments. The following 2 are central:

(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised) 71 comments

(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious? 39 comments


Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Where you are in the journey.


Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

January Invites: Ask me questions (about SIST), Write Discussion Analyses (U-Phils)


ASK ME. Some readers say they’re not sure where to ask a question of comprehension on Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP)–SIST– so here’s a special post to park your questions of comprehension (to be placed in the comments) on a little over the first half of the book. That goes up to and includes Excursion 4 Tour I on “The Myth of ‘The Myth of Objectivity'”. However,I will soon post on Tour II: Rejection Fallacies: Who’s Exaggerating What? So feel free to ask questions of comprehension as far as p.259.

All of the SIST BlogPost (Excerpts and Mementos) so far are here.


WRITE A DISCUSSION NOTE: Beginning January 16, anyone who wishes to write a discussion note (on some aspect or issue up to p. 259 are invited to do so (<750 words, longer if you wish). Send them to my error email.  I will post as many as possible on this blog.

We initially called such notes “U-Phils” as in “You do a Philosophical analysis”, which really only means it’s an analytic excercize that strives to first give the most generous interpretation to positions, and then examines them. See the general definition of  a U-Phil.

Some Examples:

Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution

U-Phil: A Further Comment on Gelman by Christian Hennig.

For a whole group of reader contributions, including Jim Berger on Jim Berger, see: Earlier U-Phils and Deconstructions

If you’re writing a note on objectivity, you might wish to compare and contrast Excursion 4 Tour I with a paper by Gelman and Hennig (2017): “Beyond subjective and objective in Statistics”.

These invites extend through January.

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

SIST* Blog Posts: Excerpts & Mementos (to Dec 31 2018)

Surveying SIST Blog Posts So Far


  • 05/19: The Meaning of My Title: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
  • 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
  • 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
  • 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
  • 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
  • 10/10: Excursion 2 Tour II (3rd stop): Falsification, Pseudoscience, Induction (2.3)
  • 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3
  • 12/01: Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)
  • 12/04: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]
  • 12/11: It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II  (Mayo 2018, CUP)
  • 12/20: Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III
  • 12/26: Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)
  • 12/29: 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II.

Mementos, Keepsakes and Souvenirs

  • 10/29: Tour Guide Mementos (Excursion 1 Tour II of How to Get Beyond the Statistics Wars)
  • 11/8:   Souvenir C: A Severe Tester’s Translation Guide (Excursion 1 Tour II)
  • 10/5:  “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (Keepsake by Fisher, 2.1)
  • 11/14: Tour Guide Mementos and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation)
  • 11/17: Mementos for Excursion 2 Tour II Falsification, Pseudoscience, Induction
  • 12/08: Memento & Quiz (on SEV): Excursion 3, Tour I
  • 12/13: Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)
  • 12/26: Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Categories: SIST, Statistical Inference as Severe Testing | Leave a comment

Mayo-Spanos Summer Seminar PhilStat: July 28-Aug 11, 2019: Instructions for Applying Now Available


See the Blog at SummerSeminarPhilStat

Categories: Announcement, Error Statistics, Statistics | Leave a comment

Blog at