**Excursion 4 Tour I: ****The Myth of “The Myth of Objectivity”**

Blanket slogans such as “all methods are equally objective and subjective” trivialize into oblivion the problem of objectivity. Such cavalier attitudes are at odds with the moves to take back science The goal of this tour is to identify what there is in objectivity that we won’t give up, and shouldn’t. While knowledge gaps leave room for biases and wishful thinking, we regularly come up against data that thwart our expectations and disagree with predictions we try to foist upon the world. This pushback supplies objective constraints on which our critical capacity is built. Supposing an objective method is to supply formal, mechanical, rules to process data is a holdover of a discredited logical positivist philosophy.Discretion in data generation and modeling does not warrant concluding: statistical inference is a matter of subjective belief. It is one thing to talk of our models as objects of belief and quite another to maintain that our task is to model beliefs. For a severe tester, a statistical method’s objectivity requires the ability to audit an inference: check assumptions, pinpoint blame for anomalies, falsify, and directly register how biasing selection effects–hunting, multiple testing and cherry-picking–alter its error probing capacities.

**Keywords**

objective vs. subjective, objectivity requirements, auditing, dirty hands argument, phenomena vs. epiphenomena, logical positivism, verificationism, loss and cost functions, default Bayesians, equipoise assignments, (Bayesian) wash-out theorems, degenerating program, transparency, epistemology: internal/external distinction

**Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?**

We begin with the *Mountains out of Molehills Fallacy *(large *n* problem): The fallacy of taking a (P-level) rejection of *H*_{0} with larger sample size as indicating greater discrepancy from *H*_{0} than with a smaller sample size. (4.3). The Jeffreys-Lindley paradox shows with large enough *n*, a .05 significant result can correspond to assigning *H*_{0} a high probability .95. There are family feuds as to whether this is a problem for Bayesians or frequentists! The severe tester takes account of sample size in interpreting the discrepancy indicated. A modification of confidence intervals (CIs) is required.

It is commonly charged that significance levels overstate the evidence against the null hypothesis (4.4, 4.5). What’s meant? One answer considered here, is that the P-value can be smaller than a posterior probability to the null hypothesis, based on a lump prior (often .5) to a point null hypothesis. There are battles between and within tribes of Bayesians and frequentists. Some argue for lowering the P-value to bring it into line with a particular posterior. Others argue the supposed exaggeration results from an unwarranted lump prior to a wrongly formulated null.We consider how to evaluate reforms based on bayes factor standards (4.5). Rather than dismiss criticisms of error statistical methods that assume a standard from a rival account, we give them a generous reading. Only once the minimal principle for severity is violated do we reject them. Souvenir R summarizes the severe tester’s interpretation of a rejection in a statistical significance test. At least 2 benchmarks are needed: reports of discrepancies (from a test hypothesis) that are, and those that are not, well indicated by the observed difference.

**Keywords**

significance test controversy, mountains out of molehills fallacy, large n problem, confidence intervals, P-values exaggerate evidence, Jeffreys-Lindley paradox, Bayes/Fisher disagreement, uninformative (diffuse) priors, Bayes factors, spiked priors, spike and slab, equivocating terms, severity interpretation of rejection (SIR)

**Excursion 4 Tour III: Auditing: Biasing Selection Effects & Randomization**

Tour III takes up Peirce’s “two rules of inductive inference”: predesignation (4.6) and randomization (4.7). The Tour opens on a court case transpiring: the CEO of a drug company is being charged with giving shareholders an overly rosy report based on post-data dredging for nominally significant benefits. Auditing a result includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to moving from statistical to substantive claims. We hear it’s too easy to obtain small *P*-values, yet replication attempts find it difficult to get small *P*-values with preregistered results. I call this the *paradox of replication*. The problem isn’t *P*-values but failing to adjust them for cherry picking and other *biasing selection effects*. Adjustments by Bonferroni and false discovery rates are considered. There is a tension between popular calls for preregistering data analysis, and accounts that downplay error probabilities. Worse, in the interest of promoting a methodology that rejects error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. However, data dependent searching need not be pejorative. In some cases, it can improve severity. (4.6)

Big Data cannot ignore experimental design principles. Unless we take account of the sampling distribution, it becomes difficult to justify resampling and randomization. We consider RCTs in development economics (RCT4D) and genomics. Failing to randomize microarrays is thought to have resulted in a decade lost in genomics. Granted the rejection of error probabilities is often tied to presupposing their relevance is limited to long-run behavioristic goals, which we reject. They are essential for an epistemic goal: controlling and assessing how well or poorly tested claims are. (4.7)

**Keywords**

error probabilities and severity, predesignation, biasing selection effects, paradox of replication, capitalizing on chance, bayes factors, batch effects, preregistration, randomization: Bayes-frequentist rationale, bonferroni adjustment, false discovery rates, RCT4D, genome-wide association studies (GWAS)

**Excursion 4 Tour IV:** **More Auditing: Objectivity and Model Checking**

While all models are false, it’s also the case that no useful models are true. Were a model so complex as to represent data realistically, it wouldn’t be useful for finding things out. A statistical model is useful by being *adequate for a problem, meaning* it enables controlling and assessing if purported solutions are well or poorly probed and to what degree. We give a way to define severity in terms of solving a problem.(4.8) When it comes to testing model assumptions, many Bayesians agree with George Box (1983) that “it requires frequentist theory of significance tests” (p. 57). Tests of model assumptions, also called misspecification (M-S) tests, are thus a promising area for Bayes-frequentist collaboration. (4.9) When the model is in doubt, the likelihood principle is inapplicable or violated. We illustrate a non-parametric bootstrap resampling. It works without relying on a theoretical probability distribution, but it still has assumptions. (4.10). We turn to the M-S testing approach of econometrician Aris Spanos.(4.11) I present the high points for unearthing spurious correlations, and assumptions of linear regression, employing 7 figures. M-S tests differ importantly from model selection–the latter uses a criterion for choosing among models, but does not test their statistical assumptions. They test fit rather than whether a model has captured the systematic information in the data.

**Keywords**

adequacy for a problem, severity (in terms of problem solving), model testing/misspecification (M-S) tests, likelihood principle conflicts, bootstrap, resampling, Bayesian p-value, central limit theorem, nonsense regression, significance tests in model checking, probabilistic reduction, respecification

]]>

**4.4 Do P-Values Exaggerate the Evidence?**

“Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

What do you mean by overstating the evidence against a hypothesis?

Several (honest) answers are possible. Here is one possibility:

What I mean is that when I put a lump of prior weight π_{0} of 1/2 on a point null *H*_{0} (or a very small interval around it), the P-value is smaller than my Bayesian posterior probability on *H*_{0}.

More generally, the “*P*-values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’t as big as 1 − *P*.

You might react by observing that: (a) *P*-values are not intended as posteriors in *H*_{0} (or Bayes ratios, likelihood ratios) but rather are used to determine if there’s an indication of discrepancy from, or inconsistency with, *H*_{0}. This might only mean it’s worth getting more data to probe for a real effect. It’s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’s no reason to suppose a *P*-value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “height and stones”:

. . . [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone. (Senn 2001b, p. 202)

To top off your rejoinder, you might ask: (c) Why assume that “the” or even “a” correct measure of evidence (relevant for scrutinizing the *P*-value) is one of the probabilist ones?

All such retorts are valid, and we’ll want to explore how they play out here. Yet, I want to push beyond them. Let’s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.

**Getting Beyond “I’m Rubber and You’re Glue”.** The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the *P*-value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’t, and they don’t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a *P*-value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a *P*-value, the least popular girl in the class, really does.

To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berger. These, in turn, are based on a handful of older ones (Cox 1977, E, L, & S 1963, Pratt 1965), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II, are about to pay large dividends.

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Readers can find blogposts that trace out the discussion of this topic, as I was developing it, along with comments. The following 2 are central:

(7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised) **71 comments**

(7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious? **39 comments**

Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

]]>

**ASK ME.** Some readers say they’re not sure where to ask a question of comprehension on *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (2018, CUP)–SIST– so here’s a special post to park your questions of comprehension (to be placed in the comments) on a little over the first half of the book. That goes up to and includes Excursion 4 Tour I on “The Myth of ‘The Myth of Objectivity'”. However,I will soon post on Tour II: Rejection Fallacies: Who’s Exaggerating What? So feel free to ask questions of comprehension as far as p.259.

All of the SIST BlogPost (Excerpts and Mementos) so far are here.

**WRITE A DISCUSSION NOTE**: Beginning January 16, anyone who wishes to write a discussion note (on some aspect or issue up to p. 259 are invited to do so (<750 words, longer if you wish). Send them to my error email. I will post as many as possible on this blog.

We initially called such notes “U-Phils” as in “You do a Philosophical analysis”, which really only means it’s an analytic excercize that strives to first give the most generous interpretation to positions, and then examines them. See the general definition of a U-Phil.

Some Examples:

Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution

U-Phil: A Further Comment on Gelman by Christian Hennig.

For a whole group of reader contributions, including Jim Berger on Jim Berger, see: Earlier U-Phils and Deconstructions

If you’re writing a note on objectivity, you might wish to compare and contrast Excursion 4 Tour I with a paper by Gelman and Hennig (2017): “Beyond subjective and objective in Statistics”.

These invites extend through January.

]]>*Excerpts*

- 05/19: The Meaning of My Title:
*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* - 09/08: Excursion 1 Tour I: Beyond Probabilism and Performance: Severity Requirement (1.1)
- 09/11: Excursion 1 Tour I (2nd stop): Probabilism, Performance, and Probativeness (1.2)
- 09/15: Excursion 1 Tour I (3rd stop): The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon (1.3)
- 09/29: Excursion 2: Taboos of Induction and Falsification: Tour I (first stop)
- 10/10: Excursion 2 Tour II (3rd stop): Falsiﬁcation, Pseudoscience, Induction (2.3)
- 11/30: Where are Fisher, Neyman, Pearson in 1919? Opening of Excursion 3
- 12/01: Neyman-Pearson Tests: An Episode in Anglo-Polish Collaboration: Excerpt from Excursion 3 (3.2)
- 12/04: First Look at N-P Methods as Severe Tests: Water plant accident [Exhibit (i) from Excursion 3]
- 12/11: It’s the Methods, Stupid: Excerpt from Excursion 3 Tour II (Mayo 2018, CUP)
- 12/20: Capability and Severity: Deeper Concepts: Excerpts From Excursion 3 Tour III
- 12/26: Excerpt from Excursion 4 Tour I: The Myth of “The Myth of Objectivity” (Mayo 2018, CUP)
- 12/29: 60 Years of Cox’s (1958) Chestnut: Excerpt from Excursion 3 tour II.

*Mementos, Keepsakes and Souvenirs*

- 10/29: Tour Guide
**Mementos**(Excursion 1 Tour II of How to Get Beyond the Statistics Wars) - 11/8:
**Souvenir**C: A Severe Tester’s Translation Guide (Excursion 1 Tour II) - 10/5: “It should never be true, though it is still often said, that the conclusions are no more accurate than the data on which they are based” (
**Keepsake**by Fisher, 2.1) - 11/14: Tour Guide
**Mementos**and Quiz 2.1 (Excursion 2 Tour I Induction and Confirmation) - 11/17:
**Mementos**for Excursion 2 Tour II Falsification, Pseudoscience, Induction - 12/08: Memento & Quiz (on SEV): Excursion 3, Tour I
- 12/13: Mementos for “It’s the Methods, Stupid!” Excursion 3 Tour II (3.4-3.6)
- 12/26: Tour Guide Mementos From Excursion 3 Tour III: Capability and Severity: Deeper Concepts

***Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

You know how in that Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? (It was a new movie when I began the blog in 2011.) He is wowed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve ~~2011~~ ~~2012~~, ~~2013~~, ~~2014~~, ~~2015~~, ~~2016~~, ~~2017~~, 2018) and is taken back sixty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a number of 2018 updates.

ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be writing on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)

BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept. But you know all this, I’ve read it in your new book: *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (STINT, 2018, CUP).

ERROR STATISTICIAN: You’ve read my new book? Wow! Then you know I don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP. I don’t rehearse my argument there, but I first found it in 2006.[ii] Sorry,…I know it’s famous…

BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).

ERROR STATISTICIAN: Well I show that no contradiction follows from holding WCP and S, while denying the LP.

BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!

ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.

BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.

ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.

BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:

(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)

ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.

BIRNBAUM: Suppose you’ve observed x”, a 2-standard deviation difference from an optional stopping experiment E”, that finally stops at n=100. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ (fixed sample size experiment, with n = 100) and tails instructs you to perform the optional stopping experiment E”, stopping as soon as you obtain a 2-standard deviation difference, and you happened to get tails, and performed the experiment E”, which happened to stop with n =100.

ERROR STATISTICAL PHILOSOPHER: Well, that is not how x” was obtained, but ok, it could have occurred that way.

BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E’, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x”, has an “LP pair”, call it x’, in some other experiment E’. In that case, a BB-experiment stipulates that you are to report x” as if you had determined whether to run E’ or E” by flipping a fair coin.

(They fill their glasses again)

ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from trying and trying again, that is, optional stopping experiment E”, with an “LP pair” in the fixed sample size experiment I did not perform, then I am to report x” as if the determination to run E” was by flipping a fair coin (which decides between E’ and E”)?

BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the fixed sample size experiment E’, it too would have an “LP pair” in the experiment you did not perform, E”. Whether you actually observed x” from E”, or x’ from E’, you are to report it as x” from E”.

ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” (with sample size n) the result is reported as x’, as if it came from E’ (fixed sample size), and as a result of this strange type of a mixture experiment.

BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BB-experiment.

ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.

BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a *single* experiment, so really you only need to apply the *weak LP* which frequentists accept. Yes? (The *weak LP is* the same as the *sufficiency principle*).

ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference from optional stopping experiment E”. How do I calculate the p-value within a Birnbaumized experiment?

BIRNBAUM: I don’t think anyone has ever called it that.

ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?

BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2

Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).

ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?

BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB- experiment.

*My this drink is sour! *

ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.

BIRNBAUM: Perhaps you’re in want of a gene; never mind.

I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).

ERROR STATISTICAL PHILOSOPHER: But the result would be that the p-value associated with x’ (fixed sample size) is reported to be larger than it actually is (.05), because I’d be averaging over fixed and optional stopping experiments; while observing x” (optional stopping) is reported to be smaller than it is–in both cases because of an experiment I did not perform.

BIRNBAUM: Yes, the BB-experiment computes the P-value in an *unconditional* manner: it takes the convex combination over the 2 ways the result could have come about.

ERROR STATISTICAL PHILOSOPHER: this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.

BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.

By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.

ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)

BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”

ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!

BIRNBAUM: So far all of this was step (1).

ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?

BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.

This gives us premise (2a):

(2a) outcome x”, once it is known that it came from E”, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the optional stopping test E”, yielding the p-value, p” (~.37). In fact, .37 is the value you give in STINT p. 44 (imagining the experimenter keeps taking 10 more).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?

BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.

(2b) Likewise, if you knew the 2-standard deviation difference came from E’, then

x’ should NOT be deemed evidentially equivalent to x” (as in the BB experiment), the report should instead use the sampling distribution of fixed test E’, (.05).

ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on and report p”.

BIRNBAUM: Yes. There was no need to repeat the whole spiel.

ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you. Of course,all of this assumes the model is correct or adequate to begin with.

BIRNBAUM: Yes, the SLP is a principle for parametric inference within a given model. So you arrive at (2a) and (2b), yes?

ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.

BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?

ERROR STATISTICAL PHILOSOPHER: Well the WCP is defined for actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP? I don’t know what the sample size will be ahead of time.

BIRNBAUM: Sure, but you admit that your observed x” could have come about through a BB-experiment, and that’s all I need. Notice

(1), (2a) and (2b) yield the strong LP!

Outcome x” from E”(optional stopping that stops at n) is evidentially equivalent to x’ from E’ (fixed sample size n).

ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).

BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?

(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)

ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:

Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:

premise (1): outcome x” (in a BB experiment) is evidentially equivalent to outcome x’ (in a BB experiment):

That is because in either case, the p-value would be (p’ + p”)/2

Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:

premise (2a): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.

premise (2b): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.

If (1) is true, then (2a) and (2b) must be false!

If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:

The average p-value (p’ + p”)/2 = p’ which is false.

Likewise if (1) is true, then (2b) is asserting:

the average p-value (p’ + p”)/2 = p” which is false

Alternatively, we can see what goes wrong by realizing:

If (2a) and (2b) are true, then premise (1) must be false.

In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.

I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).

BIRNBAUM: Yet some people still think it is a breakthrough (in favor of Bayesianism).

ERROR STATISTICAL PHILOSOPHER: I have a much clearer exposition of what goes wrong in your argument than I did in the discussion from 2010. There were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in *Statistical Science?* The key difference is that in (2014), the WCP is stated as an equivalence, as you intended. Cox’s WCP, many claim, was not an equivalence, going in 2 directions. Slides from a presentation may be found on this blogpost.

BIRNBAUM: Yes I have seen your 2014 paper, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations! I’ve also seen the slides on your blog.

ERROR STATISTICAL PHILOSOPHER: Thank you, I’m amazed you follow my blog! But look I *must* get your answer to a question before you leave this year.

S*udden interruption by the waiter*

WAITER: Who gets the tab?

BIRNBAUM: I do. To Elbar Grease! And to your new book SIST!

ERROR STATISTICAL PHILOSOPHER:** To Elbar Grease! To finally finishing SIST in 2018! Happy New Year!**

ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right? Even in the discussion of your (1962)paper, you seemed to agree with Pratt that WCP can’t do the job you intend.

BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene… Pratt thought the argument should instead appeal to a Censoring Principle (basically, it doesn’t matter if your instrument cannot measure beyond k units if the measurement you’re making is under k units.)

ERROR STATISTICAL PHILOSOPHER: Yes, but who says frequentist error statisticians deny the Censoring Principle? So back to my question, you disappeared before answering last year…I just want to know…you did see the flaw, yes?

WAITER: We’re closing now; shall I call you a taxicab?

BIRNBAUM: Yes.

ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?

MANAGER: We’re closing now; I’m sorry you must leave.

ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….

*Large group of people bustle past.*

Prof. Birnbaum…? Allan? **Where did he go? **(oy, not again!)

**Link to complete discussion: **

Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).*Statistical Science* 29 (2014), no. 2, 227-266.

[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as historical background papers may be found in my last blogpost.

[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.

An essential component of inference based on familiar frequentist notions: p-values, significance and confidence levels, is the relevant sampling distribution (hence the term *sampling theory, *or my preferred *error statistics, *as we get error probabilities from the sampling distribution). This feature results in violations of a principle known as the *strong likelihood principle* (SLP). To state the SLP roughly, it asserts that all the evidential import in the data (for parametric inference within a model) resides in the likelihoods. If accepted, it would render error probabilities irrelevant post data.

**SLP** (We often drop the “strong” and just call it the LP. The “weak” LP just boils down to sufficiency)

For any two experiments E

_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, if outcomesx* andy* (from E_{1}and E_{2}respectively) determine the same (i.e., proportional) likelihood function (f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ), thenx* andy* are inferentially equivalent (for an inference about θ).

(What differentiates the weak and the strong LP is that the weak refers to a single experiment.)

**Violation of SLP:**

Whenever outcomes

x* andy* from experiments E_{1}and E_{2}with different probability models f_{1}, f_{2}, but with the same unknown parameter θ, and f_{1}(x*; θ) = cf_{2}(y*; θ) for all θ, and yet outcomesx* andy* have different implications for an inference about θ.

For an example of a SLP violation, E_{1} might be sampling from a Normal distribution with a fixed sample size n, and E_{2} the corresponding experiment that uses an optional stopping rule: keep sampling until you obtain a result 2 standard deviations away from a null hypothesis that θ = 0 (and for simplicity, a known standard deviation). When you do, stop and reject the point null (in 2-sided testing).

The SLP tells us (in relation to the optional stopping rule) that once you have observed a 2-standard deviation result, there should be no evidential difference between its having arisen from experiment E_{1} , where n was fixed, say, at 100, and experiment E_{2} where the stopping rule happens to stop at n = 100. For the error statistician, by contrast, there is a difference, and this constitutes a violation of the SLP.

———————-

*Now for the surprising part:* Remember the 60-year old chestnut from my last post where a coin is flipped to decide which of two experiments to perform? David Cox (1958) proposes something called the Weak Conditionality Principle (WCP) to restrict the space of relevant repetitions for frequentist inference. The WCP says that once it is known which E_{i }produced the measurement, the assessment should be in terms of the properties of the particular E_{i}. Nothing could be more obvious.

The surprising upshot of Allan Birnbaum’s (1962) argument is that the SLP appears to follow from applying the WCP in the case of mixtures, and so uncontroversial a principle as sufficiency (SP). But this would preclude the use of sampling distributions. J. Savage calls Birnbaum’s argument “a landmark in statistics” (see [i]).

Although his argument purports that [(WCP and SP) entails SLP], I show how data may violate the SLP while holding both the WCP and SP. Such cases directly refute [WCP entails SLP].

In Birnbaum’s argument, he introduces an informal, and rather vague, notion of the “evidence (or evidential meaning) of an outcome * z* from experiment E”. He writes it: Ev(E,

In my formulation of the argument, I introduce a new symbol to represent a function from a given experiment-outcome pair, (E,**z**) to a generic inference implication. It (hopefully) lets us be clearer than does Ev.

(E,**z**) Infr_{E}(**z**) is to be read “the inference implication from outcome **z** in experiment E” (according to whatever inference type/school is being discussed).

*An outline of my argument is in the slides for a talk below: *

**Binge reading the Likelihood Principle.**

If you’re keen to binge read the SLP–a way to break holiday/winter break doldrums– I’ve pasted most of the early historical sources before the slides. The argument is simple; showing what’s wrong with it took a long time. My earliest treatment, via counterexample, is in Mayo (2010). A deeper argument is in Mayo (2014) in *Statistical Science*.[ii] An intermediate paper Mayo (2013) corresponds to the slides below–they were presented at the JSM. Interested readers may search this blog for quite a lot of discussion of the SLP including “U-Phils” (discussions by readers) (e.g., here, and here), and amusing notes (e.g., Don’t Birnbaumize that experiment my friend, and Midnight with Birnbaum).

You may not wish to engage in what looks to be (and is) a rather convoluted logical argument. That’s fine, but just remember that when someone says “it’s been proved mathematically” that error probabilities are irrelevant to evidence post data, you can say, “I read somewhere that this has been disproved”.

—–

[i] Savage on Birnbaum: “This paper is a landmark in statistics. . . . I, myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. . . . [T]his paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. …once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics” (Savage 1962, 307-308).

The argument purports to follow from principles frequentist error statisticians accept.

[ii] The link includes comments on my paper by Bjornstad, Dawid, Evans, Fraser, Hannig, and Martin and Liu, and my rejoinder.

**Birnbaum Papers:**

- Birnbaum, A. (1962), “On the Foundations of Statistical Inference“,
*Journal of the American Statistical Association*57(298), 269-306. - Savage, L. J., Barnard, G., Cornfield, J., Bross, I, Box, G., Good, I., Lindley, D., Clunies-Ross, C., Pratt, J., Levene, H., Goldman, T., Dempster, A., Kempthorne, O, and Birnbaum, A. (1962). “Discussion on Birnbaum’s On the Foundations of Statistical Inference”,
*Journal of the American Statistical Association*57(298), 307-326. - Birnbaum, A (1970). Statistical Methods in Scientific Inference (letter to the editor). Nature 225, 1033.
- Birnbaum, A (1972), “More on Concepts of Statistical Evidence“,
*Journal of the American Statistical Association*, 67(340), 858-861.

**Note to Reader:** If you look at the “discussion”, you can already see Birnbaum backtracking a bit, in response to Pratt’s comments.

Some additional early discussion papers:

**Durbin:**

- Durbin, J. (1970), “On Birnbaum’s Theorem on the Relation Between Sufficiency, Conditionality and Likelihood”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 395-398. - Savage, L. J., (1970), “Comments on a Weakened Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 399-401. - Birnbaum, A. (1970), “On Durbin’s Modified Principle of Conditionality”,
*Journal of the American Statistical Association*, Vol. 65, No. 329 (Mar., 1970), pp. 402-403.

There’s also a good discussion in Cox and Hinkley 1974.

**Evans, Fraser, and Monette:**

- Evans, M., Fraser, D.A., and Monette, G., (1986), “On Principles and Arguments to Likelihood.”
*The Canadian Journal of Statistics*14: 181-199.

**Kalbfleisch:**

- Kalbfleisch, J. D. (1975), “Sufficiency and Conditionality”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 251-259. - Barnard, G. A., (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 260-261. - Barndorff-Nielsen, O. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 261-262. - Birnbaum, A. (1975), “Comments on Paper by J. D. Kalbfleisch”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), pp. 262-264. - Kalbfleisch, J. D. (1975), “Reply to Comments”,
*Biometrika*, Vol. 62, No. 2 (Aug., 1975), p. 268.

**My discussions:**

- Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in
*Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science*(D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14. - Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in
*JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453. - Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder:
*Statistical Science**.*

2018 marked 60 years since the famous weighing machine example from Sir David Cox (1958)[1]. It’s one of the “chestnuts” in the exhibits of “chestnuts and howlers” in Excursion 3 (Tour II) of my new book *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST). It’s especially relevant to take this up now, just before we leave 2018, for reasons that will be revealed over the next day or two. So, let’s go back to it, with an excerpt from SIST (pp. 170-173).

**Exhibit (vi): Two Measuring Instruments of Different Precisions. ***Did you hear about the frequentist who, knowing she used a scale that’s right only half the time, claimed her method of weighing is right 75% of the time?*

She says, “I flipped a coin to decide whether to use a scale that’s right 100% of the time, or one that’s right only half the time, so, overall, I’m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)

*Basis for the joke: *An N-P test bases error probability on all possible outcomes or measurements that could have occurred in repetitions, but did not.

As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’s the statistical formulation.

We flip a fair coin to decide which of two instruments, E_{1 }or E_{2}, to use in observing a Normally distributed random sample * Z* to make inferences about mean

In testing a null hypothesis such as *θ* = 0, the same * z *measurement would correspond to a much smaller

Suppose that we know we have observed a measurement from E_{2 }with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance]. (Cox 1958, p. 361)

Once it is known which E_{i } has produced * z*, the

The point essentially is that the marginal distribution of a

P-value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense. (p. 296)

To scotch his famous example, Cox (1958) introduces a principle: weak conditionality.

If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences aboutWeak Conditionality Principle (WCP):θ are appropriately drawn in terms of the sampling behaviorin the experiment known to have been performed (Cox and Mayo 2010, p. 296).

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “which is best on the average over both experiments” (Lehmann and Romano 2005, p. 394), and it’s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b).He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b, p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’m sure they’d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974, pp. 95–6; Birnbaum 1962, p. 491.)

**Is There a Catch?**

The “two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii)). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘dilemma’ argument is therefore an illusion. (Cox and Mayo 2010, p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’s “proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b, 2013a, 2014b). That’s a relief. Now some of you will howl, “Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’s an ideal brainbuster to work on along the promenade after a long day’s tour. Don’t be dismayed by the fact that it has been accepted for so long. But I won’t revisit it here.

From *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018, CUP).

Excursion 3 Tour II, pp. 170-173.

Note to the Reader:

Textbooks should not call a claim a theorem if it’s not a theorem, i.e., if there isn’t a proof of it (within the relevant formal system). Yet you will find many statistics texts, and numerous discussion articles, that blithely repeat that the (strong) Likelihood Principle is a theorem, shown to follow if you accept the (WCP) which frequentist error statisticians do.{2] Yet I argue it is nothing of the kind, and that Allan Birnbaum’s (1962) alleged proof is circular. So, **in 2019, when you find a text that claims the LP is a theorem, provable from the (WEP), please let me know.**

If statistical inference follows Bayesian posterior probabilism, the LP follows easily. It’s shown in just a couple of pages of Excursion 1 Tour II (45-6). All the excitement is whether the frequentist (error statistician) is bound to hold it. If she is, then error probabilities become irrelevant to the evidential import of data (once the data are given), at least when making parametric inferences within a statistical model.

The LP was a main topic for the first few years of this blog. That’s because I was still refining an earlier disproof from Mayo (2010), based on giving a counterexample. I later saw the need for a deeper argument which I give in Mayo (2014) in *Statistical Science*.[3] (There, among other subtleties, the WCP is put as a logical equivalence as intended.)

“It was the adoption of an unqualified equivalence formulation of conditionality, and related concepts, which led, in my 1962 paper, to the monster of the likelihood axiom,” (Birnbaum 1975, 263).

If you’re keen to try your hand at the arguments (Birnbaum’s or mine), you might start with a summary post (based on slides) here, or an intermediate paper Mayo (2013) that I presented at the JSM. It is *not* included in SIST. It’s a brainbuster, though, I warn you. There’s no real mathematics or statistics involved, it’s pure logic. But it’s very circuitous, which is why the supposed “proof” has stuck around as long as it has.

[1] Cox 1958 has a different variant of the chestnut.

[2] Note sufficiency is not really needed in the “proof”.

[3] The discussion includes commentaries by Dawid, Evans, Martin and Liu, Hannig, and Bjørnstad–some of whom are very unhappy with me. But I’m given the final word in the rejoinder.

**References **(outside of the excerpt; for refs within SIST, please see SIST):

Birnbaum, A. (1962), “On the Foundations of Statistical Inference“, *Journal of the American Statistical Association* 57(298), 269-306.

Birnbaum, A. (1975). *Comments on Paper by J. D. Kalbfleisch*. Biometrika, 62 (2), 262–264.

Cox, D. R. (1958), “Some problems connected with statistical inference“, The Annals of Mathematical Statistics, 29, 357-372.

Mayo, D. G. (2010). “An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle”, in *JSM Proceedings*, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association: 440-453.

Mayo, D. G. (2014). Mayo paper: “On the Birnbaum Argument for the Strong Likelihood Principle,” Paper with discussion and Mayo rejoinder: *Statistical Science** *29(2) pp. 227-239, 261-266*.*

Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)

Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity–subjectivity distinction really toothless, as many will have you believe? I say no. I know it’s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.

**The Key Is Getting Pushback!** While knowledge gaps leave plenty of room for biases, arbitrariness, and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. We get pushback! This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. Explicit attention needs to be paid to communicating results to set the stage for others to check, debate, and extend the inferences reached. Which conclusions are likely to stand up? Where do the weakest parts remain? Don’t let anyone say you can’t hold them to an objective account.

Excursion 2, Tour II led us from a Popperian tribe to a workable demarcation for scientific inquiry. That will serve as our guide now for scrutinizing the myth of the myth of objectivity. First, good sciences put claims to the test of refutation, and must be able to embark on an inquiry to pin down the sources of any apparent effects. Second, refuted claims aren’t held on to in the face of anomalies and failed replications; they are treated as refuted in further work (at least provisionally); well-corroborated claims are used to build on theory or method: science is not just stamp collecting. The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity. In statistical design, day-to-day tricks of the trade to combat bias are consciously amplified and made systematic. It is not because of a “disinterested stance” that we invent such methods; it is that we, quite competitively and self-interestedly, want our theories to succeed in the market place of ideas.

Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Successful scrutiny is very different from success at grants, getting publications and honors. That is why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data and for resisting the urge to cut corners are being adopted in some fields. Fortunately, for me, our travels will bypass lands of policy recommendations, where I have no special expertise. I will stop at the perimeters of scrutiny of methods which at least provide us citizen scientists armor against being misled. Still, if the allure of carrots has grown stronger than the sticks, we need stronger sticks.

Problems of objectivity in statistical inference are deeply intertwined with a jungle of philosophical problems, in particular with questions about what objectivity demands, and disagreements about “objective versus subjective” probability. On to the jungle!

*From *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (Mayo 2018, CUP)

**Notes to the Reader of this Blog:**

Many of the ideas on objectivity in Excursion 4 Tour I are distilled from posts and discussions on this blog. I’ve pasted some of those posts below, starting with a relatively recent one with the title of this Tour. Perusing the comments by readers is valuable in its own right. (You can find a list of all posts on this blog by searching “All She Wrote (so far)”

The Myth of “The Myth of Objectivity”

Objectivity #2: The ‘Dirty Hands’ Argument for Ethics in Evidence

Objectivity #3: Clean(er) Hands With Metastatistics

Objectivity (#4) and the “Argument From Discretion”

Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”

A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs). In fact there’s a clear duality between the two: the parameter values within the (1 – α) CI are those that are not rejectable by the corresponding test at level α. (3.7) illuminates both CIs and severity by means of this duality. A key idea is arguing from the **capabilities** of methods to what may be inferred. CIs thereby obtain an inferential rationale (beyond performance), and several benchmarks are reported.

In (3.8) we reopen a highly controversial matter of interpretation in relation to statistics and the 2012 discovery of the Higgs particle based on a “5 sigma observed effect”. Because the 5 sigma standard refers to frequentist significance testing, the discovery was immediately imbued with controversies that, at bottom, concern statistical philosophy. Some Bayesians even hinted it was “bad science”. One of the knottiest criticisms concerns the very meaning of the phrase: “the probability our results are (merely) a statistical fluctuation”. Failing to clarify it may impinge on the nature of future big science inquiry. The problem is a bit delicate, and my solution is likely to be provocative. Even rejecting my construal will allow readers to see what it’s like to switch from wearing probabilist, to severe testing, glasses.

Confidence intervals, lower bound, upper bound

Confidence intervals, duality with tests

Duality between CI inferences and severity

Capability and severity

Rubbing off interpretation

Confidence distributions (CD)

Confidence level (coefficient)

Meaning vs application gap (in interpreting CIs)

Higg’s particle

ISBA (International Society for Bayesian Analysis)

Look elsewhere effect (local and global P-values)

5 sigma

P-value police

Beyond standard model physics (BSM)

Probable flukes

ASA P-value guide

*Chestnuts*:

A 95% CI known to be true

Confidence sets might have high overall coverage while known to be true in given cases

rigged and pathological CIs

*Excerpts from Excursion 3 Tour III are **here*

**Where you are in the journey. **

** **

From the itinerary: A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs), but in fact there’s a clear duality between the two. The dual mission of the first stop (Section 3.7) of this tour is to illuminate both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. The severity analysis seamlessly blends testing and estimation. A typical inquiry first tests for the existence of a genuine effect and then estimates magnitudes of discrepancies, or inquires if theoretical parameter values are contained within a confidence interval. At the second stop (Section 3.8) we reopen a highly controversial matter of interpretation that is often taken as settled. It relates to statistics and the discovery of the Higgs particle – displayed in a recently opened gallery on the “Statistical Inference in Theory Testing” level of today’s museum.

**3.7 Severity, Capability, and Confidence Intervals (CIs)**

It was shortly before Egon offered him a faculty position at University College starting 1934 that Neyman gave a paper at the Royal Statistical Society (RSS) which included a portion on confidence intervals, intending to generalize Fisher’s fiducial intervals. With K. Pearson retired (he’s still editing Biometrika but across campus with his assistant Florence David), the tension is between E. Pearson, along with remnants of K.P.’s assistants, and Fisher on the second and third floors, respectively. Egon hoped Neyman’s coming on board would melt some of the ice.

Neyman’s opinion was that “Fisher’s work was not really understood by many statisticians . . . mainly due to Fisher’s very condensed form of explaining his ideas” (C. Reid 1998, p. 115). Neyman sees himself as championing Fisher’s goals by means of an approach that gets around these expository obstacles. So Neyman presents his first paper to the Royal Statistical Society (June, 1934), which includes a discussion of confidence intervals, and, as usual, comments (later published) follow. Arthur Bowley (1934), a curmudgeon on the K.P. side of the aisle, rose to thank the speaker. Rubbing his hands together in gleeful anticipation of a blow against Neyman by Fisher, he declares: “I am very glad Professor Fisher is present, as it is his work that Dr Neyman has accepted and incorporated. . . I am not at all sure that the ‘confidence’ is not a confidence trick” (p.132). Bowley was to be disappointed. When it was Fisher’s turn, he was full of praise. “Dr Neyman . . . claimed to have generalized the argument of fiducial probability, and he had every reason to be proud of the line of argument he had developed for its perfect clarity” (Fisher 1934c, p.138). Caveats were to come later (Section 5.7). For now, Egon was relieved:

Fisher had on the whole approved of what Neyman had said. If the impetuous Pole had not been able to make peace between the second and third floors of University College, he had managed at least to maintain a friendly foot on each! (C. Reid 1998, p. 119)

**CIs, Tests, and Severity**. I’m always mystified when people say they find P-values utterly perplexing while they regularly consume polling results in terms of confidence limits. You could substitute one for the other. (SIST p. 190)

…

Not only is there a duality between confidence interval estimation and tests, they were developed by Jerzy Neyman at the same time he was developing tests! The 1934 paper in the opening to this tour builds on Fisher’s fiducial intervals dated in 1930, but he’d been lecturing on it in Warsaw for a few years already. Providing upper and lower confidence limits shows the range of plausible values for the parameter and avoids an “ up/down” dichotomous tendency of some users of tests. Yet, for some reason, CIs are still often used in a dichotomous manner: rejecting μ values excluded from the interval, accepting (as plausible or the like) those included. There’ s the tendency, as well, to fix the confidence level at a single 1 − α , usually 0.9, 0.95, or 0.99. Finally, there’ s the adherence to a performance rationale: the estimation method will cover the true θ 95% of the time in a series of uses. We will want a much more nuanced, inferential construal of CIs. We take some first steps toward remedying these shortcomings by relating confidence limits to tests and to severity.

(A) If you don’t have the book, you can find many discussions in this blog on confidence intervals (search CIs). Of most relevance to Section 3.7 is a post Duality: Confidence intervals and the severity of tests. Another is Do CIs Avoid Fallacies of Tests? Reforming the Reformers.

……

**Live Exhibit (ix). What Should We Say When Severity Is Not Calculable? **(SIST p. 200)

In developing a system like severity, at times a conventional decision must be made. However, the reader can choose a different path and still work within this system.

What if the test or interval estimation procedure does not pass the audit? Consider for the moment that there has been optional stopping, or cherry picking, or multiple testing. Where these selection effects are well understood, we may adjust the error probabilities so that they do pass the audit. But what if the moves are so tortuous that we can’t reliably make the adjustment? Or perhaps we don’t feel secure enough in the assumptions? Should the severity for μ > µ_{0} be low or undefined?

You are free to choose either. The severe tester says SEV(μ > µ_{0}) is low. As she sees it, having evidence requires a minimum threshold for severity, even without setting a precise number. If it’s close to 0.5, it’s quite awful. But if it cannot be computed, it’s also awful, since the onus on the researcher is to satisfy the minimal requirement for evidence. I’ll follow her: If we cannot compute the severity even approximately (which is all we care about), I’ll say it’s low, along with an explanation as to why: It’s low because we don’t have a clue how to compute it!

A probabilist, working with a single “probability pie” as it were, would take a low probability for H as giving a high probability to ~H. By contrast we wish to clearly distinguish between having poor evidence for H and having good evidence for ~H. Our way of dealing with bad evidence, no test (BENT) allows us to do that. Both SEV(H) and SEV(~H) can be low enough to be considered lousy, even when both are computable.

…

**3.8 The Probability Our Results Are Statistical Fluctuations: the Higgs Discovery **(SIST p. 202)

**[B] **Elements of Section 3.8, in early formulations, may be found in the several posts on the Higgs discovery on this blog. One with links to several parts is Higgs Discovery three years on (Higgs analysis and statistical flukes). Even if you have the book, you might find the valuable comments by readers (made to the original posts) worth checking out.

**Where you are in the journey. **

]]>

**Excursion 3 Tour II: It’s The Methods, Stupid**

Tour II disentangles a jungle of conceptual issues at the heart of today’s statistics wars. The first stop **(3.4)** unearths the basis for a number of howlers and chestnuts thought to be licensed by Fisherian or N-P tests.* In each exhibit, we study the basis for the joke. Together, they show: the need for an adequate test statistic, the difference between implicationary (i assumptions) and actual assumptions, and the fact that tail areas serve to raise, and not lower, the bar for rejecting a null hypothesis. (Additional howlers occur in Excursion 3 Tour III)

*recommended: medium to heavy shovel *

Stop **(3.5)** pulls back the curtain on the view that Fisher and N-P tests form an incompatible hybrid. Incompatibilist tribes retain caricatures of F & N-P tests, and rob each from notions they need (e.g., power and alternatives for F, P-values & post-data error probabilities for N-P). Those who allege that Fisherian P-values are not error probabilities often mean simply that Fisher wanted an evidential not a performance interpretation. This is a philosophical not a mathematical claim. N-P and Fisher tended to use P-values in both ways. It’s time to get beyond incompatibilism. Even if we couldn’t point to quotes and applications that break out of the strict “evidential versus behavioral” split, we should be the ones to interpret the methods for inference, and supply the statistical philosophy that directs their right use.” (p. 181)

*strongly recommended: light to medium shovel, thick-skinned jacket*

In **(3.6)** we slip into the jungle. Critics argue that P-values are for evidence, unlike error probabilities, but then aver P-values aren’t good measures of evidence either, since they disagree with probabilist measures: likelihood ratios, Bayes Factors or posteriors. A famous peace-treaty between Fisher, Jeffreys & Bayes promises a unification. A bit of magic ensues! The meaning of error probability changes into a type of Bayesian posterior probability. It’s then possible to say ordinary frequentist error probabilities (e.g., type I & II error probabilities) aren’t error probabilities. We get beyond this marshy swamp by introducing subscripts 1 and 2. Whatever you think of the two concepts, they are very different. This recognition suffices to get you out of quicksand.

*required: easily removed shoes, stiff walking stick (review Souvenir M on day of departure)*

*Several of these may be found in searching for “Saturday night comedy” on this blog. In SIST, however I trace out the basis for the jokes.

**selected key terms and ideas **

Howlers and chestnuts of statistical tests

armchair science

Jeffreys tail area criticism

Limb sawing logic

Two machines with different precisions

Weak conditionality principle (WCP)

Conditioning (see WCP)

Likelihood principle

Long run performance vs probabilism

Alphas and p’s

Fisher as behaviorist

Hypothetical long-runs

Freudian metaphor for significance tests

Pearson, on cases where there’s no repetition

Armour-piercing naval shell

Error probability_{1} and error probability _{2
}Incompatibilist philosophy (F and N-P must remain separate)

Test statistic requirements (p. 159)

**Please send me your list of key terms in the comments; typos would also be appreciated**

These are Tour Guide Mementos from Excursion 3 Tour II of Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

To see an excerpt from Excursion 3 Tour II (and “where you are” in the journey), see my last post.

For all excerpts and mementos (on this blog) from SIST (to Nov.30), see this post.

]]>

There is perhaps in current literature a tendency to speak of the Neyman–Pearson contributions as some static system, rather than as part of the historical process of development of thought on statistical theory which is and will always go on. (Pearson 1962, 276)

This goes for Fisherian contributions as well. Unlike museums, we won’ t remain static. The lesson from Tour I of this Excursion is that Fisherian and Neyman– Pearsonian tests may be seen as offering clusters of methods appropriate for different contexts within the large taxonomy of statistical inquiries. There is an overarching pattern:

Just as with the use of measuring instruments, applied to the specific case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing. (Mayo and Cox 2006, p. 84)

This information is used to ascertain what claims have, and have not, passed severely, post-data. Any such proposed inferential use of error probabilities gives considerable fodder for criticism from various tribes of Fisherians,Neyman– Pearsonians, and Bayesians. We can hear them now:

…

How can we reply? To begin, we need to uncover how the charges originate in traditional philosophies long associated with error statistical tools. That’ s the focus of Tour II.

Only then do we have a shot at decoupling traditional philosophies from those tools in order to use them appropriately today. This is especially so when the traditional foundations stand on such wobbly grounds, grounds largely rejected by founders of the tools. There is a philosophical disagreement between Fisher and Neyman, but it differs importantly from the ones that you’re presented with and which are widely accepted and repeated in scholarly and popular treatises on significance tests. Neo-Fisherians and N-P theorists, keeping to their tribes, forfeit notions that would improve their methods (e.g., for Fisherians: explicit alternatives, with corresponding notions of sensitivity, and distinguishing statistical and substantive hypotheses; for N-P theorists, making error probabilities relevant for inference in the case at hand).

The spadework on this tour will be almost entirely conceptual: we won’t be arguing for or against any one view. We begin in Section 3.4 by unearthing the basis for some classic counterintuitive inferences thought to be licensed by either Fisherian or N-P tests. That many are humorous doesn’t mean disentangling their puzzles is straightforward; a medium to heavy shovel is recommended. We can switch to a light to medium shovel in Section 3.5: excavations of the evidential versus behavioral divide between Fisher and N-P turn out to be mostly built on sand. As David Cox observes, Fisher is often more performance-oriented in practice, but not in theory, while the reverse is true for Neyman and Pearson. At times, Neyman exaggerates the behavioristic conception just to accentuate how much Fisher’s tests need reining in. Likewise, Fisher can be spotted running away from his earlier behavioristic positions just to derogate the new N-P movement, whose popularity threatened to eclipse the statistics program that was, after all, his baby. Taking the polemics of Fisher and Neyman at face value, many are unaware how much they are based on personality and professional disputes. Hearing the actual voices of Fisher, Neyman, and Pearson (F and N-P), you don’ t have to accept the gospel of “what the founders really thought.” Still, there’ s an entrenched history and philosophy of F and N-P: A thick-skinned jacket is recommended. On our third stop (Section 3.6) we witness a bit of magic. The very concept of an error probability gets redefined and, hey presto!, a reconciliation between Jeff reys, Fisher, and Neyman is forged. Wear easily removed shoes and take a stiff walking stick. The Unificationist tribes tend to live near underground springs and lakeshore bounds; in the heady magic, visitors have been known to accidentally fall into a pool of quicksand.

**3.4 Some Howlers and Chestnuts of Statistical Tests**

The well-known definition of a statistician as someone whose aim in life is to be wrong in exactly 5 per cent of everything they do misses its target. (Sir David Cox 2006a, p. 197)

Showing that a method’s stipulations could countenance absurd or counterintuitive results is a perfectly legitimate mode of criticism. I reserve the term “howler” for common criticisms based on logical fallacies or conceptual misunderstandings. Other cases are better seen as chestnuts – puzzles that the founders of statistical tests never cleared up explicitly. Whether you choose to see my “howler” as a “chestnut” is up to you. Under each exhibit is the purported basis for the joke……

TO KEEP READING, SEE Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

]]>

As you enjoy the weekend discussion & concert in the Captain’s Central Limit Library & Lounge, your Tour Guide has prepared a brief overview of Excursion 3 Tour I, and a short (semi-severe) quiz on severity, based on exhibit (i).*****

We move from Popper through a gallery on “Data Analysis in the 1919 Eclipse tests of the General Theory of Relativity (GTR)” (3.1) which leads to the main gallery on the origin of statistical tests (3.2) by way of a look at where the main members of our statistical cast are in 1919: Fisher, Neyman and Pearson. From the GTR episode, we identify the key elements of a statistical test–the steps in E.S. Pearson’s opening description of tests in 3.2. The classical testing notions–type I and II errors, power, consistent tests–are shown to grow out of requiring probative tests. The typical (behavioristic) formulation of N-P tests came later. The severe tester breaks out of the behavioristic prison. A first look at the severity construal of N-P tests is in Exhibit (i). Viewing statistical inference as severe testing shows how to do all N-P tests do (and more) while a member of the Fisherian Tribe (3.3). We consider the frequentist principle of evidence FEV and the divergent interpretations that are called for by Cox’s taxonomy of null hypotheses. The last member of the taxonomy–substantively based null hypotheses–returns us to the opening episode of GTR.

*key terms (incomplete please send me yours)*

GTR, eclipse test, ether effect, corona effect, PPN framework, statistical test ingredients, Anglo-Polish collaboration, Lambda criterion; Type I error, Type II error, power, P-value, unbiased tests, consistent tests uniformly most powerful (UMP); severity interpretation of tests, severity function, water plant accident; sufficient statistic; frequentist principle of evidence FEV; sensitivity achieved, [same as attained power (att power)], Cox’s taxonomy (embedded, nested, dividing, testing assumptions), Nordvedt effect, equivalence principle (strong and weak)

**Semi-Severe Severity Quiz, based on the example in Exhibit (i) of Excursion 3**

- Keeping to Test T+ with
*H*_{0}: μ ≤ 150 vs.*H*_{1}: μ > 150, σ = 10, and*n*= 100, observed*x*_{ }= 152 (i.e., d = 2), find the severity associated with μ > 150.5 .

i.e.,SEV_{100}(μ > 150.5) = ________

- Compute 3 or more of the severity assessments for Table 3.2, with
*x*_{ }= 153. **Comparing**Keeping to Test T+ with*n*= 100 with*n*= 10,000:*H*_{0}: μ ≤ 150 vs.*H*_{1}: μ > 150, σ = 10, change the sample size so that*n*= 10,000.

The 2SE rejection rule would now be: reject (i.e., “infer evidence against *H*_{0}”) whenever *X* _{ } > _____.

Assume *x* _{ }= just reaches this 2SE cut-off. (added previous sentence, Dec 10, I thought it was clear.) What’s the severity associated with inferring μ > 150.5 now?

i.e., SEV_{10,000}(μ > 150.5) = ____

Compare with SEV_{100}*(*μ > 150.5).

4. NEW. I realized I needed to include a “negative” result. Assume *x* _{ }= 151.5. Keeping to the same test with n = 100, find SEV_{100}*(*μ ≤ 152).

5. If you’re following the original schedule, you’ll have read Tour II of Excursion 3, so here’s an easy question: Why does Souvenir M tell you to “relax”?

6. **Extra Credit**: supply some key terms from this Tour that I left out in the above list.

*The reference is to Mayo (2018, CUP): Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.

]]>