Judea Pearl* wrote to me to invite readers of Error Statistics Philosophy to comment on a recent post of his (from his Causal Analysis blog here) pertaining to a guest post by Stephen Senn (“Being a Statistician Means never Having to Say You Are Certain”.) He has added a special addendum for us.[i]
Challenging the Hegemony of Randomized Controlled Trials: Comments on Deaton and Cartwright
Judea Pearl
I was asked to comment on a recent article by Angus Deaton and Nancy Cartwright (D&C), which touches on the foundations of causal inference. The article is titled: “Understanding and misunderstanding randomized controlled trials,” and can be viewed here: https://goo.gl/x6s4Uy
My comments are a mixture of a welcome and a puzzle; I welcome D&C’s stand on the status of randomized trials, and I am puzzled by how they choose to articulate the alternatives.
D&C’s main theme is as follows: “We argue that any special status for RCTs is unwarranted. Which method is most likely to yield a good causal inference depends on what we are trying to discover as well as on what is already known.” (Quoted from their introduction)
As a veteran challenger of the supremacy of the RCT, I welcome D&C’s challenge wholeheartedly. Indeed, “The Book of Why” (forthcoming, may 2018, http://bayes.cs.ucla.edu/WHY/) quotes me as saying:
If our conception of causal effects had anything to do with randomized experiments, the latter would have been invented 500 years before Fisher.
In this, as well as in my other writings I go so far as claiming that the RCT earns its legitimacy by mimicking the do-operator, not the other way around. In addition, considering the practical difficulties of conducting an ideal RCT, observational studies have a definite advantage: they interrogate populations at their natural habitats, not in artificial environments choreographed by experimental protocols.
Deaton and Cartwright’s challenge of the supremacy of the RCT consists of two parts:
- The first (internal validity) deals with the curse of dimensionality and argues that, in any single trial, the outcome of the RCT can be quite distant from the target causal quantity, which is usually the average treatment effect (ATE). In other words, this part concerns imbalance due to finite samples, and reflects the traditional bias-precision tradeoff in statistical analysis and machine learning.
- The second part (external validity) deals with biases created by inevitable disparities between the conditions and populations under study versus those prevailing in the actual implementation of the treatment program or policy. Here, Deaton and Cartwright propose alternatives to RCT, calling all out for integrating a web of multiple information sources, including observational, experimental, quasi-experimental, and theoretical inputs, all collaborating towards the goal of estimating “what we are trying to discover”.
My only qualm with D&C’s proposal is that, in their passion to advocate the integration strategy, they have failed to notice that, in the past decade, a formal theory of integration strategies has emerged from the brewery of causal inference and is currently ready and available for empirical researchers to use. I am referring of course to the theory of Data Fusion which formalizes the integration scheme in the language of causal diagrams, and provides theoretical guarantees of feasibility and performance. (see http://www.pnas.org/content/pnas/113/27/7345.full.pdf )
Let us examine closely D&C’s main motto: “Which method is most likely to yield a good causal inference depends on what we are trying to discover as well as on what is already known.” Clearly, to cast this advice in practical settings, we must devise notation, vocabulary, and logic to represent “what we are trying to discover” as well as “what is already known” so that we can infer the former from the latter. To accomplish this nontrivial task we need tools, theorems and algorithms to assure us that what we conclude from our integrated study indeed follows from those precious pieces of knowledge that are “already known.” D&C are notably silent about the language and methodology in which their proposal should be carried out. One is left wondering therefore whether they intend their proposal to remain an informal, heuristic guideline, similar to Bradford Hill’s Criteria of the 1960’s, or be explicated in some theoretical framework that can distinguish valid from invalid inference? If they aspire to embed their integration scheme within a coherent framework, then they should celebrate; Such a framework has been worked out and is now fully developed.
To be more specific, the Data Fusion theory described in http://www.pnas.org/content/pnas/113/27/7345.full.pdf provides us with notation to characterize the nature of each data source, the nature of the population interrogated, whether the source is an observational or experimental study, which variables are randomized and which are measured and, finally, the theory tells us how to fuse all these sources together to synthesize an estimand of the target causal quantity at the target population. Moreover, if we feel uncomfortable about the assumed structure of any given data source, the theory tells us whether an alternative source can furnish the needed information and whether we can weaken any of the model’s assumptions.[i]
You can read the rest of Pearl’s original article here.
…..
Addendum to ” Challenging the Hegemony of RCTs”
March 11, 1018
—————–
Upon re-reading the post above I realized that I have assumed readers to be familiar with Data Fusion theory. This Addendum aims at readers who are not familiar with the theory, and who would probably be asking: “Who needs a new theory to do what statistics does so well?” “Once we recognize the importance of diverse sources of data, statistics can be helpful in making decisions and quantifying uncertainty.” [Quoted from Andrew Gelman’s blog]. The reason I question the sufficiency of statistics to manage the integration of diverse sources of data is that statistics lacks the vocabulary needed for the job. Let us demonstrate it in a couple of toy examples, taken from BP-2015Example 1
————–
Suppose we wish to estimate the average causal effect of X on Y, and we have two diverse sources of data:(1) an RCT in which Z, not X, is randomized, and
(2) an observational study in which X Y and Z are measured.What substantive assumptions are needed to facilitate a solution to our problem? Put another way, how can be sure that, once we make those assumptions, we can solve our problem.
Example 2
————-
Suppose we wish to estimate the average causal effect ACE of X on Y, and we have two diverse sources of data:(1) an RCT in which the effect of X on both Y and Z is measured, but the recruited subjects had non-typical values of Z.
(2) an observational study conducted in the target population, in which both X and Z (but not Y) were measured.What substantive assumptions would enable us to estimate ACE, and how should we combine data from the two studies so as to synthesize a consistent estimate of ACE.
The nice thing about a toy example is that the solution is known to us in advance, and so, we can check any alternative solution for correctness. Curious readers can find the solutions for these two examples in http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf. More ambitious readers will probably try to solve them using statistic techniques, such as meta analysis or partial pooling. The reason I am confident that the second group will end up with disappointment comes from a profound statement made by Nancy Cartwright in 1989: “No Causes In, No Causes Out”. It means not only that you need substantive assumptions to derive causal conclusions; it also means that the vocabulary of statistical analysis, since it is built entirely on properties of distribution functions, is inadequate for expressing those substantive assumptions that are needed for getting causal conclusions.
In our examples, although part of the data is provided by an RCT, hence it is causal, one can still show that the needed assumptions must invoke causal vocabulary; distributional assumptions are insufficient. As someone versed in both graphical modeling and counterfactuals, I would go even further and state that it would be a miracle if anyone succeeds in translating the needed assumptions into a comprehensible language other than causal diagrams. (See http://ftp.cs.ucla.edu/pub/stat_ser/r452-reprint.pdf Appendix, Scenario 3.)
Armed with these examples and findings, we can go back and examine why D&C do not embrace the Data Fusion methodology in their quest for integrating diverse sources of data. The answer, I conjecture, is that D&C were not intimately familiar with what this methodology offers and how vastly different it is from previous attempts to operationalize Cartwright’s dictum: “No causes in, no causes out”.
[i] Pearl’s blog post, originally posted here, ends with the following; I hope that readers take him up on his invitation:
I would be very interested in seeing other readers reaction to D&C’s article, as well as to my optimistic assessment of what causal inference can do for us in this day and age. I have read the reactions of Andrew Gelman (on his blog) and Stephen J. Senn (on Deborah Mayo’s blog https://errorstatistics.com/2018/01/), but they seem to be unaware of the latest developments in Data Fusion analysis. I also invite Angus Deaton and Nancy Cartwright to share a comment or two on these issues. I hope they respond positively.
* Chancellor’s Professor of Computer Science and Statistics,
Director, Cognitive Systems Laboratory
University of California Los Angeles,
http://www.cs.ucla.edu/~judea/
http://bayes.cs.ucla.edu/csl_papers.html
]]>
MONTHLY MEMORY LANE: 3 years ago: March 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green up to 3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2]. Posts that are part of a “unit” or a group count as one.
March 2015
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).
]]>
It should be out in July 2018. The “Itinerary”, generally known as the Table of Contents, is below. I forgot to mention that this is not the actual pagination, I don’t have the page proofs yet. These are the pages of the draft I submitted. It should be around 50 pages shorter in the actual page proofs, maybe 380 pages.
Excursion 1: How to Tell What’s True about Statistical Inference |
1 | |
Tour I: Beyond Probabilism and Performance | 1 | |
1.1 Severity Requirement: Bad Evidence, No Test (BENT) | 3 | |
1.2 Probabilism, Performance and Probativeness | 11 | |
1.3 The Current State of Play in Statistical Foundations: A view from a hot air balloon | 22 | |
Tour II: Error Probing Tools vs. Logics of Evidence | 29 | |
1.4 The Law of Likelihood and the Likelihood Principle | 29 | |
1.5 Trying and Trying Again: The Likelihood Principle | 41 | |
Excursion 2: Taboos of Induction and Falsification | 56 | |
Tour I: Induction and Confirmation | 56 | |
2.1 The Traditional Problem of Induction | 56 | |
2.2 Is Probability a Good Measure of Confirmation? | 63 | |
Tour II: Falsification, Pseudoscience, Induction | 72 | |
2.3 Popper, Severity and Methodological Probability | 72 | |
2.4 Novelty and Severity | 87 | |
2.5 Fallacies of Rejection and an Animal Called NHST | 90 | |
2.6 The Reproducibility Revolution (Crisis) in Psychology | 94 | |
2.7 How to Solve the Problem of Induction Now | 105 | |
Excursion 3: Statistical Tests and Scientific Inference | 113 | |
Tour I: Ingenious and Severe Tests | 113 | |
3.1 Statistical Inference and Sexy Science: The 1919 Eclipse Test | 115 | |
3.2. N-P Tests: an Episode in Anglo-Polish Collaboration | 125 | |
3.3 How to Do All N-P Tests Do (and more) While a Member of the Fisherian Tribe | 139 | |
Tour II: It’s The Methods, Stupid | 156 | |
3.4 Some Howlers and Chestnuts of Statistical Tests | 157 | |
3.5 P-Values Aren’t Error Probabilities Because Fisher Rejected Neyman’s Performance Philosophy | 166 | |
3.6 Hocus-pocus: P-values Are Not Error Probabilities, Are Not Even Frequentist! | 175 | |
Tour III: Capability and Severity: Deeper Concepts | 181 | |
3.7 Severity, Capability and Confidence Intervals (CIs) | 181 | |
3.8 The Probability our Results are Statistical Fluctuations: Higg’s Discovery | 194 | |
Excursion 4: Objectivity and Auditing | 211 | |
Tour I: The Myth of “The Myth of Objectivity” | 211 | |
4.1 Dirty hands: Statistical Inference is Sullied with Discretionary Choices | 212 | |
4.2 Embrace Your Subjectivity | 218 | |
Tour II: Rejection Fallacies: Who’s Exaggerating What? | 230 | |
4.3 Significant Results with Overly Sensitive Tests: Large n problem | 231 | |
4.4 Do P-Values Exaggerate the Evidence? | 237 | |
4.5 Who’s Exaggerating? How to Evaluate Reforms Based on Bayes Factor Standards | 251 | |
Tour III: Auditing: Biasing Selection Effects & Randomization | 258 | |
4.6 Error Control is Necessary for Severity Control | 260 | |
4.7 Randomization | 278 | |
Tour IV: More Auditing: Objectivity and Model Checking | 288 | |
4.8 All Models are False | 288 | |
4.9 For Model Checking, They Come Back to Significance Tests | 293 | |
4.10 Bootstrap Resampling: My sample is a mirror of the universe | 298 | |
4.11 Misspecification (M-S) Testing in the Error Statistical Account | 300 | |
Excursion 5: Power and Severity | 313 | |
Tour I: Power: Pre-data and Post-data | 313 | |
5.1 Power Howlers, Trade-offs and Benchmarks | 315 | |
5.2 Cruise Severity Drill: How Tail Areas (appear to) Exaggerate the Evidence | 322 | |
5.3 Insignificant Results: Power Analysis and Severity | 328 | |
5.4 Severity Interpretation of Tests: Severity Curves | 336 | |
Tour II: How Not to Corrupt Power | 342 | |
5.5 Power Taboos, Retrospective Power, and Shpower | 342 | |
5.6 Positive Predictive Value: Fine for Luggage | 351 | |
Tour III: Deconstructing the N-P vs. Fisher Debates | 361 | |
5.7 Statistical Theatre: “Les Miserables Citations” | 361 | |
5.8 Neyman’s Performance and Fisher’s Fiducial Probability | 372 | |
Excursion 6: (Probabilist) Foundations Lost, (Probative) Foundations Found | 384 | |
Tour I: What Ever Happened to Bayesian Foundations? | 384 | |
6.1 Bayesian Ways: From Classical to Default | 386 | |
6.2 What are Bayesian Priors? A Gallimaufry | 391 | |
6.3 Unification or Schizophrenia: Bayesian Family Feuds | 399 | |
6.4 Is Bayes’ Rule Irrational? | 406 | |
6.5 Can You Change Your Bayesian Prior? | 408 | |
Tour II: Pragmatic and Error Statistical Bayesians | 415 | |
6.6 Pragmatic Bayesians | 415 | |
6.7 Error Statistical Bayesians: Falsificationist Bayesians | 423 | |
Souvenir (Z) Farewell | 428 | |
]]>
This continues my previous post: “Can’t take the fiducial out of Fisher…” in recognition of Fisher’s birthday, February 17. I supply a few more intriguing articles you may find enlightening to read and/or reread on a Saturday night
Move up 20 years to the famous 1955/56 exchange between Fisher and Neyman. Fisher clearly connects Neyman’s adoption of a behavioristic-performance formulation to his denying the soundness of fiducial inference. When “Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘reasoning’ means what ‘deductive reasoning’ means to others.” (Fisher 1955, p. 74).
Fisher was right that Neyman’s calling the outputs of statistical inferences “actions” merely expressed Neyman’s preferred way of talking. Nothing earth-shaking turns on the choice to dub every inference “an act of making an inference”.[i] The “rationality” or “merit” goes into the rule. Neyman, much like Popper, had a good reason for drawing a bright red line between his use of probability (for corroboration or probativeness) and its use by ‘probabilists’ (who assign probability to hypotheses). Fisher’s Fiducial probability was in danger of blurring this very distinction. Popper said, and Neyman would have agreed, that he had no problem with our using the word induction so long it was kept clear it meant testing hypotheses severely.
In Fisher’s next few sentences, things get very interesting. In reinforcing his choice of language, Fisher continues, Neyman “seems to claim that the statement (a) “μ has a probability of 5 per cent. of exceeding M” is a different statement from (b) “M has a probability of 5 per cent. of falling short of μ”. There’s no problem about equating these two so long as M is a random variable. But watch what happens in the next sentence. According to Fisher,
Neyman violates ‘the principles of deductive logic [by accepting a] statement such as
[1] Pr{(M – ts) < μ < (M + ts)} = α,
as rigorously demonstrated, and yet, when numerical values are available for the statistics M
and s, so that on substitution of these and use of the 5 per cent. value of t, the statement would read[2] Pr{92.99 < μ < 93.01} = 95 per cent.,
to deny to this numerical statement any validity. This evidently is to deny the syllogistic process of making a substitution in the major premise of terms which the minor premise establishes as equivalent (Fisher 1955, p. 75).
But the move from (1) to (2) is fallacious! Could Fisher really be commiting this fallacious probabilistic instantiation? I.J. Good (1971) describes how many felt, and often still feel:
…if we do not examine the fiducial argument carefully, it seems almost inconceivable that Fisher should have made the error which he did in fact make. It is because (i) it seemed so unlikely that a man of his stature should persist in the error, and (ii) because he modestly says(…[1959], p. 54) his 1930 explanation left a good deal to be desired’, that so many people assumed for so long that the argument was correct. They lacked the daring to question it.
In responding to Fisher, Neyman (1956, p.292) declares himself at his wit’s end in trying to find a way to convince Fisher of the inconsistencies in moving from (1) to (2).
When these explanations did not suffice to convince Sir Ronald of his mistake, I was tempted to give up. However, in a private conversation David Blackwell suggested that Fisher’s misapprehension may be cleared up by the examination of several simple examples. They illustrate the general rule that valid probability statements regarding relations involving random variables may cease and usually do cease to be valid if random variable are replaced by their observed particular values.(p. 292)[ii]
“Thus if X is a normal random variable with mean zero and an arbitrary variance greater than zero, we may agree” that Pr(X < 0)= .5 But observing, say X = 1.7 yields Pr(1.7< 0) = .5, which is clearly illicit. “It is doubtful whether the chaos and confusion now reigning in the field of fiducial argument were ever equaled in any other doctrine. The source of this confusion is the lack of realization that equation (1) does not imply (2)” (Neyman 1956).
For decades scholars have tried to figure out what Fisher might have meant, and while the matter remains unsettled, this much is agreed: The instantiation that Fisher is yelling about 20 years after the creation of N-P tests and the break with Neyman, is fallacious. Fiducial probabilities can only properly attach to the method. Keeping to “performance” language, is a sure way to avoid the illicit slide from (1) to (2). Once the intimate tie-ins with Fisher’s fiducial argument is recognized, the rhetoric of the Neyman-Fisher dispute takes on a completely new meaning. When Fisher says “Neyman only cares for acceptance sampling contexts” as he does after around 1950, he’s really saying Neyman thinks fiducial inference is contradictory unless it’s viewed in terms of properties of the method in (actual or hypothetical) repetitions. The fact that Neyman (with the contributions of Wald, and later Robbins) went overboard in his behaviorism [iii], to the extent that even Egon wanted to divorce him—ending his 1955 reply to Fisher with the claim that inductive behavior was “Neyman’s field rather than mine”—is a different matter.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[i] Fisher also commonly spoke of the output of tests as actions. Neyman rightly says that he is only following Fisher. As the years went by, Fisher comes to renounce things he himself had said earlier in the midst of polemics against Neyman.
[ii] But surely this is the kind of simple example that would have been brought forward right off the bat, before the more elaborate, infamous cases (Fisher-Behrens). Did Fisher ever say “oh now I see my mistake” as a result of these simple examples? Not to my knowledge. So I find this statement of Neyman’s about the private conversation with Blackwell a little curious. Anyone know more about it?
[iii]At least in his theory, but not not in his practice. A relevant post is “distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen“.
Fisher, R.A. (1955). “Statistical Methods and Scientific Induction”.
Good, I.J. (1971b), In reply to comments on his “The probabilistic explication of information, evidence, srprise, causality, explanation and utility’. In Godambe and Sprott (1971).
Neyman, J. (1956). “Note on an Article by Sir Ronald Fisher”.
Pearson, E.S. (1955). “Statistical concepts in Their Relation to Reality“.
]]>Continuing with posts in recognition of R.A. Fisher’s birthday, I post one from a couple of years ago on a topic that had previously not been discussed on this blog: Fisher’s fiducial probability.
[Neyman and Pearson] “began an influential collaboration initially designed primarily, it would seem to clarify Fisher’s writing. This led to their theory of testing hypotheses and to Neyman’s development of confidence intervals, aiming to clarify Fisher’s idea of fiducial intervals (D.R.Cox, 2006, p. 195).
The entire episode of fiducial probability is fraught with minefields. Many say it was Fisher’s biggest blunder; others suggest it still hasn’t been understood. The majority of discussions omit the side trip to the Fiducial Forest altogether, finding the surrounding brambles too thorny to penetrate. Besides, a fascinating narrative about the Fisher-Neyman-Pearson divide has managed to bloom and grow while steering clear of fiducial probability–never mind that it remained a centerpiece of Fisher’s statistical philosophy. I now think that this is a mistake. It was thought, following Lehman (1993) and others, that we could take the fiducial out of Fisher and still understand the core of the Neyman-Pearson vs Fisher (or Neyman vs Fisher) disagreements. We can’t. Quite aside from the intrinsic interest in correcting the “he said/he said” of these statisticians, the issue is intimately bound up with the current (flawed) consensus view of frequentist error statistics.
So what’s fiducial inference? I follow Cox (2006), adapting for the case of the lower limit:
We take the simplest example,…the normal mean when the variance is known, but the considerations are fairly general. The lower limit, [with Z the standard Normal variate, and M the sample mean]:
M_{0} – z_{c} σ/√n
derived from the probability statement
Pr(μ > M – z_{c} σ/√n ) = 1 – c
is a particular instance of a hypothetical long run of statements a proportion 1 – c of which will be true, assuming the model is sound. We can, at least in principle, make such a statement for each c and thereby generate a collection of statements, sometimes called a confidence distribution. (Cox 2006, p. 66).
For Fisher it was a fiducial distribution. Once M_{0} is observed, M_{0} – z_{c} σ/√n is what Fisher calls the fiducial c per cent limit for μ. Making such statements for different c’s yields his fiducial distribution.
In Fisher’s earliest paper on fiducial inference in 1930, he sets 1 – c as .95 per cent. Start from the significance test of μ (e.g., μ< μ_{0} vs. μ>μ_{0 }) with significance level .05. He defines the 95 percent value of the sample mean M, M_{.95} , such that in 95% of samples M< M_{.95} . In the Normal testing case, M_{.95} = μ_{0} + 1.65σ/√n. Notice M_{.95} is the cut-off for rejection in a .05 one-sided test T+ (of μ< μ_{0} vs. μ>μ_{0}).
We have a relationship between the statistic [M] and the parameter μ such that M_{.95} = is the 95 per cent value corresponding to a given μ. This relationship implies the perfectly objective fact that in 5 per cent of samples M> M_{.95}. (Fisher 1930, p. 533; I use μ for his θ, M in place of T).
That is, Pr(M < μ + 1.65σ/√n) = .95.
The event M > M_{.95} occurs just in case μ_{0} < M − 1.65σ/√n .[i]
For a particular observed M_{0} , M_{0} − 1.65σ/√n is the fiducial 5 per cent value of μ.
We may know as soon as M is calculated what is the fiducial 5 per cent value of μ, and that the true value of μ will be less than this value in just 5 per cent of trials. This then is a definite probability statement about the unknown parameter μ which is true irrespective of any assumption as to it’s a priori distribution. (Fisher 1930, p. 533 emphasis is mine).
This seductively suggests that μ < μ_{.05} gets the probability .05! But we know we cannot say that Pr(μ < μ_{.05}) = .05.[ii]
However, Fisher’s claim that we obtain “a definite probability statement about the unknown parameter μ” can be interpreted in another way. There’s a kosher probabilistic statement about the pivot Z, it’s just not a probabilistic assignment to a parameter. Instead, a particular substitution is, to paraphrase Cox “a particular instance of a hypothetical long run of statements 95% of which will be true.” After all, Fisher was abundantly clear that the fiducial bound should not be regarded as an inverse inference to a posterior probability. We could only obtain an inverse inference, Fisher explains, by considering μ to have been selected from a superpopulation of μ‘s with known distribution. But then the inverse inference (posterior probability) would be a deductive inference and not properly inductive. Here, Fisher is quite clear, the move is inductive.
People are mistaken, Fisher says, when they try to find priors so that they would match the fiducial probability:
In reality the statements with which we are concerned differ materially in logical content from inverse probability statements, and it is to distinguish them from these that we speak of the distribution derived as a fiducial frequency distribution, and of the working limits, at any required level of significance, ….as the fiducial limits at this level. (Fisher 1936, p. 253).
So, what is being assigned the fiducial probability? It is, Fisher tells us, the “aggregate of all such statements…” Or, to put it another way, it’s the method of reaching claims to which the probability attaches. Because M and S (using the student’s T pivot) or M alone (where σ is assumed known) are sufficient statistics “we may infer, without any use of probabilities a priori, a frequency distribution for μ which shall correspond with the aggregate of all such statements … to the effect that the probability that μ is less than M – 1.65σ/√n is .05.” (Fisher 1936, p. 253)[iii]
Suppose you’re Neyman and Pearson aiming to clarify and justify Fisher’s methods.
”I see what’s going on’ we can imagine Neyman declaring. There’s a method for outputting statements such as would take the general form
μ >M – z_{c}σ/√n
Some would be in error, others not. The method outputs statements with a probability of 1 – c of being correct. The outputs are instances of general form of statement, and the probability alludes to the relative frequencies that they would be correct, as given by the chosen significance or fiducial level c . Voila! “We may look at the purpose of tests from another viewpoint,” as Neyman and Pearson (1933) put it. Probability qualifies (and controls) the performance of a method.
There is leeway here for different interpretations and justifications of that probability, from actual to hypothetical performance, and from behavioristic to more evidential–I’m keen to develop the latter. But my main point here is that in struggling to extricate Fisher’s fiducial limits, without slipping into fallacy, they are led to the N-P performance construal. Is there an efficient way to test hypotheses based on probabilities? ask Neyman and Pearson in the opening of the 1933 paper.
Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong (Neyman and Pearson 1933, pp. 141-2/290-1).
At the time, Neyman thought his development of confidence intervals (in 1930) was essentially the same as Fisher’s fiducial intervals. Fisher’s talk of assigning fiducial probability to a parameter, Neyman thought at first, was merely the result of accidental slips of language, altogether expected in explaining a new concept. There was evidence that Fisher accepted Neyman’s reading. When Neyman gave a paper in 1934 discussing confidence intervals, seeking to generalize fiducial limits, but making it clear that the term “confidence coefficient” is not synonymous to the term probability, Fisher didn’t object. In fact he bestowed high praise, saying Neyman “had every reason to be proud of the line of argument he had developed for its perfect clarity. The generalization was a wide and very handsome one,” the only problem being that there wasn’t a single unique confidence interval, as Fisher had wanted (for fiducial intervals).[iv] Slight hints of the two in a mutual admiration society are heard, with Fisher demurring that “Dr Neyman did him too much honor” in crediting him for the revolutionary insight of Student’s T pivot. Neyman responds that of course in calling it Student’s T he is crediting Student, but “this does not prevent me from recognizing and appreciating the work of Professor Fisher concerning the same distribution.”(Fisher comments on Neyman 1934, p. 137). For more on Neyman and Pearson being on Fisher’s side in these early years, see Spanos’s post.
So how does this relate to the current consensus view of Neyman-Pearson vs Fisher? Stay tuned.[v] In the mean time, share your views.
The next installment is here.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[i] (μ < M – z_{c} σ/√n) iff M > M_{(1 – c)} = M >μ + z_{c} σ/√n
[ii] In terms of the pivot Z, the inequality Z >z_{c }is equivalent to the inequality
μ < M –z_{c} σ/√n
“so that this last inequality must be satisfied with the same probability as the first.” But the fiducial value replaces M with M_{0} and then Fisher’s assertion
Pr(μ > M_{0} –z_{c} σ/√n ) = 1 – c
no longer holds. (Fallacy of probabilistic instantiation.) In this connection, see my previous post on confidence intervals in polling.
[iii] If we take a number of samples of size n from the same or from different populations, and for each calculate the fiducial 5 percent value for μ, then in 5 per cent of cases the true value of μ will be less than the value we have found. There is no contradiction in the fact that this may differ from a posterior probability. “The fiducial probability is more general and, I think, more useful in practice, for in practice our samples will all give different values, and therefore both different fiducial distributions and different inverse probability distributions. Whereas, however, the fiducial values are expected to be different in every case, and our probabilty statements are relative to such variability, the inverse probability statement is absolute in form and really means something different for each different sample, unless the observed statistic actually happens to be exactly the same.” (Fisher 1930, p. 535)
[iv]Fisher restricts fiducial distributions to special cases where the statistics exhaust the information. He recognizes”The political principle that ‘Anything can be proved with statistics’ if you don’t make use of all the information. This is essential for fiducial inference”. (1936, p. 255). There are other restrictions to the approach as he developed it; many have extended it. There are a number of contemporary movements to revive fiducial and confidence distributions. For references, see the discussants on my likelihood principle paper.
[v] For background, search Fisher on this blog. Some of the material here is from my forthcoming book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP).
Cox, D. R. (2006), Principles of Statistical Inference. Cambridge.
Fisher, R.A. (1930), “Inverse Probability,” Mathematical Proceedings of the Cambridge Philosophical Society, 26(4): 528-535.
Fisher, R.A. (1936), “Uncertain Inference,”Proceedings of the American Academy of Arts and Sciences 71: 248-258.
Lehmann, E. (1993), “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” Journal of the American Statistical Association 88 (424): 1242–1249.
Neyman, J. (1934), “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection,” Early Statistical Papers of J. Neyman: 98-141. [Originally published (1934) in The Journal of the Royal Statistical Society 97(4): 558-625.]
]]>“Statistical Methods and Scientific Induction”
by Sir Ronald Fisher (1955)
SUMMARY
The attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to “decisions” in Wald’s sense, originated in several misapprehensions and has led, apparently, to several more.
The three phrases examined here, with a view to elucidating they fallacies they embody, are:
Mathematicians without personal contact with the Natural Sciences have often been misled by such phrases. The errors to which they lead are not only numerical.
To continue reading Fisher’s paper.
“Note on an Article by Sir Ronald Fisher“
by Jerzy Neyman (1956)
Summary
(1) FISHER’S allegation that, contrary to some passages in the introduction and on the cover of the book by Wald, this book does not really deal with experimental design is unfounded. In actual fact, the book is permeated with problems of experimentation. (2) Without consideration of hypotheses alternative to the one under test and without the study of probabilities of the two kinds, no purely probabilistic theory of tests is possible. (3) The conceptual fallacy of the notion of fiducial distribution rests upon the lack of recognition that valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values. The notorious multitude of “paradoxes” of fiducial theory is a consequence of this oversight. (4) The idea of a “cost function for faulty judgments” appears to be due to Laplace, followed by Gauss.
“Statistical Concepts in Their Relation to Reality”.
by E.S. Pearson (1955)
Controversies in the field of mathematical statistics seem largely to have arisen because statisticians have been unable to agree upon how theory is to provide, in terms of probability statements, the numerical measures most helpful to those who have to draw conclusions from observational data. We are concerned here with the ways in which mathematical theory may be put, as it were, into gear with the common processes of rational thought, and there seems no reason to suppose that there is one best way in which this can be done. If, therefore, Sir Ronald Fisher recapitulates and enlarges on his views upon statistical methods and scientific induction we can all only be grateful, but when he takes this opportunity to criticize the work of others through misapprehension of their views as he has done in his recent contribution to this Journal (Fisher 1955 “Scientific Methods and Scientific Induction” ), it is impossible to leave him altogether unanswered.
In the first place it seems unfortunate that much of Fisher’s criticism of Neyman and Pearson’s approach to the testing of statistical hypotheses should be built upon a “penetrating observation” ascribed to Professor G.A. Barnard, the assumption involved in which happens to be historically incorrect. There was no question of a difference in point of view having “originated” when Neyman “reinterpreted” Fisher’s early work on tests of significance “in terms of that technological and commercial apparatus which is known as an acceptance procedure”. There was no sudden descent upon British soil of Russian ideas regarding the function of science in relation to technology and to five-year plans. It was really much simpler–or worse. The original heresy, as we shall see, was a Pearson one!…
To continue reading, “Statistical Concepts in Their Relation to Reality” click HERE
]]>In recognition of R.A. Fisher’s birthday on February 17….
‘R. A. Fisher: How an Outsider Revolutionized Statistics’
by Aris Spanos
Few statisticians will dispute that R. A. Fisher (February 17, 1890 – July 29, 1962) is the father of modern statistics; see Savage (1976), Rao (1992). Inspired by William Gosset’s (1908) paper on the Student’s t finite sampling distribution, he recast statistics into the modern model-based induction in a series of papers in the early 1920s. He put forward a theory of optimal estimation based on the method of maximum likelihood that has changed only marginally over the last century. His significance testing, spearheaded by the p-value, provided the basis for the Neyman-Pearson theory of optimal testing in the early 1930s. According to Hald (1998)
“Fisher was a genius who almost single-handedly created the foundations for modern statistical science, without detailed study of his predecessors. When young he was ignorant not only of the Continental contributions but even of contemporary publications in English.” (p. 738)
What is not so well known is that Fisher was the ultimate outsider when he brought about this change of paradigms in statistical science. As an undergraduate, he studied mathematics at Cambridge, and then did graduate work in statistical mechanics and quantum theory. His meager knowledge of statistics came from his study of astronomy; see Box (1978). That, however did not stop him from publishing his first paper in statistics in 1912 (still an undergraduate) on “curve fitting”, questioning Karl Pearson’s method of moments and proposing a new method that was eventually to become the likelihood method in his 1921 paper.
After graduating from Cambridge he drifted into a series of jobs, including subsistence farming and teaching high school mathematics and physics, until his temporary appointment as a statistician at Rothamsted Experimental Station in 1919. During the period 1912-1919 his interest in statistics was driven by his passion for eugenics and a realization that his mathematical knowledge of n-dimensional geometry can be put to good use in deriving finite sample distributions for estimators and tests in the spirit of Gosset’s (1908) paper. Encouraged by his early correspondence with Gosset, he derived the finite sampling distribution of the sample correlation coefficient which he published in 1915 in Biometrika; the only statistics journal at the time, edited by Karl Pearson. To put this result in a proper context, Pearson was working on this problem for two decades and published more than a dozen papers with several assistants on approximating the first two moments of the sample correlation coefficient; Fisher derived the relevant distribution, not just the first two moments.
Due to its importance, the 1915 paper provided Fisher’s first skirmish with the ‘statistical establishment’. Karl Pearson would not accept being overrun by a ‘newcomer’ lightly. So, he prepared a critical paper with four of his assistants that became known as “the cooperative study”, questioning Fisher’s result as stemming from a misuse of Bayes theorem. He proceeded to publish it in Biometrika in 1917 without bothering to let Fisher know before publication. Fisher was furious at K.Pearson’s move and prepared his answer in a highly polemical style which Pearson promptly refused to publish in his journal. Eventually Fisher was able to publish his answer after tempering the style in Metron, a brand new statistics journal. As a result of this skirmish, Fisher pledged never to send another paper to Biometrika, and declared a war against K.Pearson’s perspective on statistics. Fisher, not only questioned his method of moments as giving rise to inefficient estimators, but also his derivation of the degrees of freedom of his chi-square test. Several, highly critical published papers ensued.[i]
Between 1922 and 1930 Fisher did most of his influential work in recasting statistics, including publishing a highly successful textbook in 1925, but the ‘statistical establishment’ kept him ‘in his place’; a statistician at an experimental station. All his attempts to find an academic position, including a position in Social Biology at the London School of Economics (LSE), were unsuccessful (see Box, 1978, p. 202). Being turned down for the LSE position was not unrelated to the fact that the professor of statistics at the LSE was Arthur Bowley (1869-1957); second only to Pearson in statistical high priesthood.[ii]
Coming of age as a statistician during the 1920s in England, was being awarded the Guy medal in gold, silver or bronze, or at least receiving an invitation to present your work to the Royal Statistical Society (RSS). Despite his fundamental contributions to the field, Fisher’s invitation to RSS would not come until 1934. To put that in perspective, Jerzy Neyman, his junior by some distance, was invited six months earlier! Indeed, one can make a strong case that the statistical establishment kept Fisher away for as long as they could get away with it. However, by 1933 they must have felt that they had to invite Fisher after he accepted a professorship at University College, London. The position was created after Karl Pearson retired and the College decided to split his chair into a statistics position that went to Egon Pearson (Pearson’s son) and a Galton professorship in Eugenics that was offered to Fisher. To make it worse, Fisher’s offer came with a humiliating clause that he was forbidden to teach statistics at University College (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself!
Fisher’s presentation to the Royal Statistical Society, on December 18th, 1934, entitled “The Logic of Inductive Inference”, was an attempt to summarize and explain his published work on recasting the problem of statistical induction since his classic 1922 paper. Bowley was (self?) appointed to move the traditional vote of thanks and open the discussion. After some begrudging thanks for Fisher’s ‘contributions to statistics in general’, he went on to disparage his new approach to statistical inference based on the likelihood function by describing it as abstruse, arbitrary and misleading. His comments were predominantly sarcastic and discourteous, and went as far as to accuse Fisher of plagiarism, by not acknowledging Edgeworth’s priority on the likelihood function idea (see Fisher, 1935, pp. 55-7). The litany of churlish comments continued with the rest of the old guard: Isserlis, Irwin and the philosopher Wolf (1935, pp. 57-64), who was brought in by Bowley to undermine Fisher’s philosophical discussion on induction. Jeffreys complained about Fisher’s criticisms of the Bayesian approach (1935, pp. 70-2).
To Fisher’s support came … Egon Pearson, Neyman and Bartlett. E. Pearson argued that:
“When these ideas [on statistical induction] were fully understood … it would be realized that statistical science owed a very great deal to the stimulus Professor Fisher had provided in many directions.” (Fisher, 1935, pp. 64-5)
Neyman too came to Fisher’s support, praising Fisher’s path-breaking contributions, and explaining Bowley’s reaction to Fisher’s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73).
Fisher, in his reply to Bowley and the old guard, was equally contemptuous:
“The acerbity, to use no stronger term, with which the customary vote of thanks has been moved and seconded … does not, I confess, surprise me. From the fact that thirteen years have elapsed between the publication, by the Royal Society, of my first rough outline of the developments, which are the subject of to-day’s discussion, and the occurrence of that discussion itself, it is a fair inference that some at least of the Society’s authorities on matters theoretical viewed these developments with disfavour, and admitted with reluctance. … However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company. … For the rest, I find that Professor Bowley is offended with me for “introducing misleading ideas”. He does not, however, find it necessary to demonstrate that any such idea is, in fact, misleading. It must be inferred that my real crime, in the eyes of his academic eminence, must be that of “introducing ideas”. (Fisher, 1935, pp. 76-82)[iii]
In summary, the pioneering work of Fisher and later supplemented by Egon Pearson and Neyman, was largely ignored by the Royal Statistical Society (RSS) establishment until the early 1930s. By 1933 it was difficult to ignore their contributions, published primarily in other journals, and the ‘establishment’ of the RSS decided to display its tolerance to their work by creating ‘the Industrial and Agricultural Research Section’, under the auspices of which both papers by Neyman and Fisher were presented in 1934 and 1935, respectively. [iv]
In 1943, Fisher was offered the Balfour Chair of Genetics at the University of Cambridge. Recognition from the RSS came in 1946 with the Guy medal in gold, and he became its president in 1952-1954, just after he was knighted! Sir Ronald Fisher retired from Cambridge in 1957. The father of modern statistics never held an academic position in statistics!
Read more in Spanos 2008 (below)
References
Bowley, A. L. (1902, 1920, 1926, 1937) Elements of Statistics, 2nd, 4th, 5th and 6th editions, Staples Press, London.
Box, J. F. (1978) The Life of a Scientist: R. A. Fisher, Wiley, NY.
Fisher, R. A. (1912), “On an Absolute Criterion for Fitting Frequency Curves,” Messenger of Mathematics, 41, 155-160.
Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10, 507-21.
Fisher, R. A. (1921) “On the ‘probable error’ of a coefficient deduced from a small sample,” Metron 1, 2-32.
Fisher, R. A. (1922) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society, A 222, 309-68.
Fisher, R. A. (1922a) “On the interpretation of c^{2} from contingency tables, and the calculation of p, “Journal of the Royal Statistical Society 85, 87–94.
Fisher, R. A. (1922b) “The goodness of fit of regression formulae and the distribution of regression coefficients,” Journal of the Royal Statistical Society, 85, 597–612.
Fisher, R. A. (1924) “The conditions under which the x2 measures the discrepancy between observation and hypothesis,” Journal of the Royal Statistical Society, 87, 442-450.
Fisher, R. A. (1925) Statistical Methods for Research Workers, Oliver & Boyd, Edinburgh.
Fisher, R. A. (1935) “The logic of inductive inference,” Journal of the Royal Statistical Society 98, 39-54, discussion 55-82.
Fisher, R. A. (1937), “Professor Karl Pearson and the Method of Moments,” Annals of Eugenics, 7, 303-318.
Gossett, W. S. (1908) “The probable error of the mean,” Biometrika, 6, 1-25.
Hald, A. (1998) A History of Mathematical Statistics from 1750 to 1930, Wiley, NY.
Hotelling, H. (1930) “British statistics and statisticians today,” Journal of the American Statistical Association, 25, 186-90.
Neyman, J. (1934) “On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection,” Journal of the Royal Statistical Society, 97, 558-625.
Rao, C. R. (1992), “ R. A. Fisher: The Founder of Modern, Statistical Science, 7, 34-48.
RSS (Royal Statistical Society) (1934) Annals of the Royal Statistical Society 1834-1934, The Royal Statistical Society, London.
Savage, L . J. (1976) “On re-reading R. A. Fisher,” Annals of Statistics, 4, 441-500.
Spanos, A. (2008), “Statistics and Economics,” pp. 1057-1097 in The New Palgrave Dictionary of Economics, Second Edition. Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan.
Tippet, L. H. C. (1931) The Methods of Statistics, Williams & Norgate, London.
[i] Fisher (1937), published a year after Pearson’s death, is particularly acerbic. In Fisher’s mind, Karl Pearson went after a young Indian statistician – totally unfairly – just the way he went after him in 1917.
[ii] Bowley received the Guy Medal in silver from the Royal Statistical Society (RSS) as early as 1895, and became a member of the Council of the RSS in 1898. He was awarded the society’s highest honor, the Guy Medal in gold, in 1935.
[iii] It is important to note that Bowley revised his textbook in statistics for the last time in 1937, and predictably, he missed the whole change of paradigms brought about by Fisher, Neyman and Pearson.
Spanos-2008[iv] In their centennial volume published in 1934, the RSS acknowledged the development of ‘mathematical statistics’, referring to Galton, Edgeworth, Karl Pearson, Yule and Bowley as the main pioneers, and listed the most important contributions in this sub-field which appeared in its Journal during the period 1909-33, but the three important papers by Fisher (1922a-b; 1924) are conspicuously absent from that list. The list itself is dominated by contributions in vital, commercial, financial and labour statistics (see RSS, 1934, pp. 208-23). There is a single reference to Egon Pearson.
This was first posted on 17, Feb. 2013 here.
HAPPY BIRTHDAY R.A. FISHER!
As part of the week of recognizing R.A.Fisher (February 17, 1890 – July 29, 1962), I reblog a guest post by Stephen Senn from 2012/2017. The comments from 2017 lead to a troubling issue that I will bring up in the comments today.
‘Fisher’s alternative to the alternative’
By: Stephen Senn
[2012 marked] the 50th anniversary of RA Fisher’s death. It is a good excuse, I think, to draw attention to an aspect of his philosophy of significance testing. In his extremely interesting essay on Fisher, Jimmie Savage drew attention to a problem in Fisher’s approach to testing. In describing Fisher’s aversion to power functions Savage writes, ‘Fisher says that some tests are more sensitive than others, and I cannot help suspecting that that comes to very much the same thing as thinking about the power function.’ (Savage 1976) (P473).
The modern statistician, however, has an advantage here denied to Savage. Savage’s essay was published posthumously in 1976 and the lecture on which it was based was given in Detroit on 29 December 1971 (P441). At that time Fisher’s scientific correspondence did not form part of his available oeuvre but in 1990 Henry Bennett’s magnificent edition of Fisher’s statistical correspondence (Bennett 1990) was published and this throws light on many aspects of Fisher’s thought including on significance tests.
The key letter here is Fisher’s reply of 6 October 1938 to Chester Bliss’s letter of 13 September. Bliss himself had reported an issue that had been raised with him by Snedecor on 6 September. Snedecor had pointed out that an analysis using inverse sine transformations of some data that Bliss had worked on gave a different result to an analysis of the original values. Bliss had defended his (transformed) analysis on the grounds that a) if a transformation always gave the same result as an analysis of the original data there would be no point and b) an analysis on inverse sines was a sort of weighted analysis of percentages with the transformation more appropriately reflecting the weight of information in each sample. Bliss wanted to know what Fisher thought of his reply.
Fisher replies with a ‘shorter catechism’ on transformations which ends as follows:
A…Have not Neyman and Pearson developed a general mathematical theory for deciding what tests of significance to apply?
B…Their method only leads to definite results when mathematical postulates are introduced, which could only be justifiably believed as a result of extensive experience….the introduction of hidden postulates only disguises the tentative nature of the process by which real knowledge is built up. (Bennett 1990) (p246)
It seems clear that by hidden postulates Fisher means alternative hypotheses and I would sum up Fisher’s argument like this. Null hypotheses are more primitive than statistics: to state a null hypothesis immediately carries an implication about an infinity of test
statistics. You have to choose one, however. To say that you should choose the one with the greatest power gets you nowhere. This power depends on the alternative hypothesis but how will you choose your alternative hypothesis? If you knew that under all circumstances in which the null hypothesis was true you would know which alternative was false you would already know more than the experiment was designed to find out. All that you can do is apply your experience to use statistics, which when employed in valid tests, reject the null hypothesis most often. Hence statistics are more primitive than alternative hypotheses and the latter cannot be made the justification of the former.
I think that this is an important criticism of Fisher’s but not entirely fair. The experience of any statistician rarely amounts to so much that this can be made the (sure) basis for the choice of test. I think that (s)he uses a mixture of experience and argument. I can give an example from my own practice. In carrying out meta-analyses of binary data I have theoretical grounds (I believe) for a prejudice against the risk difference scale and in favour of odds ratios. I think that this prejudice was originally analytic. To that extent I was being rather Neyman-Pearson. However some extensive empirical studies of large collections of meta-analyses have shown that there is less heterogeneity on the odds ratio scale compared to the risk-difference scale. To that extent my preference is Fisherian. However, there are some circumstances (for example where it was reasonably believed that only a small proportion of patients would respond) under which I could be persuaded that the odds ratio was not a good scale. This strikes me as veering towards the N-P.
Nevertheless, I have a lot of sympathy with Fisher’s criticism. It seems to me that what the practicing scientist wants to know is what is a good test in practice rather than what would be a good test in theory if this or that could be believed about the world.
References:
J. H. Bennett (1990) Statistical Inference and Analysis Selected Correspondence of R.A. Fisher, Oxford: Oxford University Press.
L. J. Savage (1976) On rereading R A Fisher. The Annals of Statistics, 441-500.
Today is R.A. Fisher’s birthday. I’ll post some Fisherian items this week in honor of it. This paper comes just before the conflicts with Neyman and Pearson erupted. Fisher links his tests and sufficiency, to the Neyman and Pearson lemma in terms of power. It’s as if we may see them as ending up in a similar place while starting from different origins. I quote just the most relevant portions…the full article is linked below. Happy Birthday Fisher!
“Two New Properties of Mathematical Likelihood“
by R.A. Fisher, F.R.S.
Proceedings of the Royal Society, Series A, 144: 285-307 (1934)
The property that where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principle result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H_{0} is more powerful than any other equivalent test, with regard to an alternative hypothesis H_{1}, when it rejects H_{0} in a set of samples having an assigned aggregate frequency ε when H_{0} is true, and the greatest possible aggregate frequency when H_{1} is true. If any group of samples can be found within the region of rejection whose probability of occurrence on the hypothesis H_{1} is less than that of any other group of samples outside the region, but is not less on the hypothesis H_{0}, then the test can evidently be made more powerful by substituting the one group for the other.
Consequently, for the most powerful test possible the ratio of the probabilities of occurrence on the hypothesis H_{0} to that on the hypothesis H_{1} is less in all samples in the region of rejection than in any sample outside it. For samples involving continuous variation the region of rejection will be bounded by contours for which this ratio is constant. The regions of rejection will then be required in which the likelihood of H_{0} bears to the likelihood of H_{1}, a ratio less than some fixed value defining the contour. (295)…
It is evident, at once, that such a system is only possible when the class of hypotheses considered involves only a single parameter θ, or, what come to the same thing, when all the parameters entering into the specification of the population are definite functions of one of their number. In this case, the regions defined by the uniformly most powerful test of significance are those defined by the estimate of maximum likelihood, T. For the test to be uniformly most powerful, moreover, these regions must be independent of θ showing that the statistic must be of the special type distinguished as sufficient. Such sufficient statistics have been shown to contain all the information which the sample provides relevant to the value of the appropriate parameter θ . It is inevitable therefore that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; and it is surprising that Neyman and Pearson should lay it down as a preliminary consideration that ‘the testng of statistical hypotheses cannot be treated as a problem in estimation.’ When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters. Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists. (296)
]]>MONTHLY MEMORY LANE: 3 years ago: February 2015 [1]. Here are some items to for your Saturday night reading and rereading. Three are in preparation for Fisher’s birthday next week (Feb 17). One is a Saturday night comedy where Jeffreys appears to substitute for Jay Leno. The 2/25 entry lets you go back 6 years where there’s more on Fisher, a bit of statistical theatre (of the absurd), Misspecification tests, and a guest post (by Schachtman) on that infamous Matrixx court case (wherein the Supreme Court is thought to have weighed in on statistical significance tests). The comments are often the most interesting parts of these old posts.
February 2015
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
]]>
Stephen Senn
Head of Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn
Evidence Based or Person-centred? A statistical debate
It was hearing Stephen Mumford and Rani Lill Anjum (RLA) in January 2017 speaking at the Epistemology of Causal Inference in Pharmacology conference in Munich organised by Jürgen Landes, Barbara Osmani and Roland Poellinger, that inspired me to buy their book, Causation A Very Short Introduction[1]. Although I do not agree with all that is said in it and also could not pretend to understand all it says, I can recommend it highly as an interesting introduction to issues in causality, some of which will be familiar to statisticians but some not at all.
Since I have a long-standing interest in researching into ways of delivering personalised medicine, I was interested to see a reference on Twitter to a piece by RLA, Evidence based or person centered? An ontological debate, in which she claims that the choice between evidence based or person-centred medicine is ultimately ontological[2]. I don’t dispute that thinking about health care delivery in ontological terms might be interesting. However, I do dispute that there is any meaningful choice between evidence based medicine (EBM) and person centred healthcare (PCH). To suggest so is to commit a category mistake by suggesting that means are alternatives to ends.
In fact, EBM will be essential to delivering effective PCH, as I shall now explain.
I shall take a rather unglamorous problem, that of deciding whether a generic form of phenytoin is equivalent in effect to a brand-name version. It may seem that this has little to do with PCH but in fact, unpromising as it may seem, it illuminates many points of the often-made but misleading claim that EBM is essentially about averages and is irrelevant to PCH, a view, in my opinion, that is behind the sentiment expressed in RLA’s final sentence: Causal singularism teaches us what PCH already knows: that each person is unique, and that one size does not fit all.
If you want to prove that a generic formulation is equivalent to a brand-name drug a common design that is used to get evidence to that effect is a cross-over, in which a number of subjects are given both formulations on separate occasions and the concentrations in the blood of the two formulations are compared to see if they are similar. Such experiments, referred to as bioequivalence studies[3], may seem simple and trivial but they exhibit in extreme form several common characteristics of RCTs that contradict standard misconceptions regarding them
In fact, the whole use of such experiments is highly theory-laden and employs an implied model partly based on assumptions and partly on experience. The idea is that, first, equality of concentration in the blood implies equality of clinical effects and, second, although the concentration in the blood could be quite different between volunteers and patients, the relative bioavailability of two formulations should be similar from the one group to the other. Hence, analysis takes place on this scale, which is judged portable from volunteers to patients. In other words, one size does not fit all but evidence from a sample is used to make judgements about what would be seen in a population.
Consider a concrete example of a trial comparing two formulations of phenytoin reported by Shumaker and Metzler[4]. This was a double cross-over in 26 healthy volunteers. In the first pair of periods each subject was given one of the two formulations, the order being randomised. This was then repeated in a second pair of periods. Figure 1 shows the relative bioavailability, that is to say the ratio of the area under the concentration time curve for the generic (test) formulation compared to the brand-name (reference) using data from the first pair of periods only. For a philosopher’s recognition of what is necessary to translate results from trails to practice, see Nancy Cartwright’s aptly named article, A philosopher’s view of the long road from RCTs to effectiveness [5] and for a statistician’s see Added Values[6].
This plot may seem less than reassuring. It is true that the values seem to cluster around 1 (dashed line), which would imply equality of the formulations, the object of the study, but one value is perilously close to the limit of 0.8 and another is actually above the limit of 1.25, these two boundaries usually being taken to be acceptable limits of similarity.
However, it would be hasty to assume that the differences in relative bioavailability reflect any personal feature of the volunteers. Because the experiment is rather more complex than usual and each volunteer was tested in two cross-overs, we can plot a second determination of the relative bioavailability against the first. This has been done in Figure 2.
There are 26 points, one for each volunteer with the X axis value being relative bioavailability in the first cross-over and the Y axis the corresponding figure for the second. The XY plane can also be divided into two regions: one in which the difference between the second determination and the first is less than the difference between the second and the mean of the first (labelled personal better) and one in which the reverse is the case (labelled mean better). The 8 points that are labelled with blue circles are in the former region and the 18 with black asterisks in the second. The net result is that for the majority of subjects one would predict the relative bioavailability on the second occasion better using the average value of all subjects rather than using the value for that subject. Note that since much of the prediction error will be due to the inescapable unpredictability of relative bioavailability from occasion to occasion, the superiority of using the average here is plausibly underestimated. In the long run it would do far better.
In fact, judgement of similarity of the two formulations would be based on the average bioavailability, not the individual values and a suitable analysis of the data from the first period fitting subject and period effects in addition to treatment to the log-transformed values would produce a 90% confidence interval of 0.97 to 1.02.
Of course one could argue that this is an extreme example. A plausible explanation is that the long-run relative bioavailability is the same for every individual and it could be argued that there are many clinical trials in patients measuring more complex outcomes where effects would not be constant. Nevertheless, doing better than using the average is harder than it seems and trying to do better will require more evidence not less.
The moral is that if you are not careful you can easily do worse by attempting to go beyond the average. This is well known in quality control circles where it is understood that if managers attempt to improve the operation of machines, processes and workers without knowing whether or not observed variation has a specific identifiable and actionable source they can make quality worse. In choosing ‘one size does not fit all’ RLA has plumped for a misleading analogy. When fitting someone out with clothes, their measurements can be taken with very little error, it can be assumed that they will not change much in the near future and that what fits now will do so for some time.
Patients and diseases are not like that. The mistake is to assume that the following statement, ‘the effect on you of a given treatment will almost certainly differ from its effect averaged over others,’ justifies the following policy, ‘I am going to ignore the considerable evidence from others and just back my best hunch about you’. The irony is that doing best for the individual may involve making substantial use of the average.
What statisticians know is that where there is much evidence on many patients and a very little on the patient currently presenting, to do best will involve a mixed strategy involving finding some optimal compromise between ignoring all experience and ignoring that of the current patient. To produce such optimal strategies requires careful planning, good analysis and many data[7, 8]. The latter are part of what we call evidence and to claim, therefore, that personalisation involves abandoning evidence based medicine is quite wrong. Less ontology and more understanding of the statistics of prediction is needed.
[1] Mumford, S. & Anjum, R.L. 2013 Causation: a very short introduction, OUP Oxford.
[2] Anjum, R.L. 2016 Evidence based or person centered? An ontological debate.
[3] Senn, S.J. 2001 Statistical issues in bioequivalence. Statistics in Medicine 20, 2785-2799.
[4] Shumaker, R.C. & Metzler, C.M. 1998 The phenytoin trial is a case study of “individual bioequivalence”. Drug Information Journal 32, 1063–1072,.
[5] Cartwright, N. 2011 A philosopher’s view of the long road from RCTs to effectiveness. The Lancet 377, 1400-1401.
[6] Senn, S.J. 2004 Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 23, 3729-3753.
[7] Araujo, A., Julious, S. & Senn, S. 2016 Understanding Variation in Sets of N-of-1 Trials. PloS one 11, e0167167. (doi:10.1371/journal.pone.0167167).
[8] Senn, S. 2017 Sample size considerations for n-of-1 trials. Statistical methods in medical research.
]]>MONTHLY MEMORY LANE: 3 years ago: January 2015. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 2-3 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2]. Posts that are part of a “unit” or a group count as one.
January 2015
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).
]]>
Stephen Senn
Head of Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn
Being a statistician means never having to say you are certain
A recent discussion of randomised controlled trials[1] by Angus Deaton and Nancy Cartwright (D&C) contains much interesting analysis but also, in my opinion, does not escape rehashing some of the invalid criticisms of randomisation with which the literatures seems to be littered. The paper has two major sections. The latter, which deals with generalisation of results, or what is sometime called external validity, I like much more than the former which deals with internal validity. It is the former I propose to discuss.
The trouble starts, in my opinion, with the discussion of balance. Perfect balance is not, contrary to what is often claimed a necessary requirement for causal inference, nor is it something that randomisation attempts to provide. Conventional analyses of randomised experiments make an allowance for imbalance and that allowance is inappropriate if all covariates are balanced. If you analyse a matched-pairs design as if it were completely randomised, you fail that question in Stat 1. (At least if I am marking the exam.) The larger standard error for the completely randomised design is an allowance for the probable imbalance that such a design will have compared to a matched-pairs design.
This brings me on to another criticism. D&C discuss matching as if it were somehow an alternative to randomisation. But Fisher’s motto for designs can be expressed as, “block what you can and randomise what you can’t”. We regularly run cross-over trials, for example, in which there is blocking by patient, since every patient receives each treatment, and also blocking by period, since each treatment appears equally often in each period but we still randomise patients to sequences.
Part of their discussion recognizes this but elsewhere they simply confuse the issue, for example discussing randomisation as if it were an alternative to control. Control makes randomisation possible. Without control, there is no randomisation. Randomisation makes blinding possible, without randomisation there can be no convincing blinding. Thus in order of importance they are, control, randomisation and blinding but to set randomisation up as some alternative to control is simply misleading and unhelpful.
Elsewhere they claim, “the RCT strategy is only successful if we are happy with estimates that are arbitrarily far from the truth, just so long as errors cancel out over a series of imaginary experiments” but this is not what RCTs rely on. The mistake is in becoming fixated with the point estimate. This will, indeed be in error but any decent experiment and analysis will deliver an estimate of that error, as, indeed, they concede elsewhere. Being a statistician is never having to say you are certain. To prove a statistician is a liar you have to prove that the probability statement is wrong. That is harder than it may seem.
They correctly identify that when it comes to hidden covariates it is the totality of their effect that matters. In this, their discussion is far superior to the indefinitely many confounders argument that has been incorrectly proposed by others as being some fatal flaw. (See my previous blog Indefinite Irrelevance). However, they then undermine this by adding “but consider the human genome base pairs. Out of all those billions, only one might be important, and if that one is unbalanced, the result of a single trial can be ‘randomly confounded’ and far from the truth”. To which I answer “so what?”. To see the fallacy in this argument, which simultaneously postulates a rare event and conditions on its having happened, even though it is unobserved, consider the following. I maintain that if a fair die is rolled six times, the probability of six sixes in a row will be 1/46,656 and so rather rare. “Nonsense” say D&C, “suppose that the first five rolls have each produced a six, it will then happen one in six times and so is really not rare at all”.
I also consider that their simulation is irrelevant. They ask us to believe that if 100 samples of size 50 are taken from a log-Normal distribution and then for each sample, the values are permuted 1000 times to 25 in the control and 25 in the experimental group the type I error rate for a nominal 5% using the two-sample t-test will be 13.5%. In view of what is known about the robustness of the t-test under the null hypothesis (there is a long literature going back to Egon Pearson in the 1930s), this is extremely surprising and as soon as I saw it I disbelieved it. I simulated this myself using 2000 permutations, just for good measure, and found the distribution of type one error rates in the accompanying figure.
Each dot represents the type I error rate over 2000 permutations for one of the 100 samples. It can be seen that for most of the samples the proportion of significant t-tests is less than the nominal 5% and in fact, the average for the simulation is 4%. It is, of course, somewhat regrettable that some of the values are above 5% and, indeed, five of them have got a value of nearly 6% but if this worries you, the cure is at hand. Use a permutation t-test rather than a parametric one. (For a history of this approach, see the excellent book by Mielke et al [2].) Don’t confuse the technical details of analysis with the randomisation. Whatever you do for analysis, you will be better off for having randomised whatever you haven’t blocked.
Why does my result differ from theirs? It is hard for me to work out exactly what they have done but I suspect that it is because they have assumed an impossible situation. They are allowing that the average treatment effect for the millions of patients that might have been included is zero but then sampling varying effects (that is to say the difference the treatment makes), rather than merely values (that is to say the reading for given patients), from this distribution. For any given sample the mean of the effects will not be zero and so the null-hypothesis will, as they point out, not be true for the sample, only for the population. But in analysing clinical trials we don’t consider this population. We have precise control of the allocation algorithm (who gets what if they are in the trial) and virtually none over the presenting process (who gets into the trial) and the null hypothesis that we test is that the effect is zero in the sample not in some fictional population. It may be that I have misunderstood what they are doing but I think that this is the origin of the difference.
This is an example of the sort of approach that led to Neyman’s famous dispute with Fisher. One can argue about the appropriateness of the Fisherian null hypothesis, “the treatments are the same”, but Neyman’s “the treatments are not the same but on average they are the same” is simply incredible[3]. As D&C’s simulation shows, as soon as you allow this, you will never find a sample for which it is true. If there is no sample for which it is true, what exactly are the remarkable properties of the population for which it is true? D&C refer to magical thinking about RCTs dismissively but this is straight out of some wizard’s wonderland.
My view is that randomisation should not be used as an excuse for ignoring what is known and observed but that it does deal validly with hidden confounders[4]. It does not do this by delivering answers that are guaranteed to be correct; nothing can deliver that. It delivers answers about which valid probability statements can be made and, in an imperfect world, this has to be good enough. Another way I sometimes put it is like this: show me how you will analyse something and I will tell you what allocations are exchangeable. If you refuse to choose one at random I will say, “why? Do you have some magical thinking you’d like to share?”
My research on inference for small populations is carried out in the framework of the IDeAl project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.
References
MONTHLY MEMORY LANE: 3 years ago: December 2014. I mark in red 3-4 posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently[1], and in green 3- 4 others of general relevance to philosophy of statistics (in months where I’ve blogged a lot)[2]. Posts that are part of a “unit” or a group count as one.
December 2014
[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
[2] New Rule, July 30,2016, March 30,2017 -a very convenient way to allow data-dependent choices (note why it’s legit in selecting blog posts, on severity grounds).
]]>
You know how in that (not-so) recent Woody Allen movie, “Midnight in Paris,” the main character (I forget who plays it, I saw it on a plane) is a writer finishing a novel, and he steps into a cab that mysteriously picks him up at midnight and transports him back in time where he gets to run his work by such famous authors as Hemingway and Virginia Wolf? He is impressed when his work earns their approval and he comes back each night in the same mysterious cab…Well, imagine an error statistical philosopher is picked up in a mysterious taxi at midnight (New Year’s Eve 2011 2012, 2013, 2014, 2015, 2016, 2017) and is taken back fifty years and, lo and behold, finds herself in the company of Allan Birnbaum.[i] There are a couple of brief updates at the end.
ERROR STATISTICIAN: It’s wonderful to meet you Professor Birnbaum; I’ve always been extremely impressed with the important impact your work has had on philosophical foundations of statistics. I happen to be writing on your famous argument about the likelihood principle (LP). (whispers: I can’t believe this!)
BIRNBAUM: Ultimately you know I rejected the LP as failing to control the error probabilities needed for my Confidence concept.
ERROR STATISTICIAN: Yes, but I actually don’t think your argument shows that the LP follows from such frequentist concepts as sufficiency S and the weak conditionality principle WLP.[ii] Sorry,…I know it’s famous…
BIRNBAUM: Well, I shall happily invite you to take any case that violates the LP and allow me to demonstrate that the frequentist is led to inconsistency, provided she also wishes to adhere to the WLP and sufficiency (although less than S is needed).
ERROR STATISTICIAN: Well I happen to be a frequentist (error statistical) philosopher; I have recently (2006) found a hole in your proof,..er…well I hope we can discuss it.
BIRNBAUM: Well, well, well: I’ll bet you a bottle of Elba Grease champagne that I can demonstrate it!
ERROR STATISTICAL PHILOSOPHER: It is a great drink, I must admit that: I love lemons.
BIRNBAUM: OK. (A waiter brings a bottle, they each pour a glass and resume talking). Whoever wins this little argument pays for this whole bottle of vintage Ebar or Elbow or whatever it is Grease.
ERROR STATISTICAL PHILOSOPHER: I really don’t mind paying for the bottle.
BIRNBAUM: Good, you will have to. Take any LP violation. Let x’ be 2-standard deviation difference from the null (asserting μ = 0) in testing a normal mean from the fixed sample size experiment E’, say n = 100; and let x” be a 2-standard deviation difference from an optional stopping experiment E”, which happens to stop at 100. Do you agree that:
(0) For a frequentist, outcome x’ from E’ (fixed sample size) is NOT evidentially equivalent to x” from E” (optional stopping that stops at n)
ERROR STATISTICAL PHILOSOPHER: Yes, that’s a clear case where we reject the strong LP, and it makes perfect sense to distinguish their corresponding p-values (which we can write as p’ and p”, respectively). The searching in the optional stopping experiment makes the p-value quite a bit higher than with the fixed sample size. For n = 100, data x’ yields p’= ~.05; while p” is ~.3. Clearly, p’ is not equal to p”, I don’t see how you can make them equal.
BIRNBAUM: Suppose you’ve observed x’, a 2-standard deviation difference from E’. You admit, do you not, that this outcome could have occurred as a result of a different experiment? It could have been that a fair coin was flipped where it is agreed that heads instructs you to perform E’ and tails instructs you to perform the optional stopping test E”, and you happened to get heads, and then performed the experiment E’ (with n = 100) and obtained your 2-standard deviation difference x’.
ERROR STATISTICAL PHILOSOPHER: Well, that is not how I got x’, but ok, it could have occurred that way.
BIRNBAUM: Good. Then you must grant further that your result could have come from a special experiment I have dreamt up, call it a BB-experiment. In a BB- experiment, if the outcome from the experiment you actually performed has an outcome with a proportional likelihood to one in some other experiment not performed, E”, then we say that your result has an “LP pair”. For any violation of the strong LP, the outcome observed, let it be x’, has an “LP pair”, call it x”, in some other experiment E”. In that case, a BB-experiment stipulates that you are to report x’ as if you had determined whether to run E’ or E” by flipping a fair coin.
(They fill their glasses again)
ERROR STATISTICAL PHILOSOPHER: You’re saying that if my outcome from the fixed sample size experiment E’ has an “LP pair” in the (optional stopping) experiment I did not perform, then I am to report x’ as if the determination to run E’ was by flipping a fair coin (which decides between E’ and E”)?
BIRNBAUM: Yes, and one more thing. If your outcome had actually come from the optional stopping experiment E”, it too would have an “LP pair” in the experiment you did not perform, E’. Whether you actually observed x’ from E’, or x” from E”, you are to report it as x’ from E’.
ERROR STATISTICAL PHILOSOPHER: So let’s see if I understand a Birnbaum BB-experiment: whether my observed 2-standard deviation difference came from E’ or E” the result is reported as x’, as if it came from E’, and as a result of this strange type of a mixture experiment.
BIRNBAUM: Yes, or equivalently you could just report x*: my result is a 2-standard deviation difference and it could have come from either E’ (fixed sampling, n= 100) or E” (optional stopping, which happens to stop at the 100^{th} trial). That’s how I sometimes formulate a BB-experiment.
ERROR STATISTICAL PHILOSOPHER: You’re saying in effect that if my result has an LP pair in the experiment not performed, I should act as if I accept the strong LP and just report it’s likelihood; so if the likelihoods are proportional in the two experiments (both testing the same mean), the outcomes are evidentially equivalent.
BIRNBAUM: Well, but since the BB- experiment is an imagined “mixture” it is a single experiment, so really you only need to apply the weak LP which frequentists accept. Yes? (The weak LP is the same as the sufficiency principle).
ERROR STATISTICAL PHILOSOPHER: But what is the sampling distribution in this imaginary BB- experiment? Suppose I have Birnbaumized my experimental result, just as you describe, and observed a 2-standard deviation difference in a fixed sample size experiment E’. How do I calculate the p-value within a Birnbaumized experiment?
BIRNBAUM: I don’t think anyone has ever called it that.
ERROR STATISTICAL PHILOSOPHER: I just wanted to have a shorthand for the operation you are describing, there’s no need to use it, if you’d rather I not. So how do I calculate the p-value within a BB-experiment?
BIRNBAUM: You would report the overall p-value, which would be the average over the sampling distributions: (p’ + p”)/2
Say p’ is ~.05, and p” is ~.3; whatever they are, we know they are different, that’s what makes this a violation of the strong LP (given in premise (0)).
ERROR STATISTICAL PHILOSOPHER: So you’re saying that if I observe a 2-standard deviation difference from E’, I do not report the associated p-value p’, but instead I am to report the average p-value, averaging over some other experiment E” that could have given rise to an outcome with a proportional likelihood to the one I observed, even though I didn’t obtain it this way?
BIRNBAUM: I’m saying that you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB- experiment.
My this drink is sour!
ERROR STATISTICAL PHILOSOPHER: Yes, I love pure lemon.
BIRNBAUM: Perhaps you’re in want of a gene; never mind.
I’m saying you have to grant that x’ from a fixed sample size experiment E’ could have been generated through a BB-experiment. If you are to interpret your experiment as if you are within the rules of a BB experiment, then x’ is evidentially equivalent to x” (is equivalent to x*). This is premise (1).
ERROR STATISTICAL PHILOSOPHER: But this is just a matter of your definitions, it is an analytical or mathematical result, so long as we grant being within your BB experiment.
BIRNBAUM: True, (1) plays the role of the sufficiency assumption, but one need not even appeal to this, it is just a matter of mathematical equivalence.
By the way, I am focusing just on LP violations, therefore, the outcome, by definition, has an LP pair. In other cases, where there is no LP pair, you just report things as usual.
ERROR STATISTICAL PHILOSOPHER: OK, but p’ still differs from p”; so I still don’t how I’m forced to infer the strong LP which identifies the two. In short, I don’t see the contradiction with my rejecting the strong LP in premise (0). (Also we should come back to the “other cases” at some point….)
BIRNBAUM: Wait! Don’t be so impatient; I’m about to get to step (2). Here, let’s toast to the new year: “To Elbar Grease!”
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease!
BIRNBAUM: So far all of this was step (1).
ERROR STATISTICAL PHILOSOPHER: : Oy, what is step 2?
BIRNBAUM: STEP 2 is this: Surely, you agree, that once you know from which experiment the observed 2-standard deviation difference actually came, you ought to report the p-value corresponding to that experiment. You ought NOT to report the average (p’ + p”)/2 as you were instructed to do in the BB experiment.
This gives us premise (2a):
(2a) outcome x’, once it is known that it came from E’, should NOT be analyzed as in a BB- experiment where p-values are averaged. The report should instead use the sampling distribution of the fixed sample test E’, yielding the p-value, p’ (.05).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I imagine myself in a Birnbaumized, I mean a BB-experiment, and report an average p-value, I’m now to return to my senses and “condition” in order to get back to the only place I ever wanted to be, i.e., back to where I was to begin with?
BIRNBAUM: Yes, at least if you hold to the weak conditionality principle WCP (of D. R. Cox)—surely you agree to this.
(2b) Likewise, if you knew the 2-standard deviation difference came from E”, then
x” should NOT be deemed evidentially equivalent to x’ (as in the BB experiment), the report should instead use the sampling distribution of the optional stopping test E”. This would yield p-value p’ (~.3).
ERROR STATISTICAL PHILOSOPHER: So, having first insisted I consider myself in a BB-experiment, in which I report the average p-value, I’m now to return to my senses and allow that if I know the result came from optional stopping, E”, I should “condition” on E” and report p”.
BIRNBAUM: Yes. There was no need to repeat the whole spiel.
ERROR STATISTICAL PHILOSOPHER: I just wanted to be clear I understood you.
BIRNBAUM: So you arrive at (2a) and (2b), yes?
ERROR STATISTICAL PHILOSOPHER: OK, but it might be noted that unlike premise (1), premises (2a) and (2b) are not given by definition, they concern an evidential standpoint about how one ought to interpret a result once you know which experiment it came from. In particular, premises (2a) and (2b) say I should condition and use the sampling distribution of the experiment known to have been actually performed, when interpreting the result.
BIRNBAUM: Yes, and isn’t this weak conditionality principle WCP one that you happily accept?
ERROR STATISTICAL PHILOSOPHER: Well the WCP is defined for actual mixtures, where one flipped a coin to determine if E’ or E” is performed, whereas, you’re requiring I consider an imaginary Birnbaum mixture experiment, where the choice of the experiment not performed will vary depending on the outcome that needs an LP pair; and I cannot even determine what this might be until after I’ve observed the result that would violate the LP?
BIRNBAUM: Sure, but you admit that your observed x’ could have come about through a BB-experiment, and that’s all I need. Notice
(1), (2a) and (2b) yield the strong LP!
Outcome x’ from E’ (fixed sample size n) is evidentially equivalent to x” from E” (optional stopping that stops at n).
ERROR STATISTICAL PHILOSOPHER: Clever, but your “proof” is obviously unsound; and before I demonstrate this, notice that the conclusion, were it to follow, asserts p’ = p”, (e.g., .05 = .3!), even though it is unquestioned that p’ is not equal to p”, that is because we must start with an LP violation (premise (0)).
BIRNBAUM: Yes, it is puzzling, but where have I gone wrong?
(The waiter comes by and fills their glasses; they are so deeply engrossed in thought they do not even notice him.)
ERROR STATISTICAL PHILOSOPHER: There are many routes to explaining a fallacious argument. Here’s one. What is required for STEP 1 to hold, is the denial of what’s needed for STEP 2 to hold:
Step 1 requires us to analyze results in accordance with a BB- experiment. If we do so, true enough we get:
premise (1): outcome x’ (in a BB experiment) is evidentially equivalent to outcome x” (in a BB experiment):
That is because in either case, the p-value would be (p’ + p”)/2
Step 2 now insists that we should NOT calculate evidential import as if we were in a BB- experiment. Instead we should consider the experiment from which the data actually came, E’ or E”:
premise (2a): outcome x’ (within in a BB experiment) is/should be evidentially equivalent to x’ from E’ (fixed sample size): its p-value should be p’.
premise (2b): outcome x” (in a BB experiment) is/should be evidentially equivalent to x” from E” (optional stopping that stops at n): its p-value should be p”.
If (1) is true, then (2a) and (2b) must be false!
If (1) is true and we keep fixed the stipulation of a BB experiment (which we must to apply step 2), then (2a) is asserting:
The average p-value (p’ + p”)/2 = p’ which is false.
Likewise if (1) is true, then (2b) is asserting:
the average p-value (p’ + p”)/2 = p” which is false
Alternatively, we can see what goes wrong by realizing:
If (2a) and (2b) are true, then premise (1) must be false.
In short your famous argument requires us to assess evidence in a given experiment in two contradictory ways: as if we are within a BB- experiment (and report the average p-value) and also that we are not, but rather should report the actual p-value.
I can render it as formally valid, but then its premises can never all be true; alternatively, I can get the premises to come out true, but then the conclusion is false—so it is invalid. In no way does it show the frequentist is open to contradiction (by dint of accepting S, WCP, and denying the LP).
BIRNBAUM: Yet some people still think it is a breakthrough (in favor of Bayesianism).
ERROR STATISTICAL PHILOSOPHER: (update 12/31/14) I have a much clearer exposition of what goes wrong in your argument in this published paper from 2010. However, there were still several gaps, and lack of a clear articulation of the WCP. In fact, I’ve come to see that clarifying the entire argument turns on defining the WCP. Have you seen my 2014 paper in Statistical Science?
BIRNBAUM: Yes I have seen it, very clever! Your Rejoinder to some of the critics is gutsy, to say the least. Congratulations!
ERROR STATISTICAL PHILOSOPHER: Thanks, but look I must ask you something.
BIRNBAUM: I do follow your blog. I might even have a palindrome for your December contest.
ERROR STATISTICAL PHILOSOPHER: Wow! I can’t believe you read my blog, but look I must ask you something before you leave this year.
sudden interruption by the waiter
WAITER: Who gets the tab?
BIRNBAUM: I do. To Elbar Grease!
ERROR STATISTICAL PHILOSOPHER: To Elbar Grease! Happy New Year!
ADD-ONS (12/31/13, 14, 15, 16, & 17):
ERROR STATISTICAL PHILOSOPHER: I have one quick question, Professor Birnbaum, and I swear that whatever you say will be just between us, I won’t tell a soul. In your last couple of papers, you suggest you’d discovered the flaw in your argument for the LP. Am I right?
BIRNBAUM: Savage, you know, never got off my case about remaining at “the half-way house” of likelihood, and not going full Bayesian. Then I wrote the review about the Confidence Concept as the one rock on a shifting scene…
ERROR STATISTICAL PHILOSOPHER: Yes, but back to my question, you disappeared before answering last year…I just want to know…
WAITER: We’re closing now; shall I call you a taxicab?
BIRNBAUM: Yes.
ERROR STATISTICAL PHILOSOPHER: ‘Yes’, you discovered the flaw in the argument, or ‘yes’ to the taxi?
MANAGER: We’re closing now; I’m sorry you must leave.
ERROR STATISTICAL PHILOSOPHER: We’re leaving I just need him to clarify his answer….
Large group of people bustle past.
Prof. Birnbaum…? Allan? Where did he go? (oy, not again!)
Link to complete discussion:
Mayo, Deborah G. On the Birnbaum Argument for the Strong Likelihood Principle (with discussion & rejoinder).Statistical Science 29 (2014), no. 2, 227-266.
[i] Many links on the strong likelihood principle (LP or SLP) and Birnbaum may be found by searching this blog. Good sources for where to start as well as historical background papers may be found in my last blogpost.
[ii] By the way, Ronald Giere gave me numerous original papers of yours. They’re in files in my attic library. Some are in mimeo, others typed…I mean, obviously for that time that’s what they’d be…now of course, oh never mind, sorry.