G. A. Barnard: 23 Sept 1915-30 July, 2002
Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me.
BARNARD:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important. Continue reading
Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inference and hypotheses, constraining them by evidence and checks of error. (Cox and Mayo 2010, p. 276)
I. The myth of objectivity. Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective,” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity-subjectivity distinction really toothless as many will have you believe? I say no.
Cavalier attitudes toward objectivity are in tension with widely endorsed movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science–if they are not mere lip-service–are rooted in the supposition that we can more objectively scrutinize results,even if it’s only to point out those that are poorly tested. The fact that the term “objectivity” is used equivocally should not be taken as grounds to oust it, but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t. Continue reading
C. S. Peirce: 10 Sept, 1839-19 April, 1914
Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. Happy birthday Peirce.
Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319
Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:
Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)
The consequences of recent criticisms of statistical tests have breathed brand new life into some very old howlers, many of which have been discussed on this blog. What is not funny, though, is how standard notions such as frequentist error probabilities are being redefined in the process, and how we now have arguments built on equivocations. In fact, there are official guidebooks for the statistically perplexed giving inconsistent definitions to the same term (See for just 1 of many examples this post). How much more perplexed will that leave us! Since it’s near the 5-year anniversary of this blog, let’s listen in to a new comedy hour mixing one from 3 years ago with some add-ons*.
Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?
Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!
Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!
Raucous laughter ensues!
(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!) Continue reading
Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin
“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”
Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’. Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction. Continue reading
E.S.Pearson on a Gate, Mayo sketch
Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is
“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.
He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]
So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.
OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her: Continue reading
E.S. Pearson (11 Aug, 1895-12 June, 1980)
This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post. I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.
Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.
Cases of Type A and Type B
“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)
1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:
“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)
Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper. But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability? Continue reading
Head of Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)
The bare facts they established are the following:
The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.
and the false conclusion they immediately asserted is the following
This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.
To understand the fallacy, look at the accompanying graph. Continue reading
3 years ago…
MONTHLY MEMORY LANE: 3 years ago: July 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently , and in green up to 3 others I’d recommend. Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.
- (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
- (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science? (memory lane)
- (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
- (7/14) Stephen Senn: Indefinite irrelevance
- (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
- (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
- (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
- (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
- (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk
 Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
 New Rule, July 30, 2016.
28 July 1902 – 17 September 1994
Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with an undergraduate item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].
Blog Readers who wish to send me their answers will have their papers graded (at least try the multiple choice; if you’re unsure, do the reading). [Use the comments or e-mail.]
[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 1, 2, and 3.
HAPPY BIRTHDAY POPPER!
Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.
Seeing the world through overly rosy glasses
Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?
(I) Listen to Jacob Cohen (1988) introduce Power Analysis
“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists. Continue reading
3 years ago…
MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently . Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.
- (6/1) Winner of May Palindrome Contest
- (6/1) Some statistical dirty laundry*(recently reblogged)
- (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers :(6/5 and6/6 are paired as one)
- (6/6) PhilStock: Topsy-Turvy Game
- (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
- (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?”*(recently reblogged)
- (6/11) Mayo: comment on the repressed memory research [How a conceptual criticism, requiring no statistics, might go.]
- (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
- (6/19) PhilStock: The Great Taper Caper
- (6/19) Stanley Young: better p-values through randomization in microarrays
- (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri*(recently reblogged)
- (6/26) Why I am not a “dualist” in the sense of Sander Greenland
- (6/29) Palindrome “contest” contest
- (6/30) Blog Contents: mid-year
 Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.
Allan Birnbaum: May 27, 1923- July 1, 1976
Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference  links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading
Professor Richard Gill
It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.
Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:
“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”
Isn’t that the danger in relying on deeply felt background beliefs? Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post. Continue reading
For entertainment only
In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading
Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had
been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong.) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013: Continue reading
Here are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here: