My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics

picture-216-1I’ll be talking about philosophy of statistics tomorrow afternoon at Rutgers University, in the Statistics and Biostatistics Department, if you happen to be in the vicinity and are interested.


Seminar Speaker:     Professor Deborah Mayo, Virginia Tech

Title:           Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

Time:          3:20 – 4:20pm, Wednesday, December 3, 2014 Place:         552 Hill Center


Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.

Categories: Announcement, Statistics | 4 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2011. I mark in red 3 posts that seem most apt for general background on key issues in this blog.*

  • (11/1) RMM-4:“Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*” by Aris Spanos, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”)
  • (11/3) Who is Really Doing the Work?*
  • (11/5) Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest
  • (11/9) Neyman’s Nursery 2: Power and Severity [Continuation of Oct. 22 Post]
  • (11/12) Neyman’s Nursery (NN) 3: SHPOWER vs POWER
  • (11/15) Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower (“observed” power) vs Power
  • (11/18) Neyman’s Nursery (NN5): Final Post
  • (11/21) RMM-5: “Low Assumptions, High Dimensions” by Larry Wasserman, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”) See also my deconstruction of Larry Wasserman.
  • (11/23) Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat
  • (11/28) The UN Charter: double-counting and data snooping
  • (11/29) If you try sometime, you find you get what you need!

*I announced this new, once-a-month feature at the blog’s 3-year anniversary. I will repost and comment on one of the 3-year old posts from time to time. [I’ve yet to repost and comment on the one from Oct. 2011, but will shortly.] For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!


 Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))












Categories: 3-year memory lane, Bayesian/frequentist, Statistics | Leave a comment

How likelihoodists exaggerate evidence from statistical tests


I insist on point against point, no matter how much it hurts

Have you ever noticed that some leading advocates of a statistical account, say a testing account A, upon discovering account A is unable to handle a certain kind of important testing problem that a rival testing account, account B, has no trouble at all with, will mount an argument that being able to handle that kind of problem is actually a bad thing? In fact, they might argue that testing account B is not a  “real” testing account because it can handle such a problem? You have? Sure you have, if you read this blog. But that’s only a subliminal point of this post.

I’ve had three posts recently on the Law of Likelihood (LL): Breaking the [LL](a)(b)[c], and [LL] is bankrupt. Please read at least one of them for background. All deal with Royall’s comparative likelihoodist account, which some will say only a few people even use, but I promise you that these same points come up again and again in foundational criticisms from entirely other quarters.[i]

An example from Royall is typical: He makes it clear that an account based on the (LL) is unable to handle composite tests, even simple one-sided tests for which account B supplies uniformly most powerful (UMP) tests. He concludes, not that his test comes up short, but that any genuine test or ‘rule of rejection’ must have a point alternative!  Here’s the case (Royall, 1997, pp. 19-20):

[M]edical researchers are interested in the success probability, θ, associated with a new treatment. They are particularly interested in how θ relates to the old treatment’s success probability, believed to be about 0.2. They have reason to hope θ is considerably greater, perhaps 0.8 or even greater. To obtain evidence about θ, they carry out a study in which the new treatment is given to 17 subjects, and find that it is successful in nine.

Let me interject at this point that of all of Stephen Senn’s posts on this blog, my favorite is the one where he zeroes in on the proper way to think about the discrepancy we hope to find (the .8 in this example). (See note [ii]) Continue reading

Categories: law of likelihood, Richard Royall, Statistics | Tags: | 18 Comments

Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”

1119OPEDmerto-master495A NYT op-ed the other day,”How Medical Care Is Being Corrupted” (by Pamela Hartzband and Jerome Groopman, physicians on the faculty of Harvard Medical School), gives a good sum-up of what I fear is becoming the new normal, even under so-called “personalized medicine”. 

WHEN we are patients, we want our doctors to make recommendations that are in our best interests as individuals. As physicians, we strive to do the same for our patients.

But financial forces largely hidden from the public are beginning to corrupt care and undermine the bond of trust between doctors and patients. Insurers, hospital networks and regulatory groups have put in place both rewards and punishments that can powerfully influence your doctor’s decisions.

Continue reading

Categories: PhilStat/Med, Statistics | Tags: | 8 Comments

Erich Lehmann: Statistician and Poet

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann                       20 November 1917 –              12 September 2009

Memory Lane 1 Year (with update): Today is Erich Lehmann’s birthday. The last time I saw him was at the Second Lehmann conference in 2004, at which I organized a session on philosophical foundations of statistics (including David Freedman and D.R. Cox).

I got to know Lehmann, Neyman’s first student, in 1997.  One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He told me he was sitting in a very large room at an ASA meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, dark table sat just one book, all alone, shiny red.  He said he wondered if it might be of interest to him!  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after. Some related posts on Lehmann’s letter are here and here.

That same year I remember having a last-minute phone call with Erich to ask how best to respond to a “funny Bayesian example” raised by Colin Howson. It is essentially the case of Mary’s positive result for a disease, where Mary is selected randomly from a population where the disease is very rare. See for example here. (It’s just like the case of our high school student Isaac). His recommendations were extremely illuminating, and with them he sent me a poem he’d written (which you can read in my published response here*). Aside from being a leading statistician, Erich had a (serious) literary bent. Continue reading

Categories: highly probable vs highly probed, phil/history of stat, Sir David Cox, Spanos, Statistics | Tags: , | Leave a comment

Lucien Le Cam: “The Bayesians Hold the Magic”

lecamToday is the birthday of Lucien Le Cam (Nov. 18, 1924-April 25,2000): Please see my updated 2013 post on him.


Categories: Bayesian/frequentist, Statistics | Leave a comment

Why the Law of Likelihood is bankrupt–as an account of evidence



There was a session at the Philosophy of Science Association meeting last week where two of the speakers, Greg Gandenberger and Jiji Zhang had insightful things to say about the “Law of Likelihood” (LL)[i]. Recall from recent posts here and here that the (LL) regards data x as evidence supporting H1 over H0   iff

Pr(x; H1) > Pr(x; H0).

On many accounts, the likelihood ratio also measures the strength of that comparative evidence. (Royall 1997, p.3). [ii]

H0 and H1 are statistical hypothesis that assign probabilities to the random variable X taking value x.  As I recall, the speakers limited  H1 and H0  to simple statistical hypotheses (as Richard Royall generally does)–already restricting the account to rather artificial cases, but I put that to one side. Remember, with likelihoods, the data x are fixed, the hypotheses vary.

1. Maximally likely alternatives. I didn’t really disagree with anything the speakers said. I welcomed their recognition that a central problem facing the (LL) is the ease of constructing maximally likely alternatives: so long as Pr(x; H0) < 1, a maximum likely alternative H1 would be evidentially “favored”. There is no onus on the likelihoodist to predesignate the rival, you are free to search, hunt, post-designate and construct a best (or better) fitting rival. If you’re bothered by this, says Royall, then this just means the evidence disagrees with your prior beliefs.

After all, Royall famously distinguishes between evidence and belief (recall the evidence-belief-action distinction), and these problematic cases, he thinks, do not vitiate his account as an account of evidence. But I think they do! In fact, I think they render the (LL) utterly bankrupt as an account of evidence. Here are a few reasons. (Let me be clear that I am not pinning Royall’s defense on the speakers[iii], so much as saying it came up in the general discussion[iv].) Continue reading

Categories: highly probable vs highly probed, law of likelihood, Richard Royall, Statistics | 62 Comments

A biased report of the probability of a statistical fluke: Is it cheating?

cropped-qqqq.jpg One year ago I reblogged a post from Matt Strassler, “Nature is Full of Surprises” (2011). In it he claims that

[Statistical debate] “often boils down to this: is the question that you have asked in applying your statistical method the most even-handed, the most open-minded, the most unbiased question that you could possibly ask?

It’s not asking whether someone made a mathematical mistake. It is asking whether they cheated — whether they adjusted the rules unfairly — and biased the answer through the question they chose…”

(Nov. 2014):I am impressed (i.e., struck by the fact) that he goes so far as to call it “cheating”. Anyway, here is the rest of the reblog from Strassler which bears on a number of recent discussions:

“…If there are 23 people in a room, the chance that two of them have the same birthday is 50 percent, while the chance that two of them were born on a particular day, say, January 1st, is quite low, a small fraction of a percent. The more you specify the coincidence, the rarer it is; the broader the range of coincidences at which you are ready to express surprise, the more likely it is that one will turn up.
Continue reading

Categories: Higgs, spurious p values, Statistics | 7 Comments

The Amazing Randi’s Million Dollar Challenge

09randi3-master675-v2-1The NY Times Magazine had a feature on the Amazing Randi yesterday, “The Unbelievable Skepticism of the Amazing Randi.” It described one of the contestants in Randi’s most recent Million Dollar Challenge, Fei Wang:

“[Wang] claimed to have a peculiar talent: from his right hand, he could transmit a mysterious force a distance of three feet, unhindered by wood, metal, plastic or cardboard. The energy, he said, could be felt by others as heat, pressure, magnetism or simply “an indescribable change.” Tonight, if he could demonstrate the existence of his ability under scientific test conditions, he stood to win $1 million.”

Isn’t “an indescribable change” rather vague?

…..The Challenge organizers had spent weeks negotiating with Wang and fine-tuning the protocol for the evening’s test. A succession of nine blindfolded subjects would come onstage and place their hands in a cardboard box. From behind a curtain, Wang would transmit his energy into the box. If the subjects could successfully detect Wang’s energy on eight out of nine occasions, the trial would confirm Wang’s psychic power. …”

After two women failed to detect the “mystic force” the M.C. announced the contest was over.

“With two failures in a row, it was impossible for Wang to succeed. The Million Dollar Challenge was already over.”

You think they might have given him another chance or something.

“Stepping out from behind the curtain, Wang stood center stage, wearing an expression of numb shock, like a toddler who has just dropped his ice cream in the sand. He was at a loss to explain what had gone wrong; his tests with a paranormal society in Boston had all succeeded. Nothing could convince him that he didn’t possess supernatural powers. ‘This energy is mysterious,’ he told the audience. ‘It is not God.’ He said he would be back in a year, to try again.”

The article is here. If you don’t know who A. Randi is, you should read it.

Randi, much better known during Uri Geller spoon-bending days, has long been the guru to skeptics and fraudbusters, but also a hero to some critical psi believers like I.J. Good. Geller continually sued Randi for calling him a fraud. As such, I.J. Good warned me that I might be taking a risk in my use of “gellerization” in EGEK (1996), but I guess Geller doesn’t read philosophy of science. A post on “Statistics and ESP Research” and Diaconis is here.


I’d love to have seen Randi break out of these chains!


Categories: Error Statistics | Tags: | 3 Comments

“Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA

We had an excellent discussion at our symposium yesterday: “How Many Sigmas to Discovery? Philosophy and Statistics in the Higgs Experiments” with Robert Cousins, Allan Franklin and Kent Staley. Slides from my presentation, “Statistical Flukes, the Higgs Discovery, and 5 Sigma” are posted below (we each only had 20 minutes, so this is clipped,but much came out in the discussion). Even the challenge I read about this morning as to what exactly the Higgs researchers discovered (and I’ve no clue if there’s anything to the idea of a “techni-higgs particle”) — would not invalidate* the knowledge of the experimental effects severely tested.


*Although, as always, there may be a reinterpretation of the results. But I think the article is an isolated bit of speculation. I’ll update if I hear more.

Categories: Higgs, highly probable vs highly probed, Statistics | 26 Comments

Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”



The biennial meeting of the Philosophy of Science Association (PSA) starts this week (Nov. 6-9) in Chicago, together with the History of Science Society. I’ll be part of the symposium:


How Many Sigmas to Discovery?
Philosophy and Statistics in the Higgs Experiments


on Nov.8 with Robert Cousins, Allan Franklin, and Kent Staley. If you’re in the neighborhood stop by.



“A 5 sigma effect!” is how the recent Higgs boson discovery was reported. Yet before the dust had settled, the very nature and rationale of the 5 sigma (or 5 standard deviation) discovery criteria began to be challenged and debated both among scientists and in the popular press. Why 5 sigma? How is it to be interpreted? Do p-values in high-energy physics (HEP) avoid controversial uses and misuses of p-values in social and other sciences? The goal of our symposium is to combine the insights of philosophers and scientists whose work interrelates philosophy of statistics, data analysis and modeling in experimental physics, with critical perspectives on how discoveries proceed in practice. Our contributions will link questions about the nature of statistical evidence, inference, and discovery with questions about the very creation of standards for interpreting and communicating statistical experiments. We will bring out some unique aspects of discovery in modern HEP. We also show the illumination the episode offers to some of the thorniest issues revolving around statistical inference, frequentist and Bayesian methods, and the philosophical, technical, social, and historical dimensions of scientific discovery.


1) How do philosophical problems of statistical inference interrelate with debates about inference and modeling in high energy physics (HEP)?

2) Have standards for scientific discovery in particle physics shifted? And if so, how has this influenced when a new phenomenon is “found”?

3) Can understanding the roles of statistical hypotheses tests in HEP resolve classic problems about their justification in both physical and social sciences?

4) How do pragmatic, epistemic and non-epistemic values and risks influence the collection, modeling, and interpretation of data in HEP?


Abstracts for Individual Presentations

robert cousins(1) Unresolved Philosophical Issues Regarding Hypothesis Testing in High Energy Physics
Robert D. Cousins.
Professor, Department of Physics and Astronomy, University of California, Los Angeles (UCLA)

The discovery and characterization of a Higgs boson in 2012-2013 provide multiple examples of statistical inference as practiced in high energy physics (elementary particle physics).  The main methods employed have a decidedly frequentist flavor, drawing in a pragmatic way on both Fisher’s ideas and the Neyman-Pearson approach.  A physics model being tested typically has a “law of nature” at its core, with parameters of interest representing masses, interaction strengths, and other presumed “constants of nature”.  Additional “nuisance parameters” are needed to characterize the complicated measurement processes.  The construction of confidence intervals for a parameter of interest q is dual to hypothesis testing, in that the test of the null hypothesis q=q0 at significance level (“size”) a is equivalent to whether q0 is contained in a confidence interval for q with confidence level (CL) equal to 1-a.  With CL or a specified in advance (“pre-data”), frequentist coverage properties can be assured, at least approximately, although nuisance parameters bring in significant complications.  With data in hand, the post-data p-value can be defined as the smallest significance level a at which the null hypothesis would be rejected, had that a been specified in advance.  Carefully calculated p-values (not assuming normality) are mapped onto the equivalent number of standard deviations (“s”) in a one-tailed test of the mean of a normal distribution. For a discovery such as the Higgs boson, experimenters report both p-values and confidence intervals of interest. Continue reading

Categories: Error Statistics, Higgs, P-values | Tags: | 18 Comments

Oxford Gaol: Statistical Bogeymen

Memory Lane: 3 years ago. Oxford Jail (also called Oxford Castle) is an entirely fitting place to be on (and around) Halloween! Moreover, rooting around this rather lavish set of jail cells (what used to be a single cell is now a dressing room) is every bit as conducive to philosophical reflection as is exile on Elba! (It is now a boutique hotel, though many of the rooms are still too jail-like for me.)  My goal (while in this gaol—as the English sometimes spell it) is to try and free us from the bogeymen and bogeywomen often associated with “classical” statistics. As a start, the very term “classical statistics” should, I think, be shelved, not that names should matter.

In appraising statistical accounts at the foundational level, we need to realize the extent to which accounts are viewed through the eyeholes of a mask or philosophical theory.  Moreover, the mask some wear while pursuing this task might well be at odds with their ordinary way of looking at evidence, inference, and learning. In any event, to avoid non-question-begging criticisms, the standpoint from which the appraisal is launched must itself be independently defended.   But for (most) Bayesian critics of error statistics the assumption that uncertain inference demands a posterior probability for claims inferred is thought to be so obvious as not to require support. Critics are implicitly making assumptions that are at odds with the frequentist statistical philosophy. In particular, they assume a certain philosophy about statistical inference (probabilism), often coupled with the allegation that error statistical methods can only achieve radical behavioristic goals, wherein all that matters are long-run error rates (of some sort)Unknown-2

Criticisms then follow readily: the form of one or both:

  • Error probabilities do not supply posterior probabilities in hypotheses, interpreted as if they do (and some say we just can’t help it), they lead to inconsistencies
  • Methods with good long-run error rates can give rise to counterintuitive inferences in particular cases.
  • I have proposed an alternative philosophy that replaces these tenets with different ones:
  • the role of probability in inference is to quantify how reliably or severely claims (or discrepancies from claims) have been tested
  • the severity goal directs us to the relevant error probabilities, avoiding the oft-repeated statistical fallacies due to tests that are overly sensitive, as well as those insufficiently sensitive to particular errors.
  • Control of long run error probabilities, while necessary is not sufficient for good tests or warranted inferences.

Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Philosophy of Statistics, Statistics | Tags: , | 30 Comments

To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola



 Bioethicist Arthur Caplan gives “7 Reasons Ebola Quarantine Is a Bad, Bad Idea”. I’m interested to know what readers think (I claim no expertise in this area.) My occasional comments are in red. 

“Bioethicist: 7 Reasons Ebola Quarantine Is a Bad, Bad Idea”

In the fight against Ebola some government officials in the U.S. are now managing fear, not the virus. Quarantines have been declared in New York, New Jersey and Illinois. In Connecticut, nine people are in quarantine: two students at Yale; a worker from AmeriCARES; and a West African family.

Many others are or soon will be.

Quarantining those who do not have symptoms is not the way to combat Ebola. In fact it will only make matters worse. Far worse. Why?

  1. Quarantining people without symptoms makes no scientific sense.

They are not infectious. The only way to get Ebola is to have someone vomit on you, bleed on you, share spit with you, have sex with you or get fecal matter on you when they have a high viral load.

How do we know this?

Because there is data going back to 1975 from outbreaks in the Congo, Uganda, Sudan, Gabon, Ivory Coast, South Africa, not to mention current experience in the United States, Spain and other nations.

The list of “the only way to get Ebola” does not suggest it is so extraordinarily difficult to transmit as to imply the policy “makes no scientific sense”. That there is “data going back to 1975″ doesn’t tell us how it was analyzed. They may not be infectious today, but…

  1. Quarantine is next to impossible to enforce.

If you don’t want to stay in your home or wherever you are supposed to stay for three weeks, then what? Do we shoot you, Taser you, drag you back into your house in a protective suit, or what?

And who is responsible for watching you 24-7? Quarantine relies on the honor system. That essentially is what we count on when we tell people with symptoms to call 911 or the health department.

It does appear that this hasn’t been well thought through yet. NY Governor Cuomo said that “Doctors Without Borders”, the group that sponsors many of the volunteers, already requires volunteers to “decompress” for three weeks upon return from Africa, and they compensate their doctors during this time (see the above link). The state of NY would fill in for those sponsoring groups that do not offer compensation (at least in NY). Is the existing 3 week decompression period already a clue that they want people cleared before they return to work? Continue reading

Categories: science communication | Tags: | 49 Comments


Hand writing a letter with a goose feather

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: October 2011 (I mark in red 3 posts that seem most apt for general background on key issues in this blog*)

*I indicated I’d begin this new, once-a-month feature at the 3-year anniversary. I will repost and comment on one each month. (I might repost others that I do not comment on, as Oct. 31, 2014). For newcomers, here’s your chance to catch-up; for old timers, this is philosophy: rereading is essential!

Categories: 3-year memory lane, blog contents, Statistics | Leave a comment

September 2014: Blog Contents

metablog old fashion typewriterSeptember 2014: Error Statistics Philosophy
Blog Table of Contents 

Compiled by Jean A. Miller

  • (9/30) Letter from George (Barnard)
  • (9/27) Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)
  • (9/23) G.A. Barnard: The Bayesian “catch-all” factor: probability vs likelihood
  • (9/21) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
  • (9/18) Uncle Sam wants YOU to help with scientific reproducibility!
  • (9/15) A crucial missing piece in the Pistorius trial? (2): my answer (Rejected Post)
  • (9/12) “The Supernal Powers Withhold Their Hands And Let Me Alone”: C.S. Peirce
  • (9/6) Statistical Science: The Likelihood Principle issue is out…!
  • (9/4) All She Wrote (so far): Error Statistics Philosophy Contents-3 years on
  • (9/3) 3 in blog years: Sept 3 is 3rd anniversary of





Categories: Announcement, blog contents, Statistics | Leave a comment

PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must



The following is from Nathan Schachtman’s legal blog, with various comments and added emphases (by me, in this color). He will try to reply to comments/queries.

“Courts Can and Must Acknowledge Multiple Comparisons in Statistical Analyses”

Nathan Schachtman, Esq., PC * October 14th, 2014

In excluding the proffered testimony of Dr. Anick Bérard, a Canadian perinatal epidemiologist in the Université de Montréal, the Zoloft MDL trial court discussed several methodological shortcomings and failures, including Bérard’s reliance upon claims of statistical significance from studies that conducted dozens and hundreds of multiple comparisons.[i] The Zoloft MDL court was not the first court to recognize the problem of over-interpreting the putative statistical significance of results that were one among many statistical tests in a single study. The court was, however, among a fairly small group of judges who have shown the needed statistical acumen in looking beyond the reported p-value or confidence interval to the actual methods used in a study[1].



A complete and fair evaluation of the evidence in situations as occurred in the Zoloft birth defects epidemiology required more than the presentation of the size of the random error, or the width of the 95 percent confidence interval.  When the sample estimate arises from a study with multiple testing, presenting the sample estimate with the confidence interval, or p-value, can be highly misleading if the p-value is used for hypothesis testing.  The fact of multiple testing will inflate the false-positive error rate. Dr. Bérard ignored the context of the studies she relied upon. What was noteworthy is that Bérard encountered a federal judge who adhered to the assigned task of evaluating methodology and its relationship with conclusions.

*   *   *   *   *   *   *

There is no unique solution to the problem of multiple comparisons. Some researchers use Bonferroni or other quantitative adjustments to p-values or confidence intervals, whereas others reject adjustments in favor of qualitative assessments of the data in the full context of the study and its methods. See, e.g., Kenneth J. Rothman, “No Adjustments Are Needed For Multiple Comparisons,” 1 Epidemiology 43 (1990) (arguing that adjustments mechanize and trivialize the problem of interpreting multiple comparisons). Two things are clear from Professor Rothman’s analysis. First for someone intent upon strict statistical significance testing, the presence of multiple comparisons means that the rejection of the null hypothesis cannot be done without further consideration of the nature and extent of both the disclosed and undisclosed statistical testing. Rothman, of course, has inveighed against strict significance testing under any circumstance, but the multiple testing would only compound the problem.

Second, although failure to adjust p-values or intervals quantitatively may be acceptable, failure to acknowledge the multiple testing is poor statistical practice. The practice is, alas, too prevalent for anyone to say that ignoring multiple testing is fraudulent, and the Zoloft MDL court certainly did not condemn Dr. Bérard as a fraudfeasor[2]. [emphasis mine]

I’m perplexed by this mixture of stances. If you don’t mention the multiple testing for which it is acceptable not to adjust, then you’re guilty of poor statistical practice; but its “too prevalent for anyone to say that ignoring multiple testing is fraudulent”. This appears to claim it’s poor statistical practice if you fail to mention your results are due to multiple testing, but “ignoring multiple testing” (which could mean failing to adjust or, more likely, failing to mention it) is not fraudulent. Perhaps, it’s a questionable research practice QRP. It’s back to “50 shades of grey between QRPs and fraud.”

  […read his full blogpost here]

Previous cases have also acknowledged the multiple testing problem. In litigation claims for compensation for brain tumors for cell phone use, plaintiffs’ expert witness relied upon subgroup analysis, which added to the number of tests conducted within the epidemiologic study at issue. Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 779 (D. Md. 2002), aff’d, 78 Fed. App’x 292 (4th Cir. 2003). The trial court explained:

“[Plaintiff’s expert] puts overdue emphasis on the positive findings for isolated subgroups of tumors. As Dr. Stampfer explained, it is not good scientific methodology to highlight certain elevated subgroups as significant findings without having earlier enunciated a hypothesis to look for or explain particular patterns, such as dose-response effect. In addition, when there is a high number of subgroup comparisons, at least some will show a statistical significance by chance alone.”

I’m going to require, as part of its meaning, that a statistically significant difference not be one due to “chance variability” alone. Then to avoid self contradiction, this last sentence might be put as follows: “when there is a high number of subgroup comparisons, at least some will show purported or nominal or unaudited statistical significance by chance alone. [Which term do readers prefer?] If one hunts down one’s hypothesized comparison in the data, then the actual p-value will not equal, and will generally be greater than, the nominal or unaudited p-value.”

So, I will insert “nominal” where needed below (in red).

Texas Sharpshooter fallacy

Id. And shortly after the Supreme Court decided Daubert, the Tenth Circuit faced the reality of data dredging in litigation, and its effect on the meaning of “significance”:

“Even if the elevated levels of lung cancer for men had been [nominally] statistically significant a court might well take account of the statistical “Texas Sharpshooter” fallacy in which a person shoots bullets at the side of a barn, then, after the fact, finds a cluster of holes and draws a circle around it to show how accurate his aim was. With eight kinds of cancer for each sex there would be sixteen potential categories here around which to “draw a circle” to show a [nominally] statistically significant level of cancer. With independent variables one would expect one statistically significant reading in every twenty categories at a 95% confidence level purely by random chance.”

The Texas sharpshooter fallacy is one of my all time favorites. One purports to be testing the accuracy of his aim, when in fact that is not the process that gave rise to the impressive-looking (nominal) cluster of hits. The results do not warrant inferences about his ability to accurately hit a target, since that hasn’t been well-probed. Continue reading

Categories: P-values, PhilStat Law, Statistics | 12 Comments

Gelman recognizes his error-statistical (Bayesian) foundations


From Gelman’s blog:

“In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons”

Posted by  on

Exhibit A: [2012] Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 189-211. (Andrew Gelman, Jennifer Hill, and Masanao Yajima)

Exhibit B: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, in press. (Andrew Gelman and Eric Loken) (Shortened version is here.)


The “forking paths” paper, in my reading,  basically argues that mere hypothetical possibilities about what you would or might have done had the data been different (in order to secure a desired interpretation) suffices to alter the characteristics of the analysis you actually did. That’s an error statistical argument–maybe even stronger than what some error statisticians would say. What’s really being condemned are overly flexible ways to move from statistical results to substantive claims. The p-values are illicit when taken to provide evidence for those claims because an actual p-value requires Prob(P < p;Ho) = p (and the actual p-value has become much greater by design). The criticism makes perfect sense if you’re scrutinizing inferences according to how well or severely tested they are. Actual error probabilities are accordingly altered or unable to be calculated. However, if one is going to scrutinize inferences according to severity then the same problematic flexibility would apply to Bayesian analyses, whether or not they have a way to pick up on it. (It’s problematic if they don’t.) I don’t see the magic by which a concern for multiple testing disappears in Bayesian analysis (e.g., in the first paper) except by assuming some prior takes care of it.

See my comment here.

Categories: Error Statistics, Gelman | 17 Comments

BREAKING THE (Royall) LAW! (of likelihood) (C)



With this post, I finally get back to the promised sequel to “Breaking the Law! (of likelihood) (A) and (B)” from a few weeks ago. You might wish to read that one first.* A relevant paper by Royall is here.

Richard Royall is a statistician1 who has had a deep impact on recent philosophy of statistics by giving a neat proposal that appears to settle disagreements about statistical philosophy! He distinguishes three questions:

  • What should I believe?
  • How should I act?
  • Is this data evidence of some claim? (or How should I interpret this body of observations as evidence?)

It all sounds quite sensible– at first–and, impressively, many statisticians and philosophers of different persuasions have bought into it. At least they appear willing to go this far with him on the 3 questions.

How is each question to be answered? According to Royall’s commandments writings, what to believe is captured by Bayesian posteriors; how to act, by a behavioristic, N-P long-run performance. And what method answers the evidential question? A comparative likelihood approach. You may want to reject all of them (as I do),2 but just focus on the last.

Remember with likelihoods, the data x are fixed, the hypotheses vary. A great many critical discussions of frequentist error statistical inference (significance tests, confidence intervals, p- values, power, etc.) start with “the law”. But I fail to see why we should obey it.

To begin with, a report of comparative likelihoods isn’t very useful: H might be less likely than H’, given x, but so what? What do I do with that information? It doesn’t tell me I have evidence against or for either.3 Recall, as well, Hacking’s points here about the variability in the meanings of a likelihood ratio across problems. Continue reading

Categories: law of likelihood, Richard Royall, Statistics | 41 Comments

A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”

Sir David Cox

Sir David Cox

The original Statistics Views interview is here:

“I would like to think of myself as a scientist, who happens largely to specialise in the use of statistics”– An interview with Sir David Cox


  • Author: Statistics Views
  • Date: 24 Jan 2014
  • Copyright: Image appears courtesy of Sir David Cox

Sir David Cox is arguably one of the world’s leading living statisticians. He has made pioneering and important contributions to numerous areas of statistics and applied probability over the years, of which perhaps the best known is the proportional hazards model, which is widely used in the analysis of survival data. The Cox point process was named after him.

Sir David studied mathematics at St John’s College, Cambridge and obtained his PhD from the University of Leeds in 1949. He was employed from 1944 to 1946 at the Royal Aircraft Establishment, from 1946 to 1950 at the Wool Industries Research Association in Leeds, and from 1950 to 1955 worked at the Statistical Laboratory at the University of Cambridge. From 1956 to 1966 he was Reader and then Professor of Statistics at Birkbeck College, London. In 1966, he took up the Chair position in Statistics at Imperial College Londonwhere he later became Head of the Department of Mathematics for a period. In 1988 he became Warden of Nuffield College and was a member of the Department of Statistics at Oxford University. He formally retired from these positions in 1994 but continues to work in Oxford.

Sir David has received numerous awards and honours over the years. He has been awarded the Guy Medals in Silver (1961) and Gold (1973) by the Royal Statistical Society. He was elected Fellow of the Royal Society of London in 1973, was knighted in 1985 and became an Honorary Fellow of the British Academy in 2000. He is a Foreign Associate of the US National Academy of Sciences and a foreign member of the Royal Danish Academy of Sciences and Letters. In 1990 he won the Kettering Prize and Gold Medal for Cancer Research for “the development of the Proportional Hazard Regression Model” and 2010 he was awarded the Copley Medal by the Royal Society.

He has supervised and collaborated with many students over the years, many of whom are now successful in statistics in their own right such as David Hinkley and Past President of the Royal Statistical Society, Valerie Isham. Sir David has served as President of theBernoulli Society, Royal Statistical Society, and the International Statistical Institute.

This year, Sir David is to turn 90*. Here Statistics Views talks to Sir David about his prestigious career in statistics, working with the late Professor Lindley, his thoughts on Jeffreys and Fisher, being President of the Royal Statistical Society during the Thatcher Years, Big Data and the best time of day to think of statistical methods.

1. With an educational background in mathematics at St Johns College, Cambridge and the University of Leeds, when and how did you first become aware of statistics as a discipline?

I was studying at Cambridge during the Second World War and after two years, one was sent either into the Forces or into some kind of military research establishment. There were very few statisticians then, although it was realised there was a need for statisticians. It was assumed that anybody who was doing reasonably well at mathematics could pick up statistics in a week or so! So, aged 20, I went to the Royal Aircraft Establishment in Farnborough, which is enormous and still there to this day if in a different form, and I worked in the Department of Structural and Mechanical Engineering, doing statistical work. So statistics was forced upon me, so to speak, as was the case for many mathematicians at the time because, aside from UCL, there had been very little teaching of statistics in British universities before the Second World War. Afterwards, it all started to expand.

2. From 1944 to 1946 you worked at the Royal Aircraft Establishment and then from 1946 to 1950 at the Wool Industries Research Association in Leeds. Did statistics have any role to play in your first roles out of university?

Totally. In Leeds, it was largely statistics but also to some extent, applied mathematics because there were all sorts of problems connected with the wool and textile industry in terms of the physics, chemistry and biology of the wool and some of these problems were mathematical but the great majority had a statistical component to them. That experience was not totally uncommon at the time and many who became academic statisticians had, in fact, spent several years working in a research institute first.

3. From 1950 to 1955, you worked at the Statistical Laboratory at Cambridge and would have been there at the same time as Fisher and Jeffreys. The late Professor Dennis Lindley, who was also there at that time, told me that the best people working on statistics were not in the statistics department at that time. What are your memories when you look back on that time and what do you feel were your main achievements?

Lindley was exactly right about Jeffreys and Fisher. They were two great scientists outside statistics – Jeffreys founded modern geophysics and Fisher was a major figure in genetics. Dennis was a contemporary and very impressive and effective. We were colleagues for five years and our children even played together.

The first lectures on statistics I attended as a student consisted of a short course by Harold Jeffreys who had at the time a massive reputation as virtually the inventor of modern geophysics. His Theory of Probability, published first as a monograph in physics was and remains of great importance but, amongst other things, his nervousness limited the appeal of his lectures, to put it gently. I met him personally a couple of times – he was friendly but uncommunicative. When I was later at the Statistical Laboratory in Cambridge, relations between the Director, Dr Wishart and R.A. Fisher had been at a very low ebb for 20 years and contact between the Lab and Fisher was minimal. I hear him speak on three of four occasions, interesting if often rambunctious occasions. To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.

“To some, Fisher showed great generosity but not to the Statistics Lab, which was sad in view of the towering importance of his work.”

Continue reading

Categories: Sir David Cox | 3 Comments

Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)

Oh My*.images-16

(“But I can succeed as a social philosopher”)

The following is from Retraction Watch. UPDATE: OCT 10, 2014**

Diederik Stapel, the Dutch social psychologist and admitted data fabricator — and owner of 54 retraction notices — is now teaching at a college in the town of Tilburg [i].

According to Omroep Brabant, Stapel was offered the job as a kind of adjunct at Fontys Academy for Creative Industries to teach social philosophy. The site quotes a Nick Welman explaining the rationale for hiring Stapel (per Google Translate):

“It came about because students one after another success story were told from the entertainment industry, the industry which we educate them .”

The students wanted something different.

“They wanted to also focus on careers that have failed. On people who have fallen into a black hole, acquainted with the dark side of fame and success.”

Last month, organizers of a drama festival in The Netherlands cancelled a play co-written by Stapel.

I really think Dean Bon puts the rationale most clearly of all.

…A letter from the school’s dean, Pieter Bon, adds:

We like to be entertained and the length of our lives increases. We seek new ways in which to improve our health and we constantly look for new ways to fill our free time. Fashion and looks are important to us; we prefer sustainable products and we like to play games using smart gadgets. This is why Fontys Academy for Creative Industries exists. We train people to create beautiful concepts, exciting concepts, touching concepts, concepts to improve our quality of life. We train them for an industry in which creativity is of the highest value to a product or service. We educate young people who feel at home in the (digital) world of entertainment and lifestyle, and understand that creativity can also mean business. Creativity can be marketed, it’s as simple as that.

We’re sure Prof. Stapel would agree.

[i] Fontys describes itself thusly: Fontys Academy for Creative Industries (Fontys ACI) in Tilburg has 2500 students working towards a bachelor of Business Administration (International Event, Music & Entertainment Studies and Digital Publishing Studies), a bachelor of Communication (International Event, Music & Entertainment Studies) or a bachelor of Lifestyle (International Lifestyle Studies). Fontys ACI hosts a staff of approximately one hundred (teachers plus support staff) as well as about fifty regular visiting lecturers.

 *I wonder if “social philosophy” is being construed as “extreme postmodernist social epistemology”?  

I guess the students are keen to watch that Fictionfactory Peephole.

**Turns out to have been short-lived. Also admits to sockpuppeting at Retraction watch. Frankly I thought it was more fun to guess who “Paul” was, but they have rules.

[ii} One of my April Fool’s Day posts is turning from part fiction to fact.

Categories: Rejected Posts, Statistics | 9 Comments

Blog at The Adventure Journal Theme.


Get every new post delivered to your Inbox.

Join 515 other followers