Home | Call For Papers | Schedule | Venue | Travel and Accommodations |
---|

**Submission Deadline:** December 1st, 2016

**Authors Notified:** February 8th, 2017

We invite papers in formal epistemology, broadly construed. FEW is an interdisciplinary conference, and so we welcome submissions from researchers in philosophy, statistics, economics, computer science, psychology, and mathematics.

Submissions should be prepared for blind review. Contributors ought to upload a full paper of no more than 6000 words and an abstract of up to 300 words to the Easychair website. Please submit your full paper in .pdf format. The deadline for submissions is December 1st, 2016. Authors will be notified on February 1st, 2017.

The final selection of the program will be made with an eye towards diversity. We especially encourage submissions from PhD candidates, early career researchers and members of groups that are underrepresented in philosophy.

If you have any questions, please email formalepistemologyworkshop2017[AT]gmail, with the appropriate suffix.

Lara Buchak (Berkeley) Vincenzo Crupi (Turin) Sujata Ghosh (ISI Chennai) Simon Hutteger (Irvine) Subhash Lele (Alberta) Hanti Lin (UC Davis) Anna Mahtani (LSE) Daniel Singer (Penn) Michael Titelbaum (Madison) Kevin Zollman (Carnegie Mellon) |
Catrin Campbell-Moore (Bristol) Kenny Easwaran (Texas A&M) Nina Gierasimczuk (DTU Compute) Brian Kim (Oklahoma) Fenrong Liu (Tsinghua) Deborah Mayo (Virgina Tech) Carlotta Pavese (Duke/Turin) Sonja Smets (ILLC Amsterdam) Gregory Wheeler (MCMP Munich) |
Eleonora Cresto (Buenos Aires) Paul Egre (Institut Jean-Nicod) Leah Henderson (Groningen) Karolina Krzyzanowska (MCMP Munich) Yang Liu (Cambridge) Cailin O’Connor (Irvine) Lavinia Picollo (MCMP Munich) Julia Staffel (WashU in St. Louis) Sylvia Wenmackers (Leuven) |

Filed under: Announcement ]]>

**International Prize in Statistics Awarded to Sir David Cox for**

**Survival Analysis Model Applied in Medicine, Science, and Engineering**

EMBARGOED until October 19, 2016, at 9 p.m. ET

ALEXANDRIA, VA (October 18, 2016) – Prominent British statistician Sir David Cox has been named the inaugural recipient of the International Prize in Statistics. Like the acclaimed Fields Medal, Abel Prize, Turing Award and Nobel Prize, the International Prize in Statistics is considered the highest honor in its field. It will be bestowed every other year to an individual or team for major achievements using statistics to advance science, technology and human welfare.

Cox is a giant in the field of statistics, but the International Prize in Statistics Foundation is recognizing him specifically for his 1972 paper in which he developed the proportional hazards model that today bears his name. The Cox Model is widely used in the analysis of survival data and enables researchers to more easily identify the risks of specific factors for mortality or other survival outcomes among groups of patients with disparate characteristics. From disease risk assessment and treatment evaluation to product liability, school dropout, reincarceration and AIDS surveillance systems, the Cox Model has been applied essentially in all fields of science, as well as in engineering.

“Professor Cox changed how we analyze and understand the effect of natural or human-induced risk factors on survival outcomes, paving the way for powerful scientific inquiry and discoveries that have impacted human health worldwide,” said Susan Ellenberg, chair of the International Prize in Statistics Foundation. “Use of the ‘Cox Model’ in the physical, medical, life, earth, social and other sciences, as well as engineering fields, has yielded more robust and detailed information that has helped researchers and policymakers address some of society’s most pressing challenges.” Successful application of the Cox Model has led to life-changing breakthroughs with far-reaching societal effects, some of which include the following:

- Demonstrating that a major reduction in smoking-related cardiac deaths could be seen within just one year of smoking cessation, not 10 or more years as previously thought
- Showing the mortality effects of particulate air pollution, a finding that has changed both industrial practices and air quality regulations worldwide
- Identifying risk factors of coronary artery disease and analyzing treatments for lung cancer, cystic fibrosis, obesity, sleep apnea and septic shock

His mark on research is so great that his 1972 paper is one of the three most-cited papers in statistics and ranked 16th in Nature’s list of the top 100 most-cited papers of all time for all fields.

In 2010, Cox received the Copley Medal, the Royal Society’s highest award that has also been bestowed upon such other world-renowned scientists as Peter Higgs, Stephen Hawking, Albert Einstein, Francis Crick and Ronald Fisher. Knighted in 1985, Cox is a fellow of the Royal Society, an honorary fellow of the British Academy and a foreign associate of the U.S. National Academy of Sciences. He has served as president of the Bernoulli Society, Royal Statistical Society and International Statistical Institute.

Cox’s 50-year career included technical and research positions in the private and nonprofit sectors, as well as numerous academic appointments as professor or department chair at Birkbeck College, Imperial College of London, Nuffield College and Oxford University. He earned his PhD from the University of Leeds in 1949, after first studying mathematics at St. Johns College. Though he retired in 1994, Cox remains active in the profession in Oxford, England.

Cox considers himself to be a scientist who happens to specialize in the use of statistics, which is defined as the science of learning from data. A foundation of scientific inquiry, statistics is a critical component in the development of public policy and has played fundamental roles in vast areas of human development and scientific exploration.

**Note to Editors:** Digital footage of Susan Ellenberg, chair of the International Prize in Statistics Foundation, announcing the recipient will be distributed on October 20. Ellenberg and Ron Wasserstein, director of the International Prize in Statistics Foundation and executive director of the American Statistical Association, will be available for interviews that day.

Link to article: press-release-international-prize-winner

**###**

**About the International Prize in Statistics**

The International Prize in Statistics recognizes a major achievement of an individual or team in the field of statistics and promotes understanding of the growing importance and diverse ways statistics, data analysis, probability and the understanding of uncertainty advance society, science, technology and human welfare. With a monetary award of $75,000, it is given every other year by the International Prize in Statistics Foundation, which is comprised of representatives from the American Statistical Association, International Biometric Society, Institute of Mathematical Statistics, International Statistical Institute and Royal Statistical Society. Recipients are chosen from a selection committee comprised of world-renowned academicians and researchers and officially presented with the award at the World Statistics Congress.

**For more information:**

Jill Talley

Public Relations Manager,

American Statistical Association

(703) 684-1221, ext. 1865

jill@amstat.org

@amstatjill

Filed under: Announcement ]]>

**MONTHLY MEMORY LANE: 3 years ago: October 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a pair count as one.

**October 2013**

**(10/3) Will the Real Junk Science Please Stand Up? (critical thinking)**- (
**10/5)****Was Janina Hosiasson pulling Harold Jeffreys’ leg?** **(10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock****(10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”(10/5 and 10/12 are a pair)**- (10/19) Blog Contents: September 2013
**(10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*****(10/25) Bayesian confirmation theory: example from last post…(10/19 and 10/25 are a pair)****(10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)****(10/31) WHIPPING BOYS AND WITCH HUNTERS**(interesting to see how things have changed and stayed the same over the past few years, share comments)

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

Filed under: 3-year memory lane, Error Statistics, Statistics ]]>

Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end [1].

[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).

An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a *multiverse analysis*, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies.

Steegen et.al.,consider the rather awful example from 2012 purporting to show that single (vs non-single) women prefer Obama to Romney when they are highly fertile; the reverse when they’re at low fertility. (I’m guessing there’s a hold on these ovulation studies during the current election season–maybe that’s one good thing in this election cycle. But let me know if you hear of any.)

Two studies with relatively large and diverse samples of women found that ovulation had different effects on religious and political orientation depending on whether women were single or in committed relationships. Ovulation led single women to become more socially liberal, less religious, and more likely to vote for Barack Obama (Durante et al., p. 1013).

What irks me to no end is the assumption they’re finding effects of ovulation when all they’ve got are a bunch of correlations with lots of flexibility in analysis. (It was discussed in brief on this blogpost.) Unlike the study claiming to show males are more likely to suffer a drop in self-esteem when their partner surpasses them in something (as opposed to when they surpass their partner), this one’s not even intuitively plausible (For the former case of “Macho Men” see slides starting from #48 of this post.) The ovulation study was considered so bad that people complained to the network and it had to be pulled.[2] Nevertheless, both studies are open to an analogous critique.

One of the choice points is where to draw the line at “highly fertile” based on days in a woman’s cycle. It wasn’t based on any hormone check, but an on-line questionnaire asking subjects when they’d had their last period. There’s latitude in using such information (even assuming it to be accurate) to decide whether to place someone in a low or high fertility group (Steegen et al., find 5 sets of days that could have been used). It turns out that under the other choice points, many of the results were insignificant. Had the evidence been “constructed”along these alternative lines, a negative result would often have ensued. Intuitively, considering what could have happened but didn’t, is quite relevant for interpreting the significant result they published. But how?

*1. A severity scrutiny*

Suppose the study is taken as evidence for

*H*_{1}: ovulation makes single women more likely to vote for Obama than Romney.

The data they selected for analysis accords with *H*_{1}, where highly fertile is defined in their chosen manner, leading to significance. The multiverse arrays how many other choice combinations lead to different p-values. We want to determine how good a job has been done in ruling out flaws in the study purporting to have evidence for *H*_{1}.To determine how severely *H*_{1 }had passed we’d ask:

What’s the probability they would *not* have found *some path or other* to yield statistical significance, even if in fact *H*_{1 }is false and there’s no genuine effect?

We want this probability to be high, in order to argue the significant result indicates a genuine effect. That is, we’d like some assurance that the procedure would have alerted us were *H*_{1}unwarranted. I’m not sure how to compute this using the multiverse, but it’s clear there’s more leeway than if one definition for fertility had been pinned down in advance. Perhaps each of the k different consistent combinations can count as a distinct hypothesis, and then one tries to consider the probability of getting r out of k hypotheses statistically significant, even if *H*_{1 }is false, taking account of dependencies. Maybe Stan Young’s “resampling-based multiple modeling” techniques could be employed (Westfall & Young, 1993). In any event, the spirit of the multiverse is, or appears to be, a quintessentially error statistical gambit. In appraising the well-testedness of a claim, anything that alters the probative capacity to discern flaws is relevant; anything that increases the flabbiness in uncovering flaws (in what is to be inferred) lowers the *severity* of the test that *H*_{1 }is false passed. Clearly, taking a walk on a data construction highway does this–the very reason for the common call for preregistration.

If one hadn’t preregistered, and all the other plausible combinations of choices yield non-significance, there’s a strong inkling that researchers selectively arrived at their result. If one had preregistered, finding that other paths yield non-significance is still informative about the fragility of the result. On the other hand, suppose one had preregistered and obtained a negative result. In the interest of reporting the multiverse, positive results may be disinterred, possibly offsetting the initial negative result.

*2. It is to be Applicable to Bayesian and Frequentist Approaches*

I find it interesting that the authors say that “a multiverse analysis is valuable, regardless of the inferential framework (frequentist or Bayesian)” and regardless of whether the inference is in the form of p-values, CIs, Bayes Factors or posteriors (p.709). Do the Bayesian tests (posterior or Bayes Factors) find evidence against *H*_{1} just when the configuration yields an insignificant result? We’re not told. No, I don’t see why they would. It would depend, of course, on the choice of alternatives and priors. Given how strongly authors Durante et al. believe *H*_{1}, it wouldn’t be surprising if the multiverse continues to find evidence for it (with a high posterior or high Bayes Factor in favor of *H*_{1}). Presumably the flexibility in discretionary choices is to show up in diminished Bayesian evidence for *H*_{1} but it’s not clear to me how. Nevertheless, even if the approach doesn’t itself consider error probabilities of methods, we can set out to appraise severity on the meta-level. We may argue that there’s a high probability of finding evidence in favor of some alternative *H*_{1} or other (varying over definitions of high fertility, say), even if its false. Yet I don’t think that’s what Steegen et al., have in mind. I welcome a clarification.

**3. Auditing: Just Falsify the Test, If You Can**

I find a lot to like in the multiverse scrutiny with its recognition of how different choice points in modeling and collecting data introduce the same kind of flexibility as explicit data-dependent searches. There are some noteworthy differences between it and the kind of critique I’ve proposed.

If no strong arguments can be made for certain choices, we are left with many branches of the multiverse that have large p-values. In these cases, the only reasonable conclusion on the effect of fertility is that there is considerable scientific uncertainty. One should reserve judgment…researchers interested in studying the effects of fertility should work hard to

deflatethe multiverse (Steegen et al., p. 708).

Reserve judgment? Here’s another reasonable conclusion: The core presumptions are falsified (or would be with little effort). What is overlooked in all of these fascinating multiverses is whether the entire inquiry makes any sense. One should expose or try to expose the unwarranted presuppositions. This is part of what I call *auditing*. The error statistical account always includes the hypothesis: *the test was poorly run, they’re not measuring what they purport to be, or the assumptions are violated.* Say each person with high fertility in the first study is tested for candidate preference at a time next month where they are now in the low fertility stage. If they have the same voting preferences, *the test is falsified.*

The onus is on the researchers to belie the hypothesis that the test was poorly run; but if they don’t, then we must.[3]

Please share your comments, suggestions, and any links to approaches related to the multiverse analysis.

**Adapted from Mayo, Statistical Inference as Severe Testing (forthcoming)**

[1] I’m reminded of Stapel’s “fix” for science: admit the story you want to tell and how you fixed the statistics to tell it. See this post.

[2] “Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as ‘silly,’ “stupid,’ ‘sexist,’ and ‘offensive.’ Others were less nice.” (Citation may be found here.)

[3] I have found nearly all experimental studies in the social sciences to be open to a falsification probe, and many are readily falsifiable. The fact that some have built-in ways to try and block falsification brings them closer to falling over the edge into questionable science. This is so, even in cases where their hypotheses are plausible. This is a far faster route to criticism than non-replication and all the rest.

**References:**

Durante, K.M., Rae, A. & Griskevicius, V. 2013, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” *Psychological Science, *24(6): 1007-1016.

Gelman, A. and Loken, E. 2014. “The statistical crisis in science,” *American Scientist* 2: 460-65.

Mayo, D. *Statistical Inference as Severe Testing*. CUP (forthcoming).

Steegen, Tuerlinckx, Gelman and Vanpaemel (2016) “Increasing Transparency Through a Multiverse Analysis.” *Perspectives on Psychological Science*, 11: 702-712.

Westfall, P. H. and S.S. Young. 1993. *Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment*. A Wiley-Interscience Publication. Wiley.

Filed under: Bayesian/frequentist, Error Statistics, Gelman, P-values, preregistration, reproducibility, Statistics ]]>

Leek’s post, from yesterday, called “Statistical Vitriol” (29 Sep 2016), calls for de-escalation of the consequences of statistical mistakes:

Over the last few months there has been a lot of vitriol around statistical ideas. First there were data parasites and then there were methodological terrorists. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.

I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics – which I’m so excited about and have spent my entire professional career working on – is something that is causing so much frustration, anxiety, and anger.

I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.

1. Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.

2. The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.

3. Most senior scientists, the ones leading and designing studies, have little or no training in statistics. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.

*Even for senior scientists in charge of designing statistical studies? *

4. There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. … There are a large number of lonely bioinformaticians out there.

5. Statisticians and computational scientists are also frustrated because there is often no outlet for them to respond to these papers in the formal scientific literature – those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.

Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shaky statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).

This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an internet scientist now.

…Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.

Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap – talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. …

I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good. So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:

We should create continuing education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.We should recognize that scientists and data generators have a lot more on the line when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard – scientist’s careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.

I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.

What do you think of his six part plan? More carrots or more sticks? (you can read his post here.)

There may be a fairly wide disparity between the handling of these issues in medicine and biology as opposed to the social sciences. In psychology at least, it appears my predictions (vague, but clear enough) of the likely untoward consequences of their way of handling their “replication crisis” are proving all too true. (See, for example, this post.)

Compare Leek to Gelman’s recent blog on the person raising accusations of “methodological terrorism”, Susan Fiske. (I don’t know if Fiske coined the term, but I consider the analogy reprehensible and think she should retract the term.) Here’s from Gelman:

Who is Susan Fiske and why does she think there are methodological terrorists running around? I can’t be sure about the latter point because she declines to say who these terrorists are or point to any specific acts of terror. Her article provides exactly zero evidence but instead gives some uncheckable half-anecdotes.

I first heard of Susan Fiske because her name was attached as editor to the aforementioned PPNAS articles on himmicanes, etc. So, at least in some cases, she’s a poor judge of social science research….

Fiske’s own published work has some issues too. I make no statement about her research in general, as I haven’t read most of her papers. What I do know is what Nick Brown sent me: [an article] by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

….The short story is that Cuddy, Norton, and Fiske made a bunch of data errors—which is too bad, but such things happen—and then when the errors were pointed out to them, they refused to reconsider anything. Their substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true….

The other thing that’s sad here is how Fiske seems to have felt the need to compromise her own principles here. She deplores “unfiltered trash talk,” “unmoderated attacks” and “adversarial viciousness” and insists on the importance of “editorial oversight and peer review.” According to Fiske, criticisms should be “most often in private with a chance to improve (peer review), or at least in moderated exchanges (curated comments and rebuttals).” And she writes of “scientific standards, ethical norms, and mutual respect.”

But Fiske expresses these views in an unvetted attack in an unmoderated forum with no peer review or opportunity for comments or rebuttals, meanwhile referring to her unnamed adversaries as “methological terrorists.” Sounds like unfiltered trash talk to me. But, then again, I haven’t seen Fiske on the basketball court so I really have no idea what she sounds like when she’s

reallytrash talkin’. (You can read Gelman’s post, which also includes a useful chronology of events, here.)

How can Leek’s 6 point list of “peaceful engagement” work in cases where authors deny the errors really matter? What if they view statistics as so much holy water to dribble over their data, mere window-dressing to attain a veneer of science? I have heard some (successful) social scientists say this aloud (privately)! Far from showing the claims they infer may be represented as unsuccessful attempts to falsify (as good Popperians would demand), the entire effort is a self-sealing affair, dressed up with statistical razzmatazz.

So, I concur with Gelman who has no sympathy for those who wish to protect their work from criticism, going merrily on their way using significance tests illicitly. I also have no sympathy for those who think the cure is merely lowering p-values or embracing methods where the assessment and control of error probabilities are absent. For me, error probability control is not for good long-run error rates, by the way, but to ensure a severe probing of error in the case at hand.

One group may unfairly call the critics “methodological terrorists.” Another may unfairly demonize the statistical methods as the villains to be blamed, banned and eradicated. It’s all the p-value’s fault there’s bad science (never mind that the lack of replication and fraudbusting are based on the use of significance tests). Worse, in some circles, methods that neatly hide the damage from biasing selection effects are championed (in high places)![1]

Gelman says the paradigm of erroneously moving from an already spurious p-value to a substantive claim—thereby doubling up on the blunders–is dead. Is it? That would be swell, but I have my doubts, especially in the most troubling areas. They didn’t nail Potti and Nevins whose erroneous cancer trials had life-threatening consequences; we can scarcely feel confident that such finagling isn’t continuing in clinical trials (see this post), though I think there’s some hope for improvements. But how can it be that “senior scientists, the ones leading and designing studies, have little or no training in statistics,” as Leek says? This is exactly why everyone could say “it’s not my job” in the horror story of the Potti and Nevins fraud. At least social psychologists aren’t using their results to base decisions on chemo treatments for breast cancer patients.

In the social sciences, undergoing a replication revolution has raised awareness, no doubt, and it’s altogether a plus that they’re stressing preregistration. But it’s been such a windfall, one cannot help asking: why would a field whose own members frequently write about its “perverse incentives,” have an incentive to kill the cash cow? Especially with all its interesting side-lines? It has a life of its own, and offers a career of its own with grants aplenty. So grist for its mills would need to continue. That’s rather cynical, but unless they’re prepared to call out bad sciences-including mounting serious critiques of widely held experimental routines and measurements (which could well lead to whole swaths of inquiry falling by the wayside), I don’t see how any other outcome is to be expected.

Share your thoughts. I wrote much more, but it got too long. I may continue this…

**Related:**

- “Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”
- “P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell”
- “Repligate Returns (or, the Non Significance of Nonsignificant Results Are the New Significant Results)
- “The Paradox of Replication and the Vindication of the P-value, but She Can Go Deeper”

Send me related links you find (on comments) and I’ll post them.

1)”There’s no tone problem in psychology” Talyarkoni

2)Menschplaining: Three Ideas for Civil Criticism: Uri Simonsohn on Data Colada

[1] I do not attribute this stance to Gelman who has made it clear that he cares about what could have happened but didn’t in analyzing tests, and is sympathetic to the idea of statistical tests as error probes:

“But I do not make these decisions on altering, rejecting, and expanding models based on the posterior probability that a model is true. …In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). … At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, p. 70).

Filed under: Anil Potti, fraud, Gelman, pseudoscience, Statistics ]]>

Departament de Filosofia & Centre d’Història de la Ciència (CEHIC), Universitat Autònoma de Barcelona (UAB)

Location: CEHIC, Mòdul de Recerca C, Seminari L3-05, c/ de Can Magrans s/n, Campus de la UAB, 08193 Bellaterra (Barcelona)

*Organized by Thomas Sturm & Agustí Nieto-Galan*

Current science is full of uncertainties and risks that weaken the authority of experts. Moreover, sometimes scientists themselves act in ways that weaken their standing: they manipulate data, exaggerate research results, do not give credit where it is due, violate the norms for the acquisition of academic titles, or are unduly influenced by commercial and political interests. Such actions, of which there are numerous examples in past and present times, are widely conceived of as violating standards of good scientific practice. At the same time, while codes of scientific conduct have been developed in different fields, institutions, and countries, there is no universally agreed canon of them, nor is it clear that there should be one. The workshop aims to bring together historians and philosophers of science in order to discuss questions such as the following: What exactly is scientific misconduct? Under which circumstances are researchers more or less liable to misconduct? How far do cases of misconduct undermine scientific authority? How have standards or mechanisms to avoid misconduct, and to regain scientific authority, been developed? How should they be developed?

**All welcome – but since space is limited, please register in advance. Write to:** Thomas.Sturm@uab.cat

09:30 Welcome (Thomas Sturm & Agustí Nieto-Galan)

9:45 José Ramón Bertomeu-Sánchez (IHMC, Universitat de València): *Managing Uncertainty in the Academy and the Courtroom: Normal Arsenic and Nineteenth-Century Toxicology*

10:30 Carl Hoefer (ICREA & Philosophy, University of Barcelona): *Comments on Bertomeu-Sánchez*

10:45 Discussion (Chair: Agustí Nieto-Galan)

11:30 Coffee break

12:00 David Teira (UNED, Madrid): *Does Replication help with Experimental Biases in Clinical Trials?*

12:45 Javier Moscoso (CSIC, Madrid): *Comment on Teira*

13:00 Discussion (Chair: Thomas Sturm)

13:45-15:00 Lunch

15:00 Torsten Wilholt (Philosophy, Leibniz University Hannover): *Bias, Fraud and Interests in Science*

15:45 Oliver Hochadel (IMF, CSIC, Barcelona): *Comments on Wilholt*

16:00 Discussion (Chair: Silvia de Bianchi)

16:45-17:15: Agustí Nieto-Galan & Thomas Sturm: Concluding reflections

**ABSTRACTS**

José Ramón Bertomeu-Sánchez: **Managing Uncertainty in the Academy and the Courtroom: Normal Arsenic and Nineteenth-Century Toxicology**

This paper explores how the enhanced sensitivity of chemical tests sometimes produced unforeseen and puzzling problems in nineteenth-century toxicology. It focuses on the earliest uses of the Marsh test for arsenic and the controversy surrounding “normal arsenic”, i.e., the existence of traces of arsenic in healthy human bodies. The paper follows the circulation of the Marsh test in French toxicology and its appearance in the academy, the laboratory and the courtroom. The new chemical tests could detect very small quantities of poison, but their high sensitivity also offered new opportunities for imaginative defense attorneys to undermine the credibility of expert witnesses. In this context, toxicologists had to dispel the uncertainty associated with the new method, and to find arguments to refute the many possible criticisms (of which “normal arsenic” was one). Meanwhile, new descriptions of animal experiments, autopsies and cases of poisoning produced a steady flow of empirical data, sometimes supporting but, in many cases, questioning previous conclusions about the reliability of chemical tests. This particularly challenging scenario provides many clues about the complex interaction between science and law in the nineteenth century, particularly on how expert authority, credibility and trustworthiness were constructed, and frequently challenged, in the courtroom.

David Teira: **Does Replication help with Experimental Biases in Clinical Trials?**

This is an analysis of the role of replicability in correcting biases in the design and conduct of clinical trials. We take as biases those confounding factors that a community of experimenters acknowledges and for which there are agreed debiasing methods. When these methods are implemented in a trial, we will speak of *unintended biases*, if they occur. Replication helps in detecting and correcting them. *Intended biases* occur when the relevant debiasing method is not implemented. Their effect may be stable and replication, on its own, will not detect them. *Interested* outcomes are treatment variables that not every stakeholder considers clinically relevant. Again, they may be perfectly replicable. Intended biases, unintended biases and interested outcomes are often conflated in the so-called replicability crisis: our analysis shows that fostering replicability, on its own, will not sort out the crisis.

Torsten Wilholt: **Bias, Fraud and Interests in Science**

Cases of fraud and misconduct are the most extreme manifestations of the adverse effects that conflicts of interests can have on science. Fabrication of data and falsification of results may sometimes be difficult to detect, but they are easy to describe as epistemological failures. But arguably, detrimental effects of researchers’ interests can also take more subtle forms. There are numerous ways by which researchers can influence the balance between the sensitivity and the specificity of their investigation. Is it possible to mark out some such trade-offs as cases of detrimental bias? I shall argue that it is, and that the key to understanding bias in science lies in relating it to the phenomenon of epistemic trust. Like fraud, bias exerts its negative epistemic effects by undermining the trust amongst scientists as well as the trust invested in science by the public. I will point out how this analysis can help us to draw the fine lines that separate unexceptionable from biased research and the latter from actual fraud.

Filed under: Announcement, replication research ]]>

**Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. ****The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me. **

♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠

**BARNARD**:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

**SAVAGE**: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

**BARNARD**: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

**LINDLEY**: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

**BARTLETT**: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

**LINDLEY**: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

**BARNARD**: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

**LINDLEY**: I do not care what it is as long as it is not one.

**BARNARD**: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

**LINDLEY**: All probabilities are conditional.

**BARNARD**: I agree.

**LINDLEY**: If there are only conditional ones, what is the point at issue?

**PROFESSOR E.S. PEARSON**: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

**BARNARD**: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

**LINDLEY**: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

**BARNARD**: Only if you knew that the condition was true, but you do not.

**GOOD**: Make a conditional bet.

**BARNARD**: You can make a conditional bet, but that is not what we are aiming at.

**WINSTEN**: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

**BARNARD**: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

You can read the rest of pages 78-103 of the Savage Forum here.

**HAPPY BIRTHDAY GEORGE!**

**References**

[i] Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

***Six other Barnard links on this blog:**

**Guest Posts: **

**Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example**

**Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis**

**Posts by Mayo:**

**Barnard, Background Information, and Intentions**

**Statistical Theater of the Absurd: Stat on a Hot tin Roof**

**George Barnard’s 100 ^{th} Birthday: We Need More Complexity and Coherence in Statistical Education**

**Letter from George Barnard on the Occasion of my Lakatos Award**

**Links to a scan of the entire Savage forum may be found at: https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/**

Filed under: Barnard, highly probable vs highly probed, phil/history of stat, Statistics ]]>

**
I. The myth of objectivity.** Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective,” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity-subjectivity distinction really toothless as many will have you believe? I say no.

Cavalier attitudes toward objectivity are in tension with widely endorsed movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science–if they are not mere lip-service–are rooted in the supposition that we can more objectively scrutinize results,even if it’s only to point out those that are poorly tested. The fact that the term “objectivity” is used equivocally should not be taken as grounds to oust it, but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.

**II. The Key is Getting Pushback. **While knowledge gaps leave plenty of room for biases, arbitrariness and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. *We get pushback!* This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. In an adequate account of statistical inference, explicit attention is paid to communicating results to set the stage for others to check, debate, extend or refute the inferences reached. Don’t let anyone say you can’t hold them to an objective account of statistical inference.

If you really want to find something out, and have had some experience with flaws and foibles, you deliberately arrange inquiries so as to capitalize on pushback, on effects that will not go away, and on strategies to get errors to ramify quickly to force you to pay attention to them. The ability to register alterations in error probabilities due to hunting, optional stopping, and other questionable research practices (QRPs) is a crucial part of objectivity in statistics. In statistical design, day-to-day tricks of the trade to combat bias are amplified and made systematic. It is not because of a “disinterested stance” that such methods are invented. It is that we, competitively and self-interestedly, want to find things out.

Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Succeeding in scientific learning is very different from success at grants, honors, publications, or engaging in technical activism, replication research and meta-research. That’s why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data, preregistration, and resisting the urge to cherry pick, outcome-switch, or otherwise engage in bad science are proposed. I say that if the allure of carrots has grown stronger than the sticks (which they have), then what we need are stronger sticks.

**III. Objective procedures. **It is often urged that, however much we may aim at objective constraints, we can never have clean hands, free of the influence of beliefs and interests. The fact that my background knowledge enters in researching a claim

Others argue that we invariably sully methods of inquiry by the entry of personal judgments in their specification and interpretation. It’s just human all too human. *The issue is not that a human is doing the measuring; the issue is whether that which is being measured is something we can reliably use to solve some problem of inquiry.* That an inference is done by machine, untouched by human hands, wouldn’t make it objective, in the relevant sense. There are three distinct requirements for an objective procedure for solving problems of inquiry:

*Relevance*: It should be relevant to learning about the intended topic of inquiry; having an uncontroversial way to measure something doesn’t make it relevant to solving a knowledge-based problem of inquiry.

*Reliably capable*: It should not routinely declare the problem solved when it is not solved (or solved incorrectly); it should be capable of controlling the reliability of erroneous reports of purported answers to question.

*Capacity to learn from error:*If the problem is not solved (or poorly solved) at a given stage, the method should set the stage for pinpointing why. (It should be able to at least embark on an inquiry for solving “Duhemian problems” of where to lay blame for anomalies.)

Yes, there are numerous choices in collecting, analyzing, modeling, and drawing inferences from data, and there is often disagreement about how they should be made. Why suppose this means all accounts are in the same boat as regards subjective factors? It need not, and they are not. An account of inference shows itself to be objective precisely in how it steps up to the plate in handling potential threats to objectivity.

**IV. Idols of Objectivity. **We should reject phony objectivity and false trappings of objectivity. They often grow out of one or another philosophical conception of what objectivity requires—even though you will almost surely not see them described that way. If it’s thought objectivity is limited to direct observations (whatever they are) plus mathematics and logic, as does the typical logical positivist, then it’s no surprise to wind up worshiping “the idols of a universal method” as Gigerenzer and Marewski (2015) call it. Such a method is to supply a formal, ideally mechanical, way to process statements of observations and hypotheses. To recognize such mechanical rules don’t exist is not to relinquish the view that they’re demanded by objectivity. Instead, objectivity goes by the board, replaced by various stripes of relativism and constructivism, or more extreme forms of post-modernisms.

Relativists may augment their rather thin gruel with a pseudo-objectivity arising from social or political negotiation, cost-benefits (“they’re buying it”), or a type of consensus (“it’s in a 5 star journal”), but that’s to give away the goat far too soon. The result is to abandon the core stipulations of scientific objectivity. To be clear: There are authentic problems that threaten objectivity. We shouldn’t allow outdated philosophical accounts to induce us into giving it up.

**V. From Discretion to Subjective Probabilities. **Some argue that “discretionary choices” in tests, which Neyman himself tended to call “subjective”[1], leads us to subjective probabilities in claims. A weak version goes: since you can’t avoid subjective (discretionary) choices in getting the data and the model, there can be little ground for complaint about subjective degrees of belief in the resulting inference. This is weaker than arguing you must use subjective probabilities; it argues merely that doing so *is no worse than* discretion. But it still misses the point.

Even if the entry of discretionary judgments in the journey to a statistical inference/model have the capability to introduce subjectivity, they need not. Second, not all discretionary judgments are in the same boat when it comes to being **open to severe testing. **

A stronger version of the argument goes on a slippery slope from the premise of discretion in data generation and modeling to the conclusion: statistical inference just i*s a matter of subjective beliefs (or their updates)*. How does that work? One variant, which I do not try to pin on anyone in particular, involves a subtle slide from “our models are merely objects of belief”, to “statistical inference is a matter of degrees of belief”. From there it’s a short step to “statistical inference is a matter of subjective probability” (whether my assignments or that of an imaginary omniscient agent).

It is one thing to describe our models as objects of belief and quite another to maintain that our task is to model beliefs.

This is one of those philosophical puzzles of language that might set some people’s eyes rolling. If I believe in the deflection effect (of gravity) then that effect is the object of my belief, but only in the sense that my belief is about said effect. Yet if I’m inquiring into the deflection effect, I’m not inquiring into beliefs about the effect. The philosopher of science Clark Glymour (2010, p. 335) calls this a shift from phenomena (content) to *epiphenomena* (degrees of belief).

Karl Popper argues that *the* central confusion all along was sliding from * the degree of the rationality (or warrantedness) of a belief, to the degree of rational belief *(1959, p. 424). The former is assessed via degrees of corroboration and well-testedness, rooted in the error probing capacities of procedures. (These are supplied by error probabilities of methods, formal or informal.)

**VI. Blurring What’s Being Measured vs My Ability to Test It. **You will sometimes hear a Bayesian claim that anyone who says their probability assignments to hypotheses are subjective must also call the use of any model subjective because it too is based on my choice of specifications. *This is a confusion of two notions of subjective. *

- The first concerns what’s being measured, and for the Bayesian, with some exceptions, probability is supposed to represent a subject’s strength of belief (be it actual or rational), betting odds, or the like.
- The second sense of subjective concerns whether the measurement is checkable or testable.

This goes back to my point about what’s required for a feature to be *relevant* to a method’s objectivity in III.

(Passages, modified, are from Mayo, *Statistical Inference as Severe Testing* (forthcoming)

[1]But he never would allow subjective probabilities to enter in statistical inference. Objective, i.e., frequentist, priors in a hypothesis H could enter, but he was very clear that this required H’s truth being the result of some kind of stochastic mechanism. He found that idea plausible in cases, the problem was not knowing the stochastic mechanism sufficiently to assign the priors. Such frequentist (or “empirical”) priors in hypotheses are not given by drawing Hrandomly from an urn of hypothesis k% of which are assumed to be true. Yet, an “objective” Bayesian like Jim Berger will call these frequentist, resulting in enormous confusion in today’s guidebooks on the probability of type 1 errors.

Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

Gigerenzer, G. and Marewski, J. 2015. ‘Surrogate Science: The Idol of a Universal Method for Scientific Inference,’ *Journal of Management* 41(2): 421-40.

Glymour, C. 2010. ‘Explanation and Truth’, in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D. Mayo and A. Spanos eds.), CUP: 331–350.

Mayo, D. (1983). “An Objective Theory of Statistical Testing.” *Synthese* **57**(2): 297-340.

Popper, K. 1959. *The Logic of Scientific Discovery*. New York: Basic Books.

Filed under: Background knowledge Tagged: objectivity ]]>

Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. *Happy birthday Peirce*.

**Peircean Induction and the Error-Correcting Thesis**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

**Self-Correcting Thesis SCT:** methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

**2. Probabilities are assigned to procedures not hypotheses**

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, *H*, of a given type be rejected or not, calculate a specified character, ** x_{0}**, of the observed facts; if

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis *H* is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what *H* asserts, and yet it did not.

**3. So why is justifying Peirce’s SCT thought to be so problematic?**

You can read Section 3 here. (it’s not necessary for understanding the rest).

**4. Peircean induction as severe testing**

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly *corroborated (by his lights)*, he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis *H* when not only does *H* “accord with” the data ** x**; but also, so good an accordance would very probably not have resulted, were

*Hypothesis H passes a severe test with* ** x** iff (firstly)

The test would “have signaled an error” by having produced results less accordant with *H* than what the test yielded. Thus, we may inductively infer *H* when (and only when) *H* has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely *H* has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for *H* but the probative capacity of the test of experiment ET (with regard to those errors that an inference to *H* is declaring to be absent)……….

You can read the rest of Section 4 here here

**5. The path from qualitative to quantitative induction**

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

*(I) First-Order, Rudimentary or Crude Induction*

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim *H*, provisionally adopt *H*. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of *H*‘s falsity would probably have been detected, were H false, finding no evidence against *H* is poor inductive evidence *for* *H*. *H* has passed only a highly unreliable error probe.

*(II) Second Order (Qualitative) Induction*

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

*(III) Third Order, Statistical (Quantitative) Induction*

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

**6. Quantitative and qualitative induction: significance test reasoning**

*Quantitative Severity*

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a *null hypothesis* *H_{0}*, which is an assertion about the distribution of the sample

*H_{0}*: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

*Let d(x)* measure the increased risk of cancer in

*p*-value = Prob(** d**(

If this probability is very small, the data are taken as evidence that

*H**: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of *modes tollens*.

If the hypothesis *H _{0}* is correct then, with high probability, 1-

** x** is statistically significant at level

Therefore, ** x** is evidence of a discrepancy from

*(i.e., H* severely passes, where the severity is 1 minus the p-value) [iii]*

For example, the results of recent, large, randomized treatment-control studies showing statistically significant increased risks (at the 0.001 level) give strong evidence that HRT, taken for over 5 years, increases the chance of breast cancer, the severity being 0.999. If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject *H _{0}* (and

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[Part 2 and Part 3 are here; you can find the full paper here.]

**REFERENCES:**

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Laudan, L. 1981 *Science and Hypothesis: Historical Essays on Scientific Methodology*. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, *Philosophy of Science*, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), *Charles S. Peirce and the Philosophy of Science*. Tuscaloosa: University of Alabama Press.

——— 1996 *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), *Probability Is the Very Guide in Life*. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), *Scientific Evidence*, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in *Foundations of Bayesianism*, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” *Philosophy of Science*, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, *The British Journal of Philosophy of Science* 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, *Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium*, *2005*.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in *Philosophical Transactions of the Royal Society*, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 *Joint Statistical Papers*, Berkeley: University of California Press.

Niiniluoto, I. 1984 *Is Science Progressive*? Dordrecht: D. Reidel.

Peirce, C. S. *Collected Papers: Vols. I-VI*, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 *Conjectures and Refutations: the Growth of Scientific Knowledge*, Basic Books, New York.

Rescher, N. 1978 *Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method*, Notre Dame: University of Notre Dame Press.

[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative *T’* is closer to the truth than a refuted *T*.

[iii] If the *p*-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard *H _{0}* as consistent with data

If there were a discrepancy from hypothesis *H _{0}* of

** x** is not statistically significant at level

Therefore, ** x** is evidence than any discrepancy from

For a general treatment of effect size, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Filed under: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics ]]>

**Error Statistics Philosophy: Blog Contents (5 years) [i]
**

*Dear Reader: It’s hard to believe I’ve been blogging for five years (since Sept. 3, 2011)! A big celebration is taking place at the Elbar Room this evening. If you’re in the neighborhood, stop by for some Elba Grease. *

*Amazingly, this old typewriter not only still works; one of the whiz kids on Elba managed to bluetooth it to go directly from my typewriter onto the blog (I never got used to computer keyboards.) I still must travel to London to get replacement ribbons for this klunker.*

*Please peruse the offerings below, and take advantage of some of the super contributions and discussions by guest posters and readers! I don’t know how much longer I’ll continue blogging, but at least until the publication of my book on statistical inference. After that I plan to run conferences, workshops, and ashrams on PhilStat and PhilSci, and will invite readers to take part! Keep reading and commenting. Sincerely, D. Mayo*

**September 2011**

- (9/3) Frequentists in Exile: The Purpose of this Blog
- (9/3) Overheard at the comedy hour at the Bayesian retreat
- (9/4) Drilling Rule #1
- (9/9) Kuru
- (9/13) In Exile, Clinging to Old Ideas?
- (9/15) SF conferences & E. Lehmann
- (9/16) Getting It Right But for the Wrong Reason
- (9/20) A Highly Anomalous Event
- (9/23) LUCKY 13 (Critcisms)
- (9/26) Whipping Boys and Witch Hunters
- (9/29) Part 1: Imaginary scientist at an imaginary company, Prionvac, and an imaginary reformer

**October 2011**

- (10/3) Part 2 Prionvac: The Will to Understand Power
- (10/4) Part 3 Prionvac: How the Reformers Should Have done Their Job
- (10/5) Formaldehyde Hearing: How to Tell the Truth With Statistically Insignificant Results
- (10/7) Blogging the (Strong) Likelihood Principle
- (10/10) RMM-1: Special Volume on Stat Sci Meets Phil Sci
- (10/10) Objectivity 1: Will the Real Junk Science Please Stand Up?
- (10/13) Objectivity #2: The “Dirty Hands” Argument for Ethics in Evidence
- (10/14) King Tut Includes ErrorStatistics in Top 50 Statblogs!
- (10/16) Objectivity #3: Clean(er) Hands With Metastatistcs
- (10/19) RMM-2: “A Conversation Between Sir David Cox & D.G. Mayo
- (10/20) Blogging the Likelihood Principle #2: Solitary Fishing: SLP Violations
- (10/22) The Will to Understand Power: Neyman’s Nursery (NN1)
- (10/28) RMM-3: Special Volume on Stat Scie Meets Phl Sci (Hendry)
- (10/30) Background Knowledge: Not to Quantify, but to Avoid Being Misled by, Subjective Beliefs
- (10/31) Oxford Gaol: Statistical Bogeymen

**November 2011**

- (11/1) RMM-4: Special Volume on Stat Scie Meets Phil Sci (Spanos)
- (11/3) Who is Really Doing the Work?*
- (11/5) Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest
- (11/9) Neyman’s Nursery 2: Power and Severity [Continuation of Oct. 22 Post]
- (11/12) Neyman’s Nursery (NN) 3: SHPOWER vs POWER
- (11/15) Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower (“observed” power) vs Power
- (11/18) Neyman’s Nursery (NN5): Final Post
- (11/21) RMM-5: Special Volume on Stat Scie Meets Phil Sci (Wasserman)
- (11/23) Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat
- (11/28) The UN Charter: double-counting and data snooping
- (11/29) If you try sometime, you find you get what you need!

**December 2011**

- (12/2) Getting Credit (or blame) for Something You Don’t Deserve (and first honorable mention)
- (12/6) Putting the Brakes on the Breakthrough Part 1*

- (12/7) Part II: Breaking Through the Breakthrough*
- (12/11) Irony and Bad Faith: Deconstructing Bayesians
- (12/19) Deconstructing and Deep-Drilling* 2
- (12/22) The 3 stages of the acceptance of novel truths
- (12/25) Little Bit of Blog Log-ic
- (12/26) Contributed Deconstructions: Irony & Bad Faith 3
- (12/29) JIM BERGER ON JIM BERGER!
- (12/31) Midnight With Birnbaum

**January 2012**

- (1/3) Model Validation and the LLP-(Long Playing Vinyl Record)
- (1/8) Don’t Birnbaumize that Experiment my Friend*
- (1/10) Bad-Faith Assertions of Conflicts of Interest?*
- (1/13) U-PHIL: “So you want to do a philosophical analysis?”
- (1/14) “You May Believe You are a Bayesian But You Are Probably Wrong” (Extract from Senn RMM article)
- (1/15) Mayo Philosophizes on Stephen Senn: “How Can We Culivate Senn’s-Ability?”
- (1/17) “Philosophy of Statistics”: Nelder on Lindley
- (1/19) RMM-6 Special Volume on Stat Sci Meets Phil Sci (Sprenger)
- (1/22) U-Phil: Stephen Senn (1): C. Robert, A. Jaffe, and Mayo (brief remarks)
- (1/23) U-Phil: Stephen Senn (2): Andrew Gelman
- (1/24) U-Phil (3): Stephen Senn on Stephen Senn!
- (1/26) Updating & Downdating: One of the Pieces to Pick up
- (1/29) No-Pain Philosophy: Skepticism, Rationality, Popper, and All That: First of 3 Parts

**February 2012**

- (2/3) Senn Again (Gelman)
- (2/7) When Can Risk-Factor Epidemiology Provide Reliable Tests?
- (2/8) Guest Blogger: Interstitial Doubts About the Matrixx (Schachtman)
- (2/8) Distortions in the Court? (PhilStat/PhilStock) **Cobb on Zilizk & McCloskey
- (2/11) R.A. Fisher: Statistical Methods and Scientific Inference
- (2/11) JERZY NEYMAN: Note on an Article by Sir Ronald Fisher
- (2/12) E.S. Pearson: Statistical Concepts in Their Relation to Reality
- (2/12) Guest Blogger. STEPHEN SENN: Fisher’s alternative to the alternative
- (2/15) Guest Blogger. Aris Spanos: The Enduring Legacy of R.A. Fisher
- (2/17) Two New Properties of Mathematical Likelihood
- (2/20) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”? (Rejected Post Feb 20)
- (2/22) Intro to Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
- (2/23) Misspecification Testing: (part 2) A Fallacy of Error “Fixing”
- (2/27) Misspecification Testing: (part 3) Subtracting-out effects “on paper”
- (2/28) Misspecification Tests: (part 4) and brief concluding remarks

**March 2012**

- (3/2) MetaBlog: March 2, 2012
- (3/3) Statistical Science Court?
- (3/6) Mayo, Senn, and Wasserman on Gelman’s RMM** Contribution
- (3/8) Lifting a piece from Spanos’ contribution* will usefully add to the mix
- (3/10) U-PHIL: A Further Comment on Gelman by Christian Henning (UCL, Statistics)
- (3/11) Blogologue*
- (3/11) RMM-7: Commentary and Response on Senn published: Special Volume on Stat Scie Meets Phil Sci
- (3/14) Objectivity (#4) and the “Argument From Discretion”
- (3/18) Objectivity (#5): Three Reactions to the Challenge of Objectivity (in inference)
- (3/22) Generic Drugs Resistant to Lawsuits
- (3/25) The New York Times Goes to War Against Generic Drug Manufacturers: Schactman
- (3/26) Announcement: Philosophy of Scientific Experiment Conference
- (3/28) Comment on the Barnard and Copas (2002) Empirical Example: Aris Spanos

**April 2012**

- (4/1) Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1
- (4/3) History and Philosophy of Evidence-Based Health Care
- (4/4) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
- (4/6) Going Where the Data Take Us
- (4/9) N. Schachtman: Judge Posner’s Digression on Regression
- (4/10) Call for papers: Philosepi?
- (4/12) That Promissory Note From Lehmann’s Letter; Schmidt to Speak
- (4/15) U-Phil: Deconstructing Dynamic Dutch-Books?
- (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
- (4/17) Earlier U-Phils and Deconstructions
- (4/18) Jean Miller: Happy Sweet 16 to EGEK! (Shalizi Review: “We have Ways of Making You Talk”)
- (4/21) Jean Miller: Happy Sweet 16 to EGEK #2 (Hasok Chang Review of EGEK)
- (4/23) U-Phil: Jon Williamson: Deconstructing DynamicDutch Books
- (4/25) Matching Numbers Across Philosophies
- (4/28) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors

**May 2012**

- (5/1) Stephen Senn: A Paradox of Prior Probabilities
- (5/5) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
- (5/8) LSE Summer Seminar: Contemporary Problems in Philosophy of Statistics
- (5/10) Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence,”
- (5/12) Saturday Night Brainstorming & Task Forces: The TFSI on
NHST

- (5/17) Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (5/20) Betting, Bookies and Bayes: Does it Not Matter?
- (5/23) Does the Bayesian Diet Call For Error-Statistical Supplements?
- (5/24) An Error-Statistical Philosophy of Evidence (PH500, LSE Seminar)
- (5/28) Painting-by-Number #1
- (5/31) Metablog: May 31, 2012

**June 2012**

- (6/2) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
- (6/6) Review of Error and Inference by C. Hennig

- (6/9) U-Phil: Is the Use of Power* Open to a Power Paradox?
- (6/12) CMU Workshop on Foundations for Ockham’s Razor
- (6/14) Answer to the Homework & a New Exercise
- (6/15) Scratch Work for a SEV Homework Problem
- (6/17) Repost (5/17/12): Do CIs Avoid Fallacies of Tests? Reforming the Reformers
- (6/17) G. Cumming Response: The New Statistics
- (6/19) The Error Statistical Philosophy and The Practice of Bayesian Statistics: Comments on Gelman and Shalizi
- (6/23) Promissory Note
- (6/26) Deviates, Sloths, and Exiles: Philosophical Remarks on the Ockham’s Razor Workshop*
- (6/29) Further Reflections on Simplicity: Mechanisms

**July 2012**

- (7/1) PhilStatLaw: “Let’s Require Health Claims to Be ‘Evidence Based’” (Schachtman)
- (7/2) More from the Foundations of Simplicity Workshop*
- (7/3) Elliott Sober Responds on Foundations of Simplicity
- (7/4) Comment on Falsification
- (7/6) Vladimir Cherkassky Responds on Foundations of Simplicity
- (7/8) Metablog: Up and Coming
- (7/9) Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science?
- (7/12) Dennis Lindley’s “Philosophy of Statistics”
- (7/15) Deconstructing Larry Wasserman – it starts like this…
- (7/16) Peter Grünwald: Follow-up on Cherkassky’s Comments
- (7/19) New Kvetch Posted 7/18/12
- (7/21) “Always the last place you look!”
- (7/22) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 1)
- (7/23) Clark Glymour: The Theory of Search Is the Economics of Discovery (part 2)
- (7/27) P-values as Frequentist Measures
- (7/28) U-PHIL: Deconstructing Larry Wasserman
- (7/31) What’s in a Name? (Gelman’s blog)

**August 2012**

- (8/2) Stephen Senn: Fooling the Patient: an Unethical Use of Placebo? (Phil/Stat/Med)
- (8/5) A “Bayesian Bear” rejoinder practically writes itself…
- (8/6) Bad news bears: Bayesian rejoinder
- (8/8) U-PHIL: Aris Spanos on Larry Wasserman
- (8/10) U-PHIL: Hennig and Gelman on Wasserman (2011)
- (8/11) E.S. Pearson Birthday
- (8/11) U-PHIL: Wasserman Replies to Spanos and Hennig
- (8/13) U-Phil: (concluding the deconstruction) Wasserman/Mayo
- (8/14) Good Scientist Badge of Approval?
- (8/16) E.S. Pearson’s Statistical Philosophy
- (8/18) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/20) Higgs Boson: Bayesian “Digest and Discussion”
- (8/22) Scalar or Technicolor? S. Weinberg, “Why the Higgs?”
- (8/27) Knowledge/evidence not captured by mathematical prob.
- (8/30) Frequentist Pursuit
- (8/31) Failing to Apply vs Violating the Likelihood Principle

**September 2012**

- (9/3) After dinner Bayesian comedy hour. …
- (9/6) Stephen Senn: The nuisance parameter nuisance
- (9/8) Metablog: One-Year Anniversary
- (9/8) Return to the comedy hour … (on significance tests)
- (9/12) U-Phil (9/25/12) How should “prior information” enter in statistical inference?
- (9/15) More on using background info
- (9/19) Barnard, background info/intentions
- (9/22) Statistics and ESP research (Diaconis)
- (9/25) Insevere tests and pseudoscience
- (9/26) Levels of Inquiry
- (9/29) Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis
- (9/30) Letter from George (Barnard)

**October 2012**

- (10/02)PhilStatLaw: Infections in the court
- (10/05) Metablog: Rejected posts (blog within a blog)
- (10/05) Deconstructing Gelman, Part 1: “A Bayesian wants everybody else to be a non-Bayesian.”
- (10/07) Deconstructing Gelman, Part 2: Using prior information
- (10/09) Last part (3) of the deconstruction: beauty and background knowledge
- (10/12) U-Phils: Hennig and Aktunc on Gelman 2012
- (10/13) Mayo Responds to U-Phils on Background Information
- (10/15) New Kvetch: race-based academics in Fla
- (10/17) RMM-8: New Mayo paper: “StatSci and PhilSci: part 2 (Shallow vs Deep Explorations)”
- (10/18) Query
- (10/18) Mayo: (first 2 sections) “StatSci and PhilSci: part 2”
- (10/20) Mayo: (section 5) “StatSci and PhilSci: part 2”
- (10/21) Mayo: (section 6) “StatSci and PhilSci: part 2”
- (10/22) Mayo: (section 7) “StatSci and PhilSci: part 2”
- (10/24) Announcement: Ontology and Methodology (Virginia Tech)
- (10/25) New rejected post: phil faux
- (10/27) New rejected post: “Are you butter off now?”
- (10/29) Reblogging: Oxford Gaol: Statistical Bogeymen
- (10/29) Type 1 and 2 errors: Frankenstorm
- (10/30) Guest Post: Greg Gandenberger, “Evidential Meaning and Methods of Inference”
- (10/31) U-Phil: Blogging the Likelihood Principle: New Summary

**November 2012**

- (11/04) PhilStat: So you’re looking for a Ph.D. dissertation topic?
- (11/07) Seminars at the London School of Economics: Contemporary Problems in Philosophy of Statistics
- (11/10) Bad news bears: ‘Bayesian bear’ rejoinder – reblog
- (11/12) new rejected post: kvetch (and query)
- (11/14) continuing the comments. …
- (11/16) Philosophy of Science Association (PSA) 2012 Program
- (11/18) What is Bayesian/Frequentist Inference? (from the normal deviate)
- (11/18) New kvetch/PhilStock: Rapiscan Scam
- (11/19) Comments on Wasserman’s “what is Bayesian/frequentist inference?”
- (11/21) Irony and Bad Faith: Deconstructing Bayesians – reblog
- (11/23) Announcement: 28 November: My Seminar at the LSE (Contemporary PhilStat)
- (11/25) Likelihood Links [for 28 Nov. Seminar and Current U-Phil]
- (11/28) Blogging Birnbaum: on Statistical Methods in Scientific Inference
- (11/30) Error Statistics (brief overview)

**December 2012**

- (12/2) Normal Deviate’s blog on false discovery rates
- (12/2) Statistical Science meets Philosophy of Science
- (12/3) Mayo Commentary on Gelman & Robert’s paper
- (12/6) Announcement: U-Phil Extension: Blogging the Likelihood Principle
- (12/7) Nov. Palindrome Winner: Kepler
- (12/8) Don’t Birnbaumize that experiment my friend*–updated reblog
- (12/11) Announcement: Prof. Stephen Senn to lead LSE grad seminar: 12-12-12
- (12/11) Mayo on S. Senn: “How Can We Cultivate Senn’s-Ability?”
- (12/13) “Bad statistics”: crime or free speech?
- (12/14) PhilStat/Law (“Bad Statistics” Cont.)
- (12/17) PhilStat/Law/Stock: multiplicity and duplicity
- (12/19) PhilStat/Law/Stock: more on “bad statistics”: Schachtman
- (12/21) Rejected Post: Clinical Trial Statistics Doomed by Mayan Apocalypse?
- (12/22) Msc kvetch: unfair but lawful discrimination (vs the irresistibly attractive)
- (12/24) 13 well-worn criticisms of significance tests (and how to avoid them)
- (12/27) 3 msc kvetches on the blog bagel circuit
- (12/30) An established probability theory for hair comparison?“–is not — and never was”
- (12/31) Midnight with Birnbaum-reblog

**January 2013**

- (1/2) Severity as a ‘Metastatistical’ Assessment
- (1/4) Severity Calculator
- (1/6) Guest post: Bad Pharma? (S. Senn)
- (1/9) RCTs, skeptics, and evidence-based policy
- (1/10) James M. Buchanan
- (1/11) Aris Spanos: James M. Buchanan: a scholar, teacher and friend
- (1/12) Error Statistics Blog: Table of Contents
- (1/15) Ontology & Methodology: Second call for Abstracts, Papers
- (1/18) New Kvetch/PhilStock
- (1/19) Saturday Night Brainstorming and Task Forces: (2013) TFSI on NHST
- (1/22) New PhilStock
- (1/23) P-values as posterior odds?
- (1/26) Coming up: December U-Phil Contributions….
- (1/27) U-Phil: S. Fletcher & N.Jinn
- (1/30) U-Phil: J. A. Miller: Blogging the SLP

**February 2013**

- (2/2) U-Phil: Ton o’ Bricks
- (2/4) January Palindrome Winner
- (2/6) Mark Chang (now) gets it right about circularity
- (2/8) From Gelman’s blog: philosophy and the practice of Bayesian statistics
- (2/9) New kvetch: Filly Fury
- (2/10) U-PHIL: Gandenberger & Hennig: Blogging Birnbaum’s Proof
- (2/11) U-Phil: Mayo’s response to Hennig and Gandenberger
- (2/13) Statistics as a Counter to Heavyweights…who wrote this?
- (2/16) Fisher and Neyman after anger management?
- (2/17) R. A. Fisher: how an outsider revolutionized statistics
- (2/20) Fisher: from ‘Two New Properties of Mathematical Likelihood’
- (2/23) Stephen Senn: Also Smith and Jones
- (2/26) PhilStock: DO < $70
- (2/26) Statistically speaking…

**March 2013**

- (3/1) capitalizing on chance
- (3/4) Big Data or Pig Data?
- (3/7) Stephen Senn: Casting Stones
- (3/10) Blog Contents 2013 (Jan & Feb)
- (3/11) S. Stanley Young: Scientific Integrity and Transparency
- (3/13) Risk-Based Security: Knives and Axes
- (3/15) Normal Deviate: Double Misunderstandings About p-values
- (3/17) Update on Higgs data analysis: statistical flukes (1)
- (3/21) Telling the public why the Higgs particle matters
- (3/23) Is NASA suspending public education and outreach?
- (3/27) Higgs analysis and statistical flukes (part 2)
- (3/31) possible progress on the comedy hour circuit?

**April 2013**

- (4/1) Flawed Science and Stapel: Priming for a Backlash?
- (4/4) Guest Post. Kent Staley: On the Five Sigma Standard in Particle Physics
- (4/6) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….
- (4/10) Statistical flukes (3): triggering the switch to throw out 99.99% of the data
- (4/11) O & M Conference (upcoming) and a bit more on triggering from a participant…..
- (4/14) Does statistics have an ontology? Does it need one? (draft 2)
- (4/19) Stephen Senn: When relevance is irrelevant
- (4/22) Majority say no to inflight cell phone use, knives, toy bats, bow and arrows, according to survey
- (4/23) PhilStock: Applectomy? (rejected post)
- (4/25) Blog Contents 2013 (March)
- (4/27) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)
- (4/29) What should philosophers of science do? (falsification, Higgs, statistics, Marilyn)

**May 2013**

- (5/3) Schedule for Ontology & Methodology, 2013
- (5/6) Professorships in Scandal?
- (5/9) If it’s called the “The High Quality Research Act,” then ….
- (5/13) ‘No-Shame’ Psychics Keep Their Predictions Vague: New Rejected post
- (5/14) “A sense of security regarding the future of statistical science…” Anon review of Error and Inference
- (5/18) Gandenberger on Ontology and Methodology (May 4) Conference: Virginia Tech
- (5/19) Mayo: Meanderings on the Onto-Methodology Conference
- (5/22) Mayo’s slides from the Onto-Meth conference
- (5/24) Gelman sides w/ Neyman over Fisher in relation to a famous blow-up
- (5/26) Schachtman: High, Higher, Highest Quality Research Act
- (5/27) A.Birnbaum: Statistical Methods in Scientific Inference
- (5/29) K. Staley: review of Error & Inference

**June 2013
**

- (6/1) Winner of May Palindrome Contest
- (6/1) Some statistical dirty laundry
- (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers (Reblog 5/17/12):
- (6/6) PhilStock: Topsy-Turvy Game
- (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
- (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?”
- (6/11) Mayo: comment on the repressed memory research
- (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
- (6/19) PhilStock: The Great Taper Caper
- (6/19) Stanley Young: better p-values through randomization in microarrays
- (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri
- (6/26) Why I am not a “dualist” in the sense of Sander Greenland
- (6/29) Palindrome “contest” contest
- (6/30) Blog Contents: mid-year

**July 2013**

- (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
- (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
- (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
- (7/11) Is Particle Physics Bad Science? (memory lane)
- (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
- (7/14) Stephen Senn: Indefinite irrelevance
- (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
- (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
- (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
- (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
- (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk

**August 2013**

- (8/1) Blogging (flogging?) the SLP: Response to Reply- Xi’an Robert
- (8/5) At the JSM: 2013 International Year of Statistics
- (8/6) What did Nate Silver just say? Blogging the JSM
- (8/9) 11
^{th}bullet, multiple choice question, and last thoughts on the JSM - (8/11) E.S. Pearson: “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”
- (8/13) Blogging E.S. Pearson’s Statistical Philosophy
- (8/15) A. Spanos: Egon Pearson’s Neglected Contributions to Statistics
- (8/17) Gandenberger: How to Do Philosophy That Matters (guest post)
- (8/21) Blog contents: July, 2013
- (8/22) PhilStock: Flash Freeze
- (8/22) A critical look at “critical thinking”: deduction and induction
- (8/28) Is being lonely unnatural for slim particles? A statistical argument
- (8/31) Overheard at the comedy hour at the Bayesian retreat-2 years on

**September 2013**

- (9/2) Is Bayesian Inference a Religion?
- (9/3) Gelman’s response to my comment on Jaynes
- (9/5) Stephen Senn: Open Season (guest post)
- (9/7) First blog: “Did you hear the one about the frequentist…”? and “Frequentists in Exile”
- (9/10) Peircean Induction and the Error-Correcting Thesis (Part I)
- (9/10) (Part 2) Peircean Induction and the Error-Correcting Thesis
- (9/12) (Part 3) Peircean Induction and the Error-Correcting Thesis
- (9/14) “When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (guest post)
- (9/18) PhilStock: Bad news is good news on Wall St.
- (9/18) How to hire a fraudster chauffeur
- (9/22) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
- (9/23) Barnard’s Birthday: background, likelihood principle, intentions
- (9/24) Gelman est efffectivement une erreur statistician
- (9/26) Blog Contents: August 2013
- (9/29) Highly probable vs highly probed: Bayesian/ error statistical differences

**October 2013**

- (10/3) Will the Real Junk Science Please Stand Up? (critical thinking)
- (10/5) Was Janina Hosiasson pulling Harold Jeffreys’ leg?
- (10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock
- (10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”
- (10/19) Blog Contents: September 2013
- (10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*
- (10/25) Bayesian confirmation theory: example from last post…
- (10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)
- (10/31) WHIPPING BOYS AND WITCH HUNTERS

**November 2013**

- (11/2) Oxford Gaol: Statistical Bogeymen
- (11/4) Forthcoming paper on the strong likelihood principle
- (11/9) Null Effects and Replication
- (11/9) Beware of questionable front page articles warning you to beware of questionable front page articles (iii)
- (11/13) T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)
- (11/16) PhilStock: No-pain bull
- (11/16) S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)
- (11/18) Lucien Le Cam: “The Bayesians hold the Magic”
- (11/20) Erich Lehmann: Statistician and Poet
- (11/23) Probability that it is a statistical fluke [i]
- (11/27) “The probability that it be a statistical fluke” [iia]
- (11/30) Saturday night comedy at the “Bayesian Boy” diary (rejected post*)

**December 2013**

- (12/3) Stephen Senn: Dawid’s Selection Paradox (guest post)
- (12/7) FDA’s New Pharmacovigilance
- (12/9) Why ecologists might want to read more philosophy of science (UPDATED)
- (12/11) Blog Contents for Oct and Nov 2013
- (12/14) The error statistician has a complex, messy, subtle, ingenious piece-meal approach
- (12/15) Surprising Facts about Surprising Facts
- (12/19) A. Spanos lecture on “Frequentist Hypothesis Testing”
- (12/24) U-Phil: Deconstructions [of J. Berger]: Irony & Bad Faith 3
- (12/25) “Bad Arguments” (a book by Ali Almossawi)
- (12/26) Mascots of Bayesneon statistics (rejected post)
- (12/27) Deconstructing Larry Wasserman
- (12/28) More on deconstructing Larry Wasserman (Aris Spanos)
- (12/28) Wasserman on Wasserman: Update! December 28, 2013
- (12/31) Midnight With Birnbaum (Happy New Year)

**January 2014
**

- (1/2) Winner of the December 2013 Palindrome Book Contest (Rejected Post)
- (1/3) Error Statistics Philosophy: 2013
- (1/4) Your 2014 wishing well. …
- (1/7) “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos: (Virginia Tech)
- (1/11) Two Severities? (PhilSci and PhilStat)
- (1/14) Statistical Science meets Philosophy of Science: blog beginnings
- (1/16) Objective/subjective, dirty hands and all that: Gelman/Wasserman blogolog (ii)
- (1/18) Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
- (1/22) Phil6334: “Philosophy of Statistical Inference and Modeling” New Course: Spring 2014: Mayo and Spanos (Virginia Tech) UPDATE: JAN 21
- (1/24) Phil 6334: Slides from Day #1: Four Waves in Philosophy of Statistics
- (1/25) U-Phil (Phil 6334) How should “prior information” enter in statistical inference?
- (1/27) Winner of the January 2014 palindrome contest (rejected post)
- (1/29) BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics
- (1/31) Phil 6334: Day #2 Slides

**February 2014**

- (2/1) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs B-boosts)
- (2/3) PhilStock: Bad news is bad news on Wall St. (rejected post)
- (2/5) “Probabilism as an Obstacle to Statistical Fraud-Busting” (draft iii)
- (2/9) Phil6334: Day #3: Feb 6, 2014
- (2/10) Is it true that all epistemic principles can only be defended circularly? A Popperian puzzle
- (2/12) Phil6334: Popper self-test
- (2/13) Phil 6334 Statistical Snow Sculpture
- (2/14) January Blog Table of Contents
- (2/15) Fisher and Neyman after anger management?
- (2/17) R. A. Fisher: how an outsider revolutionized statistics
- (2/18) Aris Spanos: The Enduring Legacy of R. A. Fisher
- (2/20) R.A. Fisher: ‘Two New Properties of Mathematical Likelihood’
- (2/21) STEPHEN SENN: Fisher’s alternative to the alternative
- (2/22) Sir Harold Jeffreys’ (tail-area) one-liner: Sat night comedy [draft ii]
- (2/24) Phil6334: February 20, 2014 (Spanos): Day #5
- (2/26) Winner of the February 2014 palindrome contest (rejected post)
- (2/26) Phil6334: Feb 24, 2014: Induction, Popper and pseudoscience (Day #4)

**March 2014**

- (3/1) Cosma Shalizi gets tenure (at last!) (metastat announcement)
- (3/2) Significance tests and frequentist principles of evidence: Phil6334 Day #6
- (3/3) Capitalizing on Chance (ii)
- (3/4) Power, power everywhere–(it) may not be what you think! [illustration]
- (3/8) Msc kvetch: You are fully dressed (even under you clothes)?
- (3/8) Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
- (3/11) Phil6334 Day #7: Selection effects, the Higgs and 5 sigma, Power
- (3/12) Get empowered to detect power howlers
- (3/15) New SEV calculator (guest app: Durvasula)
- (3/17) Stephen Senn: “Delta Force: To what extent is clinical relevance relevant?” (Guest Post)
- (3/19) Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity
- (3/22) Fallacies of statistics & statistics journalism, and how to avoid them: Summary & Slides Day #8 (Phil 6334)
- (3/25) The Unexpected Way Philosophy Majors Are Changing The World Of Business
- (3/26) Phil6334:Misspecification Testing: Ordering From A Full Diagnostic Menu (part 1)
- (3/28) Severe osteometric probing of skeletal remains: John Byrd
- (3/29) Winner of the March 2014 palindrome contest (rejected post)
- (3/30) Phil6334: March 26, philosophy of misspecification testing (Day #9 slides)

**April 2014**

- (4/1) Skeptical and enthusiastic Bayesian priors for beliefs about insane asylum renovations at Dept of Homeland Security: I’m skeptical and unenthusiastic
- (4/3) Self-referential blogpost (conditionally accepted*)
- (4/5) Who is allowed to cheat? I.J. Good and that after dinner comedy hour. . ..
- (4/6) Phil6334: Duhem’s Problem, highly probable vs highly probed; Day #9 Slides
- (4/8) “Out Damned Pseudoscience: Non-significant results are the new ‘Significant’ results!” (update)
- (4/12) “Murder or Coincidence?” Statistical Error in Court: Richard Gill (TEDx video)
- (4/14) Phil6334: Notes on Bayesian Inference: Day #11 Slides
- (4/16) A. Spanos: Jerzy Neyman and his Enduring Legacy
- (4/17) Duality: Confidence intervals and the severity of tests
- (4/19) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill)
- (4/21) Phil 6334: Foundations of statistics and its consequences: Day#12
- (4/23) Phil 6334 Visitor: S. Stanley Young, “Statistics and Scientific Integrity”
- (4/26) Reliability and Reproducibility: Fraudulent p-values through multiple testing (and other biases): S. Stanley Young (Phil 6334: Day #13)
- (4/30) Able Stats Elba: 3 Palindrome nominees for April! (rejected post)

**May 2014**

- (5/1) Putting the brakes on the breakthrough: An informal look at the
argument for the Likelihood Principle

- (5/3) You can only become coherent by ‘converting’ non-Bayesianly
- (5/6) Winner of April Palindrome contest: Lori Wike
- (5/7) A. Spanos: Talking back to the critics using error statistics (Phil6334)
- (5/10) Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)
- (5/15) Scientism and Statisticism: a conference* (i)
- (5/17) Deconstructing Andrew Gelman: “A Bayesian wants everybody else to be a non-Bayesian.”
- (5/20) The Science Wars & the Statistics Wars: More from the Scientism workshop
- (5/25) Blog Table of Contents: March and April 2014
- (5/27) Allan Birnbaum, Philosophical Error Statistician: 27 May 1923 – 1 July 1976

- (5/31) What have we learned from the Anil Potti training and test data frameworks? Part 1 (draft 2)

**June 2014**

- (6/5) Stephen Senn: Blood Simple? The complicated and controversial world of bioequivalence (guest post)
- (6/9) “The medical press must become irrelevant to publication of clinical trials.”
- (6/11) A. Spanos: “Recurring controversies about P values and confidence intervals revisited”
- (6/14) “Statistical Science and Philosophy of Science: where should they meet?”
- (6/21) Big Bayes Stories? (draft ii)
- (6/25) Blog Contents: May 2014
- (6/28) Sir David Hendry Gets Lifetime Achievement Award
- (6/30) Some ironies in the ‘replication crisis’ in social psychology (4
^{th}and final installment)

**July 2014**

- (7/7) Winner of June Palindrome Contest: Lori Wike
- (7/8) Higgs Discovery 2 years on (1: “Is particle physics bad science?”)
- (7/10) Higgs Discovery 2 years on (2: Higgs analysis and statistical flukes)
- (7/14) “P-values overstate the evidence against the null”: legit or fallacious? (revised)
- (7/23) Continued:”P-values overstate the evidence against the null”: legit or fallacious?
- (7/26) S. Senn: “Responder despondency: myths of personalized medicine” (Guest Post)
- (7/31) Roger Berger on Stephen Senn’s “Blood Simple” with a response by Senn (Guest Posts)

**August 2014**

- (08/03) Blogging Boston JSM2014?
- (08/05) Neyman, Power, and Severity
- (08/06) What did Nate Silver just say? Blogging the JSM 2013
- (08/09) Winner of July Palindrome: Manan Shah
- (08/09) Blog Contents: June and July 2014
- (08/11) Egon Pearson’s Heresy
- (08/17) Are P Values Error Probabilities? Or, “It’s the methods, stupid!” (2
^{nd}install) - (08/23) Has Philosophical Superficiality Harmed Science?
- (08/29) BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2
^{nd})

**September 2014**

- (9/30) Letter from George (Barnard)
- (9/27) Should a “Fictionfactory” peepshow be barred from a festival on “Truth and Reality”? Diederik Stapel says no (rejected post)
- (9/23) G.A. Barnard: The Bayesian “catch-all” factor: probability vs likelihood
- (9/21) Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”
- (9/18) Uncle Sam wants YOU to help with scientific reproducibility!
- (9/15) A crucial missing piece in the Pistorius trial? (2): my answer (Rejected Post)
- (9/12) “The Supernal Powers Withhold Their Hands And Let Me Alone”: C.S. Peirce
- (9/6)
The Likelihood Principle issue is out…!*Statistical Science:* - (9/4) All She Wrote (so far): Error Statistics Philosophy Contents-3 years on
- (9/3) 3 in blog years: Sept 3 is 3rd anniversary of errorstatistics.com

**October 2014**

**10/01**Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?**10/05**Diederik Stapel hired to teach “social philosophy” because students got tired of success stories… or something (rejected post)**10/07**A (Jan 14, 2014) interview with Sir David Cox by “Statistics Views”**10/10**BREAKING THE (Royall) LAW! (of likelihood) (C)**10/14**Gelman recognizes his error-statistical (Bayesian) foundations**10/18**PhilStat/Law: Nathan Schachtman: Acknowledging Multiple Comparisons in Statistical Analysis: Courts Can and Must**10/22**September 2014: Blog Contents**10/25**3 YEARS AGO: MONTHLY MEMORY LANE**10/26**To Quarantine or not to Quarantine?: Science & Policy in the time of Ebola**10/31**Oxford Gaol: Statistical Bogeymen

**November 2014**

**11/01**Philosophy of Science Assoc. (PSA) symposium on Philosophy of Statistics in the Higgs Experiments “How Many Sigmas to Discovery?”**11/09**“Statistical Flukes, the Higgs Discovery, and 5 Sigma” at the PSA**11/11**The Amazing Randi’s Million Dollar Challenge**11/12**A biased report of the probability of a statistical fluke: Is it cheating?**11/15**Why the Law of Likelihood is bankrupt–as an account of evidence**11/18**Lucien Le Cam: “The Bayesians Hold the Magic”**11/20**Erich Lehmann: Statistician and Poet**11/22**Msc Kvetch: “You are a Medical Statistic”, or “How Medical Care Is Being Corrupted”**11/25**How likelihoodists exaggerate evidence from statistical tests**11/30**3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

**December 2014**

**12/02**My Rutgers Seminar: tomorrow, December 3, on philosophy of statistics**12/04**“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)**12/06**How power morcellators inadvertently spread uterine cancer**12/11**Msc. Kvetch: What does it mean for a battle to be “lost by the media”?**12/13**S. Stanley Young: Are there mortality co-benefits to the Clean Power Plan? It depends. (Guest Post)**12/17**Announcing Kent Staley’s new book, An Introduction to the Philosophy of Science (CUP)**12/21**Derailment: Faking Science: A true story of academic fraud, by Diederik Stapel (translated into English)**12/23**All I want for Chrismukkah is that critics & “reformers” quit howlers oftesting (after 3 yrs of blogging)! So here’s Aris Spanos “Talking Back!”

**12/26**3 YEARS AGO: MONTHLY (Dec.) MEMORY LANE**12/29**To raise the power of a test is to lower (not raise) the “hurdle” for rejecting the null (Ziliac and McCloskey 3 years on)**12/31**Midnight With Birnbaum (Happy New Year)

**January 2015
**

**01/02**Blog Contents: Oct.- Dec. 2014**01/03**No headache power (for Deirdre)**01/04**Significance Levels are Made a Whipping Boy on Climate Change Evidence: Is .05 Too Strict? (Schachtman on Oreskes)**01/07**“When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)**01/08**On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)**01/12**“Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins**)****01/16**Winners of the December 2014 Palindrome Contest: TWO!**01/18**Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud?**01/21**Some statistical dirty laundry**01/24**What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri- 01/26 Trial on Anil Potti’s (clinical) Trial Scandal Postponed Because Lawyers Get the Sniffles (updated)
**01/27**3 YEARS AGO: (JANUARY 2012) MEMORY LANE**01/31**Saturday Night Brainstorming and Task Forces: (4th draft)

**February 2015**

**02/05**Stephen Senn: Is Pooling Fooling? (Guest Post)**02/10**What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?**02/13**Induction, Popper and Pseudoscience**02/16**Continuing the discussion on truncation, Bayesian convergence and testing of priors**02/16**R. A. Fisher: ‘Two New Properties of Mathematical Likelihood’: Just before breaking up (with N-P)**02/17**R. A. Fisher: How an Outsider Revolutionized Statistics (Aris Spanos)**02/19**Stephen Senn: Fisher’s Alternative to the Alternative**02/21**Sir Harold Jeffreys’ (tail area) one-liner: Saturday night comedy (b)**02/25**3 YEARS AGO: (FEBRUARY 2012) MEMORY LANE**02/27**Big Data is the New Phrenology?

**March 2015**

**03/01**“Probabilism as an Obstacle to Statistical Fraud-Busting”**03/05**A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)**03/12**All She Wrote (so far): Error Statistics Philosophy: 3.5 years on**03/16**Stephen Senn: The pathetic P-value (Guest Post)**03/21**Objectivity in Statistics: “Arguments From Discretion and 3 Reactions”**03/24**3 YEARS AGO (MARCH 2012): MEMORY LANE**03/28**Your (very own) personalized genomic prediction varies depending on who else was around?

**April 2015**

**04/01**Are scientists really ready for ‘retraction offsets’ to advance ‘aggregate reproducibility’? (let alone ‘precautionary withdrawals’)**04/04**Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”**04/08**Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!**04/13**Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC**04/16**A. Spanos: Jerzy Neyman and his Enduring Legacy**04/18**Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen**04/22**NEYMAN: “Note on an Article by Sir Ronald Fisher” (3 uses for power, Fisher’s fiducial argument)**04/24**“Statistical Concepts in Their Relation to Reality” by E.S. Pearson**04/27**3 YEARS AGO (APRIL 2012): MEMORY LANE**04/30**96% Error in “Expert” Testimony Based on Probability of Hair Matches: It’s all Junk!

**May 2015**

**05/04**Spurious Correlations: Death by getting tangled in bedsheets and the consumption of cheese! (Aris Spanos)**05/08**What really defies common sense (Msc kvetch on rejected posts)**05/09**Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)**05/16**“Error statistical modeling and inference: Where methodology meets ontology” A. Spanos and D. Mayo**05/19**Workshop on Replication in the Sciences: Society for Philosophy and Psychology: (2nd part of double header)**05/24**From our “Philosophy of Statistics” session: APS 2015 convention**05/27**“Intentions” is the new code word for “error probabilities”: Allan Birnbaum’s Birthday**05/30**3 YEARS AGO (MAY 2012): Saturday Night Memory Lane

**June 2015**

**06/04**What Would Replication Research Under an Error Statistical Philosophy Be?**06/09**“Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (rejected post)**6/11**Evidence can only strengthen a prior belief in low data veracity, N. Liberman & M. Denzler: “Response”**06/14**Some statistical dirty laundry: The Tilberg (Stapel) Report on “Flawed Science”**06/18**Can You change Your Bayesian prior? (ii)**06/25**3 YEARS AGO (JUNE 2012): MEMORY LANE**06/30**Stapel’s Fix for Science? Admit the story you want to tell and how you “fixed” the statistics to support it!

**07/03**Larry Laudan: “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)**07/09**Winner of the June Palindrome contest: Lori Wike**07/11**Higgs discovery three years on (Higgs analysis and statistical flukes)**07/14**Spot the power howler: α = ß?**07/17**“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)**07/22**3 YEARS AGO (JULY 2012): MEMORY LANE**07/24**Stephen Senn: Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics**07/29**Telling What’s True About Power, if practicing within the error-statistical tribe

**August 2015**

**08/05**Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen**08/08**Statistical Theater of the Absurd: “Stat on a Hot Tin Roof”**08/11**A. Spanos: Egon Pearson’s Neglected Contributions to Statistics**08/14**Performance or Probativeness? E.S. Pearson’s Statistical Philosophy**08/15**Severity in a Likelihood Text by Charles Rohde**08/19**Statistics, the Spooky Science**08/20**How to avoid making mountains out of molehills, using power/severity**08/24**3 YEARS AGO (AUGUST 2012): MEMORY LANE**08/31**The Paradox of Replication, and the vindication of the P-value (but she can go deeper) 9/2/15 update (ii)

**September 2015**

- 09/05 All She Wrote (so far): Error Statistics Philosophy: 4 years on
- 09/11 Peircean Induction and the Error-Correcting Thesis (Part I)
- 09/12 (Part 2) Peircean Induction and the Error-Correcting Thesis
- 09/14 (Part 3) Peircean Induction and the Error-Correcting Thesis
- 09/16 Popper on pseudoscience: a comment on Pigliucci (i), (ii) 9/18, (iii) 9/20
- 09/22 Statistical rivulets: Who wrote this?
- 09/23 George Barnard: 100th birthday: “We need more complexity” (and coherence) in statistical education
- 09/26 G.A. Barnard: The “catch-all” factor: probability vs likelihood
- 09/28 3 YEARS AGO (SEPTEMBER 2012): MEMORY LANE
- 09/30 Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?

- 10/04 Will the Real Junk Science Please Stand Up?
- 10/07 In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel)
- 10/10 P-value madness: A puzzle about the latest test ban (or ‘don’t ask, don’t tell’)
- 10/14 “Frequentist Accuracy of Bayesian Estimates” (Efron Webinar announcement)
- 10/18 Statistical “reforms” without philosophy are blind (v update)
- 10/24 3 YEARS AGO (OCTOBER 2012): MEMORY LANE
- 10/31 WHIPPING BOYS AND WITCH HUNTERS (ii)

- 11/05 S. McKinney: On Efron’s “Frequentist Accuracy of Bayesian Estimates” (Guest Post)
- 11/09 Findings of the Office of Research Misconduct on the Duke U (Potti/Nevins) cancer trial fraud: No one is punished but the patients
- 11/13 “What does it say about our national commitment to research integrity?”
- 11/20 Erich Lehmann: Neyman-Pearson & Fisher on P-values
- 11/25 3 YEARS AGO (NOVEMBER 2012): MEMORY LANE
- 11/28 Return to the Comedy Hour: P-values vs posterior probabilities (1)

**December 2015**

- 12/05 Beware of questionable front page articles warning you to beware of questionable front page articles (2)
- 12/12 Stephen Senn: The pathetic P-value (Guest Post) [3]
- 12/17 Gelman on ‘Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper’
- 12/20 Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)[4]
- 12/26 3 YEARS AGO (DECEMBER 2012): MEMORY LANE
- 12/31 Midnight With Birnbaum (Happy New Year)

**January 2016**

- 01/02 Sir Harold Jeffreys’ (tail area) one-liner: Sat night comedy [draft ii]
- 01/08 Preregistration Challenge: My email exchange
- 01/10 Winner of December Palindrome: Mike Jacovides
- 01/11 “On the Brittleness of Bayesian Inference,” Owhadi, Scovel, and Sullivan (PUBLISHED)
- 01/17 “P-values overstate the evidence against the null”: legit or fallacious?
- 01/19 High error rates in discussions of error rates (1/21/16 update)
- 01/24 Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand
- 01/29 3 YEARS AGO (JANUARY 2013): MEMORY LANE

**February 2016**

- 02/03 Philosophy-laden meta-statistics: Is “technical activism” free of statistical philosophy? (ii)
- 02/12 Rubbing off, uncertainty, confidence, and Nate Silver
- 02/17 Can’t Take the Fiducial Out of Fisher (if you want to understand the N-P performance philosophy) [i]
- 02/20 Deconstructing the Fisher-Neyman conflict wearing fiducial glasses (continued)
- 02/27 3 YEARS AGO (FEBRUARY 2013): MEMORY LANE
- 02/29 Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

**March 2016**

- 03/04 Repligate Returns (or, the non-significance of non-significant results, are the new significant results)
- 03/07 Don’t throw out the error control baby with the bad statistics bathwater
- 03/12: “A small p-value indicates it’s improbable that the results 3are due to chance alone” –fallacious or not? (more on the ASA p-value doc)
- 03/19: Your chance to continue the “due to chance” discussion in roomier quarters
- 03/22: All She Wrote (so far): Error Statistics Philosophy: 4.5 years on
- 03/26: A. Spanos: Talking back to the critics using error statistics

**April 2016**

- 04/01: Er, about those “other statistical approaches”: Hold off until a balanced critique is in?
- 04/06: I’m speaking at Univ of Minnesota on Friday
- 04/09: Winner of March 2016 Palindrome: Manan Shah
- 04/11: When the rejection ratio (1 – β)/α turns evidence on its head, for those practicing in an error-statistical tribe (ii)
- 04/16: Jerzy Neyman and “Les Miserables Citations” (statistical theater in honor of his birthday)
- 04/23: 3 YEARS AGO (MARCH & APRIL 2013): MEMORY LANE
- 04/30: Yes, these were not (entirely) real–my five April pranks

**May ****2016**

- 05/02: Philosophy & Physical Computing Graduate Workshop at VT
- 05/04: My Popper Talk at LSE: The Statistical Replication Crisis: Paradoxes and Scapegoats
- 05/09: Some bloglinks for my LSE talk tomorrow: “The Statistical Replication Crisis: Paradoxes and Scapegoats”
- 05/10: My Slides: “The Statistical Replication Crisis: Paradoxes and Scapegoats”
- 05/12: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”
- 05/14: Fallacies of Rejection, Nouvelle Cuisine, and assorted New Monsters
- 05/22: Frequentstein: What’s wrong with (1 – β)/α as a measure of evidence against the null? (ii)
- 05/27: Allan Birnbaum: Foundations of Probability and Statistics (27 May 1923 – 1 July 1976)
- 05/30: 3 YEARS AGO (MAY 2013): MEMORY LANE

**June ****2016**

- 06/02: “A sense of security regarding the future of statistical science…” Anon review of Error and Inference
- 06/05: Winner of May 2016 Palindrome Contest: Curtis Williams
- 06/08: “So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference
- 06/15: “Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)
- 06/19: Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides
- 06/22: Some statistical dirty laundry: have the stains become permanent?
- 06/25: What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Are we lowering the bar?
- 06/29: Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

**July ****2016**

- 07/01: A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)
- 07/06: 3 YEARS AGO (JUNE 2013): MEMORY LANE
- 07/11: Philosophy and History of Science Announcements
- 07/21: “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”
- 07/28: For Popper’s Birthday: A little Popper self-test & reading from Conjectures and Refutations
- 07/30: 3 YEARS AGO (JULY 2013): MEMORY LANE

**August ****2016**

- 08/02: S. Senn: “Painful dichotomies” (Guest Post)
- 08/09: If you think it’s a scandal to be without statistical falsification, you will need statistical Ju
- 08/16: Performance of Probativeness? E.S. Pearson’s Statistical Philosophy
- (08/18) History of statistics sleuths out there? “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”–No wait, it was apples, probably
- (08/21) Larry Laudan: “‘Not Guilty’: The Misleading Verdict” (Guest Post)
- (08/28) TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

**[i]Table of Contents (compiled by N. Jinn & J. Miller)***

*I thank Jean Miller for her assiduous work on the blog. I’m very grateful to guest posters in the past year: Laudan, Spanos, Senn, and to all contributors and readers for helping “frequentists in exile” to feel (and truly become) less exiled–wherever they may be!

Filed under: blog contents, Metablog, Statistics ]]>

*D id you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?*

Critic:I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)

The frequentist tester should retort:

But you assume 50% of the null hypotheses are true, compute P(HFrequentist Tester:_{0}|x) using P(H_{0}) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!

At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.

* *It is well-known that for a fixed p-value, with a sufficiently large *n*, even a statistically significant result can correspond to large posteriors in *H*_{0.} This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013). J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!

The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, *H*_{0}: μ = μ_{0} versus *H*_{1}: μ ≠ μ_{0}.

“If

n= 50 one can classically ‘rejectH_{0}at significance level p = .05,’ although Pr (H_{0}|) = .52 (which would actually indicate that the evidence favorsxH_{0}).” (Berger and Sellke, 1987, p. 113).

If *n* = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!

Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to *H*_{0}**, **the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior.

Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of *H*_{0 }as much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here.

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular *H*_{0}. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a **fallacious instantiation of probabilities:**

50% of the null hypotheses in a given pool of nulls are true.

This particular null

H_{0 }was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).Therefore P(

H_{0}is true) = .5.

*I discussed this 20 years ago,* Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called * diagnostic screening models* of tests.

It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data **x**_{0} under hypothesis *H*_{0}. In other words, it’s no longer the *H*_{0} needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

In any event, .5 is not the frequentist probability that the selected null *H*_{0} is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).

**The diagnostic screening model of tests. **The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:

“With big data, researchers have brought cherry-picking to an industrial level”.

Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.

The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of *H*_{0} conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).

**Conventional Bayesian variant.** J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become *more frequentist* (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!

How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).

Senn, in a guest post remarks:

The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

* Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) *Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of

Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.

The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it *H**? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.

**Safe Science.** We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:

Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).

The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.

In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).

Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

**References & Related articles**

Berger, J. O. (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” *Statistical Science* 18: 1-12.

Berger, J. O. and Sellke, T. (1987). “Testing a point null hypothesis: The irreconcilability of *p *values and evidence,” (with discussion). *J. Amer. Statist. Assoc. ***82: **112–139.

Cassella G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). *J. Amer. Statist. Assoc. ***82 **106–111, 123–139.

Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 **1**(3): pp. 1-16.

Fisher, R. A., (1956). *Statistical Methods and Scientific Inference*, Edinburgh: Oliver and Boyd.

Fisher, R.A. (1947), *Design of Experiments.*

Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” *Philosop2hy of Science* **64**(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.

Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) *Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science ***64**: S195-S212.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”, *Statistical Science**18*, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” *Optimality: The Second Erich L. Lehmann Symposium *(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) *Foundations of Bayesianism*. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in *Philosophy of Statistics , Handbook of Philosophy of Science* Volume 7 *Philosophy of Statistics*, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. *Psychological Reports* 66 (1): 195-244.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of *p* values and evidence: Comment.” *J. Amer. Statist. Assoc.* **82**: 123-125.

Prusiner, S. (1991). Molecular Biology of Prion Diseases. *Science,* *252*(5012), 1515-1522.

Prusiner, S. B. (2014) *Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease*, New Haven, Connecticut: Yale University Press.

Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.

Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.

**Related posts:**

Filed under: Bayesian/frequentist, Comedy, significance tests, Statistics ]]>

**Prof. Larry Laudan**

*Lecturer in Law and Philosophy*

*University of Texas at Austin*

**“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”**

Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’. Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction.

There is, however, an egregious drawback to a legal system so structured. To wit, a verdict of ‘not guilty’ tells us nothing whatever about whether it is reasonable to believe that the defendant did not commit the crime. It offers no grounds whatever for inferring that an acquitted defendant probably did not commit the crime. That fact alone should make most of us leery about someone acquitted of a felony. Will a bank happily hire someone recently acquitted of a forgery charge? Are the neighbors going to rest easy when one of them was charged with, and then acquitted of, child molestation?

While the current proof standard provides ample protection to the innocent from being falsely convicted (the false positive rate is ~3%), it does little or nothing to protect the reputation of the truly innocent defendants. If properly understood, it fails to send any message to the general public about how they should regard and treat an acquitted defendant because it fails to tell the public whether it’s likely or unlikely that he committed the crime.

It would not be difficult to remedy this centuries-old mess, both for the public and for the acquitted defendant, by employing a *three-verdict* system, as the Scots have been doing for some time. Their verdicts are: guilty, guilt not proven and innocent. In a Scottish trial, if guilt is proven beyond a reasonable doubt, the defendant is found guilty; if the jury thinks it more likely than not that defendant committed no crime, his verdict is ‘innocent’; if the jury suspects that defendant did the crime but is not sure beyond all reasonable doubt, the verdict is ‘guilt not proven’. Both the guilt-not-proven verdict and the innocence verdict are officially acquittals in the sense that those receiving it serve no jail time. (This gives a whole new meaning to the well-known phrase ‘going scot-free’.)

The Scottish verdict pattern serves the interests of both the innocent defendant and the general society. The Scots know that if a defendant received an innocent verdict, then the jury believed it likely that he committed no crime and that he should be treated accordingly. That is both important information for the citizenry and a substantial protection for the innocent defendant himself, since the innocent verdict is in effect an exoneration, entailing the likelihood of his innocence.

On the other hand, the Scottish guilt-not-proven verdict sends out the important message to citizens that no other Anglo-Saxon legal system can; to wit, that the acquitted defendant (with a guilt-not-proven verdict) should be treated warily by society since he was probably the culprit, even though he was neither convicted nor punished.

Interestingly, there is ample use of the intermediary verdict. The Scottish government reports in a study of criminal prosecutions in 2005 and 2006 that it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict. That means that about 7-in-10 acquittals for murder in Scotland involved defendants regarded by the jurors as having probably committed the crime.[1] In a more recent analysis, the Scottish government reported that in rape cases some 35% of acquittals resulted in ‘guilt not proven’ verdicts. In murder cases, the probably guilty verdict rate was 27% of all acquittals.[2]

It’s worth adding that Scotland’s intermediary verdict gives us access to information on an error whose frequency no other Western legal system can easily compute: to wit, the frequency of false acquittals. It tells us that, at least in Scotland, the rate of false acquittals hovers between 1-in-4 and 1-in-3. That is crucial information for those of us who believe that a legitimate system of inquiry—whether a legal one or otherwise— must get a handle on its error rates. Without knowing that, we cannot possibly figure out whether the distribution of erroneous verdicts is in line with our beliefs about the respective costs of the two errors.

Scottish criminal law has one other interesting feature worthy of mention in this context: a verdict there requires only a majority vote from the 15 citizens who serve as the jury. By contrast, most American states require a unanimous vote among 12 jurors, contributing to a situation in which mistrials are both expensive and common. They are expensive because they usually lead to re-trials, which are rarely cheap. In some jurisdictions in the US, 20% or more of trials end in a hung jury.[3] Not surprisingly, hung juries in Scottish cases are much less frequent.

***

[1] See http://www.scotland.gov.uk/Publications/2006/04/25104019/11.) See also the *Scottish Government Statistical Bulletin*, Crim/2006/Part 11.

[2] See Scottish Government, Criminal Proceedings in Scotland, 2013-14, Table 2B.

[3] A study by Paula Agor *et al*., (*Are Hung Juries a Problem?* National Center for State Courts and National Institute of Justice, 2002) found that in Washington, D.C. Superior Courts some 22.4% of jury trials ended in a hung jury; In Los Angeles Superior Courts, the hung jury rate was 19.5%.* *

**ADDITIONAL RESOURCES:**

- Larry Laudan, “Need Verdicts Come in Pairs?”
*International Journal of Evidence and Proof*, vol. 14 (2010), 1-24.

*Previous guest posts:*

- Larry Laudan (July 20, 2013): Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
- Larry Laudan (July 3, 2015): “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

*Among Laudan’s books:*

1977. *Progress and its Problems: Towards a Theory of Scientific Growth
*1981.

Filed under: L. Laudan, PhilStat Law Tagged: L. Laudan ]]>

Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is

“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot…–E.S Pearson, “Statistical Concepts in Their Relation to Reality”.

He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]

So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.

OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her:

One day at the beginning of April 1926, down ‘in the middle of small samples,’ wandering among apple plots at East Malling, where a cousin was director of the fruit station, he was ‘suddenly smitten,’ as he later expressed it,with a ‘doubt’ about the justification for using Student’s ratio (the t-statistic) to test a normal mean (Quotes are from Pearson in Reid, p. 60).

Soon after, Egon contacted Neyman and their joint work began.

I assumed the meanderings over apple plots was a different time, and that Egon just had a habit of conducting his deepest statistical thinking while overlooking fruit. Yet it shared certain unique features with the revelation when gazing over at the blackcurrant plot, as in my picture, if only in the date and the great importance he accorded it (although I never recall his saying he was “smitten” before). I didn’t think more about it. Then, late one night last week I grabbed a peculiar book off my shelf that contains a smattering of writings by Pearson for a work he never completed: “*Student: A Statistical Biography of William Sealy Gosset*” (1990, edited and augmented by Plackett and Barnard, Clarendon, Oxford). The very *first* thing I open up to is a note by Egon Pearson:

I cannot recall now what was the form of the doubt which struck me at East Malling, but it would naturally have arisen when discussing there the interpretation of results derived from small experimental plots. I seem to visualize myself sitting alone on a gate thinking over the basis of ‘small sample’ theory and ‘mathematical statistics Mark II’ [i.e., Fisher]. When nearly thirty years later (JRSS B, 17, 204 1955), I wrote refuting the suggestion of R.A.F. [Fisher] that the Neyman-Pearson approach to testing statistical hypotheses had arisen in industrial acceptance procedures, the plot which the gate was overlooking had through the passage of time become a blackcurrant one! (Pearson 1990 p. 81)

**What? This is weird.** So that must mean it wasn’t blackcurrants after all, and Egon is mistaken in the caption under the picture I drew 20 years ago. Yet, he doesn’t say here that it was apples either, only that it had “become a blackcurrant” plot in a later retelling. So, not blackcurrant, so, it must have been apple, putting this clue together with what he told Constance Reid. So it appears I can no longer quote that “blackcurrant” statement, at least not without explaining that, in all likelihood, it was really apples.

[i] Some of the previous lines, and 6 following words:

There was no question of a difference in point of view having ‘originated’ when Neyman ‘re-interpreted’ Fisher’s early work on tests of significance ‘in terms of that technological and commercial apparatus which is known as an acceptance procedure’. …

Indeed, to dispel the picture of the Russian technological bogey,I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot at the East Malling Research Station!–E.S Pearson, “Statistical Concepts in Their Relation to Reality”

[ii] As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman. So I should get the inspirational fruit correct.

[iii] I’m not saying I know the answer isn’t in the book on Student, or someplace else.

Fisher 1955 “Scientific Methods and Scientific Induction” .

Pearson E.S., 1955 “Statistical Methods in Their Relation to Reality”.

Reid, C. 1998, *Neyman–From Life*. Springer.

Filed under: E.S. Pearson, phil/history of stat, Statistics ]]>

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post. I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background.

Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (*performance*). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (*probativeness*). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson.

*Cases of Type A and Type B*

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Pearson considers the rationale that might be given to N-P tests in two types of cases, A and B:

“(A) At one extreme we have the case where repeated decisions must be made on results obtained from some routine procedure…

(B) At the other is the situation where statistical tools are applied to an isolated investigation of considerable importance…?” (ibid., 170)

In cases of type A, long-run results are clearly of interest, while in cases of type B, repetition is impossible and may be irrelevant:

“In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules to guide our decision, following the analysis of an isolated set of numerical data. Why do we do this? What are the springs of decision? Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?

Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?” (Ibid., 173)

Although Pearson leaves this tantalizing question unanswered, claiming, “On this I should not care to dogmatize”, in studying how Pearson treats cases of type B, it is evident that in his view, “the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment” in learning about the particular case at hand.

“Whereas when tackling problem A it is easy to convince the practical man of the value of a probability construct related to frequency of occurrence, in problem B the argument that ‘if we were to repeatedly do so and so, such and such result would follow in the long run’ is at once met by the commonsense answer that we never should carry out a precisely similar trial again.

Nevertheless, it is clear that the scientist with a knowledge of statistical method behind him can make his contribution to a round-table discussion…” (Ibid., 171).

Pearson gives the following example of a case of type B (from his wartime work), where he claims no repetition is intended:

“Example of type B.Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms…. Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate….”(Pearson 1947, 171)“Starting from the basis that, individual shells will never be identical in armour-piercing qualities, however good the control of production, he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability. ..”(Ibid.,)

*We’re interested in considering what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing.* As a noteworthy aside, Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:

“Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)

The two analyses correspond to the tests effectively asking different questions, and if we recognize this, says Pearson, different meanings may be appropriately attached.

*Three Steps in the Original Construction of Tests*

After setting up the test (or null) hypothesis, and the alternative hypotheses against which “we wish the test to have maximum discriminating power” (Pearson 1947, 173), Pearson defines three steps in specifying tests:

“Step 1. We must specify the experimental probability set, the set of results which could follow on repeated application of the random process used in the collection of the data…

Step 2. We then divide this set [of possible results] by a system of ordered boundaries…such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts” (Pearson 1966a, 173).

“Step 3. We then, if possible[i], associate with each contour level the chance that, if [the null] is true, a result will occur in random sampling lying beyond that level” (ibid.).

Pearson warns that:

“Although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.” (Ibid. 173).

Strict behavioristic formulations jump from step 1 to step 3, after which one may calculate how the test has in effect accomplished step 2. However, the resulting test, while having adequate error probabilities, may have an inadequate distance measure and may even be irrelevant to the hypothesis of interest. This is one reason critics can construct howlers that appear to be licensed by N-P methods, and which make their way from time to time into this blog.

So step 3 remains crucial, even for cases of type [B]. There are two reasons: pre-data planning—that’s familiar enough—but secondly, for post-data scrutiny. Post data, step 3 enables determining the capability of the test to have detected various discrepancies, departures, and errors, on which a critical scrutiny of the inferences are based. More specifically, the error probabilities are used to determine how well/poorly corroborated, or how severely tested, various claims are, post-data.

If we can readily bring about statistically significantly higher rates of success with the first type of armour-piercing naval shell than with the second (in the above example), we have evidence the first is superior. Or, as Pearson modestly puts it: the results “raise considerable doubts as to whether the performance of the [second] type of shell was as good as that of the [first]….” (Ibid., 192)[ii]

Still, while error rates of procedures may be used to determine how severely claims have/have not passed they do not automatically do so—hence, again, opening the door to potential howlers that neither Egon nor Jerzy for that matter would have countenanced.

*Neyman Was the More Behavioristic of the Two*

Pearson was (rightly) considered to have rejected the more behaviorist leanings of Neyman.

Here’s a snippet from an unpublished letter he wrote to Birnbaum (1974) about the idea that the N-P theory admits of two interpretations: behavioral and evidential:

“I think you will pick up here and there in my own papers signs of evidentiality, and you can say now that we or I should have stated clearly the difference between the behavioral and evidential interpretations. Certainly we have suffered since in the way the people have concentrated (to an absurd extent often) on behavioral interpretations”.

In Pearson’s (1955) response to Fisher (blogged here):

“To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot….!” (Pearson 1955, 204)

“To the best of my ability I was searching for a way of expressing in mathematical terms what appeared to me to be the requirements of the scientist in applying statistical tests to his data. After contact was made with Neyman in 1926, the development of a joint mathematical theory proceeded much more surely; it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.“ (ibid., 204-5).

“It may be readily agreed that in the first Neyman and Pearson paper of 1928, more space might have been given to discussing how the scientific worker’s attitude of mind could be related to the formal structure of the mathematical probability theory….Nevertheless it should be clear from the first paragraph of this paper that we were not speaking of the final acceptance or rejection of a scientific hypothesis on the basis of statistical analysis…. Indeed, from the start we shared Professor Fisher’s view that in scientific enquiry, a statistical test is ‘a means of learning”… (Ibid., 206)

“Professor Fisher’s final criticism concerns the use of the term ‘inductive behavior’; this is Professor Neyman’s field rather than mine.” (Ibid., 207)

__________________________

** **

**References:**

Pearson, E. S. (1935), The Application of Statistical Methods to Industrial Standardization and Quality Control, London: British Standards Institution.

Pearson, E. S. (1947), “The choice of Statistical Tests illustrated on the Interpretation of Data Classed in a 2×2 Table,” *Biometrika* 34(1/2): 139-167.

Pearson, E. S. (1955), “Statistical Concepts and Their Relationship to Reality” *Journal of the Royal Statistical Society, Series B, (Methodological)*, 17(2): 204-207.

Neyman, J. and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I.” *Biometrika* 20(A): 175-240.

[i] In some cases only an upper limit to this error probability may be found.

[ii] Pearson inadvertently switches from number of failures to number of successes in the conclusion of this paper.

[iii] I thank Aris Spanos for locating this work of Pearson’s from 1935

Filed under: 4 years ago!, highly probable vs highly probed, phil/history of stat, Statistics Tagged: E S Pearson ]]>

1. **PhilSci and StatSci.** I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in *Ecology*:

“While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant

[sic]of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.^{[1]} But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability?

We know that literal deductive falsification only occurs with trite examples like “All swans are white”; and that a single black swan falsifies the universal claim that C: all swans are white, whereas observing a single white swan wouldn’t allow inferring C (unless there was only 1 swan, or no variability in color) but Burnham and Anderson are discussing statistical falsification, and statistical methods of testing. Moreover, the authors champion a methodology that they say has nothing to do with testing or falsifying: “Unlike significance testing”, the approaches they favor “are not ‘tests,’ are not about testing” (p. 628). I’m not disputing their position that likelihood ratios, odds ratios, Akaike model selection methods are not about testing, but *falsification is all about testing*! No tests, no falsification, not even of the null hypotheses (which they presumably agree significance tests can falsify). It seems almost a scandal, and it would be one if critics of statistical testing were held to a more stringent, more severe, standard of evidence and argument than they are.

**I may add installments/corrections (certainly on E. Pearson’s birthday Thursday); I’ll update with (i), (ii) and the date.**

**A bit of background.** I view significance tests as only a part of a general statistical methodology of testing, estimation, and modeling that employs error probabilities of methods to control and assess how capable methods are at probing errors, and blocking misleading interpretations of data. I call it an *error statistical methodology*. I reformulate statistical tests as tools for severe testing. The outputs report on the discrepancies that have and have not been tested with severity. There’s much in Popper I agree with: data * x* only count as evidence for a claim

**2. Popper, Fisher-Neyman-Pearson, and falsification.**

Popper’s philosophy shares quite a lot with the stringent testing ideas found in Fisher, and also Neyman-Pearson–something Popper himself recognized in the work the authors site (LSD). Here is Popper:

We say that a theory is falsified only if we have accepted basic statements which contradict it…. This condition is necessary but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated. (Popper LSD, 1959, 203)

Such “a low level empirical hypothesis” is well captured by a statistical claim. Unlike the logical positivists, Popper realized that singular observation statements couldn’t provide the “basic statements” for science. In the same spirit, Fisher warned that in order to use significance tests to legitimately indicate incompatibility with hypotheses, we need not an isolated low P-value, but an experimental phenomenon.

[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Conjectured statistical effects are likewise falsified if they contradict data and/or could only be retained through ad hoc saves, verification biases and “exception incorporation”. Moving in stages between data collection, modeling, inferring, and from statistical to substantive hypotheses and back again, learning occurs by a series of piecemeal steps with the same reasoning. The fact that at one stage *H*_{1} might be the alternative, at another, the test hypothesis, is no difficulty. The logic differs from inductive updating probabilities of a hypothesis, as well as from a comparison of how much more probable *H*_{1} makes the data than does *H*_{0}, as in likelihood ratios. These are 2 variants of * probabilism*.

Now there are many who embrace probabilism who deny they need tools to reject or falsify hypotheses. That’s fine. *But having declared it a scandal (almost) for a statistical account to lack a methodology to reject/falsify, it’s a bit surprising to learn their account offers no such falsificationist tools.* (Perhaps I’m misunderstanding; I invite correction.) For example, the likelihood ratio, they declare, “is an evidence ratio about parameters, *given* the model and the data. It is the likelihood ratio that defines evidence (Royall 1997)” (Burnham and Anderson, p. 628). They italicize “given” which underscores that these methods begin their work only after models are specified. Richard Royall is mentioned often, but Royall is quite clear that for data to favor *H*_{1} over *H*_{0} is *not* to have supplied evidence against *H*_{0}. (“the fact that we can find some other hypothesis that is better supported than H does not mean that the observations are evidence against H” (1997, pp.21-2).) There’s no such thing as evidence for or against a single hypothesis for him. But without evidence against *H*_{0}, one can hardly mount a falsification of *H*_{0}. Thus, I fail to see how their preferred account promotes falsification. It’s (almost) a scandal.

Maybe all they mean is that “historical” Fisher said the tests have only a null, so the only alternative would be its denial. First, we shouldn’t be limiting ourselves to what Fisher thought, nor keep an arbitrary distinction between Fisher vs N-P tests nor confidence intervals. David Cox is a leading Fisherian and his tests have either implicit or explicit alternatives. The choice of a test statistic indicates the alternative, even if it’s only directional. In N-P tests, the test hypothesis and the alternative may be swapped.) Second, even if one imagines the alternative is limited to either of the following:

(i) the effect is real/ non-spurious, or (ii) a parametric non-zero claim (e.g., μ ≠ 0),

they are *still statistically falsifiable*. An example of the first came last week. Shock waves were felt in high energy particle physics (HEP) when early indications (from last December) of a genuine new particle—one that would falsify the highly corroborated Standard Model (SM)—was itself falsified. This was based on falsifying a common statistical alternative in a significance test: the observed “resonance” (a great term) is real. (The “bumps” began to fade with more data [2].) As for case (ii), some of the most important results in science are null results. By means of high precision null hypotheses tests, bounds for statistical parameters are inferred by rejecting (or falsifying) discrepancies beyond the limits tests are capable of detecting. Think of the famous negative result of Michelson-Morley experiments that falsified the “ether” (or aether) of the type ruled out by special relativity, or the famous equivalence principles of experimental GTR. An example of each is briefly touched upon in a paper with David Cox (Mayo and Cox 2006). Of course, background knowledge about the instruments and theories are operative throughout. More typical are the cases where power analysis can be applied, as discussed in this post.

“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”

Perhaps they only mean to say that Fisherian tests don’t directly try to falsify “the effect is real”. They’re supposed to, it should be very difficult to bring about statistically significant results if the world is like *H0. *

**3. Model validation, specification and falsification.**

When serious attention is paid to the discovery of new ways to extend models and theories, and to model validation, basic statistical tests are looked to. This is so even for Bayesians, be they ecumenical like George Box, or “falsificationists” like Gelman.

For Box, any account that relies on statistical models requires “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification”. This leads Box to advocate ecumenism. (Box 1983, p. 57). He asks,

[w]hy can’t all criticism be done using Bayes posterior analysis?…The difficulty with this approach is that by supposing all possible sets of assumptions are known a priori, it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process (ibid., p. 73).

Listen to Andrew Gelman (2011):

At a philosophical level, I have been persuaded by the arguments of Popper (1959), Kuhn (1970), Lakatos (1978), and others that scientific revolutions arise from the identification and resolution of anomalies. In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call ‘pure significance testing’)

^{[3]}(Gelman 2011, p. 70).

Discovery, model checking and correcting rely on statistical testing, formal or informal.

**4. “An explicit, objective criterion of ‘best’ models” using methods that obey the LP (p.628).**

Say Burnham and Anderson:

“At a deeper level, P values are not proper evidence as they violate the likelihood principle” (Royall 1997)” (p. 627).

A list of pronouncements by Royall follows. *What we know at a much deeper level is that any account that obeys the likelihood principle (LP) is not an account that directly assesses or controls the error probabilities of procedures.* Control of error probabilities, even approximately, is essential for good tests, and this grows out of a concern, not for controlling error rates in the long run, but for evaluating how well tested models and hypotheses are with the data in hand. As with others who embrace the LP, the authors reject adjusting for selection effects, data dredging, multiple testing, etc.–gambits that alter the sampling distribution and, handled cavalierly, are responsible for much of the bad statistics we see. By the way, reference or default Bayesians also violate the LP. You can’t just make declarations about “proper evidence” without proper evidence. (There’s quite a lot on the LP on this blog; see also links to posts below the references.)

Burnham and Anderson are concerned with how old a method is. Oh the horrors of being a “historical” method. Appealing to ridicule (“progress should not have to ride in a hearse”) is no argument. Besides, it’s manifestly silly to suppose you use a single method, or that error statistical tests haven’t been advanced as well as reinterpreted since Fisher’s day. Moreover, the LP is a historical, baked-on principle suitable for ye olde logical positivist days when empirical observations were treated as “given”. Within that statistical philosophy, it was typical to hold that the *data speak for themselves*, and that questionable research practices such as cherry-picking, data-dredging, data-dependent selections, and optional stopping are irrelevant to “what the data are saying”! It’s redolent of the time where statistical philosophy sought a single, “objective” evidential relationship to hold between given data, model and hypotheses. Holders of the LP still say this, and the authors are no exception.

[The LP was, I believe, articulated by George Barnard who announced he rejected it at the 1959 Savage forum for all but predesignated simple hypotheses. If you have a date or correction, please let me know. 8/10]

The truth is that one of the biggest problems behind the “replication crisis” is the violation of some age-old truisms about science.It’s the consumers of bad science (in medicine at least) that are likely to ride in a hearse. There’s something wistful about remarks we hear from some quarters now. Listen to Ben Goldacre (2016) in *Naure*: “The basics of a rigorous scientific method were worked out many years ago, but there is now growing concern about systematic structural flaws that undermine the integrity of published data,” which he follows with a list of selective publication, data dredging and all the rest, “leading collectively to the ‘replication crisis’.”

He’s trying to remind us that the rules for good science were all in place long ago and somehow are now being ignored or trampled over, in some fields. Wherever there’s a legitimate worry about “perverse incentives,” it’s not a good idea to employ methods where selection effects vanish.

**5.** **Concluding comments**

I don’t endorse many of the applications of significance tests in the literature, especially in the social sciences. Many p-values reported are vitiated by fallacious interpretations (going from a statistical to substantive effect), violated assumptions, and biasing selection effects. I’ve long recommended a reformulation of the tools to avoid fallacies of rejection and non-rejection. In some cases, sadly, better statistical inference cannot help, but that doesn’t make me want to embrace methods that do not directly pick up on the effects of biasing selections. Just the opposite.

If the authors are serious about upholding Popperian tenets of good science, then they’ll want to ensure the claims they make can be regarded as having passed a stringent probe into their falsity. I invite comments and corrections.

(Look for updates.)

____________

^{[1]}They are replying to an article by Paul Murtaugh. See the link to his paper here.

[2]http://www.physicsmatt.com/blog/2016/8/5/standard-model-1-diphotons-0

^{[3]}Gelman continues: “At the next stage, we see science–and applied statistics–as resolution of anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996).”

**REFERENCES:**

- Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, S
*cientific Inference, Data Analysis, and Robustness*. New York: Academic Press. [1982 Technical Summary Report #2408 for U.S. Army version here.] - Burnham, K. P. & Anderson, D. R. 2014, “P values are only an index to evidence: 20th- vs. 21st-century statistical science“,
*Ecology*, vol. 95, no. 3, pp. 627-630. - Cox, D. R. and Hinkley, D. 1974.
*Theoretical Statistics*. London: Chapman and Hall. - Gelman, A. 2011. “Induction and Deduction in Bayesian Data Analysis”,
*Rationality, Markets and Morals*(*RMM*) 2, Special Topic: Statistical Science and Philosophy of Science, pp. 67-78. - Cox, D. R. and Hinkley, D. 1974.
*Theoretical Statistics*. London: Chapman and Hall. - Goldacre, B. 2016. ‘Make Journals Report Clinical Trials Properly‘,
*Nature*530,7 (04 February 2016) - Kuhn, T. S. 1970.
*The Structure of Scientific Revolutions*, 2nd ed. Chicago: University of Chicago Press. - Lakatos, I. 1978.
*The Methodology of Scientific Research Programmes*, Cambridge: Cambridge University Press. - Mayo, D. 1996.
*Error and the Growth of Experimental Knowledge*. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. - Mayo, D. and Cox, D. R. 2006.”Frequentists Statistics as a Theory of Inductive Inference,”
*Optimality: The Second Erich L. Lehmann Symposium*(ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97. - Murtaugh, P.A. 2014, “In defense of P values“,
*Ecology*, vol. 95, no. 3, pp. 611-617. - Murtaugh, P.A. 2014, “Rejoinder“,
*Ecology*, vol. 95, no. 3, pp. 651-653. - Fisher, R. A. 1947.
*The Design of Experiments*(4th ed.). Edinburgh: Oliver and Boyd. - Popper, K. 1959.
*The Logic of Scientific Discovery*. New York: Basic Books. - Royall, R. 1997.
*Statistical Evidence: A Likelihood Paradigm*. Chapman and Hall, CRC Press. - Spanos, A. 2014. “Recurring controversies about P values and conﬁdence intervals revisited”,
*Ecology*, vol. 95, no. 3, pp. 645-651.

**Related Blogposts**

**LAW OF LIKELIHOOD: ROYALL**

8/29/14: BREAKING THE LAW! (of likelihood): to keep their fit measures in line (A), (B 2nd)

10/10/14: BREAKING THE (Royall) LAW! (of likelihood) (C)

11/15/14: Why the Law of Likelihood is bankrupt—as an account of evidence

11/25/14: How likelihoodists exaggerate evidence from statistical tests

**P-VALUES EXAGGERATE**

7/14/14: “P-values overstate the evidence against the null”: legit or fallacious? (revised)

7/23/14: Continued: “P-values overstate the evidence against the null”: legit or fallacious?

5/12/16: Excerpts from S. Senn’s Letter on “Replication, p-values and Evidence”

Filed under: P-values, Severity, statistical tests, Statistics, StatSci meets PhilSci ]]>