**MONTHLY MEMORY LANE: 3 years ago: November 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a group count as one. Here I’m counting 11/9, 11/13, and 11/16 as one

**November 2013**

- (11/2)
**Oxford Gaol: Statistical Bogeymen** - (11/4)
**Forthcoming paper on the strong likelihood principle** - (11/9) Null Effects and Replication (cartoon pic)
- (11/9)
**Beware of questionable front page articles warning you to beware of questionable front page articles**(iii) - (11/13)
**T. Kepler: “Trouble with ‘Trouble at the Lab’?” (guest post)** - (11/16) PhilStock: No-pain bull
- (11/16)
**S. Stanley Young: More Trouble with ‘Trouble in the Lab’ (Guest post)** - (11/18)
**Lucien Le Cam: “The Bayesians hold the Magic”** - (11/20) Erich Lehmann: Statistician and Poet
- (11/23)
**Probability that it is a statistical fluke [i]** - (11/27) “
**The probability that it be a statistical fluke” [iia]** - (11/30) Saturday night comedy at the “Bayesian Boy” diary (rejected post*)

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

Filed under: 3-year memory lane, Error Statistics, Statistics ]]>

**GLYMOUR’S ARGUMENT (in a nutshell):**

“The anti-exploration argument has everything backwards,” says Glymour (slide #11). While John Ioannidis maintains that “Research findings are more likely true in confirmatory designs,” the opposite is so, according to Glymour. (Ioannidis 2005, Glymour’s slide #6). Why? To answer this he describes an exploratory research account for causal search that he has been developing:

What’s confirmatory research for Glymour? It’s moving directly from rejecting a null hypothesis with a low P-value to inferring a causal claim.

**MAYO ON GLYMOUR:**

I have my problems with Ioannidis, but Glymour’s description of exploratory inquiry is not what Ioannidis is on about. What Ioannidis is or ought to be criticizing are findings obtained through cherry picking, trying and trying again, p-hacking, multiple testing with selective reporting, hunting and snooping, exploiting researcher flexibility—*where those gambits make it easy to output a “finding” even though it’s false.* In those cases, the purported finding fails to pass a *severe test*. One reports the observed effect is difficult to achieve unless it’s genuine, when in fact it’s easy (frequent) to attain just by expected chance variability. The central sources of nonreplicability are precisely these data-dependent selection effects, and that’s why they’re criticized.

If you’re testing purported claims with stringency and multiple times, as Glymour describes, subjecting claims arrived at one stage to *checks at another*, then you’re not really in “exploratory inquiry” as Ioannidis and others describe it. There can be no qualms with testing a conjecture arrived at through searching, using new data (so long as the probability of affirming the finding isn’t assured, even if the causal claim is false.) I have often said that the terms exploratory and confirmatory should be dropped, and we should talk just about poorly tested and well tested claims, and reliable versus unreliable inquiries.

(Added Nov. 20, 2016 in burgundy): Admittedly, and this may be Glymour’s main point, Ioannidis’ categories of exploratory and confirmatory inquires are too coarse. Here’s Ioannidis’ chart:

But nowadays, to come away from a discussion thinking that the warranted criticism of unreliable explorations can be ignored, is dangerous. Hence my comment.

Thus we can agree that compared to Glymour’s “exploratory inquiry,” what he calls “confirmatory inquiry” is inferior. Doubtless some people conduct statistical tests this way (*shame on them*!), but to do so commits two glaring fallacies: (1) moving from a single statistically significant result to a genuine effect; and (2) moving from a statistically significant effect to a causal claim. Admittedly, Ioannidis’ (2005) critique is aimed at such abusive uses of significance tests.

R.A. Fisher denounced these fallacies donkey’s years ago:

“[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1935, p.14)

“[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter…requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.” (Gigerenzer et al 1989, pp. 95-6)

Glymour has been a leader in developing impressive techniques for causal exploration and modeling. I take his thesis to be that stringent modeling and self-critical causal searching are likely to do better than extremely lousy “experiments.” (He will correct me if I’m wrong.)

I have my own severe gripes with Ioannidis’ portrayal and criticism of significance tests in terms of what I call “the diagnostic screening model of tests.” I can’t tell from Glymour’s slides whether he agrees that looking at the posterior predictive value (PPV), as Ioannidis does, is legitimate for scientific inference, and as a basis for criticizing a proper use of significance tests. I think it’s a big mistake and has caused serious misunderstandings. Two slides from my PSA presentation allude to this[i].

Moreover, using “power” as a conditional probability for a Bayesian-type computation here is problematic (the null and alternative don’t exhaust the space of possibilities). Issues with the diagnostic screening model of tests have come up a lot on this blog. Some relevant posts are at the end. **Please share your thoughts.**

**Here are Clark Glymour’s slides: **

**Clark Glymour** (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania) *“Exploratory Research is More Reliable Than Confirmatory Research” *(Abstract)

[i] I haven’t posted my PSA slides yet; I wanted the focus of this post to be on Glymour.

**Blogposts relating to the “diagnostic model of tests”**

- (11/9) Beware of questionable front page articles warning you to beware of questionable front page articles (iii)
- 03/16 Stephen Senn: The pathetic P-value (Guest Post)
- 05/09 Stephen Senn: Double Jeopardy?: Judge Jeffreys Upholds the Law (sequel to the pathetic P-value)
- 01/17 “P-values overstate the evidence against the null”: legit or fallacious?
- 01/19 High error rates in discussions of error rates (1/21/16 update)
- 01/24 Hocus pocus! Adopt a magician’s stance, if you want to reveal statistical sleights of hand
- 04/11 When the rejection ratio (1 – β)/α turns evidenc0e on its head, for those practicing in an error-statistical tribe (ii)
- (08/28) TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

**References**:

- Fisher, R. A. (1947).
*The Design of Experiments*, 4^{th }ed. Edinburgh: Oliver and Boyd. - Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krüger, L. (1989).
*Empire of Chance: How Probability Changed Science and Everyday Life*. Cambridge UK: Cambridge University Press. - Ioannidis, J. (2005). “Why most published research ﬁndings are false“,
*PLoS Med*2(8):0696-0701.

Filed under: fallacy of rejection, P-values, replication research ]]>

Science isn’t about predicting one-off events like election results, but that doesn’t mean the way to make election forecasts scientific (which they should be) is to build “theories of voting.” A number of people have sent me articles on statistical aspects of the recent U.S. election, but I don’t have much to say and I like to keep my blog non-political. I won’t violate this rule in making a couple of comments on Faye Flam’s Nov. 11 article: “Why Science Couldn’t Predict a Trump Presidency”[i].

For many people, Donald Trump’s surprise election victory was a jolt to very idea that humans are rational creatures. It tore away the comfort of believing that science has rendered our world predictable. The upset led two New York Times reporters to question whether data science could be trusted in medicine and business. A Guardian columnist declared that big data works for physics but breaks down in the realm of human behavior.

But the unexpected result wasn’t a failure of science. Yes, there were multiple, confident forecasts of win for Clinton, but those emerged from a process doesn’t qualify as science. And while social scientists weren’t equipped to see a Trump win coming, they have started to test theories of voting behavior that could shed light on why it happened…..

Not that these methods are pseudoscience; in fact, they employ some critical tools of science. The most prominent among those is Bayesian statistics, a way of calculating the probability that something is true or will come true.

Bayesian analysis is a core principle laid out in political forecaster Nate Silver’s book “The Signal and the Noise

.” Though developed in the 1700s, Bayesian statistics had a resurgence in the science of early 21^{st}century. …Why don’t Bayesian statistics work the same sort of consistent magic for political forecasts? In science, what matters isn’t the forecast but the nature of the models. Scientists are after explicit rules, patterns and insights that explain how the world works. Those give other scientists something to build on — allowing science to self-correct in a way that other intellectual ventures can’t…..

Now that it’s over, there’s still a chance for science to explain why so many people voted for Trump. There are all kinds of guesses and judgments being thrown around about Trump voters — that they’re racist or sexist, or responding to the call of tribalism. Those aren’t the least bit scientific, but they could be turned into testable hypotheses.

You can read the rest of her article here.

Anytime a purportedly scientific method fails, a defender can always maintain the failures weren’t really scientific applications of the method. I think we did see a failure of many of the polling methods as the basis for the best-known forecasts. Methods used by Trump’s internal polling alerted them to what was happening in the “rust belt states” (according to campaign manager and pollster Kellyanne Conway), but the other polls largely missed it. They didn’t really share those internal results, and the attention Trump gave to typically blue states perplexed many [For some other activities kept under wraps, see ii]. Bill Clinton, on the other hand, “had pleaded with Robby Mook, Mrs. Clinton’s campaign manager, to do more outreach with working-class white and rural voters. But his advice fell on deaf ears.” (Link is here.)

Flam suggests, on the basis of her interviews with social scientists, that the way to turn forecasts into science is to build theories of voting. My guess is that’s the wrong way to go (I don’t claim any expertise here.) It’s an understanding of the threats to the assumptions in the *particular case, with all its idiosyncrasies,* that’s called for. The only thing general might be the ways you can go wildly wrong. Do their theories include tunnel vision by pollsters? Perhaps they should have asked: “If you were a person planning to vote for Trump, would you be reluctant to tell me, if I asked?”[iii]. In Trump’s internal polling, they would deliberately ask a number of related questions to ferret out the truth. Of course pollsters are well aware of the “undercover” or “shy” voter who is too worried about giving an unacceptable answer to be frank. If there was ever a case where this would be likely, it’s this—yet it was downplayed. Ironically, one might expect the more the “shy” voter should have been a concern, the *less* seriously a pollster would take it. (You can ponder why I say this.) It’s not enough to have a repertoire of errors if they’re not taken seriously.

As for the “consistent magic” of Bayesianism, since in this case we’re talking about an event, frequentists, error statisticians, and Bayesians can talk Bayesianly if they so choose, but my understanding is that most polling is in the form of frequentist interval estimates (perhaps with various weights attached). Maybe, as Flam suggests, some formally combine prior beliefs with the statistical data, but that’s all the more reason to have been ultra-self-critical and probe how capable the method is at disinterring fundamental flaws in the model. They should have been giving their assumptions a hard time, bending over backwards to disinter biases, and self-sealing fallacies, not baking them into the analysis.

Share your thoughts.

[i] Flam is the one who interviewed me, Gelman, Simonsohn, Senn and others for that NYT article on Bayesian and frequentist methods discussed on this post.

[ii] Apparently they also kept hidden in the “Trump bunker” a fairly extensive use of data analytics (on the order of $70 million a month, according to the Bloomberg article “Inside the Trump Bunker”), encouraging people to think it was a fledgling effort. Their polls were in sync with Nate Silver’s, they say, except for the time lag in Silver’s, owing to his reliance on other polls, but their inferences about what voters really thought differed.

Trump’s data scientists, including some from the London firm Cambridge Analytica who worked on the “Leave” side of the Brexit initiative, think they’ve identified a small, fluctuating group of people who are reluctant to admit their support for Trump and may be throwing off public polls. (Inside the Trump Bunker)

The article admits they also worked toward selectively depressing the vote. The overall data analytic project was to be the basis of an enterprise to pursue after a potential loss!

[iii] This is reminiscent of the question that permits you to get at the truth (about the correct road to town) when confronted with people who either always lie or always tell the truth.

Filed under: Bayesian/frequentist, evidence-based policy ]]>

Gerd Gigerenzer, Andrew Gelman, Clark Glymour and I took part in a very interesting symposium on Philosophy of Statistics at the Philosophy of Science Association last Friday. I jotted down lots of notes, but I’ll limit myself to brief reflections and queries on a small portion of each presentation in turn, starting with Gigerenzer’s “Surrogate Science: How Fisher, Neyman-Pearson, & Bayes Were Transformed into the Null Ritual.” His complete slides are below my comments. I may write this in stages, this being (i).

SLIDE #19

- Good scientific practice–bold theories, double-blind experiments, minimizing measurement error, replication, etc.–became reduced in the social science to a surrogate: statistical significance.

I agree that “good scientific practice” isn’t some great big mystery, and that “bold theories, double-blind experiments, minimizing measurement error, replication, etc.” are central and interconnected keys to finding things out in error prone inquiry. *Do the social sciences really teach that inquiry can be reduced to cookbook statistics? Or is it simply that, in some fields, carrying out surrogate science suffices to be a “success”?*

- Instead of teaching a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes, and others, textbook writers created a hybrid theory with the null ritual at its core, and presented it anonymously as statistics per se.

I’m curious as to how he/we might cash out teaching “a toolbox of statistical methods by Fisher, Neyman-Pearson, Bayes….” Each has been open to caricature, to rival interpretations and philosophies, and each includes several methods. There should be a way to recognize distinct questions and roles, without reinforcing the “received view” that lies behind the guilt and anxiety confronting the researcher in Gigerenzer’s famous “superego-ego-id” metaphor (SLIDE #3):

In this view, N-P demands fixed, long-run performance criteria and are relevant for acceptance sampling only–no inference allowed; Fisherian significance tests countenance moving from small p-values to substantive scientific claims, as in the illicit animal dubbed NHST. The Bayesian “id” is the voice of wishful thinking that tempts some to confuse the question: “How stringently have I tested *H*?” with “How strongly do I believe *H*?”

As with all good caricatures, there are several grains of truth in Gigerenzer’s colorful Freudian metaphor, but I say it’s time to move away from exaggerating the differences between N-P and Fisher. I think we must first see why Fisher and N-P statistics do not form an inconsistent hybrid, if we’re to see their overlapping roles in the “toolbox”.

(added 11/11/16: here’s the early, best known (so far as I’m aware) introduction of the Freudian metaphor: Link: https://www.mpib-berlin.mpg.de/volltexte/institut/dok/full/gg/ggstehfda/ggstehfda.html

Full text

A treatment that is unbiased as well as historically and statistically adequate might be possible, but would have to be created anew (Vic Barnett’s *Comparative Statistical Inference* comes close). Whether this would be at all practical for a routine presentation of statistics is another issue.

3(a).The null ritual requires delusions about the meaning of the p-value. It’s blind spots led to studies with a power so low that throwing a coin would do better. To compensate, researchers engage in bad science to produce significant results which are unlikely to be reproducible.

I put aside my queries about the “required delusions” of the first sentence, and improving power by throwing a coin in the second. The point about power in the last sentence is important. It might help explain the confusion I increasingly see between (i) reaching small significance levels with a test of low power and (ii) engaging in questionable research practices (QRPs) in order to arrive at the small p-value. If you achieve (i) without the QRPs of (ii), you have a good indication of a discrepancy from a null hypothesis. It would *not* yield an exaggerated effect size as some allege, if correctly interpreted. The problem is when QRPs are used “to compensate” for a test that would otherwise produce *bubkes*.

3(b). Researchers’ delusion that the p-value already specifies the probability of replication (1 – p) makes replication studies appear superfluous.

Hmm. They should read Fisher’s denunciation of taking an isolated p-value as indicating a genuine experimental effect:

In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result(Fisher 1947, p. 14).

4. The replication crisis in the social and biomedical sciences is typically attributed to wrong incentives. But that is only half the story. Researchers tend to

believein the ritual, and the null ritual also explains why these incentives and not others were set in the first place.

The perverse incentives generally refer to the “publish or perish” mantra, the competition to produce novel and sexy results, and editorial biases in favor of a neat narrative, free of those messy qualifications or careful, critical caveats. Is Gigerenzer saying that the reason these incentives were set is because researchers believe that recipe statistics is a good way to do science? What do people think?

It would follow that if researchers rejected statistical rituals as a good way to do science, then incentives would change. To some extent that might be happening. The trouble is, even recognizing the inadequacy of the statistical rituals that have been lampooned for 80+ years, it doesn’t follow they returned to the notion of “good scientific practice” described in point #1.

**Gerd Gigerenzer** (Director of Max Planck Institute for Human Development, Berlin, Germany) *“Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual”* (Abstract)

**Some Relevant Posts:**

“Erich Lehmann: Neyman Pearson vs Fisher on P-values”

“Jerzy Neyman and ‘Les Miserables Citations’: Statistical Theater in honor of His Birthday”

Stephen Senn: “The Pathetic P-value” (guest post)

**REFERENCES**

- Barnett, V. (1999; 2009; 2000).
*Comparative Statistical Inference*(3^{rd}Ed.). Chichester; New York: Wiley. - Fisher, R. A. 1947.
*The Design of Experiments*(4^{th}Ed.). Edinburgh: Oliver and Boyd. - Gigerenzer, G. (1993). The Superego, the Ego, and the Id in statistical reasoning. In G. Keren & C. Lewis (Eds.),
*A handbook for data analysis in the behavioral sciences: Methodological issues*(pp. 311-339). Hillsdale, NJ: Erlbaum.

Filed under: Fisher, frequentist/Bayesian, Gigerenzer, Gigerenzer, P-values, spurious p values, Statistics ]]>

Link to Seminar Flyer pdf.

Filed under: Announcement ]]>

Friday November 4th 9-11:45 am

**Deborah Mayo**(Professor of Philosophy, Virginia Tech, Blacksburg, Virginia)*“Controversy Over the Significance Test Controversy”*(Abstract)**Gerd Gigerenzer**(Director of Max Planck Institute for Human Development, Berlin, Germany)*“Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual”*(Abstract)**Andrew Gelman***(*Professor of Statistics & Political Science, Columbia University, New York)*“Confirmationist and Falsificationist Paradigms in Statistical Practice”*(Abstract)**Clark Glymour**(Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania)*“Exploratory Research is More Reliable Than Confirmatory Research”*(Abstract)

**Key Words**: big data, frequentist and Bayesian philosophies, history and philosophy of statistics, meta-research, p-values, replication, significance tests.

**Summary: **

Science is undergoing a crisis over reliability and reproducibility. High-powered methods are prone to cherry-picking correlations, significance-seeking, and assorted modes of extraordinary rendition of data. The Big Data revolution may encourage a reliance on statistical methods without sufficient scrutiny of whether they are teaching us about causal processes of interest. Mounting failures of replication in the social and biological sciences have resulted in new institutes for meta-research, replication research, and widespread efforts to restore scientific integrity and transparency. Statistical significance test controversies, long raging in the social sciences, have spread to all fields using statistics. At the same time, foundational debates over frequentist and Bayesian methods have shifted in important ways that are often overlooked in the debates. The problems introduce philosophical and methodological questions about probabilistic tools, and science and pseudoscience—intertwined with technical statistics and the philosophy and history of statistics. Our symposium goal is to address foundational issues around which the current crisis in science revolves. We combine the insights of philosophers, psychologists, and statisticians whose work interrelates philosophy and history of statistics, data analysis and modeling.

**Topic:**

Philosophy of statistics tackles conceptual and epistemological problems in using probabilistic methods to collect, model, analyze, and draw inferences from data. The problems concern the nature of uncertain evidence, the role and interpretation of probability, reliability, and robustness—all of which link to a long history of disputes of personality and philosophy between frequentists, Bayesians, and likelihoodists (e.g., Fisher, Neyman, Pearson, Jeffreys, Lindley, Savage). Replication failures have led researchers to reexamine their statistical methods. Although nowadays novel statistical techniques use simulations to detect cherry-picking and p-hacking, we see a striking recapitulation of Bayesian-frequentist debates of old. New philosophical issues arise from successes of machine learning and Big Data analysis: How do its predictions succeed when parameters in models are merely black boxes? One thing we learned in 2015 is why they fail: a tendency to overlook classic statistical issues– confounders, multiple testing, bias, model assumptions, and overfitting. The time is ripe for a forum that illuminates current developments and points to the directions of future work by philosophers and methodologists of science.

*The New Statistical Significance Test Controversy***. **Mechanical, cookbook uses of statistical significance tests have long been lampooned in social sciences, but once high-placed failures revealed poor rates of replication in medicine and cancer research, the problem took on a new seriousness. Drawing on criticisms from social science, however, the new significance test controversy retains caricatures of a “hybrid” view of significance testing, common in psychology (Gigerenzer). Well-known criticisms—statistical significance is not substantive significance, p-values are invalidated by significance seeking and violated model assumptions—are based on uses of methods warned against by the founders of Fisherian and Neyman-Pearson (N-P) tests. A genuine experimental effect, Fisher insisted, cannot be based on a single, isolated significant result (a single low p-value); low p-values had to be generated in multiple settings. Yet sweeping criticisms and recommended changes of method are often based on the rates of false positives assuming a single, just-significant result, with biasing selection effects to boot!

Foundational controversies are tied up with Fisher’s bitter personal feuds with Neyman, and Neyman’s attempt to avoid inconsistencies in Fisher’s “fiducial” probability by means of confidence levels. Only a combined understanding of the early statistical and historical developments can get beyond the received views of the philosophical differences between Fisherian and N-P tests. People should look at the *properties of the methods*, independent of what the founders supposedly thought.

** Bayesian-Frequentist Debates.** The Bayesian-frequentist debates need to be revisited. Many discussants, who only a decade ago argued for the “irreconcilability” of frequentist p-values and Bayesian measures, now call for ways to reconcile the two. In today’s most popular Bayesian accounts, prior probabilities in hypotheses do not express degrees of belief but are given by various formal assignments or “defaults,” ideally with minimal impact on the posterior probability. Advocates of unifications are keen to show that Bayesian methods have good (frequentist) long-run performance; and that it is often possible to match frequentist and Bayesian quantities, despite differences in meaning and goals. Other Bayesians deny the idea that Bayesian updating fits anything they actually do in statistics (Gelman). Statistical methods are being decoupled from the philosophies in which they are traditionally couched, calling for new foundations and insights from philosophers.

Is the “Bayesian revolution,” like the significance test revolution before it, ushering in the latest in a universal method and surrogate science (Gigerenzer)? If the key problems of significance tests occur equally with Bayes ratios, confidence intervals and credible regions, then we need a new statistical philosophy to underwrite alternative, more self-critical methods (Mayo).

** The Big Data Revolution.** New data acquisition procedures in biology and neuroscience yield enormous quantities of high-dimensional data, which can only be analyzed by computerized search procedures. But the most commonly used search procedures have known liabilities and can often only be validated using computer simulations. Analyses used to find predictors in areas such as medical diagnostics are so new that often their statistical properties are unknown, making them ethically problematic. Questions about the very nature of replication and validation, and of reliability and robustness enter. Without a more critical analysis of the foibles, the current Human Connectome project to understand brain processes may result in the same disappointments as gene regulation discovery, with its so far unfulfilled promise of reliably predicting personalized cancer treatments (Glymour).[See new topic.]

The wealth of computational ability allows for the application of countless methods with little handwringing about foundations, but they introduce new quandaries. The techniques that Big Data requires to “clean” and process data introduce biases that are difficult to detect. Can sufficient data obviate the need to satisfy long-standing principles of experimental design? Can data-dependent simulations, resampling and black-box models ever count as valid replications or genuine model validations?

**The Contributors: **While participants represent diverse statistical philosophies, there is agreement that a central problem concerns the gaps between the outputs of formal statistical methods and research claims of interest. In addition to illuminating problems, each participant will argue for an improved methodology: an error statistical account of inference (Mayo), a heuristic toolbox (Gigerenzer), Bayesian falsification via predictive distributions (Gelman), and a distinct causal-modeling approach (Glymour).

**Abstracts:**

*Controversy Over the Significance Test Controversy
*Deborah Mayo

In the face of misinterpretations and proposed bans of statistical significance tests, the American Statistical Association gathered leading statisticians in 2015 to articulate statistical fallacies and galvanize discussion of statistical principles. I discuss the philosophical assumptions lurking in the background of their recommendations, linking also my co-symposiasts. As is common, probability is assumed to accord with one of two statistical philosophies: (1) *probabilism* and (2) (long-run) *performance*. (1) assumes probability should supply degrees of confirmation, support or belief in hypotheses, e.g., Bayesian posteriors, likelihood ratios, and Bayes factors; (2) limits probability to long-run reliability in a series of applications, e.g., a “behavioristic” construal of N-P type 1 and 2 error probabilities; false discovery rates in Big Data.

Assuming probabilism, significance levels are relevant to a particular inference only if misinterpreted as posterior probabilities. Assuming performance, they are criticized as relevant only for quality control, and contexts of repeated applications. Performance is just what’s needed in Big Data searching through correlations (Glymour). But for inference, I sketch a third construal: (3) probativeness. In (2) and (3), unlike (1), probability attaches to methods (testing or estimation), not the hypotheses. These “methodological probabilities” report on a method’s ability to control the probability of erroneous interpretations of data: *error probabilities*. While significance levels (p-values) are error probabilities, the probing construal in (3) directs their evidentially relevant use.

That a null hypothesis of “no effect” or “no increased risk” is rejected at the .01 level (given adequate assumptions) tells us that 99% of the time, a smaller observed difference would result from expected variability, as under the null hypothesis. If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. Looking at the entire p-value distribution under various discrepancies from the null allows inferring those that are well or poorly indicated. This is akin to confidence intervals but we do not fix a single confidence level, and we distinguish the warrant for different points in any interval. My construal connects to Birnbaum’s *confidence concept*, Popperian *corroboration*, and possibly Fisherian *fiducial* probability. The probativeness interpretation better meets the goals driving current statistical reforms.

Much handwringing stems from hunting for an impressive-looking effect, then inferring a statistically significant finding. The *actual* probability of erroneously finding significance with this gambit is not low, but high, so a *reported* small p-value is invalid. Flexible choices along “forking paths” from data to inference cause the same problem, even if the criticism is informal (Gelman). However, the same flexibility occurs with probabilist reforms, be they likelihood ratios, Bayes factors, highest probability density (HPD) intervals, or lowering the p-value (until the maximal likely alternative gets .95 posterior). But lost are the direct grounds to criticize them as flouting error statistical control. I concur with Gigerenzer’s criticisms of ritual uses of p-values, but without understanding their valid (if limited) role, there’s a danger of accepting reforms that throw out the error control baby with the “bad statistics” bathwater.

*Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed into the Null Ritual
*Gerd Gigerenzer

(Director of Max Planck Institute for Human Development, Berlin, Germany)

If statisticians agree on one thing, it is that scientific inference should not be made mechanically. Despite virulent disagreements on other issues, Ronald Fisher and Jerzy Neyman, two of the most influential statisticians of the 20th century, were of one voice on this matter. Good science requires both statistical tools and informed judgment about what model to construct, what hypotheses to test, and what tools to use. Practicing statisticians rely on a “statistical toolbox” and on their expertise to select a proper tool. Social scientists, in contrast, tend to rely on a single tool.

In this talk, I trace the historical transformation of Fisher’s null hypothesis testing, Neyman-Pearson decision theory, and Bayesian statistics into a single mechanical procedure that is performed like compulsive hand washing: the null ritual. In the social sciences, this transformation has fundamentally changed research practice, making statistical inference its centerpiece. The essence of the null ritual is:

- Set up a null hypothesis of “no mean difference” or “zero correlation.” Do not specify the predictions of your own research hypothesis. 2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p < .05, p < .01, or

p < .001, whichever comes next to the obtained p-value. 3. Always perform this procedure.

I use the term “ritual” because this procedure shares features that define social rituals: (i) the repetition of the same action, (ii) a focus on special numbers or colors, (iii) fears about serious sanctions for rule violations, and (iv) wishful thinking and delusions that virtually eliminate critical thinking. The null ritual has each of these four characteristics: mindless repetition; the magical 5% number, fear of sanctions by editors or advisors, and delusions about what a p-value means, which block researchers’ intelligence. Starting in the 1940s, writers of bestselling statistical textbooks for the social sciences have silently transformed rivaling statistical systems into an apparently monolithic method that could be used mechanically. The idol of a universal method for scientific inference has been worshipped and institutionalized since the “inference revolution” of the 1950s. Because no such method has ever been found, surrogates have been created, most notably the quest for significant p-values. I show that this form of surrogate science fosters delusions and argue that it is one of the reasons of “borderline cheating” which has done much harm, creating, for one, a flood of irreproducible results in fields such as psychology, cognitive neuroscience and tumor marker research.

Today, proponents of the “Bayesian revolution” are in a similar danger of chasing the same chimera: an apparently universal inference procedure. A better path would be to promote an understanding of the various devices in the “statistical toolbox.” I discuss possible explanations why a toolbox approach to statistics has been so far successfully prevented by journal editors, textbook writers, and social scientists.

*Confirmationist and Falsificationist Paradigms in Statistical Practice*

**Andrew Gelman
**

There is a divide in statistics between classical frequentist and Bayesian methods. Classical hypothesis testing is generally taken to follow a falsificationist, Popperian philosophy in which research hypotheses are put to the test and rejected when data do not accord with predictions. Bayesian inference is generally taken to follow a confirmationist philosophy in which data are used to update the probabilities of different hypotheses. We disagree with this conventional Bayesian-frequentist contrast: We argue that classical null hypothesis significance testing is actually used in a confirmationist sense and in fact does not do what it purports to do; and we argue that Bayesian inference cannot in general supply reasonable probabilities of models being true. The standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify, which is then taken as evidence in favor of A. Research projects are framed as quests for confirmation of a theory, and once confirmation is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

Instead, we recommend a falsificationist Bayesian approach in which models are altered and rejected based on data. The conventional Bayesian confirmation view blinds many Bayesians to the benefits of predictive model checking. The view is that any Bayesian model necessarily represents a subjective prior distribution and as such could never be tested. It is not only Bayesians who avoid model checking. Quantitative researchers in political science, economics, and sociology regularly fit elaborate models without even the thought of checking their fit.

We can perform a Bayesian test by first assuming the model is true, then obtaining the posterior distribution, and then determining the distribution of the test statistic under hypothetical replicated data under the fitted model. A posterior distribution is not the final end, but is part of the derived prediction for testing. In practice, we implement this sort of check via simulation.

Posterior predictive checks are disliked by some Bayesians because of their low power arising from their allegedly “using the data twice”. This is not a problem for us: it simply represents a dimension of the data that is virtually automatically fit by the model.

What can statistics learn from philosophy? Falsification and the notion of scientific revolutions can make us willing to check our model fit and to vigorously investigate anomalies rather than treat prediction as the only goal of statistics. What can the philosophy of science learn from statistical practice? The success of inference using elaborate models, full of assumptions that are certainly wrong, demonstrates the power of deductive inference, and posterior predictive checking demonstrates that ideas of falsification and error statistics can be applied in a fully Bayesian environment with informative likelihoods and prior distributions.

*Exploratory Research is More Reliable Than Confirmatory Research
*Clark Glymour

Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.

Filed under: Announcement ]]>

Home | Call For Papers | Schedule | Venue | Travel and Accommodations |
---|

**Submission Deadline:** December 1st, 2016

**Authors Notified:** February 8th, 2017

We invite papers in formal epistemology, broadly construed. FEW is an interdisciplinary conference, and so we welcome submissions from researchers in philosophy, statistics, economics, computer science, psychology, and mathematics.

Submissions should be prepared for blind review. Contributors ought to upload a full paper of no more than 6000 words and an abstract of up to 300 words to the Easychair website. Please submit your full paper in .pdf format. The deadline for submissions is December 1st, 2016. Authors will be notified on February 1st, 2017.

The final selection of the program will be made with an eye towards diversity. We especially encourage submissions from PhD candidates, early career researchers and members of groups that are underrepresented in philosophy.

If you have any questions, please email formalepistemologyworkshop2017[AT]gmail, with the appropriate suffix.

Lara Buchak (Berkeley) Vincenzo Crupi (Turin) Sujata Ghosh (ISI Chennai) Simon Hutteger (Irvine) Subhash Lele (Alberta) Hanti Lin (UC Davis) Anna Mahtani (LSE) Daniel Singer (Penn) Michael Titelbaum (Madison) Kevin Zollman (Carnegie Mellon) |
Catrin Campbell-Moore (Bristol) Kenny Easwaran (Texas A&M) Nina Gierasimczuk (DTU Compute) Brian Kim (Oklahoma) Fenrong Liu (Tsinghua) Deborah Mayo (Virgina Tech) Carlotta Pavese (Duke/Turin) Sonja Smets (ILLC Amsterdam) Gregory Wheeler (MCMP Munich) |
Eleonora Cresto (Buenos Aires) Paul Egre (Institut Jean-Nicod) Leah Henderson (Groningen) Karolina Krzyzanowska (MCMP Munich) Yang Liu (Cambridge) Cailin O’Connor (Irvine) Lavinia Picollo (MCMP Munich) Julia Staffel (WashU in St. Louis) Sylvia Wenmackers (Leuven) |

Filed under: Announcement ]]>

**International Prize in Statistics Awarded to Sir David Cox for**

**Survival Analysis Model Applied in Medicine, Science, and Engineering**

EMBARGOED until October 19, 2016, at 9 p.m. ET

ALEXANDRIA, VA (October 18, 2016) – Prominent British statistician Sir David Cox has been named the inaugural recipient of the International Prize in Statistics. Like the acclaimed Fields Medal, Abel Prize, Turing Award and Nobel Prize, the International Prize in Statistics is considered the highest honor in its field. It will be bestowed every other year to an individual or team for major achievements using statistics to advance science, technology and human welfare.

Cox is a giant in the field of statistics, but the International Prize in Statistics Foundation is recognizing him specifically for his 1972 paper in which he developed the proportional hazards model that today bears his name. The Cox Model is widely used in the analysis of survival data and enables researchers to more easily identify the risks of specific factors for mortality or other survival outcomes among groups of patients with disparate characteristics. From disease risk assessment and treatment evaluation to product liability, school dropout, reincarceration and AIDS surveillance systems, the Cox Model has been applied essentially in all fields of science, as well as in engineering.

“Professor Cox changed how we analyze and understand the effect of natural or human-induced risk factors on survival outcomes, paving the way for powerful scientific inquiry and discoveries that have impacted human health worldwide,” said Susan Ellenberg, chair of the International Prize in Statistics Foundation. “Use of the ‘Cox Model’ in the physical, medical, life, earth, social and other sciences, as well as engineering fields, has yielded more robust and detailed information that has helped researchers and policymakers address some of society’s most pressing challenges.” Successful application of the Cox Model has led to life-changing breakthroughs with far-reaching societal effects, some of which include the following:

- Demonstrating that a major reduction in smoking-related cardiac deaths could be seen within just one year of smoking cessation, not 10 or more years as previously thought
- Showing the mortality effects of particulate air pollution, a finding that has changed both industrial practices and air quality regulations worldwide
- Identifying risk factors of coronary artery disease and analyzing treatments for lung cancer, cystic fibrosis, obesity, sleep apnea and septic shock

His mark on research is so great that his 1972 paper is one of the three most-cited papers in statistics and ranked 16th in Nature’s list of the top 100 most-cited papers of all time for all fields.

In 2010, Cox received the Copley Medal, the Royal Society’s highest award that has also been bestowed upon such other world-renowned scientists as Peter Higgs, Stephen Hawking, Albert Einstein, Francis Crick and Ronald Fisher. Knighted in 1985, Cox is a fellow of the Royal Society, an honorary fellow of the British Academy and a foreign associate of the U.S. National Academy of Sciences. He has served as president of the Bernoulli Society, Royal Statistical Society and International Statistical Institute.

Cox’s 50-year career included technical and research positions in the private and nonprofit sectors, as well as numerous academic appointments as professor or department chair at Birkbeck College, Imperial College of London, Nuffield College and Oxford University. He earned his PhD from the University of Leeds in 1949, after first studying mathematics at St. Johns College. Though he retired in 1994, Cox remains active in the profession in Oxford, England.

Cox considers himself to be a scientist who happens to specialize in the use of statistics, which is defined as the science of learning from data. A foundation of scientific inquiry, statistics is a critical component in the development of public policy and has played fundamental roles in vast areas of human development and scientific exploration.

**Note to Editors:** Digital footage of Susan Ellenberg, chair of the International Prize in Statistics Foundation, announcing the recipient will be distributed on October 20. Ellenberg and Ron Wasserstein, director of the International Prize in Statistics Foundation and executive director of the American Statistical Association, will be available for interviews that day.

Link to article: press-release-international-prize-winner

**###**

**About the International Prize in Statistics**

The International Prize in Statistics recognizes a major achievement of an individual or team in the field of statistics and promotes understanding of the growing importance and diverse ways statistics, data analysis, probability and the understanding of uncertainty advance society, science, technology and human welfare. With a monetary award of $75,000, it is given every other year by the International Prize in Statistics Foundation, which is comprised of representatives from the American Statistical Association, International Biometric Society, Institute of Mathematical Statistics, International Statistical Institute and Royal Statistical Society. Recipients are chosen from a selection committee comprised of world-renowned academicians and researchers and officially presented with the award at the World Statistics Congress.

**For more information:**

Jill Talley

Public Relations Manager,

American Statistical Association

(703) 684-1221, ext. 1865

jill@amstat.org

@amstatjill

Filed under: Announcement ]]>

**MONTHLY MEMORY LANE: 3 years ago: October 2013. **I mark in **red** **three** posts from each month that seem most apt for general background on key issues in this blog, excluding those reblogged recently**[1], and in ****green**** up to 3 others I’d recommend[2]**.** **Posts that are part of a “unit” or a pair count as one.

**October 2013**

**(10/3) Will the Real Junk Science Please Stand Up? (critical thinking)**- (
**10/5)****Was Janina Hosiasson pulling Harold Jeffreys’ leg?** **(10/9) Bad statistics: crime or free speech (II)? Harkonen update: Phil Stat / Law /Stock****(10/12) Sir David Cox: a comment on the post, “Was Hosiasson pulling Jeffreys’ leg?”(10/5 and 10/12 are a pair)**- (10/19) Blog Contents: September 2013
**(10/19) Bayesian Confirmation Philosophy and the Tacking Paradox (iv)*****(10/25) Bayesian confirmation theory: example from last post…(10/19 and 10/25 are a pair)****(10/26) Comedy hour at the Bayesian (epistemology) retreat: highly probable vs highly probed (vs what ?)****(10/31) WHIPPING BOYS AND WITCH HUNTERS**(interesting to see how things have changed and stayed the same over the past few years, share comments)

**[1]** Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

**[2]** New Rule, July 30, 2016-very convenient.

Filed under: 3-year memory lane, Error Statistics, Statistics ]]>

Gelman and Loken (2014) recognize that even without explicit cherry picking there is often enough leeway in the “forking paths” between data and inference so that by artful choices you may be led to one inference, even though it also could have gone another way. In good sciences, measurement procedures should interlink with well-corroborated theories and offer a triangulation of checks– often missing in the types of experiments Gelman and Loken are on about. Stating a hypothesis in advance, far from protecting from the verification biases, can be the engine that enables data to be “constructed”to reach the desired end [1].

[E]ven in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases…..and many other steps in the analysis could well have occurred with different data (Gelman and Loken 2014, p. 464).

An idea growing out of this recognition is to imagine the results of applying the same statistical procedure, but with different choices at key discretionary junctures–giving rise to a *multiverse analysis*, rather than a single data set (Steegen, Tuerlinckx, Gelman, and Vanpaemel 2016). One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “which constellation of choices corresponds to which statistical results” (p. 797). The result of this exercise can, at times, mimic the delineation of possibilities in multiple testing and multiple modeling strategies.

Steegen et.al.,consider the rather awful example from 2012 purporting to show that single (vs non-single) women prefer Obama to Romney when they are highly fertile; the reverse when they’re at low fertility. (I’m guessing there’s a hold on these ovulation studies during the current election season–maybe that’s one good thing in this election cycle. But let me know if you hear of any.)

Two studies with relatively large and diverse samples of women found that ovulation had different effects on religious and political orientation depending on whether women were single or in committed relationships. Ovulation led single women to become more socially liberal, less religious, and more likely to vote for Barack Obama (Durante et al., p. 1013).

What irks me to no end is the assumption they’re finding effects of ovulation when all they’ve got are a bunch of correlations with lots of flexibility in analysis. (It was discussed in brief on this blogpost.) Unlike the study claiming to show males are more likely to suffer a drop in self-esteem when their partner surpasses them in something (as opposed to when they surpass their partner), this one’s not even intuitively plausible (For the former case of “Macho Men” see slides starting from #48 of this post.) The ovulation study was considered so bad that people complained to the network and it had to be pulled.[2] Nevertheless, both studies are open to an analogous critique.

One of the choice points is where to draw the line at “highly fertile” based on days in a woman’s cycle. It wasn’t based on any hormone check, but an on-line questionnaire asking subjects when they’d had their last period. There’s latitude in using such information (even assuming it to be accurate) to decide whether to place someone in a low or high fertility group (Steegen et al., find 5 sets of days that could have been used). It turns out that under the other choice points, many of the results were insignificant. Had the evidence been “constructed”along these alternative lines, a negative result would often have ensued. Intuitively, considering what could have happened but didn’t, is quite relevant for interpreting the significant result they published. But how?

*1. A severity scrutiny*

Suppose the study is taken as evidence for

*H*_{1}: ovulation makes single women more likely to vote for Obama than Romney.

The data they selected for analysis accords with *H*_{1}, where highly fertile is defined in their chosen manner, leading to significance. The multiverse arrays how many other choice combinations lead to different p-values. We want to determine how good a job has been done in ruling out flaws in the study purporting to have evidence for *H*_{1}.To determine how severely *H*_{1 }had passed we’d ask:

What’s the probability they would *not* have found *some path or other* to yield statistical significance, even if in fact *H*_{1 }is false and there’s no genuine effect?

We want this probability to be high, in order to argue the significant result indicates a genuine effect. That is, we’d like some assurance that the procedure would have alerted us were *H*_{1}unwarranted. I’m not sure how to compute this using the multiverse, but it’s clear there’s more leeway than if one definition for fertility had been pinned down in advance. Perhaps each of the k different consistent combinations can count as a distinct hypothesis, and then one tries to consider the probability of getting r out of k hypotheses statistically significant, even if *H*_{1 }is false, taking account of dependencies. Maybe Stan Young’s “resampling-based multiple modeling” techniques could be employed (Westfall & Young, 1993). In any event, the spirit of the multiverse is, or appears to be, a quintessentially error statistical gambit. In appraising the well-testedness of a claim, anything that alters the probative capacity to discern flaws is relevant; anything that increases the flabbiness in uncovering flaws (in what is to be inferred) lowers the *severity* of the test that *H*_{1 }is false passed. Clearly, taking a walk on a data construction highway does this–the very reason for the common call for preregistration.

If one hadn’t preregistered, and all the other plausible combinations of choices yield non-significance, there’s a strong inkling that researchers selectively arrived at their result. If one had preregistered, finding that other paths yield non-significance is still informative about the fragility of the result. On the other hand, suppose one had preregistered and obtained a negative result. In the interest of reporting the multiverse, positive results may be disinterred, possibly offsetting the initial negative result.

*2. It is to be Applicable to Bayesian and Frequentist Approaches*

I find it interesting that the authors say that “a multiverse analysis is valuable, regardless of the inferential framework (frequentist or Bayesian)” and regardless of whether the inference is in the form of p-values, CIs, Bayes Factors or posteriors (p.709). Do the Bayesian tests (posterior or Bayes Factors) find evidence against *H*_{1} just when the configuration yields an insignificant result? We’re not told. No, I don’t see why they would. It would depend, of course, on the choice of alternatives and priors. Given how strongly authors Durante et al. believe *H*_{1}, it wouldn’t be surprising if the multiverse continues to find evidence for it (with a high posterior or high Bayes Factor in favor of *H*_{1}). Presumably the flexibility in discretionary choices is to show up in diminished Bayesian evidence for *H*_{1} but it’s not clear to me how. Nevertheless, even if the approach doesn’t itself consider error probabilities of methods, we can set out to appraise severity on the meta-level. We may argue that there’s a high probability of finding evidence in favor of some alternative *H*_{1} or other (varying over definitions of high fertility, say), even if its false. Yet I don’t think that’s what Steegen et al., have in mind. I welcome a clarification.

**3. Auditing: Just Falsify the Test, If You Can**

I find a lot to like in the multiverse scrutiny with its recognition of how different choice points in modeling and collecting data introduce the same kind of flexibility as explicit data-dependent searches. There are some noteworthy differences between it and the kind of critique I’ve proposed.

If no strong arguments can be made for certain choices, we are left with many branches of the multiverse that have large p-values. In these cases, the only reasonable conclusion on the effect of fertility is that there is considerable scientific uncertainty. One should reserve judgment…researchers interested in studying the effects of fertility should work hard to

deflatethe multiverse (Steegen et al., p. 708).

Reserve judgment? Here’s another reasonable conclusion: The core presumptions are falsified (or would be with little effort). What is overlooked in all of these fascinating multiverses is whether the entire inquiry makes any sense. One should expose or try to expose the unwarranted presuppositions. This is part of what I call *auditing*. The error statistical account always includes the hypothesis: *the test was poorly run, they’re not measuring what they purport to be, or the assumptions are violated.* Say each person with high fertility in the first study is tested for candidate preference at a time next month where they are now in the low fertility stage. If they have the same voting preferences, *the test is falsified.*

The onus is on the researchers to belie the hypothesis that the test was poorly run; but if they don’t, then we must.[3]

Please share your comments, suggestions, and any links to approaches related to the multiverse analysis.

**Adapted from Mayo, Statistical Inference as Severe Testing (forthcoming)**

[1] I’m reminded of Stapel’s “fix” for science: admit the story you want to tell and how you fixed the statistics to tell it. See this post.

[2] “Last week CNN pulled a story about a study purporting to demonstrate a link between a woman’s ovulation and how she votes, explaining that it failed to meet the cable network’s editorial standards. The story was savaged online as ‘silly,’ “stupid,’ ‘sexist,’ and ‘offensive.’ Others were less nice.” (Citation may be found here.)

[3] I have found nearly all experimental studies in the social sciences to be open to a falsification probe, and many are readily falsifiable. The fact that some have built-in ways to try and block falsification brings them closer to falling over the edge into questionable science. This is so, even in cases where their hypotheses are plausible. This is a far faster route to criticism than non-replication and all the rest.

**References:**

Durante, K.M., Rae, A. & Griskevicius, V. 2013, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” *Psychological Science, *24(6): 1007-1016.

Gelman, A. and Loken, E. 2014. “The statistical crisis in science,” *American Scientist* 2: 460-65.

Mayo, D. *Statistical Inference as Severe Testing*. CUP (forthcoming).

Steegen, Tuerlinckx, Gelman and Vanpaemel (2016) “Increasing Transparency Through a Multiverse Analysis.” *Perspectives on Psychological Science*, 11: 702-712.

Westfall, P. H. and S.S. Young. 1993. *Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment*. A Wiley-Interscience Publication. Wiley.

Filed under: Bayesian/frequentist, Error Statistics, Gelman, P-values, preregistration, reproducibility, Statistics ]]>

Leek’s post, from yesterday, called “Statistical Vitriol” (29 Sep 2016), calls for de-escalation of the consequences of statistical mistakes:

Over the last few months there has been a lot of vitriol around statistical ideas. First there were data parasites and then there were methodological terrorists. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.

I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics – which I’m so excited about and have spent my entire professional career working on – is something that is causing so much frustration, anxiety, and anger.

I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.

1. Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.

2. The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.

3. Most senior scientists, the ones leading and designing studies, have little or no training in statistics. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.

*Even for senior scientists in charge of designing statistical studies? *

4. There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. … There are a large number of lonely bioinformaticians out there.

5. Statisticians and computational scientists are also frustrated because there is often no outlet for them to respond to these papers in the formal scientific literature – those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.

Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shaky statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).

This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an internet scientist now.

…Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.

Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap – talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. …

I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good. So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:

We should create continuing education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.We should recognize that scientists and data generators have a lot more on the line when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard – scientist’s careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.

I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.

What do you think of his six part plan? More carrots or more sticks? (you can read his post here.)

There may be a fairly wide disparity between the handling of these issues in medicine and biology as opposed to the social sciences. In psychology at least, it appears my predictions (vague, but clear enough) of the likely untoward consequences of their way of handling their “replication crisis” are proving all too true. (See, for example, this post.)

Compare Leek to Gelman’s recent blog on the person raising accusations of “methodological terrorism”, Susan Fiske. (I don’t know if Fiske coined the term, but I consider the analogy reprehensible and think she should retract the term.) Here’s from Gelman:

Who is Susan Fiske and why does she think there are methodological terrorists running around? I can’t be sure about the latter point because she declines to say who these terrorists are or point to any specific acts of terror. Her article provides exactly zero evidence but instead gives some uncheckable half-anecdotes.

I first heard of Susan Fiske because her name was attached as editor to the aforementioned PPNAS articles on himmicanes, etc. So, at least in some cases, she’s a poor judge of social science research….

Fiske’s own published work has some issues too. I make no statement about her research in general, as I haven’t read most of her papers. What I do know is what Nick Brown sent me: [an article] by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

….The short story is that Cuddy, Norton, and Fiske made a bunch of data errors—which is too bad, but such things happen—and then when the errors were pointed out to them, they refused to reconsider anything. Their substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true….

The other thing that’s sad here is how Fiske seems to have felt the need to compromise her own principles here. She deplores “unfiltered trash talk,” “unmoderated attacks” and “adversarial viciousness” and insists on the importance of “editorial oversight and peer review.” According to Fiske, criticisms should be “most often in private with a chance to improve (peer review), or at least in moderated exchanges (curated comments and rebuttals).” And she writes of “scientific standards, ethical norms, and mutual respect.”

But Fiske expresses these views in an unvetted attack in an unmoderated forum with no peer review or opportunity for comments or rebuttals, meanwhile referring to her unnamed adversaries as “methological terrorists.” Sounds like unfiltered trash talk to me. But, then again, I haven’t seen Fiske on the basketball court so I really have no idea what she sounds like when she’s

reallytrash talkin’. (You can read Gelman’s post, which also includes a useful chronology of events, here.)

How can Leek’s 6 point list of “peaceful engagement” work in cases where authors deny the errors really matter? What if they view statistics as so much holy water to dribble over their data, mere window-dressing to attain a veneer of science? I have heard some (successful) social scientists say this aloud (privately)! Far from showing the claims they infer may be represented as unsuccessful attempts to falsify (as good Popperians would demand), the entire effort is a self-sealing affair, dressed up with statistical razzmatazz.

So, I concur with Gelman who has no sympathy for those who wish to protect their work from criticism, going merrily on their way using significance tests illicitly. I also have no sympathy for those who think the cure is merely lowering p-values or embracing methods where the assessment and control of error probabilities are absent. For me, error probability control is not for good long-run error rates, by the way, but to ensure a severe probing of error in the case at hand.

One group may unfairly call the critics “methodological terrorists.” Another may unfairly demonize the statistical methods as the villains to be blamed, banned and eradicated. It’s all the p-value’s fault there’s bad science (never mind that the lack of replication and fraudbusting are based on the use of significance tests). Worse, in some circles, methods that neatly hide the damage from biasing selection effects are championed (in high places)![1]

Gelman says the paradigm of erroneously moving from an already spurious p-value to a substantive claim—thereby doubling up on the blunders–is dead. Is it? That would be swell, but I have my doubts, especially in the most troubling areas. They didn’t nail Potti and Nevins whose erroneous cancer trials had life-threatening consequences; we can scarcely feel confident that such finagling isn’t continuing in clinical trials (see this post), though I think there’s some hope for improvements. But how can it be that “senior scientists, the ones leading and designing studies, have little or no training in statistics,” as Leek says? This is exactly why everyone could say “it’s not my job” in the horror story of the Potti and Nevins fraud. At least social psychologists aren’t using their results to base decisions on chemo treatments for breast cancer patients.

In the social sciences, undergoing a replication revolution has raised awareness, no doubt, and it’s altogether a plus that they’re stressing preregistration. But it’s been such a windfall, one cannot help asking: why would a field whose own members frequently write about its “perverse incentives,” have an incentive to kill the cash cow? Especially with all its interesting side-lines? It has a life of its own, and offers a career of its own with grants aplenty. So grist for its mills would need to continue. That’s rather cynical, but unless they’re prepared to call out bad sciences-including mounting serious critiques of widely held experimental routines and measurements (which could well lead to whole swaths of inquiry falling by the wayside), I don’t see how any other outcome is to be expected.

Share your thoughts. I wrote much more, but it got too long. I may continue this…

**Related:**

- “Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater”
- “P-Value Madness: A Puzzle About the Latest Test Ban, or ‘Don’t Ask, Don’t Tell”
- “Repligate Returns (or, the Non Significance of Nonsignificant Results Are the New Significant Results)
- “The Paradox of Replication and the Vindication of the P-value, but She Can Go Deeper”

Send me related links you find (on comments) and I’ll post them.

1)”There’s no tone problem in psychology” Talyarkoni

2)Menschplaining: Three Ideas for Civil Criticism: Uri Simonsohn on Data Colada

[1] I do not attribute this stance to Gelman who has made it clear that he cares about what could have happened but didn’t in analyzing tests, and is sympathetic to the idea of statistical tests as error probes:

“But I do not make these decisions on altering, rejecting, and expanding models based on the posterior probability that a model is true. …In statistical terms, an anomaly is a misfit of model to data (or perhaps an internal incoherence of the model), and it can be identified by a (Fisherian) hypothesis test without reference to any particular alternative (what Cox and Hinkley 1974 call “pure significance testing”). … At the next stage, we see science—and applied statistics—as resolving anomalies via the creation of improved models which often include their predecessors as special cases. This view corresponds closely to the error-statistics idea of Mayo (1996)” (Gelman 2011, p. 70).

Filed under: Anil Potti, fraud, Gelman, pseudoscience, Statistics ]]>

Departament de Filosofia & Centre d’Història de la Ciència (CEHIC), Universitat Autònoma de Barcelona (UAB)

Location: CEHIC, Mòdul de Recerca C, Seminari L3-05, c/ de Can Magrans s/n, Campus de la UAB, 08193 Bellaterra (Barcelona)

*Organized by Thomas Sturm & Agustí Nieto-Galan*

Current science is full of uncertainties and risks that weaken the authority of experts. Moreover, sometimes scientists themselves act in ways that weaken their standing: they manipulate data, exaggerate research results, do not give credit where it is due, violate the norms for the acquisition of academic titles, or are unduly influenced by commercial and political interests. Such actions, of which there are numerous examples in past and present times, are widely conceived of as violating standards of good scientific practice. At the same time, while codes of scientific conduct have been developed in different fields, institutions, and countries, there is no universally agreed canon of them, nor is it clear that there should be one. The workshop aims to bring together historians and philosophers of science in order to discuss questions such as the following: What exactly is scientific misconduct? Under which circumstances are researchers more or less liable to misconduct? How far do cases of misconduct undermine scientific authority? How have standards or mechanisms to avoid misconduct, and to regain scientific authority, been developed? How should they be developed?

**All welcome – but since space is limited, please register in advance. Write to:** Thomas.Sturm@uab.cat

09:30 Welcome (Thomas Sturm & Agustí Nieto-Galan)

9:45 José Ramón Bertomeu-Sánchez (IHMC, Universitat de València): *Managing Uncertainty in the Academy and the Courtroom: Normal Arsenic and Nineteenth-Century Toxicology*

10:30 Carl Hoefer (ICREA & Philosophy, University of Barcelona): *Comments on Bertomeu-Sánchez*

10:45 Discussion (Chair: Agustí Nieto-Galan)

11:30 Coffee break

12:00 David Teira (UNED, Madrid): *Does Replication help with Experimental Biases in Clinical Trials?*

12:45 Javier Moscoso (CSIC, Madrid): *Comment on Teira*

13:00 Discussion (Chair: Thomas Sturm)

13:45-15:00 Lunch

15:00 Torsten Wilholt (Philosophy, Leibniz University Hannover): *Bias, Fraud and Interests in Science*

15:45 Oliver Hochadel (IMF, CSIC, Barcelona): *Comments on Wilholt*

16:00 Discussion (Chair: Silvia de Bianchi)

16:45-17:15: Agustí Nieto-Galan & Thomas Sturm: Concluding reflections

**ABSTRACTS**

José Ramón Bertomeu-Sánchez: **Managing Uncertainty in the Academy and the Courtroom: Normal Arsenic and Nineteenth-Century Toxicology**

This paper explores how the enhanced sensitivity of chemical tests sometimes produced unforeseen and puzzling problems in nineteenth-century toxicology. It focuses on the earliest uses of the Marsh test for arsenic and the controversy surrounding “normal arsenic”, i.e., the existence of traces of arsenic in healthy human bodies. The paper follows the circulation of the Marsh test in French toxicology and its appearance in the academy, the laboratory and the courtroom. The new chemical tests could detect very small quantities of poison, but their high sensitivity also offered new opportunities for imaginative defense attorneys to undermine the credibility of expert witnesses. In this context, toxicologists had to dispel the uncertainty associated with the new method, and to find arguments to refute the many possible criticisms (of which “normal arsenic” was one). Meanwhile, new descriptions of animal experiments, autopsies and cases of poisoning produced a steady flow of empirical data, sometimes supporting but, in many cases, questioning previous conclusions about the reliability of chemical tests. This particularly challenging scenario provides many clues about the complex interaction between science and law in the nineteenth century, particularly on how expert authority, credibility and trustworthiness were constructed, and frequently challenged, in the courtroom.

David Teira: **Does Replication help with Experimental Biases in Clinical Trials?**

This is an analysis of the role of replicability in correcting biases in the design and conduct of clinical trials. We take as biases those confounding factors that a community of experimenters acknowledges and for which there are agreed debiasing methods. When these methods are implemented in a trial, we will speak of *unintended biases*, if they occur. Replication helps in detecting and correcting them. *Intended biases* occur when the relevant debiasing method is not implemented. Their effect may be stable and replication, on its own, will not detect them. *Interested* outcomes are treatment variables that not every stakeholder considers clinically relevant. Again, they may be perfectly replicable. Intended biases, unintended biases and interested outcomes are often conflated in the so-called replicability crisis: our analysis shows that fostering replicability, on its own, will not sort out the crisis.

Torsten Wilholt: **Bias, Fraud and Interests in Science**

Cases of fraud and misconduct are the most extreme manifestations of the adverse effects that conflicts of interests can have on science. Fabrication of data and falsification of results may sometimes be difficult to detect, but they are easy to describe as epistemological failures. But arguably, detrimental effects of researchers’ interests can also take more subtle forms. There are numerous ways by which researchers can influence the balance between the sensitivity and the specificity of their investigation. Is it possible to mark out some such trade-offs as cases of detrimental bias? I shall argue that it is, and that the key to understanding bias in science lies in relating it to the phenomenon of epistemic trust. Like fraud, bias exerts its negative epistemic effects by undermining the trust amongst scientists as well as the trust invested in science by the public. I will point out how this analysis can help us to draw the fine lines that separate unexceptionable from biased research and the latter from actual fraud.

Filed under: Announcement, replication research ]]>

**Today is George Barnard’s 101st birthday. In honor of this, I reblog an exchange between Barnard, Savage (and others) on likelihood vs probability. ****The exchange is from pp 79-84 (of what I call) “The Savage Forum” (Savage, 1962).[i] Six other posts on Barnard are linked below: 2 are guest posts (Senn, Spanos); the other 4 include a play (pertaining to our first meeting), and a letter he wrote to me. **

♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠♠

**BARNARD**:…Professor Savage, as I understand him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood. This is my general point, that I think while I agree with a lot of the technical points, I would prefer that this is talked about in terms of likelihood rather than probability. I should like to ask what Professor Savage thinks about that, whether he thinks that the necessity to enumerate hypotheses exhaustively, is important.

**SAVAGE**: Surely, as you say, we cannot always enumerate hypotheses so completely as we like to think. The list can, however, always be completed by tacking on a catch-all ‘something else’. In principle, a person will have probabilities given ‘something else’ just as he has probabilities given other hypotheses. In practice, the probability of a specified datum given ‘something else’ is likely to be particularly vague–an unpleasant reality. The probability of ‘something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small. Looking at things this way, I do not find probabilities unnormalizable, certainly not altogether unnormalizable.

Whether probability has an advantage over likelihood seems to me like the question whether volts have an advantage over amperes. The meaninglessness of a norm for likelihood is for me a symptom of the great difference between likelihood and probability. Since you question that symptom, I shall mention one or two others. …

On the more general aspect of the enumeration of all possible hypotheses, I certainly agree that the danger of losing serendipity by binding oneself to an over-rigid model is one against which we cannot be too alert. We must not pretend to have enumerated all the hypotheses in some simple and artificial enumeration that actually excludes some of them. The list can however be completed, as I have said, by adding a general ‘something else’ hypothesis, and this will be quite workable, provided you can tell yourself in good faith that ‘something else’ is rather improbable. The ‘something else’ hypothesis does not seem to make it any more meaningful to use likelihood for probability than to use volts for amperes.

Let us consider an example. Off hand, one might think it quite an acceptable scientific question to ask, ‘What is the melting point of californium?’ Such a question is, in effect, a list of alternatives that pretends to be exhaustive. But, even specifying which isotope of californium is referred to and the pressure at which the melting point is wanted, there are alternatives that the question tends to hide. It is possible that californium sublimates without melting or that it behaves like glass. Who dare say what other alternatives might obtain? An attempt to measure the melting point of californium might, if we are serendipitous, lead to more or less evidence that the concept of melting point is not directly applicable to it. Whether this happens or not, Bayes’s theorem will yield a posterior probability distribution for the melting point given that there really is one, based on the corresponding prior conditional probability and on the likelihood of the observed reading of the thermometer as a function of each possible melting point. Neither the prior probability that there is no melting point, nor the likelihood for the observed reading as a function of hypotheses alternative to that of the existence of a melting point enter the calculation. The distinction between likelihood and probability seems clear in this problem, as in any other.

**BARNARD**: Professor Savage says in effect, ‘add at the bottom of list H_{1}, H_{2},…”something else”’. But what is the probability that a penny comes up heads given the hypothesis ‘something else’. We do not know. What one requires for this purpose is not just that there should be some hypotheses, but that they should enable you to compute probabilities for the data, and that requires very well defined hypotheses. For the purpose of applications, I do not think it is enough to consider only the conditional posterior distributions mentioned by Professor Savage.

**LINDLEY**: I am surprised at what seems to me an obvious red herring that Professor Barnard has drawn across the discussion of hypotheses. I would have thought that when one says this posterior distribution is such and such, all it means is that among the hypotheses that have been suggested the relevant probabilities are such and such; conditionally on the fact that there is nothing new, here is the posterior distribution. If somebody comes along tomorrow with a brilliant new hypothesis, well of course we bring it in.

**BARTLETT**: But you would be inconsistent because your prior probability would be zero one day and non-zero another.

**LINDLEY**: No, it is not zero. My prior probability for other hypotheses may be ε. All I am saying is that conditionally on the other 1 – ε, the distribution is as it is.

**BARNARD**: Yes, but your normalization factor is now determined by ε. Of course ε may be anything up to 1. Choice of letter has an emotional significance.

**LINDLEY**: I do not care what it is as long as it is not one.

**BARNARD**: In that event two things happen. One is that the normalization has gone west, and hence also this alleged advantage over likelihood. Secondly, you are not in a position to say that the posterior probability which you attach to an hypothesis from an experiment with these unspecified alternatives is in any way comparable with another probability attached to another hypothesis from another experiment with another set of possibly unspecified alternatives. This is the difficulty over likelihood. Likelihood in one class of experiments may not be comparable to likelihood from another class of experiments, because of differences of metric and all sorts of other differences. But I think that you are in exactly the same difficulty with conditional probabilities just because they are conditional on your having thought of a certain set of alternatives. It is not rational in other words. Suppose I come out with a probability of a third that the penny is unbiased, having considered a certain set of alternatives. Now I do another experiment on another penny and I come out of that case with the probability one third that it is unbiased, having considered yet another set of alternatives. There is no reason why I should agree or disagree in my final action or inference in the two cases. I can do one thing in one case and other in another, because they represent conditional probabilities leaving aside possibly different events.

**LINDLEY**: All probabilities are conditional.

**BARNARD**: I agree.

**LINDLEY**: If there are only conditional ones, what is the point at issue?

**PROFESSOR E.S. PEARSON**: I suggest that you start by knowing perfectly well that they are conditional and when you come to the answer you forget about it.

**BARNARD**: The difficulty is that you are suggesting the use of probability for inference, and this makes us able to compare different sets of evidence. Now you can only compare probabilities on different sets of evidence if those probabilities are conditional on the same set of assumptions. If they are not conditional on the same set of assumptions they are not necessarily in any way comparable.

**LINDLEY**: Yes, if this probability is a third conditional on that, and if a second probability is a third, conditional on something else, a third still means the same thing. I would be prepared to take my bets at 2 to 1.

**BARNARD**: Only if you knew that the condition was true, but you do not.

**GOOD**: Make a conditional bet.

**BARNARD**: You can make a conditional bet, but that is not what we are aiming at.

**WINSTEN**: You are making a cross comparison where you do not really want to, if you have got different sets of initial experiments. One does not want to be driven into a situation where one has to say that everything with a probability of a third has an equal degree of credence. I think this is what Professor Barnard has really said.

**BARNARD**: It seems to me that likelihood would tell you that you lay 2 to 1 in favour of H_{1} against H_{2}, and the conditional probabilities would be exactly the same. Likelihood will not tell you what odds you should lay in favour of H_{1} as against the rest of the universe. Probability claims to do that, and it is the only thing that probability can do that likelihood cannot.

You can read the rest of pages 78-103 of the Savage Forum here.

**HAPPY BIRTHDAY GEORGE!**

**References**

[i] Savage, L. (1962), “Discussion”, in *The Foundations of Statistical Inference: A Discussion*, (G. A. Barnard and D. R. Cox eds.), London: Methuen, 76.

***Six other Barnard links on this blog:**

**Guest Posts: **

**Aris Spanos: Comment on the Barnard and Copas (2002) Empirical Example**

**Stephen Senn: On the (ir)relevance of stopping rules in meta-analysis**

**Posts by Mayo:**

**Barnard, Background Information, and Intentions**

**Statistical Theater of the Absurd: Stat on a Hot tin Roof**

**George Barnard’s 100 ^{th} Birthday: We Need More Complexity and Coherence in Statistical Education**

**Letter from George Barnard on the Occasion of my Lakatos Award**

**Links to a scan of the entire Savage forum may be found at: https://errorstatistics.com/2013/04/06/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour/**

Filed under: Barnard, highly probable vs highly probed, phil/history of stat, Statistics ]]>

**
I. The myth of objectivity.** Whenever you come up against blanket slogans such as “no methods are objective” or “all methods are equally objective and subjective,” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity-subjectivity distinction really toothless as many will have you believe? I say no.

Cavalier attitudes toward objectivity are in tension with widely endorsed movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science–if they are not mere lip-service–are rooted in the supposition that we can more objectively scrutinize results,even if it’s only to point out those that are poorly tested. The fact that the term “objectivity” is used equivocally should not be taken as grounds to oust it, but rather to engage in the difficult work of identifying what there is in “objectivity” that we won’t give up, and shouldn’t.

**II. The Key is Getting Pushback. **While knowledge gaps leave plenty of room for biases, arbitrariness and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. *We get pushback!* This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. In an adequate account of statistical inference, explicit attention is paid to communicating results to set the stage for others to check, debate, extend or refute the inferences reached. Don’t let anyone say you can’t hold them to an objective account of statistical inference.

If you really want to find something out, and have had some experience with flaws and foibles, you deliberately arrange inquiries so as to capitalize on pushback, on effects that will not go away, and on strategies to get errors to ramify quickly to force you to pay attention to them. The ability to register alterations in error probabilities due to hunting, optional stopping, and other questionable research practices (QRPs) is a crucial part of objectivity in statistics. In statistical design, day-to-day tricks of the trade to combat bias are amplified and made systematic. It is not because of a “disinterested stance” that such methods are invented. It is that we, competitively and self-interestedly, want to find things out.

Admittedly, that desire won’t suffice to incentivize objective scrutiny if you can do just as well producing junk. Succeeding in scientific learning is very different from success at grants, honors, publications, or engaging in technical activism, replication research and meta-research. That’s why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data, preregistration, and resisting the urge to cherry pick, outcome-switch, or otherwise engage in bad science are proposed. I say that if the allure of carrots has grown stronger than the sticks (which they have), then what we need are stronger sticks.

**III. Objective procedures. **It is often urged that, however much we may aim at objective constraints, we can never have clean hands, free of the influence of beliefs and interests. The fact that my background knowledge enters in researching a claim

Others argue that we invariably sully methods of inquiry by the entry of personal judgments in their specification and interpretation. It’s just human all too human. *The issue is not that a human is doing the measuring; the issue is whether that which is being measured is something we can reliably use to solve some problem of inquiry.* That an inference is done by machine, untouched by human hands, wouldn’t make it objective, in the relevant sense. There are three distinct requirements for an objective procedure for solving problems of inquiry:

*Relevance*: It should be relevant to learning about the intended topic of inquiry; having an uncontroversial way to measure something doesn’t make it relevant to solving a knowledge-based problem of inquiry.

*Reliably capable*: It should not routinely declare the problem solved when it is not solved (or solved incorrectly); it should be capable of controlling the reliability of erroneous reports of purported answers to question.

*Capacity to learn from error:*If the problem is not solved (or poorly solved) at a given stage, the method should set the stage for pinpointing why. (It should be able to at least embark on an inquiry for solving “Duhemian problems” of where to lay blame for anomalies.)

Yes, there are numerous choices in collecting, analyzing, modeling, and drawing inferences from data, and there is often disagreement about how they should be made. Why suppose this means all accounts are in the same boat as regards subjective factors? It need not, and they are not. An account of inference shows itself to be objective precisely in how it steps up to the plate in handling potential threats to objectivity.

**IV. Idols of Objectivity. **We should reject phony objectivity and false trappings of objectivity. They often grow out of one or another philosophical conception of what objectivity requires—even though you will almost surely not see them described that way. If it’s thought objectivity is limited to direct observations (whatever they are) plus mathematics and logic, as does the typical logical positivist, then it’s no surprise to wind up worshiping “the idols of a universal method” as Gigerenzer and Marewski (2015) call it. Such a method is to supply a formal, ideally mechanical, way to process statements of observations and hypotheses. To recognize such mechanical rules don’t exist is not to relinquish the view that they’re demanded by objectivity. Instead, objectivity goes by the board, replaced by various stripes of relativism and constructivism, or more extreme forms of post-modernisms.

Relativists may augment their rather thin gruel with a pseudo-objectivity arising from social or political negotiation, cost-benefits (“they’re buying it”), or a type of consensus (“it’s in a 5 star journal”), but that’s to give away the goat far too soon. The result is to abandon the core stipulations of scientific objectivity. To be clear: There are authentic problems that threaten objectivity. We shouldn’t allow outdated philosophical accounts to induce us into giving it up.

**V. From Discretion to Subjective Probabilities. **Some argue that “discretionary choices” in tests, which Neyman himself tended to call “subjective”[1], leads us to subjective probabilities in claims. A weak version goes: since you can’t avoid subjective (discretionary) choices in getting the data and the model, there can be little ground for complaint about subjective degrees of belief in the resulting inference. This is weaker than arguing you must use subjective probabilities; it argues merely that doing so *is no worse than* discretion. But it still misses the point.

Even if the entry of discretionary judgments in the journey to a statistical inference/model have the capability to introduce subjectivity, they need not. Second, not all discretionary judgments are in the same boat when it comes to being **open to severe testing. **

A stronger version of the argument goes on a slippery slope from the premise of discretion in data generation and modeling to the conclusion: statistical inference just i*s a matter of subjective beliefs (or their updates)*. How does that work? One variant, which I do not try to pin on anyone in particular, involves a subtle slide from “our models are merely objects of belief”, to “statistical inference is a matter of degrees of belief”. From there it’s a short step to “statistical inference is a matter of subjective probability” (whether my assignments or that of an imaginary omniscient agent).

It is one thing to describe our models as objects of belief and quite another to maintain that our task is to model beliefs.

This is one of those philosophical puzzles of language that might set some people’s eyes rolling. If I believe in the deflection effect (of gravity) then that effect is the object of my belief, but only in the sense that my belief is about said effect. Yet if I’m inquiring into the deflection effect, I’m not inquiring into beliefs about the effect. The philosopher of science Clark Glymour (2010, p. 335) calls this a shift from phenomena (content) to *epiphenomena* (degrees of belief).

Karl Popper argues that *the* central confusion all along was sliding from * the degree of the rationality (or warrantedness) of a belief, to the degree of rational belief *(1959, p. 424). The former is assessed via degrees of corroboration and well-testedness, rooted in the error probing capacities of procedures. (These are supplied by error probabilities of methods, formal or informal.)

**VI. Blurring What’s Being Measured vs My Ability to Test It. **You will sometimes hear a Bayesian claim that anyone who says their probability assignments to hypotheses are subjective must also call the use of any model subjective because it too is based on my choice of specifications. *This is a confusion of two notions of subjective. *

- The first concerns what’s being measured, and for the Bayesian, with some exceptions, probability is supposed to represent a subject’s strength of belief (be it actual or rational), betting odds, or the like.
- The second sense of subjective concerns whether the measurement is checkable or testable.

This goes back to my point about what’s required for a feature to be *relevant* to a method’s objectivity in III.

(Passages, modified, are from Mayo, *Statistical Inference as Severe Testing* (forthcoming)

[1]But he never would allow subjective probabilities to enter in statistical inference. Objective, i.e., frequentist, priors in a hypothesis H could enter, but he was very clear that this required H’s truth being the result of some kind of stochastic mechanism. He found that idea plausible in cases, the problem was not knowing the stochastic mechanism sufficiently to assign the priors. Such frequentist (or “empirical”) priors in hypotheses are not given by drawing Hrandomly from an urn of hypothesis k% of which are assumed to be true. Yet, an “objective” Bayesian like Jim Berger will call these frequentist, resulting in enormous confusion in today’s guidebooks on the probability of type 1 errors.

Cox D. R. and Mayo. D. G. (2010). “Objectivity and Conditionality in Frequentist Inference” in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

Gigerenzer, G. and Marewski, J. 2015. ‘Surrogate Science: The Idol of a Universal Method for Scientific Inference,’ *Journal of Management* 41(2): 421-40.

Glymour, C. 2010. ‘Explanation and Truth’, in *Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science* (D. Mayo and A. Spanos eds.), CUP: 331–350.

Mayo, D. (1983). “An Objective Theory of Statistical Testing.” *Synthese* **57**(2): 297-340.

Popper, K. 1959. *The Logic of Scientific Discovery*. New York: Basic Books.

Filed under: Background knowledge Tagged: objectivity ]]>

Today is C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic, and he anticipated several major ideas in statistics (e.g., randomization, confidence intervals) as well as in logic. I’ll reblog the first portion of a (2005) paper of mine. Links to Parts 2 and 3 are at the end. It’s written for a very general philosophical audience; the statistical parts are pretty informal. *Happy birthday Peirce*.

**Peircean Induction and the Error-Correcting Thesis**

Deborah G. Mayo

*Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy*, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

**Self-Correcting Thesis SCT:** methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting.

Peirce’s SCT has been a source of fascination and frustration. By and large, critics and followers alike have denied that Peirce can sustain his SCT as a way to justify scientific induction: “No part of Peirce’s philosophy of science has been more severely criticized, even by his most sympathetic commentators, than this attempted validation of inductive methodology on the basis of its purported self-correctiveness” (Rescher 1978, p. 20).

In this paper I shall revisit the Peircean SCT: properly interpreted, I will argue, Peirce’s SCT not only serves its intended purpose, it also provides the basis for justifying (frequentist) statistical methods in science. While on the one hand, contemporary statistical methods increase the mathematical rigor and generality of Peirce’s SCT, on the other, Peirce provides something current statistical methodology lacks: an account of inductive inference and a philosophy of experiment that links the justification for statistical tests to a more general rationale for scientific induction. Combining the mathematical contributions of modern statistics with the inductive philosophy of Peirce, sets the stage for developing an adequate justification for contemporary inductive statistical methodology.

**2. Probabilities are assigned to procedures not hypotheses**

Peirce’s philosophy of experimental testing shares a number of key features with the contemporary (Neyman and Pearson) Statistical Theory: statistical methods provide, not means for assigning degrees of probability, evidential support, or confirmation to hypotheses, but procedures for testing (and estimation), whose rationale is their predesignated high frequencies of leading to correct results in some hypothetical long-run. A Neyman and Pearson (NP) statistical test, for example, instructs us “To decide whether a hypothesis, *H*, of a given type be rejected or not, calculate a specified character, ** x_{0}**, of the observed facts; if

The relative frequencies of erroneous rejections and erroneous acceptances in an actual or hypothetical long run sequence of applications of tests are error probabilities; we may call the statistical tools based on error probabilities, error statistical tools. In describing his theory of inference, Peirce could be describing that of the error-statistician:

The theory here proposed does not assign any probability to the inductive or hypothetic conclusion, in the sense of undertaking to say how frequently that conclusion would be found true. It does not propose to look through all the possible universes, and say in what proportion of them a certain uniformity occurs; such a proceeding, were it possible, would be quite idle. The theory here presented only says how frequently, in this universe, the special form of induction or hypothesis would lead us right. The probability given by this theory is in every way different—in meaning, numerical value, and form—from that of those who would apply to ampliative inference the doctrine of inverse chances. (2.748)

The doctrine of “inverse chances” alludes to assigning (posterior) probabilities in hypotheses by applying the definition of conditional probability (Bayes’s theorem)—a computation requires starting out with a (prior or “antecedent”) probability assignment to an exhaustive set of hypotheses:

If these antecedent probabilities were solid statistical facts, like those upon which the insurance business rests, the ordinary precepts and practice [of inverse probability] would be sound. But they are not and cannot be statistical facts. What is the antecedent probability that matter should be composed of atoms? Can we take statistics of a multitude of different universes? (2.777)

For Peircean induction, as in the N-P testing model, the conclusion or inference concerns a hypothesis that either is or is not true in this one universe; thus, assigning a frequentist probability to a particular conclusion, other than the trivial ones of 1 or 0, for Peirce, makes sense only “if universes were as plentiful as blackberries” (2.684). Thus the Bayesian inverse probability calculation seems forced to rely on subjective probabilities for computing inverse inferences, but “subjective probabilities” Peirce charges “express nothing but the conformity of a new suggestion to our prepossessions, and these are the source of most of the errors into which man falls, and of all the worse of them” (2.777).

Hearing Pierce contrast his view of induction with the more popular Bayesian account of his day (the Conceptualists), one could be listening to an error statistician arguing against the contemporary Bayesian (subjective or other)—with one important difference. Today’s error statistician seems to grant too readily that the only justification for N-P test rules is their ability to ensure we will rarely take erroneous actions with respect to hypotheses in the long run of applications. This so called inductive behavior rationale seems to supply no adequate answer to the question of what is learned in any particular application about the process underlying the data. Peirce, by contrast, was very clear that what is really wanted in inductive inference in science is the ability to control error probabilities of test procedures, i.e., “the trustworthiness of the proceeding”. Moreover it is only by a faulty analogy with deductive inference, Peirce explains, that many suppose that inductive (synthetic) inference should supply a probability to the conclusion: “… in the case of analytic inference we know the probability of our conclusion (if the premises are true), but in the case of synthetic inferences we only know the degree of trustworthiness of our proceeding (“The Probability of Induction” 2.693).

Knowing the “trustworthiness of our inductive proceeding”, I will argue, enables determining the test’s probative capacity, how reliably it detects errors, and the severity of the test a hypothesis withstands. Deliberately making use of known flaws and fallacies in reasoning with limited and uncertain data, tests may be constructed that are highly trustworthy probes in detecting and discriminating errors in particular cases. This, in turn, enables inferring which inferences about the process giving rise to the data are and are not warranted: an inductive inference to hypothesis *H* is warranted to the extent that with high probability the test would have detected a specific flaw or departure from what *H* asserts, and yet it did not.

**3. So why is justifying Peirce’s SCT thought to be so problematic?**

You can read Section 3 here. (it’s not necessary for understanding the rest).

**4. Peircean induction as severe testing**

… [I]nduction, for Peirce, is a matter of subjecting hypotheses to “the test of experiment” (7.182).

The process of testing it will consist, not in examining the facts, in order to see how well they accord with the hypothesis, but on the contrary in examining such of the probable consequences of the hypothesis … which would be very unlikely or surprising in case the hypothesis were not true. (7.231)

When, however, we find that prediction after prediction, notwithstanding a preference for putting the most unlikely ones to the test, is verified by experiment,…we begin to accord to the hypothesis a standing among scientific results.

This sort of inference it is, from experiments testing predictions based on a hypothesis, that is alone properly entitled to be called induction. (7.206)

While these and other passages are redolent of Popper, Peirce differs from Popper in crucial ways. Peirce, unlike Popper, is primarily interested not in falsifying claims but in the positive pieces of information provided by tests, with “the corrections called for by the experiment” and with the hypotheses, modified or not, that manage to pass severe tests. For Popper, even if a hypothesis is highly *corroborated (by his lights)*, he regards this as at most a report of the hypothesis’ past performance and denies it affords positive evidence for its correctness or reliability. Further, Popper denies that he could vouch for the reliability of the method he recommends as “most rational”—conjecture and refutation. Indeed, Popper’s requirements for a highly corroborated hypothesis are not sufficient for ensuring severity in Peirce’s sense (Mayo 1996, 2003, 2005). Where Popper recoils from even speaking of warranted inductions, Peirce conceives of a proper inductive inference as what had passed a severe test—one which would, with high probability, have detected an error if present.

In Peirce’s inductive philosophy, we have evidence for inductively inferring a claim or hypothesis *H* when not only does *H* “accord with” the data ** x**; but also, so good an accordance would very probably not have resulted, were

*Hypothesis H passes a severe test with* ** x** iff (firstly)

The test would “have signaled an error” by having produced results less accordant with *H* than what the test yielded. Thus, we may inductively infer *H* when (and only when) *H* has withstood a test with high error detecting capacity, the higher this probative capacity, the more severely *H* has passed. What is assessed (quantitatively or qualitatively) is not the amount of support for *H* but the probative capacity of the test of experiment ET (with regard to those errors that an inference to *H* is declaring to be absent)……….

You can read the rest of Section 4 here here

**5. The path from qualitative to quantitative induction**

In my understanding of Peircean induction, the difference between qualitative and quantitative induction is really a matter of degree, according to whether their trustworthiness or severity is quantitatively or only qualitatively ascertainable. This reading not only neatly organizes Peirce’s typologies of the various types of induction, it underwrites the manner in which, within a given classification, Peirce further subdivides inductions by their “strength”.

*(I) First-Order, Rudimentary or Crude Induction*

Consider Peirce’s First Order of induction: the lowest, most rudimentary form that he dubs, the “pooh-pooh argument”. It is essentially an argument from ignorance: Lacking evidence for the falsity of some hypothesis or claim *H*, provisionally adopt *H*. In this very weakest sort of induction, crude induction, the most that can be said is that a hypothesis would eventually be falsified if false. (It may correct itself—but with a bang!) It “is as weak an inference as any that I would not positively condemn” (8.237). While uneliminable in ordinary life, Peirce denies that rudimentary induction is to be included as scientific induction. Without some reason to think evidence of *H*‘s falsity would probably have been detected, were H false, finding no evidence against *H* is poor inductive evidence *for* *H*. *H* has passed only a highly unreliable error probe.

*(II) Second Order (Qualitative) Induction*

It is only with what Peirce calls “the Second Order” of induction that we arrive at a genuine test, and thereby scientific induction. Within second order inductions, a stronger and a weaker type exist, corresponding neatly to viewing strength as the severity of a testing procedure.

The weaker of these is where the predictions that are fulfilled are merely of the continuance in future experience of the same phenomena which originally suggested and recommended the hypothesis… (7.116)

The other variety of the argument … is where [results] lead to new predictions being based upon the hypothesis of an entirely different kind from those originally contemplated and these new predictions are equally found to be verified. (7.117)

The weaker type occurs where the predictions, though fulfilled, lack novelty; whereas, the stronger type reflects a more stringent hurdle having been satisfied: the hypothesis has had “novel” predictive success, and thereby higher severity. (For a discussion of the relationship between types of novelty and severity see Mayo 1991, 1996). Note that within a second order induction the assessment of strength is qualitative, e.g., very strong, weak, very weak.

The strength of any argument of the Second Order depends upon how much the confirmation of the prediction runs counter to what our expectation would have been without the hypothesis. It is entirely a question of how much; and yet there is no measurable quantity. For when such measure is possible the argument … becomes an induction of the Third Order [statistical induction]. (7.115)

It is upon these and like passages that I base my reading of Peirce. A qualitative induction, i.e., a test whose severity is qualitatively determined, becomes a quantitative induction when the severity is quantitatively determined; when an objective error probability can be given.

*(III) Third Order, Statistical (Quantitative) Induction*

We enter the Third Order of statistical or quantitative induction when it is possible to quantify “how much” the prediction runs counter to what our expectation would have been without the hypothesis. In his discussions of such quantifications, Peirce anticipates to a striking degree later developments of statistical testing and confidence interval estimation (Hacking 1980, Mayo 1993, 1996). Since this is not the place to describe his statistical contributions, I move to more modern methods to make the qualitative-quantitative contrast.

**6. Quantitative and qualitative induction: significance test reasoning**

*Quantitative Severity*

A statistical significance test illustrates an inductive inference justified by a quantitative severity assessment. The significance test procedure has the following components: (1) a *null hypothesis* *H_{0}*, which is an assertion about the distribution of the sample

*H_{0}*: there are no increased cancer risks associated with hormone replacement therapy (HRT) in women who have taken them for 10 years.

*Let d(x)* measure the increased risk of cancer in

*p*-value = Prob(** d**(

If this probability is very small, the data are taken as evidence that

*H**: cancer risks are higher in women treated with HRT

The reasoning is a statistical version of *modes tollens*.

If the hypothesis *H _{0}* is correct then, with high probability, 1-

** x** is statistically significant at level

Therefore, ** x** is evidence of a discrepancy from

*(i.e., H* severely passes, where the severity is 1 minus the p-value) [iii]*

For example, the results of recent, large, randomized treatment-control studies showing statistically significant increased risks (at the 0.001 level) give strong evidence that HRT, taken for over 5 years, increases the chance of breast cancer, the severity being 0.999. If a particular conclusion is wrong, subsequent severe (or highly powerful) tests will with high probability detect it. In particular, if we are wrong to reject *H _{0}* (and

It is true that the observed conformity of the facts to the requirements of the hypothesis may have been fortuitous. But if so, we have only to persist in this same method of research and we shall gradually be brought around to the truth. (7.115)

The correction is not a matter of getting higher and higher probabilities, it is a matter of finding out whether the agreement is fortuitous; whether it is generated about as often as would be expected were the agreement of the chance variety.

[Part 2 and Part 3 are here; you can find the full paper here.]

**REFERENCES:**

Hacking, I. 1980 “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, pp. 141-160 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Laudan, L. 1981 *Science and Hypothesis: Historical Essays on Scientific Methodology*. Dordrecht: D. Reidel.

Levi, I. 1980 “Induction as Self Correcting According to Peirce”, pp. 127-140 in D. H. Mellor (ed.), *Science, Belief and Behavior: Essays in Honor of R.B. Braithwaite*. Cambridge: Cambridge University Press.

Mayo, D. 1991 “Novel Evidence and Severe Tests”, *Philosophy of Science*, 58: 523-552.

———- 1993 “The Test of Experiment: C. S. Peirce and E. S. Pearson”, pp. 161-174 in E. C. Moore (ed.), *Charles S. Peirce and the Philosophy of Science*. Tuscaloosa: University of Alabama Press.

——— 1996 *Error and the Growth of Experimental Knowledge*, The University of Chicago Press, Chicago.

———–2003 “Severe Testing as a Guide for Inductive Learning”, in H. Kyburg (ed.), *Probability Is the Very Guide in Life*. Chicago: Open Court Press, pp. 89-117.

———- 2005 “Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved” in P. Achinstein (ed.), *Scientific Evidence*, Johns Hopkins University Press.

Mayo, D. and Kruse, M. 2001 “Principles of Inference and Their Consequences,” pp. 381-403 in *Foundations of Bayesianism*, D. Cornfield and J. Williamson (eds.), Dordrecht: Kluwer Academic Publishers.

Mayo, D. and Spanos, A. 2004 “Methodology in Practice: Statistical Misspecification Testing” *Philosophy of Science*, Vol. II, PSA 2002, pp. 1007-1025.

———- (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Theory of Induction”, *The British Journal of Philosophy of Science* 57: 323-357.

Mayo, D. and Cox, D.R. 2006 “The Theory of Statistics as the ‘Frequentist’s’ Theory of Inductive Inference”, *Institute of Mathematical Statistics (IMS) Lecture Notes-Monograph Series, Contributions to the Second Lehmann Symposium*, *2005*.

Neyman, J. and Pearson, E.S. 1933 “On the Problem of the Most Efficient Tests of Statistical Hypotheses”, in *Philosophical Transactions of the Royal Society*, A: 231, 289-337, as reprinted in J. Neyman and E.S. Pearson (1967), pp. 140-185.

———- 1967 *Joint Statistical Papers*, Berkeley: University of California Press.

Niiniluoto, I. 1984 *Is Science Progressive*? Dordrecht: D. Reidel.

Peirce, C. S. *Collected Papers: Vols. I-VI*, C. Hartshorne and P. Weiss (eds.) (1931-1935). Vols. VII-VIII, A. Burks (ed.) (1958), Cambridge: Harvard University Press.

Popper, K. 1962 *Conjectures and Refutations: the Growth of Scientific Knowledge*, Basic Books, New York.

Rescher, N. 1978 *Peirce’s Philosophy of Science: Critical Studies in His Theory of Induction and Scientific Method*, Notre Dame: University of Notre Dame Press.

[i] Others who relate Peircean induction and Neyman-Pearson tests are Isaac Levi (1980) and Ian Hacking (1980). See also Mayo 1993 and 1996.

[ii] This statement of (b) is regarded by Laudan as the strong thesis of self-correcting. A weaker thesis would replace (b) with (b’): science has techniques for determining unambiguously whether an alternative *T’* is closer to the truth than a refuted *T*.

[iii] If the *p*-value were not very small, then the difference would be considered statistically insignificant (generally small values are 0.1 or less). We would then regard *H _{0}* as consistent with data

If there were a discrepancy from hypothesis *H _{0}* of

** x** is not statistically significant at level

Therefore, ** x** is evidence than any discrepancy from

For a general treatment of effect size, see Mayo and Spanos (2006).

[Ed. Note: A not bad biographical sketch can be found on wikipedia.]

Filed under: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics ]]>