Bayesian/frequentist

TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)

The frequentist tester should retort:

Frequentist TesterBut you assume 50% of the null hypotheses are true, compute P(H0|x) using P(H0) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!

At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.

 It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0.  This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013).  J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!

The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0.

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!

Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.

 

 

 

 

 

 

Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior. 

Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).

Therefore P(H0 is true) = .5.

I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.

It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

In any event, .5 is not the frequentist probability that the selected null H0 is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).

The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest (Ioannidis 2005). As Taleb puts it:

“With big data, researchers have brought cherry-picking to an industrial level”.

Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.

The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H0 conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).

Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!

How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).

Senn, in a guest post remarks:

The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquhoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, its initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.

Two moves are made: (1) it’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.

The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl 1990) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.  

Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:

Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).

The diagnostic model, in effect, says keep doing what you’re doing: publish after an isolated significant result, possibly with cherry-picking and selection effects to boot, just make sure there’s high enough prior prevalence. That preregistration often makes previous significant results vanish shows the problem isn’t the statistical method but its abuse. Ioannidis has done much to expose bad methods, but not with the diagnostic model he earlier popularized.

In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect–low prior prevalence. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel Prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimer’s).

Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R.  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014 1(3): pp. 1-16.

Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Fisher, R.A. (1947), Design of Experiments.

Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” Philosop2hy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.

Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”Statistical Science18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo (2005). “Philosophy of Statistics” in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815. (Has typos.)

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66 (1): 195-244.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.

Prusiner, S. (1991). Molecular Biology of Prion Diseases. Science, 252(5012), 1515-1522.

Prusiner, S. B. (2014) Madness and Memory: The Discovery of Prions—a New Biological Principle of Disease, New Haven, Connecticut: Yale University Press.

Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.

Taleb, N. (2013). “Beware the Big Errors of Big Data”. Wired.

 

Related posts:

Categories: Bayesian/frequentist, Comedy, significance tests, Statistics | 4 Comments

Er, about those “other statistical approaches”: Hold off until a balanced critique is in?

street-chalk-art-optical-illusion-6

.

I could have told them that the degree of accordance enabling the “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

What’s sauce for the goose is sauce for the gander right?  Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, Continue reading

Categories: Bayesian/frequentist, Statistics | 84 Comments

“P-values overstate the evidence against the null”: legit or fallacious?

The allegation that P-values overstate the evidence against the null hypothesis continues to be taken as gospel in discussions of significance tests. All such discussions, however, assume a notion of “evidence” that’s at odds with significance tests–generally likelihood ratios, or Bayesian posterior probabilities (conventional or of the “I’m selecting hypotheses from an urn of nulls” variety). I’m reblogging the bulk of an earlier post as background for a new post to appear tomorrow.  It’s not that a single small P-value provides good evidence of a discrepancy (even assuming the model, and no biasing selection effects); Fisher and others warned against over-interpreting an “isolated” small P-value long ago.  The problem is that the current formulation of the “P-values overstate the evidence” meme is attached to a sleight of hand (on meanings) that is introducing brand new misinterpretations into an already confused literature! 

 

Categories: Bayesian/frequentist, fallacy of rejection, highly probable vs highly probed, P-values | 2 Comments

“On the Brittleness of Bayesian Inference,” Owhadi, Scovel, and Sullivan (PUBLISHED)

a0a82d0b0dc678502499eaa33d4f4c79

.

The record number of hits on this blog goes to “When Bayesian Inference shatters,” where Houman Owhadi presents a “Plain Jane” explanation of results now published in “On the Brittleness of Bayesian Inference”. A follow-up was 1 year ago. Here’s how their paper begins:

 

 

owhadi

.

Houman Owhadi
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA+

Clintpic

.

Clint Scovel
Senior Scientist,

Computing + Mathematical Sciences,
California Institute of Technology, USA
TimSullivan

.

Tim Sullivan

Warwick Zeeman Lecturer,
Assistant Professor,
Mathematics Institute,
University of Warwick, UK

 

 

“On the Brittleness of Bayesian Inference”

ABSTRACT: With the advent of high-performance computing, Bayesian methods are becoming increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods can impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is a pressing question to which there currently exist positive and negative answers. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they could be generically brittle when applied to continuous systems (and their discretizations) with finite information on the data-generating distribution. If closeness is defined in terms of the total variation (TV) metric or the matching of a finite system of generalized moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusion. The mechanism causing brittleness/robustness suggests that learning and robustness are antagonistic requirements, which raises the possibility of a missing stability condition when using Bayesian inference in a continuous world under finite information.

© 2015, Society for Industrial and Applied Mathematics
Permalink: http://dx.doi.org/10.1137/130938633 Continue reading

Categories: Bayesian/frequentist, Statistics | 16 Comments

Gelman on ‘Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper’

 I’m reblogging Gelman’s post today: “Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper”. I concur with Gelman’s arguments against all Bayesian “inductive support” philosophies, and welcome the Gelman and Shalizi (2013) ‘meeting of the minds’ between an error statistical philosophy and Bayesian falsification (which I regard as a kind of error statistical Bayesianism). Just how radical a challenge these developments pose to other stripes of Bayesianism has yet to be explored. My comment on them is here.

Screen Shot 2015-12-16 at 11.17.09 PM

“Gathering of philosophers and physicists unaware of modern reconciliation of Bayes and Popper” by Andrew Gelman

Hiro Minato points us to a news article by physicist Natalie Wolchover entitled “A Fight for the Soul of Science.”

I have no problem with most of the article, which is a report about controversies within physics regarding the purported untestability of physics models such as string theory (as for example discussed by my Columbia colleague Peter Woit). Wolchover writes:

Whether the fault lies with theorists for getting carried away, or with nature, for burying its best secrets, the conclusion is the same: Theory has detached itself from experiment. The objects of theoretical speculation are now too far away, too small, too energetic or too far in the past to reach or rule out with our earthly instruments. . . .

Over three mild winter days, scholars grappled with the meaning of theory, confirmation and truth; how science works; and whether, in this day and age, philosophy should guide research in physics or the other way around. . . .

To social and behavioral scientists, this is all an old old story. Concepts such as personality, political ideology, and social roles are undeniably important but only indirectly related to any measurements. In social science we’ve forever been in the unavoidable position of theorizing without sharp confirmation or falsification, and, indeed, unfalsifiable theories such as Freudian psychology and rational choice theory have been central to our understanding of much of the social world.

But then somewhere along the way the discussion goes astray: Continue reading

Categories: Bayesian/frequentist, Error Statistics, Gelman, Shalizi, Statistics | 20 Comments

Return to the Comedy Hour: P-values vs posterior probabilities (1)

Comedy Hour

Comedy Hour

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

JB [Jim Berger]: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!(1)

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah,…. I feel I’m back in high school: “So funny, I forgot to laugh!)

The frequentist tester should retort:

Frequentist Significance Tester: But you assumed 50% of the null hypotheses are true, and  computed P(H0|x) (imagining P(H0)= .5)—and then assumed my p-value should agree with the number you get, if it is not to be misleading!

Yet, our significance tester is not heard from as they move on to the next joke…. Continue reading

Categories: Bayesian/frequentist, Comedy, PBP, significance tests, Statistics | 27 Comments

S. McKinney: On Efron’s “Frequentist Accuracy of Bayesian Estimates” (Guest Post)

SMWorkPhoto_IMG_2432

.

Steven McKinney, Ph.D.
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

                    

On Bradley Efron’s: “Frequentist Accuracy of Bayesian Estimates”

Bradley Efron has produced another fine set of results, yielding a valuable estimate of variability for a Bayesian estimate derived from a Markov Chain Monte Carlo algorithm, in his latest paper “Frequentist accuracy of Bayesian estimates” (J. R. Statist. Soc. B (2015) 77, Part 3, pp. 617–646). I give a general overview of Efron’s brilliance via his Introduction discussion (his words “in double quotes”).

“1. Introduction

The past two decades have witnessed a greatly increased use of Bayesian techniques in statistical applications. Objective Bayes methods, based on neutral or uniformative priors of the type pioneered by Jeffreys, dominate these applications, carried forward on a wave of popularity for Markov chain Monte Carlo (MCMC) algorithms. Good references include Ghosh (2011), Berger (2006) and Kass and Wasserman (1996).”

A nice concise summary, one that should bring joy to anyone interested in Bayesian methods after all the Bayesian-bashing of the middle 20th century. Efron himself has crafted many beautiful results in the Empirical Bayes arena. He has reviewed important differences between Bayesian and frequentist outcomes that point to some as-yet unsettled issues in statistical theory and philosophy such as his scales of evidence work. Continue reading

Categories: Bayesian/frequentist, objective Bayesians, Statistics | 44 Comments

Statistical “reforms” without philosophy are blind (v update)

following-leader-off-cliff

.

Is it possible, today, to have a fair-minded engagement with debates over statistical foundations? I’m not sure, but I know it is becoming of pressing importance to try. Increasingly, people are getting serious about methodological reforms—some are quite welcome, others are quite radical. Too rarely do the reformers bring out the philosophical presuppositions of the criticisms and proposed improvements. Today’s (radical?) reform movements are typically launched from criticisms of statistical significance tests and P-values, so I focus on them. Regular readers know how often the P-value (that most unpopular girl in the class) has made her appearance on this blog. Here, I tried to quickly jot down some queries. (Look for later installments and links.) What are some key questions we need to ask to tell what’s true about today’s criticisms of P-values? 

I. To get at philosophical underpinnings, the single most import question is this:

(1) Do the debaters distinguish different views of the nature of statistical inference and the roles of probability in learning from data? Continue reading

Categories: Bayesian/frequentist, Error Statistics, P-values, significance tests, Statistics, strong likelihood principle | 193 Comments

Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?

Unknown

Faye Flam

ONE YEAR AGO, the NYT “Science Times” (9/29/14) published Fay Flam’s article, first blogged here.

Congratulations to Faye Flam for finally getting her article published at the Science Times at the New York Times, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful that one remark from me remained. Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory (and can point to a huge literature)….. Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:

30BAYES-master675

silly pic that accompanied the NYT article

…….When people think of statistics, they may imagine lists of numbers — batting averages or life-insurance tables. But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced that they’d analyzed 53 cancer studies and found it could not replicate 47 of them.

Similar follow-up analyses have cast doubt on so many findings in fields such as neuroscience and social science that researchers talk about a “replication crisis”

Continue reading

Categories: Bayesian/frequentist, Statistics | Leave a comment

(Part 2) Peircean Induction and the Error-Correcting Thesis

C. S. Peirce 9/10/1839 – 4/19/1914

C. S. Peirce
9/10/1839 – 4/19/1914

Continuation of “Peircean Induction and the Error-Correcting Thesis”

Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Part 1 is here.

There are two other points of confusion in critical discussions of the SCT, that we may note here:

I. The SCT and the Requirements of Randomization and Predesignation

The concern with “the trustworthiness of the proceeding” for Peirce like the concern with error probabilities (e.g., significance levels) for error statisticians generally, is directly tied to their view that inductive method should closely link inferences to the methods of data collection as well as to how the hypothesis came to be formulated or chosen for testing.

This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated (1.95) namely, that the sample be (approximately) random and that the property being tested not be determined by the particular sample x— i.e., predesignation.

The picture of Peircean induction that one finds in critics of the SCT disregards these crucial requirements for induction: Neither enumerative induction nor H-D testing, as ordinarily conceived, requires such rules. Statistical significance testing, however, clearly does. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Peircean Induction and the Error-Correcting Thesis (Part I)

C. S. Peirce: 10 Sept, 1839-19 April, 1914

C. S. Peirce: 10 Sept, 1839-19 April, 1914

Yesterday was C.S. Peirce’s birthday. He’s one of my all time heroes. You should read him: he’s a treasure chest on essentially any topic. I only recently discovered a passage where Popper calls Peirce one of the greatest philosophical thinkers ever (I don’t have it handy). If Popper had taken a few more pages from Peirce, he would have seen how to solve many of the problems in his work on scientific inference, probability, and severe testing. I’ll blog the main sections of a (2005) paper of mine over the next few days. It’s written for a very general philosophical audience; the statistical parts are pretty informal. I first posted it in 2013Happy (slightly belated) Birthday Peirce.

Peircean Induction and the Error-Correcting Thesis
Deborah G. Mayo
Transactions of the Charles S. Peirce Society: A Quarterly Journal in American Philosophy, Volume 41, Number 2, 2005, pp. 299-319

Peirce’s philosophy of inductive inference in science is based on the idea that what permits us to make progress in science, what allows our knowledge to grow, is the fact that science uses methods that are self-correcting or error-correcting:

Induction is the experimental testing of a theory. The justification of it is that, although the conclusion at any stage of the investigation may be more or less erroneous, yet the further application of the same method must correct the error. (5.145)

Inductive methods—understood as methods of experimental testing—are justified to the extent that they are error-correcting methods. We may call this Peirce’s error-correcting or self-correcting thesis (SCT):

Self-Correcting Thesis SCT: methods for inductive inference in science are error correcting; the justification for inductive methods of experimental testing in science is that they are self-correcting. Continue reading

Categories: Bayesian/frequentist, C.S. Peirce, Error Statistics, Statistics | Leave a comment

Can You change Your Bayesian prior? (ii)

images-1

.

This is one of the questions high on the “To Do” list I’ve been keeping for this blog.  The question grew out of discussions of “updating and downdating” in relation to papers by Stephen Senn (2011) and Andrew Gelman (2011) in Rationality, Markets, and Morals.[i]

“As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.” (Senn, 2011, p. 59)

“If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.” (Gelman, 2011, p. 77)

But if uncertainty is not expressible as a prior, then a major lynchpin for Bayesian updating seems questionable. If you can go from the posterior to the prior, on the other hand, perhaps it can also lead you to come back and change it.

Is it legitimate to change one’s prior based on the data?

I don’t mean update it, but reject the one you had and replace it with another. My question may yield different answers depending on the particular Bayesian view. I am prepared to restrict the entire question of changing priors to Bayesian “probabilisms”, meaning the inference takes the form of updating priors to yield posteriors, or to report a comparative Bayes factor. Interpretations can vary. In many Bayesian accounts the prior probability distribution is a way of introducing prior beliefs into the analysis (as with subjective Bayesians) or, conversely, to avoid introducing prior beliefs (as with reference or conventional priors). Empirical Bayesians employ frequentist priors based on similar studies or well established theory. There are many other variants.

images

.

S. SENN: According to Senn, one test of whether an approach is Bayesian is that while Continue reading

Categories: Bayesian/frequentist, Gelman, S. Senn, Statistics | 111 Comments

From our “Philosophy of Statistics” session: APS 2015 convention

aps_2015_logo_cropped-1

.

“The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,” at the 2015 American Psychological Society (APS) Annual Convention in NYC, May 23, 2015:

 

D. Mayo: “Error Statistical Control: Forfeit at your Peril” 

 

S. Senn: “‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?”

 

A. Gelman: “The statistical crisis in science” (this is not his exact presentation, but he focussed on some of these slides)

 

For more details see this post.

Categories: Bayesian/frequentist, Error Statistics, P-values, reforming the reformers, reproducibility, S. Senn, Statistics | 10 Comments

Philosophy of Statistics Comes to the Big Apple! APS 2015 Annual Convention — NYC

Start Spreading the News…..

aps_2015_logo_cropped-1

..

 The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference,
2015 APS Annual Convention
Saturday, May 23  
2:00 PM- 3:50 PM in Wilder

(Marriott Marquis 1535 B’way)

 

 

gelman5

.

Andrew Gelman

Professor of Statistics & Political Science
Columbia University

SENN FEB

.

Stephen Senn

Head of Competence Center
for Methodology and Statistics (CCMS)

Luxembourg Institute of Health

 .
.

Slide1

D. Mayo headshot

D.G. Mayo, Philosopher


morey

.

Richard Morey, Session Chair & Discussant

Senior Lecturer
School of Psychology
Cardiff University
Categories: Announcement, Bayesian/frequentist, Statistics | 8 Comments

Joan Clarke, Turing, I.J. Good, and “that after-dinner comedy hour…”

I finally saw The Imitation Game about Alan Turing and code-breaking at Bletchley Park during WWII. This short clip of Joan Clarke, who was engaged to Turing, includes my late colleague I.J. Good at the end (he’s not second as the clip lists him). Good used to talk a great deal about Bletchley Park and his code-breaking feats while asleep there (see note[a]), but I never imagined Turing’s code-breaking machine (which, by the way, was called the Bombe and not Christopher as in the movie) was so clunky. The movie itself has two tiny scenes including Good. Below I reblog: “Who is Allowed to Cheat?”—one of the topics he and I debated over the years. Links to the full “Savage Forum” (1962) may be found at the end (creaky, but better than nothing.)

[a]”Some sensitive or important Enigma messages were enciphered twice, once in a special variation cipher and again in the normal cipher. …Good dreamed one night that the process had been reversed: normal cipher first, special cipher second. When he woke up he tried his theory on an unbroken message – and promptly broke it.” This, and further examples may be found in this obituary

[b] Pictures comparing the movie cast and the real people may be found here. Continue reading

Categories: Bayesian/frequentist, optional stopping, Statistics, strong likelihood principle | 6 Comments

What’s wrong with taking (1 – β)/α, as a likelihood ratio comparing H0 and H1?

mayo_thumbnail_rings

.

Here’s a quick note on something that I often find in discussions on tests, even though it treats “power”, which is a capacity-of-test notion, as if it were a fit-with-data notion…..

1. Take a one-sided Normal test T+: with n iid samples:

H0: µ ≤  0 against H1: µ >  0

σ = 10,  n = 100,  σ/√n =σx= 1,  α = .025.

So the test would reject H0 iff Z > c.025 =1.96. (1.96. is the “cut-off”.)

~~~~~~~~~~~~~~

  1. Simple rules for alternatives against which T+ has high power:
  • If we add σx (here 1) to the cut-off (here, 1.96) we are at an alternative value for µ that test T+ has .84 power to detect.
  • If we add 3σto the cut-off we are at an alternative value for µ that test T+ has ~ .999 power to detect. This value, which we can write as µ.999 = 4.96

Let the observed outcome just reach the cut-off to reject the null,z= 1.96.

If we were to form a “likelihood ratio” of μ = 4.96 compared to μ0 = 0 using

[Power(T+, 4.96)]/α,

it would be 40.  (.999/.025).

It is absurd to say the alternative 4.96 is supported 40 times as much as the null, understanding support as likelihood or comparative likelihood. (The data 1.96 are even closer to 0 than to 4.96). The same point can be made with less extreme cases.) What is commonly done next is to assign priors of .5 to the two hypotheses, yielding

Pr(H0 |z0) = 1/ (1 + 40) = .024, so Pr(H1 |z0) = .976.

Such an inference is highly unwarranted and would almost always be wrong. Continue reading

Categories: Bayesian/frequentist, law of likelihood, Statistical power, statistical tests, Statistics, Stephen Senn | 87 Comments

On the Brittleness of Bayesian Inference–An Update: Owhadi and Scovel (guest post)

shattered-glass-portrait-1

owhadi

.

Houman Owhadi

Professor of Applied and Computational Mathematics and Control and Dynamical Systems,
Computing + Mathematical Sciences
California Institute of Technology, USA

 

Clintpic

.

Clint Scovel
Senior Scientist,
Computing + Mathematical Sciences
California Institute of Technology, USA

 

 “On the Brittleness of Bayesian Inference: An Update”

Dear Readers,

This is an update on the results discussed in http://arxiv.org/abs/1308.6306 (“On the Brittleness of Bayesian Inference”) and a high level presentation of the more  recent paper “Qualitative Robustness in Bayesian Inference” available at http://arxiv.org/abs/1411.3984.

In http://arxiv.org/abs/1304.6772 we looked at the robustness of Bayesian Inference in the classical framework of Bayesian Sensitivity Analysis. In that (classical) framework, the data is fixed, and one computes optimal bounds on (i.e. the sensitivity of) posterior values with respect to variations of the prior in a given class of priors. Now it is already well established that when the class of priors is finite-dimensional then one obtains robustness.  What we observe is that, under general conditions, when the class of priors is finite codimensional, then the optimal bounds on posterior are as large as possible, no matter the number of data points.

Our motivation for specifying a finite co-dimensional  class of priors is to look at what classical Bayesian sensitivity  analysis would conclude under finite  information and the best way to understand this notion of “brittleness under finite information”  is through the simple example already given in https://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/ and recalled in Example 1. The mechanism causing this “brittleness” has its origin in the fact that, in classical Bayesian Sensitivity Analysis, optimal bounds on posterior values are computed after the observation of the specific value of the data, and that the probability of observing the data under some feasible prior may be arbitrarily small (see Example 2 for an illustration of this phenomenon). This data dependence of worst priors is inherent to this classical framework and the resulting brittleness under finite-information can be seen as an extreme occurrence of the dilation phenomenon (the fact that optimal bounds on prior values may become less precise after conditioning) observed in classical robust Bayesian inference [6]. Continue reading

Categories: Bayesian/frequentist, Statistics | 13 Comments

“When Bayesian Inference Shatters” Owhadi, Scovel, and Sullivan (reblog)

images-9I’m about to post an update of this, most viewed, blogpost, so I reblog it here as a refresher. If interested, you might check the original discussion.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I am grateful to Drs. Owhadi, Scovel and Sullivan for replying to my request for “a plain Jane” explication of their interesting paper, “When Bayesian Inference Shatters”, and especially for permission to post it. 

—————————————-

owhadiHouman Owhadi
Professor of Applied and Computational Mathematics and Control and Dynamical Systems, Computing + Mathematical Sciences,
California Institute of Technology, USA
 Clint Scovel
ClintpicSenior Scientist,
Computing + Mathematical Sciences,
California Institute of Technology, USA
TimSullivanTim Sullivan
Warwick Zeeman Lecturer,
Assistant Professor,
Mathematics Institute,
University of Warwick, UK

“When Bayesian Inference Shatters: A plain Jane explanation”

This is an attempt at a “plain Jane” presentation of the results discussed in the recent arxiv paper “When Bayesian Inference Shatters” located at http://arxiv.org/abs/1308.6306 with the following abstract:

“With the advent of high-performance computing, Bayesian methods are increasingly popular tools for the quantification of uncertainty throughout science and industry. Since these methods impact the making of sometimes critical decisions in increasingly complicated contexts, the sensitivity of their posterior conclusions with respect to the underlying models and prior beliefs is becoming a pressing question. We report new results suggesting that, although Bayesian methods are robust when the number of possible outcomes is finite or when only a finite number of marginals of the data-generating distribution are unknown, they are generically brittle when applied to continuous systems with finite information on the data-generating distribution. This brittleness persists beyond the discretization of continuous systems and suggests that Bayesian inference is generically ill-posed in the sense of Hadamard when applied to such systems: if closeness is defined in terms of the total variation metric or the matching of a finite system of moments, then (1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach diametrically opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions.”

Now, it is already known from classical Robust Bayesian Inference that Bayesian Inference has some robustness if the random outcomes live in a finite space or if the class of priors considered is finite-dimensional (i.e. what you know is infinite and what you do not know is finite). What we have shown is that if the random outcomes live in an approximation of a continuous space (for instance, when they are decimal numbers given to finite precision) and your class of priors is finite co-dimensional (i.e. what you know is finite and what you do not know may be infinite) then, if the data is observed at a fine enough resolution, the range of posterior values is the deterministic range of the quantity of interest, irrespective of the size of the data. Continue reading

Categories: 3-year memory lane, Bayesian/frequentist, Statistics | 1 Comment

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance” (Dec 3 Seminar slides)

(May 4) 7 Deborah Mayo  “Ontology & Methodology in Statistical Modeling”Below are the slides from my Rutgers seminar for the Department of Statistics and Biostatistics yesterday, since some people have been asking me for them. The abstract is here. I don’t know how explanatory a bare outline like this can be, but I’d be glad to try and answer questions[i]. I am impressed at how interested in foundational matters I found the statisticians (both faculty and students) to be. (There were even a few philosophers in attendance.) It was especially interesting to explore, prior to the seminar, possible connections between severity assessments and confidence distributions, where the latter are along the lines of Min-ge Xie (some recent papers of his may be found here.)

“Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance”

[i]They had requested a general overview of some issues in philosophical foundations of statistics. Much of this will be familiar to readers of this blog.

 

 

Categories: Bayesian/frequentist, Error Statistics, Statistics | 11 Comments

3 YEARS AGO: MONTHLY (Nov.) MEMORY LANE

3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: November 2011. I mark in red 3 posts that seem most apt for general background on key issues in this blog.*

  • (11/1) RMM-4:“Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*” by Aris Spanos, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”)
  • (11/3) Who is Really Doing the Work?*
  • (11/5) Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest
  • (11/9) Neyman’s Nursery 2: Power and Severity [Continuation of Oct. 22 Post]
  • (11/12) Neyman’s Nursery (NN) 3: SHPOWER vs POWER
  • (11/15) Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower (“observed” power) vs Power
  • (11/18) Neyman’s Nursery (NN5): Final Post
  • (11/21) RMM-5: “Low Assumptions, High Dimensions” by Larry Wasserman, in Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?”) See also my deconstruction of Larry Wasserman.
  • (11/23) Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat
  • (11/28) The UN Charter: double-counting and data snooping
  • (11/29) If you try sometime, you find you get what you need!

*I announced this new, once-a-month feature at the blog’s 3-year anniversary. I will repost and comment on one of the 3-year old posts from time to time. [I’ve yet to repost and comment on the one from Oct. 2011, but will shortly.] For newcomers, here’s your chance to catch-up; for old timers,this is philosophy: rereading is essential!

Previous 3 YEAR MEMORY LANES:

 Oct. 2011

Sept. 2011 (Within “All She Wrote (so far))

 

 

 

 

 

 

 

 

 

 

 

Categories: 3-year memory lane, Bayesian/frequentist, Statistics | Leave a comment

Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,718 other followers