# Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?

Faye Flam

Congratulations to Faye Flam for finally getting her article published at the Science Times at the New York Times, “The odds, continually updated” after months of reworking and editing, interviewing and reinterviewing. I’m grateful too, that one remark from me remained. Seriously I am. A few comments: The Monty Hall example is simple probability not statistics, and finding that fisherman who floated on his boots at best used likelihoods. I might note, too, that critiquing that ultra-silly example about ovulation and voting–a study so bad they actually had to pull it at CNN due to reader complaints[i]–scarcely required more than noticing the researchers didn’t even know the women were ovulating[ii]. Experimental design is an old area of statistics developed by frequentists; on the other hand, these ovulation researchers really believe their theory, so the posterior checks out.

The article says, Bayesian methods can “crosscheck work done with the more traditional or ‘classical’ approach.” Yes, but on traditional frequentist grounds. What many would like to know is how to cross check Bayesian methods—how do I test your beliefs? Anyway, I should stop kvetching and thank Faye and the NYT for doing the article at all[iii]. Here are some excerpts:

Statistics may not sound like the most heroic of pursuits. But if not for statisticians, a Long Island fisherman might have died in the Atlantic Ocean after falling off his boat early one morning last summer.

The man owes his life to a once obscure field known as Bayesian statistics — a set of mathematical rules for using new data to continuously update beliefs or existing knowledge.

The method was invented in the 18th century by an English Presbyterian minister named Thomas Bayes — by some accounts to calculate the probability of God’s existence. In this century, Bayesian statistics has grown vastly more useful because of the kind of advanced computing power that didn’t exist even 20 years ago.

It is proving especially useful in approaching complex problems, including searches like the one the Coast Guard used in 2013 to find the missing fisherman, John Aldridge (though not, so far, in the hunt for Malaysia Airlines Flight 370).

Now Bayesian statistics are rippling through everything from physics to cancer research, ecology to psychology. Enthusiasts say they are allowing scientists to solve problems that would have been considered impossible just 20 years ago. And lately, they have been thrust into an intense debate over the reliability of research results.

When people think of statistics, they may imagine lists of numbers — batting averages or life-insurance tables. But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced that they’d analyzed 53 cancer studies and found it could not replicate 47 of them.

Similar follow-up analyses have cast doubt on so many findings in fields such as neuroscience and social science that researchers talk about a “replication crisis”

Some statisticians and scientists are optimistic that Bayesian methods can improve the reliability of research by allowing scientists to crosscheck work done with the more traditional or “classical” approach, known as frequentist statistics. The two methods approach the same problems from different angles.

The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.

By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.

Scientists who have learned Bayesian statistics often marvel that it propels them through a different kind of scientific reasoning than they’d experienced using classical methods.

“Statistics sounds like this dry, technical subject, but it draws on deep philosophical debates about the nature of reality,” said the Princeton University astrophysicist Edwin Turner, who has witnessed a widespread conversion to Bayesian thinking in his field over the last 15 years.

Countering Pure Objectivity

Frequentist statistics became the standard of the 20th century by promising just-the-facts objectivity, unsullied by beliefs or biases. In the 2003 statistics primer “Dicing With Death,”Stephen Senn traces the technique’s roots to 18th-century England, when a physician named John Arbuthnot set out to calculate the ratio of male to female births.

…..But there’s a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.

The proportion of wrong results published in prominent journals is probably even higher, he said, because such findings are often surprising and appealingly counterintuitive, said Dr. Gelman, an occasional contributor to Science Times.

Looking at Other Factors

Take, for instance, a study concluding that single women who were ovulating were 20 percent more likely to vote for President Obama in 2012 than those who were not. (In married women, the effect was reversed.)

Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.

He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated. (The paper’s lead author, Kristina M. Durante of the University of Texas, San Antonio, said she stood by the finding.)

Dr. Gelman said the results would not have been considered statistically significant had the researchers used the frequentist method properly. He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.

…..Bayesian reasoning combined with advanced computing power has also revolutionized the search for planets orbiting distant stars, said Dr. Turner, the Princeton astrophysicist.

One downside of Bayesian statistics is that it requires prior information — and often scientists need to start with a guess or estimate. Assigning numbers to subjective judgments is “like fingernails on a chalkboard,” said physicist Kyle Cranmer, who helped develop a frequentist technique to identify the latest new subatomic particle — the Higgs boson.

Others say that in confronting the so-called replication crisis, the best cure for misleading findings is not Bayesian statistics, but good frequentist ones. It was frequentist statistics that allowed people to uncover all the problems with irreproducible research in the first place, said Deborah Mayo, a philosopher of science at Virginia Tech. The technique was developed to distinguish real effects from chance, and to prevent scientists from fooling themselves.

Uri Simonsohn, a psychologist at the University of Pennsylvania, agrees. Several years ago, he published a paper that exposed common statistical shenanigans in his field — logical leaps, unjustified conclusions, and various forms of unconscious and conscious cheating.

He said he had looked into Bayesian statistics and concluded that if people misused or misunderstood one system, they would do just as badly with the other. Bayesian statistics, in short, can’t save us from bad science. …

Categories: Bayesian/frequentist, Statistics

### 47 thoughts on “Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?”

1. Seems like a good overview, although Gelman felt some things he wanted to communicate were lost in translation…

Just want to note that “finding that fisherman who floated on his boots at best used likelihoods” is false. This description of SAROPS isn’t great — it’s not detailed enough and stats jargon is misused — but anyone familiar with Bayesian decision theory in its online sequential form will recognize the paradigm.

• Well I know these particular fisherman, and that’s what they told me (I used to live there). But even so, we’re talking the most rudimentary probability. Actually their probabilities told them he couldn’t still be alive! It was the aim of probing how such probabilistic assessments can be wrong that led to persevering. But the whole thing is silly as some success story for Bayesian inference as opposed to a mere use of Bayes’ rule.

• I think we must be talking about different rescued fisherman… I don’t see anything about John Aldridge “floating in his boots” in the article, but I do see SAROPS mentioned.

“But the whole thing is silly as some success story for Bayesian inference as opposed to a mere use of Bayes’ rule.”

Well, this is a curious standard — a kind of converse of the no-true-Scotsman fallacy: any use of Bayes’ rule that you’d license as legitimate is, by definition, not Bayesian. I could argue until I’m blue in the face that SAROPS approach is grounded in ideas clearly expressed in Myron Tribus’s book Rational Descriptions, Decisions, and Designs, which is itself firmly in the Cox-Jaynes tradition, and you’ll just say that the provenance of the ideas doesn’t matter — it’s “mere” rudimentary probability.

• “Treading water awkwardly, Aldridge reached down and pulled off his left boot. Straining, he turned it upside down, raised it up until it cleared the waves, then plunged it back into the water, trapping a boot-size bubble of air inside. He tucked the inverted boot under his left armpit. Then he did the same thing with the right boot. It worked; they were like twin pontoons, and treading water with his feet alone was now enough to keep him stable and afloat.
The boots gave Aldridge a chance to think.”

As for your other comments, I don’t know anyone who equates using a theorem on inverse probability with Bayesian statistics.

• The quote comes from a different article, I see. After examining the state of the ship and noticing various clues including the broken handle on the cooler, Sosinski posited that:

“Aldridge had gone overboard somewhere between the 40-fathom curve, about 25 miles offshore, and the Anna Mary’s first trawl, about 40 miles offshore.”

This describes the prior information available to the searchers about an event that occurred at a definite time and place. The ship’s course provides a linear function from time to location, so the uncertainty is over a nice one-dimensional finite interval (either time or location, take your pick). How do you think this information is represented in SAROPS? Supposing that it’s represented as a probability distribution (perhaps a uniform one), would you call the subsequent calculations Bayesian, or a mere application of Bayes’ rule?

(From the article, it does appear that SAROPS didn’t exactly cover itself in glory — it crashed at a critical juncture, leaving Coast Guard personnel to do its job by intuition.)

• Corey: It was THE article Faye cites describing the fisherman incident. It just so happens I know the details of this story because of my friends in Montauk. But look, anyone who thinks background information and the use of empirical information are the sole prerogatives of Bayesians, that everyone else starts a blank slate every minute, is just making stuff up.

• Mayo: I don’t dispute any of that. What’s at issue here is your claim that “finding that fisherman who floated on his boots at best used likelihoods.” In relation to that, I can’t help but notice that you didn’t answer the two questions I posed above.

• Corey: I think this is a non serious issue, but I spoze, if Bayes rule really were applied here, it would depend on whether the priors were degrees of belief as opposed to frequentist based. But the issue isn’t to deny the existence of fully Bayesian applications; I mean see the “Big Bayes” stories on this blog. I just don’t happen to think that any of the examples given here are really relevant to the issue (about Bayes vs frequentist). All the many reports and reports on reports and guidelines and panels and grants etc. erupting from the problems of non-replication (e.g., in omics research, pre-clinical trials in cancer, etc) point to obvious stuff: don’t ignore unfavorable data, use controls, use blinding, insist on randomization, wash your hands and don’t “do stupid stuff”. The main way that statistical inference principles arise is that frequentist error statistics has grounds for criticizing the results and insisting on the experimental design procedures: to ignore them is to lose the control of error probabilities. The error probabilities you report may have no relation to the actual ones. That’s what’s important. But Bayesians generally deny the importance of these error probabilities (they violate the likelihood principle.)

• Mayo: You think that the possibility that a claim that you make in the very first sentence of your comments that cuts against your ideological opponents, is, in fact, false, is a non-serious issue? Huh.

In my digging into SAROPS, I have only found suggestive evidence that it employs probabilities in what I would call a Bayesian sense; I’ve not found anything unequivocal. SAROPS’s direct antecedent is Lawrence Stone’s book Theory of Optimal Search, which describes itself as Bayesian but takes the prior distribution as given without asking from where it is to come. In this slide deck the search problem is described as requiring that uncertainties be estimated (slide 5), and in the program itself the target’s initial position can specified with “bivariate normal uncertainty” (slide 8) and ‘scenarios may be “weighted”‘. I know from other reading that the weighting amounts to a prior mass being ascribed to each scenario, but in all of this I cannot find an example of use or other unmistakeable sign that the distinction between frequentist and Bayesian notions of probability is clearly understood, much less that SAROPS iemploys a consciously Bayesian approach. That said, It does seem to me that the “balance of probabilities” leans toward SAROPS being fully Bayesian. (Tangentially: SAROPS is not “rudimentary” anything; you can look up the math on that!)

• rasmusab

“As for your other comments, I don’t know anyone who equates using a theorem on inverse probability with Bayesian statistics.”

I do! 🙂 And if I read Gelman right he is too (http://andrewgelman.com/2012/07/31/what-is-a-bayesian/). Do I read you right that you don’t have anything against Bayesian model fitting (which is just application of the ol’ theorem on inverse probability) but only that you don’t approve of a Bayesian philosophy of science?

• Anyone who uses conditional probability or inverse probability uses Bayes rule. It’s a theorem holding for events or RVs. That doesn’t make you a Bayesian regarding statistical inference about hypotheses. Normal deviate had a post on this:
http://normaldeviate.wordpress.com/2012/11/17/what-is-bayesianfrequentist-inference/
Maybe that’s why Lindley said there’s no one less Bayesian than an empirical Bayesian. But I’d go further: not all empirical Bayesians can provide error probabilities of relevance for scientific inference.

• rasmusab

Ok, so you are saying that I’m no true no true Scotsm… Bayesian. And again, I read you (and Normal deviate) right that you would not have any thing against using Bayesian model fitting as such.

2. TheThirdWay

“how do I test your beliefs?”

There are more things in heaven and earth than are dreamt of in your philosophy. For you, probabilities are either frequencies or beliefs, you can’t image anything else.

There is a third option, which is mentioned here in hopes of avoiding further “talking past each other”. Probabilities distributions define a range of possibilities to consider for an unknown but true value.

For example, an error distribution defines a range of possibilities of the true errors that exist in the data taken.

This is not a frequency, even approximately. The “range” doesn’t specify frequencies of any kind. It, in essence, is a concrete representation of the uncertainty in the true errors existing in the data taken. The smaller the range the better they are known.

This is not an “opinion”. Either the true errors lie in the region considered or they don’t. It’s objective.

This is testable. It’s significantly easier to “test” whether the true errors are in this range than it is to test frequentist beliefs about the limiting frequencies of future errors in measurements which will never be taken.

• I didn’t say those were the two choices; I’ve actually listed elsewhere on the blog ~5 different construals given by non subjective Bayesians alone. And I wish the subjective Bayesians would give us a clear meaning, sometimes it’s degree of belief in a frequentist model, sometimes closer to a personal weight in a hypothesis or parameter value, other times, a mere hunch, a claim about betting behavior, still other times, an expression of bias. Frequently, it measures or is based on how frequently this or that happens (as in empirical Bayes). I find it harder to figure out how to test something whose meaning shifts around. In this case, the researchers had very strong beliefs and are prepared to defend them (they didn’t appeal to entropy). The only legitimate criticism is methodological, and this methodological critique revolves around a number of reasons that the purported error probabilities (e.g., the p-value) fail to give actual or warranted p-values. Using background knowledge to systematically critique the various assumptions of an inference is important. But formal priors needn’t and typically do not enter in this critique.
We have had A LOT of discussion of the role of background on this blog, including several exchanges with Gelman. Please search the blog if interested.

All that said, I admit that frequentist vs subjective probability fails to capture the central debate. But to your remark about long-runs, frequentist claims have short run implications that are testable now.

3. Gelman has some corrections to Faye’s version of his position. You can read it here, since it’s not very simply put:
http://andrewgelman.com/2014/09/30/didnt-say/#more-23785

4. E.Berk

A related discussion is on Leek’s “Simply Statistics” blog: “You think P-values are bad? I say show me the data.
Or should it be, show me the prior probabilities?

• vl

Leek has really jumped the shark. This is the kind of lazy “if I can show that I have a low validation error I don’t need to think through what I’m doing” argument used to be exclusive domain of machine learning.

• vl don’t have a clue what you mean. Leek is right to question the promotional ads for Bayesian ways. As Wasserman would put it, adapting, either the methods have good error probabilities or they don’t. If they do, might as well use the frequentist error probs, if they don’t why use them?

5. A blogpost, Bayes in our Times makes some of the same points I do, not about the boots. http://for-sci-law-now.blogspot.com/2014/09/bayes-in-our-times.html

6. anonymous

What means the caption summary, “In Bayesian statistics, new data is used to shape assumptions, the opposite of the frequentist (classical) approach”? Maybe someone thinks Bayesians report P(H|E) and frequentists P(E|H), but that’s wrong. Likelihoodists report P(E|H), or the likelihood ratio, but frequentists only care about the probability that the ratio takes values as high or higher than observed, assuming mere chance.

• I don’t know where that little blurb appears except on the NYT page. It’s true that it is very often quite erroneously claimed that we error statisticians report P(data|Ho) or the like, which we do not do this. A frequentist test is a general rule:

Rule R: Whenever data differs from the null by more than k, then regard the observed difference as statistically significant at level alpha (small like .01)

But we MUST consider the properties of the test rule:

Prob(R regards difference as significant at level alpha; Ho) must = ~ alpha.

This is an error probability, and it attaches to the method or rule.

So if people are using a test rule that would readily declare a difference “rare” under Ho–even when the difference is quite common under Ho– then that requirement is violated.

The high probability of relevance concerns the test rule: that it probably would have warned us:

Prob(Rule would have warned me it’s mere chance, when it is mere chance) = high

for details, see articles.

7. Nathan Schachtman

Mayo,

Not surprised that Gelman wanted to correct what seems to have been attributed to him.

“Today, this kind of number is called a p-value, the probability that an observed phenomenon or one more extreme could have occurred by chance.”

Boy, if they were going to play out Gelman’s criticisms of frequentist statistics, you would think that they could have defined a p-value correctly!?

This was hardly a rush article for page one.

Nathan

• Nathan: I couldn’t tell if they were attributing this to Gelman. It sounds like that early absolute test that allows finding patterns everywhere. I did talk with Faye about the proper understanding of a p-value when it’s a legit error probability.. What she wrote about what the frequentist allegedly does is definitely not from me.

8. Steven McKinney

From the New York Times article on the search and rescue

Really a gripping story, as so many fisher accident stories are. I am always in awe of boat fishers, and at the same time ashamed of how poorly we as a society treat fishers, though in this case thankfully there were resources to rescue one of them in need.

The clues that Sosinski put together were hugely valuable: “Sosinski had also been having second thoughts about the search area. After his initial conversation with Davis, he inspected the boat more carefully, and he found a few important clues. . . Together Sosinski and Winters came up with a new theory: Aldridge had gone overboard somewhere between the 40-fathom curve, about 25 miles offshore, and the Anna Mary’s first trawl, about 40 miles offshore. ”

Then the inevitable, the Sarops computer crashes. The report is unclear about whether Sarops was generating new maps, or whether the team was looking at a pre-crash map, and “Averill proposed a simple track-line search: the Jayhawk would head south-southeast for about 10 miles, straight through the main search area, then turn sharply to the north for another 10 miles, then veer north-northwest, which would take the crew straight back to Air Station Cape Cod. It wasn’t a conventional pattern, and it wasn’t Sarops-generated, but it would have to do.”

The key phrase to me in the story is “It wasn’t a conventional pattern, and it wasn’t Sarops-generated, but it would have to do.” This is a story of human ingenuity, and is yet another illustration of how powerful human minds are at computation, pattern recognition and problem solving, still more powerful than any computer algorithm, Bayesian or otherwise. Sosinski searched the boat for valuable clues, and Averill quickly crafted a search pattern that accommodated the helicopter’s position and fuel restrictions.

Using this story as some kind of proof that Bayesian methods are somehow superior is disingenuous. If the entire search effort had blindly followed Sarops patterns, who knows if Aldridge would be alive today.

As Simonsohn aptly notes, “if people misused or misunderstood one system, they would do just as badly with the other. Bayesian statistics, in short, can’t save us from bad science”, a point clearly exemplified by the Duke University cancer research fiasco.

• Unfortunately, the article failed to note that Simonsohn has explicitly said that using Bayesian methods introduces additional flexibility, making it far more difficult to hold people accountable. Not to mention that there isn’t a single agreed upon Bayesian approach even to run of the mill problems.

• john byrd

Steven: These are great points about human ingenuity. Perhaps Bayesians wish to capture some of that, but it is not “Bayesian” in any sense. And not just human. Watch trained retrievers (dogs) in a challenging field trial. They will start in what they reckon is the most likely spot to find the bird, but when that fails they revert quickly into a systematic search and rely heavily on process of elimination.

• Bayesian dog training? I found my cat tending to be much more of a falsificationist error statistician.

9. I find it curious that in the article Gelman is to have said Bayesian methods can help crosscheck frequentist ones, by which he means, background beliefs might make one suspect a result—but that fails utterly to show what’s flawed about the study. It becomes belief vs belief with no criteria to adjudicate. We report error probabilities, and you can knock them down, if warranted, by showing they do not hold. There’s nothing comparable in the Bayesian approach. They report a posterior probability, and there’s no measure of reliability or error probability placed on it. It just is. I can see why Faye was rather disappointed in this article. They should have omitted the ugly talk show host and put in some of the real substance of the debate.
Why such an urge to talk down to people when they’re the ones who are being presented with “personalized medicine” based on one or another statistical methodology. They do care, and they sure can understand when there’s lots of leeway and no report of accuracy, or checkable error probabilities. Anil Potti comes to mind.

• BayesAstro

I can answer this question. Suppose you take a single measurement for m and get a single observation y:

y_1=m+e_1

where e is the actual error in this measurement.

The frequentist distribution P(e) is evaluated by seeing if it’s the limiting shape of a histogram of an infinite number of future measurements e_2,e_3,…. which will not be taken.

The bayesian distribution P(e) is evaluated by seeing if e_1 is high in the high probability region of the distribution.

It’s left to the reader to figure out which one of these is easier to evaluate in real life.

Using this Bayesian version, Laplace was able to estimate the mass of Saturn. Laplace’s quote on the subject was:

“it is a bet of 11,000 to 1 that the error in this result is not 1/100th of its value”

Laplace was right. The Mass of Saturn is within those limits according to modern measurements. This would seem to be objective, testable, and provide a measure of how strong the evidence its (that “11,000 to 1” odds part)

• That’s like saying you can’t use &test the observable entailments of geometry because we don’t have real points.

• BayesAstro

The point was that Bayesians absolutely do have (objective) ‘criteria to adjudicate’ whether P(e) is a good/bad model of our uncertainty of the one e_1 that exists. The criteria aren’t frequentist in nature, but that shouldn’t be surprising because they weren’t using P(e) to model non-existent frequencies of e_2,e_3,….

Again, I’ll leave it to the reader to figure out the relative ease of being objective about errors that do exist versus errors that don’t exist.

• Bayes astro: you seem to have an appropriate name.

• BayesAstro

Well Laplace did get Saturn’s mass right.

It’s worth looking at how he did it. That way you can direct your anti-Bayesian arguments against real Bayes rather than Savage et al.

Here is logic behind how Laplace did it. Assume a measurement y_1=m+e_1. There would have been more than one measurement, but I’ll just keep it simple here.

Create a P(e) which models our uncertainty in e_1. P(e) is not the frequency of anything, even approximately. Call the high probability region of this distribution R.

If e_1 is in R, then this is a good model. If it isn’t, then it’s not a good model. That’s the ‘criteria to adjudicate’ you wondered about.

If R is big, then there is lots of uncertainty about e_1. If it’s small, there’s little uncertainty. R is the ‘range of possibilities we’re going to consider for e_1’.

Then Laplace does a Bayesian calculation to get an interval estimate for m. Call this interval M.

This calculation shows (roughly speaking) the following: for every 1 possibility in R that makes the true mass outside M, there are 11,000 possibilities in R that put the true mass in M.

That is his justification for using the interval M to estimate Saturn’s mass. The “11,000 to 1” odds expresses the strength of this evidence.

Evidently, the true errors e_1 were part of the “11,000” and not one of the rare exceptions, which is why M turned out to be right according to modern measurements.

• john byrd

BayesAstro: “This calculation shows (roughly speaking) the following: for every 1 possibility in R that makes the true mass outside M, there are 11,000 possibilities in R that put the true mass in M.”. Is it o.k. with you to utilize ” possibilities” in the inference procedure?

• john byrd

BayesAstro: You seem to have the luxury of ignoring the many ways you might be wrong. The Burmese have an old adage: Even a blind chicken stumbles into a rice pot at some point. Cherry picking instances where it seems to have worked, or journalists crediting the approach for things it should not be credited with, is not satisfying. We have to consider error probabilities to produce honest and sound inferences. Some Bayesian types seem to do that, but you are advocating not. That seems to be bad advice.

• vl

At least a bayesian _attempts_ to model underdetermination. Flawed though one’s model might be, the typical application of p-values doesn’t even make an attempt.

If you want to see “a stopped clock is right twice a day” logic in action, look at the application of p-values in genomics.

• vl

@john byrd

At least a bayesian _attempts_ to model underdetermination. Such a model could be wrong, but the standard application of p-values ignores the issue entirely.

If you want to see “a stopped clock is right twice a day” logic in action, look at the application of p-values in genomics.

• If you want to read about how frequentists solve underdetermination, please read Error and the Growth of Experimental Knowledge (1996) or a paper on underdetermination on my publication page.
The invalid uses of p-values in genomics are the ones licensed by Bayesians and all others who ignore or reject the sampling distribution. I hope that you’ve heard of the difference between computed and actual p-values? Ignoring the sample space (as the likelihood principle recommends/requires) and failing to test assumptions are what lead to bad measures, be they p-values or posteriors.

If you want a free copy of EGEK 1996, send me your address. Please try to get a bit more educated, since you obviously care about these issues.

• Steven McKinney

“We report error probabilities, and you can knock them down, if warranted, by showing they do not hold. There’s nothing comparable in the Bayesian approach. They report a posterior probability, and there’s no measure of reliability or error probability placed on it. It just is.”

Any purported scientist or researcher who makes such claims about Bayesian methods is peddling snake oil. Just because a statistical technique is derived in a Bayesian fashion does not excuse it from assessment of performance characteristics. For example, if a Bayesian-based method is developed to predict patient response to this or that drug, then that method can and must be assessed for its ability to improve health outcomes across any set of patients subjected to the method.

There is no reason that a Bayesian-based method can not have its success rate or other performance characteristics assessed, via simulation studies using data sets with known structure initially, and later via clinical trials or other sound experimental checks. This is the problem with so many claims concerning Bayesian methods – unproven performance assertions accepted by too many non-critical thinkers.

• Steven: What I meant is that the Bayesian posterior probability does not come with an assessment of reliability. I didn’t mean one couldn’t add performance characteristics. I don’t know if that turns them into error statistical methods. “There is no reason that a Bayesian-based method can not have its success rate or other performance characteristics assessed, via simulation studies using data sets with known structure initially”. I think this has to be clarified. For one thing, one would want to know the cause of the success rate. The error properties of methods generally interrelate the design, data generation, modeling and analysis–they are not just an after-trial success rate appraisal. And how can one contrast with what would have occurred if other routes had been pursued. I admit there may be a big difference if one is in the business to “improve health outcomes across any set of patients subjected to the method” as opposed to finding things out,and increasing understanding of the disease process or other phenomenon. I’m more focused on the latter.

10. HEP

We would have learned a lot more about statistics in science if Kyle Cranmer, who helped develop a frequentist technique to identify the Higgs boson-one of the biggest scientific achievements in recent years relying on statistics-had been given more than the tiniest shoutout in this article. There’s an obvious imbalance when gameshow probabilities are presented in terms of a revolutionary new advance, while the statistical basis for the Higgs discovery is sidelined.

• HEP: I’m glad to be reminded of this, nearly forgot. I second the outrage and frank befuddlement at the statistics’ community keeping this statistical success story under wraps (or so it seems). We had the International Year of Statistics last year and I don’t think there was a single session at JSM devoted to this great success story. Correct me if I’m wrong. There was enough of an ecumenical approach in parts of the work for all to cheer. Can anyone think of another science where something analogous occurs? I’m guessing there are.

11. I want to scream when I hear things like frequentist methods promised “just-the-facts objectivity, unsullied by beliefs or biases”, when the more accurate description is that frequentist and related statistics and experimental design have fought tooth and nail to invent ways to avoid biases and use probabilities, not to represent beliefs, but avoid being misled by beliefs and biases. Both mine and yours. Putting it in terms of this inane “immaculate conception” idea is completely backwards! And dangerous. It’s no surprise that as people let down their guard on experimental design and demanding, stringent tests and bias detecting methods that we have a lot of the junk science we now see. As Fisher said, those who claim you don’t require explicit techniques for error control are “selling the moon”. When it comes to fraud busting, it’s to frequentist methods that people turn.

12. Richard D. Morey

“Experimental design is an old area of statistics developed by frequentists…” Wow, I can just imagine how angry Fisher would be at this misrepresentation of history.

“Some statisticians and scientists are optimistic that Bayesian methods can improve the reliability of research by allowing scientists to crosscheck work done with the more traditional or “classical” approach, known as frequentist statistics.” Confusing classical and frequentist statistics is simply wrong (though common). Go back and read how Fisher felt about frequentism.

• The name classical was foisted on Fisherian and N-P methodologies. (See title of recent book by Lehmann.) But they are also called frequentist–another not-so-good name. Frequentist is also a name from the outside, and it is not the same as behavioristic statistics. Faye’s article made the division frequentist vs Bayesian, so I had to pick one. I very much dislike classical (which sounds like classical probability. I think that if I were confused about Fisherian statistics that Sir David Cox would have let me know during the two papers I wrote with him and through several conferences with him.

You’re obviously confusing “frequentist” with a view of the rationale of statistics as controlling long-run errors, which both I and Cox reject, but Fisher was all over the place on.

So read up on the history and you can get it right here on this blog.

Mayo, D. G. and Cox, D. R. (2010). “Frequentist Statistics as a Theory of Inductive Inference” in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

My preferred term for accounts that make use of error probabilities is “error statistical”. So now that we’ve gone through some of your naming perplexities, what in the world do you spoze you’re saying/asking?

• Richard: In case you didn’t figure this out, your second quote comes from the article! It’s Faye reporting on what Gelman said or she thought he said It’s not Mayo. Feel free to apologize.