*See “rejected posts”.
Our favorite high school student, Isaac, gets a better shot at showing his college readiness using one of the comparative measures of support or confirmation discussed last week. Their assessment thus seems more in sync with the severe tester, but they are not purporting that z is evidence for inferring (or even believing) an H to which z affords a high B-boost*. Their measures identify a third category that reflects the degree to which H would predict z (where the comparison might be predicting without z, or under ~H or the like). At least if we give it an empirical, rather than a purely logical, reading. Since it’s Saturday night let’s listen in to one of the comedy hours at the Bayesian retreat as reblogged from May 5, 2012.
The problem was, the epistemic probability in H was so low that H couldn’t be believed! Instead we believe its denial H’! So, she will infer hypotheses that are simply unbelievable!
So it appears the error statistical testing account fails to serve as an account of knowledge or evidence (i.e., an epistemic account). However severely I might wish to say that a hypothesis H has passed a test, this Bayesian critic assigns a sufficiently low prior probability to H so as to yield a low posterior probability in H[i]. But this is no argument about why this counts in favor of, rather than against, their particular Bayesian computation as an appropriate assessment of the warrant to be accorded to hypothesis H.
To begin with, in order to use techniques for assigning frequentist probabilities to events, their examples invariably involve “hypotheses” that consist of asserting that a sample possesses a characteristic, such as “having a disease” or “being college ready” or, for that matter, “being true.” This would not necessarily be problematic if it were not for the fact that their criticism requires shifting the probability to the particular sample selected—for example, a student Isaac is college-ready, or this null hypothesis (selected from a pool of nulls) is true. This was, recall, the fallacious probability assignment that we saw in Berger’s attempt, later (perhaps) disavowed. Also there are just two outcomes, say s and ~s, and no degrees of discrepancy from H. Read more
Dear Reader: Tonight marks the 2-year anniversary of this blog; so I’m reblogging my very first posts from 9/3/11 here and here (from the rickety old blog site)*. (One was the “about”.) The current blog was included once again in the top 50 statistics blogs. Amazingly, I have received e-mails from different parts of the world describing experimental recipes for the special concoction we exiles favor! (Mine is here.) If you can fly over to the Elbar Room, please join us: I’m treating everyone to doubles of Elbar Grease! Thanks for reading and contributing! D. G. Mayo
(*The old blogspot is a big mix; it was before Rejected blogs. Yes, I still use this old typewriter [ii])
“Did you hear the one about the frequentist . . .
- “who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
- “who defended the reliability of his radiation reading, despite using a broken radiometer, on the grounds that most of the time he uses one that works, so on average he’s pretty reliable?”
Such jests may work for an after-dinner laugh, but if it turns out that, despite being retreads of “straw-men” fallacies, they form the basis of why some reject frequentist methods, then they are not such a laughing matter. But surely the drubbing of frequentist methods could not be based on a collection of howlers, could it? I invite the curious reader to stay and find out.
If we are to take the criticisms seriously, and put to one side the possibility that they are deliberate distortions of frequentist statistical methods, we need to identify their sources. To this end I consider two interrelated areas around which to organize foundational issues in statistics: (1) the roles of probability in induction and inference, and (2) the nature and goals of statistical inference in science or learning. Frequentist sampling statistics, which I prefer to call “error statistics,” continues to be raked over the coals in the foundational literature, but with little scrutiny of the presuppositions about goals and methods, without which the criticisms lose all force.
First, there is the supposition that an adequate account must assign degrees of probability to hypotheses, an assumption often called probabilism. Second, there is the assumption that the main, if not the only, goal of error-statistical methods is to evaluate long-run error rates. Given the wide latitude with which some critics define “controlling long-run error,” it is not surprising to find them arguing that (i) error statisticians approve of silly methods, and/or (ii) rival (e.g., Bayesian) accounts also satisfy error statistical demands. Absent this sleight of hand, Bayesian celebrants would have to go straight to the finale of their entertainment hour: a rousing rendition of “There’s No Theorem Like Bayes’s Theorem.”
Never mind that frequentists have responded to these criticisms, they keep popping up (verbatim) in many Bayesian textbooks and articles on philosophical foundations. The difficulty of articulating a statistical philosophy that fully explains the basis for both (i) insisting on error-statistical guarantees, while (ii) avoiding pathological examples in practice, has turned many a frequentist away from venturing into foundational battlegrounds. Some even concede the distorted perspectives drawn from overly literal and radical expositions of what Fisher, Neyman, and Pearson “really thought”. Many others just find the “statistical wars” distasteful.
Here is where I view my contribution—as a philosopher of science—to the long-standing debate: not merely to call attention to the howlers that pass as legitimate criticisms of frequentist error statistics, but also to sketch the main lines of an alternative statistical philosophy within which to better articulate the roles and value of frequentist tools. Let me be clear that I do not consider this the only philosophical framework for frequentist statistics—different terminology could do as well. I will consider myself successful if I can provide one way of building, or one standpoint from which to build, a frequentist, error- statistical philosophy.
But given this is a blog, I shall be direct and to the point: I hope to cultivate the interests of others who might want to promote intellectual honesty within a generally very lopsided philosophical debate. I will begin with the first entry to the comedy routine, as it is put forth by leading Bayesians……
“Frequentists in Exile” 9/3/11 by D. Mayo
Confronted with the position that “arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006, 196), frequentists might have seen themselves in a kind of exile when it came to foundations, even those who had been active in the dialogues of an earlier period [i]. Sometime around the late 1990s there were signs that this was changing. Regardless of the explanation, the fact that it did occur and is occurring is of central importance to statistical philosophy.
Now that Bayesians have stepped off their a priori pedestal, it may be hoped that a genuinely deep scrutiny of the frequentist and Bayesian accounts will occur. In some corners of practice it appears that frequentist error statistical foundations are being discovered anew. Perhaps frequentist foundations, never made fully explicit, but at most lying deep below the ocean floor, are finally being disinterred. But let’s learn from some of the mistakes in the earlier attempts to understand it. With this goal I invite you to join me in some deep water drilling, here as I cast about on my Isle of Elba.
Cox, D. R. (2006), Principles of Statistical Inference, CUP.
[i] Yes, that’s the Elba connection: Napolean’s exile (from which he returned to fight more battles).
[ii] I have discovered a very reliable antique typewriter shop in Oxford that was able to replace the two missing typewriter keys. So long as my “ribbons” and carbon sheets don’t run out, I’m set.
It’s nearly two years since I began this blog, and some are wondering if I’ve covered all the howlers thrust our way? Sadly, no. So since it’s Saturday night here at the Elba Room, let’s listen in on one of the more puzzling fallacies–one that I let my introductory logic students spot…
“Did you hear the one about significance testers sawing off their own limbs?
‘Suppose we decide that the effect exists; that is, we reject [null hypothesis] H0. Surely, we must also reject probabilities conditional on H0, but then what was the logical justification for the decision? Orthodox logic saws off its own limb.’ ”
Ha! Ha! By this reasoning, no hypothetical testing or falsification could ever occur. As soon as H is falsified, the grounds for falsifying disappear! If H: all swans are white, then if I see a black swan, H is falsified. But according to this critic, we can no longer assume the deduced prediction from H! What? The entailment from a hypothesis or model H to x, whether it is statistical or deductive, does not go away after the hypothesis or model H is rejected on grounds that the prediction is not born out.[i] When particle physicists deduce that the events could not be due to background alone, the statistical derivation (to what would be expected under H: background alone) does not get sawed off when H is denied!
The above quote is from Jaynes (p. 524) writing on the pathologies of “orthodox” tests. How does someone writing a great big book on “the logic of science” get this wrong? To be generous, we may assume that in the heat of criticism, his logic takes a wild holiday. Unfortunately, I’ve heard several of his acolytes repeat this. There’s a serious misunderstanding of how hypothetical reasoning works: 6 lashes, and a pledge not to uncritically accept what critics say, however much you revere them.
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge: Cambridge University Press.
[i]Of course there is also no warrant for inferring an alternative hypothesis, unless it is a non-null warranted with severity—even if the alternative entails the existence of a real effect. (Statistical significance is not substantive significance—it is by now cliché . Search this blog for fallacies of rejection.)
A few previous comedy hour posts:
(09/03/11) Overheard at the comedy hour at the Bayesian retreat
(4/4/12) Jackie Mason: Fallacy of Rejection and the Fallacy of Nouvelle Cuisine
(04/28/12) Comedy Hour at the Bayesian Retreat: P-values versus Posteriors
(05/05/12) Comedy Hour at the Bayesian (Epistemology) Retreat: Highly Probable vs Highly Probed
(09/03/12) After dinner Bayesian comedy hour…. (1 year anniversary)
(09/08/12) Return to the comedy hour…(on significance tests)
(04/06/13) Who is allowed to cheat? I.J. Good and that after dinner comedy hour….
(04/27/13) Getting Credit (or blame) for Something You Didn’t Do (BP oil spill, comedy hour)
Oh No! It’s those mutant bears again. To my dismay, I’ve been sent, for the third time, that silly, snarky, adolescent, clip of those naughty “what the p-value” bears (first posted on Aug 5, 2012), who cannot seem to get a proper understanding of significance tests into their little bear brains. So apparently some people haven’t seen my rejoinder which, as I said then, practically wrote itself. So since it’s Saturday night here at the Elbar Room, let’s listen in to a mashup of both the clip and my original rejoinder (in which p-value bears are replaced with hypothetical Bayesian bears).
These stilted bear figures and their voices are sufficiently obnoxious in their own right, even without the tedious lampooning of p-values and the feigned horror at learning they should not be reported as posterior probabilities.
Bear #1: Do you have the results of the study?
Bear #2:Yes. The good news is there is a .996 probability of a positive difference in the main comparison.
Bear #1: Great. So I can be well assured that there is just a .004 probability that such positive results would occur if they were merely due to chance.
Bear #2: Not really, that would be an incorrect interpretation. Read more
Three years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15. Trials have been taking place this month, as people try to meet the 3 year deadline to sue BP and others. But what happened to the 200 million gallons of oil? (Is anyone up to date on this?) Has it vanished or just sunk to the bottom of the sea by dispersants which may have caused hidden destruction of sea life? I don’t know, but given it’s Saturday night around the 3 year anniversary, let’s listen into a reblog of a spill-related variation on the second of two original “overheard at the comedy hour” jokes.
In effect, it accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the 2010 oil spill:
Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.
Oil Exec: Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average! You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.
Senator: But you don’t know that your system would have passed the more stringent test you didn’t perform!
Oil Exec: That’s the beauty of the the frequentist test!
Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run). That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages? … Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).
Two Measuring Instruments with Different Precisions:
A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E’ with a known small variance, say 10-4, while with tails, we use E”, with a known large variance, say 104. The full data indicates whether E’ or E” was performed, and the particular value observed, which we can write as x’ and x”, respectively. (This example comes up in, ton o’bricks).
In applying our test T+ (see November 2011 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).
But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).
But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal one can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:
- If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!
The severity requirement makes explicit that such a construal is to be rejected—I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion. Read more
It was from my Virginia Tech colleague I.J. Good (in statistics), who died four years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules.
“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]
This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was a year out of grad school, and he a University Distinguished Professor.)
One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:
“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)
To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”
By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:
Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a Read more
Saturday Night Brainstorming: The TFSI on NHST–reblogging with a 2013 update
Each year leaders of the movement to reform statistical methodology in psychology, social science and other areas of applied statistics get together around this time for a brainstorming session. They review the latest from the Task Force on Statistical Inference (TFSI), propose new regulations they would like the APA publication manual to adopt, and strategize about how to institutionalize improvements to statistical methodology.
While frustrated that the TFSI has still not banned null hypothesis significance testing (NHST), since attempts going back to at least 1996, the reformers have created, and very successfully published in, new meta-level research paradigms designed expressly to study (statistically!) a central question: have the carrots and sticks of reward and punishment been successful in decreasing the use of NHST, and promoting instead use of confidence intervals, power calculations, and meta-analysis of effect sizes? Or not?
This year there are a couple of new members who are pitching in to contribute what they hope are novel ideas for reforming statistical practice. Since it’s Saturday night, let’s listen in on part of an (imaginary) brainstorming session of the New Reformers. This is a 2013 update of an earlier blogpost. Read more
“Are you butter off now? Deconstructing the butter bust of the President” http://rejectedpostsofdmayo.com/
These days, so many theater productions are updated reviews of older standards. Same with the comedy hours at the Bayesian retreat, and task force meetings of significance test reformers. So (on the 1-year anniversary of this blog) let’s listen in to one of the earliest routines (with highest blog hits), but with some new reflections (first considered here and here).
‘ “Did you hear the one about the frequentist . . .
“who claimed that observing “heads” on a biased coin that lands heads with probability .05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (.05)?”
The joke came from J. Kadane’s Principles of Uncertainty (2011, CRC Press*).
“Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05. If the coin comes up tails reject the null hypothesis. Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5% level test. It is also very robust against data errors; indeed it does not depend on the data at all. It is also nonsense, of course, but nonsense allowed by the rules of significance testing.” (439)
But is it allowed? I say no. The null hypothesis in the joke can be in any field, perhaps it concerns mean transmission of Scrapie in mice (as in my early Kuru post). I know some people view significance tests as merely rules that rarely reject erroneously, but I claim this is mistaken. Both in significance tests and in scientific hypothesis testing more generally, data indicate inconsistency with H only by being counter to what would be expected under the assumption that H is correct (as regards a given aspect observed). Were someone to tell Prusiner that the testing methods he follows actually allow any old “improbable” event (a stock split in Apple?) to reject a hypothesis about prion transmission rates, Prusiner would say that person didn’t understand the requirements of hypothesis testing in science. Since the criticism would hold no water in the analogous case of Prusiner’s test, it must equally miss its mark in the case of significance tests**. That, recall, was Rule #1. Read more
Given it’s the first anniversary of this blog, which opened with the howlers in “Overheard at the comedy hour …” let’s listen in as a Bayesian holds forth on one of the most famous howlers of the lot: the mysterious role that psychological intentions are said to play in frequentist methods such as statistical significance tests. Here it is, essentially as I remember it (though shortened), in the comedy hour that unfolded at my dinner table at an academic conference:
Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.
So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”
Howls of laughter.
But then the guy calls back with the bad news . . .
It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.
It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results. Read more
APRIL FOOL’S DAY POST: This morning I received a paper I have been asked to review (anonymously as is typical). It is to head up a forthcoming issue of a new journal called Philosophy of Statistics: Retraction Watch. This is the first I’ve heard of the journal, and I plan to recommend they publish the piece, conditional on revisions. I thought I would post the abstract here. It’s that interesting.
“Some Slightly More Realistic Self-Criticism in Recent Work in Philosophy of Statistics,” Philosophy of Statistics: Retraction Watch, Vol. 1, No. 1 (2012), pp. 1-19.
In this paper we delineate some serious blunders that we and others have made in published work on frequentist statistical methods. First, although we have claimed repeatedly that a core thesis of the frequentist testing approach is that a hypothesis may be rejected with increasing confidence as the power of the test increases, we now see that this is completely backwards, and we regret that we have never addressed, or even fully read, the corrections found in Deborah Mayo’s work since at least 1983, and likely even before that.
Second, we have been wrong to claim that Neyman-Pearson (N-P) confidence intervals are inconsistent because in special cases it is possible for a specific 95% confidence interval to be known to be correct. Not only are the examples required to show this absurdly artificial, but the frequentist could simply interpret this “vacuous interval” “as a statement that all parameter values are consistent with the data at a particular level,” which, as Cox and Hinkley note, is an informative statement about the limitations in the data (Cox and Hinkley 1974, 226). Read more
|Ruler at the Bottom of Ocean|
Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped. Read more