Author Archives: Mayo

About Mayo

I am a professor in the Department of Philosophy at Virginia Tech and hold a visiting appointment at the Center for the Philosophy of Natural and Social Science of the London School of Economics. I am the author of Error and the Growth of Experimental Knowledge, which won the 1998 Lakatos Prize, awarded to the most outstanding contribution to the philosophy of science during the previous six years. I have applied my approach toward solving key problems in philosophy of science: underdetermination, the role of novel evidence, Duhem's problem, and the nature of scientific progress. I am also interested in applications to problems in risk analysis and risk controversies, and co-edited Acceptable Evidence: Science and Values in Risk Management (with Rachelle Hollander). I teach courses in introductory and advanced logic (including the metatheory of logic and modal logic), in scientific method, and in philosophy of science.I also teach special topics courses in Science and Technology Studies.

TragiComedy hour: P-values vs posterior probabilities vs diagnostic error rates

Did you hear the one about the frequentist significance tester when he was shown the nonfrequentist nature of p-values?

Critic: I just simulated a long series of tests on a pool of null hypotheses, and I found that among tests with p-values of .05, at least 22%—and typically over 50%—of the null hypotheses are true!

Frequentist Significance Tester: Scratches head: But rejecting the null with a p-value of .05 ensures erroneous rejection no more than 5% of the time!

Raucous laughter ensues!

(Hah, hah… “So funny, I forgot to laugh! Or, I’m crying and laughing at the same time!)

The frequentist tester should retort:

Frequentist TesterBut you assume 50% of the null hypotheses are true, compute P(H0|x) using P(H0) = .5, imagine the null is rejected based on a single small p-value, and then blame the p-value for disagreeing with the result of your computation!

At times you even use α and power as likelihoods in your analysis! These tests violate both Fisherian and Neyman-Pearson tests.

 It is well-known that for a fixed p-value, with a sufficiently large n, even a statistically significant result can correspond to large posteriors in H0.  This Jeffreys-Lindley “disagreement” is considered problematic for Bayes ratios (e.g., Bernardo). It is not problematic for error statisticians. We always indicate the extent of discrepancy that is and is not indicated, and avoid making mountains out of molehills (See Spanos 2013).  J. Berger and Sellke (1987) attempt to generalize the result to show the “exaggeration” even without large n. From their Bayesian perspective, it appears that p-values come up short, error statistical testers (and even some tribes of Bayesians) balk at the fact that using the recommended priors allows highly significant results to be interpreted as no evidence against the null — or even evidence for it!

The conflict between p-values and Bayesian posteriors typically considers the two sided test of the Normal mean, H0: μ = μ0 versus H1: μ ≠ μ0.

“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although Pr (H0|x) = .52 (which would actually indicate that the evidence favors H0).” (Berger and Sellke, 1987, p. 113).

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!

Table 1 (modified) from J.O. Berger and T. Selke (1987) “Testing a Point Null Hypothesis,” JASA 82(397) : 113.







Some find the example shows the p-value “overstates evidence against a null” because it claims to use an “impartial” or “uninformative” Bayesian prior probability assignment of .5 to H0, the remaining .5 being spread out over the alternative parameter space. (“Spike and slab” I’ve heard Gelman call this, derisively.) Others demonstrate that the problem is not p-values but the high prior. 

Casella and R. Berger (1987) show that “concentrating mass on the point null hypothesis is biasing the prior in favor of Has much as possible” (p. 111) whether in 1 or 2-sided tests. Note, too, the conflict with confidence interval reasoning since the null value (here it is 0) lies outside the corresponding confidence interval (Mayo 2005). Many complain the “spiked concentration of belief in the null” is at odds with the view that “we know all nulls are false” (even though that view is also false.) See Senn’s interesting points on this same issue in his letter (to Goodman) here

But often, as in the opening joke, the prior assignment is claimed to be keeping to the frequentist camp and to frequentist error probabilities. How’s that supposed to work? It is imagined that we sample randomly from a population of hypotheses, k% of which are assumed to be true. 50% is a common number. We randomly draw a hypothesis and get this particular one, maybe it concerns the mean deflection of light, or perhaps it is an assertion of bioequivalence of two drugs or whatever. The percentage “initially true” (in this urn of nulls) serves as the prior probability for your particular H0. I see this gambit in statistics, psychology, philosophy and elsewhere, and yet it commits a fallacious instantiation of probabilities:

50% of the null hypotheses in a given pool of nulls are true.

This particular null H0 was randomly selected from this urn (some may wish to add “nothing else is known” which would scarcely be true here).

Therefore P(H0 is true) = .5.

I discussed this 20 years ago, Mayo 1997a and b (links in the references) and ever since. However, statistical fallacies repeatedly return to fashion in slightly different guises. Nowadays, you’re most likely to see it within what may be called diagnostic screening models of tests.

It’s not that you can’t play a carnival game of reaching into an urn of nulls (and there are lots of choices for what to put in the urn), and use a Bernoulli model for the chance of drawing a true hypothesis (assuming we even knew the % of true hypotheses, which we do not), but the “event of drawing a true null” is no longer the particular hypothesis one aims to use in computing the probability of data x0 under hypothesis H0. In other words, it’s no longer the H0 needed for the likelihood portion of the frequentist computation. (Note, too, the selected null would get the benefit of being selected from an urn of nulls where few have been shown false yet: “innocence by association”. See my comment on J. Berger 2003, pp. 19-24.)

In any event, .5 is not the frequentist probability that the selected null H0 is true–in those cases where a frequentist prior exists. (I first discussed the nature of legitimate frequentist priors with Erich Lehmann; see the poem he wrote for me as a result in Mayo 1997a).

The diagnostic screening model of tests. The diagnostic screening model of tests has become increasingly commonplace, thanks to Big Data, perverse incentives, nonreplication and all the rest. As Taleb puts it:

“With big data, researchers have brought cherry-picking to an industrial level”.

Now the diagnostic screening model is apt for various goals–diagnostic screening (for disease) most obviously, but also for TSA bag checks, high throughput studies in genetics and other contexts where the concern is controlling the noise in the network rather than appraising the well-testedness of your research claim. Dichotomies are fine for diagnostics (disease or not, worth further study or not, dangerous bag or not) Forcing scientific inference into a binary basket is what most of us wish to move away from, yet the new screening model dichotomizes results into significant/non-significant, usually at the .05 level. One shouldn’t mix the notions of prevalence, positive predictive value, negative predictive value, etc. from screening with the concepts from statistical testing in science. Yet people do, and there are at least 2 tragicomic results: One is that error probabilities are blamed for disagreeing with measures of completely different things. One journal editor claims the fact that p-values differ from posteriors proves the “invalidity” of p-values.

The second tragicomic result is that inconsistent meanings of type 1 (and 2) error probabilities have found their way into the latest reforms, and into guidebooks for how to avoid inconsistent interpretations of statistical concepts. Whereas there’s a trade-off between type 1 error and type 2 error probabilities in Neyman-Pearson style hypotheses tests, this is no longer true when a type 1 error probability is defined as the posterior of H0 conditional on rejecting. Topsy turvy claims about power readily ensure (search this blog under power for numerous examples).

Conventional Bayesian variant. J Berger doesn’t really imagine selecting from an urn of nulls (he claims). Instead, spiked priors come from one of the systems of default or conventional priors. Curiously, he claims that by adopting his recommended conventional priors, frequentists can become more frequentist (than using flawed error probabilities). We get what he calls conditional p-values (or conditional error probabilities). Magician that he is, the result is that frequentist error probabilities are no longer error probabilities, or even frequentist!

How it happens is not entirely clear, but it’s based on his defining a “Frequentist Principle” that demands that a type 1 (or 2) error probability yield the same number as his conventional posterior probability. (See Berger 2003, and my comment in Mayo 2003).

Senn, in a guest post remarks:

The error that is made is not only to assume that P-values should be what they are not but that when one tries to interpret them in the way that one should not, the previous calibration survives.

It is as if in giving recommendations in dosing children one abandoned a formula based on age and adopted one based on weight but insisted on using the same number of kg one had used for years.

Urn of Nulls. Others appear to be serious about the urn of nulls metaphor (e.g., Colquihoun 2014) Say 50% of the nulls in the urn are imagined to be true. Then, when you select your null, it’s initial probability of truth is .5. This however is to commit the fallacy of probabilistic instantiation.

Two moves are made: (1) It’s admitted it’s an erroneous probabilistic instantiation, but the goal is said to be assessing “science wise error rates” as in a diagnostic screening context. A second move (2) is to claim that a high positive predictive value PPV from the diagnostic model warrants high “epistemic probability”–whatever that is– to the particular case at hand.

The upshot of both are at odds with the goal of restoring scientific integrity. Even if we were to grant these “prevalence rates” (to allude to diagnostic testing), my question is: Why would it be relevant to how good a job you did in testing your particular hypothesis, call it H*? Sciences with high “crud factors” (Meehl) might well get a high PPV simply because of nearly all its nulls being false. This still wouldn’t be evidence of replication ability, nor of understanding of the phenomenon. It would reward non-challenging thinking, and taking the easiest way out.  

Safe Science. We hear it recommended that research focus on questions and hypotheses with high prior prevalence. Of course we’d never know the % of true nulls (many say all nulls are false, although that too is false) and we could cleverly game the description to have suitably high or low prior prevalence. Just think of how many ways you could describe those urns of nulls to get a desired PPV, especially on continuous parameters. Then there’s the heralding of safe science:

Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive (Ioannidis, 2005, p. 0700).

In every case of a major advance or frontier science that I can think of, there had been little success in adequately explaining some effect. It took Prusiner 10 years of failed experiments to finally transmit the prion for mad cow to chimps. People still didn’t believe there could be infection without nucleic acid (some still adhere to the “protein only” hypothesis.) He finally won a Nobel prize, but he would have had a lot less torture if he’d just gone along to get along, keep to the central dogma of biology rather than follow the results that upended it. However, it’s the researcher who has worked with a given problem, building on results and subjecting them to scrutiny, who understands the phenomenon well enough to not just replicate, but alter the entire process in new ways (e.g., prions are now being linked to Alzheimers). Researchers who have churned out and published isolated significant results, and focused on “research questions where the where the pre-study probability is already considerably high” might meet the quota on PPV, but still won’t have the understanding to even show they “know how to conduct an experiment which will rarely fail to give us a statistically significant result”–which was Fisher’s requirement before inferring a genuine phenomenon (Fisher 1947).

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than P-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities. (Senn)

References & Related articles

Berger, J. O.  (2003). “Could Fisher, Jeffreys and Neyman have Agreed on Testing?” Statistical Science 18: 1-12.

Berger, J. O. and Sellke, T.  (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence,” (with discussion). J. Amer. Statist. Assoc. 82: 112–139.

Cassella G. and Berger, R.  (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem,” (with discussion). J. Amer. Statist. Assoc. 82 106–111, 123–139.

Colquhoun, D. (2014) “An investigation of the false discovery rate and the misinterpretation of p-values.” Royal Society Open Science, 2014. 1(3): p. 140216.

Fisher, R. A., (1956). Statistical Methods and Scientific Inference, Edinburgh: Oliver and Boyd.

Fisher, R.A. (1947), Design of Experiments.

Ioannidis, J. (2005). “Why Most Published Research Findings Are False”.

Jeffreys, (1939). Theory of Probability, Oxford: Oxford University Press.

Mayo, D. (1997a). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” and “Response to Howson and Laudan,” Philosop2hy of Science 64(1): 222-244 and 323-333. NOTE: This issue only comes up in the “Response”, but it made most sense to include both here.

Mayo, D. (1997b) “Error Statistics and Learning from Error: Making a Virtue of Necessity,” in L. Darden (ed.) Supplemental Issue PSA 1996: Symposia Papers, Philosophy of Science 64: S195-S212.

Mayo, D. (2003). Comment on J. O. Berger’s “Could Fisher,Jeffreys and Neyman Have Agreed on Testing?”Statistical Science18, 19-24.

Mayo, D. (2004). “An Error-Statistical Philosophy of Evidence,” in M. Taper and S. Lele (eds.) The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. Chicago: University of Chicago Press: 79-118.

Mayo (2005). “Philosophy of Statistics”. (has typoes)

Mayo, D.G. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. and Kruse, M. (2001). “Principles of Inference and Their Consequences,” in D. Cornfield and J. Williamson (eds.) Foundations of Bayesianism. Dordrecht: Kluwer Academic Publishes: 381-403.

Mayo, D. and Spanos, A. (2011). “Error Statistics” in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Pratt, J. (1987). “Testing a point null hypothesis: The irreconcilability of p values and evidence: Comment.” J. Amer. Statist. Assoc. 82: 123-125.

Spanos, A. (2013). “Who Should Be Afraid of the Jeffreys-Lindley Paradox”.

Taleb, N. (2013). “Beware the Big Errors of Big Data”.Wired.


Related posts:

Categories: Bayesian/frequentist, Comedy, significance tests, Statistics | Leave a comment

Larry Laudan: “‘Not Guilty’: The Misleading Verdict” (Guest Post)

Larry Laudan


Prof. Larry Laudan
Lecturer in Law and Philosophy
University of Texas at Austin

“‘Not Guilty’: The Misleading Verdict and How It Fails to Serve either Society or the Innocent Defendant”

Most legal systems in the developed world share in common a two-tier verdict system: ‘guilty’ and ‘not guilty’.  Typically, the standard for a judgment of guilty is set very high while the standard for a not-guilty verdict (if we can call it that) is quite low. That means any level of apparent guilt less than about 90% confidence that the defendant committed the crime leads to an acquittal (90% being the usual gloss on proof beyond a reasonable doubt, although few legal systems venture a definition of BARD that precise). According to conventional wisdom, the major reason for setting the standard as high as we do is the desire, even the moral necessity, to shield the innocent from false conviction.

There is, however, an egregious drawback to a legal system so structured. To wit, a verdict of ‘not guilty’ tells us nothing whatever about whether it is reasonable to believe that the defendant did not commit the crime. It offers no grounds whatever for inferring that an acquitted defendant probably did not commit the crime. That fact alone should make most of us leery about someone acquitted of a felony. Will a bank happily hire someone recently acquitted of a forgery charge? Are the neighbors going to rest easy when one of them was charged with, and then acquitted of, child molestation?

While the current proof standard provides ample protection to the innocent from being falsely convicted (the false positive rate is ~3%), it does little or nothing to protect the reputation of the truly innocent defendants. If properly understood, it fails to send any message to the general public about how they should regard and treat an acquitted defendant because it fails to tell the public whether it’s likely or unlikely that he committed the crime.

It would not be difficult to remedy this centuries-old mess, both for the public and for the acquitted defendant, by employing a three-verdict system, as the Scots have been doing for some time. Their verdicts are: guilty, guilt not proven and innocent. In a Scottish trial, if guilt is proven beyond a reasonable doubt, the defendant is found guilty; if the jury thinks it more likely than not that defendant committed no crime, his verdict is ‘innocent’; if the jury suspects that defendant did the crime but is not sure beyond all reasonable doubt, the verdict is ‘guilt not proven’.  Both the guilt-not-proven verdict and the innocence verdict are officially acquittals in the sense that those receiving it serve no jail time. (This gives a whole new meaning to the well-known phrase ‘going scot-free’.)imgres-1

The Scottish verdict pattern serves the interests of both the innocent defendant and the general society.  The Scots know that if a defendant received an innocent verdict, then the jury believed it likely that he committed no crime and that he should be treated accordingly. That is both important information for the citizenry and a substantial protection for the innocent defendant himself, since the innocent verdict is in effect an exoneration, entailing the likelihood of his innocence.

On the other hand, the Scottish guilt-not-proven verdict sends out the important message to citizens that no other Anglo-Saxon legal system can; to wit, that the acquitted defendant (with a guilt-not-proven verdict) should be treated warily by society since he was probably the culprit, even though he was neither convicted nor punished.

Interestingly, there is ample use of the intermediary verdict. The Scottish government reports in a study of criminal prosecutions in 2005 and 2006 that it turned out that 71% of those defendants tried for homicide and acquitted received a ‘guilt-not-proven’ verdict. That means that about 7-in-10 acquittals for murder in Scotland involved defendants regarded by the jurors as having probably committed the crime.[1] In a more recent analysis, the Scottish government reported that in rape cases some 35% of acquittals resulted in ‘guilt not proven’ verdicts. In murder cases, the probably guilty verdict rate was 27% of all acquittals.[2]

It’s worth adding that Scotland’s intermediary verdict gives us access to information on an error whose frequency no other Western legal system can easily compute: to wit, the frequency of false acquittals. It tells us that, at least in Scotland, the rate of false acquittals hovers between 1-in-4 and 1-in-3.  That is crucial information for those of us who believe that a legitimate system of inquiry—whether a legal one or otherwise— must get a handle on its error rates. Without knowing that, we cannot possibly figure out whether the distribution of erroneous verdicts is in line with our beliefs about the respective costs of the two errors.

Scottish criminal law has one other interesting feature worthy of mention in this context: a verdict there requires only a majority vote from the 15 citizens who serve as the jury.  By contrast, most American states require a unanimous vote among 12 jurors, contributing to a situation in which mistrials are both expensive and common.  They are expensive because they usually lead to re-trials, which are rarely cheap. In some jurisdictions in the US, 20% or more of trials end in a hung jury.[3] Not surprisingly, hung juries in Scottish cases are much less frequent.


[1] See See also the Scottish Government Statistical Bulletin, Crim/2006/Part 11.

[2] See Scottish Government, Criminal Proceedings in Scotland, 2013-14, Table 2B.

[3] A study by Paula Agor et al., (Are Hung Juries a Problem? National Center for State Courts and National Institute of Justice, 2002) found that in Washington, D.C. Superior Courts some 22.4% of jury trials ended in a hung jury; In Los Angeles Superior Courts, the hung jury rate was 19.5%. 



Previous guest posts:

  • Larry Laudan (July 20, 2013): Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
  • Larry Laudan (July 3, 2015): “When the ‘Not-Guilty’ Falsely Pass for Innocent”, the Frequency of False Acquittals (guest post)

Among Laudan’s books:

1977. Progress and its Problems: Towards a Theory of Scientific Growth
1981. Science and Hypothesis
1984. Science and Values
1990. Science and Relativism: Dialogues on the Philosophy of Science
1996. Beyond Positivism and Relativism
2006. Truth, Error and Criminal Law: An Essay in Legal Epistemology


Categories: L. Laudan, PhilStat Law | Tags: | 22 Comments

History of statistics sleuths out there? “Ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot”–No wait, it was apples, probably

E.S.Pearson on Gate

E.S.Pearson on a Gate, Mayo sketch

Here you see my scruffy sketch of Egon drawn 20 years ago for the frontispiece of my book, “Error and the Growth of Experimental Knowledge” (EGEK 1996). The caption is

“I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot… –E.S Pearson, “Statistical Concepts in Their Relation to Reality”.

He is responding to Fisher to “dispel the picture of the Russian technological bogey”. [i]

So, as I said in my last post, just to make a short story long, I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed, and I discovered a funny little error about this quote. Only maybe 3 or 4 people alive would care, but maybe someone out there knows the real truth.

OK, so I’d been rereading Constance Reid’s great biography of Neyman, and in one place she interviews Egon about the sources of inspiration for their work. Here’s what Egon tells her:

One day at the beginning of April 1926, down ‘in the middle of small samples,’ wandering among apple plots at East Malling, where a cousin was director of the fruit station, he was ‘suddenly smitten,’ as he later expressed it,with a ‘doubt’ about the justification for using Student’s ratio (the t-statistic) to test a normal mean (Quotes are from Pearson in Reid, p. 60).

Soon after, Egon contacted Neyman and their joint work began.

I assumed the meanderings over apple plots was a different time, and that Egon just had a habit of conducting his deepest statistical thinking while overlooking fruit. Yet it shared certain unique features with the revelation when gazing over at the blackcurrant plot, as in my picture, if only in the date and the great importance he accorded it (although I never recall his saying he was “smitten” before). I didn’t think more about it. Then, late one night last week I grabbed a peculiar book off my shelf that contains a smattering of writings by Pearson for a work he never completed: “Student: A Statistical Biography of William Sealy Gosset” (1990, edited and augmented by Plackett and Barnard, Clarendon, Oxford). The very first thing I open up to is a note by Egon Pearson:

I cannot recall now what was the form of the doubt which struck me at East Malling, but it would naturally have arisen when discussing there the interpretation of results derived from small experimental plots. I seem to visualize myself sitting alone on a gate thinking over the basis of ‘small sample’ theory and ‘mathematical statistics Mark II’ [i.e., Fisher]. When nearly thirty years later (JRSS B, 17, 204 1955), I wrote refuting the suggestion of R.A.F. [Fisher] that the Neyman-Pearson approach to testing statistical hypotheses had arisen in industrial acceptance procedures, the plot which the gate was overlooking had through the passage of time become a blackcurrant one! (Pearson 1990 p. 81)

What? This is weird. So that must mean it wasn’t blackcurrants after all, and Egon is mistaken in the caption under the picture I drew 20 years ago. Yet, he doesn’t say here that it was apples either, only that it had “become a blackcurrant” plot in a later retelling. So, not blackcurrant, so, it must have been apple, putting this clue together with what he told Constance Reid. So it appears I can no longer quote that “blackcurrant” statement, at least not without explaining that, in all likelihood, it was really apples. If any statistical sleuths out there can corroborate that it was apples, or knows the correct fruit that Egon was gazing at (and, come to think of it, why couldn’t it have been both?) I’d be very grateful to know [ii]. I will happily cite you. I know this is a bit of minutia–don’t say I didn’t warn you [iii]. By contrast, the Pearson paper replying to Fisher is extremely important (and very short). It’s entitled “Statistical Concepts in Their Relation to Reality”. You can read the paper HERE.


[i] Some of the previous lines, and 6 following words:

There was no question of a difference in point of view having ‘originated’ when Neyman ‘re-interpreted’ Fisher’s early work on tests of significance ‘in terms of that technological and commercial apparatus which is known as an acceptance procedure’. …
Indeed, to dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot at the East Malling Research Station!E.S Pearson, “Statistical Concepts in Their Relation to Reality” 

[ii] As Erich Lehmann put it in his EGEK review, Pearson is “the hero of Mayo’s story” because I found in his work, if only in brief discussions, hints, and examples, the key elements for an “inferential” or “evidential” interpretation of Neyman. So I should get the inspirational fruit correct.

[iii] I’m not saying I know the answer isn’t in the book on Student, or someplace else.

Fisher 1955 “Scientific Methods and Scientific Induction” .

Pearson E.S., 1955 “Statistical Methods in Their Relation to Reality”.

Reid, C. 1998, Neyman–From Life. Springer.

Categories: E.S. Pearson, phil/history of stat, Statistics | 1 Comment

Performance or Probativeness? E.S. Pearson’s Statistical Philosophy

egon pearson

E.S. Pearson (11 Aug, 1895-12 June, 1980)

This is a belated birthday post for E.S. Pearson (11 August 1895-12 June, 1980). It’s basically a post from 2012 which concerns an issue of interpretation (long-run performance vs probativeness) that’s badly confused these days. I’ve recently been scouring around the history and statistical philosophies of Neyman, Pearson and Fisher for purposes of a book soon to be completed. I recently discovered a little anecdote that calls for a correction in something I’ve been saying for years. While it’s little more than a point of trivia, it’s in relation to Pearson’s (1955) response to Fisher (1955)–the last entry in this post.  I’ll wait until tomorrow or the next day to share it, to give you a chance to read the background. 


Are methods based on error probabilities of use mainly to supply procedures which will not err too frequently in some long run? (performance). Or is it the other way round: that the control of long run error properties are of crucial importance for probing the causes of the data at hand? (probativeness). I say no to the former and yes to the latter. This, I think, was also the view of Egon Sharpe (E.S.) Pearson. 

Cases of Type A and Type B

“How far then, can one go in giving precision to a philosophy of statistical inference?” (Pearson 1947, 172)

Continue reading

Categories: 4 years ago!, highly probable vs highly probed, phil/history of stat, Statistics | Tags: | Leave a comment

If you think it’s a scandal to be without statistical falsification, you will need statistical tests (ii)

Screen Shot 2016-08-09 at 2.55.33 PM


1. PhilSci and StatSci. I’m always glad to come across statistical practitioners who wax philosophical, particularly when Karl Popper is cited. Best of all is when they get the philosophy somewhere close to correct. So, I came across an article by Burnham and Anderson (2014) in Ecology:

While the exact definition of the so-called ‘scientific method’ might be controversial, nearly everyone agrees that the concept of ‘falsifiability’ is a central tenant [sic] of empirical science (Popper 1959). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘test’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified!… Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.” (Burnham and Anderson p. 629)

Well I am (almost) scandalized by this easily falsifiable allegation! I can’t think of a single “alternative”, whether in a “pure” Fisherian or a Neyman-Pearson hypothesis test (whether explicit or implicit) that’s not falsifiable; nor do the authors provide any. I grant that understanding testability and falsifiability is far more complex than the kind of popularized accounts we hear about; granted as well, theirs is just a short paper.[1] But then why make bold declarations on the topic of the “scientific method and statistical science,” on falsifiability and testability? Continue reading

Categories: P-values, Severity, statistical tests, Statistics, StatSci meets PhilSci | 20 Comments

S. Senn: “Painful dichotomies” (Guest Post)


Stephen Senn
Head of  Competence Center for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Painful dichotomies

The tweet read “Featured review: Only 10% people with tension-type headaches get a benefit from paracetamol” and immediately I thought, ‘how would they know?’ and almost as quickly decided, ‘of course they don’t know, they just think they know’. Sure enough, on following up the link to the Cochrane Review in the tweet it turned out that, yet again, the deadly mix of dichotomies and numbers needed to treat had infected the brains of researchers to the extent that they imagined that they had identified personal response. (See Responder Despondency for a previous post on this subject.)

The bare facts they established are the following:

The International Headache Society recommends the outcome of being pain free two hours after taking a medicine. The outcome of being pain free or having only mild pain at two hours was reported by 59 in 100 people taking paracetamol 1000 mg, and in 49 out of 100 people taking placebo.

and the false conclusion they immediately asserted is the following

This means that only 10 in 100 or 10% of people benefited because of paracetamol 1000 mg.

To understand the fallacy, look at the accompanying graph. Continue reading

Categories: junk science, PhilStat/Med, Statistics, Stephen Senn | 27 Comments


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: July 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1], and in green up to 3 others I’d recommend[2].  Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one.

July 2013

  • (7/3) Phil/Stat/Law: 50 Shades of gray between error and fraud
  • (7/6) Bad news bears: ‘Bayesian bear’ rejoinder–reblog mashup
  • (7/10) PhilStatLaw: Reference Manual on Scientific Evidence (3d ed) on Statistical Significance (Schachtman)
  • (7/11) Is Particle Physics Bad Science? (memory lane)
  • (7/13) Professor of Philosophy Resigns over Sexual Misconduct (rejected post)
  • (7/14) Stephen Senn: Indefinite irrelevance
  • (7/17) Phil/Stat/Law: What Bayesian prior should a jury have? (Schachtman)
  • (7/19) Msc Kvetch: A question on the Martin-Zimmerman case we do not hear
  • (7/20) Guest Post: Larry Laudan. Why Presuming Innocence is Not a Bayesian Prior
  • (7/23) Background Knowledge: Not to Quantify, But To Avoid Being Misled By, Subjective Beliefs
  • (7/26) New Version: On the Birnbaum argument for the SLP: Slides for JSM talk

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.

[2] New Rule, July 30, 2016.







Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

For Popper’s Birthday: A little Popper self-test & reading from Conjectures and Refutations


28 July 1902 – 17 September 1994

Today is Karl Popper’s birthday. I’m linking to a reading from his Conjectures and Refutations[i] along with an undergraduate item I came across: Popper Self-Test Questions. It includes multiple choice questions, quotes to ponder, and thumbnail definitions at the end[ii].

Blog Readers who wish to send me their answers will have their papers graded (at least try the multiple choice; if you’re unsure, do the reading). [Use the comments or e-mail.]

[i] Popper reading from Conjectures and Refutations
[ii] I might note the “No-Pain philosophy” (3 part) Popper posts from this blog: parts 12, and 3.



Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.



Categories: Popper | 11 Comments

“Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”


Seeing the world through overly rosy glasses

Taboos about power nearly always stem from misuse of power analysis. Sander Greenland (2012) has a paper called “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.”  I’m not saying Greenland errs; the error would be made by anyone who interprets power analysis in a manner giving rise to Greenland’s objection. So what’s (ordinary) power analysis?

(I) Listen to Jacob Cohen (1988) introduce Power Analysis

“PROVING THE NULL HYPOTHESIS. Research reports in the literature are frequently flawed by conclusions that state or imply that the null hypothesis is true. For example, following the finding that the difference between two sample means is not statistically significant, instead of properly concluding from this failure to reject the null hypothesis that the data do not warrant the conclusion that the population means differ, the writer concludes, at least implicitly, that there is no difference. The latter conclusion is always strictly invalid, and is functionally invalid as well unless power is high. The high frequency of occurrence of this invalid interpretation can be laid squarely at the doorstep of the general neglect of attention to statistical power in the training of behavioral scientists. Continue reading

Categories: Cohen, Greenland, power, Statistics | 46 Comments

Philosophy and History of Science Announcements



2016 UK-EU Foundations of Physics Conference

Start Date:16 July 2016

Categories: Announcement | Leave a comment


3 years ago...

3 years ago…

MONTHLY MEMORY LANE: 3 years ago: June 2013. I mark in red three posts that seem most apt for general background on key issues in this blog, excluding those reblogged recently [1].  Posts that are part of a “unit” or a group of “U-Phils”(you [readers] philosophize) count as one. Here I grouped 6/5 and 6/6.

June 2013

  • (6/1) Winner of May Palindrome Contest
  • (6/1) Some statistical dirty laundry*(recently reblogged)
  • (6/5) Do CIs Avoid Fallacies of Tests? Reforming the Reformers :(6/5 and6/6 are paired as one)
  • (6/6) PhilStock: Topsy-Turvy Game
  • (6/6) Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)
  • (6/8) Richard Gill: “Integrity or fraud… or just questionable research practices?*(recently reblogged)
  • (6/11) Mayo: comment on the repressed memory research [How a conceptual criticism, requiring no statistics, might go.]
  • (6/14) P-values can’t be trusted except when used to argue that p-values can’t be trusted!
  • (6/19) PhilStock: The Great Taper Caper
  • (6/19) Stanley Young: better p-values through randomization in microarrays
  • (6/22) What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Sat night potpourri*(recently reblogged)
  • (6/26) Why I am not a “dualist” in the sense of Sander Greenland
  • (6/29) Palindrome “contest” contest
  • (6/30) Blog Contents: mid-year

[1] Monthly memory lanes began at the blog’s 3-year anniversary in Sept, 2014.






Categories: 3-year memory lane, Error Statistics, Statistics | Leave a comment

A. Birnbaum: Statistical Methods in Scientific Inference (May 27, 1923 – July 1, 1976)

Allan Birnbaum: May 27, 1923- July 1, 1976

Allan Birnbaum died 40 years ago today. He lived to be only 53 [i]. From the perspective of philosophy of statistics and philosophy of science, Birnbaum is best known for his work on likelihood, the Likelihood Principle [ii], and for his attempts to blend concepts of likelihood with error probability ideas to arrive at what he termed “concepts of statistical evidence”. Failing to find adequate concepts of statistical evidence, Birnbaum called for joining the work of “interested statisticians, scientific workers and philosophers and historians of science”–an idea I have heartily endorsed. While known for a result that the (strong) Likelihood Principle followed from sufficiency and conditionality principles (a result that Jimmy Savage deemed one of the greatest breakthroughs in statistics), a few years after publishing it, he turned away from it, perhaps discovering gaps in his argument. A post linking to a 2014 Statistical Science issue discussing Birnbaum’s result is here. Reference [5] links to the Synthese 1977 volume dedicated to his memory. The editors describe it as their way of “paying homage to Professor Birnbaum’s penetrating and stimulating work on the foundations of statistics”. Ample weekend reading! Continue reading

Categories: Birnbaum, Likelihood Principle, phil/history of stat, Statistics | Tags: | 62 Comments

Richard Gill: “Integrity or fraud… or just questionable research practices?” (Is Gill too easy on them?)

Professor Gill

Professor Gill

Professor Richard Gill
Statistics Group
Mathematical Institute
Leiden University

It was statistician Richard Gill who first told me about Diederik Stapel (see an earlier post on Diederik). We were at a workshop on Error in the Sciences at Leiden in 2011. I was very lucky to have Gill be assigned as my commentator/presenter—he was excellent! As I was explaining some data problems to him, he suddenly said, “Some people don’t bother to collect data at all!” That’s when I learned about Stapel.

Committees often turn to Gill when someone’s work is up for scrutiny of bad statistics or fraud, or anything in between. Do you think he’s being too easy on researchers when he says, about a given case:

“data has been obtained by some combination of the usual ‘questionable research practices’ [QRPs] which are prevalent in the field in question. Everyone does it this way, in fact, if you don’t, you’d never get anything published. …People are not deliberately cheating: they honestly believe in their theories and believe the data is supporting them.”

Isn’t that the danger in relying on deeply felt background beliefs?  Have our attitudes changed (toward QRPs) over the past 3 years (harsher or less harsh)? Here’s a talk of his I blogged 3 years ago (followed by a letter he allowed me to post). I reflect on the pseudoscientific nature of the ‘recovered memories’ program in one of the Geraerts et al. papers in a later post. Continue reading

Categories: 3-year memory lane, junk science, Statistical fraudbusting, Statistics | 4 Comments

What do these share in common: m&ms, limbo stick, ovulation, Dale Carnegie? Are we lowering the bar?


For entertainment only

In a post 3 years ago (“What do these share in common: m&m’s, limbo stick, ovulation, Dale Carnegie? Sat night potpourri”), I expressed doubts about expending serious effort to debunk the statistical credentials of studies that most readers without any statistical training would regard as “for entertainment only,” dubious, or pseudoscientific quackery. It needn’t even be that the claim is implausible, what’s implausible is that it has been well probed in the experiment at hand. Given the attention being paid to such examples by some leading statisticians, and scores of replication researchers over the past 3 years–attention that has been mostly worthwhile–maybe the bar has been lowered. What do you think? Anyway, this is what I blogged 3 years ago. (Oh, I decided to put in a home-made cartoon!) Continue reading

Categories: junk science, replication research, Statistics | 2 Comments

Some statistical dirty laundry: have the stains become permanent?



Right after our session at the SPSP meeting last Friday, I chaired a symposium on replication that included Brian Earp–an active player in replication research in psychology (Replication and Evidence: A tenuous relationship p. 80). One of the first things he said, according to my notes, is that gambits such as cherry picking, p-hacking, hunting for significance, selective reporting, and other QRPs, had been taught as acceptable become standard practice in psychology, without any special need to adjust p-values or alert the reader to their spuriousness [i]. (He will correct me if I’m wrong[2].) It shocked me to hear it, even though it shouldn’t have, given what I’ve learned about statistical practice in social science. It was the Report on Stapel that really pulled back the curtain on this attitude toward QRPs in social psychology–as discussed in this blogpost 3 years ago. (If you haven’t read Section 5 of the report on flawed science, you should.) Many of us assumed that QRPs, even if still committed, were at least recognized to be bad statistical practices since the time of Morrison and Henkel’s (1970) Significance Test Controversy. A question now is this: have all the confessions of dirty laundry, the fraudbusting of prominent researchers, the pledges to straighten up and fly right, the years of replication research, done anything to remove the stains? I leave the question open for now. Here’s my “statistical dirty laundry” post from 2013: Continue reading

Categories: junk science, reproducibility, spurious p values, Statistics | 4 Comments

Mayo & Parker “Using PhilStat to Make Progress in the Replication Crisis in Psych” SPSP Slides

Screen Shot 2016-06-19 at 12.53.32 PMHere are the slides from our talk at the Society for Philosophy of Science in Practice (SPSP) conference. I covered the first 27, Parker the rest. The abstract is here:

Categories: P-values, reforming the reformers, replication research, Statistics, StatSci meets PhilSci | Leave a comment

“Using PhilStat to Make Progress in the Replication Crisis in Psych” at Society for PhilSci in Practice (SPSP)

Screen Shot 2016-06-15 at 1.19.23 PMI’m giving a joint presentation with Caitlin Parker[1] on Friday (June 17) at the meeting of the Society for Philosophy of Science in Practice (SPSP): “Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology” (Rowan University, Glassboro, N.J.)[2] The Society grew out of a felt need to break out of the sterile straightjacket wherein philosophy of science occurs divorced from practice. The topic of the relevance of PhilSci and PhilStat to Sci has often come up on this blog, so people might be interested in the SPSP mission statement below our abstract.

Using Philosophy of Statistics to Make Progress in the Replication Crisis in Psychology

Deborah Mayo Virginia Tech, Department of Philosophy United States
Caitlin Parker Virginia Tech, Department of Philosophy United States

Continue reading

Categories: Announcement, replication research, reproducibility | 8 Comments

“So you banned p-values, how’s that working out for you?” D. Lakens exposes the consequences of a puzzling “ban” on statistical inference



I came across an excellent post on a blog kept by Daniel Lakens: “So you banned p-values, how’s that working out for you?” He refers to the journal that recently banned significance tests, confidence intervals, and a vague assortment of other statistical methods, on the grounds that all such statistical inference tools are “invalid” since they don’t provide posterior probabilities of some sort (see my post). The editors’ charge of “invalidity” could only hold water if these error statistical methods purport to provide posteriors based on priors, which is false. The entire methodology is based on methods in which probabilities arise to qualify the method’s capabilities to detect and avoid erroneous interpretations of data [0]. The logic is of the falsification variety found throughout science. Lakens, an experimental psychologist, does a great job delineating some of the untoward consequences of their inferential ban. I insert some remarks in black. Continue reading

Categories: frequentist/Bayesian, Honorary Mention, P-values, reforming the reformers, science communication, Statistics | 45 Comments

Winner of May 2016 Palindrome Contest: Curtis Williams



Winner of the May 2016 Palindrome contest

Curtis Williams: Inventor, entrepreneur, and professional actor

The winning palindrome (a dialog): 


“Disable preplan?… I, Mon Ami?”


“Calm…Sit, fella.”

“No! I tag. I vandalized Dezi, lad.”

“Navigational leftism lacks aim…a nominal perp: Elba’s id.”

The requirement: A palindrome using “navigate” or “navigation” (and Elba, of course).

Book choiceError and Inference (D. Mayo & A. Spanos, Cambridge University Press, 2010)

Curtis Cartoon Caption 1


Bio: Curtis Mark Williams is the co-founder of WavHello and the inventor of Bellybuds, who also counts himself as an occasional professional actor who has performed on Broadway [1] and in several television shows and films. 
He currently resides in Los Angeles with his lovely wife, two daughters, his dog, Newton, and his framed New Yorker Caption Contest winning cartoon. [He has been a finalist twice and the one he won is contest #329, by Joe Dator (inspired by his theatrical background. :)] Continue reading

Categories: Palindrome | Leave a comment

“A sense of security regarding the future of statistical science…” Anon review of Error and Inference



Aris Spanos, my colleague (in economics) and co-author, came across this anonymous review of our Error and Inference (2010) [E & I]. Interestingly, the reviewer remarks that “The book gives a sense of security regarding the future of statistical science and its importance in many walks of life.” We’re not sure what the reviewer means–but it’s appreciated regardless. This post was from yesterday’s 3-year memory lane and was first posted here.

2010 American Statistical Association and the American Society for Quality

TECHNOMETRICS, AUGUST 2010, VOL. 52, NO. 3, Book Reviews, 52:3, pp. 362-370.

Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. MAYO and Aris SPANOS, New York: Cambridge University Press, 2010, ISBN 978-0-521-88008-4, xvii+419 pp., $60.00.

This edited volume contemplates the interests of both scientists and philosophers regarding gathering reliable information about the problem/question at hand in the presence of error, uncertainty, and with limited data information.

The volume makes a significant contribution in bridging the gap between scientific practice and the philosophy of science. The main contribution of this volume pertains to issues of error and inference, and showcases intriguing discussions on statistical testing and providing alternative strategy to Bayesian inference. In words, it provides cumulative information towards the philosophical and methodological issues of scientific inquiry at large.

The target audience of this volume is quite general and open to a broad readership. With some reasonable knowledge of probability theory and statistical science, one can get the maximum benefit from most of the chapters of the volume. The volume contains original and fascinating articles by eminent scholars (nine, including the editors) who range from names in statistical science to philosophy, including D. R. Cox, a name well known to statisticians. Continue reading

Categories: 3-year memory lane, Review of Error and Inference, Statistics | 3 Comments

Blog at


Get every new post delivered to your Inbox.

Join 1,717 other followers