2023 Captain’s Biblio

Selected Bibliography: a selection of articles with links.

Amrhein, Greenland, & McShane (2019). “Comment: Retire Statistical Significance”, Nature 567: 305-308.

Achinstein (2010). Mill’s Sins or Mayo’s Errors? (E&I: 170-188).

Bacchus, Kyburg, & Thalos (1990). Against Conditionalization, Synthese (85): 475-506.

Barnett (1999). Comparative Statistical Inference (Chapter 6: Bayesian Inference), John Wiley & Sons.

Begley & Ellis (2012) Raise standards for preclinical cancer research. Nature 483: 531-533.

Bem (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect, Journal of Personality and Social Psychology 100(3), 407–425.

Bem, Utts & Johnson (2011). Must Psychologists Change the Way They Analyze Their Data? Journal of Personality and Social Psychology, 101(4), 716–719.

Benjamin, Berger, Johannesson et al (2017) Redefine Statistical Significance, Nature Human Behaviour2, 6-10.

Benjamini & Hochberg (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of The Royal Statistical Society.

Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.)

Berger, J. (2003). Could Fisher, Jeffreys and Neyman have Agreed on Testing?  Stat Sci18: 1-12.

Berger, J. (2006). The Case for Objective Bayesian Analysis and Rejoinder, Bayesian Analysis 1(3), 385–402; 457–64.

Berger, J. & Sellke (1987). Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with Discussion and Rejoinder), Journal of the American Statistical Association 82(397), 112–22; 135–9.

Bernardo, J. (1997). Non-informative Priors Do Not Exist: A Dialogue with Jose M. Bernardo, Journal of Statistical Planning and Inference 65(1), 159-77.

Bernardo, J. (2010). Integrated Objective Bayesian Estimation and Hypothesis Testing (with discussion), Bayesian Statistics 9, 1–68.

Bickel, D. R. (2021). Null hypothesis significance testing defended and calibrated by Bayesian model checking. The American Statistician, 75(3), 249–255.

Brown, E. N. and Kass, R. E. (2009). What is Statistics? (with discussion), The American Statistician 63, 105–23.

Birnbaum, A. (1970), Statistical Methods in Scientific Inference (letter to the Editor), Nature 225(5237): 1033

For extensive Birnbaum references see this post on Error Statistics Philosophy Blog

Casella & R. Berger (1987a). Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem, Journal of the American Statistical Association 82(397), 106–11.

Casella, G. and Berger, R. (1987b). Comment on Testing Precise Hypotheses by J. O. Berger and M. Delampady, Statistical Science2(3), 344–7.

Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’, Royal Society Open Science 1(3), (16 pages).

Cousins, R. (2017). ‘The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics’, Synthese194, 395–432.

Cox, D. (1977). The Role of Significance Tests (with Discussion), Scandinavian Journal of Statistics4, 49–70.

Cox, D. (2006a).Principles of Statistical Inference, CUP.

Cox & Mayo (2010). Objectivity and Conditionality in Frequentist Inference (E&I: 276-304).

Cox & Mayo (2011)A Statistical Scientist Meets a Philosopher of Science: A Conversation between Sir David Cox and Deborah Mayo (as recorded, June 2011). Rationality, Markets and Morals (RMM), 2, Special Topic: Statistical Science and Philosophy of Science, 103-114.

Crupi & Tentori (2010). Irrelevant Conjunction: Statement and Solution of a New Paradox, Phil Sci, 77, 1–13.

Earman, J. and Glymour, C. (1980). ‘Relativity and Eclipses: The British Eclipse Expeditions of 1919 and Their Predecessors’, Historical Studies in the Physical Sciences11(1), 49–85.

Edwards, Lindman & Savage E, L, & S (1963). Bayesian Statistical Inference for Psychological Research, Psychological Review 70(3), 193–242.

Efron, B. (1986). Why Isn’t Everyone a Bayesian?, The American Statistician 40(1), 1–5. (3)

Efron, B. (1998). R. A. Fisher in the 21st Century and Rejoinder, Statistical Science 13(3), 95–114; 121–2.

Efron (2013) A 250-Year Argument: Belief, Behavior, and the Bootstrap, Bulletin of the American Mathematical Society 50(1), 126–46.

Feynman (1974). Cargo Cult Science (Graduation Speech)

Fisher (1930).Inverse Probability, Mathematical Proceedings of the Cambridge Philosophical Society26(4), 528–35.

Fisher (1934).Two New Properties of Mathematical Likelihood, Proceedings of the Royal Society of London Series A 144 (852), 285–307.

Fisher (1935a)/(1947).The Design of Experiments, 1st ed., Edinburgh: Oliver and Boyd. Reprinted in Fisher 1990. (Lady Tasting Tea)

Fisher, R. A. (1936), Uncertain Inference, Proceedings of the American Academy of Arts and Sciences71, 248–58.

Fisher (1955), Statistical Methods and Scientific Induction, J R Stat Soc (B) 17: 69-78.

Gardiner, G., & Zaharatos, B. (2022). The safe, the sensitive, and the severely tested: a unified account. Synthese : An International Journal for Epistemology, Methodology and Philosophy of Science200(5). 

Gelman (2011). Induction and Deduction in Bayesian Data Analysis, RMM2, 67-78.

Gelman & Carlin (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors, Perspectives on Psychological Science9, 641–51.

Gelman & Hennig (2017). Beyond Subjective and Objective in Statistics, Journal of the Royal Statistical Society: Series A 180(4), 967–1033.

Gelman & Loken (2014). The Statistical Crisis in Science, American Scientist 2, 460-5.

Gelman & Shalizi (2013). Philosophy and the Practice of Bayesian Statistics (with discussion), Brit. J. Math. Stat. Psy. 66(1): 5-64.

Gigerenzer, G. (2002). Adaptive thinking : Rationality in the real world. Oxford: Oxford University Press. (Chapter V, see p. 279.)

Gigerenzer and Marewski (2017). Surrogate Science: The Idol of a Universal Method for Scientific Inference, Journal of management 41(2), 421-40.

Gonick & Smith (1992). The Cartoon Guide to Statistics, HarperPerennial. Online from VT libraries at this link: https://virginiatech.on.worldcat.org/oclc/1148230277. (Note: there may be other online links to it from the VT library.)

Goodman (1993). P-values, Hypothesis Tests, and Likelihood-Implications for Epidemiology of a Neglected Historical Debate, American Journal of Epidemiology 137(5), 485–96.

Goodman (1999). Toward Evidence-Based Medical Statistics. 2: The Bayes Factor, Annals of Internal Medicine, 130(12), 1005–13.

Greenland (2012). Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative, Annals of Epidemiology 22, 364–8.

Greenland & Poole (2013). Living with P Values: Resurrecting a Bayesian Perspective on Frequentist Statistics and Rejoinder: Living with Statistics in Observational Research, Epidemiology 24(1), 62–8; 73–8. Gelman comment.

Greenland, Senn, Rothman et al. (2016). Statistical Tests, P values, Confidence Intervals, and Power: A Guide to Misinterpretations, European Journal of Epidemiology 31(4), 337–50.

Hacking (1972). Review: Likelihood, The British Journal for the Philosophy of Science 23(2), 132–7.

Hacking (1980). The Theory of Probable Inference: Neyman, Peirce and Braithwaite, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.

Hacking, I. (2001). An introduction to probability and inductive logic. Cambridge University Press.

Haig, B. (2016). ‘Tests of Statistical Significance Made Sound’, Educational and Psychological Measurement 77(3) 489–506.

Haig, B. (2020). What can psychology’s statistics reformers learn from the error-statistical perspective? Methods in Psychology, 2, (November 2020). 100020–100020.

Hand, D. J. (2021). Trustworthiness of statistical inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), (20211012).

Hardwicke, T. E., & Ioannidis, J. P. A. (2019). Petitions in scientific argumentation: dissecting the request to retire statistical significance. European Journal of Clinical Investigation49(10). https://doi.org/10.1111/eci.13162

Hawthorne & Fitelson (2004). Re-Solving Irrelevant Conjunction with Probabilistic Independence, Phil Sci 71: 505–514.

Howson, C. (1997). Error probabilities in error. Philosophy of Science, 64(S), S185-194.

Howson (1997). A Logic of Induction, Philosophy of Science, 64(2): 268-290.

Howson (2017). Putting on the Garber Style? Better Not, Philosophy of Science 84(4), 659-76.

Howson & Urbach (1993) Chapter 15, (2006) Chapter 5. Scientific Reasoning: The Bayesian Approach, 2nd & 3rd(Chapter 5) eds. Open court.

Hubbard & Bayarri (2003). Confusion Over Measures of Evidence versus Errors and Rejoinder, The American Statistician57(3), 171-8; 181-2.

Ioannidis (2005). Why most published research findings are false. PLoS Med 2(8): e124.

Ioannidis J. (2019). The importance of predefined rules and prespecified statistical analyses: Do not abandon significance. Journal of the American Medical Association (JAMA), 321, 2067-2068.

Kadane (2016). Beyond Hypothesis Testing, Entropy18(5), article 199, 1–5.

Kadane, J. B. (2020). Principles of uncertainty (Second ed.) Chapman & Hall/CRC. Chapter 12. (see p. 251) 

Kass (2011). Statistical Inference: The Big Picture (with discussion and rejoinder), Statistical Science26(1), 1–20.

Kass & Wasserman (1996). The Selection of Prior Distributions by Formal Rules, Journal of the American Statistical Association 91, 1343–70.

Lakens et al (2018) Justify Your Alpha Nature Human Behaviour 2, 168-71.

Lambert & Black (2012). Learning From Our GWAS Mistakes: From Experimental Design to Scientific Method, Biostatistics 13(2), 195–203.

Lehmann (1993a). ‘The Bertrand-Borel Debate and the Origins of the Neyman-Pearson Theory’, in Ghosh, J., Mitra, S., Parthasarathy, K. and Prak Ma Rao, L. (eds.), Statistics and Probability: A Raghu Raj Bahadur Festschrift, New Delhi: Wiley Eastern, 371–80. Reprinted in Lehmann 2012, pp. 965–74.

Levelt Committee, Noort Committee, Drenth Committee (2012). Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel, Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel (commissielevelt.nl/).

Lindley (2000). The Philosophy of Statistics (with Discussion), Journal of the Royal Statistical Society: Series D 49(3), 293–337.

Mayo general bibliography

Mayo (1996).Error and the Growth of Experimental Knowledge, U of Chicago P.

Mayo, D. (1997). “Duhem’s Problem, The Bayesian Way, and Error Statistics, or ‘What’s Belief got To Do With It?’” Philosophy of Science 64(2): 222-244.

Mayo (1997).Response to Howson and Laudan, Phil Sci 64(2): 323-333.

Mayo (2003). Commentary on J. Berger’s Fisher Address, Stat Sci 18: 19-24.

Mayo (2004). An Error-Statistical Philosophy of Evidence in The Nature of Scientific Evidence: Statistical, Philosophical & Empirical Considerations. (Taper & Lele eds.), UCP: 79-118.

Mayo (2005). Philosophy of Statistics in Sarkar & Pfeifer (eds.) Philosophy of Science: An Encyclopedia, Routledge: 802-815.

Mayo (2010b). An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle(E&I: 305-14).

Mayo (2010c). Sins of the Epistemic Probabilist: Exchanges with Achinstein(E&I: 189-201).

Mayo (2010e). Learning from Error: The Theoretical Significance of Experimental Knowledge, The Modern Schoolman. Guest editor, Kent Staley. 87(3/4), (March/ May 2010). Experimental and Theoretical Knowledge, The Ninth Henle Conference in the History of Philosophy, 191–217.

Mayo (2013) Presented Version: On the Birnbaum Argument for the Strong Likelihood Principle. In JSM Proceedings, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association, 440-453.

Mayo (2013). Comments on A. Gelman and C. Shalizi, Brit. J. Math. Stat. Psy.66(1): 57-64.

Mayo (2014). On the Birnbaum Argument for the Strong Likelihood Principle, (with discussion) Statistical Science 29(2) pp. 227-239, 261-266

Mayo (2016). Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R. L. and Lazar, N. A. 2016, The ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician 70(2) (supplemental materials).

Mayo, D. G. (2019). “P-value Thresholds: Forfeit at Your Peril,” European Journal of Clinical Investigation 49(10). EJCI-2019-0447

Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” Harvard Data Science Review 2.1.

Mayo, D. G. (2021). The Statistics Wars and Intellectual Conflicts of Interest (Editorial). Conservation Biology. (December 2021 online).

Mayo & Cox (2006). Frequentist Statistics as a Theory of Inductive Inference, Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo & Hand (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. (Link)

Mayo & Spanos (2004). Methodology in Practice: Statistical Misspecification Testing, Phil Sci 71: 1007-1025.

Mayo & Spanos (2006).Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction, Brit. J. Phil. Sci., 57: 323-357.

Mayo & Spanos (eds) (2010). Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, CUP. (E&I)

Mayo & Spanos (2011). Error Statisticsin Philosophy of Statistics , Handbook of Philosophy of Science 7, Philosophy of Statistics, (Gabbay, Thagard & Woods (eds); Bandyopadhyay & Forster (Vol eds.)) Elsevier: 1-46.

Mayo, Spanos & Staley (Guest eds.) (2011-2012): Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics, (Albert, Kliemt, Lahno eds.). Special Topic: Statistical Science and Philosophy of Science: Where Do (Should) They Meet in 2011 and Beyond? (Complete collection of papers).

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. American Statistician, 73, 235–245.

Meehl (1978). Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology, Journal of Consulting and Clinical Psychology 46: 806-834.

NEJM (New England Journal of Medicine) Author Guidelines (2019): Retrieved from: https://www.nejm.org/authorcenter/new-manuscripts on July 19, 2019.

Neyman, J. (1934). ‘On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection’, The Journal of the Royal Statistical Society 97(4), 558–625. Reprinted 1967 Early Statistical Papers of J. Neyman, 98–141.

Neyman (1956). Note on an Article by Sir Ronald Fisher, J R Stat Soc(B) 18: 288-294.

Neyman (1957b). The Use of the Concept of Power in Agricultural Experimentation, Journal of the Indian Society of Agricultural Statistics IX(1), 9–17.

Neyman (1962). Two Breakthroughs in the Theory of Statistical Decision Making, Revue De l’Institut International De Statistique / Review of the International Statistical Institute, 30(1),11–27.

Neyman  (1976). Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena’, Communications in Statistics: Theory and Methods 5(8), 737–51.

Neyman (1977). Frequentist Probability and Frequentist Statistics, Synthese 36(1), 97–131.

Neyman & Pearson (1928). On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I, Biometrika 20A(1/2), 175–240. Reprinted in Joint Statistical Papers, 1–66.

Neyman & Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses, Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85.

Pearson (1947). The Choice of Statistical Tests Illustrated on the Interpretation of Data Classed in a 2 x 2 Table, Biometrika 34 (1/2), 139–167. Reprinted 1966 in The Selected Papers of E. S. Pearson, pp. 169–200.

Pearson (1955). Statistical Concepts in Their Relation to Reality, J R Stat Soc(B) 17: 204-207.

Pearson & Chandra Sekar (1936). ‘The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations’, Biometrika 28 (3/4), 308–20. Reprinted 1966 in The Selected Papers of E. S. Pearson, pp. 118–30.

Pearson & Neyman (1930). ‘On the Problem of Two Samples,’ Bulletin of the Academy of Polish Sciences, 73–96. Reprinted 1966 in Joint Statistical Papers, 99–115.

Peng, Dominici & Zeger (2006).Reproducible Epidemiologic Research American Journal of Epidemiology 163 (9), 783-789.

Popper (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. Basic Books. (Chapter 1)

Ratliff & Oishi (2013). Gender Differences in Implicit Self-Esteem. Following a Romantic Partner’s Success or Failure, Journal of Personality and Social Psychology 105(4), 688–702.

Reid & Cox (2015). ‘On Some Principles of Statistical Inference’, International Statistical Review 83(2), 293–308.

Ryan, E. G., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874-020-01042-7

SavageForum(1962) The Foundations of Statistical Inference: A Discussion, London: Methuen.

Senn (2001b). ‘Two Cheers for P-values?’ Journal of Epidemiology and Biostatistics 6(2), 193–204.

Senn  (2002). ‘A Comment on Replication, P-values and Evidence’, S. N. Goodman, Statistics in Medicine 1992; 11:875-879’, Statistics in Medicine21(16), 2437–44.

Senn (2011).You May Believe You Are a Bayesian But You Are Probably Wrong. RMM2.

Simmons, Nelson & Simonsohn (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant, Psych. Sci., 22(11): 1359-1366.

Simmons, Nelson & Simonsohn (2012). ‘A 21 word solution’, Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26(2), 4–7.

Singh, Xie & Strawderman (2007). Confidence Distribution (CD) Distribution Estimator of a Parameter, IMS Lecture Notes–Monograph Series, Volume 54, Complex Datasets and Inverse Problems: Tomography, Networks and Beyond, pp. 132–50.

Spanos (2000). Revisiting Data Mining: “Hunting” with or without a License, Journal of Economic Methodology 7(2), 231–64.

Spanos (2007). Curve Fitting, the Reliability of Inductive Inference, and the Error- Statistical Approach, Philosophy of Science 74(5), 1046-1066.

Spanos (2008a). Review of S. T. Ziliak and D. N. McCloskey’s The Cult of Statistical Significance, Erasmus Journal for Philosophy and Economics 1(1), 154–64.

Spanos (2010a). Akaike-type Criteria and the Reliability of Inference: Model Selection Versus Statistical Model Specification, Journal of Econometrics 158(2), 204–20.

Spanos, A. (2011b). ‘Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation’, Rationality, Markets and Morals(RMM) 2, 146–78.

Spanos (2012). Revisiting the Berger Location Model: Fallacious Confidence Interval or a Rigged Example? Statistical Methodology, 9, 555–61. (6) (7)

Spanos (2013).Who Should Be Afraid of the Jeffreys-Lindley Paradox?Phi Sci80 (1):73-93.

Spiegelhalter (2012). Explaining 5 Sigma for the Higgs: How Well Did They Do?, Blogpost on Understandinguncertainty.org (8/7/2012).

Staley (2017). Pragmatic Warrant for Frequentist Statistical Practice: The Case of High Energy Physics, Synthese 194(2), 355–76.

Stapel (2014).Faking Science: A True Story of Academic Fraud. Translated by Brown, N. from the original 2012 Dutch Ontsporing (Derailment).

Wagenmakers, (2007). A Practical Solution to the Pervasive Problems of P values, Psychonomic Bulletin & Review 14(5), 779–804.

Wagenmakers & Grünwald (2006). A Bayesian Perspective on Hypothesis Testing: A Comment on Killeen (2005), Psychological Science 17(7), 641–2.

Wagenmakers, Wetzels, Borsboom & van der Maas (2011). Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi: Comment on Bem (2011), Journal of Personality and Social Psychology 100, 426–32.

Wasserstein & Lazar (2016). The ASA’s Statement on P-values: Context, Process and Purpose, (and supplemental materials), The American Statistician 70(2), 129–33.

Wasserstein, Schirm & Lazar (2019). Moving to a World Beyond “p < 0.05”, The American Statistician, 73 sup 1, 1-19.

Zabell (1992). R. A. Fisher and Fiducial Argument, Statistical Science7(3), 369–87. (

These items are from the bibliography of Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018, CUP). Newer items are being added as we proceed.

Blog at WordPress.com.