January 28 Phil Stat Forum “How Can We Improve Replicability?” (Alexander Bird)

The fifth meeting of our Phil Stat Forum*:

The Statistics Wars
and Their Casualties

28 January, 2021

TIME: 15:00-16:45 (London); 10-11:45 a.m. (New York, EST)

.

“How can we improve replicability?”

Alexander Bird 

Phil Stat Forum Website: phil-stat-wars.com

Alexander Bird President, British Society for the Philosophy of Science, Bertrand Russell Professor, Department of Philosophy, University of Cambridge, Fellow and Director of Studies, St John’s College, Cambridge. Previously he was the Peter Sowerby Professor of Philosophy and Medicine, Department of Philosophy, King’s College London and prior to that held the chair in Philosophy at the University of Bristol, and was lecturer and then reader at the University of Edinburgh before that. His work is principally in those areas where philosophy of science overlaps with metaphysics and epistemology. He has a particular interest in the philosophy of medicine, especially regarding methodological issues in causal and statistical inference. 

ABSTRACT: It is my view that the unthinking application of null hypothesis significance testing is a leading cause of a high rate of replication failure in certain fields.  What can be done to address this, within the NHST framework?


Readings:

Bird, A. Understanding the Replication Crisis as a Base Rate Fallacy, The British Journal for the Philosophy of Science, axy051, (13 August 2018).

7 pages from D. Mayo: Statistical Inference as Severe Testing: How to get beyond the statistics wars (SIST), pp. 361-370:  Section 5.6 “Positive Predictive Value: Fine for Luggage”.

For information about the Phil Stat Wars forum and how to join, click on this link. 


Slides and Video Links: (to be posted when available)

Alexander Bird Presentation:

Alexander Bird Discussion:


Mayo’s Memos: Any info or events that arise that seem relevant to share with y’all before the meeting. Please check back closer to the meeting day.

*Meeting 13 of our the general Phil Stat series which began with the LSE Seminar PH500 on May 21

Categories: Phil Stat Forum | 2 Comments

Post navigation

2 thoughts on “January 28 Phil Stat Forum “How Can We Improve Replicability?” (Alexander Bird)

  1. Huw Llewelyn

    Thank you for your fascinating talk, to which I listened when its recording appeared a few days ago. Unfortunately I was also unable to participate in the on-line discussion. I apologise for my comment being so lengthy but it is made with sincerity, great care and in a spirit of constructiveness from the viewpoint of a general (internal) physician as opposed to the viewpoint of an epidemiologist and screening analogies, which is the current approach. I would be grateful for any views that you might be able to make about the possible contribution of my comments to understanding and improving replicability.

    In defence of the Havard doctors and students and its relevance to replication

    Firstly, I feel some responsibility to defend the physicians, students, and house officers who were apparently stopped in the corridors at the Harvard Medical school in 1978 by Cascella, Schoenberger, and Grayboys! The term ‘false positive rate’ is ambiguous at face value. Perhaps it should have been defined when posing the question. A false positive rate could easily mean the frequency with which a disease is absent in those with a positive result and not the other way around. Any medical student or doctor will be completely familiar from daily experience that the likelihood of a symptom in someone with a disease is completely different from the probability of a disease conditional on having a symptom and that this likelihood and probability are rarely the same. However, likelihoods and probabilities can be identical in the special case of random sampling when the prior probabilities of all possibilities are uniform. I would like to suggest another medical analogy to illustrate these differences and perhaps throw light on the replication crisis that is based on a clinical diagnostic reasoning model and not the screening model that uses sensitivity and specificity.

    False positives etc

    The explanation for a false positive rate is usually as follows: A positive result means that according to the test the person has the disease (implying that the test result is one of its many sufficient criteria). The false positive rate is the proportion of positive results among people who don’t have the disease. This suggests that despite one sufficient criterion confirming presence of the disease, that positive result is also false. This logical contradiction is confusing. In Stephen Senn’s terminology, this is a classical “saying of Confuseus”! As Stephen pointed out during the discussion, it is also a fallacy that the specificity is constant for populations with different prevalence of disease. I explain in the Oxford Handbook of Clinical Diagnosis that that on the contrary, the specificity tends to be much higher in a population where the disease is rare so that the positive predictive value may be the same in that population as in the other populations in which the disease is common [see link: Things that affect ‘differential’ and ‘overall’ likelihood ratios – Oxford Medicine].

    Differential diagnoses

    In clinical medicine, we don’t think in terms of sensitivity, false positive rates, false negative rates and specificity. A symptom or screening test result is interpreted by considering its list of differential diagnoses. These lists are formulated so that it is highly probable that a sufficient criterion will be found for one of the diagnoses in the list. The list is investigated by choosing one diagnosis and looking for one of its sufficient criteria, which is usually a combination of findings. Alternatively, the list can be reduced in size by looking for the absence of a necessary criterion of one or more of the diagnoses so that they can be ‘ruled out’ (at least for the time being). The probabilities of the diagnoses in this list and how they change with evidence can be estimated using a theorem for the purpose derived from the extended version of Bayes rule with or without an assumption of statistical dependence [see link: Reasoning by elimination – Oxford Medicine].

    Diagnostic criteria

    Diagnostic criteria do not attempt specify when a ‘disease’ is present or absent. In clinical medicine diagnostic criteria simply give guidance as to when it is justified to adopt a working diagnosis and thus act as if the disease is present. In other words these are criteria as to when acting on a hypothesis is justified. The ‘action’ may be to do further tests to assess the probability of benefit from a treatment or giving the treatment and assessing the outcome (e.g. in an emergency). In other words diagnostic criteria justify acting on a hypothesis and not deciding what label to attach for use as a disease marker during epidemics (e.g. the use of the RT-PCR test for Covid-19). It is important to note that once a diagnostic criterion has been observed, the previous probability with which it was anticipated becomes ‘past history’ and no longer relevant. In other words, the presence of a diagnostic criterion means that is certain with a probability of one.

    Numerical test results, diagnostic thresholds and replication

    If diagnostic criteria are based on numerical results, then thresholds or cut-off points are used to decide whether the criterion for adopting a working diagnosis or hypothesis is satisfied or not. These are usually based on the upper or lower two standard deviations of samples taken from healthy people (which I don’t think is the best way of doing so by the way [see link: https://onlinelibrary.wiley.com/doi/abs/10.1111/jep.12981%5D). Because test results are not completely reproducible, the criterion may be satisfied with one test result but not by the next. If the decision is made final based on the first test result then it carries an element of arbitrariness. The situation can be improved slightly by using the average of a number of measurements. Also the more extreme is the result away from the threshold the more probable it will be that the result will be replicated. This brings us to the analogy between replication of diagnostic tests and replication when during scientific hypothesis testing.

    The three types of diagnostic test replication

    An initial test result that is above a threshold can be replicated in 3 ways: (1) By the subsequent single result being the same as the first result or even higher. (2) By the subsequent result simply being above the diagnostic threshold again when repeated once. (3) By the mean of an infinite number of repeat results (i.e. the ‘true’ result) being above the threshold. The probability of replication should be lowest for situation (1) and highest for situation (3). This is because (3) is analogous to a moving shooter aiming at a fixed target. Situation (2) is analogous to a moving shooter aiming at a moving target and situation (1) is analogous to a moving shooter aiming at the bulls-eye of a moving target!

    Analogy with scientific replication

    The probability of replicating a study result less extreme than a null hypothesis when it is repeated with an infinite number of observations (analogous to a moving shooter aiming at a fixed target) can be shown to be 1-P (one sided). This is analogous to situation (3) above so that when P = 0.025 one sided, the probability of replication is 0.975. If P one sided were 0.025, then the probability of getting a repeat result less extreme than the null hypothesis again with the same number of observations (analogous to situation (2) above) can be shown to be 0.917. However, the probability of getting a P value of 0.025 again (analogous to situation (1) above) would be 0.284. The details of these calculations and an explanatory figure are shown in the following link: https://osf.io/j2wq8/.

    The apparent poor rate of study replication

    This suggests that the apparently poor rate of study replication of 28.3% with the same P value or lower is just as expected. It can be interpreted in a positive way by saying that the corresponding long term probability of replication less extreme than the null hypothesis is 0.975! If it has been decided that the proportion with an outcome on treatment is greater than the proportion with the same outcome on placebo is the threshold for a positive result, then once this criterion is observed then the prediction is confirmed. The prior or preceding probability of making that observation is immaterial and does not affect the above three probabilities of replicating it. A prior probability distribution that would affect the probability of replication would be conditional on a previous study result performed in an identical way so that it could be regarded as another random sample from the same population. Updating this prior with the likelihood distribution of the new study would have the same effect as if they were combined in a meta-analysis. For example, if the prior Binomial distribution was based on a real or imaginary random sample result of 9/12 and a subsequent likelihood distribution was based on another random sampling that gave a result of 21/37, then the updated Binomial posterior probability distribution would be based on a combined random sample of (9+12)/(21+37) = 30/49 [see figure and preceding section 3.6 ‘Combining probability distributions using Bayes’ rule’ in the following link: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0212302#pone-0212302-g006 ].

    Severe testing

    These calculations of the probability of replication are based on the assumption that the study observations can be regarded as a true random sample from a population made up of a larger identical study with infinite number of observations, which therefore gives the true mean or proportion. This assumes that P-hacking, poor study design (e.g. no randomisation), no pre-registration, dishonesty, poor description of how study subjects were selected, etc. have been shown to be highly improbable by severe testing. This is analogous to showing all but one of a number of differential diagnoses to be highly improbable. If the latter study faults could not be shown to be highly improbable, then the random sampling model would be inapplicable.

    Bayes rule

    Random sampling has to take place from a single set (e.g. of those in the community with a diagnosis D), its aim being to estimate a proportion f that set intersecting with another set (e.g. with a finding F) to give an estimate of the proportion P(F|D). If the assumed prior probability of a diagnosis in the community were 0.03 {i.e. p(D) = 0.3} and the assumed prior probability of a finding were 0.02 {i.e. p(F) = 0.2} and from random sampling, there was a 97.5% chance that the likelihood of the finding conditional on the diagnosis was at least 0.5 {i.e. p(F|D) ≥ 0.5}, there would be a 97.5% chance that the assumed probability of the diagnosis conditional on the finding would be at least 0.75 {i.e. p(D|F) ≥ p(D).p(F|D) / p(F) = 0.3 x 0.5 /0.2 = 0.75}. It is important to distinguish between the application of Bayes rule to make predictions such as diagnostic criteria (when prior probabilities are rarely uniform) and its use in random sampling (a special case when the priors are uniform).

    I would be interested in your views about how the above analogies with clinical (as opposed to epidemiological) diagnostic thinking throw a light on the replication crisis.

  2. Huw Llewelyn

    Apologies! I have noticed that two link-short-cuts failed to work when copied from Word:

    To access ‘Things that affect ‘differential’ and ‘overall’ likelihood ratios’ – Oxford Medicine ‘ use the link:
    https://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13#med-9780199679867-chapter-13-div1-10

    .. and to access ‘Reasoning by elimination’ – Oxford Medicine use the link:
    https://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13#med-9780199679867-chapter-13-div1-5

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.