.

Many scientific studies involve multiple observations that are summarised by a proportion or an average as opposed to a single observation as in the above example. In order to interpret statistical calculations in terms of the concepts of replication used in medicine, one has to take a different path to the one familiar to statisticians. This can be done by recognising that the probability of an observation or something more extreme given a null hypothesis is equal to the probability of the same null hypothesis or something more extreme given the same observation. The rest follows as shown in my previous comments. David Colquhoun wishes to interpret statistical concepts from the viewpoint of a biological scientist. It is interesting to note that DC’s estimation of the ‘false discovery rate’, my ‘probability of non-replication’ and Stephen Senn’s ‘posterior probability’ are all different to the ‘P’ value. Our paths may converge after starting from different viewpoints. ]]>

So let’s deal with the remaining points of disagreement.

(4) I don’t think that anyone is saying that it’s possible to find a unique relationship between P values and false positive rates. Clearly there isn’t. But since the latter is what experimenters want to know, that immediately poses a problem for P values. Both Johnson and Berger suggest ways of dealing with the unknown prior, and although the methods are different, both come up with broadly similar results. And their conclusions are consistent with my simpler approach. The idea of uniformly-most-powerful Bayesian (UMPBT) tests seems like sensible approach to the problem. What’s wrong with it? I can’t do a “proper Bayesian analysis” because they can give any answer you want (see next point).

(5) The real disagreement seems to lie in our attitude to priors. The point prior seems to me, as experimenter, to be exactly what I want to test (admittedly that could be influenced by the fact that it’s what has been taught by generations of statisticians). As you have often pointed out, if you ask a subjective Bayesian how to analyse your experiment, you are likely to get as many answers as there are Bayesians. Insofar as that’s true, all they do is to bring statistics into disrepute (as people who can help you make sense of data). That’s why I have never found (subjective) Bayesian approaches to be of any practical value. And that’s why I find the approach via UMPBTs so interesting.

(6) “I disagree that 1/20 is an appropriate posterior probability to aim for.” That doesn’t seem to me to be a disagreement at all. I used 1/20 in my examples for obvious reasons, but you can choose any value you like.

You go on ” If you get your prior distribution right (and I have published some papers showing it is hard [1-3]) …”. While true, this is totally unhelpful in practice, because of the “If”.

The philosophical discussions are fascinating and ingenious, but if we agree that one job of statisticians is to help non-statisticians to make sensible decisions, it would be a real help if they they didn’t spend so much time squabbling with each other!

]]>We don’t want a “proper likelihood” or likelihood ratio alone because likelihood ratios fail to control error probabilities and fail to take account of the very gambits you list as leading to non replicability. Given how many points are confused, following the rule on this blog, I’ll just direct you to published work:

http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf

.

My concern about the ‘P’ value is not that it conditional upon a hypothetical fact (which is no different to a ‘sufficient’ criterion of a suspected diagnosis when describing a ‘likelihood’) but that it also partly predicts another hypothetical fact (i.e. also something more extreme than what has actually been observed). The other problem is that this definition of a ‘P’ value is not a proper likelihood and it cannot be used in a Bayesian calculation.

.

My understanding is that in Bayesian terms we have to imagine all the possible unknown outcomes and their unknown distribution if the study were repeated with all its faults using a large number of observations, giving us a continuous distribution of outcome results (one of which by the way could be chosen as a ‘null hypothesis’). We then have to estimate the small prior probabilities of getting each result in the distribution conditional on all the ‘prior’ facts about the study except the study result (which why the other facts are ‘prior’). For each possible outcome result, we then have to calculate (by using the binomial distribution if the result is a proportion) the likelihood of the actual observed result e.g. 6/9 (i.e. NOT the actual observed result of 6/9 OR some ‘hypothetical observation’ that is more extreme but that was not observed e.g. 7/9 or 8/9 or 9/9). We then use this to calculate a posterior probability of each possible outcome. We then choose a range of values called the credibility interval and sum the small probabilities inside this interval to get the probability of replicating the study with a result inside the interval (NOT replicating it for the range outside the interval).

.

The disadvantage of all this is that it is dependent from the beginning on the non-transparent ‘prior belief’ or ‘prejudice’ of a person who may have a conflict of interest in what the study result should imply (e.g. because of commercial gain from the sales of a new drug or the headache of having to fund it from a limited budget and other competing resources). It is this contentious aspect and lack of transparency about how the ‘prior subjective probability’ was estimated that I find obscure and undefined. It is this that can lead to un-resolvable disagreement about the prior probabilities of the distribution of study outcomes. I gather that if a Bayesian is ‘indifferent’ about the outcome prior probabilities and that the distribution is therefore uniform then the probability of non-replication for a ‘credibility interval’ beyond Ho is always equal to ‘P’. Is there a general proof for this interesting result?

.

The advantage of re-interpreting the ‘P’ value as the probability of non-replication given the numerical result of the study alone, using the same number of observations, is that it gives a simple, ‘objective’ preliminary indication of the probability of replication before other more arguable facts are taken into account. (‘Objective’ does not means it is correct, but that the same result will be found by different operators who make the same assumptions.) If this preliminary probability of non-replication is not low (e.g. <0.05), then the attempt to show a high probability of replication (or low probability of non-replication) of the study will have failed at the first hurdle. If it clears the first hurdle, then the probability of replication given all the facts can be estimated subsequently by using Bayesian methods. This preliminary probability can also be incorporated into a more transparent type of reasoning by probabilistic elimination. Here, the different causes of non-replication (e.g. data-dredging, multiple sub-group analyses, vague details of patient selection, etc.) are hopefully shown one by one to be improbable by a more transparent reasoning process. Again, if one of these ‘causes’ cannot be shown to be improbable, the attempt to show a high probability of replication fails at that hurdle. This approach also depends on ‘subjective’ likelihoods or probabilities but I think that it would be an improvement on simply asserting a prior probability of replication given the other facts about the study, without addressing each fact in turn in a transparent way.

Regarding your last point that “viewing a study as drawn from a pool of studies is obscure”, perhaps I should try to state it in a clearer way and to give a concrete example. A particular study result of 6/9 would be an element of the set of all study results of where p(X) = 6/9. If we chose 9 of these ‘p(X) = 6/9’ studies at random and observed the nature of next (10th) outcome (either X or ‘Not X’) for each set of 6/9, the possible proportions with ‘X’ would be 0/9 or 1/9, … or 6/9 … or 9/9. If we were to choose such 9 studies many times over, the proportion of the time would we get 0/9 or 1/9, … or 6/9 .. or 9/9 would be determined by the binomial distribution calculated using p = 0.667 (or Laplace’s p = (1+6)/(2+9) or ‘p’ with a lower bound of 0.6 and an upper bound of 0.7 by definition of a ‘mathematical probability’). In contrast to this, a Bayesian probability is an attempt to guess an unknown distribution with an infinitely (and thus unknown) large number of data points. Now I accept that all these are ’thought experiments’, being dependent on the nature of a mathematical model and various assumptions. What I propose as the set of sets with ‘p(X) = a/b’ is very easy for me to picture and not at all obscure. It also allows me to get a similar result to the ‘P’ value when I use it to calculate the probability of non-replication of a study given its numerical result alone, e.g. 6/9. Irrespective of all this I can also choose 'subjectively' to regard the ‘P’ value as being logically equivalent to the probability of non-replication of the ‘null hypothesis’ or something more extreme, conditional on the numerical result and number of observations in a study.

.

My objective is to try make it easier for statisticians, doctors, scientists to communicate better by using different analogies / models that are easier to share.

Thanks for the email, I will contact you.

> I think that to regard a study result as a sample drawn from a unknowable pool of other study results is impossibly obscure and makes those not trained as statisticians feel that they can never understand the subject.

I certainly learned this from a group of about 20 Epidemiologists in a webinar I gave, the left hand machine in my animation which represents what actually happened in nature with all but the observed sample being unknowable was incomprehensible to all 20 even with repeated one to one email exchanges with the most interested. Those with statistical training (another webinar) seemed to have little difficulty here (some are a bit shocked about not needing any math.)

Its like for most people, known unkowns are dealt with by asking an expert who knows and unknown unknowns are sillier than angels dancing on the head of a pin.

Keith O’Rourke

]]>On the other hand, viewing a study as drawn from a pool of studies is obscure. That is why I regard transplanting the use of specificity and sensitivity from screening to scientific inference as problematic. This was discussed in the discussion of the ‘pathetic p-value’ post prior to this one. ]]>

My problem is that I find it impossible to come to a logical conclusion if it involves unknowable things, e.g. hypothetical populations, angels on the heads of pins, etc.. In the same vein, I feel very uncomfortable about the definition of the ‘P’ value as “the probability of a ‘real’ observed result, OR something hypothetical that is more extreme, conditional upon another hypothetical observation (the null hypothesis)”. As Stephen Senn has pointed out elsewhere, this is meaningless.

I have never accepted this as a reasonable concept and I was not forced to accept it in order to pass any examinations as a young person so that it never became a fixed part of my system of ideas. Instead, I have regarded ‘P’ as being equal to the probability of non-replication of a study conditional ONLY on the ‘fact’ of the numerical result of the study. However, the non-replication (or replication) of the entire study depends also on other ‘conditional’ facts such as how well the work was done, other similar study results etc.

I think that to regard a study result as a sample drawn from a unknowable pool of other study results is impossibly obscure and makes those not trained as statisticians feel that they can never understand the subject. However, this is such an established part of statistics that I cannot seeing it being dropped, with the result that statisticians will be condemned to a purgatory of endless unresolvable arguments and being misunderstood for their troubles.

A closely related problem is applying ‘specificity’ and ‘false positive rates’ to diagnosis. One can define ‘sensitivity’ easily enough: the frequency of a ‘finding’ in those with a ‘diagnostic criterion’ (accepting of course that there may be ranges of severity of the finding and diagnosis). However, ‘those without the diagnostic criterion’ are very difficult to define and the ‘specificity’ will vary enormously depending on the populations in which they are measured and applied, being mostly a function of the prevalence of those with the diagnosis in the study population.

In medicine we reason by probabilistic elimination using ‘differential diagnoses’ that CAN be defined. I explain all this with a mathematical proof in Chapter 13 of the Oxford Handbook of Clinical Diagnosis (pp 615 to 642 – see ‘look inside’ on Amazon: http://www.amazon.co.uk/Handbook-Clinical-Diagnosis-Medical-Handbooks/dp/019967986X#reader_019967986X). I will send a personal (i.e. not to be copied again) PDF of Chapter 15 by email on request to me at hul2@aber.ac.uk.

]]>Appears interesting and I did locate this http://www.clinsci.org/cs/057/cs0570477.htm , though even that behind a pay wall.

I does remind me of discussions with clinical researchers evaluating the QMR artificial intelligent diagnostic program in the 1980s.

In trying to communicate statistical logic to non-statisticians, including say Ian Hacking when I was in Toronto or Iain Chalmers in Oxford, its almost never actually successful.

Not sure how what you are talking to is related to what I am trying to animate here https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/

That is completely free and we can discuss it openly here (assuming OK with blog owner).

Keith O’Rourke

]]>.

My concepts of statistics are coloured by my background as a physician. My concept of sampling is not an attempt to use a set of observations to estimate the parameters of a larger population (something that interested R. A. Fisher when using a small sampling frame to estimate plant proportions in a large field). Instead, my inclination is to assume that the next patient can be regarded as an element of the set of past patients. For example if there are 9 patients with central crushing chest pain and 6 have angina, the probability of one of these 9 patients in the set having angina is 0.67. However if a 10th patients arrives into the set with central chest pain, then the probability of that patient in the new set with an extra element having angina is either 6/10 or 7/10 (depending on what is eventually found). So the set allows me to measure the probability much in the same way as a ruler graduated in millimetres allows me to measure a length of ≥6/10cm to <7/10cm. The pair of values also allows me to know the original proportion in order make other calculations. A larger set would provide a more precise probability, where the interval between the upper and lower value approaches zero.

.

I would also regard an observation of 6/9 as an element of the set of all 6/9 observations and the probability of getting another element of 6/9 being estimated from the binomial distribution of 9 selections from a large population with a proportion of 0.667 (or 0.6 to 0.7 to provide a ‘sensitivity analysis’). The mathematical skills and innovations required to take things forward from this different sampling point of view are exactly the same as those used for other sampling concepts.

.

In medicine, we also use probabilistic reasoning by elimination based on an ‘expanded’ form of Bayes rule termed by me ‘the probabilistic elimination theorem’ (see: http://blog.oup.com/2013/09/medical-diagnosis-reasoning-probable-elimination/). ]]>

Applied statistical efforts undertake to identify useful models that approximate what we see in the world. Often there are several candidates that do a reasonable job of mimicking real world measured data.

Fisher and others a hundred years ago were developing methods for people faced with a few dozen observations, and in that scenario, the infamous “p < 0.05" evolved, as no one had a calculator or computer with which to calculate p-values so that tables of critical values were valuable tools. Generating a table of critical values for the F distribution, or the Student's t distribution took many people many days or weeks to compile.

So David Colquhoun's example of a data set of 60,000 dwell times is a modern data set for which the old p < 0.05 small data set paradigm is of course inappropriate. Of course with 60,000 data points, almost any comparison of differing models or parameter values will be associated with small p-values. Comparing those p-values to the small sample 100 year old paradigm of alpha = 0.05 critical value tables is inappropriate. But the p-values themselves, small though they may be, can still be scrutinized to rank competing models, just as can AIC or BIC or any number of other statistics.

Colquhoun is correct to entertain models that also comport with the physical characteristics of the system under study, because of course the idea is to find a representative model, not just find p-values below some threshold.

The issue now is that we have computers, and can evaluate huge sets of data, and thus need new analytical paradigms for guiding reasonable model choices and other statistical decision making efforts.

I see nothing pathetic about a p-value – it is just a statistic with some very interesting distributional properties under various scenarios. What is pathetic is shoe-horning an analysis situation (e.g. modern data sets of thousands or millions of values) into an old paradigm developed for use with smaller data sets and then decrying the silly outcomes that ensue. That's poor and fallacious philosophical practice. (I'm not pointing a finger at anyone here, or meaning to disparage anyone, just using materials in this blog post as illustration.)

The reason this blog, and Mayo's current philosophical efforts are so valuable is that we need to develop new paradigms for new data scenarios. The p < 0.05 paradigm was developed through years of philosophical debate in the small data set era of the early 20th century and served well then. We need renewed philosophical debate to develop reasonable decision making and model fitting paradigms in this era of huge amounts of data where some scenarios yield data with parameter space dimensionality far larger than the data space dimensionality and so on. Bradley Efron's paper "Scales of Evidence for Model Selection: Fisher versus Jeffreys" is one such valuable excursion down this pathway – more such philosophical debate is needed to yield reasonable model selection paradigms in this big data era. Mayo's Severe Testing concepts are also valuable, and will involve different measures or rule sets for data sets from such different dimensionalities.

The differences in performance seen between some Bayesian and other Error Statistical approaches points to the need to reassess analytical paradigms and develop new ones for modern problems, not vilify a particular statistic that served well in prior times and still does, in appropriate scenarios, today. Long live the p-value!

]]>